# SRE Runbook Index

## Architecture
- Ops runbook delivery and mapping model:
  - `doc/operations/Ops_Runbook_Architecture.md`
- SRE tool access matrix and API-first ops policy:
  - `doc/operations/SRE_Tool_Access_Matrix_v1.md`
- Observability read-model gap map:
  - `doc/operations/Observability_Read_Model_Gap_Map_v1.md`
- Observability health snapshot read-model contract:
  - `doc/operations/Observability_Health_Snapshot_Read_Model_Contract_v1.md`
- Observability correlation timeline read-model contract:
  - `doc/operations/Observability_Correlation_Timeline_Read_Model_Contract_v1.md`
- Observability log and trace pivot read-model contract:
  - `doc/operations/Observability_Log_Trace_Pivot_Read_Model_Contract_v1.md`
- Observability alert and SLO evidence read-model contract:
  - `doc/operations/Observability_Alert_SLO_Evidence_Read_Model_Contract_v1.md`
- Registry ops read-model gap map:
  - `doc/operations/Registry_Ops_Read_Model_Gap_Map_v1.md`
- Registry environment artifact inventory read-model contract:
  - `doc/operations/Registry_Environment_Artifact_Inventory_Read_Model_Contract_v1.md`
- Registry artifact trust status read-model contract:
  - `doc/operations/Registry_Artifact_Trust_Status_Read_Model_Contract_v1.md`
- Registry app artifact ops-status read-model contract:
  - `doc/operations/Registry_App_Artifact_Ops_Status_Read_Model_Contract_v1.md`
- Registry pull diagnosis read-model contract:
  - `doc/operations/Registry_Pull_Diagnosis_Read_Model_Contract_v1.md`
- Secrets/PKI ops read-model gap map:
  - `doc/operations/Secrets_PKI_Ops_Read_Model_Gap_Map_v1.md`
- Secrets/PKI purpose inventory read-model contract:
  - `doc/operations/Secrets_PKI_Purpose_Inventory_Read_Model_Contract_v1.md`
- Secrets/PKI Vault readiness read-model contract:
  - `doc/operations/Secrets_PKI_Vault_Readiness_Read_Model_Contract_v1.md`
- Secrets/PKI certificate lifecycle read-model contract:
  - `doc/operations/Secrets_PKI_Certificate_Lifecycle_Read_Model_Contract_v1.md`
- Secrets/PKI rotation evidence read-model contract:
  - `doc/operations/Secrets_PKI_Rotation_Evidence_Read_Model_Contract_v1.md`
- Secrets/PKI break-glass evidence read-model contract:
  - `doc/operations/Secrets_PKI_Breakglass_Evidence_Read_Model_Contract_v1.md`
- Provider console break-glass access model:
  - `doc/operations/Provider_Console_Breakglass_Access_Model_v1.md`
- Provider console access evidence packet:
  - `doc/operations/Provider_Console_Access_Evidence_Packet_v1.md`
- Provider capacity read-model gap map:
  - `doc/operations/Provider_Capacity_Read_Model_Gap_Map_v1.md`
- Cloudflare and DNS change evidence gate:
  - `doc/operations/Cloudflare_DNS_Change_Evidence_Gate_v1.md`
- Edge DNS read-model gap map:
  - `doc/operations/Edge_DNS_Read_Model_Gap_Map_v1.md`
- Incident notification and SOC operating model:
  - `doc/operations/Incident_Notification_And_SOC_Operating_Model_v1.md`
- Incident notification templates:
  - `doc/operations/Incident_Notification_Templates_v1.md`
- Temporal UI ops access decision:
  - `doc/operations/Temporal_UI_Ops_Access_Decision_v1.md`
- Temporal workflow read-model gap map:
  - `doc/operations/Temporal_Workflow_Read_Model_Gap_Map_v1.md`
- Temporal workflow search read-model contract:
  - `doc/operations/Temporal_Workflow_Search_Read_Model_Contract_v1.md`
- Temporal retry-history read-model contract:
  - `doc/operations/Temporal_Retry_History_Read_Model_Contract_v1.md`
- Temporal schedule-status read-model contract:
  - `doc/operations/Temporal_Schedule_Status_Read_Model_Contract_v1.md`
- Temporal stuck-activity read-model contract:
  - `doc/operations/Temporal_Stuck_Activity_Read_Model_Contract_v1.md`
- Release smoke validation checklist:
  - `doc/operations/Release_Smoke_Checklist.md`

## Core Runbooks
1. Admin Ops dashboard interpretation and decision-first triage
   - `doc/operations/runbooks/Admin_Ops_Dashboard_Usage_Runbook.md`
2. API degradation and high error rate
   - `doc/operations/runbooks/API_Degradation_Runbook.md`
3. Queue backlog and worker saturation
   - `doc/operations/runbooks/Queue_Backlog_Runbook.md`
4. Billing worker failure
   - `doc/operations/runbooks/Billing_Worker_Failure_Runbook.md`
5. Webhook processing outage
   - `doc/operations/runbooks/Webhook_Processing_Outage_Runbook.md`
6. Provisioning workflow stuck/failing
   - `doc/operations/runbooks/Provisioning_Workflow_Stuck_Runbook.md`
7. Database latency or failover
   - `doc/operations/runbooks/Database_Latency_or_Failover_Runbook.md`
8. Incident communication and stakeholder updates
   - `doc/operations/runbooks/Incident_Communication_Runbook.md`
9. Terminal gateway incidents (Option C cutover/rollback)
   - `doc/operations/runbooks/Terminal_Gateway_Incident_Runbook.md`
10. Node onboarding and bootstrap controls
   - `doc/operations/runbooks/Node_Onboarding_Runbook.md`
11. Tenant/project authorization failures
   - `doc/operations/runbooks/Tenant_Project_Authorization_Runbook.md`
12. User onboarding and auth context failures
   - `doc/operations/runbooks/User_Onboarding_Auth_Context_Runbook.md`
13. IAM role assignment and membership incident response
   - `doc/operations/runbooks/IAM_Role_Assignment_and_Membership_Incident_Runbook.md`
14. App catalog browse/filter and entitlement incident response
   - `doc/operations/runbooks/App_Catalog_Incident_Runbook.md`
15. Fleet telemetry incident response (CPU/GPU/Memory/Storage)
   - `doc/operations/runbooks/Fleet_Telemetry_Incident_Runbook.md`
16. CLI incident and support triage
   - `doc/operations/runbooks/CLI_Incident_and_Support_Triage_Runbook.md`
17. Python SDK incident and observability triage
   - `doc/operations/runbooks/Python_SDK_Incident_and_Observability_Runbook.md`
18. App artifact lifecycle and trust incident response
   - `doc/operations/runbooks/App_Artifact_Lifecycle_Incident_Runbook.md`
19. Slurm reference instance stuck in `deploying`
   - `doc/operations/runbooks/Slurm_Reference_Deploying_Stuck_Runbook.md`
20. Platform-control disk cleanup
   - `doc/operations/runbooks/Platform_Control_Disk_Cleanup_Runbook.md`
21. Platform-control k3s recovery after disk-full or bad local image rollout
   - `doc/operations/runbooks/Platform_Control_K3s_Recovery_Runbook.md`
22. Platform-control dev Cloudflare reset planning
   - `doc/operations/runbooks/Platform_Control_Dev_Cloudflare_Reset_Runbook.md`
23. Managed app UAT failures: Headlamp, OpenClaw, Slurm
   - `doc/operations/runbooks/Managed_App_UAT_Failure_Runbook.md`
24. App runtime lifecycle incidents
   - `doc/operations/runbooks/App_Runtime_Lifecycle_Incident_Runbook.md`
25. Edge and app error presentation incidents
   - `doc/operations/runbooks/Edge_And_App_Error_Presentation_Runbook.md`
26. Database operations readiness
   - `doc/operations/runbooks/Database_Operations_Readiness_Runbook.md`
27. Database backup, restore, and DR
   - `doc/operations/runbooks/Database_Backup_Restore_DR_Runbook.md`
28. Provider VM operations readiness
   - `doc/operations/runbooks/Provider_VM_Ops_Readiness_Runbook.md`
29. Keycloak persistent DB validation
   - `doc/operations/runbooks/Keycloak_Persistent_DB_Validation_Runbook.md`
30. IAM MFA enrollment, reset, and break-glass
   - `doc/operations/runbooks/IAM_MFA_Enrollment_Reset_and_Breakglass_Runbook.md`

## Readiness Gate

Runbook and on-call readiness is checked by:

```bash
scripts/ops/observability_oncall_readiness.sh
```

Status/Ops ingestion:

```bash
PLATFORM_STATUS_OBSERVABILITY_ONCALL_JSON=dist/ops/observability-oncall/observability-oncall-<run>.json \
  scripts/ci/platform_status_snapshot.sh
```

## Runbook Template
- Trigger condition
- Impact and blast radius
- Immediate mitigation
- Deep diagnosis
- Recovery steps
- Post-incident follow-up
