Incident Workflow runbook
The incident model should route people to the owning domain quickly, preserve evidence, and turn repeated manual work into product or platform gaps.
Correlation-First Flow
Triage Questions
| Question | Why it matters |
|---|---|
| Who is impacted? | Tenant, project, user, node fleet, app, or internal process |
| What changed? | Release, config, policy, credential, node state, external dependency |
| Which domain owns it? | Prevents symptom-only fixes in the wrong layer |
| What evidence exists? | Correlation ID, alert, trace, dashboard, runbook action, audit row |
| How do we verify recovery? | Prefer API/read-model surfaces over direct database inspection |
Runbook Routing
- API degradation and dependency latency route to API/dependency runbooks.
- Provisioning stuck states route to provisioning workflow and node-task runbooks.
- Terminal failures route to terminal gateway and node-agent preflight runbooks.
- Billing and payment incidents route to billing worker, ledger, and webhook runbooks.
- IAM incidents route to role assignment, membership, JWKS, and federation runbooks.
- App runtime incidents route to app platform, artifact, catalog, and lifecycle runbooks.