Skip to main content

Incident Workflow runbook

The incident model should route people to the owning domain quickly, preserve evidence, and turn repeated manual work into product or platform gaps.

Correlation-First Flow

Triage Questions

QuestionWhy it matters
Who is impacted?Tenant, project, user, node fleet, app, or internal process
What changed?Release, config, policy, credential, node state, external dependency
Which domain owns it?Prevents symptom-only fixes in the wrong layer
What evidence exists?Correlation ID, alert, trace, dashboard, runbook action, audit row
How do we verify recovery?Prefer API/read-model surfaces over direct database inspection

Runbook Routing

  • API degradation and dependency latency route to API/dependency runbooks.
  • Provisioning stuck states route to provisioning workflow and node-task runbooks.
  • Terminal failures route to terminal gateway and node-agent preflight runbooks.
  • Billing and payment incidents route to billing worker, ledger, and webhook runbooks.
  • IAM incidents route to role assignment, membership, JWKS, and federation runbooks.
  • App runtime incidents route to app platform, artifact, catalog, and lifecycle runbooks.