Skip to main content

Operators runbook

This path is for operating GPUaaS environments: release promotion, environment profiles, observability, patching, backups, incident response, and readiness evidence.

Persona Route

PersonaFirst-read pathDecision pointsNext-action pages
Platform operatorProduction Deployment Model, Day-2 Operations, Production BaselineIs the work about environment shape, promotion, incident triage, fleet/node lifecycle, observability, billing diagnostics, or evidence capture?Release Operations, Incident Workflow, Node Lifecycle, Billing Diagnostics
Security reviewerSecurity & Production Readiness, Release Evidence, Current ControlsDoes the operator workflow preserve separation, auditability, evidence, and rollback confidence?Security Controls, Observability, External Security Path

Operator Map

  • Environment profiles and promotion expectations.
  • Release, patch, and rollback workflow.
  • UAT automation and evidence capture.
  • Observability, traces, metrics, logs, and alert drills.
  • Backup/restore and data-growth checks.
  • Runbook index by failure mode.

Pages

What Operators Should Be Able To Answer

  • What release is running, what SHA it came from, and what evidence supports it?
  • Which nodes and workers are healthy, degraded, draining, or reserved?
  • Which alerts map to which runbooks and owning domains?
  • How do we prove UAT, release, runtime, security, and rollback readiness?
  • When is direct database inspection acceptable, and what missing API/read-model should replace repeated direct queries?