Operators runbook
This path is for operating GPUaaS environments: release promotion, environment profiles, observability, patching, backups, incident response, and readiness evidence.
Persona Route
| Persona | First-read path | Decision points | Next-action pages |
|---|---|---|---|
| Platform operator | Production Deployment Model, Day-2 Operations, Production Baseline | Is the work about environment shape, promotion, incident triage, fleet/node lifecycle, observability, billing diagnostics, or evidence capture? | Release Operations, Incident Workflow, Node Lifecycle, Billing Diagnostics |
| Security reviewer | Security & Production Readiness, Release Evidence, Current Controls | Does the operator workflow preserve separation, auditability, evidence, and rollback confidence? | Security Controls, Observability, External Security Path |
Operator Map
- Environment profiles and promotion expectations.
- Release, patch, and rollback workflow.
- UAT automation and evidence capture.
- Observability, traces, metrics, logs, and alert drills.
- Backup/restore and data-growth checks.
- Runbook index by failure mode.
Pages
- Day-2 Operations
- CI/CD Delivery System
- Production Deployment Model
- Production Baseline
- Release Operations
- Stabilization Closeout Status
- Observability
- Incident Workflow
- Billing Diagnostics
- Node Lifecycle
- Runbook Index
What Operators Should Be Able To Answer
- What release is running, what SHA it came from, and what evidence supports it?
- Which nodes and workers are healthy, degraded, draining, or reserved?
- Which alerts map to which runbooks and owning domains?
- How do we prove UAT, release, runtime, security, and rollback readiness?
- When is direct database inspection acceptable, and what missing API/read-model should replace repeated direct queries?