Day-2 Operations runbook
This page explains how GPUaaS should be managed after deployment. It is the
operator-facing bridge between production-readiness architecture and the detailed
runbooks under doc/operations/**.
Operating Model
| Area | Operator responsibility | Evidence to capture |
|---|---|---|
| Release and patching | Promote known SHAs, use rings, preserve rollback paths, and avoid pushing untested changes to all production capacity | Release evidence bundle, UAT result, security checks, rollback readiness |
| Capacity and node fleet | Maintain reserve capacity, track node health, rotate or drain nodes before disruptive changes | Inventory health, node-agent status, drain/return-to-service record |
| Provisioning | Monitor allocation lifecycle, stuck workflows, queue backlog, and node-task execution | Allocation state, Temporal/NATS status, worker logs, correlation IDs |
| Terminal access | Validate session binding, gateway health, node-agent preflight, and websocket drain behavior | Token mint evidence, gateway metrics, session failure traces |
| Billing and payments | Monitor billing worker windows, ledger integrity, Stripe webhook processing, and balance-driven force release | Ledger entries, webhook dedupe, billing worker runs, low-balance events |
| Storage and artifacts | Verify attachment, path-safety, app artifacts, and data lifecycle expectations | Storage operation logs, artifact trust records, user/project scope evidence |
| Security controls | Validate secrets, certs, WAF, rate limits, admin revocation, audit logs, and policy values | Security control verification, audit rows, cert expiry checks, exception records |
| Observability | Keep logs, metrics, traces, dashboards, alerts, and runbook mappings current | Alert history, traces, runbook links, incident timeline |
Day-2 Loop
Daily And Weekly Checks
| Cadence | Checks |
|---|---|
| Daily | API health, worker health, NATS/Temporal lag, node-agent status, billing worker completion, webhook errors, terminal gateway health, alert noise |
| Weekly | Backup restore evidence, cert expiry review, policy value drift, release evidence completeness, runbook freshness, reserved capacity posture |
| Per release | Exact SHA, UAT automation evidence, security checks, migration/read-model checks, rollback plan, release notes, owner signoff |
| Per incident | Correlation ID trail, impacted tenants/projects, timeline, mitigation, root-cause owner, follow-up guard or product gap |
First Response By Symptom
| Symptom | First place to look | Owning path |
|---|---|---|
| Allocation stuck in provisioning or releasing | Provisioning workflow, node task status, NATS backlog | Runbook Index |
| Browser terminal fails | Token minting, terminal gateway, node-agent preflight, websocket routing | Runbook Index |
| Balance or billing looks wrong | Billing worker run, ledger entries, payment webhook status | Runbook Index |
| App launch fails | App catalog, manifest, runtime evidence, artifact lifecycle | Build on AI Cloud |
| API degraded | Edge, API process, DB latency, Redis/NATS/Temporal dependency, error classification | System Overview |
| Security exception needed | Control verification, exception owner, expiry, risk approval | Security & Production Readiness |
Management Principles
- Prefer public/admin APIs and explicit read models for verification.
- Use direct SQL only while the owning operator/debug surface is missing.
- Capture evidence as part of the operational action, not as a later narrative.
- Fix the owning layer instead of adding symptom-only workarounds.
- Keep release and patch work ring-based with reserve capacity.
- Graduate recurring report-only findings into CI or operational gates.
- Prefer deterministic local utilities for CI/deploy/UAT/evidence summaries so operators review stable packets instead of chat-only status.
- Treat UAT as a confidence gate, not the first defect detector: pre-UAT readiness gates should catch stale images, missing app prerequisites, unhealthy services, terminal/websocket gaps, and unsafe error presentation.
Canonical sources