Day-2 Operations runbook

This page explains how GPUaaS should be managed after deployment. It is the operator-facing bridge between production-readiness architecture and the detailed runbooks under doc/operations/**.

Operating Model

Area	Operator responsibility	Evidence to capture
Release and patching	Promote known SHAs, use rings, preserve rollback paths, and avoid pushing untested changes to all production capacity	Release evidence bundle, UAT result, security checks, rollback readiness
Capacity and node fleet	Maintain reserve capacity, track node health, rotate or drain nodes before disruptive changes	Inventory health, node-agent status, drain/return-to-service record
Provisioning	Monitor allocation lifecycle, stuck workflows, queue backlog, and node-task execution	Allocation state, Temporal/NATS status, worker logs, correlation IDs
Terminal access	Validate session binding, gateway health, node-agent preflight, and websocket drain behavior	Token mint evidence, gateway metrics, session failure traces
Billing and payments	Monitor billing worker windows, ledger integrity, Stripe webhook processing, and balance-driven force release	Ledger entries, webhook dedupe, billing worker runs, low-balance events
Storage and artifacts	Verify attachment, path-safety, app artifacts, and data lifecycle expectations	Storage operation logs, artifact trust records, user/project scope evidence
Security controls	Validate secrets, certs, WAF, rate limits, admin revocation, audit logs, and policy values	Security control verification, audit rows, cert expiry checks, exception records
Observability	Keep logs, metrics, traces, dashboards, alerts, and runbook mappings current	Alert history, traces, runbook links, incident timeline

Day-2 Loop

Daily And Weekly Checks

Cadence	Checks
Daily	API health, worker health, NATS/Temporal lag, node-agent status, billing worker completion, webhook errors, terminal gateway health, alert noise
Weekly	Backup restore evidence, cert expiry review, policy value drift, release evidence completeness, runbook freshness, reserved capacity posture
Per release	Exact SHA, UAT automation evidence, security checks, migration/read-model checks, rollback plan, release notes, owner signoff
Per incident	Correlation ID trail, impacted tenants/projects, timeline, mitigation, root-cause owner, follow-up guard or product gap

First Response By Symptom

Symptom	First place to look	Owning path
Allocation stuck in provisioning or releasing	Provisioning workflow, node task status, NATS backlog	Runbook Index
Browser terminal fails	Token minting, terminal gateway, node-agent preflight, websocket routing	Runbook Index
Balance or billing looks wrong	Billing worker run, ledger entries, payment webhook status	Runbook Index
App launch fails	App catalog, manifest, runtime evidence, artifact lifecycle	Build on AI Cloud
API degraded	Edge, API process, DB latency, Redis/NATS/Temporal dependency, error classification	System Overview
Security exception needed	Control verification, exception owner, expiry, risk approval	Security & Production Readiness

Management Principles

Prefer public/admin APIs and explicit read models for verification.
Use direct SQL only while the owning operator/debug surface is missing.
Capture evidence as part of the operational action, not as a later narrative.
Fix the owning layer instead of adding symptom-only workarounds.
Keep release and patch work ring-based with reserve capacity.
Graduate recurring report-only findings into CI or operational gates.
Prefer deterministic local utilities for CI/deploy/UAT/evidence summaries so operators review stable packets instead of chat-only status.
Treat UAT as a confidence gate, not the first defect detector: pre-UAT readiness gates should catch stale images, missing app prerequisites, unhealthy services, terminal/websocket gaps, and unsafe error presentation.

Operating Model​

Day-2 Loop​

Daily And Weekly Checks​

First Response By Symptom​

Management Principles​

Operating Model

Day-2 Loop

Daily And Weekly Checks

First Response By Symptom

Management Principles