Skip to main content

Day-2 Operations runbook

This page explains how GPUaaS should be managed after deployment. It is the operator-facing bridge between production-readiness architecture and the detailed runbooks under doc/operations/**.

Operating Model

AreaOperator responsibilityEvidence to capture
Release and patchingPromote known SHAs, use rings, preserve rollback paths, and avoid pushing untested changes to all production capacityRelease evidence bundle, UAT result, security checks, rollback readiness
Capacity and node fleetMaintain reserve capacity, track node health, rotate or drain nodes before disruptive changesInventory health, node-agent status, drain/return-to-service record
ProvisioningMonitor allocation lifecycle, stuck workflows, queue backlog, and node-task executionAllocation state, Temporal/NATS status, worker logs, correlation IDs
Terminal accessValidate session binding, gateway health, node-agent preflight, and websocket drain behaviorToken mint evidence, gateway metrics, session failure traces
Billing and paymentsMonitor billing worker windows, ledger integrity, Stripe webhook processing, and balance-driven force releaseLedger entries, webhook dedupe, billing worker runs, low-balance events
Storage and artifactsVerify attachment, path-safety, app artifacts, and data lifecycle expectationsStorage operation logs, artifact trust records, user/project scope evidence
Security controlsValidate secrets, certs, WAF, rate limits, admin revocation, audit logs, and policy valuesSecurity control verification, audit rows, cert expiry checks, exception records
ObservabilityKeep logs, metrics, traces, dashboards, alerts, and runbook mappings currentAlert history, traces, runbook links, incident timeline

Day-2 Loop

Daily And Weekly Checks

CadenceChecks
DailyAPI health, worker health, NATS/Temporal lag, node-agent status, billing worker completion, webhook errors, terminal gateway health, alert noise
WeeklyBackup restore evidence, cert expiry review, policy value drift, release evidence completeness, runbook freshness, reserved capacity posture
Per releaseExact SHA, UAT automation evidence, security checks, migration/read-model checks, rollback plan, release notes, owner signoff
Per incidentCorrelation ID trail, impacted tenants/projects, timeline, mitigation, root-cause owner, follow-up guard or product gap

First Response By Symptom

SymptomFirst place to lookOwning path
Allocation stuck in provisioning or releasingProvisioning workflow, node task status, NATS backlogRunbook Index
Browser terminal failsToken minting, terminal gateway, node-agent preflight, websocket routingRunbook Index
Balance or billing looks wrongBilling worker run, ledger entries, payment webhook statusRunbook Index
App launch failsApp catalog, manifest, runtime evidence, artifact lifecycleBuild on AI Cloud
API degradedEdge, API process, DB latency, Redis/NATS/Temporal dependency, error classificationSystem Overview
Security exception neededControl verification, exception owner, expiry, risk approvalSecurity & Production Readiness

Management Principles

  • Prefer public/admin APIs and explicit read models for verification.
  • Use direct SQL only while the owning operator/debug surface is missing.
  • Capture evidence as part of the operational action, not as a later narrative.
  • Fix the owning layer instead of adding symptom-only workarounds.
  • Keep release and patch work ring-based with reserve capacity.
  • Graduate recurring report-only findings into CI or operational gates.
  • Prefer deterministic local utilities for CI/deploy/UAT/evidence summaries so operators review stable packets instead of chat-only status.
  • Treat UAT as a confidence gate, not the first defect detector: pre-UAT readiness gates should catch stale images, missing app prerequisites, unhealthy services, terminal/websocket gaps, and unsafe error presentation.