Observability runbook
Observability is the operating surface for GPUaaS. Logs, metrics, traces, dashboards, alerts, and correlation IDs should lead operators to the owning domain before they reach for infrastructure internals.
Signal Model
Operating Surfaces
| Surface | Use for |
|---|---|
| Logs | Request context, sanitized error details, worker activity, audit-adjacent debugging |
| Traces | Cross-service latency, dependency spans, terminal/session path, worker flow |
| Metrics | Health, rate, saturation, error, queue, billing, terminal, and node-agent signals |
| Dashboards | Control-plane overview, runtime health, billing/payments, terminal/notifications, incident correlation |
| Alerts | SLO and symptom detection tied to runbook ownership |
| Evidence | Smoke reports, alert simulations, incident drills, readiness artifacts |
Rules
- Every incident starts with the correlation ID when one exists.
- Logs must be sanitized before secrets, tokens, or PII can appear.
- Dashboards should map to owning domains and runbooks.
- Alert noise is an operating gap; tune rules or add missing context.
- Repeated manual diagnosis should become a dashboard, API/read-model, or guard.
Read-Model Direction
The current ops backlog is moving repeated direct checks into explicit observability read models: health snapshots, correlation timelines, log/trace pivots, and alert/SLO evidence. Until every read model exists, direct backend or SQL inspection remains temporary evidence, not the desired operator UX.
Canonical sources