Skip to main content

Observability runbook

Observability is the operating surface for GPUaaS. Logs, metrics, traces, dashboards, alerts, and correlation IDs should lead operators to the owning domain before they reach for infrastructure internals.

Signal Model

Operating Surfaces

SurfaceUse for
LogsRequest context, sanitized error details, worker activity, audit-adjacent debugging
TracesCross-service latency, dependency spans, terminal/session path, worker flow
MetricsHealth, rate, saturation, error, queue, billing, terminal, and node-agent signals
DashboardsControl-plane overview, runtime health, billing/payments, terminal/notifications, incident correlation
AlertsSLO and symptom detection tied to runbook ownership
EvidenceSmoke reports, alert simulations, incident drills, readiness artifacts

Rules

  • Every incident starts with the correlation ID when one exists.
  • Logs must be sanitized before secrets, tokens, or PII can appear.
  • Dashboards should map to owning domains and runbooks.
  • Alert noise is an operating gap; tune rules or add missing context.
  • Repeated manual diagnosis should become a dashboard, API/read-model, or guard.

Read-Model Direction

The current ops backlog is moving repeated direct checks into explicit observability read models: health snapshots, correlation timelines, log/trace pivots, and alert/SLO evidence. Until every read model exists, direct backend or SQL inspection remains temporary evidence, not the desired operator UX.