Observability runbook

Observability is the operating surface for GPUaaS. Logs, metrics, traces, dashboards, alerts, and correlation IDs should lead operators to the owning domain before they reach for infrastructure internals.

Signal Model

Operating Surfaces

Surface	Use for
Logs	Request context, sanitized error details, worker activity, audit-adjacent debugging
Traces	Cross-service latency, dependency spans, terminal/session path, worker flow
Metrics	Health, rate, saturation, error, queue, billing, terminal, and node-agent signals
Dashboards	Control-plane overview, runtime health, billing/payments, terminal/notifications, incident correlation
Alerts	SLO and symptom detection tied to runbook ownership
Evidence	Smoke reports, alert simulations, incident drills, readiness artifacts

Rules

Every incident starts with the correlation ID when one exists.
Logs must be sanitized before secrets, tokens, or PII can appear.
Dashboards should map to owning domains and runbooks.
Alert noise is an operating gap; tune rules or add missing context.
Repeated manual diagnosis should become a dashboard, API/read-model, or guard.

Read-Model Direction

The current ops backlog is moving repeated direct checks into explicit observability read models: health snapshots, correlation timelines, log/trace pivots, and alert/SLO evidence. Until every read model exists, direct backend or SQL inspection remains temporary evidence, not the desired operator UX.

Canonical sources

Signal Model​

Operating Surfaces​

Rules​

Read-Model Direction​

Signal Model

Operating Surfaces

Rules

Read-Model Direction