# Observability Read Model Gap Map v1

Status: gap map for `OPS-PROD-OBSERVABILITY-READMODEL-GAPS-001`

Owner: Platform Operations / Architecture / Backend / Frontend

Last updated: 2026-06-06

## Purpose

Convert Grafana, Prometheus, Loki, and Tempo direct UI dependence into
GPUaaS-first operator read models, evidence bundles, and runbook pivots.

The default operator path remains `/platform/ops`, platform evidence, and
runbooks first. Direct observability UIs remain SRE escape hatches when the
platform surface cannot yet answer the question.

## Operator Questions

| Operator question | Current first surface | Direct UI still used | Gap |
|---|---|---|---|
| Is the platform healthy now by service, worker, queue, and runtime boundary? | `/platform/ops` health and setup surfaces, `/healthz`, readiness checks. | Grafana and Prometheus for detailed panel/query pivots. | Need a compact platform observability health snapshot with freshness, scrape status, error rate, queue lag, and worker failure rollups. |
| Which incident timeline belongs to this correlation ID? | `/platform/evidence` and audit/correlation links. | Loki and Tempo through Grafana Explore after correlation ID is known. | Need correlation timeline read model that joins audit rows, structured logs, trace IDs, task IDs, workflow IDs, and owning resources without exposing raw log payloads. |
| What logs are relevant to this node/allocation/workflow without broad Loki access? | Runbooks and resource detail pages. | Loki direct query by service, node, correlation ID, or JSON fields. | Need bounded log excerpt/evidence API with redaction, pagination, source labels, and retention metadata. |
| What trace/span evidence explains this request or workflow failure? | Error response details may include trace/span when available. | Tempo direct trace-by-id or Grafana trace pivot. | Need trace pivot/readiness read model with trace ID, root service, span count, error span summaries, and evidence links. |
| Are alerts firing, routed, acknowledged, and mapped to runbooks? | `doc/operations/local-dev/observability/prometheus-alerts.yaml`, alert drill, on-call evidence. | Grafana alerting view. | Need alert/runbook routing read model with current firing count, owner team, runbook ID, notification route, last drill evidence, and stale-route warnings. |
| Are SLOs/error budgets healthy enough for release or production readiness? | CI checks and ops runbooks. | Grafana dashboards and Prometheus ad hoc queries. | Need SLO/error-budget evidence bundle that can be attached to release readiness without screenshot/manual dashboard export. |

## Current Coverage

| Area | Existing coverage | Status |
|---|---|---|
| Platform health summary | `/api/v1/v3/platform/ops` and related read models show inventory, lifecycle, audit, payment, setup, and action-required signals. | Partial |
| Evidence pivots | `/platform/evidence` supports correlation-oriented evidence and release/status artifacts. | Partial |
| Structured logs | Log field contract and node log gateway exist; Loki remains the backing store. | Partial |
| Metrics | Prometheus metrics and alert rules exist; platform surfaces do not yet expose reusable query outcomes. | Gap |
| Traces | OpenTelemetry/Tempo exist; platform surfaces do not yet expose trace lookup summaries. | Gap |
| Alerts/on-call | Alert rules, alert drill, on-call evidence docs, and readiness script exist. | Partial |

## Read Model Work Packages

| Task | First output | Notes |
|---|---|---|
| `OPS-PROD-OBSERVABILITY-HEALTH-SNAPSHOT-001` | Contract for service/worker/queue/runtime health snapshot backed by readiness, Prometheus query outcomes, and platform stats. Output: `doc/operations/Observability_Health_Snapshot_Read_Model_Contract_v1.md`. | Start with API/worker uptime, scrape freshness, error-rate class, queue lag, terminal gateway health, node-log gateway health, and observability stack reachability. |
| `OPS-PROD-OBSERVABILITY-CORRELATION-TIMELINE-001` | Contract for correlation ID timeline across audit, logs, traces, tasks, events, and workflows. Output: `doc/operations/Observability_Correlation_Timeline_Read_Model_Contract_v1.md`. | Must redact raw log payloads and expose excerpts/summaries only. |
| `OPS-PROD-OBSERVABILITY-LOG-TRACE-PIVOTS-001` | Contract for bounded log excerpt and trace summary pivots from resource pages and evidence bundles. Output: `doc/operations/Observability_Log_Trace_Pivot_Read_Model_Contract_v1.md`. | Loki/Tempo remain backing tools; direct UI remains escape hatch. |
| `OPS-PROD-OBSERVABILITY-ALERT-SLO-EVIDENCE-001` | Contract for alert routing, runbook mapping, drill evidence, SLO/error-budget snapshot, and release evidence export. Output: `doc/operations/Observability_Alert_SLO_Evidence_Read_Model_Contract_v1.md`. | This should feed release readiness and on-call readiness without dashboard screenshots. |

## Direct UI Policy

| Tool | Default operator path | Direct UI remains allowed when |
|---|---|---|
| Grafana | Platform health snapshot, SLO/evidence bundle, incident/runbook links. | The incident requires dashboard panels or Explore pivots not yet represented in platform read models. |
| Prometheus | Predefined query outcomes exposed as health/evidence fields. | SRE needs ad hoc query debugging for platform metrics. |
| Loki | Correlation timeline and bounded log evidence. | SRE needs raw query exploration after correlation/resource scope is known. |
| Tempo | Trace summary and trace-by-id evidence links. | SRE needs full trace inspection after trace ID is known. |

## Guardrails

- Direct observability UIs must not become tenant/product self-service.
- Platform read models must expose summaries, IDs, status, timestamps, owner
  domain, runbook IDs, and evidence links, not broad raw telemetry payloads.
- High-cardinality fields such as `correlation_id`, `trace_id`, `task_id`,
  `allocation_id`, and `node_id` stay query pivots, not mandatory metric labels.
- Log excerpts must pass existing sanitization/redaction rules and include
  source labels and retention/freshness metadata.
- SLO/error-budget evidence must be machine-readable and attachable to release
  readiness.

## Exit Criteria

Observability direct UI access stops being the normal first path when:

- `/platform/ops` exposes the current health snapshot and stale-signal reasons;
- correlation timeline answers the common "what happened for this request?"
  question without raw Loki/Tempo access;
- resource pages can link to bounded log/trace pivots;
- alert/SLO evidence can be attached to release readiness and incident review;
- runbooks clearly say when Grafana, Prometheus, Loki, or Tempo direct UI use is
  justified.
