# Observability Health Snapshot Read Model Contract v1

Status: contract draft for `OPS-PROD-OBSERVABILITY-HEALTH-SNAPSHOT-001`

Owner: Platform Operations / Backend / Frontend / Architecture

Last updated: 2026-06-06

## Purpose

Define the first platform-owned health snapshot that lets operators answer
"is the platform healthy enough to proceed?" before opening Grafana,
Prometheus, Loki, or Tempo.

This is a contract draft. The implementation task must update
`doc/api/openapi/domains/v3-read-models.yaml`, regenerate
`doc/api/openapi.draft.yaml`, regenerate client artifacts, and add backend/UI
tests before publishing a new endpoint or response shape.

## Proposed Endpoint

```text
GET /api/v1/v3/platform/ops/observability/health-snapshot
```

Required capability:

```text
platform.ops.read
```

## Query Parameters

| Parameter | Type | Required | Notes |
|---|---|---|---|
| `environment_profile` | string | no | Defaults to active runtime profile. |
| `component_type` | enum | no | `api`, `web`, `worker`, `gateway`, `queue`, `database`, `cache`, `event_bus`, `workflow`, `observability`, `runtime`, `node_log_gateway`. |
| `status` | enum | no | `healthy`, `degraded`, `unhealthy`, `unknown`, `not_reported`. |
| `include_evidence` | bool | no | Default `true`; includes evidence links and source freshness metadata. |
| `include_prometheus_queries` | bool | no | Default `false`; includes query ids and result classes, not raw samples. |

## Response Shape

```json
{
  "overall": {
    "status": "degraded",
    "degradation_level": "partial",
    "summary": "Terminal gateway route readiness is stale; core API and queues are healthy.",
    "generated_at": "2026-06-06T00:15:00Z",
    "environment_profile": "dev-control-rke2"
  },
  "components": [
    {
      "component_id": "cmd-api",
      "component_type": "api",
      "owner_domain": "platform-api",
      "status": "healthy",
      "freshness_seconds": 120,
      "checked_at": "2026-06-06T00:13:00Z",
      "degradation_reason": null,
      "runbook_id": "ops.api.degradation",
      "evidence_href": "/platform/evidence?component_id=cmd-api",
      "metrics": {
        "scrape_state": "fresh",
        "scrape_freshness_seconds": 30,
        "error_rate_class": "ok",
        "latency_class": "ok",
        "saturation_class": "ok"
      },
      "details": {
        "healthz_status": "ok",
        "runtime_commit": "abcdef1"
      }
    }
  ],
  "rollups": {
    "service": {
      "healthy": 8,
      "degraded": 1,
      "unhealthy": 0,
      "unknown": 0,
      "not_reported": 0
    },
    "worker": {
      "healthy": 4,
      "degraded": 0,
      "unhealthy": 0,
      "unknown": 1,
      "not_reported": 0
    },
    "queue": {
      "max_lag_seconds": 42,
      "dlq_backlog": 0,
      "stale_consumers": 0
    },
    "runtime": {
      "stale_deployments": 0,
      "image_digest_missing": 0,
      "profile_mismatch": 0
    },
    "observability": {
      "prometheus_reachable": true,
      "loki_reachable": true,
      "tempo_reachable": true,
      "grafana_escape_hatch_configured": true
    }
  },
  "evidence": [
    {
      "source": "platform_status_snapshot",
      "state": "fresh",
      "freshness_seconds": 120,
      "href": "/platform/evidence?bundle_id=status-dev-001"
    }
  ],
  "direct_ui": {
    "grafana": {
      "configured": true,
      "path": "https://aicloud-dev-grafana.core42.dev/",
      "use_when": "Dashboard or Explore pivots are needed after the platform snapshot identifies the failing area."
    },
    "prometheus": {
      "configured": false,
      "path": null,
      "use_when": "Ad hoc query debugging by SRE only."
    }
  },
  "meta": {
    "cache": "miss",
    "sources": ["platform_status_snapshot", "guard_report", "prometheus_query_outcome", "runtime_metadata"],
    "raw_telemetry_included": false
  }
}
```

## Component Coverage

| Component family | Minimum fields |
|---|---|
| API/Web | Health status, runtime commit/image, latest deploy age, error-rate class, latency class. |
| Workers | Heartbeat or runtime metadata freshness, queue ownership, failure/backoff class, last successful unit of work where available. |
| Queues/Event bus | NATS/JetStream reachability, consumer lag, DLQ backlog, outbox failed count. |
| Workflow runtime | Temporal reachability, stuck workflow count, retry/failure class, schedule freshness where configured. |
| Database/Redis | Reachability, degraded-state reason, connection/error class, backup/restore evidence freshness where available. |
| Terminal and node-log gateways | Route readiness, WebSocket/gateway health class, node-facing endpoint freshness, gateway rejection/forward-failure class. |
| Observability stack | Prometheus/Loki/Tempo/Grafana reachability, scrape freshness, query outcome freshness, direct-UI escape-hatch state. |
| Runtime/app platform | App-runtime-worker freshness, managed-ingress health class, artifact/runtime image freshness where available. |

## Status Semantics

| Status | Meaning |
|---|---|
| `healthy` | Fresh evidence exists and no blocking degradation is detected. |
| `degraded` | Service is usable but one or more signals are stale, partial, or warning state. |
| `unhealthy` | A required service, worker, route, queue, or dependency is failing. |
| `unknown` | The component is expected but the platform cannot classify it from available evidence. |
| `not_reported` | The component is not expected for the current profile or has no evidence source yet. |

The `overall.status` is the maximum severity of required components for the
selected profile. Optional escape-hatch tools such as Grafana do not make the
overall state unhealthy unless the profile declares them required for current
release validation.

## Sources

Use platform-owned evidence and query outcomes first:

- `platform_status_snapshot` payloads;
- platform foundation guard reports;
- runtime metadata read model;
- existing `/api/v1/v3/platform/ops` health and action signals;
- worker/queue counters already emitted by API, workers, NATS, Redis, and
  Temporal;
- bounded Prometheus query outcomes, classified into status classes;
- observability/on-call readiness evidence.

Do not expose raw Prometheus samples, Loki log payloads, or Tempo traces in the
health snapshot. Link to follow-up read models or direct UI escape hatches when
the summary is not enough.

## Excluded Data

The response must not include:

- raw logs, full stack traces, or unsanitized exception text;
- raw Prometheus sample vectors;
- raw Tempo spans;
- bearer tokens, cookies, kubeconfigs, registry credentials, Vault material, or
  private keys;
- payment references or customer identifiers;
- tenant-owned app payloads or tenant data plane content.

## Implementation Notes

1. Start with a read-only backend projection. It may reuse existing platform
   stats and evidence tables before adding Prometheus query execution.
2. Add OpenAPI before handler work.
3. Keep `cmd/api` as a thin handler; projection logic belongs in a platform
   read-model package.
4. Add tests for authorization, profile filtering, stale evidence, queue
   degradation, missing observability evidence, and excluded raw telemetry.
5. Add a `/platform/ops` UI section that shows the snapshot before direct
   Grafana/Prometheus pivots.
6. Attach snapshot evidence to release/deploy-run tasks so UAT is not the first
   place operators discover stale service or worker health.