# Observability Alert And SLO Evidence Read Model Contract v1

Status: contract draft for `OPS-PROD-OBSERVABILITY-ALERT-SLO-EVIDENCE-001`

Owner: Platform Operations / SRE / Backend / Architecture

Last updated: 2026-06-06

## Purpose

Define platform-owned alert routing and SLO/error-budget evidence read models
that let operators attach current operational confidence to release readiness,
incident review, and UAT exit without manually exporting Grafana screenshots.

This is a contract draft. The implementation task must update
`doc/api/openapi/domains/v3-read-models.yaml`, regenerate
`doc/api/openapi.draft.yaml`, regenerate generated client artifacts, and add
backend/UI tests before publishing new endpoints or response shapes.

## Proposed Endpoints

```text
GET /api/v1/v3/platform/ops/observability/alerts
GET /api/v1/v3/platform/ops/observability/slo-evidence
```

Required capability:

```text
platform.ops.read
```

## Alert Routing Query Parameters

| Parameter | Type | Required | Notes |
|---|---|---|---|
| `environment_profile` | string | no | Defaults to active runtime profile. |
| `severity` | enum | no | `sev1`, `sev2`, `sev3`, `warning`, `critical`, `unknown`. |
| `owner_team` | string | no | Owner team from alert labels or route mapping. |
| `service` | string | no | Service/component name. |
| `domain` | string | no | Optional alert domain label such as `security`, `provisioning-control-loop`, or `terminal-stream-relay`. |
| `state` | enum | no | `firing`, `resolved`, `suppressed`, `unknown`, `not_loaded`. |
| `runbook_id` | string | no | Runbook annotation/id. |
| `include_inactive` | bool | no | Default `false`; includes loaded-but-not-firing alerts when true. |
| `cursor` | string | no | Pagination cursor. |
| `page_size` | int | no | Default 50, max 200. |

## SLO Evidence Query Parameters

| Parameter | Type | Required | Notes |
|---|---|---|---|
| `environment_profile` | string | no | Defaults to active runtime profile. |
| `service` | string | no | API, worker, gateway, queue, app runtime, observability, or provider component. |
| `slo_id` | string | no | Specific SLO definition. |
| `time_range` | enum | no | `1h`, `6h`, `24h`, `7d`, `30d`, `custom`. |
| `from` | RFC3339 | only with custom | Inclusive lower bound. |
| `to` | RFC3339 | only with custom | Exclusive upper bound. |
| `release_id` | string | no | Optional release/deploy-run association. |
| `include_query_outcomes` | bool | no | Default `true`; includes classified query outcomes, not raw samples. |
| `cursor` | string | no | Pagination cursor. |
| `page_size` | int | no | Default 25, max 100. |

## Alert Routing Response Shape

```json
{
  "overall": {
    "status": "degraded",
    "environment_profile": "dev-control-rke2",
    "generated_at": "2026-06-06T00:20:00Z",
    "firing_count": 2,
    "missing_route_count": 1,
    "stale_drill_count": 1
  },
  "items": [
    {
      "alert_name": "GPUaaSProvisioningQueueDepthHigh",
      "state": "firing",
      "severity": "sev2",
      "owner_team": "platform",
      "service": "provisioning-worker",
      "domain": "provisioning-control-loop",
      "signal_key": "provisioning_queue_depth_high",
      "runbook_id": "ops.provisioning.stuck",
      "runbook_href": "/operations/runbooks/Provisioning_Workflow_Stuck_Runbook",
      "summary": "Provisioning queue depth sustained above 200 pending messages.",
      "last_fired_at": "2026-06-06T00:15:00Z",
      "last_resolved_at": null,
      "route": {
        "destination": "platform-oncall",
        "configured": true,
        "last_verified_at": "2026-06-04T00:00:00Z"
      },
      "drill": {
        "last_drill_at": null,
        "state": "missing"
      },
      "evidence_href": "/platform/evidence?alert_name=GPUaaSProvisioningQueueDepthHigh"
    }
  ],
  "rollups": {
    "by_severity": {
      "sev1": 0,
      "sev2": 2,
      "sev3": 0,
      "warning": 0,
      "critical": 0,
      "unknown": 0
    },
    "by_owner_team": {
      "platform": 2
    },
    "route_state": {
      "configured": 8,
      "missing": 1,
      "unknown": 0
    }
  },
  "pagination": {
    "next_cursor": null,
    "page_size": 50
  },
  "meta": {
    "sources": [
      "alert_rule_manifest",
      "prometheus_alert_state",
      "oncall_roster",
      "runbook_index",
      "alert_drill_evidence"
    ],
    "raw_prometheus_samples_included": false
  }
}
```

## SLO Evidence Response Shape

```json
{
  "overall": {
    "status": "blocked",
    "environment_profile": "dev-control-rke2",
    "generated_at": "2026-06-06T00:20:00Z",
    "release_id": "release-001",
    "summary": "API error budget is healthy; provisioning queue SLO is blocked by stale staging alert evidence."
  },
  "items": [
    {
      "slo_id": "api.availability.30d",
      "service": "api",
      "owner_team": "platform",
      "objective": {
        "target": "99.9",
        "unit": "percent",
        "window": "30d"
      },
      "current": {
        "value": "99.95",
        "status": "within_budget",
        "error_budget_remaining_percent": 50.0
      },
      "query_outcome": {
        "source": "prometheus_query_outcome",
        "query_id": "api_availability_30d",
        "state": "fresh",
        "checked_at": "2026-06-06T00:18:00Z"
      },
      "alert_coverage": {
        "required_alerts": ["GPUaaSAPIHighErrorRate", "GPUaaSAPIP99LatencyHigh"],
        "loaded": true,
        "route_configured": true
      },
      "evidence_href": "/platform/evidence?slo_id=api.availability.30d"
    }
  ],
  "release_gate": {
    "state": "blocked",
    "blocking_reasons": [
      "provisioning queue alert simulation evidence is stale"
    ],
    "attachable_evidence_href": "/platform/evidence?bundle_type=slo_release_evidence&release_id=release-001"
  },
  "pagination": {
    "next_cursor": null,
    "page_size": 25
  },
  "meta": {
    "sources": [
      "slo_alert_pack",
      "prometheus_query_outcome",
      "alert_rule_manifest",
      "observability_oncall_gate",
      "release_evidence"
    ],
    "raw_prometheus_samples_included": false,
    "grafana_screenshots_required": false
  }
}
```

## Source Coverage

| Source | Minimum fields |
|---|---|
| Alert rule manifests | Alert name, expression id or query id, severity, service, owner/domain labels, runbook id, loaded state. |
| Prometheus alert state | Firing/resolved/suppressed state, last transition timestamps, classified freshness, not raw sample vectors. |
| On-call roster and routing | Destination, owner team, last verification timestamp, missing/stale route reasons. |
| Runbook index | Runbook id, runbook href, required coverage area, stale/missing status. |
| Alert drill evidence | Last drill timestamp, result, open gaps, expiry. |
| SLO/error-budget evidence | SLO id, target, window, classified query outcome, error-budget state, release gate state. |

## Status Semantics

| Status | Meaning |
|---|---|
| `healthy` | Required alerts/routes/evidence are loaded, current, and no blocking alert/SLO condition is active. |
| `degraded` | Alert/SLO evidence is usable but stale, partial, warning, or has non-blocking findings. |
| `blocked` | Release or UAT exit should stop because required alert, route, drill, or SLO evidence is missing or failing. |
| `unknown` | The platform cannot classify required evidence from current sources. |

## Excluded Data

The responses must not include:

- raw Prometheus sample vectors;
- raw Grafana dashboard JSON or screenshots as evidence payloads;
- notification destination secrets, webhook URLs, tokens, or credentials;
- private incident chat content;
- customer payment details, raw app payloads, tenant data plane content, or
  unsanitized exception text.

## Implementation Notes

1. Add OpenAPI before handler work.
2. Start with checked-in alert manifests, `slo_alert_pack` evidence,
   `observability_oncall` status snapshots, and classified Prometheus query
   outcomes.
3. Keep raw Prometheus queries and Grafana views as SRE escape hatches. The
   platform read model returns query ids and status classes.
4. Attach SLO evidence bundles to deploy-run/release readiness tasks so UAT is
   not the first place stale alerting or SLO evidence is discovered.
5. Add tests for authorization, alert route missing, stale drill evidence,
   blocked release gate, pagination, and raw sample/secret exclusion.
