# Ops Runbook Architecture (v1)

Purpose:
- Define how runbooks are delivered, discovered, and governed from the Admin Ops experience.
- Prevent ad-hoc link sprawl and keep alert, dashboard, and runbook mappings consistent.

Admin Ops interaction model:
- `/admin/ops` is decision-first, not flat-dashboard-first.
- The default operator path is:
  1. Decision Header
  2. Action Required
  3. Health Summary
  4. Investigation Tools
  5. Fleet and Sample Detail
- Runbooks, saved queries, and observability links must preserve that order.

## 1. Scope and Principles

Scope:
- Admin/ops-facing runbook discovery from `/admin/ops`.
- Mapping from operational signals/alerts to deterministic runbook entries.
- Ownership, review cadence, and version tracking.
- Tenant/project authorization failure triage and project-context lookup workflow.

Principles:
- Docs-as-source-of-truth for content (`doc/operations/runbooks/*`).
- Stable runbook IDs (do not key UI to file path strings).
- Contract-first integration for UI/API additions.
- RBAC-enforced access (admin now, dedicated ops role later).

## 2. Layered Delivery Model

### MVP (now)
- Content source: markdown files in `doc/operations/runbooks/`.
- Metadata source: repository manifest file:
  - `doc/operations/runbooks/runbooks.catalog.json`
- UI behavior: Admin Ops page shows decision-first incident cards and grouped runbook links by signal group.
- Mapping: explicit `signal_key -> runbook_id[]` in manifest.

### Pre-production
- Add read-only API endpoint for runbook metadata:
  - `GET /api/v1/admin/runbooks`
  - `GET /api/v1/admin/runbooks/{runbook_id}`
- UI reads from API instead of local constants.
- Alert rules reference `runbook_id` in annotations.

### Production scale
- Optional runbook service or knowledge gateway.
- Add ownership workflow and review SLA checks.
- Add execution checklist telemetry/audit for incident follow-through.

## 3. Canonical Metadata Contract

Each runbook entry must include:
- `id` (stable identifier, e.g. `ops.outbox.relay`).
- `title`.
- `severity_hint` (`sev1`/`sev2`/`sev3`).
- `owner_team`.
- `last_reviewed_at` (ISO-8601 date).
- `url` (repo path or docs URL).
- `signals` (array of signal keys this runbook addresses).

Suggested signal keys:
- `outbox_relay_degraded`
- `dlq_backlog_present`
- `api_error_rate_high`
- `billing_worker_failures`
- `app_runtime_billing_reconciliation_failed`
- `app_runtime_billing_attribution_drift`
- `webhook_reconcile_failures`
- `database_latency_high`
- `provisioning_queue_depth_high`
- `provisioning_dispatch_latency_high`
- `provisioning_timeout_rate_high`
- `provisioning_failure_rate_high`
- `tenant_project_membership_denied`
- `project_context_missing`
- `user_onboarding_bootstrap_failed`
- `app_catalog_browse_failed`
- `app_catalog_entitlement_mutation_failed`
- `fleet_telemetry_api_error`
- `fleet_telemetry_data_stale`
- `app_operator_service_account_failed`
- `scheduler_reference_flow_failed`
- `app_runtime_mode_scope_mismatch`
- `enterprise_federation_oidc_failed`
- `enterprise_federation_saml_failed`
- `enterprise_federation_membership_denied`
- `enterprise_federation_state_invalid`
- `app_artifact_publish_intent_failed`
- `app_artifact_registration_failed`
- `app_artifact_promotion_failed`
- `app_artifact_trust_failed`
- `app_artifact_retirement_failed`
- `lab_control_stack_failed`
- `gpu_worker_host_degraded`
- `control_host_stack_failed`
- `cross_host_correlation_investigation`
- `cli_auth_failed`
- `cli_project_context_missing`
- `cli_command_failure`
- `python_sdk_request_failed`
- `python_sdk_auth_failed`
- `python_sdk_project_context_missing`

## 4. Alert and Dashboard Mapping

Requirements:
- Every actionable alert must include a `runbook_id`.
- Every critical incident card in `Action Required` must map to at least one runbook.
- Every `Health Summary` block must have a deterministic investigation path, even when it is not yet an incident.
- Runbook IDs must be unique and stable across file moves/renames.
- Incident flows must preserve canonical API error envelope fields
  (`code`, `message`, `correlation_id`, `details`) for lookup and triage.

Examples:
- `outbox_relay_ok=false` -> `ops.outbox.relay`
- `dlq_pending>0` -> `ops.queue.backlog` and `ops.outbox.relay`
- `payments_reconcile_failed_total` spike -> `ops.webhook.outage`

## 5. Governance and Ownership

Owner model:
- Primary owner per runbook (`owner_team`).
- Backup owner in on-call roster.

Review policy:
- `last_reviewed_at` must be refreshed at least every 90 days.
- Any incident using a runbook should trigger post-incident review update.

Change management:
- Runbook changes require PR review from owning team.
- New signal or alert requires runbook mapping in same PR.

## 6. Security and Access

- Runbook links exposed only in authenticated admin/ops surfaces.
- No secrets, credentials, or private key material in runbook content.
- External links (Grafana/Tempo/Loki) must remain behind RBAC and audited access.

## 7. Correlation-ID-First Incident Workflow

Use the `/admin/ops` layout as the operator entry path:
1. Read the Decision Header for freshness and incident count.
2. Start from the highest-severity card in `Action Required`.
3. Open the linked runbook before dropping into raw logs and traces.
4. Use `Investigation Tools` only after the incident class is selected.
5. Use `Fleet and Sample Detail` as supporting evidence, not as the first triage surface.

1. Start from a surfaced failure and extract `correlation_id` from the API error envelope.
2. Use `correlation_id` as the primary join key across logs, traces, alerts, and audit rows.
3. Resolve canonical `resource_name` and use it as a secondary deterministic pivot when correlation spans multiple services.
4. Map matched alert/runbook entries using `runbook_id` annotations.
5. Record final incident timeline with envelope `code`, `correlation_id`, and canonical resource evidence.
6. In the three-host lab, record the first failing `host_role`:
- `platform_control`
- `app_control`
- `worker_compute`

### 7.0 Canonical Resource Name Baseline

Use the canonical resource identifier format from `doc/architecture/Resource_Identifier_Spec.md`:

`core42:aicloud:{region}:{tenant_id}:{project_id}:{resource_type}:{resource_id}`

Operator notes:
- `core42:aicloud` is fixed and must not be replaced with aliases.
- Prefer exact `resource_name` matches in logs/traces before fallback to partial IDs.
- Capture both `correlation_id` and `resource_name` in incident handoff notes.

### 7.1 Query Templates (Loki / Tempo / Prometheus)

Use these as the default triage sequence:

1. API error envelope by correlation ID (Loki):
   - `{service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"`
2. Terminal gateway failures by correlation ID (Loki):
   - `{service="gpuaas-terminal-gateway"} | json | correlation_id="<CORRELATION_ID>"`
3. Worker-side event processing by correlation ID (Loki):
   - `{service=~"gpuaas-(billing-worker|provisioning-worker|notification-relay|webhook-worker)"} | json | correlation_id="<CORRELATION_ID>"`
4. Trace pivot by trace ID (Tempo/Grafana):
   - Use `trace_id` from error envelope `details.trace_id` or `X-Trace-ID` response header.
5. Error-rate snapshot around incident window (Prometheus):
   - `sum(rate(http_server_requests_total{status=~"5.."}[5m])) by (service)`
6. Terminal gateway websocket event outcomes (Prometheus):
   - `sum(rate(terminal_gateway_ws_events_total[5m])) by (outcome, reason)`

Guidance:
- Prefer `correlation_id` first, then pivot to `trace_id`.
- Include both IDs in incident notes and handoff updates.
- If no `trace_id` is present, treat as telemetry gap and log follow-up work.
- Fast helper:
  - `make ops-correlation-lookup CORRELATION_ID=<id> [TRACE_ID=<id>] [SESSION_ID=<id>] [ERROR_CODE=<code>] [WINDOW=30m]`

### 7.1a Three-Host Lab Host-Role Pivot

When the incident occurs in the three-host lab:
1. Determine `host_role` first:
- `platform_control`
- `app_control`
- `worker_compute`
2. Use `correlation_id` across all matching roles before assuming a single-host defect.
3. If the first failing boundary is unknown, start with:
- `{host_role=~"platform_control|app_control|worker_compute"} | json | correlation_id="<CORRELATION_ID>"`
4. Keep control-plane, platform-app control-stack, and real GPU worker evidence separate in notes.

### 7.2 Tenant/Project Authorization Triage

When handling authz failures tied to tenant/project scope:
1. Capture envelope `code` and `correlation_id` from the failing response.
2. Verify request project context:
   - required header: `X-Project-ID`
   - confirm project id value matches expected tenant membership.
3. Distinguish expected classes:
   - missing project context (`invalid_request`)
   - membership/ownership denial (`insufficient_permissions` / related authz code)
4. Pivot with `correlation_id` across API logs, authz checks, and audit trail entries.
5. Route to tenant/project authz runbook for operator response workflow.

## 8. Immediate Next Steps

1. Keep `doc/operations/runbooks/runbooks.catalog.json` updated with stable IDs.
2. Keep signal-to-runbook mappings aligned with the incident derivation logic in `packages/web/app/admin/ops/page.tsx`.
3. Preserve the `Action Required -> Investigation Tools -> Runbook` sequence when adding new Admin Ops signals or panels.
4. Add tests:
   - API contract tests for runbook list/detail.
   - UI tests for signal-to-runbook rendering and incident-card/runbook linkage.
   - Governance check ensuring every critical alert has `runbook_id`.
5. Add correlation-ID operator workflow:
   - API endpoint(s) to query recent API/audit/outbox records by `correlation_id`.
   - Admin Ops UI search panel for incident triage using correlation IDs from user-facing errors.

## 10. Symptom-to-Query Mapping (Ops Hardened Baseline)

Use this deterministic path during incident triage:
- UI/admin symptom -> Grafana panel -> Loki/Tempo saved query -> owning service.

Reference shortlist:
- Symptom: API errors spike in Admin Ops panel.
  - Panel: API/control-plane reliability.
  - saved query: `api_error_by_correlation_id`.
  - owner: `gpuaas-api`.
- Symptom: terminal sessions fail to connect/reconnect.
  - Panel: terminal gateway/session reliability.
  - saved query: `terminal_resource_name_join`.
  - owner: `gpuaas-terminal-gateway` (+ API for token/session mint path).
- Symptom: allocation stuck in provisioning/releasing.
  - Panel: provisioning workflow and queue health.
  - saved query: `provisioning_timeout_failure_window`.
  - owner: `gpuaas-provisioning-worker`.
- Symptom: billing balance updates missing/delayed.
  - Panel: billing/payment reconciliation.
  - saved query: `billing_webhook_reconcile_failures`.
  - owner: `gpuaas-billing-worker` and `gpuaas-webhook-worker`.
- Symptom: app runtime billing row looks wrong, missing, or not explainable from app-instance context.
  - Panel: billing/payment reconciliation + app runtime lifecycle context.
  - saved query: `app_runtime_billing_reconciliation`.
  - owner: `gpuaas-billing-worker` first, then app-runtime owner if attribution source is wrong.
- Symptom: new users can sign in but fail project-scoped actions.
  - Panel: API/control-plane reliability.
  - saved query: `api_error_by_correlation_id` (filter onboarding/auth context codes).
  - owner: `gpuaas-api` (auth/membership bootstrap path).
- Symptom: App Catalog page fails to load/filter or entitlement writes fail.
  - Panel: API/control-plane reliability + IAM mutation counters.
  - saved query: `api_error_by_correlation_id` (filter app catalog/entitlement routes).
  - owner: `gpuaas-api`.
- Symptom: app runtime lifecycle actions fail (deploy/upgrade/rollback/decommission) or instance state is stuck.
  - Panel: API/control-plane reliability + queue/outbox health + app runtime worker.
  - saved query: `api_error_by_correlation_id` (seed with lifecycle `correlation_id`) and app runtime worker log filters.
  - owner: `gpuaas-api`, `gpuaas-outbox-relay`, and `gpuaas-app-runtime-worker`.
- Symptom: app operator automation or scheduler reference flow fails despite healthy runtime worker status.
  - Panel: API/control-plane reliability + IAM/service-account context + app runtime worker.
  - saved query: `api_error_by_correlation_id` with pivots on `operator_service_account_id`, `app_slug`, and `app_instance_id`.
  - owner: `gpuaas-api` first, then app-platform/runtime owner.
- Symptom: Fleet Telemetry tabs (CPU/GPU/Memory/Storage) show stale or empty data.
  - Panel: fleet telemetry + observability backend health.
  - saved query: `api_error_by_correlation_id` and stack readiness checks.
  - owner: `gpuaas-api` + observability platform owner.
- Symptom: enterprise onboarding or work-account sign-in fails through OIDC/SAML flow.
  - Panel: API/control-plane reliability.
  - saved query: `api_error_by_correlation_id` filtered for federation auth routes/messages (`oidc|saml|federation|state`).
  - owner: `gpuaas-api` auth/federation owner, then IAM/onboarding owner for membership-gate failures.
- Symptom: three-host lab failure where ownership is unclear between platform-control, app-control, and worker-compute hosts.
  - Panel: host-role aware lab overview.
  - saved query: `lab_control_plane_failure`, `lab_control_host_failure`, and `lab_gpu_worker_failure`.
  - owner: platform owner first, then boundary-specific owner after first failing `host_role` is established.
- Symptom: CLI command failure in user/operator automation.
  - Panel: API/control-plane reliability + endpoint-specific worker panel.
  - saved query: `api_error_by_correlation_id` from CLI-provided `correlation_id`.
  - owner: `gpuaas-api` plus owning downstream service by endpoint.
- Symptom: Python SDK exception spike in integrations.
  - Panel: API/control-plane reliability + endpoint-specific worker panel.
  - saved query: `api_error_by_correlation_id` from SDK exception `correlation_id`.
  - owner: `gpuaas-api` plus owning downstream service by endpoint.
- Symptom: node lifecycle action fails (retire/reactivate/remove) or node-agent behavior diverges after transition.
  - Panel: API/control-plane reliability + node health/admin lifecycle surfaces.
  - saved query: `api_error_by_correlation_id` plus node-agent logs by `node_id`.
  - owner: `gpuaas-api` + inventory/node lifecycle owner.

Operator requirement:
- incident records must include both `correlation_id` and canonical `resource_name` when available.

## 9. Ops Metrics Source Modes

Use explicit ops metrics source modes for Admin Ops dashboard behavior:

1. `in_memory` mode:
   - local/dev fallback from process counters.
   - useful for fast smoke checks when observability backend is unavailable.
2. `backend` mode:
   - durable totals and historical trend reads from observability backend queries.
   - required for staging/production incident response and audits.

Ops metrics source runbook requirements:
- incident responders must confirm current mode first.
- if in `backend` mode and query path fails, degrade gracefully and surface mode-specific guidance.
- keep query pack ownership documented with alert/runbook linkage in ops evidence docs.

Platform role/authz control-plane counters (available in `/api/v1/admin/ops/overview` `control_plane` and `/metrics`):
- `platform_role_list_requests_total`
- `platform_role_bind_requests_total`
- `platform_role_revoke_requests_total`
- `platform_role_mutation_success_total`
- `platform_role_mutation_failure_total`
- `platform_role_admin_denied_total`
- `platform_role_service_unavailable_total`

Ops interpretation baseline:
- `platform_role_admin_denied_total` increase with stable success/failure counters usually indicates expected authorization boundary enforcement.
- `platform_role_mutation_failure_total` increase with matching API 5xx/error envelopes indicates backend failure path requiring triage.
- `platform_role_service_unavailable_total` increase indicates missing role-binding schema/runtime availability (migration/bootstrap drift).