# Observability Baseline

## Backend Stack Decision (v1)
- OpenTelemetry Collector as the single telemetry pipeline.
- Prometheus for metrics.
- Tempo for traces.
- Loki for logs.
- Grafana for dashboards and alerting views.
- Vector is deferred by default (add only if advanced multi-sink transforms are required).

Reference:
- `doc/architecture/Observability_Architecture.md`
- `doc/governance/Observability_Standards.md`
- `doc/operations/OTEL_Collector_Tenant_Isolation.md`

## Logging
- Structured JSON logs with fields:
  - timestamp
  - level
  - service
  - correlation_id
  - trace_id / span_id (when span context exists)
  - org_id/project_id when available
  - error_code for failed requests/operations (catalog-aligned)
  - resource_name when the affected resource can be resolved

Runtime structured-log field contract (API/gateway/workers):
- `correlation_id`
- `error_code`
- `resource_name`
- `org_id` (tenant boundary)
- `project_id` (project boundary)

Three-host lab host-role field contract (required for lab evidence and triage):
- `host_role`:
  - `platform_control`
  - `app_control`
  - `worker_compute`
- `host_name`:
  - `dev-control-1`
  - `dev-lab-1`
  - `dev-gpu-1`
- `lab_stack` when a platform-app control stack is involved (for example `slurm-reference`)
- `node_id` when the real GPU worker host is involved

Field omission is allowed only when context is not yet established (for example,
startup/bootstrap logs before a request scope exists).

## Tracing
- OpenTelemetry tracing enabled for:
  - API requests
  - async worker jobs
  - external integrations (Stripe, node SSH operations)

## Metrics
Required core metrics:
- API request rate, latency, error rate
- Queue depth and consumer lag
- Workflow success/failure counts
- Billing debit/credit event counts
- Webhook processing latency and failure rate

Provisioning control-loop required metrics:
- `provisioning_queue_depth` (gauge): backlog of provisioning dispatch work.
- `provisioning_dispatch_latency_seconds` (histogram): event enqueue-to-dispatch delay.
- `provisioning_timeouts_total` (counter): provisioning task/workflow timeout outcomes.
- `provisioning_failures_total` (counter): non-timeout provisioning failures.
- `nats_consumer_lag{stream="PROVISIONING"}` (gauge): JetStream lag for provisioning stream.

Provisioning control-loop alert objectives:
- Queue depth sustained above threshold triggers backlog alert.
- Dispatch p95 latency sustained above threshold triggers dispatch-delay alert.
- Timeout rate above threshold triggers timeout alert.
- Failure burst above threshold triggers failure-rate alert.
- All provisioning control-loop alerts must include `runbook_id: ops.provisioning.stuck`.

Webhook worker baseline counters (via `GET /metrics` on `cmd/webhook-worker`):
- `webhook_events_received_total`
- `webhook_signature_failures_total`
- `webhook_invalid_payload_total`
- `webhook_persist_failures_total`
- `webhook_processed_success_total`
- `webhook_processing_failures_total`
- `payments_reconcile_failed_total`

API baseline counters/gauges (via `GET /metrics` on `cmd/api`):
- `api_ratelimit_fail_open_total`
- `api_idempotency_persisted_body_json_total`
- `api_idempotency_skipped_body_empty_total`
- `api_idempotency_skipped_body_non_json_total`
- `api_idempotency_replays_served_total`
- `terminal_token_consumed_ok_total`
- `terminal_token_replay_rejected_total`
- `ws_notifications_active_connections`
- `ws_notifications_forwarded_messages_total`
- `ws_notifications_write_errors_total`
- `api_platform_role_list_requests_total`
- `api_platform_role_bind_requests_total`
- `api_platform_role_revoke_requests_total`
- `api_platform_role_mutation_success_total`
- `api_platform_role_mutation_failure_total`
- `api_platform_role_admin_denied_total`
- `api_platform_role_service_unavailable_total`

Note:
- Platform-role counters are expected when the API runtime includes role-binding management telemetry.
- In mixed-version local environments, smoke checks warn (not fail) until API runtime is refreshed.

Terminal stream relay observability baseline:
- session lifecycle counters (open/close/error) with close-reason labels.
- relay write error rate and drop counters at gateway/runtime boundary.
- token replay/consume counters correlated with session failures.
- alert annotations must map to `ops.terminal.gateway`.

SSH key management observability baseline:
- key mutation counters (create/delete/default-switch/allocation-keyset-update).
- authorization-denied and validation-error rate for SSH key APIs.
- audit-log completeness checks for key-management mutations.
- alert annotations must map to `ops.node.onboarding`.

Baseline validation command:
- `make ops-observability-smoke`
- Script: `scripts/ops/observability_smoke.sh`
- Latest local evidence: `doc/operations/evidence/observability_local_smoke_report.md`

Correlation-first validation checks (required):
- API error envelope includes:
  - `code`, `message`, `correlation_id`
  - machine-readable `details` with at least `service`, and when available `trace_id`, `span_id`
- Terminal gateway error envelope includes:
  - same fields above, plus route/method metadata in `details`
- NATS event path preserves context:
  - `x-correlation-id` in message headers
  - `traceparent`/`tracestate` propagation when trace context is present

Local overlay bring-up:
- `make dev-up-observability`
- Compose overlay: `doc/operations/local-dev/docker-compose.observability.yaml`
- Stack readiness check: `make ops-observability-stack-smoke`
- OTLP export is enabled for all core runtime services in observability mode:
  - `gpuaas-api`
  - `gpuaas-terminal-gateway`
  - `gpuaas-billing-worker`
  - `gpuaas-provisioning-worker`
  - `gpuaas-webhook-worker`
  - `gpuaas-notification-relay`
- `gpuaas-outbox-relay`

`platform_control` deployed baseline:
- namespace: `gpuaas-observability`
- components:
  - `otel-collector`
  - `prometheus`
  - `loki`
  - `tempo`
  - `grafana`
  - `promtail`
- current public endpoints on `dev-control-1`:
  - Grafana: `http://100.90.157.34:3001`
  - Prometheus: `http://100.90.157.34:9090`
  - Loki: `http://100.90.157.34:3100`
  - Tempo: `http://100.90.157.34:3200`
- current proof points:
  - Prometheus scrapes `gpuaas-core` services
  - Tempo stores API request traces from `gpuaas-api`
  - Loki stores API warning/error logs from `gpuaas-core`

Three-host lab observability baseline:
- `dev-control-1` must remain identifiable as `host_role=platform_control`.
- `dev-lab-1` must remain identifiable as `host_role=app_control`.
- `dev-gpu-1` must remain identifiable as `host_role=worker_compute`.
- dashboards, logs, and runbooks must preserve `correlation_id` across those boundaries.
- real GPU incidents must stay distinguishable from scheduler/platform-app control-stack incidents.
- platform-control Kubernetes logs must be queried with the live Kubernetes label model (`namespace`, `job`, `host_role`) plus JSON payload fields such as `service`, `correlation_id`, and `trace_id`; do not rely on the retired local-dev `compose_service` label.
- `platform_control` observability is automation-owned through:
  - `infra/k8s/base/observability/`
  - environment-specific Prometheus label patches in
    `infra/k8s/overlays/*/observability-prometheus-env-*.yaml`
  - `infra/ansible/roles/platform_control_k8s_observability/`

collector-backed node-agent logs:
- `gpuaas-node-agent` and `gpuaas-metrics-helper` emit normal structured stdout/journald/file logs.
- Worker nodes run a host-local Vector collector when
  `GPUAAS_NODE_LOG_COLLECTOR_ENABLED=1`.
- The collector tails:
  - `gpuaas-node-agent.service`
  - `gpuaas-metrics-helper.service`
  - `gpuaas-metrics-helper.timer`
  - `/var/log/gpuaas-node-agent*.log`
- Existing-node convergence must not backfill old host logs into Loki. The
  bootstrap Vector config reads file logs from the end, restricts journald to
  the current boot, and drops events older than 10 minutes before the Loki sink.
  Use `--reset-buffer` when enabling the collector on an already-running node.
- The collector forwards to `gpuaas-node-log-gateway` through the node-facing
  ingress with bounded disk buffering, not from service code directly and not
  directly to raw Loki in production.
- Configure Vector's Loki sink endpoint as the gateway base path
  (`https://node-api.<env>/internal/v1/node-logs`). Vector appends
  `/loki/api/v1/push`; using the full push URL as the endpoint causes a doubled
  path and 404s.
- `gpuaas-node-log-gateway` validates the node bearer token, caps request size,
  forwards only Loki push batches to in-cluster Loki, and exposes
  `node_log_gateway_*` Prometheus counters. Backend rejections are logged with a
  bounded sanitized backend response snippet so operators can distinguish Loki
  policy failures such as stale timestamps from gateway/network failures.
- Production-shape gateway deployment:
  - base replica floor: 3 replicas;
  - HPA: CPU target 65%, min 3, max 20, conservative scale-down;
  - pod requests: `100m` CPU and `128Mi` memory; limits: `1` CPU and `512Mi`;
  - rolling updates allow one unavailable pod and one surge pod.
- Local kind includes `metrics-server` under
  `infra/k8s/overlays/local-kind/metrics-server.yaml` so `kubectl top` and
  HPA CPU targets work during parity testing. The kind add-on uses
  `--kubelet-insecure-tls`, which is appropriate for local kind only; production
  RKE2/k3s environments should use their distro-managed metrics-server or a
  cluster add-on with normal kubelet CA validation.
- Gateway scaling signals:
  - `node_log_gateway_batches_total{outcome="accepted|rejected_auth|rejected_method|rejected_size|forward_failed"}`
  - `node_log_gateway_forwarded_bytes_total`
  - `node_log_gateway_in_flight_requests`
  - `node_log_gateway_push_duration_seconds`
- Gateway alerts:
  - `GPUAASNodeLogGatewayDown`
  - `GPUAASNodeLogGatewayForwardFailures`
  - `GPUAASNodeLogGatewayInFlightBacklog`
- Gateway dashboard:
  - `GPUaaS Node Log Gateway`, provisioned from
    `infra/k8s/base/observability/config/grafana/dashboards/gpuaas-node-log-gateway.json`.
- Gateway load smoke:
  - `scripts/ops/node_log_gateway_load_smoke.sh` simulates many nodes pushing
    Loki batches through the gateway. Run it against a local port-forward or a
    staging/demo node-facing ingress before changing production HPA limits or
    Loki ingestion sizing.
- The collector must validate the gateway TLS chain with the node bootstrap CA
  bundle (`GPUAAS_NODE_LOG_COLLECTOR_CA_FILE`, default
  `/etc/gpuaas/ca-bundle.crt`). Do not disable certificate or hostname
  verification to work around node trust drift.
- Required Loki labels:
  - `service=gpuaas-node-agent|gpuaas-metrics-helper`
  - `component=node-agent|metrics-helper`
  - `source=journald|self-update-finalizer`
  - `systemd_unit`
  - `host_role=worker_compute`
  - `host_name`
  - `node_id`
- High-cardinality values such as `task_id`, `allocation_id`, and
  `correlation_id` stay as JSON fields and are queried with `| json`.
- Bootstrap ownership:
  - `build/node-agent-bootstrap/observability/vector-node-logs.toml.tmpl`
  - `build/node-agent-bootstrap/systemd/gpuaas-node-log-collector.service.tmpl`
  - `build/node-agent-bootstrap/systemd/gpuaas-metrics-helper.service.tmpl`
  - `build/node-agent-bootstrap/systemd/gpuaas-metrics-helper.timer.tmpl`
- Worker-host automation ownership:
  - `infra/ansible/roles/worker_compute/`
  - `scripts/ops/gpuaas_manual_worker_node_converge.sh`
  - `scripts/ops/gpuaas_manual_worker_node_log_collector_converge.sh`
- Smoke validation:
  - `scripts/ops/observability_stack_smoke.sh` with
    `OBSERVABILITY_EXPECT_DEPLOYMENT_ENV=<env>`
  - `make ops-node-agent-loki-smoke`
  - set `LOKI_BASE_URL` and `NODE_ID` for node-specific validation before
    marking a worker node observability-ready.

Node-local Netdata telemetry edge:
- Worker bootstrap owns the stable Netdata edge used by platform-proxy, not the
  API route layer.
- Netdata listens on `127.0.0.1:19998`.
- nginx listens on `0.0.0.0:19999`.
- `/gpuaas/telemetry/health` returns `ok`.
- `/gpuaas/telemetry/netdata/` redirects to the locally detected Netdata UI
  version path.
- Bootstrap ownership:
  - `build/node-agent-bootstrap/nginx/gpuaas-netdata-edge.conf.tmpl`
  - `build/node-agent-bootstrap/install-node-agent.sh`
- Worker-host automation ownership:
  - `infra/ansible/roles/worker_compute/`
- Existing-node repair/convergence helper:
  - `scripts/ops/gpuaas_netdata_edge_converge.sh`

Ops metrics query pack (backend mode):
- Use `backend mode` for durable totals displayed on Admin Ops views.
- Query pack must include canonical mappings for:
  - request/error totals by service and status class
  - websocket/terminal session outcomes
  - queue/backlog and worker failure aggregates
- Query failures in backend mode must emit explicit operator-facing degradation reason
  with fallback instructions.

## Alerts
- Alert on SLO burn for API latency/error budgets.
- Alert on queue backlog thresholds.
- Alert on webhook failures and billing worker failures.
- Alert on repeated provisioning failures.
- Alert on provisioning dispatch latency and timeout rates.
- Alert on terminal stream relay degradation (session drop/error spikes).
- Alert on SSH key management anomaly spikes (mutation failures/denials).
- Alert rules should carry `runbook_id` annotations mapped to
  `doc/operations/runbooks/runbooks.catalog.json`.

Current local Prometheus rule pack (`doc/operations/local-dev/observability/prometheus-alerts.yaml`):
- `GPUAASWebhookProcessingFailuresSpike` -> `runbook_id: ops.webhook.outage`
- `GPUAASTerminalTokenReplaySpike` -> `runbook_id: ops.terminal.gateway`
- `GPUAASNotificationWriteErrorsSpike` -> `runbook_id: ops.terminal.gateway`
- `GPUAASRateLimitFailOpenDetected` -> `runbook_id: ops.api.degradation`

Alert drill command (synthetic test vectors):
- `make ops-observability-alert-drill`
- validates rule syntax and firing behavior via `promtool test rules`

Grafana alert routing baseline (local provisioning):
- Contact points:
  - `gpuaas-default` (fallback)
  - `gpuaas-platform`
  - `gpuaas-payments`
- Notification policy routes by:
  - `owner_team` label
  - `runbook_id` label (explicit runbook mapping)
- Message templates:
  - `gpuaas.alert.title`
  - `gpuaas.alert.body`

## Dashboards
- Service health overview
- Provisioning workflow dashboard
- Terminal gateway/session reliability dashboard
- Billing and payments dashboard
- Node fleet health dashboard
- Security/authentication anomalies dashboard

Local Grafana pack (current auto-provisioned baseline):
- `GPUaaS Control-Plane Overview` (API/control-plane health + error logs)
- `GPUaaS Billing & Payments` (webhook and reconcile path)
- `GPUaaS Terminal & Notifications` (terminal token + websocket reliability)
- `GPUaaS Runtime Health` (process/runtime saturation by scraped job)
- `GPUaaS Incident Correlation` (correlation_id/trace_id pivots in Loki)
- `GPUaaS Local Overview` (legacy starter dashboard; retained as compatibility view)
- `GPUaaS Fleet Telemetry` (CPU/GPU/Memory/Storage rollups for `/admin/telemetry`)

Initial Grafana dashboard set ownership:
- API/control-plane reliability: Platform/API owner.
- Provisioning pipeline and worker lag: Provisioning owner.
- Terminal session and token path reliability: Terminal owner.
- Billing/payment reconciliation path: Billing owner.
- Node fleet enrollment and health posture: Platform/Inventory owner.

Three-host lab dashboard/query direction:
- `platform_control`:
  - GitLab, registry, control-plane stack, and observability stack health
- `app_control`:
  - platform-app control stacks such as `slurm-reference`
- `worker_compute`:
  - node-agent, terminal path, allocation runtime, and GPU validation
- any host-role alert should carry `runbook_id: ops.lab.three-host` when the first failing boundary is not yet obvious

Admin Ops decision-first observability mapping:
- `Decision Header` is the scan point for freshness and incident count.
- `Action Required` is the default entry point for degraded signals needing action now.
- `Health Summary` is for compact state confirmation, not primary diagnosis.
- `Investigation Tools` is where correlation, trace, and saved-query pivots live after the incident class is selected.
- `Fleet and Sample Detail` is supporting evidence only.
- Auth/login failures must stay visible as WARN/401-class incidents and must not rely on 5xx-only dashboards.

Saved query cookbook (incident-ready defaults):
- API 5xx burst by correlation:
  - Loki saved query: `api_error_by_correlation_id`
  - `{service="gpuaas-api"} | json | status=~"5.." | correlation_id!=""`
- Terminal incident join by resource_name:
  - Loki saved query: `terminal_resource_name_join`
  - `{service=~"gpuaas-(terminal-gateway|api|notification-relay)"} | json | resource_name="<RESOURCE_NAME>"`
- Provisioning timeout/failure sweep:
  - Loki saved query: `provisioning_timeout_failure_window`
  - `{service="gpuaas-provisioning-worker"} | json | event_type=~"provisioning\\.(failed|release_failed)"`
- Billing/webhook reconciliation failures:
  - Loki saved query: `billing_webhook_reconcile_failures`
  - `{service=~"gpuaas-(billing-worker|webhook-worker)"} | json | code=~"upstream_error|service_unavailable|internal_error"`
- App runtime billing reconciliation failures:
  - Loki saved query: `app_runtime_billing_reconciliation`
  - `{service=~"gpuaas-(api|billing-worker|app-runtime-worker)"} | json | correlation_id!="" | (app_instance_id!="" or usage_source="app_runtime")`
- Fleet telemetry endpoint failures:
  - Loki saved query: `fleet_telemetry_api_error`
  - `{service="gpuaas-api"} | json | path="/api/v1/admin/telemetry/fleet" | status=~"4..|5.."`
- App operator/service-account failures:
  - Loki saved query: `app_operator_service_account_failure`
  - `{service="gpuaas-api"} | json | correlation_id!="" | operator_service_account_id!=""`
- App controller audit flood check:
  - Loki saved query: `app_controller_audit_flood`
  - `{service="gpuaas-api"} | json | action=~"app_instance\\..*report|app_instance\\.bootstrap_ssh\\.reconcile|shared_runtime\\..*report" | correlation_id!=""`
  - Runtime controllers may poll frequently, but report APIs must only write audit rows when persisted reconciliation state changes. Repeated identical controller observations are heartbeat telemetry, not privileged mutations.
  - Bootstrap SSH trust reconcile requests are audited only when they enqueue a new node task or change the requested trust state. Identical in-flight reconcile requests should not emit another audit row.
  - A sudden increase means either controller backoff/reconciliation is unhealthy or the runtime read model is changing on every poll. Inspect the correlation IDs before adding UI-side filtering.
- Enterprise federation failures:
  - Loki saved query: `enterprise_federation_auth_failure`
  - `{service="gpuaas-api"} | json | correlation_id!="" |~ "(oidc|saml|federation|state)"`
- Three-host lab control-plane failures:
  - Loki saved query: `lab_control_plane_failure`
  - `{host_role="platform_control"} | json | correlation_id!=""`
- Three-host GPU worker failures:
  - Loki saved query: `lab_gpu_worker_failure`
  - `{host_role="worker_compute"} | json | correlation_id!=""`
- Three-host app-control host failures:
  - Loki saved query: `lab_control_host_failure`
  - `{host_role="app_control"} | json | correlation_id!=""`
- Trace pivot helper:
  - Tempo/Grafana saved query: `trace_from_correlation_id`
  - start from API error envelope `details.trace_id`, then inspect cross-service spans.
  - when `details.trace_id` is absent:
    1. use Loki with `correlation_id` to find the API log line,
    2. extract `trace_id`,
    3. open the trace in Tempo by ID,
    4. verify downstream spans from workers (`billing`, `provisioning`, `notification`, `outbox`) are present for async flows.

SLO/SLI shortlist (operations review baseline):
- API availability SLI: non-5xx request ratio over rolling 30d.
- API latency SLI: p95 request latency on authenticated API endpoints.
- Provisioning workflow SLI: requested->active success ratio within SLO window.
- Terminal session SLI: successful websocket open + stable session duration ratio.
- Billing/reconcile SLI: successful webhook processing + reconcile completion ratio.
- Queue health SLI: outbox/NATS backlog age below threshold.

Operator interpretation reference:
- `doc/operations/runbooks/Admin_Ops_Dashboard_Usage_Runbook.md`
- `doc/operations/Ops_Runbook_Architecture.md`
- `doc/operations/runbooks/Three_Host_Lab_Incident_Runbook.md`