# Observability Architecture v1

Purpose:
- Define the telemetry backend and data flow before implementing Ops UI and operational workflows.
- Keep observability implementation contract-driven and consistent across API and workers.

## 1. Backend Decision (v1)

Selected stack:
- OpenTelemetry SDKs in services/workers.
- OpenTelemetry Collector as the single telemetry pipeline.
- Prometheus for metrics scrape and alert rule evaluation.
- Tempo for trace storage/query.
- Loki for log storage/query.
- Grafana for dashboards and alert operations.

Deferred:
- Vector log pipeline agent (defer unless multi-sink routing or heavy log transforms are required).

Rationale:
- One collector pipeline reduces per-service telemetry complexity.
- Prometheus/Tempo/Loki/Grafana gives metrics+traces+logs with a single operational surface.
- Matches current local stack direction and production platform baseline.

## 2. Topology

### Local Development
1. `cmd/api`, workers -> OTLP (gRPC/HTTP) -> OTel Collector.
2. Collector exports:
- metrics -> Prometheus.
- traces -> Tempo.
- logs -> Loki.
3. Grafana reads Prometheus/Tempo/Loki datasources.

### Production
1. Edge/API/worker telemetry -> OTel Collector deployment (HA).
2. Collector exports to managed or self-hosted backends:
- metrics: Prometheus + long-term store (Mimir/Thanos) when scale requires.
- traces: Tempo.
- logs: Loki.
3. Alerting through Grafana + Prometheus Alertmanager integration.

## 3. Telemetry Contract

Required resource attributes on all services:
- `service.name`
- `service.version`
- `deployment.environment`
- `service.instance.id`

Required request/event correlation fields:
- `correlation_id` (log field and span attribute)
- `user_id` where available (redacted policy applies)
- `org_id` where available
- `event_id` for async events

Metrics contract:
- Use stable metric names and units.
- For counters use `_total` suffix.
- For durations use seconds.
- Avoid high-cardinality labels (no raw UUIDs/session IDs in labels).

Tracing contract:
- Every incoming HTTP request has a root span.
- NATS publish/consume creates spans linked by correlation context.
- Stripe and SSH operations use child spans with failure status tags.

Logging contract:
- Structured JSON only.
- Include `timestamp`, `level`, `service`, `message`, `correlation_id`.
- Redaction rules follow `doc/governance/Coding_Standards.md`.

## 4. Security and Retention

Security requirements:
- Telemetry transport must use TLS in production.
- Access to logs/traces/dashboards restricted by role (admin at v1, ops role in v2).
- No secrets/tokens/private keys in logs or span attributes.

Retention baseline:
- Metrics: 30 days minimum (longer with remote storage after scale trigger).
- Traces: 7 to 14 days baseline.
- Logs: 30 days baseline for operational logs, longer for security/audit streams per policy.

## 5. Ops UI Integration (Admin v1)

Initial UI route:
- `/admin/ops` (admin role only in v1).

Panel sources:
- Service health and internal stats endpoints.
- Aggregated telemetry overview endpoint (to be added to OpenAPI before coding UI panel data fetch).
- Deep links to Grafana dashboards for detailed troubleshooting.

Rule:
- Do not directly query Prometheus/Loki/Tempo from browser.
- Browser calls API/BFF only; backend enforces authz and returns sanitized operational summaries.

## 6. Pre-Implementation Gates

Before coding observability features:
- OpenAPI includes admin ops summary endpoint contract(s).
- AsyncAPI references any new operational events (if introduced).
- UX mock exists for `/admin/ops` with state matrix.
- Governance standards for telemetry are approved.