# Observability Standards

Purpose:
- Define mandatory telemetry standards before observability implementation starts.
- Ensure consistent, secure, low-noise operational signals across services.

## 1. Required Telemetry Outputs

Every runtime component (`cmd/api`, all workers) must emit:
- structured logs
- OpenTelemetry traces
- Prometheus-compatible metrics

All signals must include correlation context where applicable.

## 2. Required Fields

Required log fields:
- `timestamp`
- `level`
- `service`
- `message`
- `correlation_id`

Required span attributes:
- `service.name`
- `deployment.environment`
- `correlation_id`

Required metric dimensions:
- Keep labels bounded and low-cardinality.
- Allowed: `service`, `route_template`, `status_code_class`, `operation`.
- Forbidden: raw UUIDs, session IDs, token IDs, email addresses.

## 3. Naming and Units

Rules:
- Counters use `_total`.
- Durations in seconds.
- Monetary values in minor units (integers).
- Retry/failure metrics must be explicit (`*_failures_total`, `*_retries_total`).

## 4. Redaction and Data Safety

Never emit into logs/spans/metrics labels:
- access/refresh/id tokens
- passwords and password hashes
- SSH private key material
- payment card data
- full webhook payloads containing PII

Reference:
- `doc/governance/Coding_Standards.md` sanitize/redaction rules are mandatory.

## 5. Error and Correlation Requirements

Requirements:
- Every error response must include canonical `ErrorResponse` with `correlation_id`.
- `X-Correlation-ID` must be returned on every HTTP response.
- Async handlers must carry correlation IDs from event envelope to logs/spans.

## 6. Dashboards and Alerts (Minimum v1)

Must-have dashboards:
- API SLO dashboard (latency/error/availability)
- Billing + payment processing dashboard
- Provisioning workflow dashboard
- Queue/relay health dashboard

Must-have alerts:
- API error-rate burn
- high p95 latency
- outbox relay publish failures/backlog growth
- webhook processing failures
- worker heartbeat loss

## 7. CI and Review Gates

Before merge for observability-impacting changes:
- unit tests for new metrics/log fields (where practical)
- integration smoke for telemetry endpoints where available
- documentation updated for added/changed metrics
- no new high-cardinality labels introduced

Review checklist:
- [ ] Correlation preserved end-to-end
- [ ] PII-safe telemetry
- [ ] Dashboard/alert impact documented
- [ ] Contract docs updated before code
