# Error and Traceability DNA Standard

## Purpose
Make error handling and traceability a non-negotiable implementation baseline across backend, UI, and ops.
This standard is enforcement-oriented: code, tests, CI gates, and runbooks must follow it.

## Non-Negotiable Invariants
1. Every API error response uses canonical `ErrorResponse` shape: `code`, `message`, `correlation_id`, optional `details`.
2. Every failure path logs structured fields including `correlation_id`.
3. Every incident-relevant flow is traceable across UI symptom -> API response -> service log -> runbook.
4. No silent failures in UX for API/WS flows.
5. No fallback that hides root-cause without explicit backlog entry and removal criteria.

## Required Error Envelope
All HTTP handlers must return only canonical catalog codes from `doc/architecture/Error_Code_Catalog.md`.

Minimum payload:
```json
{
  "code": "invalid_request",
  "message": "human text",
  "correlation_id": "uuid-or-stable-id"
}
```

Rules:
1. `correlation_id` is required in all non-2xx responses.
2. `details` is required for `validation_error`.
3. No legacy `{ "error": "..." }` payloads are allowed.

## Required Logging/Trace Fields
For all error logs and key warning logs:
1. `correlation_id`
2. `error`
3. `org_id` when available
4. `project_id` when available
5. `actor_type` (`user` or `service_account`) when available
6. `actor_id` (subject/service account id) when available
7. `resource_name` when target resource exists

PII/secret sanitization remains mandatory per `doc/governance/Coding_Standards.md`.

## Route-Class Requirements
### HTTP (REST)
1. Pre-handler validation failures must still return canonical envelope.
2. AuthN/AuthZ failures must be explicit and mapped to catalog codes.
3. Project-scope failures must be deterministic (`400` missing context, `403` ownership/scope denial).

### WebSocket/Realtime
1. Pre-upgrade failures must map to a user-visible error state in UI.
2. Post-upgrade failures must emit stable control codes and UX message mappings.
3. Silent close without user-facing state change is a defect.

## CI Enforcement Gates
### Backend (`A`)
1. Unit tests for handler error mapping (status + code + `correlation_id`).
2. Integration tests for project-scope and auth scope rejection paths.
3. Static grep gate:
   - fail if new `\"error\":` payload patterns are introduced in API handlers.
4. Observability smoke confirms required metrics and runbook mapping signals remain present.

### UI (`B`)
1. Unit tests ensure API and WS error states render mapped user-facing messages.
2. E2E tests cover:
   - pre-open WS failure
   - auth/session expiry
   - missing project context
3. No hidden failure states (spinner forever, disconnected without banner).

### Ops (`C`)
1. Runbooks map UI symptom to owning service and query path.
2. Correlation-first incident flow must be documented and tested in smoke gates.
3. Dashboard panels must distinguish control-plane degradation vs node-metrics degradation.

## Review Checklist (PR Gate)
1. Does every new error path return canonical envelope?
2. Does every new error path include `correlation_id` in logs?
3. Is there at least one test covering each new domain error path?
4. Is user-facing error behavior explicit for UI-impacting changes?
5. Are fallback paths documented in `Fallback_Tech_Debt_Register.md` when introduced?

## Report-Only Sweep Procedure (No Fixes)
Use this when establishing baseline quality before a cleanup sprint.

Deliverable: one report document containing findings only, no code changes.

### Sweep Scope
1. `cmd/api`, `cmd/terminal-gateway`, `packages/services/*`, `packages/shared/middleware`
2. `packages/web` API/WS user-facing error surfaces
3. `doc/operations` runbook linkage for top incident paths

### Finding Categories
1. `E1` Contract violation (non-canonical envelope, missing `correlation_id`, invalid code)
2. `E2` Traceability gap (missing fields/log lineage breaks)
3. `E3` UX silent failure (error not surfaced)
4. `E4` Test gap (path exists but no automated coverage)

### Severity
1. `S1` blocks prod readiness
2. `S2` high operational/debugging risk
3. `S3` quality debt

### Required Report Format
1. Findings first, ordered by severity.
2. Each finding includes file reference and exact failing behavior.
3. No fixes in the sweep report.
4. End with a prioritized fix plan grouped by owner (`A`, `B`, `C`).

## Ownership
1. Architecture defines invariants.
2. Backend/UI/Ops owners enforce via tests and implementation.
3. Queue task cannot be marked `done` if this standard is violated for newly introduced paths.
