# Scalability and Security Watchlist

Purpose:
- Capture non-blocking but important hardening items that should be scheduled before/after launch phases.
- Prevent known risks from getting lost during implementation velocity.

Status semantics:
- `open`: not started.
- `planned`: design agreed, implementation pending.
- `in_progress`: active implementation.
- `done`: implemented and validated.

## Current Watchlist

1. Notification delivery durability beyond Redis Pub/Sub
- Status: `open`
- Why: Redis Pub/Sub is ephemeral and does not provide per-user persistence/read-state guarantees.
- Current baseline: NATS -> notification-relay -> Redis Pub/Sub -> WS fanout.
- Future target:
  - Add persistent notification store (Postgres/object log) with retention policy.
  - Add read/dismiss tracking and replay for reconnecting clients.
  - Keep Pub/Sub as low-latency fanout path.
- Owners: Platform + Notification service owner.

2. Data growth guardrails (usage/ledger/audit)
- Status: `in_progress`
- Why: High-growth tables can degrade query performance and operational recovery.
- Current baseline: partition strategy documented at architecture level.
- Future target:
  - Define row-count/size triggers to activate partitioning.
  - Add runbook automation for partition creation/archival/retention.
  - Add dashboard alerts on growth and vacuum/maintenance lag.
- Progress:
  - baseline guard script added: `scripts/ops/data_growth_check.sh` with row/size thresholds for `usage_records`, `ledger_entries`, and `platform_audit_logs`.
  - make target added: `make ops-data-growth-check`.
- Owners: Platform + Infra/SRE.

3. Security key management and rotation runbooks
- Status: `in_progress`
- Why: JWT/terminal/KMS key rotation needs deterministic operational procedures.
- Current baseline: architecture direction documented.
- Future target:
  - Rotation cadence and break-glass procedure.
  - Key compromise response path with timeline targets.
  - Validation checklist for JWKS and terminal token signer rotation.
- Progress:
  - added unified runbook: `doc/operations/runbooks/Key_Rotation_and_Compromise_Response_Runbook.md`.
  - break-glass JWKS path already wired via `POST /internal/auth/jwks/refresh`.
- Owners: Security + Platform.

4. WS token replay/concurrency hardening tests
- Status: `in_progress`
- Why: single-use token semantics must hold under race/concurrency conditions.
- Current baseline: contract and architecture specify short-lived single-use tokens.
- Future target:
  - Add concurrency tests for duplicate WS connect attempts.
  - Add metrics/alerts for token replay rejection rates.
- Progress:
  - terminal service now has concurrent consume race test proving only one `GETDEL` consume succeeds and all competing consumes fail with `ErrTokenInvalid`.
  - terminal service now exposes snapshot counters for `consumed_ok` and `replay_rejected` to support replay anomaly alerting.
- Owners: Backend + QA.

5. Abuse controls beyond RPM
- Status: `planned`
- Why: single RPM limit is insufficient for mixed endpoint classes and abuse patterns.
- Current baseline: policy-driven per-route-group limits implemented.
- Future target:
  - Add burst controls and per-IP heuristics.
  - Add anomaly thresholds for auth/payment/terminal endpoints.
  - Add SOC-facing signals for automated blocking decisions.
- Owners: Security + Backend.

## Scheduling Guidance
- Before public beta:
  - Item 3 (key management/rotation)
  - Item 4 (WS replay/concurrency tests)
- First scale milestone (>= 10M usage rows or equivalent load):
  - Item 2 (data growth guardrails)
- Post-beta reliability enhancement:
  - Item 1 (persistent notifications)
  - Item 5 (advanced abuse controls)

## Accepted MVP Tradeoffs (Revisit Triggers)

These are intentional MVP decisions. They are acceptable now, but must be re-evaluated
at the listed trigger to avoid future scaling/extensibility/security constraints.

1. Single control-plane API binary (`cmd/api`)
- Trigger: first domain extraction candidate or sustained saturation of one domain path.
- Revisit: split high-load domains into independent deployables behind stable contracts.

2. Service mesh deferred (Envoy/Istio)
- Trigger: multiple independently deployed internal services with complex east/west policy needs.
- Revisit: adopt mesh when platform-native controls are no longer sufficient.

3. Notification delivery is best-effort (Redis Pub/Sub fanout)
- Trigger: product requires reliable inbox/replay/read-state or support tickets show missed alerts.
- Revisit: add persistent notification store and replay semantics.

4. Policy evaluation is DB-direct at MVP
- Trigger: duplicated policy-resolution logic/caching drift across services.
- Revisit: extract dedicated Policy Service behind existing `PolicyClient`, and adopt OPA/OPAL in the same step for distributed policy propagation.

5. API key auth deferred
- Trigger: CLI/automation demand exceeds browser-only workflows.
- Revisit: add API key issuance/rotation/revocation with resolver-chain integration.

6. MVP scope constraints (single-region runtime, scheduler backends deferred)
- Trigger: enterprise onboarding requiring multi-region/scheduler integration.
- Revisit: activate additive Phase-2 components without public contract breaks.

7. Dedicated terminal WS runtime in `cmd/terminal-gateway` (Option C)
- Trigger: pre-production hardening gate before public launch.
- Revisit: continue gateway hardening (strict ingress/egress policy, saturation controls,
  and incident drill evidence) and remove any legacy assumptions about API-hosted WS paths.

References:
- `doc/governance/Assumptions_Register.md`
- `doc/operations/Parallel_Ops_Track.md`

## Pre-Beta Hardening Additions (Captured)

1. Policy cache invalidation across pods
- Status: `in_progress`
- Why: 60s local cache can produce inconsistent enforcement after policy updates.
- Target: publish/subscribe invalidation (`policy.invalidate.<key>`) and immediate local eviction.
- Progress:
  - policy cache invalidation methods added: `PostgresClient.Invalidate(key)` and `InvalidateAll()`.
  - API process now runs Redis pub/sub subscriber on `policy.invalidate.*` and invalidates local cache immediately.
  - invalidation message parsing tests added in `cmd/api/policy_invalidation_test.go`.
- Remaining:
  - wire publisher on admin policy update path when policy management APIs are implemented.

2. Encryption envelope specification
- Status: `in_progress`
- Why: `_enc` fields need one canonical format and rotation strategy to avoid ad-hoc implementations.
- Progress:
  - `doc/architecture/Encryption_Envelope_Spec.md` added with canonical envelope shape and rotation rules.
  - `packages/shared/crypto/envelope.go` + tests added (AES-256-GCM envelope helper).
  - provisioning worker now uses shared envelope helper when writing `allocations.ssh_private_key_enc`.
- Remaining:
  - lock provider-specific KMS authn/authz constraints and secure command execution policy for any remaining app-layer key fetch surfaces.
  - wire helper into any future storage/scheduler credential material paths that use `_enc` fields.

3. Rate-limit fail-open observability
- Status: `done`
- Why: Redis outages silently disable app-layer limits.
- Target: metrics + alerts for fail-open events; document WAF compensating control.
- Progress:
  - rate limiter now tracks fail-open occurrences via `RateLimiter.Snapshot().FailOpenCount`.
  - unit test added for Redis-unavailable fail-open path with counter increment.
  - API now exports fail-open metrics via `GET /metrics` (`api_ratelimit_fail_open_total`) and secured JSON stats via `GET /api/v1/internal/stats`.

4. JWKS compromise break-glass
- Status: `done`
- Why: key-compromise response path is time-sensitive.
- Target: runbook with forced JWKS refresh and emergency key-rotation procedure.
- Progress:
  - emergency runbook added: `doc/operations/runbooks/JWKS_Compromise_Breakglass_Runbook.md`.
  - auth resolver now exposes `JWKSAuth.ForceRefresh(ctx)` hook for incident tooling paths.
  - API now exposes authenticated internal trigger `POST /internal/auth/jwks/refresh` (enabled by `INTERNAL_JWKS_REFRESH_TOKEN`) to invoke force refresh on demand.

5. Node probe SSRF guardrails
- Status: `in_progress`
- Why: admin probe can otherwise target sensitive internal addresses.
- Target: allowlist GPU node CIDRs and block metadata/internal reserved ranges.
- Progress:
  - inventory service now validates probe targets before dial (`packages/services/inventory/service.go`).
  - blocked by default: loopback, unspecified, multicast, link-local, and metadata endpoint `169.254.169.254`.
  - optional CIDR allowlist enforced via `NODE_PROBE_ALLOWED_CIDRS`.
  - API handlers map denied targets to `400` for admin node create/probe flows.

6. Idempotency response-body sanitization
- Status: `in_progress`
- Why: cached replay bodies may carry PII.
- Target: sanitize before persistence or store bounded allowlisted subset.
- Progress:
  - idempotency middleware now sanitizes JSON response bodies before persisting `platform_api_idempotency_keys.response_body`.
  - invalid/non-JSON response bodies are skipped (fail-safe, no raw payload persistence).
  - tests added for sensitive-field redaction and invalid payload behavior.
  - counters added via `IdempotencySnapshot()` for persisted JSON bodies, skipped-empty, skipped-non-JSON, and replay-served totals.

7. Notification channel namespace extensibility
- Status: `done`
- Why: user-only channels limit future org/system broadcast patterns.
- Target: define channel constructors for user/org/broadcast namespaces.
- Progress:
  - channel constructors implemented in `packages/services/notification/channels.go`:
    - `UserChannel(userID)`
    - `OrgChannel(orgID)`
    - `BroadcastChannel()`
  - constructor behavior covered in `packages/services/notification/transform_test.go`.

8. Scheduler metadata encryption rule
- Status: `in_progress`
- Why: future scheduler credentials could leak if stored plaintext.
- Target: mandate envelope encryption for credential material in `scheduler_metadata`.
- Progress:
  - allocation create path now envelope-encrypts `scheduler_request` into `allocations.scheduler_metadata.scheduler_request_enc`.
  - ERD and schema notes now explicitly require envelope-encryption for credential-bearing scheduler metadata.
- Remaining:
  - enforce equivalent envelope handling on all future scheduler-adapter write paths (slurm/k8s/ray workers).

9. Temporal execution-path parity
- Status: `open`
- Why: differing local/prod scheduler paths increase drift risk.
- Target: run billing schedule through Temporal locally and in production.

10. Outbox payload data minimization
- Status: `in_progress`
- Why: outbox may contain sensitive payload fields.
- Target: prefer IDs over rich payloads and enforce encryption-at-rest controls.
- Progress:
  - added CI guard `scripts/ci/outbox_payload_guard.sh` (wired through `contracts_validate.sh`) to block secret/token-like fields in event payload contracts.
- Remaining:
  - continue tightening payload schemas toward ID-first patterns where full host/user context is not required.

11. Storage path-safety algorithm lock
- Status: `done`
- Why: traversal prevention must be deterministic and testable before coding.
- Progress:
  - `packages/shared/storagepath/path.go` codifies namespace-rooted `filepath.Clean` + prefix-check enforcement.
  - `packages/shared/storagepath/path_test.go` covers success, normalization, absolute-path reject, and traversal reject.
- Remaining:
  - none for MVP baseline; keep enforcing this helper in future storage refactors.

12. Browser token storage hardening (sessionStorage -> httpOnly cookie session)
- Status: `open`
- Why: browser-accessible token storage raises XSS blast radius and weakens central session controls.
- Target: migrate web auth to server-managed httpOnly/sameSite secure cookie session (or equivalent BFF token handling) before production launch.
- Progress:
  - current implementation keeps access token in browser session storage for MVP velocity.
- Remaining:
  - define migration plan and acceptance tests for cookie-based auth flow and logout revocation behavior.

13. Persistent user SSH private-key storage removal
- Status: `open`
- Why: storing user-access private keys server-side increases blast radius and key compromise impact.
- Current baseline: public `/api/v1/allocations/{id}/ssh-key` endpoint removed from contract; runtime/path cleanup remains in progress under `A-P7-005`.
- Target:
  - one-time key delivery model and/or user-managed public-key model.
  - control plane stores public keys, fingerprints, and metadata only for steady-state.
  - terminal/provisioning runtime paths do not depend on persistent user private-key retrieval.
- Execution mode: pre-launch cutover (no backward-compatibility migration window required).
- Owners: Security + Provisioning + Terminal.

14. Queue acceptance-check execution evidence enforcement
- Status: `open`
- Why: queue currently validates acceptance-check syntax and done-commit lineage, but does not execute each task's acceptance checks as part of done-state enforcement.
- Target:
  - add CI gate that executes task `acceptance_checks` for tasks moved to `done` and records pass/fail evidence.
  - require evidence link or artifact reference in queue metadata before final done-state acceptance.
- Trigger: activate before introducing reviewer agent or before enabling multi-lane (V2) parallel execution.
- Owners: Governance + CI maintainers.
