# Parallel Ops Track (Execute Alongside Feature Coding)

Purpose:
- Track operational work that must run in parallel with backend/frontend implementation.
- Convert operations readiness from "later" into gated deliverables with evidence.

Fairway tracking:
- Active operational production-readiness epic: `OPS-PROD-READINESS-BACKLOG`.
- This document remains the ops-readable workstream summary. Fairway owns task
  state, dependencies, evidence, and completion gates.
- `doc/architecture/Production_Deployment_Readiness_v1.md` is the gap/risk
  source; this track converts those findings into operational controls,
  runbooks, evidence, drills, and environment/profile gates.

Status values:
- `open`
- `in_progress`
- `done`

## 1) SLO and Alert Pack
- Status: `in_progress`
- Owner: `SRE Lead`
- Target: `Before public beta`
- Scope:
  - API availability/latency/error budget alerts.
  - Queue lag and worker failure alerts.
  - Billing drift/processing anomaly alerts.
  - Webhook ingestion + reconciliation failure alerts.
- Evidence:
  - Evidence baseline check command: `make ops-parallel-evidence-check`
  - Execution artifact: `doc/operations/evidence/slo_alert_pack.md`
  - Baseline definitions: `doc/operations/Observability_Baseline.md`
  - Incident model linkage: `doc/operations/Incident_Severity_Model.md`
  - Alert rules baseline: `doc/operations/evidence/alert_rule_manifest_baseline.yaml`
  - Local observability smoke script: `scripts/ops/observability_smoke.sh`
  - Local observability smoke report: `doc/operations/evidence/observability_local_smoke_report.md`
  - Simulation playbook/report: `doc/operations/evidence/alert_simulation_playbook.md`, `doc/operations/evidence/alert_simulation_report.md`
  - Pending: staging load/trigger execution and captured outputs.

## 2) Runbooks and On-Call Readiness
- Status: `in_progress`
- Owner: `SRE Lead`
- Target: `Before public beta`
- Scope:
  - Complete runbooks for API, queue, billing worker, provisioning stuck/failures, webhook outage.
  - On-call ownership matrix and escalation path.
  - Incident drill calendar (tabletop + restore drill).
- Evidence:
  - Evidence baseline check command: `make ops-parallel-evidence-check`
  - Execution artifact: `doc/operations/evidence/runbooks_oncall_readiness.md`
  - Runbook index: `doc/operations/SRE_Runbook_Index.md`
  - Existing runbooks: `doc/operations/runbooks/*`
  - On-call roster + escalation baseline: `doc/operations/evidence/oncall_roster_and_escalation.md`
  - Drill calendar/report baseline: `doc/operations/evidence/incident_drill_reports.md`
  - Pending: executed drills with timestamps and outcomes.

## 3) Backup, Restore, and DR Validation
- Status: `in_progress`
- Owner: `Platform + DBA`
- Target: `Before staging go-live`
- Scope:
  - Automated backup schedule and retention policy.
  - Staging restore rehearsal with time-to-recovery measurement.
  - RPO/RTO verification and risk note for gaps.
- Evidence:
  - Evidence baseline check command: `make ops-parallel-evidence-check`
  - Execution artifact: `doc/operations/evidence/backup_restore_dr.md`
  - Local smoke drill script: `scripts/ops/backup_restore_smoke.sh`
  - Local rehearsal report: `doc/operations/evidence/backup_restore_rehearsal_report.md`
  - Pending: backup job config + restore rehearsal report.

## 4) Secrets and Key Operations
- Status: `in_progress`
- Owner: `Security + Platform`
- Target: `Before public beta`
- Scope:
  - KMS/secret-manager integration for runtime secrets.
  - Rotation cadence for app secrets, JWT signing/JWKS dependencies, terminal token signer.
  - Break-glass access process and audit trail.
- Evidence:
  - Evidence baseline check command: `make ops-parallel-evidence-check`
  - Execution artifact: `doc/operations/evidence/secrets_key_ops.md`
  - Platform baseline requirements: `doc/operations/Production_Platform_Baseline.md`
  - KMS operational guardrails: `doc/operations/KMS_Control_Key_Source_Guardrails.md`
  - Node probe SSRF guardrails implementation: `packages/services/inventory/service.go`
  - Pending: formal rotation SOP + execution logs.

## 5) East/West Security and Certificate Lifecycle
- Status: `in_progress`
- Owner: `Security + Platform`
- Target: `Before staging->prod promotion`
- Scope:
  - Default-deny network policies and explicit allow-list flows.
  - Internal mTLS (or equivalent) across service/workload paths.
  - Cert lifecycle controls: issuance, rotation, revocation, expiry alerting.
- Evidence:
  - Evidence baseline check command: `make ops-parallel-evidence-check`
  - Execution artifact: `doc/operations/evidence/east_west_security_certs.md`
  - PKI/step-ca staging readiness artifact: `doc/operations/evidence/pki_stepca_staging_readiness.md`
  - Baseline requirements + workstream: `doc/operations/Production_Platform_Baseline.md`
  - Baseline network policy manifest: `doc/operations/evidence/network_policy_baseline.yaml`
  - Baseline cert check script: `scripts/ops/cert_expiry_check.sh`
  - Pending: cluster-applied validation output and cert health dashboard evidence.

## 6) Capacity and Load Baseline
- Status: `in_progress`
- Owner: `Performance/Infra`
- Target: `Before public beta`
- Scope:
  - Staging load profile for API endpoints, websocket concurrency, queue throughput.
  - Baseline limits for DB connections/pools and worker concurrency.
  - Saturation indicators and scale playbook.
- Evidence:
  - Data growth guard baseline: `doc/operations/evidence/data_growth_guardrails.md`
  - Pending: load-test scripts/results + baseline report.

## 7) Security Verification Pipeline
- Status: `in_progress`
- Owner: `AppSec + DevEx`
- Target: `Before feature-complete freeze`
- Scope:
  - SAST/DAST/dependency/container scanning wired into CI gates.
  - Severity-based policy (fail on critical/high as defined by governance).
  - Exception workflow with expiry.
- Evidence:
  - Baseline policy references: `doc/governance/CI_Enforcement_Checklist.md`
  - Pending: SAST/DAST job outputs in hosted CI.

## 8) Environment Parity and Config Governance
- Status: `in_progress`
- Owner: `Platform`
- Target: `Before staging go-live`
- Scope:
  - Controlled environment diffs (dev/staging/prod) for non-secret config.
  - Feature flag policy and promotion controls.
  - Explicit change approval for risk-sensitive config changes.
- Evidence:
  - Promotion policy baseline: `doc/operations/Environment_Promotion_Policy.md`
  - Pending: config diff reports + approval records.

## 9) Audit and Compliance Operations
- Status: `in_progress`
- Owner: `Security + Billing Ops`
- Target: `Before public beta`
- Scope:
  - Audit log access controls and export path validation.
  - Retention and retrieval procedure for security/finance investigations.
  - Correlation-id based incident trace workflow.
- Evidence:
  - Audit controls baseline: `doc/governance/Security_Control_Verification.md`
  - Pending: sample investigation report + export validation evidence.

## 10) Cost Observability and Budget Guardrails
- Status: `in_progress`
- Owner: `FinOps + Platform`
- Target: `Before public beta`
- Scope:
  - Cost dashboards for compute/storage/egress and alert thresholds.
  - Budget policy and anomaly escalation path.
  - Cost-per-feature visibility where practical.
- Evidence:
  - Pending: dashboard links + alert policy references.

## 11) Terminal gateway ingress/network policy rollout (Option C)
- Status: `in_progress`
- Owner: `Platform + Security`
- Target: `Before terminal Option C production cutover`
- Scope:
  - Route websocket terminal traffic to dedicated `cmd/terminal-gateway` runtime.
  - Apply default-deny + explicit allow-list network policy for terminal gateway pods.
  - Validate rollback path to last known-good terminal-gateway revision without contract changes.
- Evidence:
  - Evidence baseline check command: `make ops-parallel-evidence-check`
  - Execution artifact: `doc/operations/evidence/terminal_gateway_rollout_plan.md`
  - Network policy baseline: `doc/operations/evidence/network_policy_baseline.yaml`
  - Pending: staging cutover evidence and rollback drill output.

## Launch Gating
- Public launch requires items 1-5 to be `done`.
- Items 6-10 must be at least `in_progress` with owners, timeline, and evidence links.
