# Implementation Roadmap

Ordered coding breakdown for GPUaaS v1. Each phase lists its hard prerequisites,
the exact files to create, the API endpoints to implement, and the tests to write.

## Document Role
- Purpose: source-of-truth plan for what to build and in what order.
- Scope: phase definitions, prerequisites, deliverables, and done criteria.
- Does not track daily progress; use `doc/Execution_Progress.md` for commit-level status and `doc/Phase_Readiness_Tracker.md` for readiness gating.
- Historical note: early phase sections preserve the original execution plan language for sequencing context. Current implementation truth for completed phases lives in the repo, `doc/Execution_Progress.md`, and queue state in `doc/governance/Agent_Work_Queue.yaml`.

**How to use this document**
- Work phases in order. Phases within a group that share no dependencies may be
  parallelised across agents (noted where applicable).
- Before starting any task: read `AGENTS.md` for architecture rules, then read
  the files listed under "Read first" for that phase.
- Follow `doc/governance/Coding_Standards.md §Go Implementation Patterns`
  for every handler, service function, and test file.
- After completing a phase: tick all boxes and verify `make test` passes before
  moving on.

**CI portability rule (host-agnostic execution)**
- Keep gate logic in `scripts/ci/*.sh` and call these scripts from pipeline YAML.
- `.gitlab-ci.yml` and GitHub Actions workflows should stay as orchestration wrappers only.
- If host changes (GitLab -> GitHub or reverse), reuse `scripts/ci` unchanged and only adapt runner/secret wiring.

---

## Pre-Phase UX — UX completion before feature coding

**Prerequisite**: Docs baseline complete (PRD + OpenAPI + AsyncAPI + Architecture).
**Blocking rule**: Feature coding does not start until UX completion checklist is signed off.

Read first:
- `doc/product/UX_Journeys.md`
- `doc/product/UX_Implementation_Spec.md`
- `doc/api/openapi.draft.yaml`
- `doc/api/asyncapi.draft.yaml`

Deliverables:
- Screen inventory (user + admin) with route ownership and contract links.
- State matrix per screen (`loading`, `empty`, `error`, `success`, `restricted`, `rate_limited`).
- Async UX patterns for allocation lifecycle (`requested/provisioning/active/releasing/released/failed/release_failed`).
- Terminal UX flow (mint token -> connect WS -> disconnected/retry states).
- Accessibility baseline (keyboard, focus trap, aria labels, contrast checks).

Done when:
- [ ] Every user action maps to an OpenAPI endpoint or AsyncAPI channel.
- [ ] No production flow depends on prototype-only behavior.
- [ ] UX spec includes explicit handling for `401`, `403`, `404`, `409`, `429`.
- [ ] UX signoff recorded in `doc/Phase_Readiness_Tracker.md`.

---

## Pre-Phase Tooling — Contract Codegen

**Purpose**: avoid losing codegen setup while implementation starts.

Deliverables:
- Add `scripts/codegen.sh` with deterministic OpenAPI-driven generation steps.
- Wire `make codegen` to this script and verify it runs locally.
- Add `sdk_codegen_smoke` CI step to execute (not just echo) once toolchain is installed.

Done when:
- [x] `scripts/codegen.sh` exists and is executable.
- [x] `make codegen` updates generated artifacts without manual edits.
- [x] `AGENTS.md` repo layout note is updated back to include `scripts/codegen.sh`.

---

## Pre-Phase Platform — Production Baseline (DevOps Parallel)

**Purpose**: allow DevOps/security platform work to run in parallel with app feature coding.

Read first:
- `doc/operations/Production_Platform_Baseline.md`
- `doc/operations/Parallel_Ops_Track.md`
- `doc/operations/Environment_Promotion_Policy.md`
- `doc/governance/Security_Control_Verification.md`

Deliverables:
- Managed edge gateway + WAF configured for API + websocket routes.
- TLS and cert rotation in place for public endpoints.
- East/west default-deny network policy + explicit allow-list flows implemented.
- Internal mTLS (or equivalent) with certificate issuance/rotation/revocation SOP implemented.
- Centralized logs/metrics/traces with alert rules for MVP SLOs.
- Secret manager/KMS wiring for runtime secrets.
- Backup/restore rehearsal completed for Postgres.

Done when:
- [ ] All “Required for Public MVP” controls in `Production_Platform_Baseline.md` are implemented in staging.
- [ ] Parallel operations items in `Parallel_Ops_Track.md` have owners and status updates.
- [ ] Evidence links are recorded in `doc/Phase_Readiness_Tracker.md`.

---

## Pre-Phase Observability — Contract and UX Gate

**Purpose**: lock observability backend and Ops UI contracts before implementation.

Read first:
- `doc/architecture/Observability_Architecture.md`
- `doc/governance/Observability_Standards.md`
- `doc/governance/UX_Contract_Gate.md`
- `doc/operations/Observability_Baseline.md`
- `doc/operations/Ops_Runbook_Architecture.md`
- `doc/product/ux-mocks/admin-ops.md`

Deliverables:
- Observability backend decision finalized (OTel Collector + Prometheus + Tempo + Loki + Grafana).
- OpenAPI additions for admin ops aggregated endpoints (required before UI coding).
- OpenAPI additions for runbook metadata endpoints (`/api/v1/admin/runbooks*`) before runbook panel UI coding.
- Runbook metadata architecture (manifest, stable IDs, alert mapping) approved.
- Ops UI mock (`/admin/ops`) reviewed and approved.
- Alert and dashboard minimum set mapped to SLOs.

Done when:
- [ ] Observability architecture/standards docs are approved.
- [ ] Ops UI route contract is added to OpenAPI.
- [ ] Runbook metadata route contracts are added to OpenAPI.
- [ ] Ops UI mock maps every panel interaction to a contract endpoint.
- [ ] Degraded panel states have deterministic runbook mappings by `runbook_id`.
- [ ] UX/contract gate checklist is satisfied for `/admin/ops` before feature implementation.

---

## Pre-Phase Node Agent — Secure Node Communication (Blocking for Phase 7)

**Purpose**: replace raw SSH provisioning with a pull-based, typed-task node agent
protected by mTLS and task-signing. All node-side operations are performed by compiled,
audited handlers — no arbitrary command execution.

**Blocking rule**: Phase 7 (Provisioning Worker) does not start until every box below
is checked.

Read first:
- `doc/architecture/PKI_Spec.md`
- `doc/architecture/Node_Agent_Spec.md`
- `doc/architecture/db_schema_v1.sql`
- `doc/api/openapi.draft.yaml §Internal`

Deliverables:

**Specification**
- `doc/architecture/PKI_Spec.md` written and reviewed (CA hierarchy, enrollment flow,
  renewal, revocation, task signing, Vault migration path). ✅
- `doc/architecture/Node_Agent_Spec.md` written and reviewed (task catalog, protocol,
  privilege model, parameter validation). ✅

**DB Schema**
- `node_tasks` table added to `doc/architecture/db_schema_v1.sql` and applied to dev DB.

**OpenAPI Contract**
- `/internal/v1/nodes/*` endpoints added to `doc/api/openapi.draft.yaml` before any
  implementation of handlers or node agent code.

**CA Infrastructure**
- step-ca deployed in Kubernetes internal namespace (`pki-ca.internal:9000`).
- CA ceremony completed (Root CA offline, Intermediate CA in KMS, fingerprint recorded).
- Root CA cert fingerprint added to `doc/Phase_Readiness_Tracker.md`.

**MAAS Integration (optional, parallel track — gates `full-reimage` isolation model)**
- `packages/services/maas/` — `MAASClient` interface + implementation (OAuth 1.0 auth,
  `DeployMachine`, `ReleaseMachine`, `GetMachineStatus`, `ListMachines`).
- `nodes` table: `maas_system_id TEXT` column added to `db_schema_v1.sql`.
- `POST /internal/v1/maas/machine-commissioned` internal webhook endpoint added to
  `openapi.draft.yaml` — MAAS calls this on commissioning complete to auto-register
  nodes and generate enrollment tokens.
- `MAAS_URL`, `MAAS_API_KEY` added to `cmd/api/config.go` (only required when
  `MAAS_ENABLED=true`; config validates conditionally).
- Policy keys `maas.enabled` and `allocation.isolation_model` seeded in `scripts/seed.sql`.
- Isolation model read from `PolicyClient` in provisioning worker — never hardcoded.

**Go Scaffold (historical bootstrap requirement)**
- `packages/shared/pki/client.go` — `CAClient` interface + `StepCAClient` implementation.
- `cmd/node-agent/` directory scaffold compiles (`go build ./cmd/node-agent`).
- `catalog/catalog.go` dispatch function with full task type registry (historical bootstrap note:
  initial handler stubs were acceptable at scaffold time; current implementation should not rely on this guidance).
- `validate/params.go` — parameter validators for all task types.
- `signing/verify.go` — Ed25519 signature verification.
- Unit tests pass: catalog dispatch (known type dispatches, unknown type rejects),
  parameter validation (valid params pass, invalid params reject), replay protection.

Done when:
- [ ] `doc/architecture/PKI_Spec.md` signed off in `doc/Phase_Readiness_Tracker.md`
- [ ] `doc/architecture/Node_Agent_Spec.md` signed off in `doc/Phase_Readiness_Tracker.md`
- [ ] `node_tasks` table in schema and applied
- [ ] `/internal/v1/nodes/*` endpoints in `openapi.draft.yaml`
- [ ] step-ca running in staging; CA ceremony complete
- [ ] `go build ./cmd/node-agent` passes
- [ ] `go build ./packages/shared/pki/...` passes
- [ ] `make test` passes (node agent unit tests)

---

## Pre-Phase Security — Encryption Envelope Baseline (Blocking)

**Purpose**: prevent ad-hoc encryption implementations in provisioning, storage, and scheduler metadata paths.

Read first:
- `doc/operations/Scalability_Security_Watchlist.md` (SEC-3, E-3)
- `doc/architecture/db_schema_v1.sql` (`*_enc` fields, `scheduler_metadata`)

Deliverables:
- `doc/architecture/Encryption_Envelope_Spec.md` defining:
  - envelope format/version fields
  - key identifiers and KMS key source conventions
  - rotation and re-encryption strategy
  - decrypt failure handling and audit expectations
- `packages/shared/crypto/` scaffold with:
  - envelope encode/decode interfaces
  - KMS adapter abstraction
  - deterministic test fixtures and redaction-safe logging behavior

Done when:
- [ ] Encryption envelope spec exists and is referenced by provisioning/storage implementation phases.
- [ ] Shared crypto package compiles and is usable by provisioning worker code.
- [ ] Security owner signoff recorded in `doc/Phase_Readiness_Tracker.md`.

---

## Pre-Phase Tenant Ownership — Tenant/Project Enforcement Baseline (Blocking)

**Purpose**: lock ownership semantics before further feature coding so access control,
billing scope, and policy evaluation stay coherent.
**Prerequisite gate**: update and approve `doc/governance/Testing_Standards.md` tenant/project
authz coverage expectations before implementation tasks in this phase start.

Read first:
- `doc/architecture/Tenant_Project_Ownership_Baseline.md`
- `doc/architecture/adrs/ADR-008-tenant-project-ownership-baseline.md`
- `doc/architecture/ERD.md`
- `doc/architecture/db_schema_v1.sql`
- `doc/api/openapi.draft.yaml`

Deliverables:
- Ownership semantics adopted in docs:
  - tenant(org) as ownership root,
  - project as resource scope,
  - user as actor attribution (not owner-of-record).
- Baseline schema tightened for ownership invariants (reset-baseline, no data migration):
  - `allocations.org_id` non-null
  - `allocations.project_id` non-null
- Membership baseline added now to lock authz query shape:
  - `tenant_memberships` (MVP-enforced single-tenant via `UNIQUE(user_id)`)
  - `project_memberships` (`UNIQUE(project_id, user_id)`)
- Hybrid auth context baseline documented:
  - tenant claim remains in JWT/session for MVP boundary enforcement
  - active project remains request-scoped and membership-validated
- API contract updated for explicit project context on project-owned mutations.
- Authorization rules updated to tenant/project checks for resource list/read/mutate paths.
- Billing scope plan documented for tenant-owned customer/balance model.
- Policy scope plan documented for both project cap and tenant cap concurrency limits.

Done when:
- [ ] ADR-008 is accepted and linked from architecture index.
- [ ] Ownership baseline doc is approved.
- [ ] `doc/governance/Testing_Standards.md` tenant/project authz coverage section is approved before phase implementation starts.
- [ ] ERD and `db_schema_v1.sql` reflect non-null allocation ownership fields.
- [ ] ERD and `db_schema_v1.sql` include membership baseline tables and constraints.
- [ ] OpenAPI reflects project-context requirements on project-owned mutations.

---

## Pre-Phase Service Accounts — Machine Identity Baseline (Blocking for App Integrations)

**Purpose**: define machine identity contracts and controls before app-team integration work.

Read first:
- `doc/architecture/Service_Account_Model.md`
- `doc/architecture/adrs/ADR-004-identity-authz-model.md`
- `doc/architecture/Tenant_Project_Ownership_Baseline.md`
- `doc/governance/Security_Control_Verification.md`
- `doc/api/openapi.draft.yaml`

Deliverables:
- Service-account ownership baseline defined:
  - service account belongs to one project and one tenant,
  - machine auth is project-scoped, tenant-bounded.
- Planned schema objects documented:
  - `service_accounts`
  - `service_account_credentials`
- Planned token model documented:
  - actor_type=`service_account`
  - short-lived token TTL, audience + scope claims.
- Planned API surface documented for lifecycle + token issuance.
- Security control set documented for key storage, rotation, revocation, and audit.

Done when:
- [ ] Service account baseline doc is approved.
- [ ] OpenAPI change list for service-account endpoints/tokens is defined.
- [ ] ERD/db schema change list includes service-account tables and constraints.
- [ ] Authz matrix includes `actor_type=service_account`.
- [ ] Security controls for service accounts are added to verification checklist.

---

## Pre-Phase Resource Naming — Canonical Identifier Baseline (Blocking)

**Purpose**: establish one machine-readable identifier shape across API/events/audit
before broad feature expansion.

Read first:
- `doc/architecture/Resource_Identifier_Spec.md`
- `doc/architecture/adrs/ADR-009-canonical-resource-identifier-format.md`
- `doc/architecture/Tenant_Project_Ownership_Baseline.md`

Deliverables:
- Canonical format adopted:
  - `core42:aicloud:{region}:{tenant_id}:{project_id}:{resource_type}:{resource_id}`
- Resource type registry baseline documented for MVP domains.
- Shared parser/formatter implementation plan captured (single package, no per-service drift).
- API/event/audit adoption targets listed for initial rollout.

Done when:
- [ ] ADR-009 is accepted and linked from architecture index.
- [ ] Resource identifier spec is approved.
- [ ] Architecture docs reference canonical identifier usage.
- [ ] Initial implementation backlog includes parser + boundary emission tasks.

---

## Pre-Phase Frontend — UX Foundation Packages

**Purpose**: establish shared UX platform primitives before feature slices, so UI work stays consistent and API-first.

Deliverables:
- `packages/web/src/lib/api/`:
  - contract client wrapper (typed calls, auth header injection, refresh flow handling, correlation-id propagation)
  - common error mapper from `ErrorResponse` to UX-safe message model
- `packages/web/src/lib/query/`:
  - cache/query conventions (keys, stale times, retry defaults)
- `packages/web/src/lib/session/`:
  - session/user/role state + protected-route helpers
- `packages/web/src/components/system/`:
  - shared async states (`LoadingState`, `EmptyState`, `ErrorState`, `RestrictedState`, `RateLimitedState`)
  - pagination/table primitives bound to cursor model
  - confirm modal + destructive action pattern
- `packages/web/src/components/a11y/`:
  - focus trap, keyboard shortcuts helper, aria-live announcement helper
- `packages/web/src/styles/`:
  - design tokens (color/spacing/type/radius/elevation) and theme contract

Done when:
- [ ] Frontend can call protected API via shared client with automatic token refresh path.
- [ ] All list screens can reuse common cursor pagination primitives.
- [ ] Shared UX state components are used by at least one screen each.
- [ ] A11y helpers are integrated in modal + notification flows.

---

## Phase 0 — Foundation ✅ DONE

| Artifact | Status |
|---|---|
| `go.mod` + directory scaffold | ✅ |
| `packages/shared/errors` | ✅ |
| `packages/shared/events` | ✅ |
| `packages/shared/middleware` | ✅ |
| `packages/shared/policy` | ✅ |
| `packages/shared/db` + `rdb` | ✅ |
| Initial `cmd/*` scaffolds | ✅ |
| `doc/governance/Coding_Standards.md` | ✅ |
| `doc/governance/Testing_Standards.md` | ✅ |

---

## Phase 1 — Test harness + cmd/api wiring

**Prerequisite**: Phase 0.
**Parallel**: 1A (test harness) and 1B (cmd/api wiring) can run concurrently.

### 1A — Unit tests for packages/shared

Read first: `doc/governance/Testing_Standards.md §Go Test Patterns`

Files to create:
```
packages/shared/errors/errors_test.go
packages/shared/middleware/sanitize_test.go
packages/shared/middleware/correlation_test.go
packages/shared/middleware/auth_test.go         # httptest + fake JWKS server
packages/shared/middleware/ratelimit_test.go    # stub policy + stub Redis via interface
packages/shared/middleware/idempotency_test.go  # stub pgxpool
packages/shared/events/types_test.go
packages/shared/policy/policy_test.go           # stub DB via interface
```

Tests to write:
- `errors`: New(), WithDetails(), all ErrCode constants compile
- `sanitize`: redacts each blocked field, redacts `ssh_private_key*` prefix, recurses into nested maps, leaves unblocked fields untouched
- `correlation`: generates UUID when header absent, echoes existing header, stores in context
- `auth`: valid JWT passes, expired JWT → 401 ErrTokenExpired, missing Bearer → 401 ErrTokenMissing, bad signature → 401 ErrTokenInvalid, RequireAdmin passes admin role, RequireAdmin blocks user role
- `ratelimit`: under limit passes, at limit+1 returns 429, X-RateLimit-* headers present, fails open on Redis error
- `idempotency`: no header → passes through, same key+body → replays cached response, same key+different body → 422, in-flight key → 409
- `events/types`: all Subject* constants are non-empty and unique, all payload structs have json tags
- `policy`: GetInt / GetBool / GetString return correct types, cache hit skips DB call, cache miss queries DB

Done when: `make test` passes with zero failures.

---

### 1B — Wire cmd/api + outbox relay

Read first: `doc/architecture/Inter_Service_Communication.md`

Files to create / replace:
```
cmd/api/main.go           # full wiring (replaces stub)
cmd/api/config.go         # env-var config struct with validation
cmd/api/server.go         # http.Server setup, graceful shutdown
cmd/api/routes.go         # route mounting (historical bootstrap note; current handlers are implemented)
cmd/api/outbox.go         # outbox relay loop
cmd/outbox-relay/main.go  # dedicated outbox relay process (scalable option)
packages/shared/outbox/relay.go  # shared relay logic (used by billing-worker too)
```

`config.go` — reads and validates:
```
DATABASE_URL          (required)
REDIS_URL             (required)
NATS_URL              (required, default nats://localhost:4222)
KEYCLOAK_ISSUER_URL   (required)
PORT                  (default 8080)
OTEL_EXPORTER_OTLP_ENDPOINT  (optional)
```

`main.go` wiring order:
1. Parse config
2. `middleware.SetupOTel(ctx, "gpuaas-api", version)`
3. `db.Connect(ctx, cfg.DatabaseURL)`
4. `rdb.Connect(ctx, cfg.RedisURL)`
5. `events.Connect(cfg.NatsURL)` → `events.InitStreams(js)`
6. `middleware.NewJWKSAuth(ctx, cfg.KeycloakIssuerURL)`
7. `policy.NewPostgresClient(pool)`
8. `middleware.NewRateLimiter(rdb, policyClient)`
9. Mount middleware chain: `Tracing → CorrelationID → Auth → RateLimit`
10. Start outbox relay goroutine
11. `server.ListenAndServe`
12. On SIGTERM: drain NATS → close pool → shutdown HTTP server

`outbox relay` — claims rows with:
- `SELECT ... FROM platform_outbox_events WHERE status = 'pending' ORDER BY occurred_at LIMIT 50 FOR UPDATE SKIP LOCKED`
- then publishes and updates status in the same worker transaction boundary.
- Publish each row via `events.PublishTyped`
- On success: `UPDATE platform_outbox_events SET status = 'published', published_at = now()`
- On failure: `UPDATE platform_outbox_events SET retry_count = retry_count + 1, last_attempted_at = now()`; after 10 retries set `status = 'failed'`
- Runs every 2 s; jitter ±200 ms to avoid thundering herd on multi-instance deploy

Endpoints to implement:
- `GET /api/v1/healthz` → checks DB ping + Redis ping + NATS connection; returns 200 or 503

Tests to write:
```
cmd/api/config_test.go        # missing required env → error
cmd/api/routes_test.go        # GET /healthz returns 200 with all deps up; 503 with DB down
packages/shared/outbox/relay_test.go # pending rows published; retry incremented on NATS error; failed after 10 retries
```

Done when:
- [ ] `make dev-infra && make dev-api` starts without error
- [ ] `curl localhost:8080/api/v1/healthz` returns `{"status":"ok"}`
- [ ] `make test` passes

---

## Phase 2 — Auth + Users service

**Prerequisite**: Phase 1 complete.
Read first: `doc/api/openapi.draft.yaml §Auth §Users`, `doc/architecture/Inter_Service_Communication.md §JWT`

Files to create:
```
packages/services/auth/service.go
packages/services/auth/handler.go
packages/services/auth/handler_test.go
packages/services/auth/service_test.go
packages/services/auth/models.go
```

Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| GET | `/api/v1/auth/oidc/authorize` | Redirect to Keycloak authorize URL with PKCE params |
| POST | `/api/v1/auth/oidc/exchange` | Exchange code for tokens; upsert user in `users` table |
| POST | `/api/v1/auth/personal/login` | Personal account login (feature-flag controlled) |
| POST | `/api/v1/auth/token/refresh` | Forward refresh token to Keycloak |
| POST | `/api/v1/auth/logout` | Revoke refresh token at Keycloak |
| GET | `/api/v1/users/me` | Returns current user from `users` table by JWT `sub` |

Service logic:
- `UpsertUserFromClaims(ctx, claims)` — `INSERT … ON CONFLICT (oidc_issuer, oidc_subject) DO UPDATE SET …`; maps `realm_access.roles` claim to `users.role`
- `GetUserByOIDCSub(ctx, issuer, subject)` — lookup by `(oidc_issuer, oidc_subject)` unique index

DB tables touched: `users`

Tests to write:
- `UpsertUserFromClaims`: new user created, existing user updated, role mapped correctly
- `GetUserByOIDCSub`: found, not-found → `ErrUserNotFound`
- `GET /users/me`: valid token returns user; missing token → 401; unknown sub → 404

Done when:
- [ ] `POST /auth/oidc/exchange` with dev Keycloak token upserts a user row
- [ ] `GET /users/me` returns the user
- [ ] All unit tests pass

---

## Phase 3 — Inventory service (Catalog + Nodes)

**Prerequisite**: Phase 1 complete. (Parallel with Phase 2.)
Read first: `doc/api/openapi.draft.yaml §Catalog §Nodes §AdminNodes`, `doc/architecture/db_schema_v1.sql`

Files to create:
```
packages/services/inventory/service.go
packages/services/inventory/handler.go
packages/services/inventory/handler_test.go
packages/services/inventory/service_test.go
packages/services/inventory/models.go
```

Endpoints to implement:
| Method | Path | Auth | Notes |
|---|---|---|---|
| GET | `/api/v1/skus` | user | List active SKUs from `sku_catalog` |
| GET | `/api/v1/nodes` | user | List node lifecycle + occupancy projection (no SSH secrets) |
| GET | `/api/v1/admin/nodes` | admin | List all nodes with onboarding mode and occupancy context |
| POST | `/api/v1/admin/nodes` | admin | Insert node; validate SKU exists; choose onboarding mode (`manual` or `maas`) |
| POST | `/api/v1/admin/nodes/{node_id}/probe` | admin | Reachability probe; update lifecycle status (`active` or `offline`) |
| DELETE | `/api/v1/admin/nodes/{node_id}` | admin | Soft-retire node (`status = 'retired'`) |

Service logic:
- `ListSKUs(ctx)` — `SELECT … FROM sku_catalog WHERE active = true ORDER BY sku`
- `ListAvailableNodes(ctx)` — schedulable nodes are lifecycle `active` and occupancy `available`
- `CreateNode(ctx, req)` — validate SKU, insert; write audit log
- `ProbeNode(ctx, nodeID)` — SSH dial with 10 s timeout; set `active`/`offline`; write audit log
- `DisableNode(ctx, nodeID)` — set `retired`; write audit log; fail if node has active allocation

DB tables touched: `sku_catalog`, `nodes`, `platform_audit_logs`

Tests:
- `ListAvailableNodes`: only lifecycle `active` + occupancy `available` nodes returned
- `DisableNode`: fails when node has active allocation (`ErrNodeInUse`)
- `POST /admin/nodes` without admin token → 403

Done when:
- [ ] `GET /api/v1/skus` returns seeded SKUs
- [ ] Admin can register and probe a node

---

## Phase 4 — Billing service (read path)

**Prerequisite**: Phase 2 (user identity).
Read first: `doc/architecture/State_Machines.md §3-4`, `doc/architecture/db_schema_v1.sql §ledger_entries §usage_records`

Files to create:
```
packages/services/billing/service.go
packages/services/billing/handler.go
packages/services/billing/handler_test.go
packages/services/billing/service_test.go
packages/services/billing/models.go
```

Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| GET | `/api/v1/billing/balance` | Sum of `ledger_entries.amount_minor` for user; never a column |
| GET | `/api/v1/billing/usage` | Paginated `usage_records` for user |
| GET | `/api/v1/billing/usage/csv` | CSV export of usage records |

Service logic:
- `GetBalance(ctx, userID)` — `SELECT COALESCE(SUM(amount_minor),0) FROM ledger_entries WHERE user_id = $1`; **never** a balance column
- `GetUsage(ctx, userID, filter)` — paginated query on `usage_records`
- `GetLedger(ctx, userID, filter)` — paginated query on `ledger_entries`
- `CreditLedger(ctx, tx, userID, amount, entryType, refID, corrID)` — shared helper used by payments and admin; inserts a ledger row inside an existing transaction

Tests:
- Balance: credits and debits sum correctly; empty → 0
- Balance: no balance column in schema — test verifies query uses SUM

Done when:
- [ ] `GET /billing/balance` returns correct sum after seeded ledger entry
- [ ] CSV export streams correctly

---

## Phase 5 — Payments service

**Prerequisite**: Phase 4 (billing ledger credit helper).
Read first: `doc/api/openapi.draft.yaml §Payments §AdminPayments`, `doc/architecture/State_Machines.md §5`

Files to create:
```
packages/services/payments/service.go
packages/services/payments/handler.go
packages/services/payments/webhook.go
packages/services/payments/handler_test.go
packages/services/payments/service_test.go
packages/services/payments/models.go
cmd/webhook-worker/main.go   # implemented (historically replaced stub)
```

Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| POST | `/api/v1/payments/checkout-session` | Create Stripe session; insert `payment_sessions` row; idempotent via X-Idempotency-Key |
| POST | `/api/v1/payments/customer-portal-session` | Stripe billing portal URL |
| POST | `/api/v1/payments/webhook` | Stripe webhook — **buffer raw body first** before any JSON parse |
| GET | `/api/v1/admin/payments/sessions` | List stuck/failed sessions for reconciliation |

Service logic:
- `CreateCheckoutSession(ctx, userID, req)` — validate amount against policy min/max; create Stripe session; insert `payment_sessions` with `status = 'initiated'`; idempotency via `ix_payment_sessions_idempotency`
- `HandleWebhook(ctx, rawBody, sigHeader)` — `stripe.ConstructEvent`; on `checkout.session.completed`: update `payment_sessions` to `checkout_completed`, then in one transaction: post ledger credit + update to `credited` + write `payments.balance_credited` to outbox
- Amount mismatch → `failed_reconcile`

**Critical**: `POST /payments/webhook` must read and buffer `r.Body` as raw bytes BEFORE calling any JSON decoder. The Stripe signature is computed over the exact raw bytes.

Tests:
- Checkout session created and `payment_sessions` row inserted
- Webhook with valid signature credits balance + transitions session state
- Duplicate webhook (same `stripe_event_id`) does not double-credit (idempotent via `stripe_events` PK)
- Webhook with mutated body → 400 (signature invalid)
- Amount below minimum → 400

Done when:
- [ ] `POST /payments/checkout-session` returns a Stripe URL
- [ ] Webhook handler verifies signature and credits balance
- [ ] Duplicate webhook rejected

---

## Phase 6 — Provisioning orchestrator

**Prerequisite**: Phases 2, 3, 4 (auth, inventory, billing balance check) and Pre-Phase Security (Encryption Envelope Baseline).
Read first: `doc/architecture/State_Machines.md §1`, `doc/architecture/Sequence_Flows.md`, `doc/api/openapi.draft.yaml §Allocations`

Files to create:
```
packages/services/provisioning/orchestrator/service.go
packages/services/provisioning/orchestrator/handler.go
packages/services/provisioning/orchestrator/handler_test.go
packages/services/provisioning/orchestrator/service_test.go
packages/services/provisioning/orchestrator/models.go
packages/services/provisioning/orchestrator/statemachine.go
```

Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| POST | `/api/v1/allocations` | Create allocation; check balance + concurrency limit via policy; insert `requested`; write outbox `provisioning.requested`; write `usage_records` row |
| GET | `/api/v1/allocations` | Paginated list for current user |
| GET | `/api/v1/allocations/{id}` | Single allocation; ownership check |
| POST | `/api/v1/allocations/{id}/release` | Transition to `releasing`; write outbox `provisioning.releasing.requested`; also accepts `release_failed` → `releasing` |
| GET | `/api/v1/ssh-keys` | List current user's registered SSH public keys |
| POST | `/api/v1/ssh-keys` | Register SSH public key for runtime access |
| DELETE | `/api/v1/ssh-keys/{key_id}` | Revoke/remove SSH public key |
| GET | `/api/v1/admin/allocations` | Admin list with status filter |
| POST | `/api/v1/admin/allocations/{id}/force-release` | Transition `release_failed` → `releasing`; write outbox `provisioning.force_release_requested`; write audit log |

State machine transitions (see `doc/architecture/State_Machines.md §1`):
```
requested → provisioning   (on provisioning.requested consumed by worker)
provisioning → active      (on provisioning.active event)
provisioning → failed      (on provisioning.failed event)
active → releasing         (user/admin release request)
releasing → released       (on provisioning.releasing.completed event)
releasing → release_failed (on provisioning.release_failed event)
release_failed → releasing (user retry or admin force-release)
```

Concurrency check: `SELECT COUNT(*) FROM allocations WHERE user_id = $1 AND status IN ('requested','provisioning','active','releasing')` — compare against `policy.KeyAllocationMaxConcurrentPerUser`.

Balance check: `billing.GetBalance(ctx, userID) > 0` before creating allocation.

DB tables touched: `allocations`, `usage_records`, `platform_outbox_events`, `platform_audit_logs`

Tests:
- Create allocation: happy path, insufficient balance → 402, concurrency limit → 429, SKU unavailable → 409
- Release: active → releasing transition; already releasing → 409; wrong owner → 403
- Force-release: only for `release_failed` status; requires admin role
- `GET /allocations/{id}`: wrong owner → 403, not found → 404

Done when:
- [ ] Full create → release cycle transitions correctly in DB
- [ ] All state machine edge cases tested

---

## Phase 7 — Provisioning worker

**Prerequisite**: Phase 6 (allocation state machine in DB), Pre-Phase Security (Encryption Envelope Baseline), and **Pre-Phase Node Agent** (node agent scaffold + step-ca running).
Read first: `doc/architecture/NATS_Stream_Config.md`, `doc/architecture/Inter_Service_Communication.md §GPU nodes`

Files to create:
```
packages/services/provisioning/worker/workflow.go       # Temporal workflow definitions
packages/services/provisioning/worker/activities.go     # Node agent task activities
packages/services/provisioning/worker/ssh.go            # SSH dial + exec helpers (admin probe only)
packages/services/provisioning/worker/consumer.go       # NATS consumer setup
packages/services/provisioning/worker/workflow_test.go
packages/services/provisioning/worker/activities_test.go
cmd/provisioning-worker/main.go   # implemented (historically replaced stub)
```

**Note**: provisioning activities use the node agent task API, not raw SSH.
See `doc/architecture/Node_Agent_Spec.md §12` for the activity pattern.
`ssh.go` is retained only for the admin probe endpoint (`POST /admin/nodes/{id}/probe`).

Temporal workflows:
- `ProvisionNodeWorkflow(allocationID)` — activity sequence: AllocateNode → ProvisionUser (via node agent `allocation.provision_user` task) → UpdateAllocationActive → EmitOutboxActive
- `ReleaseNodeWorkflow(allocationID)` — activity sequence: RevokeUser (via node agent `allocation.revoke_user` task) → UpdateAllocationReleased → EmitOutboxReleasingCompleted; on max retries → UpdateAllocationReleaseFailed + EmitOutboxReleaseFailed

Node agent activities:
- `ProvisionUserActivity` — inserts `node_tasks` row (`allocation.provision_user`); polls until `succeeded` or timeout; applies user public key set and avoids persistent storage of user private keys in control-plane DB.
- `RevokeUserActivity` — inserts `node_tasks` row (`allocation.revoke_user`); polls until `succeeded` or timeout; sets `release_failed_reason` on task `failed` or timeout

NATS consumers to register on startup (durable names from `NATS_Stream_Config.md`):
- `provisioning_worker_provision_requested` → start `ProvisionNodeWorkflow`
- `provisioning_worker_releasing_requested` → start `ReleaseNodeWorkflow`
- `provisioning_worker_force_release` → start `ReleaseNodeWorkflow`

Tests:
- `ProvisionUserActivity`: node task queued; task `failed` → activity returns error; task `succeeded` → allocation updated
- `RevokeUserActivity`: task timeout → activity returns error with reason
- `ProvisionNodeWorkflow`: activity failures retried; exhausted retries → release_failed state

Done when:
- [ ] `make dev-infra && make dev-worker-provisioning` starts
- [ ] Creating an allocation (Phase 6) triggers the workflow and transitions to `active`
- [ ] Agent runtime waits for `node_tasks` completion and maps timeout/failure to `failed`/`release_failed` deterministically
- [ ] Internal node endpoints enforce node identity authorization with mTLS certificate binding
- [ ] Temporal workflow path includes retry + compensation behavior for duplicate events and task timeouts
- [ ] Provisioning resumes from `provisioning` when node assignment arrives (no stranded rows)
- [ ] step-ca integration + KMS signing key lifecycle verified in staging
- [ ] Integration tests cover full node-agent flow (`requested -> active`, `releasing -> released`, `release_failed` retry)
- [ ] Provisioning metrics include task queue depth, dispatch latency, timeout count, and failure reasons
- [ ] Private-key handling cutover complete: no persistent server-side user SSH private-key storage before public launch

Remaining backend provisioning checklist (active):
- [ ] Agent runtime completion semantics (`node_tasks` enqueue is not terminal success).
- [ ] Internal node auth hardening (mTLS identity -> `node_id` binding).
- [ ] Workflow robustness for retries/compensation and duplicate event handling.
- [ ] Scheduler/assignment re-trigger when allocation is waiting for node assignment.
- [ ] PKI production path completion (step-ca + rotation + KMS key lifecycle).
- [ ] Provisioning integration/e2e coverage for success/failure/retry/replay.
- [ ] Operational metrics + alerts for provisioning control loop health.
- [ ] Remove persistent private-key storage dependency from provisioning/terminal path via pre-launch cutover (one-time delivery and/or user-managed key model).

---

## Phase 8 — Billing worker

**Prerequisite**: Phase 4 (ledger credit helper), Phase 6 (usage_records).
Read first: `doc/architecture/State_Machines.md §3-4`, `doc/architecture/NATS_Stream_Config.md`

Files to create:
```
packages/services/billing/accrual.go          # billing loop logic
packages/services/billing/accrual_test.go
packages/services/billing/consumer.go         # NATS consumer setup
cmd/billing-worker/main.go   # implemented (historically replaced stub)
```

Worker responsibilities:
1. **Accrual loop** — every `policy.KeyBillingWindowSeconds`:
   - Query `usage_records WHERE end_time IS NULL` (active usage)
   - For each: compute `elapsed * gpu_hourly_price_minor * gpus_total_snapshot`
   - Insert `ledger_entries` debit row; update `usage_records.last_billed_at + accrued_cost_minor`
   - Idempotency key: `(usage_record_id, window_start)` in `platform_api_idempotency_keys`
2. **Low balance check** — after each accrual cycle:
   - `GetBalance(ctx, userID)` for every user with active usage
   - If balance ≤ `policy.KeyBillingLowBalanceThresholdMinor` and `last_low_balance_notified_at` is nil or > 24 h ago: write outbox `billing.low_balance_warning`; update `users.last_low_balance_notified_at`
3. **Balance depleted** — if balance ≤ 0: write outbox `billing.balance_depleted`; write outbox `provisioning.force_release_requested` for each active allocation

NATS consumers (durable names from `NATS_Stream_Config.md`):
- `billing_worker_provision_active` → open `usage_records` row
- `billing_worker_releasing_completed` → close `usage_records` row (`end_time = now()`)
- `billing_worker_release_failed` → close `usage_records` row (billing stops)
- `billing_worker_balance_credited` → check if any paused allocations can resume

Tests:
- Accrual: correct cost for GPU-hours elapsed; idempotent on replay
- Low balance: warning emitted once per transition; not re-emitted while still low
- Depleted: force-release outbox row written for each active allocation

Done when:
- [ ] `make dev-worker-billing` starts
- [ ] Creating and holding an allocation for 2 billing cycles accrues expected cost
- [ ] Depleting balance triggers force-release flow

---

## Phase 9 — Terminal service

**Prerequisite**: Phase 6 (allocation active + runtime access credential model).
Read first: `doc/architecture/Inter_Service_Communication.md §Terminal tokens`

Files to create:
```
packages/services/terminal/service.go
packages/services/terminal/handler.go
packages/services/terminal/proxy.go         # WebSocket → SSH proxy
packages/services/terminal/handler_test.go
packages/services/terminal/service_test.go
```

Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| POST | `/api/v1/allocations/{id}/terminal-token` | Mint single-use 256-bit token; store in Redis with 300 s TTL; key: `terminal_token:{token}` → `{user_id, allocation_id}` |
| WS | `/ws/terminal/{allocation_id}` | WebSocket upgrade; validate terminal token via `Sec-WebSocket-Protocol` for browser clients or `Authorization` header for non-browser clients (never query string); open SSH shell; stream bidirectionally |

Token storage:
```
Key:   terminal_token:{random_hex}
Value: JSON {user_id, allocation_id, expires_at}
TTL:   300 s
```

Token validation: `GETDEL` from Redis (atomic single-use consume).

WebSocket proxy:
- Dial SSH using non-persistent runtime credentials
- Bidirectional copy: WS frames ↔ SSH stdin/stdout/stderr
- On SSH disconnect: close WS with normal close code

Tests:
- Token minting: stored in Redis with correct TTL
- Token is single-use: second validation returns not-found
- Token for wrong user → 403
- Terminal token must be in header, not query string

Done when:
- [ ] `POST /api/v1/allocations/{id}/terminal-token` returns an opaque token
- [ ] WS connection with token proxies to SSH

---

## Phase 9C — Hardened Terminal Gateway Extraction (Option C)

**Prerequisite**: Phase 9 complete and stable in staging.
Read first:
- `doc/architecture/Inter_Service_Communication.md §5.1`
- `doc/operations/Production_Platform_Baseline.md`
- `doc/operations/Parallel_Ops_Track.md`

Goal:
- Move terminal streaming from embedded API runtime (Option B) to a dedicated
  hardened service (Option C) without changing public contracts.

Files to create/update:
```
cmd/terminal-gateway/main.go
cmd/terminal-gateway/config.go
cmd/terminal-gateway/server.go
packages/services/terminal/gateway_service.go
packages/services/terminal/gateway_service_test.go
doc/operations/runbooks/Terminal_Gateway_Incident_Runbook.md
doc/operations/local-dev/docker-compose.yaml
```

Scope:
1. Route ownership split:
   - `cmd/api`: terminal token mint endpoint only.
   - `cmd/terminal-gateway`: `WS /ws/terminal/{allocation_id}` only.
2. Security hardening:
   - gateway can consume terminal tokens atomically via Redis `GETDEL`.
   - strict origin/protocol validation for browser WS.
   - deny direct access to non-terminal API surfaces.
3. Network policy:
   - edge routes `/ws/terminal/*` to gateway service.
   - gateway egress limited to Redis and node SSH targets.
4. Observability:
   - connection success/failure counters, replay rejects, active session gauges.
   - runbook links for degraded/error states in admin ops.

Tests:
- Gateway accepts valid token and rejects replay/expired/mismatched allocation token.
- Multi-instance gateway can serve concurrent sessions without sticky routing failures.
- API terminal-token endpoint remains unchanged and interoperates with gateway consumer.
- Integration: route switch from API WS handler to gateway with zero contract changes.

Done when:
- [ ] `cmd/terminal-gateway` runs in local dev and staging.
- [ ] `/ws/terminal/*` ingress points to gateway, not `cmd/api`.
- [ ] Terminal contracts in OpenAPI/AsyncAPI remain unchanged.
- [ ] Ops dashboard includes gateway health and replay/anomaly signals.
- [ ] Rollback procedure (route switch back to API WS handler) documented and tested.

---

## Phase 10 — Storage service

**Prerequisite**: Phase 2 (user identity).
Read first: `doc/api/openapi.draft.yaml §Storage`

Files to create:
```
packages/services/storage/service.go
packages/services/storage/handler.go
packages/services/storage/pathsafety.go    # path traversal prevention
packages/services/storage/handler_test.go
packages/services/storage/service_test.go
```

Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| GET | `/api/v1/storage/list` | List `platform_storage_objects` for active project under given path prefix |
| GET | `/api/v1/storage/download` | Stream object bytes from S3 |
| PUT | `/api/v1/storage/upload` | Stream to S3; insert/update `platform_storage_objects`; quota check |
| POST | `/api/v1/storage/mkdir` | Insert `dir` type row in `platform_storage_objects` |
| POST | `/api/v1/storage/rename` | Update `path` column; check no traversal |
| DELETE | `/api/v1/storage/delete` | Delete from S3 + `platform_storage_objects` |

Path safety rules:
- Resolved path must remain under `/{org_id}/{project_id}/` prefix
- Reject `..` components: return 400 `ErrStoragePathTraversal`
- Quota: SUM of `size_bytes` for project must stay under `policy.KeyStorageQuotaBytes` (add this policy key)

Tests:
- `../` traversal → 400 `storage_path_traversal`
- Quota exceeded → 400 `storage_quota_exceeded`
- Download non-existent object → 404 `storage_object_not_found`

Done when:
- [ ] Upload → list → download round-trip works
- [ ] Path traversal and quota tests pass

---

## Phase 11 — Admin service

**Prerequisite**: Phases 2–6 complete (all domain entities exist).
Read first: `doc/api/openapi.draft.yaml §AdminUsers §AdminAllocations §AdminAudit §AdminPayments`

Files to create:
```
packages/services/admin/users_handler.go
packages/services/admin/nodes_handler.go
packages/services/admin/allocations_handler.go
packages/services/admin/audit_handler.go
packages/services/admin/payments_handler.go
packages/services/admin/service.go
packages/services/admin/handler_test.go
packages/services/admin/service_test.go
```

Endpoints to implement:
| Method | Path | Notes |
|---|---|---|
| GET | `/api/v1/admin/users` | Paginated user list |
| POST | `/api/v1/admin/users` | Create user directly (bypass OIDC) |
| GET | `/api/v1/admin/users/{id}` | User + balance + active allocations |
| POST | `/api/v1/admin/users/{id}/balance` | Admin ledger credit/adjustment; write audit log |
| POST | `/api/v1/admin/users/{id}/refunds` | Refund within policy window; write `refund_requests`; write audit log |
| GET | `/api/v1/admin/allocations` | Paginated with status filter (esp. `release_failed`) |
| POST | `/api/v1/admin/allocations/{id}/force-release` | Already in Phase 6 |
| GET | `/api/v1/admin/audit-logs` | Filterable by actor, target, action, date range |
| GET | `/api/v1/admin/audit-logs/export` | CSV export |
| GET | `/api/v1/admin/payments/sessions` | Sessions in `initiated`/`failed_reconcile` state |

All admin endpoints must:
- Be gated by `middleware.RequireAdmin`
- Write an `platform_audit_logs` row for every mutation

Refund logic:
- Check `policy.KeyAllocationRefundWindowDays` from credit posting date
- Beyond window → 422 `refund_window_exceeded`
- Within window: call Stripe refund API OR post internal ledger credit; update `refund_requests`

Tests:
- Non-admin token → 403 on every admin endpoint
- Admin balance credit creates ledger entry and audit log
- Refund beyond window → 422

Done when:
- [ ] All admin CRUD operations work via Keycloak dev-admin token
- [ ] Audit log populated for every mutation

---

## Phase 12 — Notification service

**Prerequisite**: Phase 8 (billing events), Phase 7 (provisioning events).
Read first: `doc/architecture/NATS_Stream_Config.md §BILLING §PROVISIONING`

Files to create:
```
packages/services/notification/service.go
packages/services/notification/consumer.go    # NATS consumer setup
packages/services/notification/email.go       # email adapter (stub → SES/SMTP)
packages/services/notification/ws.go          # WebSocket broadcast (user-facing)
packages/services/notification/service_test.go
```

NATS consumers (durable names from `NATS_Stream_Config.md`):
- `notification_relay_low_balance` → publish user-scoped Redis notification + optional email
- `notification_relay_auto_release_pending` → publish user-scoped Redis notification
- `notification_relay_balance_depleted` → publish user-scoped Redis notification + optional email
- `notification_relay_provision_active` → publish user-scoped Redis notification
- `notification_relay_provision_failed` → publish user-scoped Redis notification + optional email
- `notification_relay_releasing_completed` → publish user-scoped Redis notification
- `notification_relay_release_failed` → publish user-scoped Redis notification + optional email

Enable/disable controlled by `policy.KeyNotificationLowBalanceEnabled` and `policy.KeyNotificationBalanceDepletedEnabled`.

Tests:
- Low balance with feature flag disabled → no notification dispatched
- Provisioning failed → email adapter called with correct user ID and reason

Done when:
- [ ] Low-balance event triggers log entry (email adapter stubbed)
- [ ] Policy flag disables notification correctly

---

## Phase 13 — Integration test harness

**Prerequisite**: Phase 1B (dev-infra running), any service under test.
Read first: `doc/governance/Testing_Standards.md §Integration test setup`

Files to create:
```
packages/testhelpers/db.go       # DB(t) — pool + t.Cleanup
packages/testhelpers/redis.go    # Redis(t) — client + t.Cleanup
packages/testhelpers/nats.go     # NATS(t) — connection + t.Cleanup
packages/testhelpers/truncate.go # TruncateTables(t, pool, tables...)
packages/testhelpers/jwt.go      # MintToken(t, userID, roles) using test JWKS
packages/testhelpers/jwks.go     # NewFakeJWKSServer(t) — httptest JWKS + token signer
```

`MintToken` generates an RS256 JWT signed by the fake JWKS server's private key,
used to test auth-protected endpoints in integration tests without Keycloak.

Add integration tests for:
```
packages/services/auth/service_integration_test.go
packages/services/billing/accrual_integration_test.go
packages/services/payments/webhook_integration_test.go
packages/services/provisioning/orchestrator/service_integration_test.go
```

Done when:
- [ ] `make test-integration` passes with `make dev-infra` running
- [ ] `TruncateTables` isolates each test

---

## Phase 14 — E2E acceptance tests

**Prerequisite**: All phases complete. `make e2e-up` running.
Read first: `doc/governance/Testing_Standards.md §Acceptance Matrix`

Files to create:
```
tests/e2e/auth_test.go          # AT-001, AT-002, AT-003
tests/e2e/marketplace_test.go   # AT-010
tests/e2e/provisioning_test.go  # AT-020, AT-023, AT-030 – AT-032
tests/e2e/billing_test.go       # AT-040 – AT-042
tests/e2e/payments_test.go      # AT-050 – AT-053
tests/e2e/storage_test.go       # AT-060, AT-061
tests/e2e/ratelimit_test.go     # AT-070 – AT-072
tests/e2e/audit_test.go         # AT-080 – AT-083
```

All E2E tests use `//go:build e2e` build tag.

Done when:
- [ ] All AT-xxx cases from `Testing_Standards.md` pass against full stack

---

## Dependency graph summary

```
Pre-Phase Node Agent (step-ca + node-agent scaffold + schema + OpenAPI)
    │
    └──────────────────────────────────────── Phase 7 (provisioning worker) [BLOCKS]

Phase 0 (done)
    └── Phase 1 (test harness + cmd/api)
            ├── Phase 2 (auth)      ─┐
            ├── Phase 3 (inventory)  ├── Phase 6 (provisioning orchestrator)
            └── Phase 4 (billing)   ─┘       └── Phase 7 (provisioning worker)
                    └── Phase 5 (payments)              ← also requires Pre-Phase Node Agent
                                              └── Phase 8 (billing worker)
                                                        └── Phase 12 (notification)
Phase 6 ──────────────────────────────────────────────── Phase 9 (terminal)
Phase 2 ──────────────────────────────────────────────── Phase 10 (storage)
Phases 2–6 ───────────────────────────────────────────── Phase 11 (admin)
Phase 1 ──────────────────────────────────────────────── Phase 13 (integration harness)
All phases ───────────────────────────────────────────── Phase 14 (E2E)
```

Phases 2, 3, 4 have no inter-dependencies and can run in parallel after Phase 1.
Phases 9, 10, 11, 12 can run in parallel after their prerequisites are met.

## UX + API Vertical Slice Strategy

To reduce integration risk, deliver backend and frontend together per slice after pre-phases:

1. Slice A — Auth + Profile
- API: auth/session + `GET /users/me`
- UX: login redirect/exchange/logout, protected layout, session expiry handling

2. Slice B — Marketplace + Allocations Read
- API: `GET /skus`, `GET /nodes`, `GET /allocations`, `GET /allocations/{id}`
- UX: capacity cards, allocation list/detail, async state rendering

3. Slice C — Provision/Release + Terminal
- API: create/release allocation, terminal token endpoint, terminal WS path
- UX: request/provisioning/releasing lifecycle states, terminal connect/reconnect flow

4. Slice D — Billing + Payments
- API: balance/usage/csv, checkout session, portal session
- UX: balance cards, usage table/export, payment redirect outcomes

5. Slice E — Admin Ops
- API: admin users/nodes/allocations/audit endpoints
- UX: admin tables, filters, force-release and refund workflows, audit export

6. Slice F — Storage
- API: list/upload/download/mkdir/rename/delete
- UX: file explorer interactions with path-safety errors and confirmations

Rule:
- If a UX flow needs multiple services, ship partial sections with explicit loading/degraded states rather than blocking full-screen delivery.
