# GPUaaS — Agent Context

GPU-as-a-Service platform: users provision GPU nodes, get billed per GPU-hour, and
access them via SSH or browser terminal. Built contract-first; all agent work must stay
within the contracts defined in `doc/api/`.

---

## Tech Stack

| Layer | Technology |
|---|---|
| Frontend | Next.js + TypeScript (`packages/web`) |
| CLI | Go — `cmd/gpuaas-cli` (operator/user CLI for auth/catalog/projects/nodes/allocations/billing) |
| API server | Go — single binary `cmd/api` (BFF, imports all domain packages) |
| Workers | Go — `cmd/billing-worker`, `cmd/provisioning-worker`, `cmd/webhook-worker`, `cmd/notification-relay`, `cmd/outbox-relay` |
| Node agent | Go — `cmd/node-agent` (deployed on GPU nodes; pull-based typed-task model, mTLS) |
| Workflow engine | Temporal (`temporalio/auto-setup:1.24`) |
| Event bus | NATS JetStream 2.10 |
| Database | PostgreSQL 16 (`pgx/v5` + `pgxpool`) |
| Cache / rate limit | Redis 7 (`go-redis/v9`) |
| Identity / OIDC | Keycloak 26 (local dev: H2 in-memory, realm imported from JSON) |
| Payments | Stripe (`stripe-go/v76`) |
| Observability | OpenTelemetry + Prometheus + structured JSON logs |
| Node access (user sessions) | Browser terminal via node-agent stream relay (`cmd/terminal-gateway` ↔ `cmd/api` internal stream ↔ `cmd/node-agent`) |
| Terminal WS topology | Current: dedicated `cmd/terminal-gateway` WebSocket endpoint (`/ws/terminal/{allocation_id}`), API mints/validates session bindings |
| Bare metal provisioning (planned) | Canonical MAAS (Metal as a Service) — optional, gated by `MAAS_ENABLED`; delivers cloud-init at deploy time, manages BMC/power, enables full re-image isolation between allocations |
| Internal PKI (planned) | Smallstep step-ca — node/worker mTLS certs (24h, X5C renewal); Vault PKI migration path via `packages/shared/pki.CAClient` interface |

---

## Repo Layout

```
/
├── cmd/
│   ├── api/                    # BFF entry point — all public HTTP routes
│   ├── billing-worker/         # Scheduled billing accrual
│   ├── provisioning-worker/    # Temporal worker — node-task provisioning lifecycle (agent/noop runtime modes)
│   ├── webhook-worker/         # Stripe webhook consumer
│   ├── notification-relay/     # NATS -> Redis Pub/Sub bridge for WS notifications
│   └── outbox-relay/           # Outbox -> NATS publisher process
├── packages/
│   ├── shared/
│   │   ├── errors/             # ErrorResponse type + ErrCode constants
│   │   ├── events/             # NATS client, typed event structs, InitStreams()
│   │   ├── middleware/         # Auth (JWKS), rate-limit, PII scrubber, OTel
│   │   ├── outbox/             # Outbox relay primitives (claim/publish/retry)
│   │   └── policy/             # PolicyClient interface → queries policy_values table
│   ├── platform/               # Shared platform services: IAM/auth, billing, audit, evidence, policy, storage, MAAS, etc.
│   └── products/
│       ├── gpuaas/             # GPUaaS product domains: inventory, provisioning, terminal
│       └── appplatform/        # App Platform product domains: catalog, runtime, SDK
├── doc/                        # All documentation (canonical reference)
│   ├── api/
│   │   ├── openapi.draft.yaml  # REST API contracts (canonical bundled artifact)
│   │   ├── asyncapi.draft.yaml # Event contracts (canonical bundled artifact)
│   │   ├── openapi/            # Domain OpenAPI authoring manifests/fragments
│   │   └── asyncapi/           # Domain AsyncAPI authoring manifests/fragments
│   ├── architecture/           # DB schema, ERD, state machines, ADRs, etc.
│   └── governance/             # Coding standards, testing standards, CI gates
├── scripts/
│   ├── seed.sql                # Bootstrap reference data (idempotent)
│   ├── codegen.sh              # Generate Go + TS OpenAPI artifacts
│   └── ci/                     # Reusable CI gate scripts (platform-agnostic)
├── cmd/gpuaas-cli/             # CLI entrypoint
└── doc/operations/local-dev/   # docker-compose + Keycloak realm + env example
```

Implementation note: the former `packages/services/*` tree has been retired.
Use `packages/platform/*` or `packages/products/*` for all new domain work.

---

## Local Dev — Quick Start

```bash
# 1. Copy env
cp doc/operations/local-dev/env.local.example .env.local

# 2. Start infra (Postgres, Redis, NATS, Temporal, Keycloak)
docker compose --env-file .env.local -f doc/operations/local-dev/docker-compose.yaml up -d

# 3. Apply schema + seed
psql $DATABASE_URL -f doc/architecture/db_schema_v1.sql
psql $DATABASE_URL -f scripts/seed.sql

# 4. Run API server (with hot-reload via air)
make dev-api
```

Get a dev bearer token:
```bash
curl -s -X POST http://localhost:8080/realms/gpuaas/protocol/openid-connect/token \
  -d "grant_type=password&client_id=gpuaas-api&client_secret=dev-client-secret&username=dev-user&password=dev123" \
  | jq -r .access_token
```

Dev users: `dev-user` / `dev123` (role: user) · `dev-admin` / `admin123` (role: user, admin)

---

## Key Makefile Targets

```
make dev-infra              # start infra docker-compose
make dev-api                # go run ./cmd/api with hot-reload
make dev-worker-billing     # go run ./cmd/billing-worker
make dev-worker-provisioning# go run ./cmd/provisioning-worker
make dev-worker-notification # go run ./cmd/notification-relay
make dev-worker-outbox      # go run ./cmd/outbox-relay
make build-cli              # build CLI binary (host os/arch by default)
make test                   # go test ./...
make test-integration       # go test ./... -tags integration
make verify-web             # pnpm typecheck + full web tests
make lint                   # golangci-lint run
make codegen                # regenerate Go + TS artifacts from openapi.draft.yaml
make db-init                # apply db_schema_v1.sql
make seed                   # apply scripts/seed.sql
make e2e-up                 # full-stack docker-compose (all services + web)
```

---

## Allocation State Machine

```
requested → provisioning → active → releasing → released
               ↘ failed        ↘ release_failed (admin retry via force-release)
```

Status values: `requested` `provisioning` `active` `releasing` `released` `failed` `release_failed`

- **`release_failed`**: all SSH retries exhausted. Billing stops. Node stays assigned
  until admin retries via `POST /api/v1/admin/allocations/{id}/force-release`.
- User can also retry: `POST /api/v1/allocations/{id}/release` on a `release_failed`
  allocation transitions it back to `releasing`.

---

## Error Handling — Required Pattern

Every error response must use `ErrorResponse` from `packages/shared/errors`:
```json
{ "code": "<catalog_code>", "message": "human text", "correlation_id": "...", "details": {} }
```

`correlation_id` is **required** in every error response. `details` required for `validation_error`.

### Error Code Catalog (all valid values for `code`)

| Group | Codes |
|---|---|
| Auth | `token_missing` `token_invalid` `token_expired` `token_scope_invalid` |
| Authz | `insufficient_permissions` `admin_required` `ownership_required` |
| Validation | `validation_error` `invalid_request` |
| Allocation | `allocation_not_found` `allocation_not_active` `allocation_already_releasing` `allocation_concurrency_limit` `insufficient_balance` `sku_unavailable` |
| Node | `node_not_found` `node_offline` `node_in_use` `node_already_exists` |
| User | `user_not_found` `user_already_exists` |
| Billing/Pay | `stripe_signature_invalid` `refund_window_exceeded` |
| Storage | `storage_object_not_found` `storage_path_traversal` `storage_already_exists` `storage_quota_exceeded` |
| Catalog | `sku_not_found` |
| Rate limit | `rate_limit_exceeded` |
| Server | `internal_error` `upstream_error` `service_unavailable` |

Do not invent codes outside this list. Add to `doc/architecture/Error_Code_Catalog.md` first.

---

## Domain Events

All events use this envelope:
```json
{ "event_id": "uuid", "event_type": "domain.name", "occurred_at": "RFC3339",
  "version": "1.0", "correlation_id": "...", "payload": {} }
```

| Subject | Producer | Key consumers |
|---|---|---|
| `provisioning.requested` | orchestrator | provisioning-worker |
| `provisioning.active` | provisioning-worker | billing-worker, notification-relay |
| `provisioning.failed` | provisioning-worker | notification-relay |
| `provisioning.releasing.requested` | orchestrator | provisioning-worker |
| `provisioning.releasing.completed` | provisioning-worker | billing-worker, notification-relay |
| `provisioning.release_failed` | provisioning-worker | billing-worker, notification-relay |
| `provisioning.force_release_requested` | billing-worker | provisioning-worker |
| `billing.low_balance_warning` | billing-worker | notification-relay |
| `billing.auto_release_pending` | billing-worker | notification-relay |
| `billing.balance_depleted` | billing-worker | notification-relay |
| `payments.balance_credited` | payments-svc | billing-worker |

NATS streams: `BILLING` (`billing.>`), `PAYMENTS` (`payments.>`), `PROVISIONING` (`provisioning.>`), `DLQ` (`dlq.>`)
Full consumer catalog: `doc/architecture/NATS_Stream_Config.md`

---

## Policy Keys

Read policy values from the `policy_values` table via `packages/shared/policy.PolicyClient`.
**Never hardcode these values in service code.**

| Key | Type | Default | Meaning |
|---|---|---|---|
| `billing.low_balance_threshold_minor` | int | 500 | cents below which warning fires |
| `billing.window_seconds` | int | 60 | billing accrual interval |
| `billing.minimum_deposit_minor` | int | 1000 | min Stripe checkout (cents) |
| `billing.maximum_deposit_minor` | int | 100000 | max Stripe checkout (cents) |
| `allocation.max_concurrent_per_user` | int | 50 | max active allocations per user |
| `allocation.refund_window_days` | int | 30 | refund eligibility window |
| `rate_limit.api_requests_per_minute` | int | 120 | default rate limit per user |
| `rate_limit.terminal_token_requests_per_minute` | int | 10 | terminal token mint rate limit |
| `rate_limit.financial_requests_per_minute` | int | 30 | payments/refunds/balance rate limit |
| `rate_limit.admin_overview_requests_per_minute` | int | 600 | admin overview polling rate limit |
| `terminal.session_max_ttl_seconds` | int | 14400 | max active terminal session lifetime (independent of token TTL; enforced by gateway and node-agent) |
| `notification.low_balance_enabled` | bool | true | enable low-balance alerts |
| `notification.balance_depleted_enabled` | bool | true | enable depletion alerts |
| `auth.service_account_token_ttl_seconds` | int | 900 | service-account access token TTL |
| `allocation.isolation_model` | string | `user-revoke` | Node isolation between allocations: `user-revoke` (revoke OS user, keep node enrolled) or `full-reimage` (MAAS re-deploy full OS wipe; requires `maas.enabled=true`) |
| `maas.enabled` | bool | false | Enable MAAS bare-metal integration (node deploy/release via MAAS API) |

---

## Coding Rules (mandatory)

**1. Contract-first** — every API or event change starts in the contract under
`doc/api/`: use the domain fragment when that domain has migrated, otherwise update the
canonical bundled artifact (`openapi.draft.yaml` or `asyncapi.draft.yaml`). Code follows
the spec, never the reverse.

**1a. UX-first gate before feature coding** — complete and sign off
`doc/product/UX_Implementation_Spec.md` and `doc/product/UX_Journeys.md`
before implementing user-facing feature flows.

For any major new page, major page refactor, new navigation section, or
multi-step workflow:
- place it in the current product information architecture first
- define the intended role, scope, and primary user intent
- confirm its page family and shell/navigation fit
- produce or update the relevant mock/spec before broad implementation

Do not continue adding major user/admin/ops surfaces as isolated page work once
the product structure is unclear. Resolve IA and mock questions first so the
same surfaces do not get redesigned repeatedly.

**1a.1. UX/e2e validation gate** — any change that touches user-visible web UX,
navigation, page layout, role/persona flows, browser app connect flows, auth
redirects, route guards, or web API consumption must run the frontend e2e gate
before the task is marked done.

Required pattern:
- run `make verify-web` or the focused equivalent for type/unit coverage,
- run `bash scripts/ci/frontend_e2e.sh` for full web-flow validation,
- when narrowing scope, set `E2E_SPEC=<spec>` only when the touched journey is
  fully covered by that spec and record why the full matrix was not needed,
- for v3 visual/shell/navigation work, use the managed harness path, not an
  ad hoc Playwright command against an existing localhost server,
- record the frontend e2e command, result, and artifact/log path as Fairway
  evidence.

CI failures such as `frontend_e2e` pipeline failures after a UX change are
process misses unless the task evidence already shows the matching local e2e
gate was run or explicitly blocked. Create a `HARNESS-FIX-*` task for false
positives/false negatives, and a `UAT-BUG-*` or normal implementation task for
real product regressions.

**1a.2. Contract/codegen validation gate** — any OpenAPI, AsyncAPI, generated
SDK/client, API response shape, route contract, or frontend API consumption
change must validate generated artifacts before completion.

Required pattern:
- update the contract first,
- run `make codegen` or `bash scripts/ci/sdk_codegen_smoke.sh`,
- run `CODEGEN_ENFORCE_CLEAN=1 bash scripts/ci/sdk_codegen_smoke.sh` before
  marking contract/client work done,
- commit generated Go/TypeScript artifacts with the contract change,
- if codegen changes unexpectedly, treat that as contract drift and resolve it
  before pushing.

CI failures such as stale `packages/web/src/lib/gen/openapi.types.ts`, generated
Go client drift, or SDK smoke failures are process misses unless the task
evidence includes the enforced-clean codegen gate.

**1a.3. Flow/dependency mapping gate** — before implementing, broadening UAT,
or scheduling an approval-gated drill for any user-visible, operator-visible,
or live-environment workflow, map the flow and its dependencies explicitly.

Use `doc/governance/Product_Gap_Readiness_Gate.md` and
`doc/operations/Product_Quality_Flow_Coverage_Operating_Model_v1.md` as the
governing model. For critical P0/P1 flows, update or create the flow row before
feature code or live validation starts. The row must name:

- persona and canonical entry point;
- happy, empty, blocked, recovery, negative, and cleanup paths;
- contract/API/CLI/runtime owners;
- fixture, identity, permission, provider, DNS/edge, browser, CI/CD, and
  environment prerequisites;
- preflight proof that can run before broad UAT or a live window;
- evidence owner, rollback/cleanup owner, and accepted residual gaps.

If any required dependency is unknown, do discovery or create a scoped
`UAT-BUG-*`, `HARNESS-FIX-*`, `OPS-FIX-*`, `CI-FIX-*`, `CD-FIX-*`, or
`DOC-FIX-*` task before proceeding. Do not let broad UAT, user testing, deploy,
or a live drill be the first place where missing flow behavior, provider API
semantics, browser/runtime viability, fixture setup, or cleanup ownership is
discovered.

**1b. v3 migration and domain-ownership foundation** — v3 is the long-term product
and code organization model. Current v1 routes are only a frozen demo/internal-user
continuity surface while the migration is in progress; do not treat v1 as a public
backward-compatibility contract.

Before adding or changing product/API/UI work:
- read `doc/architecture/AI_Factory_Team_Domain_Operating_Model_v1.md`
- read `doc/architecture/API_Domain_Authoring_Model_v1.md`
- read `doc/architecture/API_Route_Modularization_and_V1_Freeze_v1.md`
- keep new UI work aligned to the v3 shell/page-family model
- keep `/api/v1/v3/*` read models isolated until their target domain route is clear
- do not add new product routes to `cmd/api/routes_v1_frozen.go` unless fixing a real demo/internal bug, security issue, or migration continuity issue
- prefer domain route files (`routes_provisioning_*.go`, `routes_platform_*.go`, `routes_access_*.go`, etc.) for durable production routes

**2. Idempotent mutations** — all POST/PUT/PATCH must be safe to retry with the same
`X-Idempotency-Key`. Exception: terminal token minting (single-use by design, documented
in the endpoint description).

**3. No hardcoded business policy constants** — billing thresholds, rate limits, concurrency
limits, refund windows must come from `PolicyClient`, not literals. Test-only constants in
`_test.go` files are allowed.

**4. Sanitize first** — before logging or emitting a trace span, pass the request/response
through `packages/shared/middleware.Sanitize()`. Never log: `password`, `password_hash`,
`access_token`, `refresh_token`, `id_token`, `ssh_private_key*`, `stripe_customer_id`,
`payment_reference`. Redact as `[REDACTED]`, never omit the field.

**5. Immutable ledger** — never UPDATE or DELETE `ledger_entries` rows. Corrections are
new entries. No direct balance column; balance is always computed from the ledger.

**6. Outbox for events** — write the domain change + the outbox row in the same Postgres
transaction. Never publish to NATS directly from a handler; the outbox relay publishes.

**7. Stripe raw-body-first** — for `POST /api/v1/payments/webhook`, buffer the raw request
body BEFORE any JSON parsing. Signature verification requires the exact bytes Stripe sent.

**8. No query-string tokens** — auth material (`Authorization`, terminal/notification tokens)
must never be passed in URL query parameters. For browser WebSockets, pass auth tokens via
`Sec-WebSocket-Protocol` (approved exception to header-only transport). Non-browser clients
may use `Authorization` header where supported. Never `?token=` in any endpoint.

**9. Audit required** — every privileged mutation must write an `audit_logs` row containing
`actor_user_id`, `actor_role`, `action`, `target_type`, `target_id`, `result`, `correlation_id`.

**10. DB access boundaries** — each service package queries only its own domain tables. No
cross-domain direct DB joins. Cross-domain data flows through NATS events or explicit API calls.

**10a. API-first ops verification** — operator, admin, and debugging verification should use
public/admin APIs or explicit read-model surfaces by default, not direct SQL. Temporary direct DB
inspection is acceptable only while the owning operator/debug surface is missing; if a query is
needed repeatedly during implementation or incident work, add the corresponding GET/read-model API
and treat the missing surface as a product gap.

**11. CI portability** — keep gate logic in `scripts/ci/*.sh`; CI workflow files
(`.gitlab-ci.yml` / GitHub Actions) are orchestration wrappers only.
If CI host changes, reuse scripts and only adapt runner/secrets wiring.

**12. Root-cause ownership (no symptom-only fixes)** — for any bug, fix the owning layer
(contract/schema/query/service/runtime/UI boundary). If the owner is outside current scope,
mark task `blocked` and create upstream fix task; do not mark `done` with workaround-only patches.

**13. Postgres polymorphic typing safety** — when using `jsonb_build_object(...)` (or other
polymorphic functions), explicitly cast bind parameters (`$n::text`, `$n::int`, `$n::uuid`).
Do not rely on implicit parameter typing.

**14. 5xx classification gate** — every new/changed 5xx path must be classified in review as:
- upstream dependency (`upstream_error` / `service_unavailable`), or
- local defect (`internal_error`)
Then add a regression test covering that path.

---

## JWT / Auth

`cmd/api` validates bearer tokens locally using a cached JWKS from Keycloak (refreshed every
5 min). No per-request Keycloak call. Required claims:

| Claim | Used for |
|---|---|
| `sub` | `user_id` in all authz checks |
| `realm_access.roles` | RBAC: `user` or `admin` |
| `exp` | Token expiry |
| `iss` | Must match `KEYCLOAK_ISSUER_URL` |
| `org_id` | Tenant scoping (custom claim, nullable) |

Terminal tokens (MVP): opaque 256-bit random, stored in Redis with TTL 300s, single-use
(deleted on first use). Key: `terminal_token:{token}` → `{user_id, allocation_id, expiry}`.

Terminal WS runtime ownership:
- Current: `cmd/terminal-gateway` handles `/ws/terminal/{allocation_id}`.
- API remains the control-plane authority for token minting/session binding and internal
  node stream relay checks; public terminal contract remains unchanged.

Auth endpoint visibility rules:
- `POST /api/v1/auth/login`: internal/dev bootstrap only (do not expose in production UX by default).
- `POST /api/v1/auth/token/refresh`: public session renewal endpoint (OIDC-backed sessions supported).

Policy key authority for implementation:
- Treat `scripts/seed.sql` and `doc/architecture/Seed_Data_Spec.md` as the authoritative
  source for currently implemented policy keys/defaults.

---

## Development Workflow

Follow this sequence for every feature. Do not skip steps — each gate catches
a different class of problem.

Required coordination reading before starting Fairway-tracked work:
- `doc/operations/Fairway_Agent_Operating_Model.md`
- `doc/operations/Fairway_Review_Operating_Model.md`
- `doc/governance/Product_Gap_Readiness_Gate.md` and
  `doc/operations/Product_Quality_Flow_Coverage_Operating_Model_v1.md` for
  user/admin/operator workflows, UAT, deploy-readiness, and approval-gated
  drills
- `doc/operations/Shared_Service_Lane_Worktree_Model_v1.md` when creating,
  delegating, merging, or cleaning up parallel lane worktrees

### Fairway Coordination And Provider Sessions

Read `doc/operations/Fairway_Agent_Operating_Model.md` before claiming,
delegating, or resuming Fairway work. It is the focused operating model for
provider sessions, tmux/Claude attachments, provider-event checkpoints, active
queue sources, lane-monitor fallback, and stabilization-phase coordination.
Read `doc/operations/Fairway_Review_Operating_Model.md` before closing,
reviewing, merging, or promoting high-risk Fairway work.

Fairway is the active coordination layer for platform-foundation and Docusaurus
tracks. The durable unit is the Fairway lane/task, not a specific provider chat.

Rule:

```text
Durable lane, replaceable provider attachment.
```

Use Codex, Claude, Gemini, tmux, or shell as execution attachments to the lane.
Fairway remains the source of truth for task state, ownership, checkpoints,
evidence, handoffs, reviews, and merge gates.

For parallel implementation, lanes should map to platform shared-service and
code ownership boundaries rather than provider threads or generic
backend/frontend buckets. Use
`doc/operations/Shared_Service_Lane_Worktree_Model_v1.md` for lane assignment.
Normal delivery is:

```text
shared-service lane worktree -> local lane commit -> Fairway evidence/review
-> orchestrator merge to master -> master push -> CI monitor
```

Remote task branches are exceptional and require Fairway push-intent. The
orchestrator owns integration and should keep `master` as the normal remote
validation branch.

Execution surface rule:

```text
Desktop thread = control/review surface.
CLI or tmux with bypass = trusted execution attachment.
Fairway = durable coordination and audit state.
```

Use Desktop threads for architecture, review, steering, screenshots, browser
checks, and cross-thread coordination. For trusted long-running ops work such as
CI monitoring, deploy validation, Kubernetes rollout loops, UAT/smoke harness
runs, branch closeout, and Fairway reconciliation, prefer a CLI/tmux provider
session. The approved unattended CLI form is:

```bash
codex --dangerously-bypass-approvals-and-sandbox
```

This mode must still register a Fairway provider session, record started/active
checkpoints, attach evidence, and reconcile before stopping. Do not use bypass
mode for ambiguous scope, destructive cleanup, production mutation, secret
handling, or live failure injection unless explicit approval and rollback
criteria are recorded in Fairway.

Desktop shells can still inherit Codex app sandbox limits even when they run as
the logged-in user. A common symptom is `git add` failing with:

```text
fatal: Unable to create '.git/index.lock': Operation not permitted
```

When the task is already reviewed and the only blocker is a sandbox-limited
operation such as staging, committing, or a trusted deploy/watch command, use a
tmux or SSH session that was started outside the Codex Desktop sandbox as the
execution attachment. Example local commit-boundary lane:

```bash
tmux new -d -s gpuaas-git
tmux send-keys -t gpuaas-git 'cd /Users/subash/dev/GPUasService && git status --short --branch' C-m
tmux capture-pane -pt gpuaas-git -S -200
```

Treat this as a controlled execution lane, not an untracked escape hatch:

```text
Fairway decides and records.
Desktop coordinates and reviews.
tmux/SSH executes only the approved command boundary.
```

For commit-boundary use, Fairway evidence must name the exact reviewed files,
the validation already run, the command output captured from the tmux/SSH lane,
the resulting commit SHA, `merge-ready` readback, and final reconciliation.

For long-running Codex-backed orchestrator or track sessions, set an explicit
provider goal in addition to the Fairway task and `tmp-ux` memory file. The
roles are:

```text
Fairway task = durable work unit and audit state.
tmp-ux memory file = provider-independent resume packet.
Codex goal = active provider objective and completion criteria.
Orchestrator prompt = current steering instruction.
```

Use a provider goal when work crosses CI/deploy/UAT waits, review waits,
context compaction, account/surface switching, or multiple parallel lanes. The
goal must name the Fairway config, memory file, task/batch scope, completion
criteria, stop conditions, and forbidden actions. A provider goal is not a
Fairway status change and does not approve reviews, claim ownership, or make
work merge-ready.

Parallel Fairway tooling work is not an automatic stop signal for GPUaaS
stabilization. If the architecture track is updating Fairway docs, dashboard,
provider adapters, release tooling, or coordination rules, continue the current
GPUaaS task unless there is a direct dependency on that unfinished Fairway work.
Only wait for this/user thread when approval is required for destructive or
production-impacting action, a needed credential/secret is missing, the
environment is blocked, required review has no alternate ready work, scope is
unsafe to infer, or a missing Fairway capability prevents correct task tracking.
Otherwise proceed to the next ready non-conflicting task, record checkpoints and
evidence, and reconcile before ending.

When work stalls on current external platform behavior, consult a second
provider/current-info source after one serious local evidence pass. Good
examples are Apple signing/notarization, Cloudflare, Pomerium, GitHub/GitLab CI
runners, Homebrew, Kubernetes/kind, registries, MAAS/LXD, OpenClaw, Keycloak,
and provider-specific networking or deployment behavior. Validate the finding
locally or against the environment, then record the symptom, consulted source,
confirmed interpretation, and next action as Fairway evidence or a checkpoint.
The second source is advisory; Fairway evidence and local verification remain
the authority.

Starting or switching work is an atomic coordination step:

1. Register or refresh the provider session with `fairway session upsert`.
2. Associate that session with the exact Fairway task being worked.
3. Claim or set the task `in_progress`.
4. Record an `active` checkpoint or `started` provider event for the same task.
5. Confirm `fairway session status` shows the active session before editing.

Do not mark a task `in_progress` and then start implementation without a
matching active session/checkpoint. If an agent is working, the Fairway wall
must be able to show who is working. If a provider thread named
`orchestrator` is executing a backend task, keep the task role as `backend` and
record the active session as the orchestrator provider attachment to that task.

Short direct coordinator/orchestrator work may temporarily appear as
`in_progress` without an active session only when all of the following are true:
the work is expected to finish within one short burst, the task has a fresh
checkpoint explaining who is working and why no session is attached, and the
task will be closed, reset, blocked, or explicitly handed off before the burst
ends. High-risk stabilization, UAT, production-readiness, delegated provider,
tmux/Claude/Codex external, or multi-step work must register a provider session
and emit a `started` provider event. Treat `in_progress without session` as a
known temporary state, not as the normal execution model.

When delegating work to an external provider session:

1. Register or update the session with `fairway session upsert`.
2. Associate the session with the current Fairway task.
3. Immediately emit a `started` provider event so the task receives an
   `active` checkpoint.
4. Feed runtime state back through the Fairway provider event adapter:
   `../fairway/examples/session-adapters/provider-event.sh`.
5. Record `awaiting_input` checkpoints for approvals, questions, failures,
   stale sessions, or no-progress states.
6. Emit a `completed` provider event when the delegated session finishes; this
   records a `done` checkpoint plus evidence or handoff.
7. Do not change task status, approve reviews, or mark merge readiness from
   provider chat alone; use normal Fairway gates.

For active GPUaaS tracks that use non-default Fairway config files, place a
temporary wrapper earlier in `PATH` so the adapter writes to the right Fairway
DB:

```bash
export GPUAAS_REPO="${GPUAAS_REPO:-$(pwd)}"
export FAIRWAY_REPO="${FAIRWAY_REPO:-../fairway}"
tmpdir=$(mktemp -d /tmp/gpuaas-fairway-adapter.XXXXXX)
cat > "$tmpdir/fairway" <<'EOF'
#!/usr/bin/env bash
cd "$FAIRWAY_REPO"
exec go run ./cmd/fairway --config "$GPUAAS_REPO/.fairway/platform-foundation-config.toml" "$@"
EOF
chmod +x "$tmpdir/fairway"
PATH="$tmpdir:$PATH" ../fairway/examples/session-adapters/provider-event.sh \
  --provider codex \
  --backend codex-thread \
  --external-session-id <thread-id> \
  --role orchestrator \
  --task-id <fairway-task-id> \
  --state started \
  --summary "delegated provider session started" \
  --transcript .fairway/transcripts/<thread-id>.log
```

Then emit additional provider events for `waiting_on_approval`,
`waiting_on_input`, `failed`, `stale`, `no_progress`, and `completed` as the
external session changes state. Active external sessions must have Fairway
provider-event checkpoints at start, waiting/stale/failure, and completion.

Use `.fairway/docusaurus-config.toml` for Docusaurus portal work. Use
`.fairway/platform-foundation-config.toml` for platform-foundation work. Do not
update legacy `doc/governance/Agent_Work_Queue.yaml` as the active queue.

### Fairway Review Rules

Fairway task `review_domains` are the expected independent review domains, not
decorative metadata. GPUaaS uses lightweight, risk-scaled review as the
first-class default. Use `risk_level` to determine the minimum review bar:

- `low`: owner self-check plus durable evidence is acceptable unless a domain
  trigger says otherwise.
- `medium`: at least one independent reviewer from a primary review domain.
- `high`: at least two review domains.
- `critical` or launch-sensitive: architecture, security, and ops are required,
  plus backend/frontend when code or UI changed.

For pre-user/pre-production stabilization, do not route the full review matrix
for every child task by default. Use one accountable reviewer or grouped review
for small docs, harness, classifier, setup/readback, stale-blocker, and
non-live/disposable fixes that preserve the same safety boundary. Escalate to
the full matrix at real boundary decisions: live/source-prod mutation,
credential reset/submission, token/API sensitive-operation proof,
break-glass, public exposure, deploy/release, sensitive-operation enforcement,
production-readiness claims, or external compliance/customer claims.

If a heavier process is proposed as the new default, first run a bounded pilot
and record whether it improved speed, quality, safety, defect discovery, or
rollback confidence. If it does not, keep the lightweight model and invest in
preflight, tests, UAT flow coverage, and automation.

Security and CISO-style controls should follow a maturity ramp. In early
product development, implement the feature, add reproducible tests/preflights,
run representative UAT, and keep claims modest. Do not turn every security
feature into a production control attestation program before there are users,
release pressure, or an external claim. Add formal control packets, custody
evidence, multi-domain review, and compliance mapping when work crosses a real
boundary: production cutover, regulated/customer claim, external audit,
credential/break-glass action, public exposure, or enforcement of a sensitive
operation. If a control is introduced earlier, it must be justified by a
specific defect or risk it is expected to catch.

Use dedicated track sessions for review-heavy work so architecture, security,
ops, backend, frontend, and governance reviewers keep context. The track
session is a provider attachment; Fairway remains the durable record. Record
reviews with `fairway record review`, and do not rely on provider chat approval
as review evidence.

### Commit Boundary Rule

Commit at task or review boundaries, not as an end-of-day cleanup batch. A
commit should represent one coherent task outcome whenever possible.
Documentation changes follow the same rule: commit them when the doc update is
complete and sanity-checked, even if no code changed.

Required pattern:

1. Implement the task slice.
2. Attach Fairway evidence and decide task status.
3. Run `git diff --check`, focused tests, and relevant script syntax checks.
4. Route and record required Fairway reviews for medium/high/critical work.
5. Commit the reviewed, merge-ready slice before deploy, UAT, or release
   validation.
6. Leave blocked or partial work as explicit Fairway follow-up tasks rather
   than bundling it into an unrelated commit.

Large multi-task commits are allowed only for explicit repository cleanup or
recovery from a dirty-tree checkpoint, and must say so in the commit message or
Fairway evidence. Agents should not leave many unrelated completed tasks
uncommitted in the same worktree.

### Deploy Run And Finding Taxonomy

Every meaningful push, CI, deploy, smoke, or UAT attempt should have a
lightweight Fairway deploy-run task. Use one deploy-run task to record source
SHA, environment, CI result, deploy result, smoke/UAT result, evidence paths,
and final status.

Create scoped child or follow-up tasks for actionable findings:

- `CI-FIX-*`: build, test, lint, generated contract, or CI runner failure.
- `CD-FIX-*`: promotion, deploy script, image freshness, rollout, secret
  wiring, or cluster apply failure.
- `UAT-BUG-*`: product/runtime behavior found by UAT.
- `OPS-FIX-*`: environment, observability, credential, backup, network, or
  capacity issue.
- `HARNESS-FIX-*`: UAT/test harness false positive, false negative, flake, or
  sequencing issue.
- `DOC-FIX-*`: docs or runbook mismatch found during operation.

Rule: one deploy-run task per meaningful release/deploy attempt; one
child/follow-up task per actionable finding. Do not create child tasks for
one-off transient noise unless it recurs or changes operator action.

Every actionable finding task should include `detected_by` and `expected_gate`
classifiers using `doc/operations/DevSecOps_Escape_Rate_Classifier_v1.md`.
These fields allow DevSecOps escape-rate metrics to distinguish issues caught
by the intended gate from issues that escaped to CI, deploy, UAT, incident, or
manual review. Use `unknown` only as a temporary classifier and create or
update a governance follow-up when either value is unknown. Do not include
tenant names, customer names, credentials, tokens, private URLs, or
incident-sensitive detail in classifier metadata.

### CI And Deploy Wait Window Rule

See `doc/operations/Fairway_Agent_Operating_Model.md §10 CI And Deploy Wait
Windows` for the full rule.

Summary: CI and deploy waits are active operating windows, not idle chat time.
Before switching attention, record a deploy-run task update or checkpoint with:

- pipeline or deploy URL when available;
- source SHA or release branch;
- target environment;
- expected completion window as an absolute time or `+N minutes`;
- what is blocked on the result;
- the safe next action while waiting.

Do not start conflicting code changes in the same worktree while the pipeline
result is deciding whether the current SHA is valid. If parallel work is needed,
use a separate Fairway task and provider session, and use a separate worktree
when code edits could collide with the SHA under test. When CI or deploy
finishes, update the deploy-run or checkpoint with the result, create
`CI-FIX-*`, `CD-FIX-*`, `UAT-BUG-*`, `OPS-FIX-*`, `HARNESS-FIX-*`, or
`DOC-FIX-*` tasks for actionable findings, close or reset the waiting task
explicitly, and reconcile active sessions before ending the work block.

### Platform-Control Release Rule

Before any `release/platform-control` promotion or deploy work, read:
- `doc/governance/Platform_Control_Release_Promotion_Policy.md`
- `doc/governance/Multi_Agent_Lane_Worktrees_v1.md`

Hard requirements:
- do not hand-edit `release/platform-control` as a normal workflow
- merge fixes to `master` first
- promote release with `scripts/ci/platform_control_promote_release_branch.sh`

```
1. Spec first          Update openapi.draft.yaml (and/or asyncapi.draft.yaml)
                       before writing any Go code.

2. Unit tests          Write _test.go files for the service logic using stub
                       dependencies. Tests must pass with `make test` (no infra).

3. Implement           Write the service function and HTTP handler following the
                       patterns in doc/governance/Coding_Standards.md.

4. Wire + smoke test   Ensure `make dev-infra && make dev-api` starts cleanly
                       and `GET /healthz` returns 200.

5. Integration test    Add a //go:build integration test covering the DB/Redis/
                       NATS path. Must pass with `make test-integration`.

6. Definition of Done  Check every box in the checklist below before opening a PR.
```

### Definition of Done

Every PR that adds or changes behaviour must satisfy all of the following.
Agents: do not mark a task complete until every box is checked.

- [ ] `openapi.draft.yaml` updated if endpoint shape changed
- [ ] `asyncapi.draft.yaml` updated if event payload changed
- [ ] Service function implemented; domain sentinel errors defined in the package
- [ ] HTTP handler follows the standard shape (corrID + claims at top, no HTTP types below handler layer)
- [ ] Unit tests for service logic: happy path + all domain error paths
- [ ] HTTP handler tested via `httptest` (auth required, validation failure, success)
- [ ] Outbox row written in same DB transaction as domain mutation (no direct NATS publish from handler)
- [ ] Audit log row written for every privileged mutation
- [ ] `middleware.Sanitize` applied before any log line that echoes request data
- [ ] No policy constant hardcoded — all thresholds read from `policy.Client`
- [ ] `go build ./...` passes
- [ ] `go vet ./...` passes
- [ ] `make lint` passes
- [ ] `make test` passes (unit tests, no infra)
- [ ] `make test-integration` passes (requires `make dev-infra`)
- [ ] `make verify-web` passes for any task touching `packages/web/**`
- [ ] `bash scripts/ci/frontend_e2e.sh` passes for any user-visible UX,
      navigation, route guard, app connect, auth redirect, or frontend API
      consumption change; if scoped, record the exact `E2E_SPEC` and why it is
      sufficient
- [ ] `CODEGEN_ENFORCE_CLEAN=1 bash scripts/ci/sdk_codegen_smoke.sh` passes for
      any OpenAPI/AsyncAPI/API response/generated client/frontend API
      consumption change
- [ ] Mutating DB paths that write JSON/audit fields are executed against real Postgres in tests (to catch bind-type/query issues)

### Coding patterns quick-ref

Full patterns (handler shape, logging, DB transaction, outbox, audit, naming):
→ `doc/governance/Coding_Standards.md §Go Implementation Patterns`

Test patterns (table-driven, httptest, mocks, integration setup, coverage targets):
→ `doc/governance/Testing_Standards.md §Go Test Patterns`

---

## Task → Files to Read

| Task | Files |
|---|---|
| Add / change a REST endpoint | `doc/api/openapi/manifest.yaml` for domain ownership, then the domain fragment or `doc/api/openapi.draft.yaml` → update spec first |
| Add / change a domain event | `doc/api/asyncapi/manifest.yaml`, domain fragment or `doc/api/asyncapi.draft.yaml`, `doc/architecture/Event_Taxonomy.md` |
| Add / change v3 UI/API migration work | `doc/architecture/AI_Factory_Team_Domain_Operating_Model_v1.md`, `doc/architecture/API_Route_Modularization_and_V1_Freeze_v1.md`, `doc/architecture/API_Domain_Authoring_Model_v1.md` |
| Change DB schema | `doc/architecture/db_schema_v1.sql`, `doc/architecture/ERD.md` |
| Return a new error | `doc/architecture/Error_Code_Catalog.md` — add code there first |
| Add a policy key | `doc/architecture/Seed_Data_Spec.md`, `scripts/seed.sql` |
| Implement billing logic | `doc/architecture/State_Machines.md §3-4`, `doc/architecture/db_schema_v1.sql` |
| Implement provisioning | `doc/architecture/State_Machines.md §1`, `doc/architecture/Sequence_Flows.md` |
| Implement node agent tasks | `doc/architecture/Node_Agent_Spec.md` — task catalog, protocol, security model |
| Implement PKI / cert enrollment | `doc/architecture/PKI_Spec.md` — CA hierarchy, enrollment flow, renewal |
| Implement MAAS integration | `doc/architecture/Node_Agent_Spec.md §4`, `packages/platform/maas/` — isolation model, deploy/release API |
| Add a NATS consumer | `doc/architecture/NATS_Stream_Config.md` |
| Write tests | `doc/governance/Testing_Standards.md` (pyramid, patterns, coverage targets) |
| Follow Go coding patterns | `doc/governance/Coding_Standards.md §Go Implementation Patterns` |
| Find what to build next | `doc/Implementation_Roadmap.md` — ordered phases with files, endpoints, tests |
| Understand service boundaries | `doc/architecture/Domain_Ownership_Map.md` |
| Understand failure/retry flows | `doc/architecture/Compensation_Matrix.md` |
| Understand the data model | `doc/architecture/ERD.md`, `doc/architecture/db_schema_v1.sql` |
| Inter-service calls | `doc/architecture/Inter_Service_Communication.md` |
| Monorepo / build structure | `doc/architecture/Monorepo_Structure.md` |

---

## Anti-Patterns — Do Not Do These

- Do not return errors with a `code` not in the catalog above.
- Do not write balance as a column; compute it from `ledger_entries`.
- Do not publish to NATS directly from an HTTP handler (use outbox).
- Do not add `?token=` or `?auth=` query parameters to any endpoint.
- Do not hardcode billing thresholds or rate limit values in service code.
- Do not log or trace any field from the PII/credential blocklist above.
- Do not add gRPC before Phase-2 service extraction — internal calls are direct Go package calls.
- Do not add a new API field without updating `openapi.draft.yaml` first.
- Do not UPDATE or DELETE rows in `ledger_entries` or `audit_logs`.
- Do not skip the outbox pattern for cross-service state changes.
- Do not call the node agent task API directly from HTTP handlers — task dispatch goes through Temporal activities only.
- Do not use `full-reimage` isolation without `maas.enabled=true` — guard this path in the provisioning worker.
- Do not call MAAS API or step-ca directly from node-agent — the node agent talks only to `api.internal`.
