# Architecture v1 (Public-Facing, Agent-Friendly)

## 1. Goals
- Security-first public platform.
- Horizontally scalable control plane.
- Durable workflows for provisioning, billing, and payments.
- Contract-first APIs for SDK/CLI generation.

## 2. Architecture Principles
- API-first and event-contract-first.
- Async-by-default for long-running and failure-prone workflows.
- Idempotent mutations and replay-safe consumers.
- Immutable financial ledger as source of truth.
- Policy enforcement at every trust boundary.
- Service-owned data stores (no BFF direct DB access).
- Tenant/project ownership baseline: tenant is ownership root, project is resource scope, user is actor attribution.
- Canonical resource names across API/events/audit: `core42:aicloud:{region}:{tenant_id}:{project_id}:{resource_type}:{resource_id}`.
- Shared backend isolation must distinguish hierarchy scope, operational domain, and custody domain (see `Platform_Domain_Isolation_Model_v1.md`); deployment topology may vary without changing platform contracts.

## 3. C4 - System Context
```mermaid
flowchart LR
  user[End User]
  admin[Admin Operator]
  stripe[Stripe]
  gpu[GPU Nodes]

  subgraph gp[GPUaaS Platform]
    api[Public API Gateway/BFF]
  end

  user --> api
  admin --> api
  api --> stripe
  api --> gpu
```

## 4. C4 - Container View
```mermaid
flowchart TB
  subgraph edge[Public Edge]
    waf[WAF + API Gateway]
  end

  subgraph cp[Control Plane]
    bff[API/BFF]
    auth[Identity/Auth Service]
    inv[Inventory Service]
    orch[Provisioning Orchestrator]
    bill[Billing + Ledger Service]
    pay[Payments/Webhook Service]
    term[Terminal Gateway]
    adm[Admin Service]
    stg[Storage Service]
    notif[Notification WS Hub]
  end

  subgraph workers[Workers]
    pw[Provisioning Worker]
    bw[Billing Worker]
    ww[Webhook Worker]
    nrw[Notification Relay Worker]
  end

  subgraph data[Data Layer]
    pg[(PostgreSQL)]
    rd[(Redis)]
    bus[(NATS JetStream)]
    obj[(Object Storage)]
    sec[(Secrets + KMS)]
    otel[(OTel/Prometheus/Logs)]
  end

  waf --> bff
  bff --> auth
  bff --> inv
  bff --> orch
  bff --> bill
  bff --> pay
  bff --> term
  bff --> adm
  bff --> stg
  bff --> notif

  auth --> pg
  auth --> sec
  inv --> pg
  orch --> pg
  bill --> pg
  pay --> pg
  adm --> pg
  stg --> pg

  orch --> bus
  bill --> bus
  pay --> bus
  pw <--> bus
  bw <--> bus
  ww <--> bus
  nrw <--> bus

  pw --> pg
  bw --> pg
  ww --> pg

  stg --> obj
  term --> rd
  nrw --> rd
  notif --> rd

  bff --> otel
  pw --> otel
  bw --> otel
  ww --> otel
  nrw --> otel
```

## 5. Trust Boundaries
```mermaid
flowchart LR
  subgraph internet[Untrusted Internet]
    client[Browser / SDK / CLI]
    stripe[Stripe Webhook Source]
  end

  subgraph edge[Boundary A: Edge]
    waf[WAF + Rate Limiting + TLS]
  end

  subgraph app[Boundary B: Service Mesh / Private Network]
    services[API Services]
    workers[Async Workers]
  end

  subgraph data[Boundary C: Data Plane]
    db[(Postgres)]
    cache[(Redis)]
    queue[(JetStream)]
    obj[(Object Storage)]
    kms[(KMS/Secrets)]
    obs[(Observability Stack)]
  end

  client --> waf --> services
  stripe --> waf --> services
  services <--> workers
  services --> db
  services --> cache
  services --> queue
  services --> obj
  services --> kms
  services --> obs
  workers --> db
  workers --> queue
  workers --> obs
```

## 6. Critical Sequence - Provision (Success and Failure)
```mermaid
sequenceDiagram
  participant U as User
  participant API as API/BFF
  participant ORCH as Orchestrator
  participant Q as Queue
  participant W as Provisioning Worker
  participant N as GPU Node
  participant DB as Postgres

  U->>API: POST /api/v1/allocations
  API->>ORCH: Validate + create AllocationRequested
  ORCH->>DB: tx(write request + outbox)
  ORCH->>Q: publish provisioning.requested
  Q->>W: consume event
  W->>API: queue signed node task + wakeup
  API-->>N: node agent long-poll returns task
  N->>API: task result (success/failure/rejected)
  alt success
    W->>DB: mark allocation active, usage start
    W->>Q: publish provisioning.active
  else failure
    W->>DB: mark request failed + reason
    W->>Q: publish provisioning.failed
  end
  API-->>U: request accepted + status endpoint
```

## 7. Critical Sequence - Billing + Force Release
```mermaid
sequenceDiagram
  participant T as Scheduler
  participant BW as Billing Worker
  participant DB as Postgres
  participant Q as Queue
  participant PW as Provisioning Worker

  T->>BW: Run billing window
  BW->>DB: load active usage windows
  BW->>DB: write ledger debits + usage accrual
  alt balance <= threshold
    BW->>Q: publish billing.low_balance_warning
  end
  alt balance <= 0
    BW->>Q: publish provisioning.force_release_requested
    Q->>PW: force release allocations
    PW->>DB: close allocations + end usage
    PW->>Q: publish provisioning.releasing.completed
  end
```

## 8. Critical Sequence - Stripe Webhook Idempotency
```mermaid
sequenceDiagram
  participant S as Stripe
  participant API as Payments Service
  participant DB as Postgres

  S->>API: webhook(event)
  API->>API: verify signature + timestamp
  API->>DB: check event_id uniqueness
  alt duplicate
    API-->>S: 200 already processed
  else new
    API->>DB: tx(insert event, insert ledger credit)
    API->>API: emit payments.balance_credited
    API-->>S: 200 processed
  end
```

## 9. Critical Sequence - Terminal Session
```mermaid
sequenceDiagram
  participant C as Client
  participant TG as Terminal Gateway
  participant DB as Postgres
  participant N as GPU Node

  C->>TG: WS connect(token, allocation)
  TG->>DB: validate token + ownership + allocation active
  TG->>N: open remote shell channel
  TG-->>C: stream bi-directional terminal data
```

Terminal auth evolution:
- Day 1: gateway validates token + allocation ownership against cache/DB for correctness.
- Day 2 target: mint short-lived signed terminal session tokens (`allocation_id`, `user_id`, expiry, scope) so gateway can verify locally and reduce per-connect DB dependency.
- Keep server-side revocation strategy (deny-list/versioned token claims) for emergency access revocation.
- Single-use token consumption must be atomic (`GETDEL` or equivalent) to prevent race/replay on concurrent upgrades.

## 10. Data and Key Management
- All secrets stored in secret manager; no plaintext secrets in code or repo.
- KMS envelope encryption for sensitive values.
- SSH private material never stored unencrypted.
- SSH key retrieval uses Authorization-header-authenticated endpoint; no auth tokens in query strings.
- Node provisioning does not use direct control-plane SSH; node-agent task dispatch is the MVP path.
- If signed download URLs are used internally, TTL must be short-lived and logs must redact credential material.
- Key rotation policy documented per environment.
- Database encryption at rest and TLS in transit.
- Sensitive token/credential values must be log-sanitized.

## 11. Policy Configuration Architecture
- Policy values are runtime-configured from DB/config store; no hardcoded business policy constants in service code.
- Policy scope hierarchy:
  - `global`
  - `plan`
  - `org`
  - `user`
- Resolution rule: most specific applicable scope wins, with bounded fallback to broader scope.
- Policy change flow:
  - Admin API validates key + bounds
  - Write policy change record transactionally
  - Emit policy-change event for cache invalidation
  - Audit log entry required for every change
- Required policy keys include rate limits, allocation concurrency, billing window/thresholds, refund window/rules, and notification defaults.

## 12. Scalability Model
- API HTTP paths are stateless and scale horizontally behind load balancer.
- Terminal websocket sessions are stateful per connected pod; deployment must use websocket-capable routing with session affinity (or dedicated terminal gateway extraction) to avoid reconnect/session disruption.
- Worker pools scale independently by queue depth and latency SLO.
- Redis for hot-path cache, rate limiting, and short-lived coordination.
- Notification fanout path uses `notification-relay` (NATS consumer) -> Redis Pub/Sub -> API WS hub, so `cmd/api` does not subscribe directly to domain events.
- DB optimization plan:
  - indexes for allocation state and user lookups
  - partition large usage/ledger tables by time
  - read-replica strategy for analytics/export workloads

## 13. Reliability Model
- Retries with backoff for transient failures.
- DLQ for poisoned events and manual replay tooling.
- Outbox pattern to avoid dual-write loss.
- Compensation steps for partial failures (provision/release).
- Circuit breaker + timeout + bulkhead policies on internal node-agent dispatch and result-reporting paths.
- RTO/RPO targets per environment:
  - RTO target: <= 30 minutes
  - RPO target: <= 15 minutes

## 14. Observability Model
- End-to-end distributed tracing with correlation IDs across HTTP and event flows.
- Structured JSON logging with required fields: `timestamp`, `level`, `service`, `correlation_id`, `org_id` where present.
- Metrics baseline:
  - request latency/error/rate
  - queue depth/consumer lag
  - workflow success/failure counters
  - billing and payment event counters
- Alerting:
  - SLO burn alerts
  - queue backlog thresholds
  - webhook and billing worker failure alerts
- Cardinality policy:
  - avoid high-cardinality label values (e.g., raw user ids in high-volume metrics).

## 15. Security Controls Matrix
| Threat | Control | Owner | Verification |
|---|---|---|---|
| Credential theft | OIDC + short-lived tokens + refresh rotation | Security | Pen-test + auth tests |
| API abuse | WAF + rate limits + bot controls | Platform | Load + abuse tests |
| Replay webhook | signature + timestamp + idempotency key | Payments | Integration tests |
| Privilege escalation | RBAC/ABAC checks server-side | Backend | Authz tests + reviewguard |
| Secret leakage | secret manager + scans + no hardcoding | Security | CI secret scan |
| Financial tampering | immutable ledger + transactional writes | Billing | Ledger invariant tests |

## 16. Deployment Topology
- Environments: `dev`, `staging`, `prod`.
- Separate cloud accounts/projects per environment.
- Network segmentation:
  - public ingress only at edge gateway
  - private service and data subnets
- GitOps deployment with policy checks before promotion.

## 17. NFR and SLO Baseline
| Domain | Target |
|---|---|
| API availability | 99.9% monthly |
| p95 read latency | < 300ms |
| p95 mutation accept latency | < 500ms |
| Webhook processing latency | < 60s end-to-end |
| Provision workflow completion | < 5 min normal path |

## 18. ADR References (Required)
- ADR-001: Service-oriented control plane from day 1.
- ADR-002: Postgres + immutable ledger model.
- ADR-003: Queue/event bus selection.
- ADR-004: AuthN/AuthZ architecture.
- ADR-005: Terminal gateway isolation model.

## 19. Phase-2 Expansion Without Rework (Design Constraints)

### 19.1 Scheduler Abstraction Boundary
- Introduce `SchedulerAdapter` interface at orchestrator layer:
  - `request_capacity`
  - `provision_runtime`
  - `release_runtime`
  - `get_runtime_status`
- MVP uses `BareMetalAdapter`; future adapters: `SlurmAdapter`, `K8sAdapter`, `RayAdapter`.
- API contracts remain scheduler-agnostic; scheduler-specific details live in adapter metadata.

### 19.2 Multi-Tenant by Construction
- Every mutable domain aggregate must carry tenant scope (`org_id`, optional `project_id`).
- Tenant scoping enforced in data access layer and policy layer.
- Partition and index strategy includes tenant keys from day 1.

### 19.3 Enterprise Billing Extensibility
- Billing engine split into metering, rating, ledger posting, invoicing projection.
- Invoicing/subscriptions/commitments are additive modules on top of the same ledger core.

### 19.4 Multi-Region Topology Ready
- Resource identity format includes region (`region/resource_id`).
- Placement service evaluates policy + region capacity.
- Event bus and workflows include region context in all messages.

### 19.5 Contract and Schema Guardrails
- No API field names that hard-code backend type.
- Use extensible enums and metadata fields with versioned schemas.
- Keep compatibility policy mandatory for contract evolution.

### 19.6 Architecture Review Gates for Phase-2 Readiness
- Gate A: new feature does not require changing core identity model.
- Gate B: new feature does not require replacing billing ledger engine.
- Gate C: new feature does not require public API break for existing consumers.
- Gate D: new feature can be introduced as additive service/module and migration.

### 19.7 Policy Service Extraction
**MVP approach**: services query the `platform_policy_values` table directly via the shared database, applying scope-resolution logic locally.

**Phase-2 extraction path**: introduce a dedicated Policy Service that owns all policy reads, scope resolution, and cache invalidation. Other services call it via an internal RPC/HTTP interface rather than touching the policy tables directly.

Benefits: single cache layer eliminates redundant per-service policy caching; consistent scope-resolution logic in one place; policy cache invalidation on write is reliable and propagated uniformly; enables policy evaluation audit trail without per-service instrumentation.

Migration approach: introduce a `PolicyClient` interface in all consuming services from day 1 (backed by direct DB query at MVP). Route calls through the dedicated service in Phase-2 without changing call sites. Remove direct `platform_policy_values` DB access from non-policy services.

### 19.8 Node Identity and Credentials (Node-Agent + PKI)
**MVP approach**: node-agent mTLS and task-signing are required. Nodes enroll via
internal enrollment flow and renew short-lived certs. Provisioning dispatches typed
signed tasks to node agents; no direct control-plane SSH provisioning.

Node onboarding model (MVP canonical):
- Admin registers node inventory first.
- Control plane issues a bootstrap bundle (`node_id`, single-use enrollment token, API URL, CA trust).
- Delivery mode is explicit per node: `manual` (operator-installed) or `maas` (provider automation).
- Lifecycle status tracks trust/readiness (`registered` -> `bootstrap_issued` -> `enrolling` -> `active`, with `offline`/`quarantined`/`retired` control states).
- Admin lifecycle controls include retire (reversible), reactivate of retired nodes (same `node_id`, returns to `offline`), and permanent remove (allowed only from `retired`).
- Occupancy is projected separately from allocations (`available`/`assigned`/`releasing`/`cleanup`/`unavailable`).

**Phase-2 evolution path**: add deeper PKI automation (rotation workflows, provider
abstraction hardening, optional TPM-backed key storage), expanded task catalog, and
automated agent rollout controls.

Benefits: no long-lived node-admin credentials in control-plane runtime paths; bounded
credential lifetime; stronger authorization boundary on node operations; auditable task
dispatch and execution trail.

### 19.9 Audit Log Service (Event-Driven, Decoupled from Primary DB)
**MVP approach**: each service writes `platform_audit_logs` rows transactionally to the primary PostgreSQL database as part of its operation. The admin API reads directly from this table.

**Phase-2 evolution path**: introduce a dedicated Audit Log Service that consumes structured audit events from the NATS event bus (a new `audit.entry` subject). All services publish audit events to the bus rather than writing to the shared table. The Audit Log Service persists and indexes them independently.

Benefits: decouples audit write volume from the primary database write path; enables separate retention, archival, and compliance policy (e.g., write to immutable cold storage); removes shared-table coupling between services; scales audit throughput independently.

Migration approach: add `audit.entry` event publication to the outbox alongside the existing transactional DB write at MVP. When the dedicated Audit Log Service is live, drain the dual-write and remove direct `platform_audit_logs` DB writes from service code.

## 20. Assumptions Register (Required Companion)
- All architecture-level assumptions are tracked in `doc/governance/Assumptions_Register.md`.
- Any change to ingress model, auth model, notification bridge, or policy model must update both this document and the assumptions register in the same PR.
