System Overview designed
GPUaaS uses a contract-first control plane with a single API/BFF binary today,
domain packages behind it, workers for asynchronous work, and a pull-based node
agent model for GPU host operations. The portal summarizes this shape; the
canonical contracts and implementation rules remain in doc/api/** and
doc/architecture/**.
Runtime Topology
| Layer | Main components | Responsibility |
|---|---|---|
| Edge | Public ingress, WAF, TLS, websocket routing policy | Terminate public traffic, enforce edge policy, route API and terminal traffic |
| API/control plane | cmd/api, domain packages, middleware, policy client | Authenticate, authorize, validate contracts, write domain state, mint access/session tokens |
| Workers | Billing, provisioning, webhook, notification relay, outbox relay | Process long-running, scheduled, webhook, notification, and event publishing flows |
| Workflow/event layer | Temporal, NATS JetStream, outbox | Coordinate durable workflows and async state changes |
| Data layer | Postgres, Redis, object storage, secrets/PKI | Store durable state, session/cache state, user data, credentials, and certificates |
| Node fleet | cmd/node-agent, GPU hosts, terminal relay path | Execute typed node tasks, report status, serve user access under policy |
| Observability | OpenTelemetry, Prometheus, structured logs, runbook evidence | Trace requests, inspect workers/events, alert on degradation, preserve release/incident evidence |
Control-Plane Flow
Provisioning Flow
- A user or automation requests capacity through the API.
- The API validates identity, project/tenant scope, policy, idempotency, and catalog availability.
- The provisioning domain writes allocation state and an outbox row in the same database transaction.
- The outbox relay publishes the domain event to NATS.
- Provisioning workers and Temporal activities drive the allocation lifecycle.
- The node agent receives typed tasks through the approved control-plane path.
- Allocation state moves through
requested,provisioning,active,releasing,released,failed, orrelease_failed. - Billing, notifications, and operator evidence react to state changes through events and read models.
Access And Terminal Flow
Browser terminal access is mediated by session binding and token validation. Tokens are not sent in query strings. The API remains the control-plane authority for terminal token minting and validation, while the terminal gateway handles the stateful websocket path to the node agent.
Billing Flow
Billing is ledger-backed. Active usage windows generate ledger entries; balances are computed from immutable ledger rows. Low-balance and depleted-balance events drive notifications and force-release behavior. Operators should treat the ledger as append-only: corrections are new entries, never edits to old entries.
Event And Evidence Flow
Domain changes that need cross-service visibility are written through the outbox. Events are published by the relay, consumed by workers, and tied back to correlation IDs. Release, UAT, runtime, security, and incident evidence should use the same posture: API/read-model proof first, direct data inspection only when an owning surface is missing.
Architecture Guardrails
- API and event changes start in
doc/api/**. - Service packages own their domain tables.
- Cross-domain coordination flows through events, APIs, or explicit shared platform services.
- The platform-foundation implementation sequence is maps first, guard visibility second, facade implementation third.
- Report-only guards should graduate to warning and then blocking gates when the team has cleaned up false positives and agreed the boundary is enforceable.