Skip to main content

System Overview designed

GPUaaS uses a contract-first control plane with a single API/BFF binary today, domain packages behind it, workers for asynchronous work, and a pull-based node agent model for GPU host operations. The portal summarizes this shape; the canonical contracts and implementation rules remain in doc/api/** and doc/architecture/**.

Runtime Topology

LayerMain componentsResponsibility
EdgePublic ingress, WAF, TLS, websocket routing policyTerminate public traffic, enforce edge policy, route API and terminal traffic
API/control planecmd/api, domain packages, middleware, policy clientAuthenticate, authorize, validate contracts, write domain state, mint access/session tokens
WorkersBilling, provisioning, webhook, notification relay, outbox relayProcess long-running, scheduled, webhook, notification, and event publishing flows
Workflow/event layerTemporal, NATS JetStream, outboxCoordinate durable workflows and async state changes
Data layerPostgres, Redis, object storage, secrets/PKIStore durable state, session/cache state, user data, credentials, and certificates
Node fleetcmd/node-agent, GPU hosts, terminal relay pathExecute typed node tasks, report status, serve user access under policy
ObservabilityOpenTelemetry, Prometheus, structured logs, runbook evidenceTrace requests, inspect workers/events, alert on degradation, preserve release/incident evidence

Control-Plane Flow

Provisioning Flow

  1. A user or automation requests capacity through the API.
  2. The API validates identity, project/tenant scope, policy, idempotency, and catalog availability.
  3. The provisioning domain writes allocation state and an outbox row in the same database transaction.
  4. The outbox relay publishes the domain event to NATS.
  5. Provisioning workers and Temporal activities drive the allocation lifecycle.
  6. The node agent receives typed tasks through the approved control-plane path.
  7. Allocation state moves through requested, provisioning, active, releasing, released, failed, or release_failed.
  8. Billing, notifications, and operator evidence react to state changes through events and read models.

Access And Terminal Flow

Browser terminal access is mediated by session binding and token validation. Tokens are not sent in query strings. The API remains the control-plane authority for terminal token minting and validation, while the terminal gateway handles the stateful websocket path to the node agent.

Billing Flow

Billing is ledger-backed. Active usage windows generate ledger entries; balances are computed from immutable ledger rows. Low-balance and depleted-balance events drive notifications and force-release behavior. Operators should treat the ledger as append-only: corrections are new entries, never edits to old entries.

Event And Evidence Flow

Domain changes that need cross-service visibility are written through the outbox. Events are published by the relay, consumed by workers, and tied back to correlation IDs. Release, UAT, runtime, security, and incident evidence should use the same posture: API/read-model proof first, direct data inspection only when an owning surface is missing.

Architecture Guardrails

  • API and event changes start in doc/api/**.
  • Service packages own their domain tables.
  • Cross-domain coordination flows through events, APIs, or explicit shared platform services.
  • The platform-foundation implementation sequence is maps first, guard visibility second, facade implementation third.
  • Report-only guards should graduate to warning and then blocking gates when the team has cleaned up false positives and agreed the boundary is enforceable.