Shared Platform Governance in-progress

GPUaaS is both the first AI Factory product and the proving ground for governed agentic engineering. The platform model and the delivery model are tied together: shared services own cross-product controls, and Fairway keeps the work that changes those controls auditable.

Operating Principle

Shared platform services own cross-product controls.
Fairway owns durable work coordination and evidence.
Provider sessions are replaceable execution attachments.

Platform Boundary

Layer	Owns
GPUaaS product domains	Allocations, node lifecycle, terminal access, GPU SKUs, MAAS behavior
App Platform / SDK	App manifests, app catalog, runtime adapters, developer workflows
Token Factory	Model endpoint routing, model policy, inference analytics
Shared platform services	IAM, billing, audit, evidence, status, notification, registry, secrets/PKI, policy

The immediate goal is ownership clarity and contract composition. Physical service extraction comes after routes, schemas, events, read models, and review gates can enforce the boundary.

Agentic Engineering Boundary

Control	Authority
Fairway task	Scope, owner, status, dependencies, risk, review domains
Provider session	Execution attachment for Codex, Claude, Gemini, tmux, shell, or browser work
Evidence artifact	Command/result, source SHA, environment, logs, screenshots, UAT or scan output
Review record	Independent domain approval or concrete requested changes
Deploy-run	CI, deploy, smoke, UAT, rollback, and follow-up evidence

Provider chat is useful context, but it is not approval. Fairway evidence and reviews are the durable record.

Security Boundaries

Surface	Required control
MFA	Keycloak owns human MFA enforcement; GPUaaS consumes provider posture and claims without collecting factor secrets
Secrets / PKI	Custody stays in Vault, step-ca, cert-manager, and service identity tooling; GPUaaS records purpose, policy, delivery, and audit
CI runners	Scaleout requires non-secret inventory, host headroom, and ops approval
Cloudflare / edge	DNS, tunnel, Access, TLS, and route changes need explicit ops/security evidence and rollback
RTE environments	Boundary, segmentation, storage, observability, IAM, and separation exports are required before service exposure closure
UAT / deploy	Meaningful deploy and UAT attempts need Fairway deploy-runs with deterministic artifacts
Agent automation	Agents may execute and summarize; Fairway reviews and evidence decide closure

Closeout Behavior

When a lane is waiting on reviewers, credentials, exports, or an approval window, the team should keep moving on safe fallback work:

Close deploy/CI monitor tasks with terminal evidence.
Run approved non-production UAT and smoke harnesses.
Convert findings into scoped follow-up tasks.
Update runbooks, architecture docs, operations docs, evidence packets, and this portal.
Reconcile Fairway and route reviews before switching lanes.

Fallback work does not loosen production controls. Keycloak, runner, Cloudflare, firewall, route, RTE, secret, destructive cleanup, and production deploy changes still need explicit approval and rollback criteria.

Current Closeout Dependencies

MFA live drill evidence is required before sensitive-operation MFA gates.
RTE export evidence is required before service exposure baseline closure.
Runner inventory and host-headroom evidence are required before controlled scaleout.
Kind deploy and deterministic smoke evidence exist, but full kind UAT and dev deploy remain gated by OPS-FIX-KIND-COMPUTE-CAPACITY-PREREQ-001 until compute-vm-small in local-maas-lxd has schedulable capacity or an approved alternate profile/waiver is recorded.
Portal source checks and build evidence are required before the documentation portal is called current.

Current Safe Work While Capacity Is Blocked

The closeout program can still progress without weakening release controls:

Keep UAT coverage matrices current with evidence paths and Fairway blockers.
Run non-mutating kind smoke for auth/session, account/security, catalog/read models, billing and finance reads, admin/ops read models, and terminal connect against the existing active allocation.
Update runbooks, source-of-truth maps, portal pages, and cleanup/archive recommendations.
Route the ops decision for capacity restore, alternate profile, or scoped waiver before mutating UAT or dev deploy.

Canonical sources

Operating Principle​

Platform Boundary​

Agentic Engineering Boundary​

Security Boundaries​

Closeout Behavior​

Current Closeout Dependencies​

Current Safe Work While Capacity Is Blocked​