Shared Platform Governance in-progress
GPUaaS is both the first AI Factory product and the proving ground for governed agentic engineering. The platform model and the delivery model are tied together: shared services own cross-product controls, and Fairway keeps the work that changes those controls auditable.
Operating Principle
Shared platform services own cross-product controls.
Fairway owns durable work coordination and evidence.
Provider sessions are replaceable execution attachments.
Platform Boundary
| Layer | Owns |
|---|---|
| GPUaaS product domains | Allocations, node lifecycle, terminal access, GPU SKUs, MAAS behavior |
| App Platform / SDK | App manifests, app catalog, runtime adapters, developer workflows |
| Token Factory | Model endpoint routing, model policy, inference analytics |
| Shared platform services | IAM, billing, audit, evidence, status, notification, registry, secrets/PKI, policy |
The immediate goal is ownership clarity and contract composition. Physical service extraction comes after routes, schemas, events, read models, and review gates can enforce the boundary.
Agentic Engineering Boundary
| Control | Authority |
|---|---|
| Fairway task | Scope, owner, status, dependencies, risk, review domains |
| Provider session | Execution attachment for Codex, Claude, Gemini, tmux, shell, or browser work |
| Evidence artifact | Command/result, source SHA, environment, logs, screenshots, UAT or scan output |
| Review record | Independent domain approval or concrete requested changes |
| Deploy-run | CI, deploy, smoke, UAT, rollback, and follow-up evidence |
Provider chat is useful context, but it is not approval. Fairway evidence and reviews are the durable record.
Security Boundaries
| Surface | Required control |
|---|---|
| MFA | Keycloak owns human MFA enforcement; GPUaaS consumes provider posture and claims without collecting factor secrets |
| Secrets / PKI | Custody stays in Vault, step-ca, cert-manager, and service identity tooling; GPUaaS records purpose, policy, delivery, and audit |
| CI runners | Scaleout requires non-secret inventory, host headroom, and ops approval |
| Cloudflare / edge | DNS, tunnel, Access, TLS, and route changes need explicit ops/security evidence and rollback |
| RTE environments | Boundary, segmentation, storage, observability, IAM, and separation exports are required before service exposure closure |
| UAT / deploy | Meaningful deploy and UAT attempts need Fairway deploy-runs with deterministic artifacts |
| Agent automation | Agents may execute and summarize; Fairway reviews and evidence decide closure |
Closeout Behavior
When a lane is waiting on reviewers, credentials, exports, or an approval window, the team should keep moving on safe fallback work:
- Close deploy/CI monitor tasks with terminal evidence.
- Run approved non-production UAT and smoke harnesses.
- Convert findings into scoped follow-up tasks.
- Update runbooks, architecture docs, operations docs, evidence packets, and this portal.
- Reconcile Fairway and route reviews before switching lanes.
Fallback work does not loosen production controls. Keycloak, runner, Cloudflare, firewall, route, RTE, secret, destructive cleanup, and production deploy changes still need explicit approval and rollback criteria.
Current Closeout Dependencies
- MFA live drill evidence is required before sensitive-operation MFA gates.
- RTE export evidence is required before service exposure baseline closure.
- Runner inventory and host-headroom evidence are required before controlled scaleout.
- Kind deploy and deterministic smoke evidence exist, but full kind UAT and dev
deploy remain gated by
OPS-FIX-KIND-COMPUTE-CAPACITY-PREREQ-001untilcompute-vm-smallinlocal-maas-lxdhas schedulable capacity or an approved alternate profile/waiver is recorded. - Portal source checks and build evidence are required before the documentation portal is called current.
Current Safe Work While Capacity Is Blocked
The closeout program can still progress without weakening release controls:
- Keep UAT coverage matrices current with evidence paths and Fairway blockers.
- Run non-mutating kind smoke for auth/session, account/security, catalog/read models, billing and finance reads, admin/ops read models, and terminal connect against the existing active allocation.
- Update runbooks, source-of-truth maps, portal pages, and cleanup/archive recommendations.
- Route the ops decision for capacity restore, alternate profile, or scoped waiver before mutating UAT or dev deploy.