Skip to main content

Shared Platform Governance in-progress

GPUaaS is both the first AI Factory product and the proving ground for governed agentic engineering. The platform model and the delivery model are tied together: shared services own cross-product controls, and Fairway keeps the work that changes those controls auditable.

Operating Principle

Shared platform services own cross-product controls.
Fairway owns durable work coordination and evidence.
Provider sessions are replaceable execution attachments.

Platform Boundary

LayerOwns
GPUaaS product domainsAllocations, node lifecycle, terminal access, GPU SKUs, MAAS behavior
App Platform / SDKApp manifests, app catalog, runtime adapters, developer workflows
Token FactoryModel endpoint routing, model policy, inference analytics
Shared platform servicesIAM, billing, audit, evidence, status, notification, registry, secrets/PKI, policy

The immediate goal is ownership clarity and contract composition. Physical service extraction comes after routes, schemas, events, read models, and review gates can enforce the boundary.

Agentic Engineering Boundary

ControlAuthority
Fairway taskScope, owner, status, dependencies, risk, review domains
Provider sessionExecution attachment for Codex, Claude, Gemini, tmux, shell, or browser work
Evidence artifactCommand/result, source SHA, environment, logs, screenshots, UAT or scan output
Review recordIndependent domain approval or concrete requested changes
Deploy-runCI, deploy, smoke, UAT, rollback, and follow-up evidence

Provider chat is useful context, but it is not approval. Fairway evidence and reviews are the durable record.

Security Boundaries

SurfaceRequired control
MFAKeycloak owns human MFA enforcement; GPUaaS consumes provider posture and claims without collecting factor secrets
Secrets / PKICustody stays in Vault, step-ca, cert-manager, and service identity tooling; GPUaaS records purpose, policy, delivery, and audit
CI runnersScaleout requires non-secret inventory, host headroom, and ops approval
Cloudflare / edgeDNS, tunnel, Access, TLS, and route changes need explicit ops/security evidence and rollback
RTE environmentsBoundary, segmentation, storage, observability, IAM, and separation exports are required before service exposure closure
UAT / deployMeaningful deploy and UAT attempts need Fairway deploy-runs with deterministic artifacts
Agent automationAgents may execute and summarize; Fairway reviews and evidence decide closure

Closeout Behavior

When a lane is waiting on reviewers, credentials, exports, or an approval window, the team should keep moving on safe fallback work:

  1. Close deploy/CI monitor tasks with terminal evidence.
  2. Run approved non-production UAT and smoke harnesses.
  3. Convert findings into scoped follow-up tasks.
  4. Update runbooks, architecture docs, operations docs, evidence packets, and this portal.
  5. Reconcile Fairway and route reviews before switching lanes.

Fallback work does not loosen production controls. Keycloak, runner, Cloudflare, firewall, route, RTE, secret, destructive cleanup, and production deploy changes still need explicit approval and rollback criteria.

Current Closeout Dependencies

  • MFA live drill evidence is required before sensitive-operation MFA gates.
  • RTE export evidence is required before service exposure baseline closure.
  • Runner inventory and host-headroom evidence are required before controlled scaleout.
  • Kind deploy and deterministic smoke evidence exist, but full kind UAT and dev deploy remain gated by OPS-FIX-KIND-COMPUTE-CAPACITY-PREREQ-001 until compute-vm-small in local-maas-lxd has schedulable capacity or an approved alternate profile/waiver is recorded.
  • Portal source checks and build evidence are required before the documentation portal is called current.

Current Safe Work While Capacity Is Blocked

The closeout program can still progress without weakening release controls:

  1. Keep UAT coverage matrices current with evidence paths and Fairway blockers.
  2. Run non-mutating kind smoke for auth/session, account/security, catalog/read models, billing and finance reads, admin/ops read models, and terminal connect against the existing active allocation.
  3. Update runbooks, source-of-truth maps, portal pages, and cleanup/archive recommendations.
  4. Route the ops decision for capacity restore, alternate profile, or scoped waiver before mutating UAT or dev deploy.