# GPUaaS Security and CD Current-State Gap Roadmap v1

Date: 2026-05-29

## Purpose

This document summarizes where GPUaaS currently stands as a production operating
model, what security/CD controls already exist, what is still missing, and the
roadmap to reach a production-grade end state.

The intended audience is platform architecture, security architecture, release
engineering, platform operations, and the agent/coordinator team responsible for
turning this into executable work.

## Architecture Position

GPUaaS should not inherit a Dev-to-Prod or single-node patch validation model.
The target operating model must treat production release, GPU node patching,
security validation, and capacity reserve as first-class controls.

The repo already contains a strong foundation: contract-first APIs, reusable CI
gate scripts, GitLab release orchestration, production enforcement policy,
security verification docs, release promotion discipline, environment profile
resolution, and an agent-native execution model.

The remaining gap is not mainly "missing tools." The larger gap is converting
documented, automated, and partly report-only controls into enforced production
gates, backed by a dedicated production-like UAT/Security environment, release
rings, capacity reserve, signed artifacts, progressive rollout, and auditable
agent evidence.

## Current State

| Area | Current Capability | Evidence |
|---|---|---|
| Source and CI orchestration | GitLab CI is the active orchestration backbone. Pipeline stages include contracts, build/test, security, SDK, migration, package, preflight, deploy, and post-deploy. | `.gitlab-ci.yml`, `doc/governance/CI_Pipeline_Implementation.md` |
| Portable CI logic | Most gate logic lives in reusable `scripts/ci/*.sh`, making GitLab an orchestrator rather than the only place where controls exist. | `scripts/ci/README.md` |
| Contract-first controls | OpenAPI/AsyncAPI validation, breaking-change checks, contract invariant guards, SDK codegen smoke, and route-structure guards exist. | `scripts/ci/contracts_validate.sh`, `scripts/ci/contracts_breaking_change.sh`, `scripts/ci/sdk_codegen_smoke.sh` |
| Security scans | SAST/secret/dependency/image/DAST scripts exist or are wired into report paths. Tools include gosec, semgrep, gitleaks, govulncheck, Trivy, ZAP, and Schemathesis where available. | `scripts/ci/security_*.sh`, `doc/operations/Security_Scan_Triage_2026-04-01.md` |
| Runtime invariant guards | Audit, outbox, policy literal, logging, correlation, trace, node control-plane, and token/query protections exist as repo gates. | `scripts/ci/*guard.sh` |
| Release branch discipline | `release/platform-control` is a promotion branch, not a normal development branch. Releases should promote one exact SHA from `master`. | `doc/governance/Platform_Control_Release_Promotion_Policy.md` |
| Multi-environment model | Local-kind, dev-control, demo, and planned staging/prod profiles are modeled with environment/profile/artifact/gate separation. | `doc/operations/Platform_Control_CI_CD_Multi_Environment_Model_v1.md` |
| UAT automation | Persona-based UAT automation exists for the current demo environment. It includes read-only checks, gated mutating checks, Playwright browser journeys, API/CLI/SDK smokes, provider-capacity/read-model checks, app-route checks, terminal/WebSocket checks, and timestamped evidence under `dist/uat/`. | `doc/operations/Demo_UAT_Package_v1.md`, `doc/operations/Demo_UAT_Flow_Coverage_Matrix_v1.md`, `doc/governance/Persona_Journey_UAT_Model_v1.md`, `scripts/ops/demo_uat_package.sh` |
| Ops readiness tracking | Public launch controls are tracked for SLOs, runbooks, backup/restore, secrets, east/west security, cert lifecycle, load, audit, and cost. | `doc/operations/Parallel_Ops_Track.md`, `doc/operations/Production_Platform_Baseline.md` |
| Agent execution model | Role lanes, queue authority, reviewer lanes, context packets, watcher lanes, and handoff evidence are documented. | `doc/governance/Agent_Orchestrator_v2_Coordinated_Execution.md`, `doc/governance/Multi_Agent_Lane_Worktrees_v1.md` |

## Key Gaps

| Gap | Current Risk | Target State |
|---|---|---|
| Production-like UAT/Security environment | UAT automation is strong, but it is currently condensed into the available demo/kind/dev-control environments. That proves workflows and catches product gaps, but it does not fully prove production GPU infrastructure, security, and rollout behavior in a separate environment. | Dedicated UAT/Security and staging environments using production-like GPU nodes, networking, identity, storage, and edge paths; reuse the existing UAT automation there. |
| Security scan enforcement | Scan summaries feed `security_promotion_gate.sh`; full release promotion can now fail on missing SAST/SCA/secrets/image/DAST/hardening evidence or unwaived high/critical-class findings. | Keep scanner signal calibrated and maintain expiring exceptions in `doc/governance/security_scan_exceptions.json`. |
| SBOM, signing, and provenance | `package_and_attest.sh` now produces local SBOM, signature, provenance, artifact-signature, and release-evidence references, with `supply_chain_evidence_gate.sh` able to block when evidence is missing or only dev-local signatures are present. Runner/secrets assumptions and exception handling are documented in `doc/operations/Supply_Chain_Evidence_Gate_Runbook.md`. | Production promotion still needs approved Sigstore/cosign or external signing configuration and non-local signature evidence. |
| Capacity reserve | Selling or assigning 100% of GPU capacity removes room for patching, testing, rollback, and hot spares. | 15-20% reserve capacity by pool/ring, enforced by scheduler/inventory policy. |
| Release rings for GPU nodes | Single-node validation followed by broad rollout can create fleet-wide failure. | Ring 0 internal, Ring 1 UAT/Security, Ring 2 production canary, Ring 3 broad production, Ring 4 sensitive tenants. |
| Blue-green/canary maturity | GitLab can trigger deploys, but production rollout safety needs runtime-level progressive delivery. | Blue-green/canary for web/API/control plane; ring-based drain/patch/validate for GPU hosts and node-agent. |
| GitOps and drift control | Deploy scripts are strong, but desired-vs-live drift reconciliation is not yet the end-state control layer. | Argo CD or Flux-style GitOps reconciliation, manifest drift reporting, and policy-controlled sync. |
| Agent-native SDLC enforcement | Agent model is documented, but context packets, watcher closeout, model-diverse review, and release blocking are not fully enforced. | Queue-backed task authority, mandatory context/evidence, no self-approval for critical changes, human release approval. |
| UAT as release gate | Demo UAT automation now exists and is active, including read-only and gated mutating lanes. The maturity gap is promotion gating and environment separation, not absence of automation. | Stable persona UAT and security UAT packs become promotion gates for staging/prod, first report-only, then blocking by risk tier. |
| Ops evidence completion | Public MVP ops items 1-5 are still `in_progress`. | SLO/alerts, runbooks/on-call, backup/restore, secrets/key ops, and east/west/cert lifecycle all marked done with staging evidence. |

## End-State Operating Model

### Environment Promotion

No direct Dev-to-Prod path.

Target sequence:

```text
Dev -> Integration -> UAT/Security -> Staging/Pre-Prod -> Production
```

Each promotion must use one immutable release artifact and one resolved
environment/profile contract.

### Release Strategy

Control-plane services:
- blue-green or canary rollout,
- health, authz, observability, and rollback gates before full promotion,
- manifest-vs-live image digest drift checks.

GPU nodes and node-agent:
- reserve nodes before patching,
- drain active workloads where possible,
- patch by ring,
- validate node health, GPU driver/CUDA/runtime compatibility, node-agent
  connectivity, task signing, terminal, and workload launch,
- promote only after ring evidence is captured,
- keep rollback or replacement capacity available.

### Capacity Reserve

GPUaaS should enforce a reserve policy, not rely on operator discipline.

Baseline target:
- 15% reserved capacity minimum for production pools,
- 20% for early launch or high-change periods,
- reserve split across patch validation, hot spare, rollback, and security/UAT
  nodes.

The scheduler/inventory layer should prevent production allocation from
consuming reserved capacity except through explicit break-glass approval.

### Security Gates

Before production promotion, the release packet should include:
- contract validation,
- breaking-change report,
- backend/frontend/integration test results,
- migration validation,
- SAST/secret/dependency scan results,
- image scan results,
- SBOM and provenance,
- signed artifacts,
- authz and tenant/project isolation evidence,
- terminal/token replay evidence,
- node-agent/task signing evidence,
- rollback proof,
- release approver and exception records.

### UAT Automation and Environment Separation

GPUaaS already has meaningful UAT automation. The current package validates
persona journeys and supporting implementation evidence in the demo environment:

- safe read-only UAT,
- gated mutating UAT for disposable workflows,
- Playwright product/browser journeys,
- external app browser checks for managed routes,
- API, CLI, Go SDK, and Python SDK smokes,
- provider-capacity and provider-ops readiness checks,
- terminal/WebSocket checks,
- app catalog and scheduler checks,
- tenant-admin and account/security flows,
- timestamped evidence under `dist/uat/`.

This is a strength. The near-term weakness is that UAT and security validation
are compressed into the environments currently available. That is acceptable
for demo iteration, but production readiness requires running the same UAT
automation against a separate production-like UAT/Security environment and then
staging/pre-prod.

The roadmap should therefore preserve the current automation investment and
move it across environments rather than rebuilding UAT from scratch.

### Agent-Native SDLC

GPUaaS relies heavily on agents, so the SDLC must be agent-native rather than a
traditional human-only PR model.

Required controls:
- queue task is the source of work authority,
- each non-trivial task has a context packet,
- agent work is scoped by owned files, non-goals, stop conditions, and
  acceptance checks,
- bug fixes record root cause, owning layer, proof command, regression coverage,
  and residual risk,
- watcher agents have explicit allowed fixes and close conditions,
- high-risk changes get D-arch and E-governance review,
- implementation agents cannot self-approve critical changes,
- production release remains human-approved.

## Tooling Roadmap

| Capability | Current State | Recommended Direction |
|---|---|---|
| CI orchestration | GitLab CI active | Keep GitLab as CI/orchestration backbone. Do not make it the only production control plane. |
| GitOps | Not yet clear as the deployment authority | Add Argo CD or Flux for controlled reconciliation and drift evidence. |
| Progressive delivery | Scripted release profiles and fast lanes exist | Add Argo Rollouts, Flagger, or equivalent for canary/blue-green rollout policy. |
| Policy as code | Repo-specific guards and production enforcement YAML exist | Add/adopt OPA Gatekeeper or Kyverno for Kubernetes admission controls. |
| Artifact trust | Package/attestation scaffold exists | Add Syft SBOM, Cosign signing, in-toto/SLSA provenance, signed release packets. |
| Vulnerability management | Scan scripts and production promotion blocking exist; `doc/operations/Vulnerability_Remediation_SLA_v1.md` defines severity clocks, owner routing, escalation, exception handling, and the evidence report model. `scripts/ci/vulnerability_sla_summary.sh` emits open, overdue, remediated, waived, resurfaced, false-positive, and invalid finding posture. | Surface vulnerability SLA posture in Status/Ops and release packets. |
| Waiver governance | `doc/governance/Security_Waiver_Governance_v1.md` defines waiver states, required schema, approval rules, expiry handling, monthly review, release-blocking behavior, and reconciliation with current exception mechanisms. `scripts/ci/security_waiver_summary_check.sh` emits active, expired, deferred, invalid, and not-applicable waiver posture. | Surface waiver posture in Status/Ops and release packets. |
| UAT automation | Persona UAT package, flow coverage matrix, read-only/mutating lanes, and evidence output exist for demo | Reuse the same UAT package across UAT/Security and staging; make stable flows release-blocking by risk tier. |
| Runtime drift | Release validation checks exist | Add GitOps drift views plus product/operator release evidence read models. |
| Agent governance | Strong docs and queue scripts exist | Enforce context packets, queue-git checks, review routing, and evidence closeout in CI. |

## Roadmap

### Phase 1: Operating Model Baseline

Goal: establish non-negotiable operating principles.

Deliverables:
- Documented rule that GPUaaS will not use Dev-to-Prod as a production path.
- Capacity reserve policy target: 15-20%.
- Release ring model for GPU nodes and node-agent.
- Security scan severity policy, vulnerability remediation clocks, and exception
  expiry/waiver model.
- Human approval requirement for production release.

Exit criteria:
- Principles recorded in architecture/governance docs and accepted by the
  owning platform, security, product, and release domains.
- Gaps converted into queue tasks with owners.

### Phase 2: Enforced Release Evidence

Goal: turn existing scripts and docs into release-blocking evidence.

Deliverables:
- Promote critical/high security scans from report-only to blocking. The
  production promotion gate exists; remaining work is scanner signal
  calibration, exception hygiene, and Status/Ops SLA/waiver surfacing.
- Complete real `package_and_attest.sh` with SBOM, signing, and provenance
  evidence. Local deterministic evidence exists; production signing provider
  configuration remains approval/environment work.
- Classify existing demo UAT checks into advisory, required, and release-blocking
  tiers.
- Make stable persona UAT checks blocking for staging/prod promotion by risk
  tier while keeping exploratory or environment-dependent checks report-only.
- Add queue/context/evidence enforcement for high-risk agent tasks.
- Complete ops launch items 1-5 in `Parallel_Ops_Track.md`.

Exit criteria:
- A staging release cannot proceed without a complete evidence packet.
- The release packet includes UAT evidence or an approved exception for each
  required persona journey.
- Exceptions require owner, reason, expiry, and security-domain approval where
  risk is high.

### Phase 3: Production-Like UAT and Ring Rollout

Goal: prove changes on real infrastructure without risking the whole fleet.

Deliverables:
- Dedicated UAT/Security GPU pool.
- Staging/pre-prod environment with production-like edge, identity, storage,
  network policy, node-agent, and observability.
- Existing `demo_uat_package.sh` and related persona checks parameterized for
  UAT/Security and staging profiles.
- Node ring metadata in inventory.
- Patch workflow: reserve -> drain -> patch -> validate -> promote -> rollback.
- Canary/blue-green rollout for API/web/control plane.

Exit criteria:
- Node-agent and host patching can be demonstrated ring-by-ring.
- A failed canary or node patch stops before broad production impact.

### Phase 4: GitOps and Policy-Controlled Production

Goal: make desired state, runtime state, and security posture continuously
auditable.

Deliverables:
- GitOps controller for staging/prod reconciliation.
- Progressive delivery controller for control-plane workloads.
- Admission policies for signed images, privileged workloads, unsafe mounts,
  missing labels, and required security context.
- Release drift dashboard/read model.
- Continuous vulnerability and exception dashboard.

Exit criteria:
- Production drift is visible and actionable.
- Unsigned or policy-violating workloads cannot be admitted without explicit
  break-glass.

## Architecture Implications

1. GPUaaS should use the platform production operating model rather than inherit
   a direct Dev-to-Prod or broad-push patching flow.
2. Existing UAT automation should be preserved and moved across environments
   instead of rebuilt from scratch.
3. Capacity reserve is part of resilience and security architecture, not only a
   commercial capacity-planning question.
4. Release rings and progressive delivery are required for GPU nodes,
   node-agent, API/web, and control-plane services.
5. Artifact trust, release evidence, and exception expiry should become
   shared platform controls.
6. Agent-native delivery requires stricter task authority, evidence, review,
   and release gates than a traditional human-only PR flow.
7. Production release authority remains human-gated until the control model has
   sufficient environment-backed evidence to justify further automation.

## Source Documents

- `doc/governance/CI_Pipeline_Implementation.md`
- `doc/governance/Security_Control_Verification.md`
- `doc/governance/production_enforcement_policy.yaml`
- `doc/governance/Platform_Control_Release_Promotion_Policy.md`
- `doc/operations/Platform_Control_CI_CD_Multi_Environment_Model_v1.md`
- `doc/operations/Vulnerability_Remediation_SLA_v1.md`
- `doc/governance/Security_Waiver_Governance_v1.md`
- `doc/operations/Demo_UAT_Package_v1.md`
- `doc/operations/Demo_UAT_Flow_Coverage_Matrix_v1.md`
- `doc/governance/Persona_Journey_UAT_Model_v1.md`
- `doc/operations/Production_Platform_Baseline.md`
- `doc/operations/Parallel_Ops_Track.md`
- `doc/governance/Agent_Orchestrator_v2_Coordinated_Execution.md`
- `doc/governance/Agent_Execution_Quality_and_Context_Model_v1.md`
- `doc/governance/Multi_Agent_Lane_Worktrees_v1.md`
- `doc/governance/External_Architectural_Review_2026-04.md`
