# Security Architecture Current State v1

Status: canonical-current
Owner: Platform Architecture / Security
Last updated: 2026-06-05

## Purpose

This document is the current security architecture package for GPUaaS and the
AI Factory platform foundation. It replaces stale whitepaper-style claims with
repo-backed current state, explicit partial controls, future regulated-profile
work, and non-claims.

Use this document for architecture, security, product, operations, and CISO
review conversations. Do not use `Core42_GPUaaS_Cloud.pdf` as active current
architecture evidence.

## Current Security Posture Summary

GPUaaS has a strong platform-security direction, but it should be described as
production-hardening in progress, not compliance-ready or regulated-profile
ready.

Current strengths:

1. contract-first API/event development;
2. platform IAM model separating Keycloak authentication from product
   authorization truth;
3. append-only billing ledger and privileged audit requirements;
4. node-agent mTLS/PKI design and MAAS/full-reimage isolation path;
5. Pomerium/managed ingress direction for browser/app routes;
6. runtime trust, Secrets/PKI, cert-manager, Keycloak, and evidence/status
   readiness gates;
7. Fairway-tracked stabilization, review domains, deploy-run evidence, and
   CI/deploy wait-window discipline.

Current caveat:

GPUaaS should not claim FedRAMP readiness, PCI-DSS CDE readiness, HIPAA
readiness, FIPS cryptographic posture, HSM-backed key custody, PostgreSQL RLS
as the primary tenant isolation boundary, or WORM/tamper-proof audit immutability
until the corresponding evidence packages and owner approvals exist.

## Source Of Truth

| Area | Current authority |
|---|---|
| Platform shared-service boundary | `Platform_Shared_Services_Model_v2.md` |
| Security review triage | `Security_Architecture_Review_Triage_v1.md` |
| IAM and Keycloak boundary | `../Platform_IAM_Model_v1.md` |
| MFA policy and enforcement | `../IAM_MFA_Policy_and_Keycloak_Enforcement_v1.md` |
| Node-agent trust | `../Node_Agent_Spec.md`, `../PKI_Spec.md` |
| MAAS and node lifecycle | `../MAAS_Bare_Metal_Lifecycle_v1.md` |
| Capacity/workload isolation | `../Allocation_Capacity_Shapes_and_GPU_Slices_v1.md` |
| Production readiness | `../Production_Deployment_Readiness_v1.md` |
| Runtime trust | `Secrets_PKI_Runtime_Trust_Model_v1.md` |
| Release and evidence gates | `Platform_Release_Profile_Gates_v1.md`, `Platform_Evidence_Status_Schema_v1.md` |
| Operational readiness | `../../operations/Production_Platform_Baseline.md` |
| Tenant/workload isolation evidence | `Tenant_And_Workload_Isolation_Evidence_v1.md` |

## Architecture Boundary

Security controls are split across shared platform services and product-owned
domains.

Platform-owned:

1. IAM/access, organizations, departments, projects, memberships, roles, scopes,
   service accounts, and API keys;
2. policy/entitlements, quota, and feature gates;
3. billing/metering/payment ledger authority and money reconciliation;
4. audit/evidence/status/readiness records;
5. Secrets/PKI coordination, credential delivery contracts, and cert lifecycle
   evidence;
6. registry/artifact trust metadata;
7. notification templates and security/status notices;
8. edge/runtime evidence for release, UAT, security scan, and operator gates.

Product-owned:

1. GPU inventory, SKU/capacity semantics, allocation lifecycle, node lifecycle,
   terminal access, and MAAS orchestration;
2. App Platform catalog/runtime/SDK behavior;
3. future Token Factory model routing and token/request usage semantics.

Products must compose shared services through platform contracts, read models,
or events. They must not invent product-local IAM, billing, audit, credential,
or evidence systems.

## Active Controls

| Control area | Current state | Evidence / authority |
|---|---|---|
| API contract discipline | Contract-first OpenAPI/AsyncAPI model with generated SDK/codegen gates. | `doc/api/`, `scripts/codegen.sh`, CI contract gates |
| Authentication | Keycloak handles human login, token issuance, JWKS, federation entry, and future MFA flow enforcement. | `Platform_IAM_Model_v1.md` |
| Product authorization | Platform DB remains product IAM authority for tenant/project membership, roles, service accounts, and scopes. | `Platform_IAM_Model_v1.md`, `Unified_IAM_Billing_Across_Products_v1.md` |
| MFA architecture | MFA authority assigned to Keycloak for humans; service accounts/API keys are not MFA subjects. | `IAM_MFA_Policy_and_Keycloak_Enforcement_v1.md` |
| Privileged audit | Privileged mutations require `audit_logs` with actor, target, result, and correlation ID. | `AGENTS.md`, coding standards |
| Billing integrity | Ledger entries are immutable; balance is derived from ledger, not mutable balance columns. | `Billing_Architectural_Invariants_v1.md`, `AGENTS.md` |
| Evidence/status | Platform evidence/status schema and gates exist for release, UAT, security, guard reports, component status. | `Platform_Evidence_Status_Schema_v1.md`, `Platform_Release_Profile_Gates_v1.md` |
| Boundary guards | Import, route, schema, event, frontend, worker, and v3 namespace ownership guards exist and have graduation policy. | `Platform_Foundation_Boundary_Guards_v1.md` |
| Runtime trust | Secrets/PKI custody and runtime trust gates exist; live production rotation remains approval-gated. | `Secrets_PKI_Runtime_Trust_Model_v1.md`, `scripts/ci/secrets_pki_runtime_trust_gate.sh` |
| Cert lifecycle | cert-manager lifecycle readiness gate exists; live issuer/certificate evidence is environment-gated. | `scripts/ci/cert_manager_lifecycle_readiness.sh` |
| Edge posture | Cloudflare/security-insights triage created HTTPS/HSTS, legacy DNS/TLS, ops access, security.txt, and pre-prod scan follow-ups. | Fairway `OPS-CLOUDFLARE-*`, `SEC-PUBLIC-*`, `SEC-PREPROD-*` |
| Node-agent trust | Node-agent protocol and PKI enrollment model exist; stronger hardware-rooted attestation is future hardening. | `Node_Agent_Spec.md`, `PKI_Spec.md` |
| MAAS isolation | MAAS bare-metal lifecycle and full-reimage path are documented; current isolation profile must be stated per environment. | `MAAS_Bare_Metal_Lifecycle_v1.md` |

## Partial Controls And Open Gaps

These are real current-state gaps, not stale-review noise.

| Gap | Current status | Owning work |
|---|---|---|
| MFA implementation | Architecture exists; Keycloak flow, runbook, read model, UX, and sensitive-op gate remain tasks. | `IAM-MFA-*` |
| Keycloak/IdP HA evidence | Runtime posture must prove HA/managed-IdP or accepted single-instance risk, realm drift, backup/restore, JWKS freshness, MFA posture, and break-glass state. Gate implementation is tracked separately and should become active evidence only after it lands on the release branch. | `OPS-PROD-IAM-KEYCLOAK-HA-001` |
| Live Secrets/PKI drills | Custody/runtime gates exist; live rotation and break-glass drills are approval-gated. | `OPS-PROD-SECRETS-PKI-*` |
| Public edge hardening | Plans exist for HTTPS/HSTS, stale DNS/TLS cleanup, ops access policy, security.txt, and pre-prod scans; live mutations need approval. | `OPS-CLOUDFLARE-*`, `SEC-PUBLIC-*`, `SEC-PREPROD-*` |
| Tenant/workload isolation evidence | Current model is broader than the stale PDF, but negative test/evidence package is not complete. | `SEC-ARCH-TENANT-WORKLOAD-ISOLATION-EVIDENCE-001` |
| Audit tamper-evidence | Append-only audit and ledger controls exist; cryptographic/WORM audit evidence is not current state. | `SEC-ARCH-AUDIT-TAMPER-EVIDENCE-001` |
| Data retention and erasure | References are scattered; single data classification, retention, legal hold, and erasure matrix is pending. | `SEC-ARCH-RETENTION-ERASURE-MATRIX-001` |
| Incident/SOC model | Runbooks and on-call readiness exist; customer/regulator notification and SOC operating matrix is pending. | `SEC-ARCH-INCIDENT-SOC-MODEL-001` |
| Supply chain evidence | CI scans and gates exist; full SBOM/provenance/signing/release evidence package is pending. | `SEC-ARCH-SUPPLY-CHAIN-EVIDENCE-GATE-001` |
| Regulated crypto/key custody | Normal production baseline is separate from FIPS/HSM/FedRAMP-grade requirements. | `SEC-ARCH-REGULATED-CRYPTO-KEY-CUSTODY-001` |

## Explicit Non-Claims

GPUaaS must not currently claim:

1. FedRAMP authorized or FedRAMP ready;
2. PCI-DSS CDE readiness;
3. HIPAA/ePHI readiness;
4. SOC 2 or ISO 27001 certification;
5. FIPS-validated cryptographic boundary;
6. HSM-backed KEK custody;
7. WORM/tamper-proof audit immutability;
8. hardware-rooted node attestation;
9. PostgreSQL RLS as the primary tenant-isolation boundary;
10. full public-production edge posture until Cloudflare/DNS/access tasks are
    approved and proven.

These may become future enterprise or regulated-profile claims only after the
specific evidence packages and owner approvals land.

## Current Data And Trust Flows

### Identity And Authorization

```text
Human user
  -> Keycloak / external IdP
  -> JWT + JWKS validation at API
  -> platform IAM membership / role / scope checks
  -> product operation
  -> audit/evidence/status record
```

Keycloak authenticates humans and publishes tokens/JWKS. Platform IAM owns
product authorization, memberships, scoped roles, service accounts, API keys,
and authorization evidence.

### Node And Runtime Trust

```text
Allocation request
  -> provisioning product workflow
  -> MAAS / provider lifecycle
  -> node-agent enrollment
  -> mTLS / task pull
  -> terminal/app runtime binding
  -> status, audit, billing, and release evidence
```

The current production baseline is service identity, cert lifecycle evidence,
node-agent task protocol, and provider reconciliation. TPM, measured boot,
firmware/BMC trust, and hardware attestation are future hardening.

### Edge And App Access

```text
Browser / API client
  -> Cloudflare / managed edge / Pomerium direction
  -> GPUaaS API or app route
  -> platform IAM / policy / token checks
  -> product runtime
  -> audit/status/evidence
```

Current direction is managed ingress with explicit DNS/TLS/access evidence. Live
public edge changes remain approval-gated.

## Regulated-Profile Separation

Normal production readiness and regulated-profile readiness are different
tracks.

Normal production baseline:

1. HA or accepted-risk posture for critical runtime dependencies;
2. secret custody and rotation evidence;
3. TLS/cert lifecycle evidence;
4. CI/CD release evidence and deploy-run traceability;
5. observability/on-call/runbook evidence;
6. tenant/workload isolation evidence;
7. backup/restore and recovery evidence.

Regulated-profile additions:

1. FedRAMP package/SSP/SAR/POA&M and 3PAO process;
2. FIPS module boundary and CMVP certificate evidence;
3. HSM-backed key custody or managed KMS/HSM controls;
4. PCI CDE scope package if payment data enters platform control;
5. HIPAA Security Rule risk analysis if ePHI is in scope;
6. WORM/Object Lock audit retention with separation-of-duties evidence.

Do not block ordinary stabilization work on regulated-profile controls unless a
specific customer or launch profile selects that target.

## Review And Release Gate Use

For production-impacting changes, attach security posture through release
evidence rather than prose-only approval:

1. source SHA and environment profile;
2. contract/codegen/SDK drift result;
3. UAT coverage and invariant result;
4. security scan/triage result;
5. Secrets/PKI and cert lifecycle status;
6. Keycloak/IdP runtime posture when auth behavior changes;
7. edge/DNS/access status when public routes change;
8. product, platform, security approval and residual risk.

Fairway deploy-run tasks are the operating record for these gates.

## Forward Work

Continue the `SEC-ARCH-REVIEW-EPIC` sequence in this order:

1. `SEC-ARCH-COMPLIANCE-SCOPE-MATRIX-001`
2. `SEC-ARCH-TENANT-WORKLOAD-ISOLATION-EVIDENCE-001`
3. `SEC-ARCH-AUDIT-TAMPER-EVIDENCE-001`
4. `SEC-ARCH-NODE-TRUST-HARDENING-001`
5. `SEC-ARCH-RETENTION-ERASURE-MATRIX-001`
6. `SEC-ARCH-INCIDENT-SOC-MODEL-001`
7. `SEC-ARCH-SUPPLY-CHAIN-EVIDENCE-GATE-001`
8. `SEC-ARCH-REGULATED-CRYPTO-KEY-CUSTODY-001`

MFA implementation remains under `IAM-MFA-EPIC`, not duplicated in this
security architecture epic.