# Tenant And Workload Isolation Evidence v1

Status: current evidence package with named follow-up gaps
Owner: Security / Architecture / Backend / Ops / Governance
Last updated: 2026-06-05

## Purpose

State the current GPUaaS tenant and workload isolation posture in reviewable
terms. This document separates implemented controls from partial controls and
missing evidence. It is not a regulated-workload approval and it does not claim
FedRAMP, HIPAA, PCI CDE, or dedicated-tenant readiness.

## Scope

Covered surfaces:

1. organization, department, project, resource, and principal scoping;
2. API authorization and negative access evidence;
3. database constraints and query boundaries;
4. Redis/read-cache and terminal/session key boundaries;
5. NATS and Temporal event/workflow boundaries;
6. terminal, managed ingress, app runtime, and storage surfaces;
7. GPU bare-metal, VM slice, PCI passthrough, fabric/RDMA, and cleanup
   assumptions.

## Current Isolation Model

GPUaaS uses layered logical isolation. No single mechanism is treated as the
whole tenant boundary.

| Layer | Current control | Evidence | Status |
|---|---|---|---|
| Product IAM | Keycloak authenticates humans; platform IAM owns organization, project, role, service-account, and API-key authorization. | `doc/architecture/Platform_IAM_Model_v1.md`, `packages/platform/iam`, `packages/platform/auth` | Implemented / evolving |
| Department hierarchy | Every organization has a default department; projects and service accounts are department-attributed and same-organization constrained. | `doc/architecture/db_schema_v1.sql`, `doc/architecture/IAM_Department_Hierarchy_Implementation_Plan_v1.md` | Implemented schema invariant |
| Project access | Durable v3 handlers resolve project scope before project-owned reads or mutations. Missing membership maps to access denied. | `cmd/api/routes.go`, `cmd/api/routes_v3_*.go`, `packages/platform/auth/legacyimpl/project_scope.go` | Implemented |
| Role decisions | Role/capability decisions deny when membership or permission is missing and deny actor-disabled before permission checks. | `packages/shared/authz/decision_test.go`, `packages/platform/iam/adapter_auth_test.go` | Unit-tested |
| Allocation ownership | Allocations carry `org_id` and `project_id`; project/org foreign-key constraints prevent cross-org project binding. | `doc/architecture/db_schema_v1.sql`, `packages/products/gpuaas/provisioning` | Implemented |
| Service accounts | Service accounts are project-scoped, department-attributed, and same-organization constrained. | `doc/architecture/db_schema_v1.sql`, `doc/architecture/Unified_IAM_Billing_Across_Products_v1.md` | Implemented schema invariant |
| Managed ingress | Route intent owns org, project, app instance, endpoint, auth mode, route family, and proxy pool. Pomerium is renderer/runtime, not source of ownership truth. | `doc/architecture/Managed_Ingress_Tenant_Isolation_and_Scaling_v1.md`, `cmd/api/routes_test.go` | Partially implemented; production network isolation remains open |
| Terminal | Terminal tokens/session bindings are scoped to allocation/session and gateway/node stream state. | `cmd/terminal-gateway/routes.go`, `cmd/terminal-gateway/routes_test.go` | Implemented for current topology |
| Node-agent tasks | Node-agent uses pull-based, signed, typed task catalog; control plane does not run arbitrary shell on nodes. | `doc/architecture/Node_Agent_Spec.md`, `cmd/node-agent/catalog_test.go` | Implemented contract / tested catalog constraints |
| Read cache | Cache keys normalize namespace and include tenant/project-style parts; prefix deletion can target one tenant prefix without deleting another. Redis key families are inventoried and CI-guarded. | `packages/shared/readcache/cache_test.go`, `doc/architecture/platform-foundation/Redis_Keyspace_Isolation_Evidence_v1.md`, `scripts/ci/redis_keyspace_isolation_guard.sh` | Unit-tested / CI-guarded |
| GPU placement | Bare-metal exclusivity and resource-claim uniqueness prevent simultaneous active bare-metal or slot claims. | `doc/architecture/db_schema_v1.sql`, `doc/architecture/Allocation_Capacity_Shapes_and_GPU_Slices_v1.md` | Implemented for current placement primitives; slice maturity partial |

## Database Constraints And Ownership Evidence

| Boundary | Current evidence | Residual risk |
|---|---|---|
| Department attribution | `platform_iam_departments`, `platform_iam_projects.department_id not null`, default department trigger, and same-org project/department foreign key. | Existing tests should continue proving migration/bootstrap paths on real Postgres. |
| Project/org consistency | `platform_iam_projects` has `(id, org_id)` identity, and `gpuaas_allocations(project_id, org_id)` references that pair. | Query-level route tests still need broader negative matrix coverage by surface. |
| Service-account/project consistency | `platform_iam_service_accounts(project_id, org_id)` references projects, and department attribution is assigned from the project. | Service-account API-key and runtime paths need a complete live contract matrix. |
| Allocation claims | Bare-metal claims use unique active node ownership; slot claims use unique active slot ownership. | Slice-mode scheduling and cleanup proof remains a future evidence package. |
| App/runtime ownership | App instances, runtime credentials, routes, components, and worker operations carry org/project fields. | Existing tests are strong around happy path and selected route authz, but not yet a complete negative authz matrix. |

PostgreSQL row-level security is not the primary isolation boundary today. The
current boundary is application-level authorization plus schema constraints and
project/org-scoped queries. Any future RLS adoption should be an additional
defense-in-depth layer, not a replacement for service authorization checks.

## API And Negative Authorization Evidence

Current proof points:

1. `resolveProjectScope` denies missing or mismatched project membership and
   maps the denial to canonical project access errors.
2. Integration coverage proves soft-deleted tenant or project memberships do
   not continue authorizing project scope.
3. Managed-ingress route authz denies `project_scope_mismatch`.
4. Managed-ingress forwarding strips caller-controlled `Authorization` and
   spoofed `X-GPUaaS-*` headers before injecting trusted project/route headers.
5. Service-account route access rechecks service-account state even when route
   authorization is cached.
6. Role/capability decisions deny missing membership before permission grants.

Missing evidence:

1. one consolidated negative authorization matrix across v3 compute, app
   launch, app runtime credentials, storage grants, terminal token mint/open,
   managed ingress, billing/audit, and admin read models;
2. route-family negative tests for `browser_app`, `api_app`, `terminal_ws`, and
   `platform_admin`;
3. a CI gate that fails when a project-owned handler does not call the project
   scope resolver or an equivalent platform facade.

Follow-up:

- `SEC-ARCH-TENANT-AUTHZ-NEGATIVE-MATRIX-001`

## Redis, Cache, NATS, And Temporal Evidence

| Backend | Current posture | Evidence | Gap |
|---|---|---|---|
| Redis read cache | Cache keys are namespace-normalized and can include tenant/project parts. Prefix deletion can target one tenant prefix. High-risk Redis key families are inventoried with owner-scope dimensions. | `packages/shared/readcache/cache_test.go`, `doc/architecture/platform-foundation/Redis_Keyspace_Isolation_Evidence_v1.md`, `doc/architecture/platform-foundation/redis_keyspace_isolation_inventory.json`, `scripts/ci/redis_keyspace_isolation_guard.sh` | Covered for keyspace inventory and read-cache prefix deletion; individual read-model owners must keep owner parts before untrusted query parts. |
| Redis terminal/session state | Terminal tokens remain allocation/user-bound and single-use; terminal stream channels and bindings are session/allocation/node/gateway scoped. | `packages/products/gpuaas/terminal/runtime_backend_test.go`, `cmd/terminal-gateway/routes_test.go`, `scripts/ci/redis_keyspace_isolation_guard.sh` | Covered for Redis token/session keyspace behavior; live UAT still proves deployed gateway/node stream reachability. |
| NATS | Streams are domain-scoped (`platform.billing.>`, `platform.payments.>`, `gpuaas.provisioning.>`, `appplatform.>`, `storage.>`, `dlq.>`). Events carry correlation IDs. | `doc/architecture/NATS_Stream_Config.md`, `packages/shared/events` | Tenant/project ownership is payload-level, not stream-level; provisioning lifecycle events now carry optional `project_id` owner fields when allocation/project scope is known. |
| Temporal | Temporal is execution engine for long-running workflows; persisted DB state remains product truth. Provisioning workflows use event payloads; MAAS and node-agent workflows use persisted lifecycle/onboarding/decommission records. | `doc/architecture/Intent_Control_And_Reconciliation_Model_v1.md`, `cmd/provisioning-worker`, `packages/platform/maas`, `packages/platform/adminops` | Provisioning, MAAS, and node-agent lifecycle workflows now carry safe owner/target metadata in input, memo, or persisted records. Queryable Temporal owner-scope search attributes remain tracked separately. |

### NATS Tenant/Project Event Map

NATS stream and subject boundaries are domain boundaries, not tenant boundaries.
Tenant/project isolation therefore depends on payload owner fields, consumer
query boundaries, idempotency, and DLQ operator visibility.

| Stream | Subjects | Sensitive owner fields present today | Consumers / visibility | Status |
|---|---|---|---|---|
| `BILLING` | `platform.billing.low_balance_warning`, `platform.billing.auto_release_pending`, `platform.billing.balance_depleted`, `platform.billing.budget_threshold_crossed`, `platform.billing.usage.metered` | User/account billing events carry `user_id` and optional `org_id`; budget and usage events carry `org_id` and `project_id` where project-scoped. | `billing-worker`, `notification-relay`; DLQ through `dlq.>` with correlation id. | Covered for current account/project billing model; keep project billing events on explicit `project_id`. |
| `PAYMENTS` | `platform.payments.balance_credited`, `platform.payments.reconcile_failed` | `user_id`, optional `org_id`, payment/session ids. | `billing-worker`, `notification-relay`; financial replay requires operator review. | Account/user scoped; no project claim implied. |
| `PROVISIONING` | `gpuaas.provisioning.*` allocation lifecycle subjects | Payload structs carry `allocation_id`, `user_id`, optional `org_id`, optional `project_id`, node/sku details. Current producers populate `project_id` from allocation or triggering event scope when project ownership is known. | `provisioning-worker`, `billing-worker`, `notification-relay`; provisioning workflow start uses event envelope and payload. | Covered for payload owner fields; Temporal owner-scope visibility remains tracked separately. |
| `APPS` | `appplatform.runtime.*`, `appplatform.artifact.*`, `platform.artifact.*` | App instance, artifact, and runtime events carry `org_id` and `project_id` for project-owned app instances/artifacts; shared-runtime events carry `org_id` for tenant-owned runtime resources. | `app-runtime-worker` and related operators; DLQ subjects remain domain-scoped. | Covered for payload owner fields; shared runtimes are tenant-scoped by design. |
| `STORAGE` | `storage.attachment.*` | Attachment payloads carry `org_id`, `project_id`, `bucket_id`, allocation/workload/node identifiers. | provisioning/storage worker; DLQ through `dlq.>`. | Covered for payload owner fields; consumer queries still need owner-scope guard coverage with storage worker implementation. |
| `DLQ` | `dlq.>` | DLQ inherits original event body and correlation id. | Ops/replay tooling. | Partial; replay runbook should expose owner scope before replay and require idempotent original-subject replay. |

### Temporal Workflow Isolation Map

Temporal does not own tenant authorization. Product/platform records remain the
source of truth, and workflow execution should be recoverable from persisted
records. The current workflow evidence is:

| Workflow family | Task queue / ID shape | Persisted owner state | Current evidence | Gap |
|---|---|---|---|---|
| Provisioning event workflow | Task queue `provisioning-workflows`; workflow ID remains `provisioning-{event_type}-{event_id_or_correlation_id}` to preserve idempotency for in-flight/redelivered events. | Event payload is the workflow input; provisioning payloads preserve optional `project_id`, and workflow memo/static summary expose safe allocation/org/project/user/node metadata for operators. Allocation ownership remains in `gpuaas_allocations`. | `cmd/provisioning-worker/worker.go`, `cmd/provisioning-worker/worker_test.go`, `doc/operations/Temporal_Search_Attribute_Registry.md`. | Covered for provisioning owner/target memo visibility. Queryable owner-scope search attributes are defined but must be namespace-provisioned before code uses `SearchAttributes:`. |
| MAAS onboarding | Workflow ID `maas-onboarding:{onboarding_id}[:attempt]`; run id stored on `gpuaas_node_onboardings`. | Onboarding records carry site/profile/SKU/host identity and requester; workflow input now carries the same safe target/requester metadata for Temporal activity/UI context. Lifecycle state remains in Postgres. | `packages/platform/maas/legacyimpl/onboarding.go`, `workflow_contract.go`, helper tests, read models and integration tests. | Covered for workflow input target metadata; queryable Temporal search attributes are defined but not yet consumed by code. |
| MAAS decommission | Workflow ID `maas-decommission:{decommission_id}[:attempt]`; run id stored on `gpuaas_node_decommissions`. | Decommission records carry node/site/system/mode/requester identity; workflow input now carries the same safe target/requester metadata for Temporal activity/UI context. Workflow recovery updates persisted records. | `packages/platform/maas/legacyimpl/decommission.go`, `workflow_contract.go`, execution/read model tests. | Covered for workflow input target metadata; queryable Temporal search attributes are defined but not yet consumed by code. |
| Node-agent lifecycle | Workflow ID `node-agent-lifecycle:{lifecycle_id}` remains stable and idempotent. | Lifecycle records carry node, target version, actor, safety policy, and correlation id; workflow input and Temporal memo/static summary now carry the same safe node/lifecycle/actor metadata. | `packages/platform/adminops`, `cmd/api/temporal.go`, `cmd/provisioning-worker/temporal.go`, focused workflow input tests. | Covered for workflow input and memo visibility; queryable Temporal search attributes are defined but not yet consumed by code. |
| MAAS reconciliation scan | Scheduled workflow, task queue `provisioning-workflows`, schedule id `maas-reconciliation-scan`. | Reconciliation findings write product/platform drift records. | `cmd/provisioning-worker/temporal.go`, MAAS reconciliation read models. | Platform-wide by design; run evidence should show it cannot mutate tenant resources without persisted owner checks. |

Follow-up:

- `SEC-ARCH-EVENT-WORKFLOW-ISOLATION-EVIDENCE-001`
- `SEC-ARCH-PROVISIONING-EVENT-PROJECT-OWNER-FIELDS-001`
- `SEC-ARCH-TEMPORAL-WORKFLOW-OWNER-SCOPE-VISIBILITY-001`

## Terminal, Proxy, And App Runtime Boundaries

| Surface | Current boundary | Evidence | Residual risk |
|---|---|---|---|
| Terminal WebSocket | API mints/validates terminal session binding; terminal-gateway proxies allocation-bound stream. | `cmd/terminal-gateway/routes.go`, `cmd/terminal-gateway/routes_test.go` | Need cross-project allocation-token negative evidence. |
| Managed ingress `api_bearer` | Route authz validates actor, project, route, route version, route family, client auth mode, and service-account state. | `cmd/api/routes_test.go`, `doc/architecture/Managed_Ingress_Tenant_Isolation_and_Scaling_v1.md` | Need production network controls so workloads cannot bypass Pomerium/shared edge. |
| Browser app routes | Pomerium can own browser login redirect, but GPUaaS remains route ownership authority. | `doc/architecture/Managed_Ingress_Tenant_Isolation_and_Scaling_v1.md` | Need stale-route denial and dedicated proxy-pool evidence. |
| App runtime | App instance, components, routes, credentials, and worker operations carry org/project context. | `packages/products/appplatform/runtime`, integration tests | Need complete negative tests for cross-project app instance, credential, route, and member-operation access. |
| Storage grants | Owner project and subject project are explicit in grant request and read model. | `cmd/api/routes_v3_storage_grants.go` | Need negative tests for cross-project bucket/grant revoke and list paths. |

Follow-up:

- `SEC-ARCH-APP-RUNTIME-CROSS-PROJECT-NEGATIVE-TESTS-001`
- `SEC-ARCH-TERMINAL-CROSS-PROJECT-NEGATIVE-TESTS-001`

## GPU Workload Isolation Model

| Shape | Current model | Isolation assumption | Residual risk |
|---|---|---|---|
| Bare metal | One allocation claims a whole node; active bare-metal allocation blocks active node reuse. | Tenant workload owns the host for the allocation period; release cleanup or reimage policy determines reuse confidence. | Default `user-revoke` cleanup is weaker than full reimage; regulated profiles need MAAS full reimage or stronger wipe evidence. |
| VM slice | Allocation claims approved host-local slots: GPU, fabric/VF, NVMe/volume, NUMA, vNIC identity. | Isolation comes from VM boundary, VFIO/IOMMU grouping, approved slot mapping, storage wipe, and network identity. | Slice mode is not fully production evidence yet; needs host readiness, slot approval, VFIO/IOMMU, wipe proof, and network isolation tests. |
| PCI passthrough | PCI devices are assigned only from platform-approved slot claims. | Device identity and IOMMU group safety must be verified before assignment. | Needs negative tests for unapproved device paths and unsafe IOMMU groups. |
| Fabric/RDMA | Fabric compatibility is part of slot compatibility for IB/RoCE products. | Low-latency fabric only promised when matching slot capability is verified. | Need RDMA/fabric tenant isolation and no-cross-tenant fabric reachability evidence before regulated/dedicated claims. |
| Shared GPU / MIG / vGPU | Future capacity shapes only. | Not current baseline. | Do not market as supported isolation until explicit mechanism, scheduler model, and cleanup proof exist. |

Baseline production can continue with bare-metal and VM/compute profiles when
the selected environment evidence matches the product promise. Regulated or
dedicated-profile claims require stronger cleanup, network, storage, and
hardware-root evidence.

Follow-up:

- `SEC-ARCH-GPU-SLICE-ISOLATION-EVIDENCE-001`
- `SEC-ARCH-FABRIC-RDMA-ISOLATION-EVIDENCE-001`

## Current Decision

Current GPUaaS isolation posture is suitable to describe as:

```text
layered logical tenant/project isolation with schema constraints, platform IAM,
project-scoped queries, route authorization, signed node-agent tasks, and
allocation/claim ownership controls
```

It must not be described as:

```text
regulated workload isolation, FedRAMP-ready isolation, HIPAA/ePHI-ready
isolation, PCI CDE segmentation, RLS-only tenancy, or cryptographically
tamper-proof workload separation
```

## Required Follow-Up Tasks

The following tasks are required before this evidence package can support a
stronger production or regulated-profile claim:

| Task | Owner | Purpose |
|---|---|---|
| `SEC-ARCH-TENANT-AUTHZ-NEGATIVE-MATRIX-001` | Backend + security | Build and run a route-family negative authorization matrix across user, service-account, project, terminal, storage, managed-ingress, billing, and audit surfaces. |
| `SEC-ARCH-EVENT-WORKFLOW-ISOLATION-EVIDENCE-001` | Backend + ops | Prove NATS event payload ownership, DLQ handling, Temporal workflow IDs/payload scope, and consumer query boundaries. |
| `SEC-ARCH-PROVISIONING-EVENT-PROJECT-OWNER-FIELDS-001` | Backend + security + ops | Add and test `project_id` owner fields for provisioning lifecycle events that drive allocation/project-sensitive consumers and workflows. |
| `SEC-ARCH-TEMPORAL-WORKFLOW-OWNER-SCOPE-VISIBILITY-001` | Backend + ops + security + architecture | Make owner/target scope visible in workflow input, persisted evidence, and safe Temporal operational metadata; prove reruns cannot cross owner boundaries. |
| `SEC-ARCH-APP-RUNTIME-CROSS-PROJECT-NEGATIVE-TESTS-001` | Backend | Add negative tests for app instance, runtime credential, route, and member-operation cross-project access. |
| `SEC-ARCH-TERMINAL-CROSS-PROJECT-NEGATIVE-TESTS-001` | Backend + ops | Add negative tests for terminal token mint/open against another project's allocation/session. |
| `SEC-ARCH-GPU-SLICE-ISOLATION-EVIDENCE-001` | Architecture + ops | Produce slice-mode isolation proof for slot approval, VFIO/IOMMU, storage wipe, and cleanup. |
| `SEC-ARCH-FABRIC-RDMA-ISOLATION-EVIDENCE-001` | Architecture + ops | Produce IB/RoCE tenant isolation and no-cross-tenant reachability evidence before fabric isolation claims. |

Until those tasks are complete, this package should be used as a current-state
evidence map and gap register, not a production or regulated-profile attestation.
