# Platform UAT Completeness Matrix v1

Status: draft
Owner: Product / Platform / Ops
Primary environments: kind, dev

## Purpose

This document defines what must be true before post-PSSM UAT is considered
complete. It is not a replacement for pre-UAT readiness gates. Readiness gates
prove the environment and deployed services are fit to test. This matrix proves
the UAT suite covers the product workflows and platform invariants that matter.

The problem this matrix prevents: a full UAT script can exit green while a
critical persona, workflow, negative path, cleanup path, or environment lane was
never exercised.

## Completion Rules

- UAT is complete only when every required row below is `Passed`, or explicitly
  `Skipped` with owner, reason, expiry, and Fairway follow-up task.
- A green shell exit is supporting evidence only. It is not sufficient unless it
  maps to the required rows in this matrix.
- Required rows are not enough by themselves. For critical flows, the Product
  Quality flow coverage map must also show happy path, empty state, blocked
  state, in-place recovery, negative path, cleanup path, fixtures, contracts,
  and automation anchors before broad UAT is treated as a verification gate.
- Missing fixtures are not product failures. They are `Blocked` prerequisite
  rows and must point to the owning setup gate.
- Basic login, schema compatibility, deployed image freshness, environment
  profile drift, provider bootstrap, and app artifact visibility must be proven
  before full UAT starts.
- UAT should find integrated workflow regressions, not stale deploys, broken
  schema migrations, missing credentials, or harness defects.

## Environment Lanes

| Lane | Provider | Purpose | Required Before Completion |
|---|---|---|---|
| kind | MAAS-LXD | Fast post-PSSM iteration, MAAS-LXD lifecycle, compute-only app/runtime proof | read-only, mutating compute, terminal, app runtime, managed ingress, clean-log checks |
| dev | Proxmox | Higher-capacity workflow proof, browser UAT, app runtime, scheduler, provider bootstrap | read-only, mutating compute, browser workflows, scheduler apps, OpenAI endpoint, cleanup |
| demo | Proxmox/demo | External/demo handoff only | not required for post-PSSM completion unless explicitly requested |

## 2026-06-11 Stabilization Closeout Snapshot

Current closeout status is intentionally split between proven non-mutating
coverage and blocked mutating coverage. The kind deploy and smoke profile are
usable as release evidence, but full kind UAT and dev promotion are not yet
approved because the kind compute launch prerequisite has no schedulable
capacity.

| Area | Current Status | Evidence | Completion Impact |
|---|---|---|---|
| Kind deploy | Passed | `.fairway/artifacts/deploy-kind-stabilization-closeout-20260611/kind-parity-up-envlocal-rerun.log`; `.fairway/artifacts/deploy-kind-stabilization-closeout-20260611/kind-parity-validate-envlocal.log`; `.fairway/artifacts/local-deploy-monitor-kind-stabilization-20260611/summary.md` | Satisfies deploy evidence for kind, subject to UAT blockers below |
| Deterministic kind smoke profile | Passed with expected block | `.fairway/artifacts/kind-smoke-profile-wrapper-20260611/full/summary.json`; Fairway task `HARNESS-FIX-KIND-SMOKE-PROFILE-ENV-001`; commit `0f9393f0e65b901be28ea73fcde6ac1ab27cfd7b`; GitLab pipeline `2635` | Wrapper is the preferred command for closeout smoke evidence |
| Kind read-only UAT smoke | Passed | `.fairway/artifacts/kind-uat-smoke-20260611/read-only-smokes-kind/`; `.fairway/artifacts/kind-uat-smoke-20260611/billing-read-smoke/summary.json` | Covers read-model, authz, persona, billing/account, and platform read paths |
| Kind terminal/connect smoke | Passed against existing allocation | `.fairway/artifacts/kind-uat-smoke-20260611/non-mutating-connect/terminal-remote-smoke.log` | Covers terminal session behavior only for existing allocation `f090e861-358c-48c8-b6b0-42cc8a7db718` |
| Kind mutating compute launch | Blocked | `.fairway/artifacts/kind-compute-capacity-recheck-20260611/prereq-baseline/summary.md`; Fairway task `OPS-FIX-KIND-COMPUTE-CAPACITY-PREREQ-001` | Blocks `UAT-COMPUTE-001`, `UAT-COMPUTE-002`, and any downstream app/runtime row that requires a fresh allocation |
| Dev deploy and dev full UAT | Gated | `.fairway/artifacts/kind-compute-capacity-recheck-20260611/ops-capacity-decision-request.md` | Do not start until kind capacity is restored or an approved alternate profile/waiver exists |

The current approved safe work while this blocker remains open is non-mutating
coverage only: auth/session, account/security, catalog/read models,
billing/finance reads, admin/ops read models, terminal/connect against the
existing active allocation, documentation, and evidence cleanup. Restoring
capacity, approving an alternate profile, or approving a waiver must be recorded
on `OPS-FIX-KIND-COMPUTE-CAPACITY-PREREQ-001` before full kind UAT or dev deploy
claims can proceed.

## Required Coverage

| ID | Persona | Workflow / Invariant | Automation Layer | kind | dev | Required Evidence |
|---|---|---|---|---|---|---|
| UAT-AUTH-001 | End user / admin | Login, session refresh, whoami, project context | API + Playwright | Required | Required | token/session smoke, shell screenshot or trace, correlation IDs on failures |
| UAT-READ-001 | End user | Catalog, workloads, billing posture, account/security read models | API + UI smoke | Required | Required | read-only summary and route/API result artifacts |
| UAT-COMPUTE-001 | End user | Compute launch from catalog to active | API + browser where available | Required | Required | allocation id, provider resource id, state transition evidence |
| UAT-COMPUTE-002 | End user | Compute release and provider cleanup | API + provider read model | Required | Required | release state, node/provider cleanup evidence |
| UAT-TERMINAL-001 | End user / ops | Terminal token mint, WebSocket connect, command execution | terminal smoke + browser | Required | Required | terminal smoke log, no query-token use, cleanup continuation on failure |
| UAT-METRICS-001 | End user / ops | Current metrics and classified historical metrics state | API + UI smoke | Required | Required | current metrics response, classified time-series status |
| UAT-APP-001 | App user | Launchable OCI app launch/connect/decommission | SDK/API + browser route | Required | Required | app instance id, route id, route health, decommission evidence |
| UAT-APP-002 | App developer | App SDK manifest/launch/connect contract | SDK smoke | Required | Required | contract-only and live-runtime evidence separated |
| UAT-OPENAI-001 | App developer | OpenAI-compatible endpoint with service-account bearer auth | API client smoke | Conditional | Required | `/v1/models` plus chat/completions, service-account actor attribution |
| UAT-SCHED-001 | End user / ops | RKE2/Headlamp scheduler app | API + browser route | Conditional | Required | scheduler instance, managed route, controller credential evidence |
| UAT-SCHED-002 | End user / ops | Slurm scheduler app | API + runtime smoke | Conditional | Required | scheduler instance and controller/runtime readiness evidence |
| UAT-STORAGE-001 | Project admin | Storage bucket/grant/mount flow | API + UI smoke | Required | Required | create/grant/mount/detach/delete or approved skip |
| UAT-IAM-001 | Tenant/project admin | Project, membership, service-account lifecycle | API + browser | Required | Required | create/update/delete, authz transition, audit evidence |
| UAT-BILLING-001 | Finance/admin | Billing balance, usage records, insufficient balance guard | API + UI smoke | Required | Required | billing read model, blocked launch proof, usage/audit rows |
| UAT-MFA-USER-001 | Human user / platform admin / platform ops / platform superadmin / security reviewer | MFA factor lifecycle: no-factor setup, existing-factor manage, disable/delete, lost-phone/app-upgrade recovery, locked-out recovery, provider unavailable, provider return success/cancel/error, privileged-human behavior, break-glass, non-human exclusion, branding, provider internals, and current-session assurance | Product UAT checklist + browser/e2e/API/runbook substitute gates | Required | Required before MFA product-ready claim | one result per subpath in `MFA_Factor_Lifecycle_UAT_Coverage_v1.md`; sanitized artifacts; gap task, owner, expiry, substitute gate, and residual risk for every non-pass |
| UAT-MFA-OPS-001 | Platform ops / security / governance | MFA operational rollout readiness: isolated IdP topology, target readback, browser auth-entry, token/API assurance plan, rollback owner, and evidence closeout | readiness bundle + flow-contract review | Conditional | Required before exact-window approval | sanitized readiness bundle, flow-contract packet, reviewed target-readback, no live/source mutation |
| UAT-MFA-OPS-002 | Platform ops / security / governance | MFA rollback, cleanup, and break-glass lifecycle, including safe denial/stop paths and post-action readback | runbook packet + dry-run/readback evidence | Conditional | Required before exact-window approval | rollback/delete owner, cleanup/readback artifact, break-glass owner/approver/expiry, residual risk |
| UAT-MFA-OPS-003 | Platform ops / backend / security | MFA assurance propagation and non-human exclusions for browser, API, CLI, service-account, API-key, and automation flows | non-live plan + contract/API tests where implemented | Conditional | Required before enforcement claim | `amr`/`acr` or explicit non-claim plan, token/API matrix plan, service-account/API-key exclusion proof |
| UAT-OPS-001 | Platform operator | Provider capacity refresh and bootstrap preflight | ops script + API | Required | Required | capacity snapshot, bootstrap reachability, provider lane evidence |
| UAT-OPS-002 | Platform operator | Node lifecycle and orphan cleanup | API + provider read model | Required | Required | node/provider state, cleanup or quarantine evidence |
| UAT-EDGE-001 | End user / security | User-safe errors for app/proxy/upstream failures | negative smoke + UI | Required | Required | branded/product error, correlation ID, classified 5xx cause |
| UAT-OBS-001 | Ops/security | Clean logs for audit, evidence, usage, authz decisions | log gate | Required | Required | clean-log report scoped to UAT window and approved-warning list |
| UAT-HARNESS-001 | Governance | Harness exit status, summary, body capture, cleanup continuation agree | failure-injection suite | Required | Required | forced-failure evidence with nonzero exit and preserved body |

## Known Current Gaps

These gaps mean current UAT evidence is not yet enough to call post-PSSM UAT
complete:

| Gap | Impact | Fairway Task |
|---|---|---|
| No single post-PSSM summary maps UAT output back to this matrix. | A green script exit can hide missing persona or workflow rows. | `PSSM-UAT-CI-PROMOTION-WIRING-001` |
| Pre-UAT admission does not yet block on touched-service image freshness. | UAT can spend hours rediscovering stale API, worker, terminal, or web images. | `PSSM-UAT-PREDEPLOY-SERVICE-FRESHNESS-GATE-001` |
| Integration smoke can still be warn-only outside pre-UAT mode. | DB-backed and trace defects may not fail before full UAT starts. | `PSSM-UAT-BLOCKING-INTEGRATION-SMOKE-001` |
| Live prerequisites are partially implicit in `demo_uat_package.sh`. | Missing SSH keys, storage, app routes, service accounts, or capacity are reported as product failures. | `PSSM-UAT-LIVE-PREREQ-GATE-001` |
| App SDK evidence separates contract-only and live runtime proof in scripts, but UAT completion does not yet require both. | Developer-facing app contract gaps can pass as runtime-only UAT. | `PSSM-UAT-APP-SDK-LIVE-CONTRACT-GATE-001` |
| Clean-log audit/evidence/usage assertions are not yet a hard prerequisite for successful UAT rows. | Hidden platform defects can sit behind otherwise successful workflows. | `PSSM-UAT-CLEAN-LOG-OBSERVABILITY-GATE-001` |
| Harness failure-injection coverage is tracked but not yet a required gate. | The UAT harness itself can mask failures or lose response bodies. | `PSSM-UAT-HARNESS-FAILURE-INJECTION-001` |
| Kind mutating launch can use a default region instead of the live provider region. | UAT may create on-demand capacity or fail placement instead of using the intended MAAS-LXD worker lane. | `PSSM-KIND-REGION-PROFILE-LAUNCH-GATE-001` |
| Scheduler launch precheck and submit can diverge on required dependency blockers. | Late runtime 500s replace early product-owned prerequisite errors. | `PSSM-SCHEDULER-LAUNCH-PRECHECK-SUBMIT-PARITY-001` |
| Bare-metal allocation create can proceed without a resolvable SSH key path. | Provisioning fails after mutation instead of blocking before allocation create. | `PSSM-ALLOCATION-CREATE-SSH-KEY-VALIDATION-001` |
| CPU-only compute VM allocation behavior lacks explicit zero-GPU contract coverage. | Compute-only UAT can regress through GPU-specific assumptions. | `PSSM-ZERO-GPU-ALLOCATION-CONTRACT-TEST-001` |
| App member-operation add can fail with allocation-intent payloads. | Scheduler/app-runtime UAT can expose handler 500s that unit tests should catch. | `PSSM-APP-MEMBER-OPERATION-ADD-500-001` |
| Slurm worker-add retry behavior is not proven idempotent. | Retried UAT or controller operations can leave scheduler workers in inconsistent state. | `PSSM-SLURM-WORKER-ADD-IDEMPOTENCY-001` |

## MFA Operational Journey Coverage

MFA rollout, rollback, and recovery are approval-gated operational journeys.
They must be represented in UAT coverage even when the unsafe parts cannot run
inside ordinary UAT. The row result is not allowed to be omitted because the
live workflow is unsafe; it must be `passed`, `blocked`, or `skipped` with a
substitute gate, owner, artifact, expiry, and residual risk.

`UAT-MFA-USER-001` is the product-facing factor lifecycle row. It uses
`doc/operations/MFA_Factor_Lifecycle_UAT_Coverage_v1.md` as its subpath matrix
and must be evaluated before any broad MFA product-ready claim. The operational
rows below remain separate because exact-window rollout, rollback,
break-glass, and token/API assurance gates can be unsafe to execute inside
ordinary UAT.

These rows are evidence gates only. Passing them does not authorize live MFA,
source/prod mutation, disposable preflight rerun, credential submission,
token/API matrix execution, sensitive-operation gate execution, break-glass
use, deploy, merge, push, or cleanup.

| Operational journey | Unsafe-to-run scope | Substitute gate | Owner | Required evidence artifact | Residual risk / follow-up |
|---|---|---|---|---|---|
| Isolated IdP rollout and target topology | Creating, changing, or deleting source/prod Keycloak realm, client, flow, user, or group state. | MFA drill readiness bundle validation plus Product Quality flow-contract review. | Ops owns target/readback; security owns MFA boundary; governance owns approval/evidence boundary. | Sanitized readiness bundle with target readback, source-realm read-only proof, isolated realm state, Keycloak API proof, Fairway proof, and browser runtime launch proof. | Live provider timing and source/prod behavior remain unproven until an approved exact-window packet. |
| Browser auth-entry and MFA state sequence | Browser credential submission or OTP/TOTP material use outside an approved disposable or live packet. | Auth-entry classifier and disposable browser-flow proof in a non-source/approved isolated realm. | Product Quality / Integration with ops and security review. | Sanitized classifier/runner artifacts showing expected login, MFA challenge/enrollment, callback, blocked, excluded, or fail-closed state. | Source-realm browser behavior remains a non-claim unless explicitly mapped by an approved packet. |
| Rollback and cleanup | Realm-wide MFA disablement, production rollback, destructive realm/client/user/group cleanup, or unapproved retained fixture deletion. | Rollback packet with dry-run or isolated-realm delete/readback evidence and cleanup manifest. | Ops owns rollback and cleanup; governance owns closeout evidence; security owns no-secret/no-factor evidence. | Before/after readback, rollback/delete owner, scoped target, stop condition, cleanup manifest, and Fairway follow-up for retained or blocked resources. | Actual live rollback duration and provider-side eventual consistency remain residual until a live approved drill closeout. |
| Break-glass lifecycle | Activating emergency access, using break-glass credentials, or extending emergency access. | Break-glass packet/readback review without activation unless separately approved. | Security owns custody and expiry; ops owns activation/deactivation runbook; governance owns approval and post-incident evidence. | Owner, approver, expiry, factor custody class, two-person control, activation/deactivation readback plan, post-incident review owner. | Break-glass usability remains unproven until a separately approved drill or incident packet. |
| Token/API assurance propagation | Running token/API matrix checks with live credentials or treating provider claims as production enforcement proof. | Token/API matrix plan plus contract/API tests where implementation exists. | Backend owns token/session contract; security owns assurance interpretation; governance owns non-claim wording. | Source-vs-isolated issuer plan, human MFA persona expectations, normal-user expectations, service-account/API-key/client-credential exclusions, expected `amr`/`acr` or explicit non-claim handling. | Actual production token claim behavior remains unproven until reviewed provider evidence exists. |
| Sensitive-operation MFA/step-up catalog | Executing sensitive gates or using MFA state to allow/deny production operations. | Closed sensitive-operation and audit-action catalog review before implementation. | Security and governance own the catalog; backend owns implementation tasks. | Catalog entries with operation, actor, required assurance/freshness, audit action, denial path, and tests/follow-up task. | No sensitive-operation enforcement claim until implementation, tests, and reviews close. |
| Non-human exclusions | Pulling service accounts, API keys, client credentials, node identities, or automation identities into human browser MFA. | Non-human exclusion proof in rollout/readiness bundle and API/CLI tests where implemented. | Security owns credential class boundaries; backend/ops own readback evidence. | Service-account/API-key/client-credential exclusion evidence, affected auth flow list, and follow-up for unknown flows. | Unknown auth flows block enforcement claims until classified. |
| Evidence closeout | Treating logs, screenshots, provider responses, or transcripts as evidence when they contain secrets, tokens, cookies, raw HTML bodies, OTP/TOTP material, QR payloads, recovery codes, or raw provider bodies. | Redaction guard plus Fairway evidence closeout review. | Governance owns evidence governance; security owns redaction boundary; ops owns artifact production. | Sanitized artifact paths, redaction result, evidence owner, retention class, review handback, and follow-up task for gaps. | Missing or failed redaction blocks closeout and exact-window reuse. |

## Summary Mapping Contract

Every full UAT summary must include one result row per required matrix ID above.
The row format can be markdown, JSONL, or both, but it must preserve these
fields:

| Field | Required Meaning |
|---|---|
| `matrix_id` | One of the IDs in Required Coverage, for example `UAT-COMPUTE-001`. |
| `environment` | `kind`, `dev`, or another explicitly approved lane. |
| `status` | `passed`, `failed`, `blocked`, or `skipped`. |
| `required_level` | `required`, `conditional`, `approved_skip`, or `future_only`. |
| `evidence_path` | Durable artifact path under `dist/uat` or `.fairway/artifacts`. |
| `fairway_task` | Owning follow-up task when status is not `passed`. |
| `owner` | Product, ops, backend, frontend, security, app-developer, or architecture owner for non-pass rows. |
| `expiry` | Required for `skipped` and approved bypass rows. |
| `substitute_gate` | Required when the workflow is unsafe to run in ordinary UAT; names the reviewed gate that stands in for live execution. |
| `residual_risk` | Required for `skipped`, `blocked`, `conditional`, and substitute-gate rows. |

Completion rule: if any required `kind` or `dev` row is absent, `failed`,
`blocked` without approved exception, or `skipped` without owner/reason/expiry,
post-PSSM UAT is not complete.

## Related Docs

- `doc/operations/Product_Quality_Flow_Coverage_Operating_Model_v1.md`
- `doc/operations/MFA_Factor_Lifecycle_UAT_Coverage_v1.md`
- `doc/operations/MFA_User_Factor_Setup_Manage_Flow_Coverage_v1.md`
- `doc/operations/Demo_UAT_Flow_Coverage_Matrix_v1.md`
- `.fairway/artifacts/post-pssm-uat-learning-gaps.md`
- `.fairway/artifacts/pssm-uat-readiness-code-sweep-2026-06-04.md`
- `scripts/ops/demo_uat_package.sh`