# IAM MFA Policy and Keycloak Enforcement v1

Status: canonical target decision for first MFA production slice.

Date: 2026-06-03.

Owner: `IAM-MFA-ARCHITECTURE-001`.

## Purpose

Define how GPUaaS should introduce MFA without moving password or factor
verification into product code.

The security request is valid, but the first implementation must preserve the
IAM boundary:

1. Keycloak authenticates humans and enforces browser-login MFA.
2. Pomerium consumes the resulting OIDC session for browser edge access.
3. GPUaaS validates tokens, resolves memberships, applies product
   authorization, and exposes MFA posture/readiness.
4. Service accounts and API keys do not use human MFA; they use scoped
   credentials, rotation, rate limits, and audit.

## Current State

1. `GET /api/v1/v3/account/security` already has an MFA posture contract with
   `totp_enabled` and `webauthn_enabled`.
2. The current backend read model returns both MFA flags as `false`.
3. The V3 account security UI displays MFA posture, but it does not provide
   enrollment, reset, or enforcement flows.
4. Local/dev Keycloak bootstrap scripts create dev users with no required
   actions.
5. Platform IAM owns memberships, roles, scopes, service accounts, API keys,
   audit, and product authorization. Keycloak is not the source of product IAM
   truth.

## Development-Stage Operating Posture

MFA is an early product feature in GPUaaS, not yet a production control
attestation program. During feature development, the work should be driven by
normal product engineering evidence:

1. Keycloak configuration and GPUaaS read models are implemented from checked-in
   contracts and runbooks.
2. Reproducible scripts and tests prove setup/readback, browser login behavior,
   non-human exclusions, UI posture, and rollback paths.
3. UAT/e2e coverage proves user, admin, and ops flows.
4. Claims stay modest: provider-enforced browser MFA for named privileged
   humans, with sensitive-operation step-up and external compliance explicitly
   deferred until separately proven.

Formal CISO-style controls such as exact production windows, multi-approver
packets, custody evidence, external compliance mapping, and broad control
attestation are release and claim gates. They should be introduced when moving
to production, handling credentials/break-glass, enforcing sensitive
operations, or making customer/compliance claims, not as the default loop for
every development fix.

## Decision

Use provider-enforced human MFA as the first production model.

1. Keycloak is the enforcement authority for human login MFA.
2. GPUaaS does not verify TOTP, WebAuthn, or recovery factors directly.
3. GPUaaS exposes MFA posture and policy state through platform IAM read models.
4. GPUaaS may require an MFA-authenticated session for sensitive operations only
   when the session contains a trustworthy MFA signal, such as `amr` or `acr`.
5. If Keycloak does not emit a trustworthy MFA signal in the current realm,
   first-slice enforcement remains at the Keycloak authentication-flow boundary
   and sensitive-operation claim checks are deferred.

## Browser Flow Enforcement Decision

First production enforcement model:

```text
Keycloak browser flow with Conditional OTP for a platform-admin/ops MFA group
```

Use a Keycloak group or equivalent realm role such as
`gpuaas-platform-mfa-required` as the enforcement selector. Membership in that
selector is managed by the platform/IAM operations process for users who hold
`platform_superadmin`, `platform_admin`, or `platform_ops` authority. This avoids
realm-wide MFA for normal users while keeping enforcement at the identity
provider boundary.

Selected option:

| Option | Decision | Reason |
|---|---|---|
| Realm-wide conditional OTP | Not first slice | Too broad for normal users and tenant personas before enrollment/reset support is proven. |
| Role/group-scoped conditional OTP | Selected | Targets platform admin/ops humans, keeps service accounts/API keys out of MFA, and uses Keycloak-native enforcement. |
| Separate admin client or flow | Defer | Useful later if admin surfaces move to a separate client or Pomerium app, but not required for first enforcement. |
| Managed IdP equivalent | Acceptable later | Enterprise IdP can replace Keycloak-native factors if it emits equivalent assurance and the rollout/read-model contract is preserved. |

Operational rule:

1. Platform admin/ops human users must be placed in the MFA-required Keycloak
   selector before production admin/ops access.
2. Service accounts, API keys, product runtime credentials, and automation
   credentials must not be placed in the human MFA selector.
3. Direct-grant/password-token flows for production platform admin/ops users are
   blocked or disabled. Non-production direct-grant personas may remain exempt
   only in dev/kind/test profiles and must not be treated as production
   evidence.
4. GPUaaS sensitive-operation MFA gates remain deferred until reliable `amr`,
   `acr`, or equivalent MFA claim evidence is proven through Keycloak, Pomerium,
   and API JWT validation.
5. Break-glass access remains time-bounded and audited; it must not become a
   standing exclusion from the MFA selector.

Human admin/ops CLI access is production admin/ops access. It must use browser
OIDC PKCE, or a future reviewed OAuth device-code handoff, that sends the human
operator through the same Keycloak or brokered IdP MFA boundary. Long-lived
local bearer tokens, copied browser tokens, direct-grant/password-token flows,
service-account tokens, API keys, and client credentials are not acceptable
substitutes for a human privileged CLI session.

Implementation note: if platform roles are not yet mirrored into Keycloak roles,
the first slice should use an explicit Keycloak group maintained by an ops
runbook or reconciliation task. The platform read model should still report the
effective relationship between platform role, Keycloak MFA selector membership,
and enforcement posture so operators can see drift.

## Rollout Policy

Initial enforcement:

| Subject | MFA requirement |
|---|---|
| `platform_superadmin` | Required before production admin access. |
| `platform_admin` | Required before production admin access. |
| `platform_ops` | Required before production ops access. |
| Tenant/org owners and admins | Planned next; can be enabled per organization after support/reset flows exist. |
| Project admins and members | Optional in first slice unless organization policy requires it. |
| Normal users | Optional in first slice. |
| Service accounts/API keys | Not applicable; use credential custody, expiry, rotation, scopes, rate limits, and audit. |

Tenant/admin rollout and recovery semantics are defined separately in
`doc/architecture/IAM_MFA_Tenant_Admin_Rollout_Recovery_Model_v1.md`.
The first platform-admin slice does not imply tenant/admin MFA readiness.
Tenant/admin enforcement must declare whether recovery is customer-managed
federation, tenant-admin-approved, or platform-assisted before broad rollout.

MFA capabilities:

1. TOTP is the first supported factor for production enforcement.
2. WebAuthn/passkeys are a follow-up capability.
3. Recovery and reset flows must exist before broad tenant/org-admin
   enforcement.

### Platform Superadmin Factor Requirement

`platform_superadmin` requires phishing-resistant MFA for production access.
Acceptable first-class evidence is WebAuthn/passkey, hardware security key, or
an equivalent provider-managed phishing-resistant factor whose issuer,
audience, freshness, custody, recovery, and revocation semantics are reviewed.

TOTP-only superadmin MFA is allowed only as a transitional non-production or
time-bounded exception posture. A TOTP-only exception must name:

1. environment and account scope;
2. owner and approver;
3. expiry or sunset date;
4. compensating controls, such as stricter session lifetime and monitoring;
5. migration path to phishing-resistant MFA;
6. evidence that the exception is not used as a production readiness claim.

`platform_admin` and `platform_ops` may use TOTP for the first production slice
when the approved rollout packet says so, but any `platform_superadmin`
production-readiness claim remains blocked until phishing-resistant posture or a
reviewed exception is recorded. This section is a policy/model requirement only;
it does not mutate Keycloak, enroll factors, reset credentials, or authorize
break-glass.

## Compliance Mapping And Non-Claims

The MFA design is mapped to external compliance control families in
`doc/governance/IAM_MFA_Compliance_Frame_Mapping_v1.md`.

That mapping is a governance readiness artifact, not an attestation. It does
not assert SOC 2, ISO 27001, UAE ISR/NESA, FedRAMP/M-22-09, PCI, or
customer-specific compliance. Production-quality compliance claims remain
blocked until the mapped Fairway hardening tasks close with reviewed evidence,
including privileged phishing-resistant MFA, `amr`/`acr` claim proof,
sensitive-operation/audit cataloging, reset proofing, retention mapping,
factor drift evidence, and recovery controls.

## Policy Shape

MFA policy should be represented as platform policy/IAM state, not hardcoded UI
logic.

Minimum policy dimensions:

1. required by platform role, for example `platform_superadmin`,
   `platform_admin`, `platform_ops`;
2. required by organization;
3. optional grace period for rollout;
4. explicit break-glass exemption with expiry, reason, actor, and audit;
5. factor policy, initially `totp`, later `webauthn`.

Policy examples:

```text
auth.mfa.required_platform_roles = platform_superadmin,platform_admin,platform_ops
auth.mfa.required_org_ids = <org ids or policy rows>
auth.mfa.grace_period_seconds = <optional rollout grace>
auth.mfa.allowed_factors = totp
```

The exact storage can be policy-table backed or Keycloak-flow backed for the
first slice, but the platform read model must make the effective posture visible
to operators.

## Rollout Preview And Cutover Model

Privileged-human MFA rollout must use explicit rollout states. A live drill can
prove drill mechanics, but it does not by itself approve a production rollout
slice unless the packet says so and includes the production rollout evidence
below.

Rollout states:

| State | Purpose | Required evidence | Exit criteria |
|---|---|---|---|
| `notification` | Tell affected admin/ops users that MFA enrollment is coming. | target population, communication owner, support path, deadline, exception intake path | affected users have notification evidence or approved exception |
| `soft_launch_preview` | Let selected users enroll and exercise MFA without broad lockout risk. | selected cohort, Keycloak flow/client scope, account/security read-model posture, support owner | cohort can enroll, sign in, reset/recover, and rollback in non-prod/prod-like evidence |
| `enrollment_monitoring` | Track readiness before enforcement cutover. | enrollment counts, unenrolled exceptions, failed enrollment/reset events, break-glass posture, support queue health | required population is enrolled or has approved time-bounded exception |
| `cutover_ready` | Confirm enforcement can be enabled for the target slice. | successful drill or equivalent proof, rollback proof, break-glass owner/approver/expiry/factor custody, UAT/ops coverage, no unresolved no-go findings | Architecture Control/user approves exact cutover packet |
| `enforced` | MFA is required for the approved human population. | before/after readback, audit/change evidence, login smoke, non-human exclusion proof | monitoring remains healthy through the observation window |
| `rollback` | Restore access when enforcement causes outage or unsupported client behavior. | failure scope, rollback owner, before/after flow readback, normal access proof, follow-up task | affected access restored and residual risk accepted |
| `exception_active` | Track named users/accounts temporarily outside enforcement. | owner, approver, reason, expiry, compensating control, review date | exception expires, is renewed by approval, or user is enrolled |

Cutover criteria for a production or production-like privileged slice:

1. rollout state is at least `cutover_ready`;
2. target population and non-human exclusions are named;
3. provider enrollment/readback evidence exists or an explicit waiver is
   approved;
4. break-glass posture is approved, including owner, approver, expiry, factor
   custody, rollback/readback evidence, and two-person control where required;
5. rollback procedure is rehearsed or has equivalent non-live evidence;
6. UAT/ops coverage references the active rollout state and expected evidence;
7. no open finding would make the packet a live-drill no-go.

Live drill packets must declare one of these packet intents:

1. `drill_mechanics_only`: proves isolated drill mechanics, admin tooling,
   rollback, evidence capture, and browser-flow behavior. It does not authorize
   production rollout or production population changes.
2. `rollout_preview_slice`: enables a named non-production or approved
   production-like preview cohort. It requires notification, support owner,
   enrollment monitoring, rollback, and exception handling evidence.
3. `production_cutover_slice`: enables enforcement for a named production
   population. It requires all cutover criteria, explicit Architecture
   Control/user authorization, and security/ops/governance approval.

Packets that omit intent are treated as `drill_mechanics_only` and cannot be
used as production rollout approval.

## Session and Claim Contract

Status: contract defined, runtime acceptance deferred.

The completed disposable preflight and isolated live drill prove the browser
flow can challenge the intended human admin/ops personas and leave the source
realm healthy. They do not prove that GPUaaS can safely authorize sensitive
operations from token/session claims. Until non-live token evidence proves the
claim path end to end, sensitive-operation MFA gates remain blocked.

Accepted first-slice claim contract, when proven:

1. Token source: a locally validated JWT already accepted by the normal GPUaaS
   API JWT path. The API must not call Keycloak per request to infer MFA.
2. Issuer boundary: `iss` must exactly match the configured source realm issuer
   for the current environment. Drill, disposable, cross-realm, tenant-federated,
   or unknown issuers are not accepted for production sensitive-operation gates.
3. Audience/client boundary: token audience or authorized party must match an
   approved GPUaaS browser/API client for human sessions. Service accounts,
   client credentials, API keys, direct-grant/dev-token flows, automation
   credentials, and break-glass sessions are not human MFA proof.
4. MFA assurance: token must contain one reviewed MFA signal:
   - `amr` array containing an approved MFA method value such as `otp`, `totp`,
     `mfa`, `webauthn`, or `hwk`, only after evidence proves the provider emits
     that value after the MFA step; or
   - `acr` string equal to an approved MFA assurance value; or
   - a separately reviewed equivalent claim with issuer, audience, freshness,
     and replay semantics documented.
5. Token location: accepted claims must be present in the access token used for
   API authorization, or in an edge/session assertion that the API validates
   locally with equivalent issuer/audience/signature checks. ID-token-only or
   UI-only evidence is not enough for API gates.
6. Freshness: token must include trusted `auth_time` or an equivalent
   authentication timestamp and satisfy:

   ```text
   now - auth_time <= 15 minutes
   ```

7. Refresh behavior: refreshed access tokens may preserve MFA assurance only
   when non-live evidence proves `amr`/`acr` and `auth_time` remain bound to the
   original MFA-authenticated session and are not silently reset to refresh time
   without MFA. If refresh semantics are ambiguous, refreshed tokens are not
   accepted for sensitive-operation step-up.
8. Pomerium/browser edge behavior: if browser API traffic depends on Pomerium or
   another edge session, evidence must prove the relevant claims are preserved
   into the locally validated API token or edge assertion without raw token
   persistence.
9. Fail-closed behavior: missing, stale, untrusted, issuer-mismatched,
   audience-mismatched, non-human, or ambiguous claims deny the sensitive
   operation or keep the gate unavailable. They must not fall back to group
   membership, account-security read-model posture, browser UI state, prior live
   drill success, or per-request provider lookup.

Current accepted production-sensitive-operation state:

```text
claim_contract_state = deferred_no_reliable_claim_signal
sensitive_operation_gate_state = blocked
```

The deferral is intentional. The bounded non-live proof
`HARNESS-IAM-MFA-NONLIVE-CLAIM-CONTRACT-PROOF-001` completed with
`final_recommendation=keep_deferral`: current Keycloak tokens did not expose a
reliable local `amr`, `acr`, or fresh `auth_time` contract for admin/ops
sessions, and refresh behavior was classified unsafe for step-up.

Proof artifact:
`/Users/subash/dev/GPUasService/.fairway/artifacts/harness-iam-mfa-nonlive-claim-contract-proof-20260615/kind-sanitized-claim-collection-20260615T135201Z/claim-contract-proof-closeout.md`.

This result does not block the first production slice when the slice is limited
to provider-enforced browser MFA. It does block any claim that GPUaaS locally
enforces sensitive-operation MFA step-up from current token claims.

Required evidence before `IAM-MFA-SENSITIVE-OPS-GATE-001` can start:

1. a new provider/edge assurance source that emits reviewed MFA evidence into a
   locally trusted token or assertion; or
2. a dedicated step-up design such as a separate admin client/flow with fresh
   login and an API-verifiable handoff; and
3. source-realm issuer/JWKS/client readback with sanitized IDs only;
4. admin/ops/normal/non-human token or edge-assertion shape from a non-live or
   explicitly approved production-like proof path;
5. claim presence/absence matrix for `amr`, `acr`, `auth_time`, `iss`, `aud`,
   `azp`, `sub`, and token flow class;
6. refresh-token behavior proof showing whether MFA assurance and `auth_time`
   survive refresh safely;
7. direct-grant, service-account, API-key, client-credential, automation, and
   break-glass exclusion proof;
8. Pomerium or edge-preservation proof if browser sessions traverse that edge;
9. redaction proof that raw access tokens, refresh tokens, ID tokens, cookies,
   authorization headers, client secrets, credentials, TOTP material, private
   keys, and full JWT bodies are not persisted;
10. explicit recommendation: accept claim contract, keep deferral, or move to
    an alternate design.

Multirealm, multi-region, and federated assurance semantics are defined in
`doc/architecture/IAM_MFA_Multirealm_Federation_Semantics_v1.md`. The first
production slice is single realm/region. Cross-realm or external IdP `amr`,
`acr`, AuthnContextClassRef, or equivalent assurance is rejected as MFA evidence
until issuer/entity, audience, tenant binding, freshness, replay, revocation,
recovery, and evidence mapping are reviewed.

If reliable claims exist:

1. API may gate selected sensitive operations on MFA-authenticated sessions.
2. Failed gates return canonical `ErrorResponse` with `insufficient_permissions`
   or a future cataloged MFA-specific error if approved.
3. The gate implementation must still wait for
   `SEC-IAM-MFA-SENSITIVE-OPS-CATALOG-AUDIT-001` to name the closed operation
   catalog and audit actions.

If reliable claims do not exist:

1. Keycloak flow enforcement is the only first-slice enforcement boundary.
2. GPUaaS sensitive-operation MFA gates remain deferred.
3. The account/security read model must report posture source as provider-flow
   or unavailable rather than pretending to know factor state.
4. The alternate design task must decide whether to use provider-managed
   step-up, a dedicated admin client/flow with fresh login, or report-only
   posture until trustworthy local claims exist.

Production-readiness implication for the first slice:

```text
provider_enforced_browser_mfa_state = production_candidate
sensitive_operation_stepup_state = excluded_from_first_slice
```

Any production packet that includes sensitive-operation step-up must reference a
new accepted claim/edge/step-up proof. Packets that only claim provider-enforced
browser MFA may proceed if rollout, superadmin, break-glass, UAT, monitoring,
rollback, and exception criteria are satisfied.

## Alternate Sensitive-Operation Step-Up Design

Status: selected design direction, not implemented.

Because current Keycloak access-token evidence did not prove reliable local
`amr`, `acr`, `auth_time`, issuer, audience, and refresh semantics, GPUaaS must
not build sensitive-operation gates by inspecting today's normal access tokens.
The alternate path is a provider-neutral step-up grant issued only after a fresh
provider/edge assurance event.

Target state:

```text
alternate_stepup_design_state = provider_assertion_to_short_lived_stepup_grant
sensitive_operation_gate_state = blocked_until_grant_proof
```

The design is:

1. A cataloged sensitive operation first returns a fail-closed
   `step_up_required` response with operation family, correlation ID, and a
   server-created challenge ID. The operation does not mutate provider or
   product state at this point.
2. The browser or CLI starts a dedicated step-up flow through the configured
   assurance provider. Provider adapters may be Keycloak OIDC, Pomerium or
   another edge assertion source, an airgapped IdP, or a future enterprise IdP.
   The adapter must request fresh authentication where supported (`max_age=0`,
   `prompt=login`, `acr_values`, or provider-specific equivalent).
3. The step-up callback or edge assertion is validated locally by GPUaaS using
   configured issuer, audience/client, signature/JWKS, nonce/state, subject,
   tenant, session binding, operation family, and freshness rules. The API must
   not call the provider per sensitive-operation request.
4. On success, GPUaaS issues an internal step-up grant. The grant is short
   lived, bound to actor, browser or CLI session, tenant/project where
   applicable, operation family, provider, assurance class, challenge ID, and
   correlation ID. Default TTL is 10 minutes, with an upper bound of 15 minutes.
5. The sensitive operation consumes the grant. High-risk operations should use
   single-use grants; lower-risk operations may allow bounded reuse within the
   TTL only when the operation family and target scope match exactly.
6. Every allow, deny, expiry, replay, stale, issuer mismatch, audience mismatch,
   tenant mismatch, operation mismatch, and non-human attempt writes a sanitized
   audit event using the reserved `platform.iam.mfa.sensitive_gate.*` action
   family.

Required proof before implementation:

1. provider adapter proof for at least Keycloak and one provider-neutral
   fallback path, or a documented reason why only Keycloak is in first scope;
2. non-live proof that fresh assurance is distinguishable from normal login and
   token refresh;
3. negative proof for service accounts, API keys, client credentials,
   automation, direct grant, stale sessions, mismatched operation family,
   mismatched tenant/project, replay, and expired grants;
4. redaction proof for callback payloads, assertions, tokens, cookies, state,
   nonce, and grant material;
5. contract-first API and UI/CLI flow design for `step_up_required`,
   challenge start, callback/complete, and grant consumption;
6. UAT coverage for browser admin and CLI/operator paths before production use.

Disallowed shortcuts:

1. Treating Keycloak group membership, UI posture, account-security read model
   status, or a prior live drill as sensitive-operation MFA proof.
2. Calling Keycloak or another provider on every sensitive operation to infer
   assurance.
3. Accepting ID-token-only or browser-only evidence for API authorization
   without a locally validated API token or internal step-up grant.
4. Letting non-human credentials, break-glass sessions, service accounts, API
   keys, or automation satisfy human MFA step-up.
5. Making customer or compliance claims that sensitive-operation step-up is
   enforced before `sensitive_operation_gate_state` moves out of blocked state
   with reviewed runtime evidence.

## Sensitive Operation Catalog And Audit Actions

This section is the closed catalog required before
`IAM-MFA-SENSITIVE-OPS-GATE-001` can be implemented. It is a docs/model and
registry-reservation boundary only. It does not authorize token/API matrix
execution, source/prod mutation, live MFA action, or enforcement code.

Current state:

```text
sensitive_operation_catalog_state = defined_not_enforced
audit_action_registry_state = reserved_not_runtime_enforced
sensitive_operation_gate_state = blocked
```

The catalog may be used by implementation only after
`HARNESS-IAM-MFA-NONLIVE-CLAIM-CONTRACT-PROOF-001` proves the claim contract
and backend/security review accepts the enforcement path.

### Operations That Require MFA Step-Up Before Future Execution

The future sensitive-operation gate must treat these as closed, reviewed
operation families. New operation families require a separate architecture and
security review before enforcement.

| Operation family | Example operations | Required actor | Required MFA evidence before allow | Audit action family | Denial behavior |
|---|---|---|---|---|---|
| Platform role elevation | Granting or revoking `platform_superadmin`, `platform_admin`, or `platform_ops`; changing equivalent privileged bindings. | Privileged human admin/ops only. | Proven local claim contract plus `auth_time` freshness. | Existing role audit plus `platform.iam.mfa.sensitive_gate.*`. | Deny before mutation; write deny audit once implementation exists. |
| MFA selector membership | Adding/removing users from the Keycloak MFA selector group or equivalent realm role; changing selector target. | Privileged human admin/ops only. | Proven local claim contract plus source-realm change packet. | `platform.iam.mfa.selector_membership.change`. | Deny before Keycloak write; no direct required-action substitute. |
| MFA policy or flow change | Changing browser flow, Conditional OTP, realm/client MFA policy, factor policy, or rollout state. | Privileged human admin/ops only. | Proven local claim contract, exact change packet, rollback owner. | `platform.iam.mfa.policy.change`. | Deny before provider write; require rollback/no-op evidence. |
| Factor reset and recovery | Requesting, approving, denying, or completing privileged human factor reset/re-enrollment. | Separate requester and approver where policy requires. | Proven local claim contract for the acting admin, plus reset proofing evidence. | `platform.iam.mfa.factor_reset.*`. | Deny or escalate before reset; preserve proofing evidence class only. |
| Break-glass lifecycle | Activating, extending, deactivating, or post-use closing break-glass access. | Named break-glass approver/operator, not routine admin self-service. | Break-glass packet approval; MFA claim gate may be unavailable during emergency but must record explicit exception. | `platform.iam.mfa.break_glass.*`. | Deny routine use; emergency use requires packet and expiry. |
| Privileged session revocation | Revoking privileged browser/CLI sessions after role/group/factor change or suspected stolen session. | Privileged human admin/ops or incident owner. | Proven local claim contract when not incident/emergency; incident packet otherwise. | `platform.iam.mfa.session.revoke`. | Deny non-emergency self-service; require incident evidence for emergency path. |
| Credential custody changes for privileged control-plane access | Creating, rotating, disabling, or emergency revoking service-account token, API-key, OIDC-client, or provider credential material that can administer IAM/MFA or production control plane. | Privileged human admin/ops only; non-human actors cannot satisfy human MFA. | Proven local claim contract for human approval and existing credential custody audit. | Existing credential audit plus `platform.iam.mfa.sensitive_gate.*`. | Deny before credential mutation; service accounts are excluded as MFA proof. |
| Destructive infrastructure break-glass | Force-delete, force-detach, emergency disable, or equivalent irreversible platform operations where existing model marks break-glass/destructive. | Privileged human admin/ops or incident owner. | Proven local claim contract for routine privileged path; incident/break-glass packet for emergency. | Existing operation audit plus `platform.iam.mfa.sensitive_gate.*`. | Deny before destructive mutation unless approved emergency packet exists. |

### Operations That Do Not Use Human MFA Step-Up

These operations must not be pulled into the human browser MFA path:

1. read-only account/security posture views and provider-unqueried posture
   display;
2. normal user allocation release, terminal token minting, and self-service
   non-privileged workflows already governed by ownership and policy checks;
3. service-account, API-key, client-credential, node-agent, worker, webhook,
   and automation flows;
4. billing deposits, refunds, and ledger corrections unless a separate finance
   privileged-action review classifies the operation as MFA sensitive;
5. disposable, drill, or non-live proof operations, which are evidence inputs
   only and never production-sensitive allow decisions.

### Reserved MFA Audit Actions

The platform registry reserves these audit actions with lifecycle `reserved`.
They are not runtime-active enforcement evidence until implementation tasks
activate them with backend/security/governance review.

| Audit action | Target type | Retention class | Required use |
|---|---|---|---|
| `platform.iam.mfa.enrollment.challenge` | `mfa_enrollment` | `security` | Record future MFA enrollment/challenge presentation without raw OTP, QR, recovery code, or provider body. |
| `platform.iam.mfa.enrollment.complete` | `mfa_enrollment` | `security` | Record successful future enrollment or re-enrollment completion. |
| `platform.iam.mfa.factor.challenge` | `mfa_factor` | `security` | Record future factor-use challenge result classes. |
| `platform.iam.mfa.factor.failure` | `mfa_factor` | `security` | Record future factor failure/lockout result classes. |
| `platform.iam.mfa.factor_reset.request` | `mfa_factor_reset` | `security` | Record reset request and proofing reference class. |
| `platform.iam.mfa.factor_reset.approve` | `mfa_factor_reset` | `security` | Record separate approval or denial of reset. |
| `platform.iam.mfa.factor_reset.complete` | `mfa_factor_reset` | `security` | Record provider reset/re-enrollment completion readback. |
| `platform.iam.mfa.policy.change` | `mfa_policy` | `release` | Record realm/client/flow/factor policy change packet and rollback reference. |
| `platform.iam.mfa.selector_membership.change` | `mfa_selector_membership` | `security` | Record MFA selector group/role membership changes. |
| `platform.iam.mfa.break_glass.activate` | `break_glass_access` | `security` | Record break-glass activation with owner, approver, expiry, and custody class. |
| `platform.iam.mfa.break_glass.extend` | `break_glass_access` | `security` | Record extension with renewed approval and expiry. |
| `platform.iam.mfa.break_glass.deactivate` | `break_glass_access` | `security` | Record deactivation and post-use readback. |
| `platform.iam.mfa.session.revoke` | `privileged_session` | `security` | Record privileged session revocation or failed revocation. |
| `platform.iam.mfa.sensitive_gate.evaluate` | `mfa_sensitive_gate` | `security` | Record future allow evaluation for cataloged sensitive operation. |
| `platform.iam.mfa.sensitive_gate.deny` | `mfa_sensitive_gate` | `security` | Record fail-closed denial with safe reason class. |

Audit metadata for these actions must never include raw tokens, cookies, OTP
codes, TOTP seeds, QR payloads, recovery codes, raw provider responses, raw
browser bodies, client secrets, private keys, or full request/response payloads.
Allowed metadata is limited to actor, target, operation family, result,
reason-class, correlation/change id, retention class, reviewer/approver
reference, expiry where applicable, and sanitized evidence artifact paths.

## Privileged Admin Session, Step-Up, And Revocation Model

This section defines the target posture for human platform admin/ops sessions.
It is a security model and review gate, not proof that the current Keycloak
realm, Pomerium edge, API, or CLI already enforces every value. Runtime
implementation requires separate task evidence before production use.

### Token And Session Lifetime Targets

For human users with `platform_superadmin`, `platform_admin`, or
`platform_ops` authority:

| Control | Target posture | Notes |
|---|---|---|
| Access-token TTL | 5 minutes or less for privileged admin/ops sessions. | Short-lived bearer exposure is the primary post-MFA stolen-token containment control. |
| Refresh-token idle timeout | 30 minutes or less for privileged admin/ops sessions. | Idle sessions should require a fresh browser login or step-up after inactivity. |
| Refresh-token absolute lifetime | 8 hours or less for privileged admin/ops sessions. | Longer sessions need an explicit exception and evidence. |
| Admin browser session idle timeout | 30 minutes or less. | Applies to Keycloak/Pomerium/browser access where supported. |
| Admin browser session max lifetime | One work shift, 8 hours or less. | Longer incident sessions require a named exception and expiry. |
| CLI human admin session | Same or stricter than browser admin session. | CLI PKCE tokens must not outlive the privileged browser policy by default. |

If the current Keycloak/Pomerium/client configuration cannot express these
values separately for privileged humans, first production rollout must either:

1. use stricter realm/client defaults that satisfy privileged access; or
2. create a scoped follow-up before treating privileged session lifetime as
   compliant.

Service accounts, API keys, product runtime credentials, and automation
credentials are not human MFA sessions. Their lifetimes are controlled through
scoped credentials, expiry, rotation, rate limits, custody, and audit.

### Step-Up Freshness

Sensitive admin operations may require a fresh MFA-authenticated session only
after reliable provider/token evidence exists. The step-up freshness rule is:

```text
now - auth_time <= 15 minutes
```

The API may enforce this only when all of the following are true:

1. the JWT has already been validated locally by the normal API JWT path;
2. `auth_time` or an equivalent trusted authentication timestamp is present;
3. `amr`, `acr`, or equivalent provider evidence proves MFA occurred in that
   session;
4. Pomerium or the browser edge preserves the relevant OIDC context for browser
   sessions where applicable;
5. direct-grant, service-account, API-key, and automation flows are excluded
   from human step-up decisions.

If any value is missing or untrusted, the sensitive-operation gate must fail
closed or remain unavailable. It must not infer MFA freshness from UI state,
group membership alone, cached account-security read models, or a per-request
Keycloak call.

The first sensitive-operation gate must reference this model and include test
coverage for fresh, stale, missing, and non-human token cases before it can be
treated as an enforcement control.

### Concurrent Session Policy

Privileged human accounts should have a small concurrent-session limit. Target
posture:

1. `platform_superadmin`: one active privileged browser/CLI session unless an
   incident packet approves a second session with expiry;
2. `platform_admin` and `platform_ops`: no more than two active privileged
   sessions by default;
3. break-glass accounts: one active session, time-bounded, with named owner and
   approver;
4. service accounts and API keys: not counted as human sessions and governed by
   credential policy.

Until an automated concurrent-session control exists, the rollout/readiness
packet must state whether Keycloak/Pomerium can enforce the limit, whether
admin readback can detect excess sessions, and what manual revocation action is
required.

### Revocation On Role Or Group Change

Changing a privileged user's platform role, Keycloak MFA selector membership,
break-glass status, or admin/ops group must trigger session revocation. Target
behavior:

1. revoke the user's Keycloak realm sessions or client sessions;
2. revoke or invalidate refresh tokens for affected clients;
3. clear Pomerium/browser edge sessions where applicable;
4. clear platform-side authorization/session caches if any are introduced;
5. record audit evidence with actor, target, reason, before/after membership,
   revocation result, and correlation/change id.

If automated revocation is not yet implemented, the ops runbook must require a
manual admin kill-switch step before the change is considered complete. A role
or group removal that leaves a privileged session active is a security blocker
for production admin/ops rollout.

### Post-MFA Stolen-Session Recovery

MFA reduces credential theft risk but does not make bearer or refresh tokens
safe after compromise. Recovery from a suspected post-MFA stolen session must
use layered containment:

1. short privileged access-token TTL as above;
2. refresh-token rotation with reuse detection where supported;
3. if refresh rotation or reuse detection is not proven, create an explicit
   follow-up before production rollout and keep privileged refresh lifetimes
   conservative;
4. admin kill-switch or realm-session revoke path for the affected user;
5. Pomerium/browser edge session clear where applicable;
6. role/group/session readback after revocation;
7. incident evidence and post-incident follow-up.

Sensitive-operation gates must not be used as the only stolen-session recovery
control. Revocation and refresh-token containment remain required even when
step-up freshness is enforced.

## Account Security Read Model

`GET /api/v1/v3/account/security` should evolve from hardcoded booleans to a
provider-aware posture contract.

This is an extension of the existing V3 account security read model and
`/account/security` UI. MFA does not create a new account-security page, a new
parallel API surface, or a product-owned MFA verifier.

Minimum fields for implementation:

1. `totp_enabled`
2. `webauthn_enabled`
3. posture source: provider, token claim, manual policy, or unavailable
4. effective requirement: optional, required, grace, exempt
5. next action URL or disabled reason

Contract changes must start in `doc/api`.

## UX Boundary

The user-facing account page should show:

1. current MFA state;
2. whether MFA is required for the user's role or organization;
3. action to enroll/manage MFA through Keycloak or a platform-owned redirect;
4. clear unavailable state when provider posture cannot be read.

The platform should not build a custom TOTP verifier unless Keycloak is no
longer the authentication authority.

## Ops and Support Boundary

Required runbooks:

1. enable MFA for platform admin/ops roles;
2. enroll a user;
3. recover/reset a lost factor;
4. handle break-glass access;
5. audit an MFA policy or reset event;
6. rollback enforcement without disabling all authentication;
7. revoke human admin/ops CLI sessions after logout, credential exposure, role
   removal, or incident response.

Break-glass access must be explicit, time-bounded, and audit logged.

### Privileged Factor Reset Proofing Posture

Factor reset is a privileged recovery path, not a helpdesk shortcut around MFA.
For `platform_superadmin`, `platform_admin`, and `platform_ops`, reset proofing
must be explicit before any production or production-like rollout.

Minimum posture:

1. reset is available only for a named target account and target role;
2. the requester identity is verified out of band from the failing MFA factor
   and from the same browser MFA path being recovered;
3. privileged resets require two-person control: a reset actor and a separate
   approver;
4. `platform_superadmin` reset requires the strongest available proofing path,
   plus Architecture Control or incident commander acknowledgement when the
   reset affects production or production-like access;
5. the proofing path records an allow, deny, or escalate decision before any
   factor is removed or re-enrollment is forced;
6. denial is the safe default when ownership, approver, proofing evidence,
   ticket/incident id, or target role is ambiguous;
7. service accounts, API keys, client credentials, automation identities, node
   identities, and runtime credentials are not human factor-reset subjects;
8. reset evidence must not contain passwords, TOTP seeds, QR payloads, recovery
   codes, WebAuthn private material, bearer tokens, cookies, raw headers, or
   raw provider response bodies.

Required reset evidence:

1. target user and privileged role;
2. requester and proofing channel class;
3. reset actor and separate approver;
4. reason, ticket, incident, or change id;
5. decision: allowed, denied, or escalated to break-glass/incident response;
6. exact Keycloak/provider action taken, if allowed;
7. re-enrollment/readback result and source realm health where applicable;
8. audit event or follow-up when platform audit cannot yet represent the
   action.

Missing proofing evidence is a no-go for production admin/ops MFA rollout,
live drill packets that claim reset readiness, and any broad tenant/admin MFA
expansion.

### Break-Glass And Out-Of-Band Recovery Posture

Break-glass is not an MFA bypass class for daily operations. It is a
controlled emergency recovery path for cases where the normal Keycloak browser
MFA path, factor reset path, or operator session path cannot restore access in
time to protect the platform.

Minimum posture:

1. break-glass accounts are inventory-controlled, disabled by default where the
   provider supports it, and excluded from ordinary admin rotations;
2. each break-glass account has a named owner, named approver, expiry, and
   activation reason before use;
3. phishing-resistant factor custody is required for production break-glass
   accounts where supported by the IdP; otherwise the accepted factor type and
   compensating controls must be recorded before rollout;
4. activation requires two-person control for production or production-like
   environments unless Architecture Control records an incident waiver;
5. credential and factor material must be stored outside the same Keycloak path
   that is being recovered, using an approved secrets custody process;
6. activation, use, deactivation, rollback/readback, and post-incident review
   evidence are mandatory;
7. break-glass access must expire or be disabled after use, and any continued
   access requires a new approved exception.

Out-of-band recovery must not depend solely on the same broken Keycloak browser
MFA path. The recovery model must preserve at least one reviewed path for:

1. proving the operator identity outside the failing browser-flow path;
2. reaching Keycloak admin tooling or an equivalent IdP recovery surface;
3. restoring normal admin/ops MFA access;
4. proving source realm health and rollback after recovery;
5. recording platform or incident audit evidence even when Keycloak admin event
   visibility is degraded.

For the first production slice, sensitive-operation MFA gates and live drill
packets must treat missing break-glass owner, approver, expiry, factor custody,
or rollback/readback evidence as a no-go condition.

## MFA Factor Inventory And Drift Read Model

Status/Ops needs a platform-level read model that answers a different question
than the personal `GET /api/v1/v3/account/security` endpoint. The account
endpoint reports the caller's current-session MFA posture and must continue to
avoid replaying token-derived assurance from cache. The factor inventory/drift
model is an operator read model for privileged human accounts and must not be
implemented as a per-request Keycloak query.

### Operator Question

The first slice must let platform operators answer:

1. which human users hold privileged platform roles;
2. whether those users are members of the MFA-required Keycloak selector, such
   as `gpuaas-platform-mfa-required`;
3. whether provider evidence shows at least one enrolled factor;
4. whether provider evidence shows a phishing-resistant factor such as WebAuthn
   or passkey;
5. which accounts are stale, unqueried, pending, or errored so they are not
   accidentally treated as compliant.

Service accounts, API keys, automation identities, node identities, and runtime
credentials are out of scope for human MFA factor inventory. They should appear
only as non-human exclusions in aggregate counts or evidence notes, never as
factor rows.

### Proposed API Contract

Contract work must start in `doc/api` before code implementation. The proposed
first read-only surface is:

```text
GET /api/v1/v3/platform/iam/mfa-factor-drift
```

Access:

1. platform admin/ops only;
2. no mutation, no Keycloak write, and no browser-flow side effect;
3. no per-request provider call unless a separately approved provider snapshot
   mode is added later with explicit rate, cache, timeout, and redaction
   controls.

Request filters:

| Field | Purpose |
|---|---|
| `role` | Optional privileged role filter, initially `platform_superadmin`, `platform_admin`, or `platform_ops`. |
| `drift_state` | Optional drift filter for no-selector, no-factor, totp-only, stale, and error views. |
| `factor_state` | Optional factor-state filter. |
| `selector_state` | Optional MFA-required selector membership filter. |
| `cursor` / `page_size` | Standard paginated operator inventory shape. |

Response shape:

```json
{
  "summary": {
    "privileged_human_users": 0,
    "selector_missing": 0,
    "no_factor": 0,
    "totp_only": 0,
    "webauthn_present": 0,
    "phishing_resistant_missing": 0,
    "stale": 0,
    "error": 0,
    "provider_unqueried": 0,
    "provider_pending": 0
  },
  "rows": [],
  "evidence": {},
  "operations": [],
  "meta": {}
}
```

Each row should expose only operator-safe identity and posture fields:

| Field | Meaning |
|---|---|
| `user_id` | GPUaaS platform user id. |
| `username` / `display_name` / `email` | Redacted or omitted according to existing platform IAM operator rules. |
| `platform_roles` | Platform roles that make the user privileged. |
| `human_identity` | `true` only for human users; service accounts and API keys are excluded. |
| `selector_state` | `member`, `missing`, `not_applicable`, `provider_unqueried`, `provider_pending`, or `error`. |
| `factor_state` | See state table below. |
| `phishing_resistant_state` | `present`, `missing`, `provider_unqueried`, `provider_pending`, `stale`, or `error`. |
| `drift_state` | Highest-severity drift classification for the row. |
| `evidence_source` | `provider_snapshot`, `provider_pending`, `manual_policy`, `token_claim`, or `unavailable`. |
| `evidence_collected_at` | Time the sanitized provider snapshot was produced, if any. |
| `stale_after` | Time after which the row must be treated as stale. |
| `error_class` | Sanitized provider/read-model error class only; no provider response body. |
| `recommended_action` | Operator action such as add selector membership, enroll factor, require WebAuthn, refresh snapshot, or investigate provider error. |

The read model may include `operations` for future operator actions, but those
operations must be disabled in the first slice unless a separate mutation
contract, audit action, idempotency policy, and Keycloak write packet are
approved.

### State Semantics

The factor inventory must distinguish unknown, pending, non-compliant, and
compliant states explicitly:

| State | Meaning | Compliance use |
|---|---|---|
| `provider_unqueried` | No approved provider snapshot has queried this user/factor relationship. | Not proof of compliance. |
| `provider_pending` | Snapshot or reconciliation is scheduled/running but not complete. | Not proof of compliance. |
| `no_factor` | Provider snapshot found no enrolled MFA factor for a privileged human user. | Drift/blocker. |
| `totp_only` | Provider snapshot found OTP/TOTP but no phishing-resistant factor. | Acceptable only when policy allows TOTP; drift when WebAuthn/passkey is required. |
| `webauthn_present` | Provider snapshot found at least one WebAuthn/passkey or equivalent phishing-resistant factor. | Satisfies phishing-resistant evidence when snapshot is fresh. |
| `stale` | Evidence exists but is older than the configured freshness window or source realm changed after collection. | Not current proof; refresh required. |
| `error` | Provider snapshot/read-model failed with a sanitized error class. | Not proof of compliance; investigate. |

Selector membership and factor state are separate axes. A platform admin can be
in the MFA-required selector but still have `no_factor`; another can have a
factor but be missing selector membership. Both are drift.

Drift classification should use the highest severity:

1. `error`
2. `stale`
3. `selector_missing`
4. `no_factor`
5. `phishing_resistant_missing`
6. `totp_only`
7. `provider_pending`
8. `provider_unqueried`
9. `compliant`

### Data Sources And Freshness

Initial implementation must read from platform IAM tables plus an explicitly
approved sanitized provider snapshot table or evidence bundle. It must not call
Keycloak on every operator request.

Required inputs before code:

1. platform role source for `platform_superadmin`, `platform_admin`, and
   `platform_ops`;
2. mapping from platform user to provider subject/username;
3. MFA-required selector id/name and last verified readback;
4. sanitized provider factor inventory snapshot, with only factor type classes
   and timestamps;
5. snapshot freshness policy, for example `auth.mfa.factor_inventory_freshness_seconds`;
6. non-human exclusion evidence for service accounts/API keys.

The provider snapshot collector is a separate follow-up. It must be approved by
security, ops, architecture, and backend before it queries Keycloak factor
surfaces. Until that exists, the read model must return `provider_unqueried` or
`provider_pending`, not inferred compliance.

### Evidence And Logging Boundary

The read model and any future collector must never log, persist, or return:

1. TOTP secrets, seed values, QR payloads, recovery codes, WebAuthn credential
   public-key material beyond a stable non-secret credential id/fingerprint
   class if explicitly approved;
2. Keycloak admin tokens, bearer tokens, cookies, raw headers, raw provider
   response bodies, passwords, client secrets, private keys, or OTP values;
3. raw browser page bodies or screenshots from MFA enrollment/challenge flows.

Allowed evidence:

1. factor type class: `otp`, `totp`, `webauthn`, `passkey`,
   `recovery_code_present`, or `unknown_provider_factor`;
2. count classes such as `none`, `one`, `multiple`;
3. timestamps and freshness classes;
4. provider subject id or username only if already part of platform IAM operator
   inventory and permitted by the platform IAM redaction rules;
5. sanitized error class, HTTP status class, and correlation id;
6. evidence artifact path and checksum for the sanitized snapshot.

All logs must use correlation id and sanitized classes. Provider response
parsing errors are `upstream_error` or `service_unavailable` when Keycloak is
unavailable, and `internal_error` only when the local parser or read-model code
is defective. Add regression tests for each new 5xx path.

### Contract And Implementation Sequence

This task's first slice is design only. Implementation is not safe until the
following contract steps are reviewed:

1. Add OpenAPI schema and route contract for
   `GET /api/v1/v3/platform/iam/mfa-factor-drift`.
2. Run `make codegen` or `bash scripts/ci/sdk_codegen_smoke.sh`, then
   `CODEGEN_ENFORCE_CLEAN=1 bash scripts/ci/sdk_codegen_smoke.sh`.
3. Add backend read-model tests for the state table above, cache/freshness
   behavior, non-human exclusions, and no-secret response shape.
4. Add a disabled/empty frontend or ops surface only after the contract is
   accepted; frontend must render unknown/pending states as not compliant.
5. Create a separate provider-snapshot collector task before any live Keycloak
   factor collection. That task must name exact Keycloak Admin REST endpoints,
   rate limits, redaction proof, rollback/no-mutation boundary, and review
   domains.

Until those steps are complete, Status/Ops may use this document as the
approved target shape but must not treat the read model as implemented or use it
as live MFA evidence.

## Implementation Task Split

| Task | Owner lane | Purpose |
|---|---|---|
| `IAM-MFA-KEYCLOAK-FLOW-001` | ops/security | Configure and smoke Keycloak MFA enforcement for platform admin/ops roles. |
| `IAM-MFA-POSTURE-READMODEL-001` | backend/security | Extend the existing `/api/v1/v3/account/security` read model from hardcoded MFA booleans to provider-aware posture and contract updates. |
| `IAM-MFA-ACCOUNT-UX-001` | frontend/security | Extend the existing `/account/security` page with required/optional/unavailable MFA state and provider enrollment/manage links. |
| `IAM-MFA-OPS-RUNBOOK-001` | ops/security | Document enrollment, reset, break-glass, rollback, and audit procedures. |
| `SEC-IAM-MFA-ADMIN-SESSION-STEPUP-MODEL-001` | security/architecture/backend/ops | Define privileged admin session lifetime, step-up freshness, concurrent-session, revocation, and stolen-session recovery posture before sensitive-operation gates. |
| `IAM-MFA-SENSITIVE-OPS-GATE-001` | backend/security | Gate sensitive operations on MFA-authenticated sessions only after reliable `amr`/`acr` and `auth_time` evidence is proven and this session/step-up model has an implementation-ready control path. |
| `SEC-IAM-MFA-ROLLOUT-PREVIEW-MODEL-001` | ops/security/governance | Define rollout states, preview/enrollment monitoring, cutover criteria, exception handling, and packet intent semantics. |
| `SEC-IAM-MFA-TENANT-ADMIN-ROLLOUT-RECOVERY-001` | architecture/security/ops/governance | Define tenant/admin MFA rollout modes, recovery authority, reset proofing, support scalability, and first-slice future/scoped boundary. |
| `SEC-IAM-MFA-MULTIREALM-FEDERATION-SEMANTICS-001` | architecture/security/backend/ops | Define single-realm first-slice boundary, federated IdP assurance mapping, and future multi-region trust requirements. |

## Non-Goals

1. Do not implement custom password or TOTP verification inside GPUaaS.
2. Do not require MFA for service-account/API-key usage.
3. Do not block normal user onboarding in the first slice.
4. Do not infer MFA enrollment from frontend state alone.
5. Do not call Keycloak on every API request for authorization decisions.
6. Do not create a new account security route or page; extend the existing V3
   account security surface.

## Open Questions

1. Resolved for first production slice: use a Keycloak browser flow with
   Conditional OTP scoped to a platform-admin/ops MFA group or equivalent realm
   role. Do not use realm-wide MFA as the first slice.
2. Can the current Keycloak realm emit reliable `amr`/`acr` claims after MFA?
3. Should tenant/org-admin MFA enforcement wait for tenant federation SSO
   implementation or be supported for local/OIDC users first?
4. What is the acceptable break-glass path for the first production environment?

## Done Criteria

The architecture task is complete when:

1. MFA authority is assigned to Keycloak for human login.
2. GPUaaS-owned posture, policy, UX, audit, and sensitive-operation boundaries
   are documented.
3. Implementation tasks exist in Fairway with owners and dependencies.
4. The current hardcoded MFA posture is identified as implementation debt.