# Node Agent Drift And Recovery Model v1

Status: model v1 locked; implementation hardening continues through child tasks
Date: 2026-05-18
Owner: A-NODE-REENROLLMENT-RECOVERY-HARDENING-001

## Purpose

Node-agent connectivity is a multi-part contract. A node can come back from a
host reboot, network partition, VPN/routing change, extended control-plane
outage, or environment profile switch with only part of that contract still
valid. Treat this as drift, not only as "node down".

This model is separate from environment profile switching. The environment
profile owns canonical edge URLs and rendered trust material. Node-agent
recovery owns detection and repair when a host-local runtime no longer matches
that canonical environment.

## Layer Boundary

Node-agent drift is first a control-plane management problem. It means the
platform may not be able to inspect, mutate, recover, or clean up the node. It
does not automatically mean that already-running workloads are down.

The data-plane impact must be measured separately:

- A running Jupyter container can continue serving through managed ingress while
  node-agent task polling is stale.
- A Pomerium or route failure can break user access even when node-agent task
  polling is healthy.
- A completed status probe reporting a stopped container is workload evidence;
  an expired status probe is only missing evidence.

Recovery implementation and V3 UX should therefore keep three layers separate:

| Layer | Question | Primary owner | User-facing impact |
|---|---|---|---|
| Control-plane management | Can GPUaaS still talk to and manage the node? | `cmd/api`, `cmd/node-agent`, lifecycle/provisioning workers | usually none until a mutation, cleanup, or recovery is needed |
| App/runtime liveness | Is the workload process actually running and reachable from the expected network? | `cmd/app-runtime-worker`, node-agent task results, route probes | workload may be degraded or unavailable |
| Edge/user access | Can a user authenticate and reach the workload through the product URL? | Pomerium/proxy runtime, edge profile, app route health | user-visible access failure |

This distinction is the guardrail against treating "node-agent unreachable" as
"user workload failed". User-facing state should change only when app/runtime
or edge/user-access evidence indicates impact, or when control-plane drift is
severe enough that safe lifecycle guarantees cannot be maintained.

## Control-Plane Versus Workload Recovery

The recovery model has three separate questions. They must not collapse into
one `failed` bit:

| Question | Failure means | Normal repair path | User-facing default |
|---|---|---|---|
| Can the platform manage the node? | mutations, cleanup, placement, probe dispatch, or recovery tasks may be blocked | restore node-agent identity, endpoint profile, certs, or task truth | `temporary` unless workload/edge evidence also fails |
| Is the workload alive? | the runtime process, scheduler job, container, service, or endpoint is unhealthy | app-runtime recovery, restart, or explicit failed status with completed probe evidence | degraded or unavailable only after completed unhealthy evidence |
| Can the user reach it? | Pomerium, edge, route, auth, header injection, or upstream connect path is failing | proxy-runtime, edge-profile, or app-route repair | access failure, not automatically node recovery |

This is intentionally closer to Kubernetes' "node not ready does not prove pods
are dead" model than to a binary agent-up/agent-down model. The GPUaaS addition
is stricter tenant identity, audit, and recovery evidence, not a different
definition of liveness.

## Security Model Cost

Some failure modes in this document are consequences of the chosen trust model,
not intrinsic to every server-agent system:

- mTLS between node-agent and node-api creates certificate expiry, issuer, and
  ingress trust drift cases.
- Environment-profile portability creates endpoint and trust-bundle drift cases.
- A node-bound recovery token enables automatic recovery after long outages but
  creates an additional credential lifecycle.

Those choices may still be correct for multi-tenant, sovereign, and airgapped
customers, but each one must pay for itself. Future simplification work should
explicitly evaluate whether a workload data path can use in-cluster
`Service`/`Endpoints` or a node-local relay instead of mTLS-to-node ingress, and
whether expired-cert recovery should fail closed to operator re-enrollment
instead of carrying a long-lived recovery credential.

Security-model choices that add operational cost:

| Choice | What it buys | What it costs | Follow-up owner |
|---|---|---|---|
| mTLS node identity for management paths | strong node-bound control-plane identity | cert expiry, issuer drift, trust-bundle drift | `A-NODE-REENROLLMENT-RECOVERY-HARDENING-001` |
| mTLS-to-node workload ingress where used | direct node endpoint authentication | ingress trust CA drift and harder data-path diagnosis | `D-NODE-DATA-PATH-MTLS-SIMPLIFICATION-DECISION-001` |
| node-bound recovery token | automatic recovery after long outage | one more credential to rotate, revoke, and lose | `D-NODE-RECOVERY-TOKEN-ROTATION-POLICY-001` |
| portable environment profiles | kind/demo/prod/airgapped shape reuse | endpoint and trust-profile drift | `C-DEV-ENV-PROFILE-SWITCHING-001` |

## Driftable Contract

The node-agent contract has five independently driftable parts:

1. **Node process liveness**: `gpuaas-node-agent` is running and heartbeat is
   current.
2. **Public/control endpoint identity**: `GPUAAS_API_URL`,
   `GPUAAS_TERMINAL_API_URL`, and local CA bundle point at the active edge
   profile.
3. **Node client certificate**: `/etc/gpuaas/cert.pem` and key are present,
   unexpired, and node-bound.
4. **Control-plane signing CA**: API signs node certificates with the same CA
   expected by the node-api ingress mTLS verifier.
5. **Task truth**: queued node tasks have not outlived `expires_at`, and current
   read models do not treat expired queued tasks as active operations.

Observation ownership must be explicit so recovery does not depend on ad hoc
SQL or one-off log inspection:

| Driftable part | Primary observation signal | Observer / implementation owner |
|---|---|---|
| Node process liveness | heartbeat staleness and resumed heartbeat with the same node identity | `cmd/api` node read-model queries plus lifecycle sweeper |
| Public/control endpoint identity | enrolled node profile fields versus active environment profile; node heartbeat-reported API/terminal hosts | `cmd/api` enrollment/profile verifier |
| Node client certificate | certificate expiry, subject/node binding, and poll-start self-check result | `cmd/node-agent`, reported to `cmd/api` |
| Control-plane signing CA | signing CA fingerprint compared with node-api ingress trusted CA fingerprint | `cmd/api` startup/preflight and ops profile verifier |
| Ingress mTLS acceptance | node task wait / heartbeat mTLS result and ingress rejection logs | `cmd/api` node-api surface plus ingress/Loki dashboards |
| Task truth | queued/dispatched tasks where `expires_at <= now()` and idempotent task completion evidence | provisioning/lifecycle task sweeper, then V3 read models |
| App runtime observations | status probe task lease, completed probe output, endpoint route health, and credential health | `cmd/app-runtime-worker`, proxy runtime, credential broker |

Default operational budgets:

| Severity | Meaning | Detection budget | MTTR target |
|---|---|---|---|
| P0 | possible security boundary failure or tenant-impacting control-plane drift | under 1 minute | under 15 minutes or active incident |
| P1 | active user/workload impact with known repair path | under 5 minutes | under 1 hour |
| P2 | degraded observation, stale state, or routine recovery | under 30 minutes | next business day or automatic recovery |

These are defaults. A production environment may tighten them, but should not
weaken P0/P1 behavior without an explicit operations decision.

## Recovery Policy

- If only the node process is down, restart or lifecycle-repair the agent.
- If only the node certificate is expired, or the control plane rejects the
  current node identity after a reachable endpoint/profile change, and a current
  recovery enrollment token exists, the node-agent may attempt bounded
  re-enrollment automatically and continue polling.
- "Current" means the recovery token is unexpired, node-bound, and issued for
  the active environment profile. Tokens older than the configured recovery
  window must fail closed and require an operator-issued bundle.
- If endpoint hostnames, local trust bundle, enrollment token, or control-plane
  signing CA have drifted, use operator re-enrollment/repair. The recovery
  bundle must rewrite endpoint fields and CA material, not only mint a new token.
- If API signing CA and ingress mTLS trust CA diverge, stop and repair the
  profile/secret rendering path. Re-enrolling nodes alone will keep producing
  certificates that ingress rejects.
- If a freshly recovered certificate is still rejected, the agent must suppress
  immediate re-enrollment retries and expose recovery counters so operators can
  distinguish rare drift from a recurring profile or CA mismatch.
- If task polling resumes after an outage, expired queued tasks must become
  historical evidence, not current latest-operation state.

Suggested policy keys:

- `node.recovery_token_max_age_seconds`: default `86400`.
- `node.heartbeat_stale_after_seconds`: default `120`.
- `node.task_stale_sweep_interval_seconds`: default `60`.
- `audit.dedupe_window_seconds`: default `300`.

## Self-Update, Rebootstrap, And Reimage Boundary

The node-agent may participate in its own update path, but it is not the
authority for node identity, recovery-token trust, tenant isolation, or host
reimage decisions. Self-update is a bounded runtime maintenance action over an
already trusted node. Rebootstrap and reimage are operator/control-plane
recovery actions.

Decision boundary:

| Trigger | Allowed path | Node-agent may do | Requires control-plane/operator action | Rollback expectation |
|---|---|---|---|---|
| Patch-level node-agent binary update with unchanged endpoint, CA, task protocol, and service unit | `node.self_update` | verify digest, install package, refresh task verifier material, optionally replace current recovery token, restart its own systemd unit | release evidence must prove package digest, expected version, task signing keys, and recovery token binding | restart previous package only if previous digest and verifier set are still trusted; otherwise rebootstrap |
| Task verifier key rotation where current agent can verify the update task | `node.self_update` | replace `GPUAAS_TASK_SIGNING_PUBKEYS` from the signed task payload | control plane must publish old+new verifier overlap and evidence | roll back by publishing a signed task with the prior trusted verifier set during overlap |
| Recovery-token refresh on an otherwise healthy node | `node.self_update` or recovery-token rotation task | replace node-bound `GPUAAS_ENROLLMENT_TOKEN` after control-plane issuance | IAM/PKI owner must audit issuance and expiry | old token must be revoked or allowed to expire inside policy window |
| Expired cert with current recovery token and valid recovery trust path | automatic recovery enrollment | request bounded re-enrollment, replace cert/key on success, resume polling | API remains authority and may reject retired/quarantined/cloned identities | no binary rollback; failed recovery remains `operator_action_required` |
| Runtime API URL, terminal API URL, CA bundle, install root, systemd unit, or registry trust drift | operator rebootstrap/repair bundle | stop retry flooding; write local diagnostic evidence | operator/control plane must issue a bundle that rewrites endpoint, trust, env, and package metadata together | rollback is another audited bundle; do not partially edit host config without evidence |
| API signing CA versus ingress trust CA mismatch | control-plane trust repair first | suppress immediate re-enrollment loops after repeated rejection | repair profile/secret rendering; then re-enroll affected nodes only if needed | roll back control-plane CA/trust rendering before issuing node bundles |
| Lost cert/key and no valid recovery token | operator rebootstrap | expose local reason and wait | issue audited node-bound bundle for the existing node, or retire and onboard as new node | rollback by invalidating bundle and restoring previous trusted material if still valid |
| Cloned identity, suspected credential compromise, tenant-isolation violation, or host transferred between trust domains | quarantine then full reimage/new identity | stop accepting work and surface security reason | security/operator approval; revoke identity and bootstrap a new node identity after wipe | no in-place rollback; reimage or decommission |
| GPU/runtime/kernel/container drift that can affect tenant isolation or workload correctness | repair if proven safe, otherwise full reimage | run approved diagnostics only | provider lifecycle owner decides repair versus reimage using reconciliation evidence | workload admission remains blocked until evidence clears |

`node.self_update` therefore has these hard limits:

1. It must be digest pinned and signed/authorized by the control plane.
2. It may update the node-agent package, task verifier material, and a
   control-plane-issued recovery token.
3. It must not mint its own credentials, relax sudoers policy, change tenant
   isolation state, reclassify node ownership, or decide that a host is safe for
   reuse after suspected compromise.
4. It must leave local evidence before restart, including requested version,
   package digest, verifier-set version, token refresh posture, result, and
   rollback hint.
5. Failure to self-update is a node-management degradation. It is not workload
   downtime unless app/runtime or edge evidence separately proves customer
   impact.

Release/profile gates for node-agent changes must attach evidence for:

- package digest and expected version;
- task-signing verifier overlap when verifier keys change;
- recovery-token issuance/rotation posture;
- cert renewal and recovery enrollment behavior;
- terminal node-stream readiness when terminal-related code changes;
- app-runtime task compatibility when task protocol changes;
- provider lifecycle cleanup posture when node update can affect provisioning
  or decommissioning.

Gate recommendation must be `block_production` when a release changes
node-agent update, cert, credential, terminal, or task protocol behavior without
corresponding recovery/rollback evidence.

## Operator State Reasons

Operator-facing state should distinguish at least:

- `agent_down`: heartbeat stale or process not reporting.
- `cert_expired`: local node certificate expired.
- `cert_untrusted_by_ingress`: node certificate issuer does not match ingress
  trusted CA.
- `endpoint_profile_drift`: node env points at a different edge/profile host
  family than the active environment.
- `recovery_token_missing`: automatic re-enrollment is not possible.
- `task_backlog_stale`: expired queued tasks are present after recovery.
- `identity_revoked_or_fenced`: the node identity is retired, quarantined, or
  explicitly denied.
- `cloned_identity_detected`: more than one host instance is presenting the same
  node identity.

These reason codes are debug and runbook evidence, not the normal user-visible
state model. Operators need the specific reason after clicking into the
resource; users and first-line support need a small state family:
`normal`, `temporary`, `recovering`, `operator_action_required`, or
`security_action_required`.

## Process Ownership

Each recovery loop has one binary that owns the decision. Other components may
provide signals, but they should not independently mutate lifecycle truth.

| Decision / action | Owning binary | Supporting signals |
|---|---|---|
| Mark heartbeat stale/current | `cmd/api` read model / lifecycle sweeper | node-agent heartbeat payload, last-seen timestamp |
| Detect local cert expiry before polling | `cmd/node-agent` | local cert parse, node-bound cert subject |
| Attempt automatic recovery enrollment | `cmd/node-agent` | current `GPUAAS_ENROLLMENT_TOKEN`, active endpoint profile |
| Issue or reject recovery enrollment | `cmd/api` | node status, token binding, profile, signing CA |
| Validate API signing CA versus ingress trust CA | `cmd/api` startup/profile preflight | rendered Kubernetes secret, signing CA fingerprint |
| Reclaim expired queued/dispatched node tasks | provisioning/lifecycle sweeper | `node_tasks.expires_at`, idempotency contract, task type |
| Classify app status probe expiry | `cmd/app-runtime-worker` | task lease, prior runtime observation, backoff state |
| Classify completed app runtime failure | `cmd/app-runtime-worker` | completed probe output, runtime adapter result |
| Classify edge/user route failure | proxy runtime / Pomerium edge checks | route health, Pomerium logs, upstream status |
| Deduplicate app token/controller retry audit | credential broker / `cmd/app-runtime-worker` | actor, app, credential, action, reason, retry window |
| Detect cloned identity | `cmd/api` heartbeat correlator | duplicate node identity, host fingerprint, conflicting heartbeat source |

## Recovery Evidence

Minimum recovery evidence before marking a node healthy:

- heartbeat current,
- node cert issuer matches ingress trusted CA,
- task wait succeeds through node-api mTLS,
- a typed no-op/probe or terminal-open task completes,
- lifecycle/latest-operation read model no longer surfaces expired queued tasks
  as current work.

## Recurrence Analysis: What Was Missed

The 2026-05-21 and 2026-05-22 local-kind recurrences showed that the matrix was
conceptually right but still missing an implementation boundary:

1. A node-agent can be alive and correctly running while the API read model only
   sees a stale heartbeat.
2. The agent may be unable to report the reason because the same stale HTTPS
   trust/profile material blocks task polling, certificate renewal, and bearer
   recovery enrollment before any API handler sees a request.
3. In that state, API-side evidence is necessarily incomplete. A read model that
   only has `agent_reported_at` can only say "heartbeat stale", not whether the
   root cause is process liveness, server trust, endpoint profile, DNS/routing,
   identity fencing, or task backlog.

The design adjustment is:

- Node-agent must maintain a local diagnostic record for pre-auth and pre-API
  failures. The record should be written to disk and journald before every
  retry/backoff decision. It must include low-cardinality reason codes such as
  `process_alive`, `server_tls_untrusted`, `endpoint_profile_drift`,
  `recovery_enrollment_blocked`, `identity_revoked`, `edge_rate_limited`, and
  `task_poll_failed`.
- Host-side parity/smoke tooling must read that local diagnostic record when API
  evidence is stale. API-first remains the default, but this is the explicit
  fallback for failures that prevent API reporting.
- Bootstrap material must include a recovery trust/profile path that can reach
  enrollment after the active API trust bundle drifts. If that is intentionally
  unavailable, the node must fail closed into an operator-visible
  `endpoint_profile_drift` / `recovery_enrollment_blocked` state instead of
  retrying forever as generic `node_unavailable`.
- `GPUAAS_RECOVERY_API_URL` and `GPUAAS_RECOVERY_CA_BUNDLE_PATH` are the
  node-agent inputs for that path. They default to the runtime API/profile in
  simple environments, but can be pinned separately for local-kind, demo, or
  private-ingress recovery paths.
- Node-agent metrics must expose identity rejection and recovery enrollment
  counters: attempts, successes, failures, and locally suppressed retries. These
  counters are the first signal that a trust/profile issue is recurring across
  nodes rather than being a one-off recovery event.
- The API/V3 read model must merge heartbeat age, latest node-agent lifecycle
  error, stale task count, and optional host diagnostic evidence into one
  classified recovery status. `node_agent_unreachable` remains only the fallback
  when no better evidence exists.

This is the difference between a workaround and a stable recovery model: a
host-local failure that prevents API reporting still produces durable,
low-cardinality evidence that an operator or smoke test can collect.

## App Runtime Liveness Boundary

Node-agent liveness and app-runtime liveness are related but not equivalent.
A missed app status probe is a missing observation. It must not automatically
mean that the app container, compose service, terminal path, or managed ingress
route is unhealthy.

The app-runtime controller should treat these signals separately:

1. **Agent heartbeat**: can the node-agent currently report to the control
   plane?
2. **Task lease**: was the specific task claimed and completed before its
   lease/TTL expired?
3. **Runtime process state**: does the node report the container, Compose
   service, scheduler job, or Kubernetes workload as running?
4. **Endpoint reachability**: can the platform proxy or direct probe reach the
   app endpoint from the expected network location?
5. **User-visible route health**: does the route complete authentication,
   header injection, upstream connect, and response/WebSocket handling?
6. **Credential/token health**: can the app runtime refresh or use required
   app/user/service tokens without generating duplicate audit noise?

Policy:

- Launch, stop, start, restart, and remove tasks are desired-state operations.
  If they expire before completion, the operation state can fail or require
  explicit recovery because the requested mutation did not complete.
- Recurring status probes are observations. If they expire before completion,
  clear the in-flight probe, record `last_runtime_probe_failed_at`, keep the
  previous app status, and retry with bounded backoff. Do not mark a running
  app failed unless a completed probe reports a real unhealthy runtime state or
  a separate endpoint/route health policy reaches its failure threshold.
- If a running app is marked unhealthy due only to an expired status probe, a
  later successful status probe may recover it to `running` without a new
  deploy operation.
- Read models must show "last observation stale" separately from "runtime
  failed" so operators can distinguish control-plane/node-agent polling issues
  from workload issues.
- Repeated credential refresh failures, expired app tokens, or expected retry
  loops must be summarized/deduplicated. Audit rows should represent security
  or lifecycle decisions, not every controller retry tick. High-frequency token
  failure detail belongs in structured logs/metrics with correlation, bounded
  counters, and "first seen / last seen / count" summaries.
- The default audit dedupe key is `(actor_type, actor_id, app_instance_id,
  credential_id, action, reason)`. Within `audit.dedupe_window_seconds`, emit at
  most one audit row plus a summary update with `first_seen_at`,
  `last_seen_at`, and `count`. Detailed retry errors belong in structured logs
  and metrics, not immutable audit spam.

Concrete output shape:

- Immutable audit: one row when the failure first appears, one row when it
  recovers, and one periodic summary row if the failure persists beyond the
  dedupe window.
- Structured log: every retry may log a sanitized event with `correlation_id`,
  `dedupe_key_hash`, `first_seen_at`, `last_seen_at`, and `count`.
- Metrics: counters by low-cardinality reason and route/runtime family; no raw
  token, URL query string, user secret, or provider error payload labels.

## User-Facing Recovery States

Operator reasons should map to stable V3 user-facing states, but they should
not leak low-level reason codes into normal user copy. The detailed reason code
belongs in operator evidence and runbooks; the user surface should present a
small state family.

Recommended state families:

| V3 state family | Meaning | Typical underlying reasons |
|---|---|---|
| `normal` | workload is reachable; any control-plane drift is not user-impacting | none, or non-impacting management drift |
| `temporary` | observation or management is stale; workload may still be running | `agent_down`, `task_backlog_stale`, short outage |
| `recovering` | automatic repair is in progress | `cert_expired` with current recovery token |
| `operator_action_required` | platform must repair identity, profile, or lifecycle state | `cert_untrusted_by_ingress`, `endpoint_profile_drift`, `recovery_token_missing`, `identity_revoked_or_fenced` |
| `security_action_required` | possible identity or tenant-boundary issue | `cloned_identity_detected` |

Detailed mapping:

| Operator reason | V3 user state | User-facing behavior |
|---|---|---|
| `agent_down` | `temporary` | show runtime temporarily disconnected; keep allocation/app visible |
| `cert_expired` | `recovering` | show reconnecting/recovering while automatic token is valid |
| `cert_untrusted_by_ingress` | `operator_action_required` | show platform maintenance; suppress user retry loops |
| `endpoint_profile_drift` | `operator_action_required` | show platform maintenance; link operator evidence |
| `recovery_token_missing` | `operator_action_required` | show support/admin action required |
| `task_backlog_stale` | `temporary` | show operation history stale, not workload failed |
| `identity_revoked_or_fenced` | `operator_action_required` | show node unavailable; hide self-service recovery |
| `cloned_identity_detected` | `security_action_required` | show node unavailable; page security/operator team |

For shared or sliced nodes, a node-level drift reason fans out to every active
allocation/app on that node. Each affected allocation should show the same
user-facing recovery state but keep its own evidence trail and notification
record so tenant communication remains scoped.

The detailed fanout model is intentionally a separate task:
`D-NODE-DRIFT-TENANT-FANOUT-IMPACT-001`. Until that lands, node-level evidence
may be summarized on operator pages, but tenant-facing pages must not expose
other tenants' affected allocation counts or identifiers.

## Recovery As V3 Tasks

Recovery actions must be visible through the same lifecycle/evidence model as
normal provisioning work. Re-enrollment, recovery-bundle generation,
signing-CA repair, node rebootstrap, cloned-identity fencing, and stale-task
sweeps should emit V3 tasks/evidence with a shared `correlation_id`.

The correlation chain should let an operator pivot from an ingress rejection or
Pomerium/proxy symptom to node-agent re-enrollment attempts, recovery bundle
issuance, task-sweeper changes, and any user-visible state change.

## Follow-Up Task Split

This document locks the shared model. The following work remains intentionally
split so each task has a crisp owner:

| Task | Scope |
|---|---|
| `A-NODE-REENROLLMENT-RECOVERY-HARDENING-001` | implementation of cert/profile/trust recovery hardening and operator-visible reason fields |
| `A-NODE-AGENT-LOCAL-DIAGNOSTIC-EVIDENCE-001` | node-agent local diagnostic record for pre-auth/pre-API trust, endpoint, and recovery failures |
| `A-NODE-AGENT-BOOTSTRAP-RECOVERY-TRUST-001` | bootstrap-pinned recovery trust/profile path for enrollment when the active API trust bundle is stale |
| `C-NODE-READINESS-CLASSIFIED-EVIDENCE-001` | API/V3 readiness classification that merges heartbeat, lifecycle error, stale task, and host diagnostic evidence |
| `C-KIND-NODE-AGENT-RECURRENCE-SMOKE-001` | local-kind disruption/recovery smoke that proves classified recovery instead of generic unreachable state |
| `D-NODE-CONTROL-PLANE-VS-WORKLOAD-IMPACT-SPLIT-001` | live proof that node-agent drift does not automatically imply workload downtime |
| `D-NODE-DATA-PATH-MTLS-SIMPLIFICATION-DECISION-001` | decide whether bare-metal workload ingress can move away from mTLS-to-node data paths |
| `D-NODE-DRIFT-TENANT-FANOUT-IMPACT-001` | tenant-scoped notification/evidence semantics for shared-node drift |
| `D-NODE-RECOVERY-TOKEN-ROTATION-POLICY-001` | recovery-token lifetime, rotation, revoke, and fail-closed policy |
| `D-NODE-SELF-UPDATE-RECOVERY-BOUNDARY-001` | boundary between node-agent binary self-update, rebootstrap, and recovery-token/cert repair; links to `Node_Agent_OCI_Distribution_v1.md` and `A-NODE-AGENT-RELEASE-UPDATE-PATH-001` |
| `A-APP-RUNTIME-TOKEN-AUDIT-DEDUPE-001` | code-level dedupe for repeated app token/controller audit emissions |

## Failure Scenario Matrix

| Scenario | Severity | Detection budget | MTTR target | Example trigger | Expected detection | Implementation owner | Expected recovery |
|---|---|---|---|---|---|---|---|
| Short node outage | P2 | < 5m | automatic / next sweep | host reboot, node-agent restart | heartbeat stale then current, same cert issuer | `cmd/api` lifecycle read model + `cmd/node-agent` | automatic resume; no re-enrollment |
| Long node outage beyond cert expiry | P1 | < 5m after return | < 1h | host powered off past node cert TTL | local cert expired before task wait | `cmd/node-agent` recovery path | automatic re-enrollment only if recovery token and trust bundle are current |
| Node identity rejected after reachable profile change | P1 | < 5m | < 1h | node has stale cert or cert chain after endpoint/profile repair | task wait or renewal returns node identity rejection while recovery endpoint is reachable | `cmd/node-agent` recovery path + `cmd/api` node authz | bounded automatic re-enrollment; backend remains authority and rejects retired/quarantined nodes |
| Control-plane outage | P1 | < 1m | < 1h | API/ingress down while nodes are running | task wait/network failures with bounded backoff | `cmd/node-agent` retry loop + ops alerts | automatic retry; no task/log flood; recover when API returns |
| Control-plane CA restart drift | P0 | < 1m | < 15m | API restarts without persisted node signing CA | enrollment succeeds but task wait returns ingress mTLS rejection | `cmd/api` startup/preflight + profile renderer | repair API signing CA and ingress trusted CA first; then re-enroll |
| Ingress trust CA drift | P0 | < 1m | < 15m | ingress secret differs from API node signing CA | ingress `400 SSL certificate error` or unknown CA | profile renderer + ingress/Loki alert | render one CA into API signing config, bootstrap CA, and ingress trust secret |
| Endpoint/profile drift | P1 | < 5m | < 1h | node env points at old hostname or old edge family | node logs show old `GPUAAS_API_URL`/terminal URL; profile verifier mismatch | `cmd/api` enrollment/profile verifier | recovery bundle rewrites endpoint fields and CA material |
| Trust bundle drift on host | P1 | < 5m | < 1h | node local CA bundle trusts old public/node API cert | enrollment or task wait TLS verify failure | `cmd/node-agent` TLS self-check + `cmd/api` classifier | recovery bundle rewrites local CA bundle; restart agent |
| Trust bundle drift blocks recovery enrollment | P1 | < 5m | < 1h | local-kind or demo edge cert/profile rotates while external workers retain stale `/etc/gpuaas/ca-bundle.crt` | task polling and bearer recovery enrollment both fail server TLS verification before API handlers see the request | `cmd/node-agent` TLS self-check + `cmd/api` profile/read-model evidence | use bootstrap-pinned recovery trust, profile repair material, or operator-issued bundle; do not loop indefinitely on generic `node_unavailable` |
| Pre-auth diagnostic blind spot | P1 | < 5m | < 1h | node-agent cannot verify API TLS or reach recovery endpoint, so API has no fresh failure payload | stale heartbeat plus host diagnostic record reports `server_tls_untrusted`, `endpoint_profile_drift`, or `recovery_enrollment_blocked` | `cmd/node-agent` local diagnostic writer + host parity/smoke tooling + API/V3 classifier | collect host diagnostic evidence, classify recovery status, repair profile/trust path |
| Edge/API rate limiting | P2 | < 5m | automatic / rule-dependent | node-agent is pointed at a browser/edge path or an edge rule throttles repeated task polling | task wait returns 429, `rate_limit_exceeded`, or Cloudflare temporary-rate-limit page | `cmd/node-agent` retry loop + edge runbook | classify `edge_rate_limited`, slow polling cadence, and repair the endpoint profile or edge rule instead of flooding retries |
| Recovery token missing/consumed | P1 | < 5m after cert failure | < 1h | env lacks current node-bound token | expired cert plus missing/expired `GPUAAS_ENROLLMENT_TOKEN` | `cmd/node-agent` recovery path + `cmd/api` evidence | admin issues audited re-enrollment material |
| Node identity revoked/fenced | P0 | < 1m | < 15m or explicit operator resolution | node status retired/quarantined or identity revoked | 401/403 classified as revoked/fenced | `cmd/api` node authz | pause polling; operator reactivates or re-enrolls |
| Cloned VM identity | P0 | < 1m | < 15m | two hosts share node ID/cert/env | duplicate instance IDs, conflicting heartbeats, cert identity mismatch | `cmd/api` heartbeat correlator | fence duplicate, require new node identity/bootstrap |
| IP/route drift | P2 | < 30m | next sweep | DHCP/VPN/routing changed host address | heartbeat from old host or connect path fails while node reports | `cmd/api` lifecycle read model | refresh node network facts; recovery only if identity still matches |
| Task backlog after outage | P2 | < 5m | automatic / next sweep | queued tasks expire while node/control plane is down | `node_tasks.status=queued` with `expires_at <= now()` | provisioning/lifecycle task sweeper | mark/filter expired tasks; preserve audit history |
| Dispatched result lost | P2 | < 30m | automatic or operator replay | node completed task but response was lost | stale dispatched task with idempotent task type | provisioning/lifecycle task reclaimer | safe requeue or idempotent result acceptance |
| App status probe expires | P2 | < 5m | automatic / next probe | app is running, but node-agent/control plane misses one status task | status probe task `expired`, previous runtime observation exists | `cmd/app-runtime-worker` | clear probe, keep app status, record stale observation, retry with backoff |
| App status probe reports stopped container | P1 | < 5m | < 1h | node-agent completes status task with container not running | completed status output contains stopped/exited state | `cmd/app-runtime-worker` | mark app runtime unhealthy/failed with completed probe evidence |
| App status probe succeeds after false unhealthy | P2 | < 5m | automatic / next probe | app row says unhealthy due only to stale/expired probe, container still running | completed status output reports running | `cmd/app-runtime-worker` | recover app to running and clear failure reason |
| App endpoint fails while runtime process is running | P1 | < 5m | < 1h | container reports running but HTTP/WS route fails | endpoint probe/proxy metrics fail, status probe succeeds | proxy runtime + `cmd/app-runtime-worker` | classify as route/upstream health, not node-agent failure |
| App token refresh/use fails repeatedly | P1 | < 5m | < 1h | expired app token, bad credential binding, upstream auth rejects token | repeated credential operation errors with same app/credential/reason | credential broker + `cmd/app-runtime-worker` | dedupe audit, emit metrics/log summaries, surface one operator action item |
| Terminal gateway outage | P1 | < 5m | < 1h | terminal edge down while node task path is healthy | terminal token mints but WS cannot complete | `cmd/terminal-gateway` + Pomerium/edge checks | edge/gateway retry; do not re-enroll node |
| Terminal node stream never connects | P1 | < 30s | < 1h | `terminal.open` task returns success but node-agent cannot reach the gateway's node-facing internal WebSocket | browser session waits for node stream; gateway never logs internal node websocket connected | `cmd/terminal-gateway`, `cmd/node-agent`, endpoint-profile verifier | fail fast with `node_stream_timeout`; repair node-facing terminal endpoint/profile instead of treating task success as session readiness |
| Node-agent version drift | P2 | < 30m | next maintenance window unless blocking workloads | old agent lacks current recovery or task protocol | heartbeat reports old build; task failures by unsupported type | lifecycle rollout worker | self-update/rebootstrap before workload recovery |
| Clock skew | P1 | < 5m | < 1h | node clock behind/ahead | cert appears not-yet-valid/expired inconsistently | `cmd/node-agent` self-check + `cmd/api` classifier | sync time first; then retry cert validation |

The implementation should classify these states explicitly enough that an
operator can see whether the next action is automatic wait, repair certs,
re-enroll, repair endpoint profile, rebootstrap, reactivate, or fix the control
plane. A generic "node-agent unreachable" state is only useful as a temporary
fallback when the control plane lacks enough evidence.

## Implementation Status

Implemented:

- local certificate expiry detection before task polling,
- automatic recovery enrollment when a current `GPUAAS_ENROLLMENT_TOKEN` exists,
- bounded retry/backoff for disconnected nodes,
- stale queued/dispatched task expiry and V3 latest-operation filtering,
- admin re-enrollment bundle that rewrites endpoint URLs and CA bundle files.

Still required:

- operator-visible reason fields for the states above,
- active API signing CA versus ingress trusted CA drift validation,
- cloned identity detection and fencing,
- explicit lost cert/key recovery path,
- local node-agent diagnostic file/journald evidence for failures that block API
  reporting,
- bootstrap-pinned recovery trust or equivalent profile-repair path for stale
  server trust bundles that block recovery enrollment before API auth,
- node-facing terminal endpoint/profile reachability validation so
  `terminal.open` task completion cannot be mistaken for browser-session
  readiness,
- host reuse/reactivation semantics after uninstall or rebootstrap,
- read-model/API surface that shows the recovery evidence checklist.

2026-05-21 local-kind recurrence:

- `compute-node-01` and `compute-node-02` surfaced as generic
  `node_unavailable` after Docker/kind/cloudflared disruption.
- SSH/journald on `compute-node-01` showed `gpuaas-node-agent` was active, but
  task polling and recovery enrollment both failed with server TLS verification
  errors against `node-api.gpuaas.test`.
- Manual repair replaced stale `/etc/gpuaas/ca-bundle.crt` with the active
  node-api certificate and restarted the agent; the node immediately completed
  a status task.
- Evidence: `doc/operations/evidence/node_agent_stability_kind_recurrence_2026-05-21.md`.

2026-05-22 follow-up:

- local-kind again surfaced node-agent fragility after host/runtime disruption.
  The read-model must classify this as profile/trust/transport evidence rather
  than a single `node_unavailable` bucket.
- demo terminal validation exposed an adjacent transport blind spot: the
  node-agent completed the `terminal.open` task, but the gateway never observed
  the node-facing internal WebSocket connection. Task success is therefore only
  "open command accepted"; browser terminal readiness requires the node stream
  to connect before the gateway timeout.
