# Node Agent And Terminal Preflight Runbook

Status: draft
Owner: Platform Operations
Last reviewed: 2026-05-25

Use this runbook when a node appears unreachable, an allocation terminal stays
at "Connecting...", or a demo/promotion smoke proves allocation lifecycle but
fails browser terminal readiness.

This runbook covers two related but distinct failure domains:

1. Node-agent connectivity: can the worker node reach the control plane, poll
   tasks, authenticate, and report task results?
2. Terminal readiness: after token mint and browser WebSocket upgrade, can the
   node-agent open the node-facing terminal stream back to the terminal gateway?

The common trap is treating `terminal.open` task success as terminal readiness.
Node-agent builds after 2026-05-25 must not report `terminal.open` success until
the node-facing internal WebSocket connects and the PTY starts. Older agents may
still report a false success after accepting the command; the browser session is
ready only after the terminal gateway observes the node-facing internal
WebSocket and starts the stream.

## First Five Minutes

Collect these identifiers before changing anything:

- `allocation_id`
- `node_id`
- `correlation_id` from the failing API/UI request
- terminal `session_id` if the gateway created one
- node hostname, provider, and private IP
- expected `GPUAAS_API_URL`
- expected node-facing terminal URL, usually `GPUAAS_TERMINAL_API_URL`
- root filesystem usage and diagnostic path writability

Run the checks in this order. Stop at the first failed hop.

1. Confirm the allocation is active and bound to the expected node.
2. Confirm the node-agent is recently reporting heartbeat/task poll activity.
3. From the node, check `df -h /` and confirm the node-agent diagnostic path is
   writable. A full root filesystem can block cert recovery, task evidence, and
   terminal startup while presenting as generic node-agent unreachable.
4. From the node, test TCP/TLS reachability to `GPUAAS_API_URL`.
5. Confirm a simple node task can be claimed and its result is posted.
6. Mint a terminal token through the API for the allocation.
7. Confirm the browser-facing terminal gateway accepts or rejects the WebSocket
   intentionally. It must not silently hang.
8. Confirm the API enqueues `terminal.open` and the node-agent completes it.
9. From the node, test TCP/TLS reachability to `GPUAAS_TERMINAL_API_URL`.
10. Confirm the terminal gateway logs the node-facing internal WebSocket for the
   same `session_id`.
11. Confirm the browser receives a ready or explicit error frame. "Connecting..."
    past the node-stream timeout is a product bug.

## Node-Agent Connectivity Matrix

| Symptom | Fast preflight | Likely owner | Recovery | Dashboard signal |
|---|---|---|---|---|
| Node status stale or unreachable | Check latest node heartbeat/report timestamp and node-agent service status | Node runtime / provider network | Restart node-agent if local; otherwise repair route/DNS/firewall first | `node_agent_heartbeat_stale` |
| Node cannot reach API | From node: connect to API host and port in `GPUAAS_API_URL` | Environment networking | Fix VPN/Tailscale/private route, DNS, firewall, or edge profile | `node_agent_api_unreachable` |
| API TLS fails from node | From node: TLS probe to API host with current trust bundle | PKI / edge profile | Refresh trust bundle or repair signing/trust CA drift | `node_agent_api_tls_failed` |
| Node-agent logs auth failures | Check enrollment cert/token and API auth response code | PKI / node enrollment | Re-enroll, repair cert, or revoke cloned identity | `node_agent_auth_failed` |
| Node-agent polls but cannot claim tasks | Check task signing/key validation errors | Control plane / node-agent version | Roll node-agent to compatible build or repair task signing config | `node_task_claim_failed` |
| Node claims tasks but result post fails | Check API result endpoint reachability and node-agent logs | API / node-agent network | Repair API route or retry result post; do not mark app runtime failed | `node_task_result_post_failed` |
| Node-agent process exits | `systemctl status gpuaas-node-agent` and journald | Node runtime | Restart; inspect binary arch, env file, cert paths, and permissions | `node_agent_process_down` |
| Node root filesystem is full | `df -h /`; diagnostic writes fail with `no space left on device`; large journald, runtime, metrics, or local archive usage | Node runtime / observability | Free space, stop runaway local buffers, rotate/archive aggressively, then rerun cert and terminal preflight | `disk_full` or `node_observability_disk_pressure` |
| Exec format error or missing binary | Check node-agent binary arch and version | Bootstrap / release packaging | Install correct multi-arch node-agent package | `node_agent_binary_invalid` |
| SSH to node fails but node-agent polls | Compare SSH path vs node-agent API path | Ops access / provider network | Fix SSH/user/firewall separately; do not classify as node-agent down | `node_ops_ssh_failed` |
| SSH resets and QEMU guest agent is down while node-agent polls | Compare API heartbeat with SSH and provider guest-agent status | Provider VM runtime / ops access | Repair guest OS access path or recreate worker; node-agent liveness alone is not enough for day-2 operability | `node_ops_access_unhealthy` |

## Terminal Failure Matrix

| Symptom | Fast preflight | Likely owner | Recovery | Dashboard signal |
|---|---|---|---|---|
| Token mint fails | API response code and error envelope | API / authz / allocation state | Fix session, role, project context, or allocation active state | `terminal_token_mint_failed` |
| Browser WebSocket gets route error | Probe browser-facing terminal health path and, for Pomerium routes, run the authenticated edge WebSocket smoke | Edge / Pomerium / Cloudflare | Repair terminal host route, health path, and WebSocket upgrade config | `terminal_public_route_missing` |
| Browser WebSocket upgrade fails | Gateway logs and browser console close code | Terminal gateway / edge | Repair route, token subprotocol, Redis/session binding, or gateway health | `terminal_gateway_upgrade_failed` |
| Gateway accepts browser, then times out | Gateway close reason `node_stream_timeout` | Node-facing terminal endpoint/profile | Test `GPUAAS_TERMINAL_API_URL` from node; fix route, port, DNS, or firewall | `terminal_node_stream_timeout` |
| `terminal.open` task completes but no node stream arrives | Compare node task result with gateway internal stream logs for same `session_id` | Node-agent terminal callback path / stale node-agent build | Upgrade node-agent to a build that gates success on stream readiness, then fix node-facing WebSocket endpoint/profile | `terminal_open_without_stream` |
| Node stream reaches gateway but mTLS rejected | Gateway TLS/client-cert error logs | PKI / terminal gateway | Repair node cert, CA trust, SAN, or gateway client-cert policy | `terminal_node_stream_mtls_failed` |
| Node stream connected but PTY fails | Node-agent terminal logs for user/shell/pty errors | Node runtime / allocation user setup | Repair allocation user, shell, sudo/pty permissions, or cleanup drift | `terminal_pty_failed` |
| Browser stays "Connecting..." after timeout | UI does not render gateway error frame | Web UX / terminal gateway contract | Return structured error frame and map it to user-visible recovery text | `terminal_ui_connecting_stuck` |
| Terminal works in kind but not demo | Compare node reachability to node-facing terminal endpoint in each env | Environment profile | Fix profile-specific edge/node route; record as env drift | `terminal_env_profile_drift` |

## Required Preflight Helper

The intended helper should be runnable by ops before handing an environment to
testers:

```bash
scripts/ops/node_agent_terminal_preflight.sh \
  --api-base https://aicloud-demo-api.core42.dev \
  --terminal-base https://aicloud-demo-term.core42.dev \
  --allocation-id <allocation_id> \
  --node-id <node_id> \
  --node-host <private_ip_or_hostname>
```

For provider-backed workers that cannot resolve private node-facing names
through DNS yet, pass the explicit node runtime and terminal endpoints plus the
private resolve targets:

```bash
scripts/ops/node_agent_terminal_preflight.sh \
  --api-base https://aicloud-kind-api.core42.dev \
  --terminal-base https://aicloud-kind-term.core42.dev \
  --node-id <node_id> \
  --node-host <private_ip_or_hostname> \
  --node-user ubuntu \
  --node-runtime-api-url https://node-api.gpuaas.test \
  --node-runtime-resolve-address <platform_private_ip> \
  --node-terminal-url https://term.gpuaas.test:18443 \
  --node-terminal-resolve-address <platform_private_ip>
```

When the node runtime URL is a node-facing API endpoint, the helper probes
`/internal/v1/nodes/{node_id}/tasks/wait` with the node certificate and key
from `/etc/gpuaas`. When the node terminal URL targets the node-facing terminal
listener (`term.*` or `:18443`), the helper also probes with the node
certificate and key. A plain public Cloudflare tunnel is not sufficient for
these checks because it terminates TLS before the API/gateway can see the node
client certificate.

For kind routes fronted by Pomerium/Cloudflare, `--run-ws-smoke` without an
edge-auth cookie is not the browser-equivalent proof. Use
`scripts/ops/pomerium_kind_ws_authenticated_smoke.sh` for the authenticated
terminal and notification WebSocket edge parity check. The preflight helper's
`terminal_public_route` hop checks the terminal gateway `/healthz` route; the
gateway root may legitimately return an edge/Pomerium 404 and should not be
used as the route-health signal.

Minimum report sections:

- allocation binding and state
- node-agent heartbeat freshness
- root filesystem usage and diagnostic path writability
- API TCP/TLS reachability from the node
- node task claim/result round trip
- terminal token mint
- browser-facing terminal WebSocket route classification
- `terminal.open` task lifecycle
- node-facing terminal WebSocket TCP/TLS reachability from the node
- gateway observation of internal node stream
- final classification and recommended owner

The helper should output a single machine-readable JSON summary and a concise
human-readable table. A failed check should identify the failed hop, not just
return "terminal failed."

## Dashboard Requirements

The platform dashboard should eventually expose the same chain so an operator
does not need direct SQL or ad hoc log scraping:

- node-agent status: `healthy`, `stale`, `auth_failed`, `api_unreachable`,
  `task_result_failed`, `unknown`
- terminal readiness: `token_ready`, `browser_ws_ready`, `terminal_open_sent`,
  `node_stream_connected`, `pty_ready`, `failed`
- last failure reason with `correlation_id`
- environment profile and expected node-facing terminal endpoint
- last successful preflight timestamp
- one-click evidence bundle with gateway logs, node task IDs, and node-agent
  diagnostic output

## Chaos And Negative Test Scenarios

Create repeatable scenarios for the two critical domains.

Node-agent connectivity:

- block node egress to API host
- break node API DNS
- replace API trust bundle with stale CA
- stop node-agent service
- run incompatible node-agent binary
- make result-post endpoint unreachable after task claim
- rotate node cert while node-agent is running

Terminal:

- remove browser-facing terminal route
- make Redis/session binding unavailable
- block node egress to node-facing terminal endpoint
- point `GPUAAS_TERMINAL_API_URL` at the wrong host or port
- make gateway reject node client cert
- remove allocation OS user before terminal open
- force PTY shell failure
- delay node stream beyond timeout and verify UI receives structured failure

Each scenario should assert both recovery behavior and operator evidence:

- the user sees a bounded, actionable error
- the operator dashboard names the failed hop
- audit/evidence includes the same `correlation_id`
- a healthy retry after repair succeeds without manual database edits

### Concrete Chaos Scenario Matrix

Use this matrix to turn the scenario list into repeatable day-2 checks. Prefer
temporary route, DNS, service, or config changes that can be reverted without
database edits. Capture the preflight JSON before repair, repair the fault, and
rerun the same helper command to prove recovery.

| Scenario | Injection | Expected failed hop | Expected signal | Evidence to capture | Recovery proof |
|---|---|---|---|---|---|
| Node-agent service stopped | Stop `gpuaas-node-agent` on the worker or block its supervisor restart | `node_heartbeat` or `node_recovery_status` | `node_agent_heartbeat_stale` | Node read model heartbeat summary, journald service state, allocation/node IDs | Restart service; heartbeat becomes current and task poll resumes |
| Node-facing API unreachable | Block node egress to the private API host/port, or remove the node-facing API host resolve entry | `node_runtime_api_reachability` | `node_agent_api_unreachable` | Helper SSH probe detail, node-agent poll errors, endpoint profile | Restore route/DNS/firewall; `/internal/v1/nodes/{node_id}/tasks/wait` probe succeeds |
| Node API mTLS/auth rejected | Replace node cert with expired/stale cert in a disposable test node or point node-agent at a public edge that cannot pass client certs | `node_recovery_status` | `node_agent_auth_failed` | API auth response, node-agent auth logs, cert serial/SAN, profile URL | Re-enroll/restore cert and confirm node recovery status returns healthy |
| Node disk pressure blocks recovery | Fill a disposable worker root filesystem past the configured pressure threshold, or redirect bounded test logs into a small filesystem | `node_recovery_status` or `node_readiness` | `disk_full` or `node_observability_disk_pressure` | `df -h /`, diagnostic write result, journald/log-buffer usage, node-agent local diagnostic evidence | Free space and rotate/drop observability buffers; node-agent renews or enrolls and heartbeat returns without database repair |
| Task result post blocked | Allow task claim, then block result-post path or API egress before completion | `node_recovery_status` or latest operation | `node_task_result_post_failed` | Latest operation ID/status, node-agent result post error, correlation ID | Restore API reachability; queued result posts complete or retry cleanly |
| Browser terminal route missing | Remove or mispoint terminal public host/health path while API remains healthy | `terminal_public_route` | `terminal_public_route_missing` | HTTP status from terminal `/healthz`, edge route config, browser close/error | Restore route; helper sees terminal `/healthz` 200 and authenticated Pomerium WebSocket smoke succeeds |
| Browser gateway upgrade rejected | Break WebSocket upgrade headers, token subprotocol forwarding, or Redis/session binding | `browser_terminal_ws` | `terminal_gateway_upgrade_failed` or `terminal_open_without_stream` | Gateway close code/reason, Redis/session binding, token mint correlation ID | Restore edge/gateway/session path; smoke receives ready or structured terminal error |
| Node terminal stream blocked | Block node egress to `GPUAAS_TERMINAL_API_URL` or remove the node-facing terminal resolve entry | `node_terminal_reachability` | `terminal_node_stream_timeout` | Helper node-side terminal probe, gateway `node_stream_timeout`, node endpoint URL | Restore route/DNS/firewall; gateway logs internal node WebSocket for same session |
| Node terminal mTLS rejected | Require client cert and present stale/untrusted node cert to the terminal node listener | `workload_terminal_startup_failure` or gateway logs | `terminal_node_stream_mtls_failed` | Gateway TLS/client-cert error, cert serial/SAN, correlation ID | Restore trusted node cert/CA policy; internal stream connects |
| Allocation OS user missing | Remove or rename the allocation user before terminal open in a disposable allocation | `workload_terminal_startup_failure` | `terminal_pty_failed` | Node-agent PTY/user error, allocation ID, user setup task evidence | Recreate allocation user through provisioning cleanup/setup; terminal reaches PTY |
| PTY readiness delayed | Delay shell/PTY startup beyond gateway timeout | `workload_terminal_startup_failure` or `browser_terminal_ws` | `terminal_node_stream_timeout` or `terminal_pty_failed` | Gateway timeout/error frame, node-agent PTY timing logs, browser-visible message | Remove delay; user receives ready frame within timeout |

### Network Path Lessons From MAAS-LXD And Proxmox

The `local-maas-lxd` path is asymmetric by default. A MAAS-LXD VM on a
Mac/UTM host can usually reach the control plane through VPN/Tailscale and can
prove node-agent polling, but the terminal callback still depends on the node
resolving and reaching the private node-facing terminal endpoint. Public
Cloudflare hosts prove browser/bootstrap reachability only; they do not prove
that the API or terminal gateway can see the node client certificate.

When the local MAAS-LXD worker used the public or unresolved terminal profile,
the browser gateway accepted the browser WebSocket and the node-agent reached
`terminal.open`, but the gateway timed out because no internal node stream
arrived. The expected classification is `terminal_node_stream_timeout` if the
node cannot reach the endpoint, or `terminal_open_without_stream` if the task
completion exists but no matching gateway node-stream evidence exists.

Proxmox demo workers live on the demo/VPN side and are the better validation
target for demo terminal readiness. If Proxmox still reports
`node_stream_timeout`, treat it as terminal endpoint/profile drift or gateway
listener policy drift, not as the local MAAS-LXD host-network asymmetry.

## Environment Notes

MAAS-LXD on a local Mac/UTM host can reach demo through VPN/Tailscale, but demo
cannot reach back into Mac-side VM networks unless a return route is installed.
That setup can validate API/task polling from the node if the node reaches the
control plane, but it cannot validate terminal node-stream callback unless the
node-facing terminal endpoint is reachable from the MAAS-LXD VM.

Kind/local MAAS-LXD has the same split in miniature:

- public Cloudflare hosts such as `aicloud-kind-api.core42.dev` are browser and
  bootstrap fetch edges;
- node-agent task polling must use the private node-facing API profile
  (`node-api.gpuaas.test` in the current local profile);
- node terminal streams must use the private node-facing terminal profile
  (`term.gpuaas.test:18443` in the current local profile);
- until private DNS exists, bootstrap or the preflight helper must materialize
  those names with a private resolve target.

If the node-agent logs `node identity revoked` while using a Cloudflare-fronted
API URL, first verify whether the request is reaching a node-facing mTLS
endpoint. A browser/public edge can be healthy while still being the wrong
runtime endpoint for node identity.

If the node-agent diagnostic reason is `edge_rate_limited`, treat it as an
endpoint/profile or edge-policy problem, not as a reason to increase retry
pressure. The agent deliberately slows task polling for this class. Confirm the
node is using the private node-facing API profile first; only then inspect
Cloudflare/WAF/rate-limit rules.

Proxmox demo nodes are a better fit for demo terminal validation because they
live on the demo/VPN side. If Proxmox terminal still fails with
`node_stream_timeout`, treat it as node-facing terminal endpoint/profile drift,
not as MAAS-LXD network asymmetry.

## Durable Endpoint Options

Option 2 is the current bridge because it is fully controlled by the GPUaaS
environment profile and bootstrap automation:

- The environment profile declares a node-facing terminal hostname and port.
- Bootstrap writes `GPUAAS_TERMINAL_API_URL` with that hostname.
- Bootstrap materializes only the terminal hostname into node-local resolver
  state when a private DNS service is not available yet.
- This is acceptable as an automated bridge; manual per-node `/etc/hosts` edits
  are diagnostic only.

Keep the other two options in backlog for infra alignment:

- Internal DNS: own the node-facing hostname in the demo/VPN resolver so worker
  nodes resolve it privately without host-file materialization.
- Private load balancer: expose `gpuaas-terminal-gateway-node` on a stable
  private VIP/port and point internal DNS at that VIP instead of pinning
  bootstrap to a single platform-control host and Kubernetes NodePort.

## Immediate Network Plan

The current action split is documented in
`doc/architecture/Network_Reachability_Immediate_Plan_v1.md`.

Short version:

- GPUaaS owns endpoint profile correctness, automated host materialization for
  kind/local nodes, and preflight classification.
- Infra follow-up owns private DNS, private load balancers/VIPs, and production
  tenant/project network isolation.
- Until private DNS exists, use `scripts/ops/reconcile_kind_external_node.sh`
  for kind/local provider workers instead of manual `/etc/hosts` edits.
- Do not treat public Cloudflare/Pomerium browser hosts as proof of node-facing
  mTLS reachability.

## Related Runbooks

- `doc/operations/runbooks/Terminal_Gateway_Incident_Runbook.md`
- `doc/operations/runbooks/Node_Onboarding_Runbook.md`
- `doc/operations/runbooks/Node_Agent_Control_Plane_Recovery_2026-03.md`
- `doc/operations/runbooks/Pomerium_Host_Proxy_Incident_Runbook.md`