# Terminal Gateway Incident Runbook

## Trigger
- Terminal websocket sessions fail on the terminal-gateway runtime (`/ws/terminal/{allocation_id}`).
- Spike in terminal token replay rejects or websocket write/upgrade failures.
- Terminal gateway health checks fail or ingress route switch produces elevated 5xx/timeout.
- terminal stream relay degradation:
  - sustained stream setup failures
  - elevated relay write/drop errors
  - abnormal session churn (rapid connect/disconnect)

## Impact
- Users cannot establish terminal sessions to active allocations.
- Support load increases and admin operations may require gateway config rollback/redeploy.

## Immediate Mitigation
1. Confirm ingress route target for `/ws/terminal/*` points to `cmd/terminal-gateway`.
2. If customer impact is ongoing, execute gateway rollback path:
   - revert recent gateway deployment/config changes.
   - keep `/ws/terminal/*` contract and gateway route unchanged.
3. Freeze further terminal-gateway config changes until error rate stabilizes.

## Diagnosis
1. Check terminal-gateway process health and restart events.
2. Inspect gateway and ingress logs for websocket upgrade failures.
3. Validate terminal token consume/replay behavior and redis connectivity.
4. Verify network policy allows required gateway ingress/egress paths.
5. Confirm alert annotations map to this runbook in alert manifest/catalog.
6. Review terminal stream relay counters and error trends:
   - `ws_notifications_write_errors_total` (relay/write failure proxy)
   - `terminal_token_replay_rejected_total` (session control anomaly)
   - terminal stream relay service-specific counters if enabled
7. Perform correlation-id-first tracing:
   - capture `correlation_id` from error envelope/log/event first
   - pivot logs/traces/alerts using that correlation value across gateway, API, and worker paths

## Recovery
1. Restore known-good ingress route and policy set.
2. Re-run terminal websocket smoke checks.
3. Confirm token mint/consume path success for new sessions.
4. Re-enable full terminal-gateway traffic incrementally (canary/percentage) after stabilization.
5. Validate terminal stream relay recovery over a soak window before full traffic restore.

## Deploy Drain Policy
Terminal gateway pods are stateful for active browser terminal sessions. During
deploy, rollback, node drain, or manual pod termination:

1. `SIGTERM` marks the gateway as draining.
2. New `/ws/terminal/{allocation_id}` sessions are rejected with canonical
   `service_unavailable` and message `terminal gateway draining`.
3. Existing sessions continue until they close, hit TTL/idle timeout, or the
   drain timeout expires.
4. Kubernetes `terminationGracePeriodSeconds` must exceed
   `TERMINAL_GATEWAY_DRAIN_TIMEOUT_SECONDS`; the base manifest uses 45s grace
   for a 30s drain timeout.
5. If sessions are still active at the end of the drain window, the pod may
   terminate them. This is an operator-visible deploy event, not a silent
   application hang.

Operator expectations:
- Existing terminals may close during rollback or deploy, but they should close
  with controlled gateway behavior and users can reconnect with a new token.
- Do not reduce `terminationGracePeriodSeconds` below the configured drain
  timeout.
- During emergency rollback, prefer restoring the last known-good gateway image
  over disabling drain behavior.

## Post-Incident
- Record cutover/rollback timestamps and impacted session counts.
- Capture root cause and permanent corrective action.
- Update rollout evidence: `doc/operations/evidence/terminal_gateway_rollout_plan.md`.
- Add terminal stream relay incident notes and metric snapshots to on-call evidence log.

## Correlation Lookup Workflow
1. Start from user-visible/API failure and extract `correlation_id` from the returned error envelope.
2. Resolve the canonical `resource_name` for the impacted allocation/session context.
3. Search terminal-gateway logs by the same `correlation_id`.
4. Pivot to exact `resource_name` matches to connect API, gateway, and worker evidence deterministically.
5. Correlate with API logs and alert fire timeline.
6. Confirm final incident record includes one canonical `correlation_id` trail and `resource_name`.

Canonical `resource_name` format:
- `core42:aicloud:{region}:{tenant_id}:{project_id}:gpuaas/allocation:{allocation_id}`

## Public Funnel / Browser WS Checks
Use this path when terminals work from a Tailscale-connected machine but fail from a non-Tailscale browser with repeated websocket errors.

Symptoms:
- Browser console shows websocket failures for `/ws/terminal/{allocation_id}`.
- `https://gpuaas-dev-term.tailfe39f5.ts.net/healthz` may be healthy.
- `https://gpuaas-kind-term.tailfe39f5.ts.net/healthz` may be healthy for kind demo environments.
- A direct websocket probe against `gpuaas-dev-term.tailfe39f5.ts.net` returns `101 Switching Protocols` with a fresh terminal token.
- The deployed web bundle still references an internal/private websocket host such as `wss://term.retired-dev-control.example.invalid`.

Checks:
1. Verify terminal Funnel health:
   - `curl -fsS https://gpuaas-dev-term.tailfe39f5.ts.net/healthz`
   - `curl -fsS https://gpuaas-kind-term.tailfe39f5.ts.net/healthz`
2. Verify the deployed configmap has browser-facing public WS bases:
   - `sudo k3s kubectl -n gpuaas-core get configmap gpuaas-core-config -o jsonpath='{.data.NEXT_PUBLIC_WS_BASE_URL}{"\n"}{.data.NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL}{"\n"}'`
   - Expected for platform-control demo: `wss://gpuaas-dev-term.tailfe39f5.ts.net` and `wss://gpuaas-dev-api.tailfe39f5.ts.net`.
   - Expected for kind demo: `wss://gpuaas-kind-term.tailfe39f5.ts.net` and `wss://gpuaas-kind-api.tailfe39f5.ts.net`.
3. Verify the served frontend bundle matches the deployed config:
   - `PLATFORM_CONTROL_WEB_URL=https://gpuaas-dev-app.tailfe39f5.ts.net scripts/ci/platform_control_web_runtime_assert.sh`
   - For kind, rebuild/redeploy the web image with `bash scripts/ops/build_kind_public_funnel_web.sh`; this bakes `NEXT_PUBLIC_WS_BASE_URL` into the Next.js bundle.
4. If a live allocation is available, mint a terminal token and probe the public gateway. Browser-compatible token transport uses only the token value in `Sec-WebSocket-Protocol`; do not send `?token=` and do not use query-string auth.

Fix:
1. Update platform-control web build defaults and dev-control configmap so `NEXT_PUBLIC_WS_BASE_URL` points to `wss://gpuaas-dev-term.tailfe39f5.ts.net`.
2. Update `NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL` to `wss://gpuaas-dev-api.tailfe39f5.ts.net`.
3. Promote `release/platform-control` and deploy a rebuilt web runtime. A configmap-only rollout is not sufficient because Next.js public env values are baked into the browser bundle at build time.
4. Confirm non-Tailscale browsers no longer resolve terminal websocket traffic to `100.x` or `*.retired IP-derived DNS` private hosts.

Environment helpers:
- Platform-control: `bash scripts/ops/platform_control_tailscale_funnel_edges.sh start-term && bash scripts/ops/platform_control_tailscale_funnel_edges.sh verify`
- Kind: `bash scripts/ops/kind_tailscale_funnel_edges.sh start-term && bash scripts/ops/kind_tailscale_funnel_edges.sh verify`

Important: app/api/auth public edges are not enough for browser terminals. The terminal gateway is a separate public edge because browsers must reach the websocket gateway directly.

## Operator Query Cheatsheet
Use these exact queries during terminal incidents:

1. Terminal gateway errors by correlation:
   - `{service="gpuaas-terminal-gateway"} | json | correlation_id="<CORRELATION_ID>" | level=~"ERROR|WARN"`
2. API-side token/session failures by correlation:
   - `{service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>" | code=~"token_.*|service_unavailable|internal_error"`
3. Node-agent terminal stream events by session:
   - `{service="gpuaas-node-agent"} | json | session_id="<SESSION_ID>"`
4. Terminal websocket outcomes (5m):
   - `sum(rate(terminal_gateway_ws_events_total[5m])) by (outcome, reason)`
5. Token replay anomalies (5m):
   - `sum(rate(terminal_token_replay_rejected_total[5m]))`

Evidence capture minimum:
- `correlation_id`
- `trace_id` (if present in `details.trace_id` or response `X-Trace-ID`)
- `session_id` (terminal incidents)
- final `error code` and mitigation action
