# Terminal WebSocket Bridge Implementation Plan v1

## Status

Proposed

Build plan for:
- [Terminal WebSocket Bridge Architecture v1](./Terminal_WebSocket_Bridge_Architecture_v1.md)

## Purpose

Break the terminal redesign into implementation slices that can be delivered,
tested, and rolled out without falling back into incident-mode changes.

## Governing Decisions

- browser-facing terminal contract remains unchanged
- terminal-gateway becomes the live terminal byte bridge
- API remains session authority, not byte relay
- node-facing terminal path is a dedicated internal WebSocket over mTLS
- no per-frame Redis or DB in the hot data path
- v1 sessions are non-resumable

## New ADR Required

Before or during slice 1, add a new ADR under `doc/architecture/adrs/`:

- `ADR-011-terminal-node-websocket-bridge.md`

That ADR should:
- supersede the current internal HTTP relay assumption for terminal data plane
- reference:
  - [`ADR-005-terminal-gateway-isolation.md`](./adrs/ADR-005-terminal-gateway-isolation.md)
  - [`ADR-007-terminal-access-auth-model.md`](./adrs/ADR-007-terminal-access-auth-model.md)
  - [`Terminal_WebSocket_Bridge_Architecture_v1.md`](./Terminal_WebSocket_Bridge_Architecture_v1.md)

## Slice Order

### Slice 1: Session Authority And Broker Contract Cleanup

Owner:
- API + terminal service

Goal:
- make session binding a stable broker-owned control-state model before changing transport

In scope:
- normalize Redis session schema:
  - `terminal_session:{session_id}`
  - `terminal_allocation_active:{allocation_id}`
  - `terminal_gateway_sessions:{gateway_instance_id}`
- ensure `terminal.open` task payload carries everything node-agent needs for bridge connect
- define explicit close reasons and session states

Out of scope:
- node-facing internal websocket listener
- browser UI changes

Files likely touched:
- `packages/services/terminal/service.go`
- [`cmd/api/routes.go`](../../cmd/api/routes.go)
- [`doc/api/openapi.draft.yaml`](../api/openapi.draft.yaml) if contract text/fields change

Acceptance:
- unit tests for token consume + session binding creation/cleanup
- integration test for single active session per allocation
- audit/session state logs remain correct

### Slice 2: Terminal-Gateway Internal Node Listener

Owner:
- terminal-gateway

Goal:
- add dedicated node-facing internal WebSocket listener with native mTLS verification

In scope:
- second listener/port in `cmd/terminal-gateway`
- TLS client auth using node CA
- session lookup and node identity validation
- in-process browser socket <-> node socket bridge
- first upstream/downstream frame logs

Out of scope:
- node-agent switching to the new path

Files likely touched:
- [`cmd/terminal-gateway/main.go`](../../cmd/terminal-gateway/main.go)
- [`cmd/terminal-gateway/routes.go`](../../cmd/terminal-gateway/routes.go)
- `packages/services/terminal/service.go`

Acceptance:
- local/integration test that gateway accepts node mTLS websocket
- session ownership is registered on connect and cleared on close
- no Redis pubsub used for live frame relay on this new path

### Slice 3: Node-Agent Internal WebSocket Client

Owner:
- node-agent

Goal:
- replace HTTP terminal relay client with node-facing internal WebSocket client

In scope:
- on `terminal.open`, node-agent opens internal websocket to gateway
- PTY byte relay over binary frames
- resize/close/heartbeat over typed control frames
- explicit non-resumable close behavior

Out of scope:
- lifecycle/task polling transport changes

Files likely touched:
- [`cmd/node-agent/terminal_stream.go`](../../cmd/node-agent/terminal_stream.go)
- [`cmd/node-agent/config.go`](../../cmd/node-agent/config.go)

Acceptance:
- node-agent unit tests for:
  - connect success
  - wrong node identity rejection
  - close reason propagation
- PTY prompt appears and typed key echoes against a fake gateway

### Slice 4: Kubernetes Exposure And Node-Reachable Routing

Owner:
- infra / platform-control deploy path

Goal:
- expose node-facing terminal listener on a worker-node-routable path

In scope:
- dedicated Service / port for internal node terminal websocket
- no Traefik in the critical node stream path
- bootstrap/runtime config for node-agent internal terminal endpoint

Out of scope:
- browser ingress route changes

Files likely touched:
- `infra/k8s/base/core/*`
- `infra/k8s/overlays/dev-control/*`
- deploy scripts under `scripts/ci/`

Acceptance:
- worker node can reach the terminal internal endpoint
- mTLS handshake succeeds from node network
- route does not depend on `X-Forwarded-*` identity propagation

### Slice 5: Browser/Gateway Integration And UI State

Owner:
- terminal-gateway + web

Goal:
- keep browser contract stable while adapting gateway to the new bridge internals

In scope:
- preserve:
  - `POST /api/v1/allocations/{id}/terminal-token`
  - `WS /ws/terminal/{allocation_id}`
- browser receives typed `ready`, `data`, `close`, `error`
- explicit close reason rendering

Out of scope:
- browser contract redesign

Files likely touched:
- [`packages/web/src/components/terminal/TerminalPanel.tsx`](../../packages/web/src/components/terminal/TerminalPanel.tsx)
- [`cmd/terminal-gateway/routes.go`](../../cmd/terminal-gateway/routes.go)

Acceptance:
- browser prompt appears
- typed key leaves browser and echoes
- resize works
- explicit close reason visible

### Slice 6: Deployed-Environment Smoke And Failure Tests

Owner:
- cross-cutting

Goal:
- prove the redesign in the environment that exposed the failure

Required tests:
- deployed terminal smoke:
  - open terminal
  - prompt appears
  - type `echo hi`
  - verify echo/output before disconnect
- post-reimage terminal smoke
- browser disconnect test
- gateway restart test
- node-agent restart test
- wrong node cert / wrong node_id test

Files likely touched:
- `packages/web/e2e/terminal-input.spec.ts`
- CI/deploy validation scripts under `scripts/ci/`

Acceptance:
- this slice is required before declaring the redesign done

## Ordering Rules

- do not build slice 3 before slice 2 contracts are stable
- do not roll out slice 4 before slice 2 and 3 can be tested together in a lower-risk environment
- do not remove the old HTTP relay path until slice 6 is passing in deployed environment

## Temporary Compatibility Strategy

Use a feature flag during migration:

- `TERMINAL_NODE_TRANSPORT=legacy_http|internal_ws`

This flag is temporary and must be removed after successful soak.

Compatibility rules:

- browser contract must remain unchanged during migration
- task polling/provisioning transport must remain untouched
- do not mix old and new node stream behavior within the same live session

## Default Backpressure Behavior

V1 rule:
- bounded buffers only
- no silent output dropping
- if queue exceeds limit:
  - close session
  - emit explicit close reason
  - log saturation event

## Risks

- mTLS setup for the dedicated node-facing listener may expose CA/config drift again
- gateway session ownership bugs could create split-brain live sessions
- node-facing listener exposure may require infra changes on worker-reachable networking

## Success Criteria

The redesign is successful when all of these are true in deployed environment:

- prompt appears without disconnect tricks
- a typed key echoes before disconnect
- no terminal byte path depends on ingress request/response buffering
- node identity is validated directly by the node-facing listener
- browser contract is unchanged
- terminal survives normal load at the target concurrency envelope

## Not Allowed

- more incremental fixes to the old NDJSON duplex-over-HTTP relay
- relying on Traefik forwarded client-cert headers for the node-facing terminal path
- per-frame Redis pubsub as the steady-state bridge
