# Terminal Node Transport Redesign v1

## Status

Decided — superseded by [Terminal WebSocket Bridge Architecture v1](./Terminal_WebSocket_Bridge_Architecture_v1.md)

This document remains the option-analysis and decision trail.

Decision outcome:

- this document initially recommended Option C (gRPC bidirectional streaming)
- after further first-principles review, the chosen v1 architecture is Option B:
  a dedicated WebSocket bridge with a node-facing mTLS listener
- the active build specification is:
  - [Terminal WebSocket Bridge Architecture v1](./Terminal_WebSocket_Bridge_Architecture_v1.md)

## Problem

The current node-agent terminal path mixes two concerns that should be designed
together:

- live bidirectional terminal transport
- node identity and authorization for that transport

Production findings from 2026-03:

- the ingress path (`https://node-api...`) is reachable from nodes
- but the NDJSON duplex-over-HTTP relay buffers live terminal traffic badly enough
  that prompt/input behavior is effectively delayed until disconnect or session unwind
- a direct node-reachable path removes that streaming stall
- but the current direct path fails `401 invalid node identity` because the present
  identity model depends on ingress/mTLS handoff behavior

So the current system does not have a stable terminal node/control-plane transport.

## Goals

- terminal must be truly bidirectional during the live session
- transport behavior must not depend on proxy buffering quirks
- node identity must be explicit on the chosen transport
- browser contract stays stable:
  - `POST /api/v1/allocations/{id}/terminal-token`
  - `WS /ws/terminal/{allocation_id}`
- terminal transport must remain separable from node task polling/provisioning
- reconnection, close, resize, and readiness should be typed protocol events

## Non-Goals

- changing the browser-facing websocket contract immediately
- coupling terminal redesign to task polling/provisioning transport changes
- continuing incremental incident patches to the current NDJSON duplex relay

## Options

### Option A: Keep NDJSON-over-HTTP and add more bypass rules

Shape:
- browser -> terminal-gateway websocket
- terminal-gateway/api -> node-agent via HTTP request/response streams

Pros:
- smallest code churn
- keeps existing OpenAPI-shaped internals

Cons:
- already disproven in production as a robust solution
- still proxy-sensitive
- still requires transport-specific auth exceptions
- harder to test and reason about full duplex behavior

Decision:
- reject as long-term direction

### Option B: Direct websocket from gateway to node-facing endpoint

Shape:
- browser -> terminal-gateway websocket
- terminal-gateway -> node-agent websocket or websocket-like direct stream

Pros:
- true duplex transport
- simpler browser/gateway mental model

Cons:
- introduces a second public-ish node-facing runtime surface
- harder to keep API as the control-plane authority
- more awkward to integrate with existing node mTLS/task identity model

Decision:
- not preferred

### Option C: gRPC bidirectional stream between node-agent and control plane

Shape:
- browser -> terminal-gateway websocket
- terminal-gateway/API broker -> node-agent via gRPC bidi stream on a dedicated
  node-facing control-plane endpoint

Pros:
- transport matches the problem: true duplex typed streaming
- explicit stream lifecycle and backpressure
- typed protocol for `ready`, `data`, `resize`, `close`, `error`
- avoids dependence on ingress request/response buffering semantics
- identity can be designed explicitly for this stream instead of inherited from
  the ingress header-forwarding path

Cons:
- larger change than another HTTP patch
- needs a new node-facing service boundary and tests

Decision:
- recommended

## Recommendation At Time Of Analysis

At the time this document was written, Option C was the recommended direction:

- keep browser edge as websocket through `cmd/terminal-gateway`
- move node/control-plane terminal transport to a dedicated gRPC bidirectional stream
- expose that stream on a node-reachable control-plane endpoint separate from the
  current ingress-buffered HTTP terminal relay

That recommendation was later superseded by the WebSocket bridge design after weighing:

- operational simplicity
- fit for byte-stream relay semantics
- reuse of existing websocket runtime and operational model

The active design is now:
- [Terminal WebSocket Bridge Architecture v1](./Terminal_WebSocket_Bridge_Architecture_v1.md)

## Recommended Runtime Topology

```text
browser
  -> terminal-gateway websocket
  -> terminal session broker in control plane
  -> gRPC bidi stream
  -> node-agent PTY
```

Suggested ownership split:

- `terminal-gateway`
  - browser websocket termination
  - browser token/session validation
  - browser resize/input/output relay

- `cmd/api` or a dedicated terminal broker service
  - session authority
  - allocation/user/node binding validation
  - audit/session lifecycle events
  - issues short-lived node stream credentials

- `cmd/node-agent`
  - opens one gRPC bidi stream per terminal session
  - runs PTY as allocation user
  - relays typed frames

## Identity Model

This part must be explicit. The direct path should not depend on ingress header
forwarding.

Recommended model:

1. Node remains enrolled and authenticated by its existing node certificate
   for lifecycle/task APIs.

2. Terminal stream transport gets a dedicated auth model:
   - mTLS at transport level on the node-facing gRPC endpoint using the node cert, or
   - short-lived API-issued node stream token bound to:
     - `node_id`
     - `session_id`
     - `allocation_id`
     - `exp`

3. Server verifies both:
   - the node identity
   - the terminal session binding

4. Terminal stream authorization does not depend on Traefik forwarding
   `X-Forwarded-Tls-Client-Cert*` headers.

Preferred variant:
- mTLS on the dedicated gRPC endpoint plus signed stream/session claims

Reason:
- transport-level peer identity remains strong
- session-level authorization remains explicit and auditable

## Protocol Shape

Use a typed stream instead of free-form NDJSON frames.

Suggested messages:

- `TerminalOpen`
- `TerminalReady`
- `TerminalData`
- `TerminalResize`
- `TerminalClose`
- `TerminalError`
- `TerminalHeartbeat`

Required invariants:

- exactly one active terminal session per allocation unless future multiplexing is
  explicitly added
- server-enforced TTL
- explicit close reasons
- correlation IDs carried end to end

## Restart / Reconnect Model

The design should support restartability explicitly.

Requirements:

- if terminal-gateway restarts:
  - browser reconnects with a new websocket
  - broker resumes or cleanly reopens the node stream for the existing live session
- if node-agent restarts:
  - session is closed explicitly with a typed reason
  - UI gets a deterministic reconnectable/closed state
- if control-plane broker restarts:
  - session registry rebuild is deterministic or active sessions are explicitly expired

Recommendation:
- keep session registry in broker-owned durable/replicated state
- support reopen-by-session semantics only if the PTY is still valid
- otherwise close cleanly and require a fresh `terminal.open`

Do not infer reconnect state from transport side effects.

## Migration Plan

### Phase 1: Design and contract

- define protobuf for node/control-plane terminal stream
- define auth model for the node-facing gRPC endpoint
- document session lifecycle and close semantics

### Phase 2: Broker implementation

- implement gRPC stream server in API or dedicated terminal broker
- keep existing browser websocket contract unchanged
- add typed session/state logging

### Phase 3: Node-agent implementation

- add gRPC terminal client in node-agent
- keep old HTTP path behind a temporary feature flag for rollback only

### Phase 4: Cutover

- route a dev-control environment to the gRPC path
- validate:
  - prompt delivery
  - keystroke echo
  - resize
  - disconnect
  - restart behavior
- remove old NDJSON duplex relay after soak

## Test Requirements

Minimum required before calling the redesign done:

- unit tests for stream auth/session validation
- integration test for node-agent <-> broker bidi stream
- deployed-environment smoke that verifies:
  - prompt appears
  - typed key reaches PTY
  - shell echoes before disconnect
- post-reimage terminal smoke
- restart tests:
  - gateway restart
  - node-agent restart
  - broker restart

## Decision

Do not continue incident-style fixes on the current NDJSON duplex relay.

Build a dedicated node-facing gRPC terminal stream with an explicit identity model,
while preserving the browser websocket contract.
