# Terminal WebSocket Bridge Architecture v1

## Status

Proposed v1 design target

## Purpose

Define a first-principles terminal architecture for GPUaaS that:

- provides true full duplex terminal I/O
- does not depend on HTTP request/response streaming behavior through proxies
- uses explicit node identity verification
- scales to hundreds of simultaneous terminal sessions
- can serve as the primary access path in restricted or sovereign environments where
  direct SSH may be unavailable or undesirable

This proposal is intended to replace incident-driven patching of the current NDJSON
terminal relay path.

## Problem Statement

The existing terminal runtime mixes three separate concerns into a single ingress path:

1. full duplex terminal transport
2. node identity/authentication
3. session authorization and control-plane ownership

That coupling created two production failure modes:

- ingress path was reachable but buffered live duplex traffic badly enough to delay
  prompt and input behavior until disconnect/session unwind
- direct path restored live stream behavior but failed node identity checks because the
  current identity model depends on ingress/mTLS handoff assumptions

The new design must separate these concerns.

## First-Principles Requirements

### Functional

- terminal input and output must flow in both directions independently and immediately
- browser contract should remain stable for users
- node-agent remains the terminal execution owner on the node
- terminal can be the primary remote shell surface when SSH is not the main access path

### Security

- node identity must be verified without forwarded header dependence
- user/session authorization must remain control-plane owned
- transport must be encrypted in transit
- session open/close/error activity must remain auditable

### Operational

- no correctness dependence on proxy buffering quirks
- support hundreds of simultaneous terminal sessions
- explicit connection caps and backpressure
- clear behavior on browser restart, gateway restart, and node-agent restart

## Design Decision

Adopt a dual-WebSocket bridge design:

- browser ↔ terminal-gateway: WebSocket
- node-agent ↔ terminal-gateway: dedicated internal WebSocket over mTLS

The terminal-gateway becomes the terminal data-plane bridge.

The control plane (`cmd/api`) remains the session authority:

- mint terminal token
- validate allocation ownership
- create session binding
- enqueue `terminal.open`
- record audit trail

The API is not the terminal byte relay.

## Why WebSocket Bridge

### Why not HTTP streamed request/response

- request/response body streaming is not a reliable model for long-lived full duplex I/O
- proxy and ingress behavior can buffer or delay one or both directions
- production already proved this failure mode

### Why WebSocket

- true duplex after upgrade
- widely supported operationally
- already used on the browser side of the system
- simpler than introducing gRPC runtime and protobuf for a byte-stream bridge
- easy to carry binary data plus a few control messages

### Why not Redis pubsub in the frame path

- per-frame broker hops add latency and failure modes
- terminal data plane should remain in-process once both sockets are connected
- Redis remains acceptable for session binding and token state, not live byte transport

## Topology

```text
browser
  -> websocket
terminal-gateway
  -> session authority calls
cmd/api
  -> node_tasks terminal.open
node-agent
  -> websocket over mTLS
terminal-gateway
```

Live relay path:

```text
browser websocket <-> terminal-gateway <-> node websocket <-> PTY
```

No API byte relay in the steady-state data path.

## Session Flow

### Phase 1: Browser authorization

1. Browser calls:
   - `POST /api/v1/allocations/{id}/terminal-token`
2. API validates:
   - user owns allocation
   - allocation is active
   - rate limits
3. API stores single-use short-lived token in Redis.

### Phase 2: Browser terminal connect

1. Browser opens:
   - `WS /ws/terminal/{allocation_id}`
2. Browser sends terminal token via `Sec-WebSocket-Protocol`.
3. Terminal-gateway validates token and creates session binding:
   - `session_id`
   - `allocation_id`
   - `user_id`
   - `node_id`
   - `username`
   - expiry / TTL
4. Terminal-gateway requests/causes `terminal.open` task dispatch.

### Phase 3: Node-agent connect

1. Node-agent claims `terminal.open`.
2. Node-agent starts PTY for the allocation user.
3. Node-agent opens a dedicated internal WebSocket to terminal-gateway:
   - node-facing listener only
   - mTLS required
   - session ID included in path or signed header/token
4. Terminal-gateway validates:
   - peer cert is a valid node cert
   - node identity matches session binding
   - session is still active
5. Terminal-gateway bridges browser socket and node socket.

### Phase 4: Live session

- browser input frames -> gateway -> node PTY stdin
- node PTY stdout/stderr -> gateway -> browser output frames
- resize/close/heartbeat remain typed control messages

### Phase 5: Close and cleanup

- either side can close
- gateway records close reason and audit
- session binding is removed
- browser receives explicit close reason

## Security Model

Security has two separate layers.

### 1. Transport identity: node mTLS

The node-facing internal WebSocket listener requires:

- TLS
- client certificate verification against the node CA
- peer identity extraction directly from the TLS layer

This must not depend on Traefik forwarding client-cert headers.

### 2. Session authorization

Session authorization remains control-plane owned.

The gateway accepts the node socket only if:

- the session binding exists
- the bound `node_id` matches the node identity proven by the client cert
- the session is not expired or closed

Optional strengthening:

- require a short-lived signed session claim from API in addition to the cert
- bind that claim to:
  - `session_id`
  - `node_id`
  - `allocation_id`
  - `exp`

That is recommended but not strictly required for v1 if the binding store and mTLS
checks are already strong.

## Session Directory And Redis Schema

Redis remains the session-control store, but not the frame relay path.

Recommended v1 keys:

- `terminal_session:{session_id}`
  - value:
    - `allocation_id`
    - `user_id`
    - `node_id`
    - `username`
    - `expires_at`
    - `gateway_instance_id`
    - `status`
- `terminal_allocation_active:{allocation_id}`
  - value:
    - `session_id`
- `terminal_gateway_sessions:{gateway_instance_id}`
  - set of owned `session_id`

Required semantics:

- session binding is created before the node socket attaches
- `gateway_instance_id` is written when the gateway takes ownership
- session ownership is removed on close or expiry
- session TTL remains enforced by control-plane policy

This extends the existing session-binding model rather than inventing a parallel
session directory with unrelated ownership rules.

## Protocol

### Browser-side protocol

Use JSON text frames with typed control/data messages.

Examples:

- `ready`
- `data`
- `resize`
- `close`
- `error`
- `heartbeat`

Payload bytes are base64 encoded for browser-side structured messaging.

### Node-side protocol

Use:

- binary frames for PTY byte data
- text frames for control messages (`resize`, `close`, `heartbeat`, `error`)

This keeps the hot path efficient while preserving structured control behavior.

## Scale Model

This architecture must handle hundreds of simultaneous sessions.

### Scaling assumptions

- many nodes may be connected concurrently
- a restricted environment may prefer terminal over SSH
- terminal sessions are long-lived and bursty

### Required scale properties

#### Gateway horizontal scaling

Terminal-gateway must scale horizontally.

Each active session is owned by one gateway instance.

Session directory requirements:

- `session_id -> gateway_instance_id`
- `session_id -> allocation_id, user_id, node_id, expiry`

This registry can live in Redis or another shared store because it is control state,
not the frame path.

#### No per-frame external dependency

There must be:

- no Redis publish/subscribe for live frame movement
- no database writes per frame
- no control-plane API hop per keypress

#### Backpressure and buffering

Each session must have bounded buffers.

Required policies:

- max outbound queue per browser socket
- max outbound queue per node socket
- close session or shed data predictably if peer is too slow

V1 default policy:

- do not silently drop terminal output bytes
- if a bounded queue is exceeded:
  - close the session explicitly
  - emit an explicit close reason such as:
    - `output_backpressure_exceeded`
    - `input_backpressure_exceeded`
  - surface that reason to logs and to the browser

Terminal should prefer correctness and bounded memory over unbounded buffering.

#### Connection caps

Enforce:

- per-user active session cap
- per-allocation active session cap
- per-gateway active session cap

Expose saturation metrics so operators know when to scale out.

## Restart And Reconnect Model

This must be explicit, not accidental.

### V1 recommendation

Non-resumable sessions.

Rules:

- browser disconnect:
  - close session after short grace period or immediately, depending on product choice
- gateway restart:
  - all sessions close with explicit reason `gateway_restart`
- node-agent restart:
  - session closes with reason `node_restart`
- control-plane/API restart:
  - existing gateway/node sockets continue if gateway remains alive; no byte relay through API

Why this is recommended for v1:

- simpler correctness model
- easier to test
- avoids hidden half-open session semantics

### V2 option

Add resumable sessions only after v1 is stable.

That requires:

- durable session directory
- attach/re-attach semantics
- PTY survivability rules

## Deployment Model

### Node-facing listener

Expose a dedicated node-facing WebSocket listener outside the current HTTP ingress path.

Requirements:

- node-reachable address
- websocket-safe/L4-safe load balancing
- native mTLS at the gateway process
- bounded node-stream connection timeout so browser sessions fail with an
  actionable `node_stream_timeout` instead of waiting indefinitely when the
  node-facing endpoint/profile is wrong

Do not place the critical node stream path behind the same ingress assumptions that
caused the current buffering issue.

Concrete v1 deployment shape:

- `cmd/terminal-gateway` serves two listeners:
  - existing browser-facing websocket listener
  - new node-facing internal websocket listener on a dedicated port
- Kubernetes exposes the node-facing listener through a dedicated Service:
  - separate from the current browser-facing route
  - separate from the current ingress-buffered `node-api` path
- preferred v1 exposure:
  - dedicated `LoadBalancer` Service for the node-facing websocket port
  - reachable from worker nodes on the node network
  - TLS terminates in `cmd/terminal-gateway`, not at Traefik

Explicit non-goal:
- do not reuse the current Traefik/ingress terminal or node-api route for the live
  node stream path

### Browser-facing listener

Keep the existing browser-facing websocket endpoint and routing model, provided it
remains websocket-safe.

## Observability Requirements

Required logs:

- browser websocket accepted
- session binding created
- node websocket accepted
- first downstream frame forwarded
- first upstream frame forwarded
- close reason and initiator
- auth failure reason

Required metrics:

- active sessions
- sessions opened / closed
- auth failures by reason
- average session duration
- buffer saturation / dropped-session counts
- gateway instance session counts

## Test Plan

The redesign is not done without deployed-environment tests.

### Must-have tests

- prompt appears in deployed environment
- typed key is echoed before disconnect
- resize changes PTY size
- session closes with explicit reason
- post-reimage terminal smoke

### Failure-mode tests

- browser disconnect
- gateway restart
- node-agent restart
- expired session binding
- wrong node cert / wrong node_id

### Scale tests

- N concurrent sessions across multiple gateways
- sustained typing/output across many sessions
- no per-session memory runaway

## Migration Plan

### Phase 1

- add dedicated node-facing internal WebSocket listener to terminal-gateway
- add mTLS verification at the listener
- add feature flag for node transport selection

### Phase 2

- add node-agent internal WebSocket client for terminal sessions
- keep existing browser contract unchanged

### Phase 3

- run dev-control soak
- validate duplex, auth, restart behavior, and post-reimage flow

### Phase 4

- remove NDJSON HTTP relay path
- remove frame-path Redis pubsub coupling
- simplify terminal runbooks and alerts around the new architecture

## Decision

The recommended v1 terminal redesign is:

- browser-side WebSocket unchanged
- node-side direct internal WebSocket over mTLS
- terminal-gateway as the only live byte bridge
- API as session authority, not byte relay
- no dependence on ingress behavior for node terminal streaming
