Node Agent Runtime Depth implemented

The node agent is not a bootstrap helper or a wrapper around SSH. It is a bounded runtime executor that turns control-plane intent into host-local action without giving the control plane arbitrary shell authority.

Why It Matters

If this subsystem were weak, GPUaaS would collapse into:

direct control-plane SSH,
host-local scripts as hidden product logic,
or runtime behavior that bypasses audit and typed contracts.

The node agent is the mechanism that prevents that.

Authority Boundary

The important rule is simple:

the control plane may request an approved task;
the node agent may execute only a compiled handler for that task;
neither side gets to silently widen that boundary during incidents or new feature work.

The Execution Domains Inside The Agent

The current binary contains two distinct execution domains:

Domain	Purpose	Failure model
Lifecycle/task execution	enroll, renew, poll, verify, dispatch, result reporting	pull-based, typed-task, bounded execution
Terminal execution	open session, bind PTY, stream frames, close session	long-lived interactive stream with separate runtime behavior

That split matters because terminal behavior is not just another task loop. The platform already preserves this seam so terminal transport can evolve without rewriting lifecycle execution.

What The Agent Already Does

Current Depth In Repo

The current implementation is already substantial:

about 11K non-test Go lines in cmd/node-agent/**
major execution surfaces:
- agent.go
- catalog.go
- slice_vm.go
- slice_topology.go
- oci_workload.go
- terminal_stream.go

This is platform runtime software, not glue code.

Control Plane To Host Sequence

Why GPU Slice Support Raises The Stakes

For slice-backed products, the node agent is responsible for the most dangerous host-local boundaries:

topology discovery
VM image and cloud-init handling
VFIO and device passthrough preparation
libvirt-backed VM lifecycle
cleanup proof before slot reuse

That is why the platform keeps slice behavior behind typed tasks such as:

slice.topology_discover
slice.vm_provision
slice.vm_release

The scheduling and product truth stays in the control plane. The agent owns execution truth on the host.

Security And Ops Reading

Security should read this as:

no arbitrary shell from the control plane;
task execution is typed and compiled;
terminal is a separate runtime surface, not a hidden host bypass;
host-local mutation is bounded and reportable.

Ops should read this as:

node rollout risk is real runtime risk, not just package-install risk;
drain, cleanup, and return-to-service proof matter because the agent is part of the runtime substrate;
slice-mode hosts require stronger prerequisite and cleanup discipline than whole-node leasing.

Architecture should read this as:

the node agent is one of the platform moats;
it is the right place for bounded host primitives;
it is the wrong place for product policy, UI truth, or control-plane ownership decisions.

Canonical sources

Why It Matters​

Authority Boundary​

The Execution Domains Inside The Agent​

What The Agent Already Does​

Current Depth In Repo​

Control Plane To Host Sequence​

Why GPU Slice Support Raises The Stakes​

Security And Ops Reading​

Security should read this as:​

Ops should read this as:​

Architecture should read this as:​

Related Portal Pages​