Node Agent Runtime Depth implemented
The node agent is not a bootstrap helper or a wrapper around SSH. It is a bounded runtime executor that turns control-plane intent into host-local action without giving the control plane arbitrary shell authority.
Why It Matters
If this subsystem were weak, GPUaaS would collapse into:
- direct control-plane SSH,
- host-local scripts as hidden product logic,
- or runtime behavior that bypasses audit and typed contracts.
The node agent is the mechanism that prevents that.
Authority Boundary
The important rule is simple:
- the control plane may request an approved task;
- the node agent may execute only a compiled handler for that task;
- neither side gets to silently widen that boundary during incidents or new feature work.
The Execution Domains Inside The Agent
The current binary contains two distinct execution domains:
| Domain | Purpose | Failure model |
|---|---|---|
| Lifecycle/task execution | enroll, renew, poll, verify, dispatch, result reporting | pull-based, typed-task, bounded execution |
| Terminal execution | open session, bind PTY, stream frames, close session | long-lived interactive stream with separate runtime behavior |
That split matters because terminal behavior is not just another task loop. The platform already preserves this seam so terminal transport can evolve without rewriting lifecycle execution.
What The Agent Already Does
Current Depth In Repo
The current implementation is already substantial:
- about 11K non-test Go lines in
cmd/node-agent/** - major execution surfaces:
agent.gocatalog.goslice_vm.goslice_topology.gooci_workload.goterminal_stream.go
This is platform runtime software, not glue code.
Control Plane To Host Sequence
Why GPU Slice Support Raises The Stakes
For slice-backed products, the node agent is responsible for the most dangerous host-local boundaries:
- topology discovery
- VM image and cloud-init handling
- VFIO and device passthrough preparation
- libvirt-backed VM lifecycle
- cleanup proof before slot reuse
That is why the platform keeps slice behavior behind typed tasks such as:
slice.topology_discoverslice.vm_provisionslice.vm_release
The scheduling and product truth stays in the control plane. The agent owns execution truth on the host.
Security And Ops Reading
Security should read this as:
- no arbitrary shell from the control plane;
- task execution is typed and compiled;
- terminal is a separate runtime surface, not a hidden host bypass;
- host-local mutation is bounded and reportable.
Ops should read this as:
- node rollout risk is real runtime risk, not just package-install risk;
- drain, cleanup, and return-to-service proof matter because the agent is part of the runtime substrate;
- slice-mode hosts require stronger prerequisite and cleanup discipline than whole-node leasing.
Architecture should read this as:
- the node agent is one of the platform moats;
- it is the right place for bounded host primitives;
- it is the wrong place for product policy, UI truth, or control-plane ownership decisions.
Related Portal Pages
- Node Lifecycle
- GPU Slicing And Scheduler Layers
- Workload Access and Runtime Surfaces
- Platform Proof Points