Skip to main content

Node Agent Runtime Depth implemented

The node agent is not a bootstrap helper or a wrapper around SSH. It is a bounded runtime executor that turns control-plane intent into host-local action without giving the control plane arbitrary shell authority.

Why It Matters

If this subsystem were weak, GPUaaS would collapse into:

  • direct control-plane SSH,
  • host-local scripts as hidden product logic,
  • or runtime behavior that bypasses audit and typed contracts.

The node agent is the mechanism that prevents that.

Authority Boundary

The important rule is simple:

  • the control plane may request an approved task;
  • the node agent may execute only a compiled handler for that task;
  • neither side gets to silently widen that boundary during incidents or new feature work.

The Execution Domains Inside The Agent

The current binary contains two distinct execution domains:

DomainPurposeFailure model
Lifecycle/task executionenroll, renew, poll, verify, dispatch, result reportingpull-based, typed-task, bounded execution
Terminal executionopen session, bind PTY, stream frames, close sessionlong-lived interactive stream with separate runtime behavior

That split matters because terminal behavior is not just another task loop. The platform already preserves this seam so terminal transport can evolve without rewriting lifecycle execution.

What The Agent Already Does

Current Depth In Repo

The current implementation is already substantial:

  • about 11K non-test Go lines in cmd/node-agent/**
  • major execution surfaces:
    • agent.go
    • catalog.go
    • slice_vm.go
    • slice_topology.go
    • oci_workload.go
    • terminal_stream.go

This is platform runtime software, not glue code.

Control Plane To Host Sequence

Why GPU Slice Support Raises The Stakes

For slice-backed products, the node agent is responsible for the most dangerous host-local boundaries:

  • topology discovery
  • VM image and cloud-init handling
  • VFIO and device passthrough preparation
  • libvirt-backed VM lifecycle
  • cleanup proof before slot reuse

That is why the platform keeps slice behavior behind typed tasks such as:

  • slice.topology_discover
  • slice.vm_provision
  • slice.vm_release

The scheduling and product truth stays in the control plane. The agent owns execution truth on the host.

Security And Ops Reading

Security should read this as:

  • no arbitrary shell from the control plane;
  • task execution is typed and compiled;
  • terminal is a separate runtime surface, not a hidden host bypass;
  • host-local mutation is bounded and reportable.

Ops should read this as:

  • node rollout risk is real runtime risk, not just package-install risk;
  • drain, cleanup, and return-to-service proof matter because the agent is part of the runtime substrate;
  • slice-mode hosts require stronger prerequisite and cleanup discipline than whole-node leasing.

Architecture should read this as:

  • the node agent is one of the platform moats;
  • it is the right place for bounded host primitives;
  • it is the wrong place for product policy, UI truth, or control-plane ownership decisions.