Skip to main content

Node Lifecycle runbook

GPU nodes are production capacity, security boundary, and user runtime all at once. Node lifecycle operations should be ring-based, observable, and recoverable without pushing untested changes across the full fleet.

Lifecycle Map

Operating Areas

AreaOperator posture
Node-agentPull-based typed tasks, health, drift detection, recovery
CertificatesShort-lived host cert lifecycle, renewal, revocation, evidence
IsolationCurrent user-revoke model, future full-reimage path behind MAAS readiness
Release ringsInternal, UAT/security, canary, broad production, sensitive tenants
Reserve capacityDrain and patch only when spare capacity exists
RecoveryNode-agent, terminal preflight, MAAS state, reimage, and return-to-service evidence

Slice-Ready Node Posture

When a host participates in GPU slice products, operators should treat slice-readiness as a promoted runtime posture, not an automatic hardware fact.

The control plane owns SKU intent, placement, and claim truth. The node owns host-local discovery and execution proof. In practice that means:

  • approved slot inventory is the scheduling source of truth;
  • host-local topology discovery is advisory until approved into slot inventory;
  • cleanup proof is required before a slot returns to reuse;
  • draining and cleanup-blocked states matter at both node and slot level.

See GPU Slicing And Scheduler Layers for the full control-plane versus node-plane split. Open Node Agent Runtime Depth when the reader needs the actual host-runtime authority and task model, not just the lifecycle posture.

Release / Reimage Direction

Do not treat every node change as safe for whole-fleet rollout. Patch and feature work should start with ring evidence. Full reimage isolation depends on MAAS readiness and should remain gated until the deploy/release path is proven.