Node Lifecycle runbook
GPU nodes are production capacity, security boundary, and user runtime all at once. Node lifecycle operations should be ring-based, observable, and recoverable without pushing untested changes across the full fleet.
Lifecycle Map
Operating Areas
| Area | Operator posture |
|---|---|
| Node-agent | Pull-based typed tasks, health, drift detection, recovery |
| Certificates | Short-lived host cert lifecycle, renewal, revocation, evidence |
| Isolation | Current user-revoke model, future full-reimage path behind MAAS readiness |
| Release rings | Internal, UAT/security, canary, broad production, sensitive tenants |
| Reserve capacity | Drain and patch only when spare capacity exists |
| Recovery | Node-agent, terminal preflight, MAAS state, reimage, and return-to-service evidence |
Slice-Ready Node Posture
When a host participates in GPU slice products, operators should treat slice-readiness as a promoted runtime posture, not an automatic hardware fact.
The control plane owns SKU intent, placement, and claim truth. The node owns host-local discovery and execution proof. In practice that means:
- approved slot inventory is the scheduling source of truth;
- host-local topology discovery is advisory until approved into slot inventory;
- cleanup proof is required before a slot returns to reuse;
- draining and cleanup-blocked states matter at both node and slot level.
See GPU Slicing And Scheduler Layers for the full control-plane versus node-plane split. Open Node Agent Runtime Depth when the reader needs the actual host-runtime authority and task model, not just the lifecycle posture.
Release / Reimage Direction
Do not treat every node change as safe for whole-fleet rollout. Patch and feature work should start with ring evidence. Full reimage isolation depends on MAAS readiness and should remain gated until the deploy/release path is proven.