Node Lifecycle runbook

GPU nodes are production capacity, security boundary, and user runtime all at once. Node lifecycle operations should be ring-based, observable, and recoverable without pushing untested changes across the full fleet.

Lifecycle Map

Operating Areas

Area	Operator posture
Node-agent	Pull-based typed tasks, health, drift detection, recovery
Certificates	Short-lived host cert lifecycle, renewal, revocation, evidence
Isolation	Current user-revoke model, future full-reimage path behind MAAS readiness
Release rings	Internal, UAT/security, canary, broad production, sensitive tenants
Reserve capacity	Drain and patch only when spare capacity exists
Recovery	Node-agent, terminal preflight, MAAS state, reimage, and return-to-service evidence

Slice-Ready Node Posture

When a host participates in GPU slice products, operators should treat slice-readiness as a promoted runtime posture, not an automatic hardware fact.

The control plane owns SKU intent, placement, and claim truth. The node owns host-local discovery and execution proof. In practice that means:

approved slot inventory is the scheduling source of truth;
host-local topology discovery is advisory until approved into slot inventory;
cleanup proof is required before a slot returns to reuse;
draining and cleanup-blocked states matter at both node and slot level.

See GPU Slicing And Scheduler Layers for the full control-plane versus node-plane split. Open Node Agent Runtime Depth when the reader needs the actual host-runtime authority and task model, not just the lifecycle posture.

Release / Reimage Direction

Do not treat every node change as safe for whole-fleet rollout. Patch and feature work should start with ring evidence. Full reimage isolation depends on MAAS readiness and should remain gated until the deploy/release path is proven.

Canonical sources

Lifecycle Map​

Operating Areas​

Slice-Ready Node Posture​

Release / Reimage Direction​

Lifecycle Map

Operating Areas

Slice-Ready Node Posture

Release / Reimage Direction