GPU Slicing And Scheduler Layers designed
GPUaaS does not treat “GPU count” as enough scheduling truth. The platform separates three different concerns:
- capacity shape
- control-plane placement
- node-local runtime execution
That separation is what lets the platform support whole-node leasing today and host-local GPU slice products without collapsing scheduling into ad hoc node scripts.
The Two Capacity Shapes
| Shape | Meaning | Scheduling unit |
|---|---|---|
baremetal | one allocation owns an entire enrolled GPU host | physical node |
gpu_slice | one allocation owns one or more approved host-local resource bundles | resource slot set on one node |
The first slice model is not “fractional GPU by inference”. It is an approved slot inventory per host. A slot is a schedulable bundle that can include:
- GPU device identity
- NVMe or volume identity
- network or fabric device identity
- NUMA locality
- CPU and memory reservation metadata
- VM/runtime metadata needed to launch safely
This is the key product and architecture rule: the platform schedules approved slots, not raw PCI devices and not just a GPU count.
Control Plane Vs Node Plane
What The Control Plane Owns
The control plane is the authority for intent and placement. It owns:
- the public SKU and capacity-shape model;
- tenant, project, policy, and entitlement boundaries;
- allocation intent and lifecycle;
- placement candidate resolution;
- durable resource claims;
- read models shown to users, admins, and operators;
- reconciliation when provider or node truth drifts.
Important consequence: GPUaaS should not let node-local runtime code invent its own product model. The control plane decides:
- whether a request is
baremetalorgpu_slice, - which node family is eligible,
- which slot set is compatible,
- whether a multi-slot request must stay on one node,
- when a slot or node is blocked, draining, cleanup-blocked, or reusable.
What The Node Plane Owns
The node plane is the authority for host-local execution truth. It owns:
- discovery of candidate slot topology on the host;
- verification that the host is actually slice-ready;
- realization of an approved allocation into a VM or runtime instance;
- host-local cleanup and proof on release.
For slice mode, node-agent tasks already expose this boundary:
slice.topology_discoverslice.vm_provisionslice.vm_release
Those tasks are bounded host primitives. They are not a generic remote shell.
Scheduler Layers
There are two scheduler layers, and they solve different problems.
1. Control-plane scheduler
This is the platform scheduler layer. It decides:
- which node family or node can satisfy the request;
- whether the request is whole-node or slice-backed;
- whether a compatible same-node slot set exists;
- whether the request violates policy, health, drain, or occupancy rules;
- which placement candidate wins.
This layer works with product constructs such as:
- SKUs
- capacity shapes
- node occupancy
- resource claims
- policy overlays
- placement candidates
2. Node or app runtime scheduler
This is the runtime execution layer on top of placed capacity. It decides how work runs inside already allocated capacity. Examples:
- a VM runtime using a selected slice bundle;
- Slurm or another scheduler app dispatching jobs inside project-owned capacity;
- future app operators managing queues, workers, and runtime topology.
This layer must not replace the control-plane scheduler. App schedulers consume platform placement outcomes; they do not own node inventory truth.
Why This Separation Matters
Without the split, three failure modes show up quickly:
- product SKUs drift from what hosts can actually run;
- node-local scripts become the real scheduler, outside tenancy and audit;
- runtime schedulers such as Slurm start leaking scheduler-specific behavior into core allocation APIs.
GPUaaS explicitly avoids that. The platform keeps scheduler-agnostic control primitives in core and pushes runtime-specific logic behind app/operator boundaries.
Placement And Occupancy Rules
The architecture docs already define occupancy as a derived aggregate, not a single “node busy” flag.
| Node posture | Meaning |
|---|---|
| fully available | no blocking claims and node is schedulable |
| partially allocated | some approved slots are claimed |
| fully allocated by slices | no additional compatible slots remain |
| exclusively allocated by baremetal | whole-node claim blocks all slot use |
| draining | existing allocations continue, new placement stops |
| cleanup blocked | host or slot cleanup proof is not sufficient for reuse |
| unavailable | health or admin posture blocks placement |
For the first slice implementation, gpu_slice placement must fit on a single
physical node. The control plane should fail closed with sku_unavailable when
no same-node compatible slot set exists.
Intent, Observed State, And Repair
GPUaaS uses a hybrid control model:
- desired state: what the platform intends
- observed state: what provider or node evidence says is true
- execution state: what the current workflow is doing
- projected state: what the UI currently shows
That matters for slicing because a node may be:
- intended for slice use,
- observed as missing a required host prerequisite,
- executing a cleanup or release workflow,
- still projected in the UI as available or blocked depending on reconciliation.
The portal should make this easy to understand: placement is not the same as execution, and neither is the same as what the UI last projected.
What Product And Ops Should Read From This
Product view
- GPU slice is a first-class product shape, not a hidden runtime detail.
- A user buys a capacity shape and SKU outcome, not a raw host implementation.
- Future MIG, vGPU, or shared-GPU products should land as explicit shapes or child-slot models, not by overloading the first slice contract.
Ops and infra view
- slice capability is a promotion decision with readiness evidence, not a hardware guess;
- approved slot inventory is the scheduling source of truth;
- cleanup proof is part of safe reuse, not an afterthought;
- node-agent remains the bounded execution surface for host-local operations.
Architecture and developer view
- core allocation and placement logic stays scheduler-agnostic;
- runtime schedulers such as Slurm belong behind the app/operator boundary;
- placement correctness depends on claims and slots, not UI read models.