Skip to main content

GPU Slicing And Scheduler Layers designed

GPUaaS does not treat “GPU count” as enough scheduling truth. The platform separates three different concerns:

  1. capacity shape
  2. control-plane placement
  3. node-local runtime execution

That separation is what lets the platform support whole-node leasing today and host-local GPU slice products without collapsing scheduling into ad hoc node scripts.

The Two Capacity Shapes

ShapeMeaningScheduling unit
baremetalone allocation owns an entire enrolled GPU hostphysical node
gpu_sliceone allocation owns one or more approved host-local resource bundlesresource slot set on one node

The first slice model is not “fractional GPU by inference”. It is an approved slot inventory per host. A slot is a schedulable bundle that can include:

  • GPU device identity
  • NVMe or volume identity
  • network or fabric device identity
  • NUMA locality
  • CPU and memory reservation metadata
  • VM/runtime metadata needed to launch safely

This is the key product and architecture rule: the platform schedules approved slots, not raw PCI devices and not just a GPU count.

Control Plane Vs Node Plane

What The Control Plane Owns

The control plane is the authority for intent and placement. It owns:

  • the public SKU and capacity-shape model;
  • tenant, project, policy, and entitlement boundaries;
  • allocation intent and lifecycle;
  • placement candidate resolution;
  • durable resource claims;
  • read models shown to users, admins, and operators;
  • reconciliation when provider or node truth drifts.

Important consequence: GPUaaS should not let node-local runtime code invent its own product model. The control plane decides:

  • whether a request is baremetal or gpu_slice,
  • which node family is eligible,
  • which slot set is compatible,
  • whether a multi-slot request must stay on one node,
  • when a slot or node is blocked, draining, cleanup-blocked, or reusable.

What The Node Plane Owns

The node plane is the authority for host-local execution truth. It owns:

  • discovery of candidate slot topology on the host;
  • verification that the host is actually slice-ready;
  • realization of an approved allocation into a VM or runtime instance;
  • host-local cleanup and proof on release.

For slice mode, node-agent tasks already expose this boundary:

  • slice.topology_discover
  • slice.vm_provision
  • slice.vm_release

Those tasks are bounded host primitives. They are not a generic remote shell.

Scheduler Layers

There are two scheduler layers, and they solve different problems.

1. Control-plane scheduler

This is the platform scheduler layer. It decides:

  • which node family or node can satisfy the request;
  • whether the request is whole-node or slice-backed;
  • whether a compatible same-node slot set exists;
  • whether the request violates policy, health, drain, or occupancy rules;
  • which placement candidate wins.

This layer works with product constructs such as:

  • SKUs
  • capacity shapes
  • node occupancy
  • resource claims
  • policy overlays
  • placement candidates

2. Node or app runtime scheduler

This is the runtime execution layer on top of placed capacity. It decides how work runs inside already allocated capacity. Examples:

  • a VM runtime using a selected slice bundle;
  • Slurm or another scheduler app dispatching jobs inside project-owned capacity;
  • future app operators managing queues, workers, and runtime topology.

This layer must not replace the control-plane scheduler. App schedulers consume platform placement outcomes; they do not own node inventory truth.

Why This Separation Matters

Without the split, three failure modes show up quickly:

  1. product SKUs drift from what hosts can actually run;
  2. node-local scripts become the real scheduler, outside tenancy and audit;
  3. runtime schedulers such as Slurm start leaking scheduler-specific behavior into core allocation APIs.

GPUaaS explicitly avoids that. The platform keeps scheduler-agnostic control primitives in core and pushes runtime-specific logic behind app/operator boundaries.

Placement And Occupancy Rules

The architecture docs already define occupancy as a derived aggregate, not a single “node busy” flag.

Node postureMeaning
fully availableno blocking claims and node is schedulable
partially allocatedsome approved slots are claimed
fully allocated by slicesno additional compatible slots remain
exclusively allocated by baremetalwhole-node claim blocks all slot use
drainingexisting allocations continue, new placement stops
cleanup blockedhost or slot cleanup proof is not sufficient for reuse
unavailablehealth or admin posture blocks placement

For the first slice implementation, gpu_slice placement must fit on a single physical node. The control plane should fail closed with sku_unavailable when no same-node compatible slot set exists.

Intent, Observed State, And Repair

GPUaaS uses a hybrid control model:

  • desired state: what the platform intends
  • observed state: what provider or node evidence says is true
  • execution state: what the current workflow is doing
  • projected state: what the UI currently shows

That matters for slicing because a node may be:

  • intended for slice use,
  • observed as missing a required host prerequisite,
  • executing a cleanup or release workflow,
  • still projected in the UI as available or blocked depending on reconciliation.

The portal should make this easy to understand: placement is not the same as execution, and neither is the same as what the UI last projected.

What Product And Ops Should Read From This

Product view

  • GPU slice is a first-class product shape, not a hidden runtime detail.
  • A user buys a capacity shape and SKU outcome, not a raw host implementation.
  • Future MIG, vGPU, or shared-GPU products should land as explicit shapes or child-slot models, not by overloading the first slice contract.

Ops and infra view

  • slice capability is a promotion decision with readiness evidence, not a hardware guess;
  • approved slot inventory is the scheduling source of truth;
  • cleanup proof is part of safe reuse, not an afterthought;
  • node-agent remains the bounded execution surface for host-local operations.

Architecture and developer view

  • core allocation and placement logic stays scheduler-agnostic;
  • runtime schedulers such as Slurm belong behind the app/operator boundary;
  • placement correctness depends on claims and slots, not UI read models.