# Managed Runtime Bundles v1

Purpose:
- Define how framework and environment choice should be delivered on top of GPU allocations.
- Keep allocation creation fast while still offering supported runtime choices such as PyTorch.
- Establish the ownership boundary between platform-managed runtime state and user-managed SSH-installed state.

Inputs:
- `doc/product/Allocation_Experience_Gaps_v1.md`
- `doc/product/Navigation_Redesign_App_Platform_v1.md`
- `doc/architecture/App_Runtime_Operating_Modes_v1.md`
- `doc/architecture/App_Runtime_Instance_Lifecycle_v1.md`

Related:
- `doc/architecture/Kubernetes_Platform_Options_v1.md`
- `doc/product/Slurm_UI_Options_v1.md`

---

## 1. Executive Summary

The platform should not solve framework choice by turning every allocation into a slow, heavily parameterized image-build exercise.

Instead:
- node provisioning prepares the machine for GPU use
- allocation stays fast
- managed runtime bundles provide supported user-space environments on top

Examples:
- PyTorch
- Jupyter
- notebook workspaces
- future inference or training bundles

These bundles should be:
- optional at allocation time
- installable or changeable later
- clearly separated from arbitrary user-managed SSH customization

---

## 2. Problem Statement

Today the platform has a good fast-allocation model:
- heavy system work happens during node provisioning
- drivers and core prerequisites are ready before the user asks for compute

But users still need environment choice.

If we push that choice into node provisioning alone:
- image sprawl grows
- compatibility management becomes harder
- allocation UX becomes slower and less predictable

If we leave it entirely to user SSH customization:
- every user repeats the same work
- the platform cannot support or reason about the resulting state

Managed runtime bundles are the middle layer.

---

## 3. Design Principles

### 3.1 Drivers belong to the base node

Base node provisioning should continue to own:
- OS
- GPU drivers
- core system/runtime prerequisites

### 3.2 Framework choice belongs above raw allocation

PyTorch, Jupyter, and similar stacks should be treated as managed runtime bundles, not base-node identity.

### 3.3 Platform-managed runtime state must live in a platform-owned path

Do not let managed runtime installs mutate arbitrary user shell state by default.

Preferred pattern:
- `/opt/gpuaas/runtimes/<bundle>/<version>`

Example:
- `/opt/gpuaas/runtimes/pytorch/2.6.0-cu124`
- `/opt/gpuaas/runtimes/jupyter/2026.04`

### 3.4 User-managed SSH customization remains allowed

Users may still:
- install their own packages
- create their own virtualenv/conda envs
- place their own tools under home/project paths

The platform should not try to govern or support arbitrary user-installed state outside the managed runtime paths.

### 3.5 Allocation-time selection is optional

Users should be able to:
- create raw compute with no managed runtime selected
- or choose an initial managed runtime during allocation creation

Later, they should also be able to:
- apply a bundle
- switch bundles
- upgrade a bundle

---

## 4. Product Model

## 4.1 Bundle identity

A managed runtime bundle should have:
- product name
- bundle slug
- compatible GPU/runtime family
- version
- support level

Examples:
- `PyTorch`
- `Jupyter`
- `PyTorch Notebook`

The user-facing label should be product-oriented, not implementation-oriented.

## 4.2 Where bundle choice appears

Bundle choice should appear in two places:

1. Allocation creation
- optional initial runtime selection

2. Allocation detail
- install / change / upgrade managed runtime

This preserves the simple compute-first flow while still allowing guided runtime setup.

## 4.3 How bundles relate to apps

These bundles can still be treated as app bundles in the broader product architecture.

That means:
- cataloged
- versioned
- lifecycle-managed
- eventually upgradeable and auditable

But for user experience, some of them should also be usable as allocation-scoped runtime choices rather than only as heavyweight standalone workloads.

---

## 5. Ownership Boundary

## 5.1 Platform-owned

The platform owns and supports:
- bundle installation in platform-owned paths
- bundle activation helpers
- bundle version visibility
- bundle upgrades/rollbacks where supported
- repair/reinstall of the managed bundle

## 5.2 User-owned

The user owns:
- custom packages installed over SSH
- custom conda/venv state outside managed paths
- custom containers they run themselves
- arbitrary home-directory mutations

## 5.3 Support statement

The platform should say clearly:
- managed runtime bundles are supported and observable
- manual SSH-installed environments are allowed but not platform-managed

That is the cleanest support and incident boundary.

---

## 6. Recommended Filesystem Pattern

### 6.1 Managed install root

Use:
- `/opt/gpuaas/runtimes/<bundle>/<version>`

Optional convenience links:
- `/opt/gpuaas/runtimes/<bundle>/current`

### 6.2 Activation helpers

Provide a stable entrypoint such as:
- `source /opt/gpuaas/runtimes/pytorch/current/activate.sh`

or a wrapper like:
- `gpuaas-runtime use pytorch`

The exact shell UX can evolve, but the important part is:
- users can opt into the platform-managed runtime explicitly
- platform-managed paths remain isolated from arbitrary user customization

---

## 7. First Candidate Bundles

Recommended initial managed runtime bundle set:
- PyTorch
- Jupyter

Potential later additions:
- TensorFlow
- vLLM / inference runtime
- data science notebook stacks
- CUDA/ROCm development bundle

PyTorch is the highest-value first candidate because it addresses the most obvious ML framework expectation without requiring the user to manually construct the environment on every allocation.

---

## 8. UX Direction

Allocation create flow:
- raw compute remains the primary object
- optional managed runtime selector appears as an advanced or recommended step

Allocation detail:
- show current managed runtime state
- show installed version
- show last applied timestamp
- allow upgrade/change/reinstall where supported

Workload/app surfaces:
- longer term, these bundles can also appear in app catalog form when it is useful to present them as packaged experiences

---

## 9. API And Runtime Boundary

Initial implementation uses a bounded contract:
- `GET /api/v1/runtime-bundles` lists platform-managed bundles from `managed_runtime_bundles`
- `POST /api/v1/allocations` accepts optional `runtime_bundle_id` or `runtime_bundle_slug`
- `GET /api/v1/allocations/{allocation_id}/runtime-bundles` shows allocation-scoped bundle state
- `POST /api/v1/allocations/{allocation_id}/runtime-bundles` applies or changes the desired bundle

The node-agent task is intentionally constrained to `runtime.write_env_file`. For an active allocation, the control plane writes a descriptor under:

```text
/etc/gpuaas/runtime/allocations/<allocation_id>.env
```

The descriptor points at the platform-owned install root, for example:

```text
/opt/gpuaas/runtimes/pytorch/2.6.0-cu124
```

This first contract does not run arbitrary package installs on behalf of a user. It makes the selected supported runtime discoverable and auditable while keeping allocation provisioning fast.

---

## 10. Open Questions

- Is the first PyTorch bundle allocation-scoped only, or also exposed as a standalone app-catalog item?
- Do upgrades mutate `current` in place, or install side-by-side and switch a symlink atomically?
- How much of bundle install should be observable in workload/allocation activity feeds?
- Should bundle application require restart, or can it be purely user-space in the first version?
- How do bundle compatibility rules interact with GPU vendor family and driver branch?

---

## 11. Decision Summary

The right boundary is:
- base node owns drivers and core system setup
- allocation stays fast
- managed runtime bundles provide supported framework choice
- managed bundles install into platform-owned paths
- users remain free to customize elsewhere over SSH

This preserves the platform’s fast-allocation advantage while giving users a supported way to get common environments such as PyTorch without turning allocation provisioning into a large mutable image matrix.
