# MAAS Node State Model v1

## 1. Purpose

Define the state model for MAAS-managed GPU nodes in a way that separates:
- coarse GPUasService node lifecycle state,
- detailed workflow/job state,
- observed MAAS machine state,
- operator actions and recovery transitions.

This document exists because the MAAS lifecycle is no longer simple enough to describe safely with prose alone.

Use this together with:
- [`MAAS_Bare_Metal_Lifecycle_v1.md`](./MAAS_Bare_Metal_Lifecycle_v1.md)
- [`State_Machines.md`](./State_Machines.md)

## 2. Model Layers

There are three distinct state layers:

1. `nodes.status`
- coarse lifecycle state of the GPUasService node record
- stable and operator-facing

2. `node_onboardings.status` / `node_decommissions.status`
- detailed workflow/job execution state
- expresses retryability, manual intervention, compensation, reconciliation

3. observed MAAS machine state
- current upstream machine status from MAAS
- not owned by GPUasService

Do not collapse these into one enum.

## 3. GPUasService Node Lifecycle

### 3.1 Canonical coarse states

| State | Meaning |
|---|---|
| `bootstrap_issued` | manual bootstrap bundle/token issued; node not yet enrolled |
| `enrolling` | MAAS/manual onboarding is in progress; node not ready for scheduling |
| `active` | node is healthy and schedulable |
| `offline` | node expected to exist but not currently polling/responding |
| `quarantined` | node exists but is blocked from scheduling due to drift, cleanup failure, or investigation |
| `draining` | retire/decommission drain is in progress; the node is unschedulable until the drain lifecycle finishes |
| `retired` | node intentionally removed from scheduling but not yet being deleted |
| `removing` | remove/uninstall workflow is in progress |
| `deleted` | terminal state outside active inventory |

### 3.2 Node state transitions

```text
bootstrap_issued -> enrolling -> active
enrolling -> quarantined
active -> offline
offline -> active
active -> quarantined
offline -> quarantined
quarantined -> active
quarantined -> draining
active -> draining
offline -> draining
draining -> retired
draining -> offline          (drain failed / operator recovery chooses non-retired fallback)
retired -> offline           (reactivate/reuse same identity; health must prove active)
retired -> removing
removing -> retired          (uninstall failure / operator rollback)
removing -> deleted
```

### 3.3 Transition ownership

| Transition | Owner |
|---|---|
| `bootstrap_issued -> enrolling` | admin API / onboarding workflow start |
| `enrolling -> active` | onboarding workflow after successful agent enrollment and health checks |
| `enrolling -> quarantined` | onboarding workflow or operator action when onboarding reaches a blocked/unsafe state |
| `active -> offline` | heartbeat/reconciliation logic |
| `offline -> active` | heartbeat recovery / reconciliation |
| `active/offline -> quarantined` | reconciler or cleanup validation |
| `quarantined -> active` | explicit operator recovery or reconciliation after the node is proven healthy again |
| `active/offline/quarantined -> draining` | admin lifecycle action that starts a resumable drain operation |
| `draining -> retired` | drain lifecycle completes successfully |
| `draining -> offline` | drain lifecycle fails or operator recovery elects to stop short of retirement |
| `retired -> offline` | explicit reactivate/reuse flow, only if the node was paused/retired and not fully removed; heartbeat or probe may later move it to `active` |
| `retired -> removing` | remove workflow |
| `removing -> retired` | uninstall failure path |
| `removing -> deleted` | uninstall success / final delete |

Guardrails:
- `draining` and `removing` are coarse lifecycle states, not self-sufficient execution state.
- Entering either state must also create or resume an owning lifecycle operation/task.
- A node must never be left in `draining` or `removing` solely because a short-lived task lease expired.
- `retired -> offline` is valid only for retained node identity reuse. It preserves the installed identity without making the node immediately schedulable.
- Once a full decommission reaches completed removal (`removing -> deleted`), the old node identity must not be reactivated.
- Re-onboarding after full remove creates a new GPUasService node record.

## 4. Onboarding Workflow State

### 4.1 Workflow/job states

These belong in `node_onboardings.status`, not `nodes.status`.

| State | Meaning |
|---|---|
| `pending` | accepted but not started |
| `running` | workflow executing |
| `completed` | workflow reached intended terminal success |
| `failed_retryable` | failed, but safe recovery path exists |
| `failed_manual_intervention` | blocked pending operator action |
| `cancelled` | operator/system cancelled workflow |
| `compensating` | rollback/cleanup is in progress |
| `reconciled` | previously failed/ambiguous workflow has been realigned with observed state |

### 4.2 Typical onboarding stages

| Stage |
|---|
| `load_site_config` |
| `resolve_power_credentials` |
| `create_or_find_in_maas` |
| `commission_node` |
| `wait_for_ready` |
| `configure_storage` |
| `apply_roce_phase2` |
| `ensure_pxe_interface_auto` |
| `render_cloud_init` |
| `deploy_via_maas` |
| `wait_for_deployed` |
| `classify_deploy_failure` |
| `recover_for_datasource_retry` |
| `ensure_hardware_sync_configured` |
| `wait_for_hardware_sync_healthy` |
| `wait_for_agent_enrollment` |

### 4.3 Mapping: node state vs onboarding state

| Node state | Onboarding workflow state | Meaning |
|---|---|---|
| `enrolling` | `pending` / `running` | normal onboarding in progress |
| `enrolling` | `failed_retryable` | node still not ready; safe rerun/resume available |
| `enrolling` or `quarantined` | `failed_manual_intervention` | workflow blocked pending operator decision |
| `active` | `completed` | normal success |
| `active` | `reconciled` | workflow succeeded after adopt/reconcile action |

Rule:
- do not create micro-node states such as `commissioning`, `deploying`, `waiting_for_hw_sync`
- those are workflow stages only

## 5. Decommission Workflow State

### 5.1 Workflow/job states

`node_decommissions.status` should use the same job-state model as onboarding:
- `pending`
- `running`
- `completed`
- `failed_retryable`
- `failed_manual_intervention`
- `cancelled`
- `compensating`
- `reconciled`

### 5.2 Typical decommission stages

This is the superset stage catalog across `soft_reset`, `reimage`, `full_decommission`, and `storage_cleanup`. Not every mode uses every stage.

| Stage |
|---|
| `disable_node` |
| `force_release_allocations` |
| `drain_node` |
| `cleanup_storage` |
| `scrub_gpu` |
| `validate_clean_node` |
| `load_site_config` |
| `release_maas_node` |
| `power_off_maas_node` |
| `retire_gpuaas_node` |
| `remove_gpuaas_node_record` |
| `cleanup_secrets` |
| `remove_maas_record` |

### 5.3 Mapping: node state vs decommission state

| Node state | Decommission workflow state | Meaning |
|---|---|---|
| `active` / `offline` / `quarantined` / `draining` / `retired` | `running` | pre-remove cleanup or reimage in progress |
| `removing` | `running` | uninstall/remove path in progress |
| `retired` | `failed_retryable` | remove/uninstall failed but identity preserved |
| `retired` / `quarantined` | `failed_manual_intervention` | decommission blocked pending operator action |
| `deleted` | `completed` | full remove succeeded |

## 6. MAAS State Mapping

### 6.1 Relevant MAAS states

| MAAS state | Interpretation |
|---|---|
| `New` | discovered but not accepted/commissioned |
| `Commissioning` | MAAS is commissioning the machine |
| `Ready` | editable, not deployed |
| `Allocated` | MAAS allocated but not yet fully deployed |
| `Deploying` | OS deployment in progress |
| `Deployed` | OS deployed |
| `Failed` / `Broken` / `Failed deployment` | upstream failure state |
| absent | machine missing/deleted in MAAS |

### 6.2 Expected MAAS state by GPUasService lifecycle

| GPUasService node state | Expected MAAS state |
|---|---|
| `bootstrap_issued` | none required |
| `enrolling` | `New`, `Commissioning`, `Ready`, `Allocated`, `Deploying`, or `Deployed` depending on stage |
| `active` | `Deployed` |
| `offline` | usually `Deployed` |
| `quarantined` | often `Ready`, `Failed`, `Broken`, `Deployed`, or absent |
| `draining` | usually `Deployed` or `Ready`; decommission workflow owns the transition to `retired` |
| `retired` | `Deployed`, `Ready`, or powered off depending on mode |
| `removing` | `Ready`, powered off, or absent |
| `deleted` | absent or retained in MAAS by policy |

## 7. Manual Intervention Triggers

These should usually move workflow state to `failed_manual_intervention`:
- conflicting discovery candidates
- BOSS disk not found
- PXE interface not fixable automatically
- machine deleted from MAAS during workflow
- repeated datasource/cloud-init failure after bounded retry
- repeated hardware-sync or SSH seed failure
- workflow state and observed MAAS/node state disagree in a way automation cannot safely adopt

## 8. Operator Actions

These actions operate on workflow/job state, not directly on coarse node state:

| Action | Effect |
|---|---|
| `retry_stage` | rerun current failed stage |
| `resume` | continue from last safe stage |
| `rerun` | re-enter workflow from top with status-aware adoption |
| `restart_clean` | explicitly compensate/reset then start again |
| `cancel` | stop workflow and compensate where possible |
| `adopt_observed_state` | accept externally advanced state and continue/finish |
| `mark_manual_intervention_required` | freeze workflow until human resolution |

For node inventory lifecycle, `resume` must be available whenever the node is in an in-progress coarse state whose owning task may have been lost or stalled:
- `draining` -> resume/reissue/requeue `node.drain`
- `removing` -> resume/reissue/requeue `node.uninstall`

These operator actions are idempotent:
- if a fresh queued task already exists, the action returns success without duplication
- if a live dispatched task still holds the lease, the action returns success without duplication
- if the task expired or the dispatch lease is stale, the action requeues or recreates the lifecycle task

## 9. Recommended UI/Operator Presentation

### 9.1 Inventory view

Show coarse node state only:
- `active`
- `offline`
- `quarantined`
- `retired`
- `removing`

### 9.2 Lifecycle detail view

Show:
- workflow state
- current stage
- attempt count
- failure class
- last observed MAAS state
- recommended next action

This prevents inventory screens from becoming overloaded with workflow microstates.

## 10. Rules

1. `nodes.status` must remain coarse and stable.
2. Workflow stages and retryability belong in onboarding/decommission read models.
3. MAAS state is observed upstream truth, not GPUasService-owned state.
4. Reconciliation may change workflow/job state without directly inventing new node lifecycle states.
5. Any new MAAS-specific lifecycle transition should be added here before implementation if it changes operator-facing behavior.
