# V3 Workflow Consolidation Plan v1

Status: active migration input  
Scope: canonical product routes under `/*` and backend read models/mutations
required to retire V1 admin/entity dump pages. Historical `/v3/*` mock routes
are retired from the current app tree.

## Problem

The V3 migration now exposes most user, admin, and ops data, but the recordings
from 2026-05-01 show that several flows still behave like migrated entity dumps.
V1 hid a lot of operational complexity under broad admin pages; V3 must turn that
complexity into workflow surfaces before V1 can be retired.

The migration goal is not only route parity. A V3 surface is complete when the
user or operator can answer:

1. What is active right now?
2. What needs action right now?
3. What can I safely ignore or archive?
4. What operation should I run next?
5. What will the operation do, where is progress tracked, and where is evidence?

## Design Decisions

### Workloads Are Runtime-First

The workload landing page is a runtime workbench, not an audit archive.

Default contents:
- active workloads
- provisioning workloads
- recent actionable failures

Hidden by default:
- released workloads
- old failed test resources with no current remediation path
- historical failures retained only for audit/history

Released workloads may appear in an explicit `History` or `All states` view. They
should not occupy the default action band after the user has intentionally released
them.

Action-band eligibility should eventually be contract-backed:

```text
show as live action if:
  is_actionable = true
  and archived_at is null
```

The server should compute `is_actionable` and any time-bound actionability from
its own clock. UI code should not reimplement `actionable_until >= now` in each
operator browser because clock skew would make triage inconsistent. The read model
may still expose `action_required`, `actionable_until`, and `archived_at` for
explanation and filtering, but default workbench behavior should use the derived
server-side boolean.

Until the backend exposes those fields, the UI should apply a conservative
temporary default: show active/provisioning/recent failed resources first and keep
old released rows behind an opt-in filter.

### Admin Pages Become Workflows

V1 admin pages were useful as entity dumps, but that model should not carry into
V3. V3 platform/admin surfaces should be organized by operator intent:

- setup and configuration
- onboarding
- maintenance
- recovery
- reimage/retire
- evidence and audit
- destructive break-glass

Tables remain useful as pivot lookups, but they should not be the primary page
shape for lifecycle-heavy resources.

### Node Onboarding Starts Guidance-First

The first V3 node onboarding flow should be a guidance-first workflow unless the
full backend mutation contract is ready.

Guidance-first flow:
1. Choose region/site and onboarding mode.
2. Choose node role/profile: bare metal, slice host, scheduler controller, worker,
   storage/WEKAFS-adjacent host if applicable.
3. Confirm required setup: MAAS site when automated, registry, bootstrap bundle,
   node-agent version, network profile, telemetry, and SSH/break-glass access.
4. Generate or select a bootstrap/enrollment token.
5. Show the exact manual command or MAAS profile attachment steps.
6. Poll for first heartbeat, cert status, capabilities, and telemetry.
7. Land on node detail with next operations.

Full mutation flow later:
- create MAAS onboarding workflow
- commission/deploy
- deliver bootstrap payload
- enroll node-agent
- verify cert renewal, telemetry, GPU inventory, slice slots, and runtime readiness

The guidance shell should be designed so the full mutation path can replace manual
steps without changing the user journey.

### Node Lifecycle Operations Are First-Class

Node detail must expose operations by intent, not as a flat button cluster.

Operation groups:
- Inspect: probe, open telemetry, view evidence
- Repair: repair node-agent, repair certs, re-enroll, refresh inventory
- Scheduling: enable/disable scheduling, drain, resume
- Provisioning: start MAAS onboarding, reimage, retry failed workflow
- Retirement: retire, detach, remove
- Destructive: delete, force cleanup

Each operation should show:
- preconditions
- expected result
- progress location
- failure path
- evidence/audit/task pivots
- danger level and confirmation behavior

### Platform Setup Has A Checklist

Platform config needs a setup checklist, not only family cards.

Checklist areas:
- Vault and registry
- MAAS site and profiles
- bootstrap bundle and image/profile digest
- node-agent release and cert renewal posture
- SKUs and images
- telemetry/Netdata proxy posture
- network/security profile placeholders
- storage provider posture, initially WEKAFS-first

Each item should report `configured`, `missing`, `stale`, or `unsafe`, with owner,
next action, and evidence.

### SSH Keys Are Canonical In Account, Available In Flow

Account Security remains the canonical record-management page for personal SSH
keys. Launch and onboarding flows must still let the user add or select keys
inline so missing prerequisites do not send the user away and force a restart.

### App Workloads Keep App Identity

App-backed workloads should carry the app catalog identity into the workload
detail. JupyterLab, vLLM, Slurm, RKE2, Axolotl, and future apps should not look
like plain compute with a different subtitle. Use app mark, tone, category, and
kind-aware primary action consistently.

### Console And Cert Recovery Are Product Flows

Browser console failure caused by expired/lost node-agent certificates is not only
a bug. It is a production recovery scenario. V3 should expose cert age/renewal
posture, failed renewal evidence, and a repair/re-enroll workflow.

## Historical Mock Lessons For B

The retired `/v3` mock captured several workflow decisions that still need to
be reflected in canonical `/*` implementation:

1. Workloads actionability and history model.
2. Node onboarding guidance-first wizard.
3. Platform setup checklist.
4. Node lifecycle operation groups.
5. SSH key inline add/select flow in launch.
6. App runtime identity carried into workload detail.

Current canonical routes should stay honest about backend readiness: where full
mutation support is not ready, label the flow as guided/manual and show the
future automated path as disabled or pending configuration.

Priority notes:
- Workloads actionability is highest impact because workloads is the default landing
  surface.
- Node onboarding is the highest-effort operator workflow and should be designed
  early even if the first pass is guidance-only.
- Platform setup checklist is the first-login operator prerequisite for onboarding.
- App runtime identity is a low-risk visual win and can be implemented independently
  by lifting the app hero/tone mapping into a shared helper.

## Backend Contract Follow-Ups

Backend/read-model work should focus on:

- server-derived actionability plus archive/history fields for workloads and failed
  operations
- node onboarding workflow mutations and read models
- node-agent cert repair/re-enroll operations
- platform setup checklist read model
- operation capability fields and durable activity streams

Do not add UI-only state that cannot be defended by a read model or an explicit
temporary migration rule.

## Acceptance For This Consolidation

V3 migration can proceed page-by-page only after each page maps to one of these
workflow families. If a migrated page still feels like an entity dump, it should
either:

- be demoted to a pivot table under a workflow surface, or
- get a named backend/read-model task if the workflow cannot be represented yet.
