# Intent Control And Reconciliation Model v1

## Purpose

Define the control-plane execution model for GPUaaS as a hybrid of:

- intent-owned resources
- workflow-owned execution
- reconciliation-owned truth repair

This avoids two bad extremes:

- request/response-only control that cannot survive long-running distributed work
- one generic reconciler abstraction forced onto every domain, even where staged workflow execution is the real model

The platform should instead choose the right mechanism per boundary while requiring explicit reconciliation whenever truth can drift.

## Core Model

Every managed resource or workflow should distinguish four kinds of state:

- `desired_state`
  - what the control plane intends to be true
- `observed_state`
  - what the platform currently believes is true from providers, agents, or health checks
- `execution_state`
  - what in-flight workflow or task engine execution is doing right now
- `projected_state`
  - what the user/admin UI reads from persisted records and read models

These are not always the same thing.

Examples:

- a decommission may have:
  - desired: `reimage requested`
  - execution: `wait_node_enrollment`
  - observed: node already `active`
  - projected: still `running` because reconciliation lagged

- an allocation may have:
  - desired: `released`
  - execution: `releasing`
  - observed: node assignment still present
  - projected: `release_failed`

The design rule is:

- every cross-boundary handoff must declare:
  - authoritative truth
  - persisted projection
  - divergence detection
  - reconciliation path

## Invariants And Bounded Authority

Intent alone is not enough.

If the platform only stores desired state and execution strategy, it still lacks an internal notion of correctness. The missing layer is invariants:

- what must remain true
- what the platform is allowed to violate temporarily
- what requires escalation instead of automation

For GPUaaS, invariants should be treated as first-class control inputs over typed state.

### Invariant shape

At minimum, an invariant should declare:

- `scope`
  - which tenant/project/resource set it governs
- `predicate`
  - the condition that must hold over desired, observed, or derived state
- `class`
  - safety, isolation, availability, cost, governance
- `severity`
  - `hard` or `soft`

Examples:

- `availability`
  - minimum capacity reserved for a project
- `safety`
  - no destructive remediation while a node is still assigned
- `isolation`
  - reimage required between tenants when policy says full isolation
- `cost`
  - do not start workloads that would exceed bounded budget
- `governance`
  - project admin may manage project lifecycle without seeing project content

### Hard vs soft invariants

- `hard`
  - automation must not violate the invariant
  - if no compliant action exists, stop and surface the reason
- `soft`
  - automation may proceed only with explicit override, time-bounded exception, or downgraded advisory handling
  - all such decisions must be auditable

This gives the control plane bounded authority:

- workflow engines do not get to “just continue”
- reconcilers do not get to “heal” by violating correctness
- AI- or planner-assisted proposals remain advisory unless they preserve the invariant set

### Invariants over mechanisms

Execution strategy is not correctness.

For this platform:

- workflows
- reconcilers
- janitors
- direct mutations

are mechanisms.

Invariants are the bound on what those mechanisms are allowed to do.

### Conflict model

Invariant conflicts should be treated as first-class control events, not silent fallback logic.

Initial dominance order for GPUaaS should be:

- safety
- isolation
- governance
- availability
- cost

If two invariants conflict and no invariant-preserving option exists:

- do not guess
- surface alternatives
- require operator or policy-owned decision

This is especially important for:

- remediation vs workload continuity
- cost vs reserved capacity
- tenant admin control vs project content visibility
- auto-recovery vs strict isolation policy

## Control Mechanisms

### 1. Direct mutation

Use when:

- the state change is local
- completion is synchronous or bounded
- no external workflow engine is needed

Examples:

- updating a policy row
- creating a simple role binding
- recording an audit log row in the same transaction

### 2. Workflow-owned execution

Use when:

- execution is staged
- retries/compensation matter
- external systems are involved
- operators need stage visibility

Examples in GPUaaS:

- MAAS onboarding
- MAAS decommission / reimage
- node-agent lifecycle execution
- allocation provisioning / release orchestration

Temporal is the execution engine here, but it is not the only source of product truth.

### 3. Reconciler-owned convergence

Use when:

- the resource should continuously converge toward intended state
- drift can appear after initial execution
- external/manual actions can change truth out of band

Examples in GPUaaS:

- MAAS state drift vs platform node state
- node-agent identity/task truth vs control-plane truth
- billing usage backfill / repair
- externally cancelled workflow runs that must reconcile into DB/UI truth

### 4. Janitor / discovery / backfill

Use when:

- orphaned resources or missed transitions must be found later
- correctness relies on periodic repair, not only foreground control flow

Examples:

- usage event backfill
- orphan/ghost resource detection
- stale workflow truth repair

## Mechanism Selection

### Prefer workflow when:

- the business meaning is a staged operation
- operators think in terms of attempts, stages, retries, and recovery
- manual recovery actions must be attached to execution state

Examples:

- MAAS reimage
- operator-assisted remediation
- allocation release

### Prefer reconciler when:

- the business meaning is “make reality match intent”
- truth can drift long after initial execution
- external/manual changes must be absorbed safely

Examples:

- node inventory drift
- MAAS machine truth vs GPUaaS node truth
- service-account/session/credential cleanup

### Use both when:

- a workflow performs the initial execution
- a reconciler later repairs drift or out-of-band divergence

This is the normal model for distributed infrastructure.

### Require invariant evaluation when:

- a workflow chooses between multiple repair paths
- a reconciler can take an action with blast-radius or cost implications
- automation could trade one correctness property against another

Examples:

- choose `rebootstrap` vs `reimage`
- continue retrying MAAS deploy vs stop and require operator review
- auto-release or deny provisioning when balance/capacity constraints are close

## GPUaaS Mapping

### Allocation domain

- intent-owned:
  - allocation desired lifecycle
- workflow-owned:
  - provisioning, release, forced release
- reconciliation-owned:
  - release drift
  - billing window repair
  - node assignment cleanup

### MAAS domain

- intent-owned:
  - onboarding request
  - decommission / reimage request
  - site/profile policy
- workflow-owned:
  - onboarding execution
  - decommission/reimage execution
- reconciliation-owned:
  - MAAS machine truth vs node truth
  - Temporal run truth vs `node_onboardings` / `node_decommissions`
  - management IP discovery and refresh

### Node-agent lifecycle

- intent-owned:
  - desired lifecycle mode/scenario
- workflow-owned:
  - manual install / rebootstrap / repair execution state
- reconciliation-owned:
  - node identity drift
  - cert enrollment/renew truth
  - stale node rows / reenrollment truth

### IAM

- intent-owned:
  - role bindings
  - memberships
  - invitations
- workflow-owned:
  - invitation acceptance / federation setup flows where needed
- reconciliation-owned:
  - capability projection into session/read model
  - external IdP/federation truth vs platform bindings

### Billing

- intent-owned:
  - ledger semantics
  - policy-driven billing rules
- workflow-owned:
  - payment/session reconciliation where stepwise processing matters
- reconciliation-owned:
  - usage backfill
  - activation/deactivation dedupe
  - failed periodic accrual repair

## Boundary Contract

For every major handoff, document:

### 1. Authority

Which system is authoritative for this truth?

Examples:

- Temporal run status is authoritative for workflow engine execution
- Postgres is authoritative for product read model
- MAAS is authoritative for MAAS machine deployment state
- node-agent is authoritative for host-local runtime task execution result

### 2. Projection

Where is the product-readable projection stored?

Examples:

- `node_decommissions`
- `node_onboardings`
- `allocations`
- session/user read model

### 3. Divergence signal

How do we detect the truth has drifted?

Examples:

- workflow record says `running`, Temporal says `Canceled`
- node record says `active`, MAAS says not deployed
- control plane waits for node tasks, node identity is missing/revoked

### 4. Reconciliation path

How does convergence happen?

Examples:

- read-path reconciliation
- scheduled sweeper
- event-driven repair
- explicit admin reconcile action

### 5. Operator-visible degraded mode

What should the UI/logs show when truth is not yet reconciled?

Never present stale projection as final truth when the system already knows reconciliation is incomplete.

## Required Current Boundaries

The current platform must explicitly model reconciliation for:

- service ↔ Temporal
  - especially external/manual cancel/terminate
- Temporal ↔ MAAS-backed product records
- control plane ↔ node-agent identity/task truth
- backend truth ↔ workflow detail UI
- billing usage projection ↔ actual activation/deactivation events

## Operator Controls

Long-running workflows need explicit operator controls:

- `reconcile now`
- `cancel`
- `rerun`
- `resume`
- `quarantine` where automation should stop

These controls must update product truth, not only the engine.

They must also respect invariant severity:

- hard-invariant violations block execution automatically
- soft-invariant overrides must be explicit and auditable

## Evaluation Strategy

Not every invariant or truth check belongs in the same place.

The control plane should choose the cheapest enforcement point that still preserves correctness.

### 1. Inline checks

Use inline checks when the signal is:

- local
- cheap
- authoritative enough for immediate decision

Examples:

- authz/capability checks
- policy lookups
- state-machine transition guards
- row/version/attempt guards
- idempotency checks

These should run directly in request handlers or service mutations.

### 2. Workflow-stage gates

Use workflow-stage gates when the check is part of progressing a staged operation and the cost is moderate.

Examples:

- current attempt/run matches expected workflow identity
- node is in an allowed lifecycle state before next stage
- provider/MAAS response proves the stage precondition
- release/reactivation/deploy guards before continuing

These checks should block stage progression when the invariant matters before the next action.

### 3. Reconciliation checks

Use periodic or read-triggered reconciliation when:

- truth can drift out of band
- the check would be too expensive to run inline on every mutation
- correctness can be safely repaired after the fact

Examples:

- Temporal status vs persisted workflow status
- MAAS machine truth vs platform node truth
- node-agent identity truth vs control-plane node record
- billing activation/deactivation backfill
- orphan/ghost resource cleanup

These should produce:

- converged product state
- operator-visible drift when convergence is incomplete

### 4. Expensive cross-system verification

Some checks cost more but are still required because correctness depends on them across systems.

Examples:

- proving reenrollment belongs to the current MAAS run
- management IP discovery from MAAS after deploy
- host-local runtime verification before declaring a repair complete
- completion proof when one system can report success before the next boundary is actually ready

Rule:

- do not run expensive cross-system checks everywhere by default
- do run them at the points where advancing without them would make the next state unsafe or misleading

### 5. Deferred but mandatory verification

If an expensive check is not needed before the immediate next step, move it into reconciliation rather than skipping it entirely.

This is the preferred model when:

- the user/operator can proceed safely without the result
- but the platform still needs eventual truth for audit, billing, or later recovery

## Practical Selection Rules

Use this order:

1. inline if local and cheap
2. workflow-stage gate if progression depends on it
3. reconciliation if drift is acceptable temporarily
4. expensive cross-system verification when correctness or safety depends on it before proceeding

Do not pay a high-cost verification on every happy-path request if:

- the invariant can be enforced safely by reconciliation
- and the temporary mismatch does not create unsafe execution or operator deception

Do pay the higher cost when:

- the next step would become unsafe
- the user would be told a lie without it
- or recovery later would be materially harder

## Design Rules

### Rule 1

Do not assume all truth changes pass through one happy-path API mutation.

### Rule 2

Workflow engine status alone is not a sufficient product read model.

### Rule 3

DB projection alone is not sufficient when the engine or provider can change out of band.

### Rule 4

Every long-running boundary must have a reconciliation owner.

### Rule 5

UI must surface uncertainty or degraded truth when reconciliation is pending.

## Immediate Applications

This model directly applies to the currently observed live issues:

- externally cancelled MAAS Temporal run still shown as `running`
- reenrolled node initially left `retired` because MAAS/workflow/node-agent handoff raced
- stale `node_not_found` control-plane noise from identity drift
- runtime/observability truth being clearer on the node than in the control plane

## Next Documents

This model should inform:

- [State_Machines.md](./State_Machines.md)
- [Domain_Ownership_Map.md](./Domain_Ownership_Map.md)
- MAAS recovery/reconciliation tasks in [Agent_Work_Queue.yaml](../governance/Agent_Work_Queue.yaml)
- IAM capability/session projection work