# State Machines (Current + Target Clarification)

## 1. Allocation Lifecycle

### Canonical lifecycle (contract + implementation target)
- `requested -> provisioning -> active -> releasing -> released`
- Failure side transitions:
  - `provisioning -> failed`
  - `releasing -> release_failed` (after `max_deliver` retries exhausted on `gpuaas.provisioning.releasing.requested`)

### `release_failed` state behaviour
- **Billing**: usage window is closed when `gpuaas.provisioning.release_failed` is received by billing-worker — the user is not charged for a failed release.
- **Node**: remains assigned to the allocation until an admin manually retries or removes the assignment. Surface via `GET /api/v1/admin/allocations?status=release_failed`.
- **Retry path**: admin uses `POST /api/v1/admin/allocations/{id}/force-release` to trigger a new release attempt, transitioning the allocation back to `releasing`.
- **User retry**: `POST /api/v1/allocations/{id}/release` on a `release_failed` allocation is also accepted and transitions back to `releasing`.

## 1a. Allocation Group Lifecycle

Allocation groups are aggregate parent resources over normal single-node
allocations. They do not replace allocation state and they do not own billing or
placement correctness.

Canonical aggregate lifecycle:
- `requested -> provisioning -> active -> releasing -> released`
- Failure/degraded side transitions:
  - `requested|provisioning -> failed` when no required member can become usable
  - `active -> degraded` when one required member fails while another member remains usable
  - `releasing -> release_failed` when one or more member releases exhaust retries

Rules:
- Group status is derived from member allocation status plus group-level release
  intent.
- Member allocations keep their own allocation lifecycle, connection target, and
  usage/billing windows.
- Group release fans out to member release requests and remains idempotent under
  the group release idempotency key.
- App runtimes may bind an app instance to an `allocation_group_id`, but
  app-specific topology and member semantics stay in the app-instance member
  contract.

Full model: `doc/architecture/Allocation_Group_Model_v1.md`.

## 2. Node Assignment Lifecycle

### Current behavior
- Free when `assignedAllocationId = null`
- In use when `assignedAllocationId = allocation.id`

### Target
- Derive occupancy from active allocation relation in DB.
- Keep assignment pointer as cache only if needed.

## 3. Usage/Billing Lifecycle

### Current behavior
- Usage starts with allocation creation: `startTime`, `endTime=null`
- Billing loop every minute updates `lastBilledAt` and `cost`
- Usage closes on allocation release (`endTime` set)

### Target
- Event-sourced debit windows with idempotency key per `(usage_record, interval_window)`.

## 4. User Balance State

### Current behavior
- thresholds:
  - low: `balance <= 10`
  - depleted: `balance <= 0`
- `lowBalanceNotified` prevents repeated warning spam.
- depleted triggers force release of all active allocations.

### Target
- explicit state field or derived state view:
  - `healthy`, `low_balance`, `depleted`
- notification service/channel decoupled from terminal WS.

## 5. Stripe Payment State

### Target lifecycle (implemented via `payment_sessions` table)

```
initiated → checkout_completed → credited
                                ↘ failed_reconcile
         ↘ expired  (session TTL elapsed with no webhook)
```

| Status | Trigger |
|---|---|
| `initiated` | `POST /api/v1/payments/checkout-session` — Stripe session created, URL returned to user |
| `checkout_completed` | Stripe webhook `checkout.session.completed` received and verified |
| `credited` | Ledger credit posted transactionally with the webhook; `ledger_entry_id` set |
| `failed_reconcile` | Checkout completed but credit application failed after all retries |
| `expired` | Session TTL elapsed (default 24h) with no `checkout.session.completed` webhook |

### Key properties
- `stripe_checkout_session_id` is the join key between the webhook payload and the session row.
- `idempotency_key` (from `X-Idempotency-Key` header) is stored; duplicate session creation
  requests for the same user + key return the existing session URL without a second Stripe call.
- `credited_amount_minor` is set from the webhook payload and must equal `requested_amount_minor`;
  a mismatch is flagged as `failed_reconcile` for ops investigation.
- Admin endpoint: `GET /api/v1/admin/payments/sessions` surfaces stuck sessions
  (`initiated` with no completion after >1h, or `failed_reconcile`) for support resolution.

## 6. Terminal Session State

### Canonical lifecycle
- `opening -> active -> closing -> closed`
- Failure side transitions:
  - `opening -> error`
  - `active -> error` (node stream/tunnel drop, upstream termination, policy timeout)

### Rules
- Open handshake uses short-lived single-use token.
- Active session max lifetime is enforced by policy key `terminal.session_max_ttl_seconds` (default 4h).
- Single active terminal session per allocation.
- Reconnect is full reopen (new token + new open task + new session id); no resume.

### Edge-case sequencing
- Allocation release during active terminal:
  1. send deterministic close reason (`allocation_released`),
  2. wait ack or close-timeout window,
  3. continue release path (`allocation.revoke_user` then workflow completion).
- Node stream drop during active terminal:
  - close with retryable reason (`node_stream_dropped`), UI may reopen with full flow.
- User OIDC expiry during active terminal:
  - session remains valid until close/session TTL; auth rechecked on next reopen.

## 7. Storage Attachment State

Storage attachment is the runtime binding between a project-owned storage
namespace and one allocation/workload mount. The full workflow is defined in
`doc/architecture/Storage_Attachment_Workflow_v1.md`.

Canonical lifecycle:

```
requested -> prechecking -> grant_applying -> grant_applied -> mounting -> mounted
mounted -> detaching -> detached
```

Failure side transitions:

- `requested|prechecking -> failed`
- `grant_applying|grant_applied|mounting -> failed`
- `mounted|detaching -> detach_failed`
- `failed|detach_failed -> detaching -> detached`

Rules:

- Attach/detach is owned by Temporal, not direct HTTP handler side effects.
- Persistent storage is never deleted by an attachment detach or allocation
  release path.
- `quota_bytes` lives on the storage namespace; attachment precheck validates
  whether the requested write mode is allowed under current quota posture.
- `multi_writer` is allowed only when both provider capability and product/app
  storage policy allow it.
- Node-agent performs local mount/unmount through typed tasks and reports
  result; API/Temporal never SSHes into nodes directly.
