# Compensation Matrix (Workflow Failure Handling)

## Objective
Define compensating actions for partial failures so workflows are recoverable without manual data surgery.

| Workflow Step | Failure Point | Detection | Compensation Action | Owner |
|---|---|---|---|---|
| Provision request accepted | event publish fails after DB write | outbox lag monitor | outbox re-dispatch and retry publish | Orchestrator |
| Remote user setup | SSH user/key install fails | worker command exit/error | mark allocation failed; ensure node remains unassigned | Provisioning Worker |
| Allocation activation | DB write fails after remote setup | DB transaction failure | retry transaction; if repeated fail, enqueue cleanup task on node | Provisioning Worker |
| Release start | DB state updated but SSH cleanup fails | SSH failure log + retry counter | retry cleanup job; keep allocation released and flag cleanup_pending | Provisioning Worker |
| Billing debit | debit posted twice via replay | idempotency conflict | reject duplicate using key/window check | Billing Worker |
| Stripe credit | duplicate webhook event | unique stripe_event id conflict | no-op, acknowledge success | Payments Service |
| Notification publish | bus temporary failure | publish retry failure | retry; if max exceeded, persist notification for later dispatch | Notification Worker |
| Active allocation runtime | node goes offline mid-session | probe/heartbeat failure + worker checks | transition allocation to `failed`; stop usage accrual; apply credit adjustment per policy; emit failure notification | Provisioning Worker + Billing Worker |
| Admin force-release | force-release accepted while usage active | admin mutation accepted | stop usage accrual at accepted timestamp; proceed releasing workflow; no automatic credit unless refund policy triggers | Admin Service + Billing Worker |
| Storage attach precheck | target workload/allocation, quota, grant, provider capability, or mount policy is invalid | Temporal activity validation failure | mark `platform_storage_attachments.state=failed`; write audit and outbox; no provider/node side effect | Storage Workflow |
| Storage provider grant apply | provider grant/namespace ACL partially applied, later activity fails | provider adapter result + workflow retry exhaustion | revoke provider grant if possible; otherwise mark provider drift and block mount; mark attachment `failed` | Storage Workflow + Provider Adapter |
| Storage node mount | node-agent mount task dispatched but task fails or health check fails | node task result, heartbeat timeout, mount probe | dispatch idempotent unmount if needed; revoke ephemeral provider grant; mark attachment `failed`; storage namespace remains intact | Storage Workflow + Node Agent |
| Storage detach | allocation release or explicit detach cannot unmount after retries | node task timeout/result + workflow retry exhaustion | mark attachment `detach_failed`; keep storage namespace; require operator retry through API/workflow, not direct DB update | Storage Workflow + Ops |
| Storage quota reconcile | provider reports usage above quota or quota cannot be enforced | provider reconcile activity/read model | block new read-write attachments; surface quota flag; keep existing mounted workloads unless policy explicitly requires intervention | Storage Workflow + Storage Service |

## Operational Rules
- Every compensation must be auditable.
- Compensation side effects must also be idempotent.
- Manual intervention path must be documented for exhausted retries.
- Billing stop/credit behavior must be deterministic and policy-driven (no ad-hoc per-handler decisions).
