# Runbook: Billing Worker Failure and Reconciliation Drift

## Trigger
1. Alert: billing worker failures / billing queue lag.
2. Alert: payment reconcile failures (`payments.reconcile_failed`).
3. User/support report: unexpected charge, missing credit, or lifecycle boundary dispute.

## Required Context
1. `correlation_id` from API/UI error envelope (or support ticket metadata).
2. `trace_id` if present in error details/logs.
3. Scope identifiers when available:
- `org_id`
- `project_id`
- `user_id`
- `allocation_id`
- `app_instance_id`
- `usage_source`
- `control_plane_component`

## Immediate Actions
1. Determine blast radius:
- single user/project dispute vs global worker degradation.
2. If widespread:
- freeze risky manual adjustments until consistency is verified.
3. Confirm dependency health:
- Postgres connectivity
- NATS consumer health
- outbox relay health

## Correlation-First Diagnosis
1. Start in Loki with `correlation_id`:
- `service=gpuaas-billing-worker` and related services (`gpuaas-api`, `gpuaas-webhook-worker`).
2. Extract `trace_id` from log/error details and open in Tempo.
3. Validate event sequence:
- `provisioning.active`
- `provisioning.releasing.completed` or `provisioning.release_failed`
- `billing.*` notifications
- `payments.balance_credited` / `payments.reconcile_failed` when payment path involved.
4. Validate data consistency:
- `usage_records` lifecycle (`start_time`, `end_time`, `last_billed_at`, `accrued_cost_minor`)
- `ledger_entries` projection for affected `requested_by_user_id`
- allocation status boundary (`active`, `releasing`, `released`, `release_failed`)
- app-runtime attribution boundary (`app_instance_id`, `usage_source`, `control_plane_component`, `operating_mode`, `control_plane_scope`, `runtime_backend`)

## Reconciliation Checklist
1. Missing open usage:
- allocation is `active|releasing|release_failed` but no open `usage_records` row.
2. Orphan open usage:
- open `usage_records` row while allocation is not active/releasing.
3. Unbilled closed usage:
- closed `usage_records` with positive accrued cost and no usage ledger debit.
4. Payment drift:
- `payment_sessions.status=failed_reconcile` or missing linked `ledger_entry_id` after checkout completion.

## Mitigation
1. Worker/process remediation:
- restart or roll back billing worker if runtime regression suspected.
2. Data remediation:
- use approved reconciliation procedure; avoid ad-hoc direct state mutation.
3. Payment remediation:
- investigate provider/webhook mismatch and apply recovery path per payments runbook.
4. Keep all corrective actions auditable with actor + correlation linkage.

## Launch Restriction Verification

When a prepaid or overdue posture should block new work, verify the product
launch boundary before debugging worker internals:

```bash
curl -sk -H "Authorization: Bearer $TOKEN" \
  "$API/api/v1/billing/financial-posture" | jq .

curl -sk -X POST "$API/api/v1/v3/launch/compute/precheck" \
  -H "Authorization: Bearer $TOKEN" \
  -H "X-Project-ID: $PROJECT_ID" \
  -H "Content-Type: application/json" \
  --data '{"sku":"compute-vm-small"}' \
  | jq '.checks[] | select(.id=="billing")'
```

Expected restricted state:

```text
state=restricted
reason=prepaid_balance_depleted
effects.block_new_launches=true
precheck billing severity=blocker
submit HTTP 402 code=insufficient_balance
```

Recovery is the inverse: clear or downgrade the financial restriction through
the admin billing API, then verify the precheck billing check returns `ok`.
Latest kind proof:
`dist/uat/kind/20260530T083827Z-billing-launch-guard`.

## Recovery Criteria
1. Billing worker resumes stable processing.
2. Reconciliation checks return no unresolved drift for impacted scope.
3. No duplicate debits/credits introduced by recovery.
4. Alerts return below threshold.

## Evidence to Capture
1. Incident timeline with `correlation_id` and `trace_id`.
2. Query evidence for `usage_records` and `ledger_entries` before/after remediation.
3. Impacted tenant/project/user scope and customer-facing summary.
4. Follow-up tasks for root-cause prevention.
