# Runbook: App Runtime Billing Reconciliation Incident

## Trigger
1. User/operator report: app runtime charge is unexpected, missing, or duplicated.
2. Billing UI/export shows mixed allocation and app-runtime rows that do not reconcile.
3. Alert or log evidence indicates app-runtime metering drift, especially for control-plane components.

## Required Context
1. `correlation_id` from API/UI error envelope or support ticket.
2. `trace_id` when present.
3. Attribution identifiers when available:
   - `org_id`
   - `project_id`
   - `app_instance_id`
   - `usage_source`
   - `control_plane_component`
   - `operating_mode`
   - `control_plane_scope`
   - `runtime_backend`

## Immediate Actions
1. Confirm whether the dispute is:
   - one app instance,
   - one project,
   - one tenant,
   - or broad billing-worker degradation.
2. Do not treat app-runtime usage as a separate billing system.
   - all reconciliation must still flow through `usage_records` and `ledger_entries`.
3. If multiple app instances are impacted, freeze manual customer-facing adjustments until source attribution is verified.

## Correlation-First Diagnosis
1. Start in Loki with `correlation_id`:
   - `{service=~"gpuaas-(api|billing-worker|app-runtime-worker|webhook-worker)"} | json | correlation_id="<CORRELATION_ID>"`
2. If `app_instance_id` is known, pivot on it:
   - `{service=~"gpuaas-(api|billing-worker|app-runtime-worker)"} | json | app_instance_id="<APP_INSTANCE_ID>"`
3. Extract `trace_id` and inspect Tempo for:
   - app lifecycle call,
   - outbox relay publish,
   - billing worker handling,
   - any webhook/payment follow-on if customer funding is involved.
4. Confirm source attribution on affected usage rows:
   - `usage_source = app_runtime`
   - `app_instance_id` present
   - `control_plane_component` correct for control-plane cost
   - `operating_mode`, `control_plane_scope`, `runtime_backend` align with the instance

## Reconciliation Checklist
1. Missing usage row:
   - app runtime activity happened, but no `usage_records` row exists for the `app_instance_id`.
2. Wrong source attribution:
   - usage is recorded against `allocation` when it should be `app_runtime`, or vice versa.
3. Wrong attribution anchor:
   - `project_id`, `app_instance_id`, `operating_mode`, or `control_plane_scope` do not match the instance.
4. Ledger mismatch:
   - app-runtime `usage_records` exist, but no corresponding debit/credit interpretation appears in customer-visible billing state.
5. Control-plane classification drift:
   - `control_plane_component` is false for scheduler/head/control services that should meter separately.

## Mitigation
1. Fix the owning layer:
   - metering emitter,
   - attribution mapping,
   - billing worker interpretation,
   - or UI/filter/export path.
2. Do not patch around drift by inventing app-runtime-only ledgers or manual hidden adjustments.
3. If remediation needs data correction:
   - use approved auditable reconciliation procedure,
   - preserve `correlation_id` linkage in incident notes and corrective records.

## Recovery Criteria
1. Mixed usage listing/export clearly distinguishes `allocation` vs `app_runtime`.
2. `app_instance_id`-scoped usage rows reconcile with ledger-visible customer impact.
3. Control-plane cost rows are explainable by `control_plane_component`, `operating_mode`, `control_plane_scope`, and `runtime_backend`.
4. No duplicate or missing customer-visible charges remain for impacted scope.

## Evidence to Capture
1. Incident timeline with `correlation_id` and `trace_id`.
2. Before/after query evidence for:
   - `usage_records`
   - `ledger_entries`
   - app instance metadata (`app_instance_id`, mode/scope/runtime backend)
3. Customer-visible impact summary by tenant/project/app instance.
4. Follow-up task for the owning layer if drift originated outside billing.
