Troubleshooting designed
Troubleshooting should start from the state the user sees, then point to the owning product or operator surface.
Common States
provisioning: placement or node bootstrap is still running.active: workload is ready for terminal, SSH, metrics, or app access.releasing: teardown is in progress.release_failed: billing has stopped, but cleanup needs retry or operator attention.failed: provisioning did not complete; check machine-readable failure reason.insufficient_balance: add funds or contact the tenant/customer admin.sku_unavailable: select a different SKU or wait for capacity.
Troubleshooting By Symptom
| Symptom | First check | Next safe step |
|---|---|---|
| Launch stays in provisioning | current allocation state and correlation id | wait within normal provisioning window, then escalate with correlation id |
| Browser or SSH access missing | allocation is actually active and the expected access path is enabled | retry the correct access path, then escalate with state proof |
| Release does not complete | state is releasing or release_failed | retry release if user-safe, otherwise contact support/operator path |
| MFA status looks stale | account security page and refresh path | refresh status before assuming the factor was lost |
| Recovery path fails | capture correlation id and the action attempted | move to support-assisted recovery path |
| Billing warning appears | current balance, recent usage, and tenant context | add funds or contact the tenant admin |
User-Safe Rule
Prefer product/API surfaces and correlation IDs over direct infrastructure inspection. Repeated operator-only checks should become product read models.