# PRD v0.3 - Core42 AI Cloud Compute Platform

## 1. Document Intent
This PRD converts prototype learning into a production-oriented, API-first product baseline.

Assumption control:
- Cross-cutting product/architecture assumptions are tracked in `doc/governance/Assumptions_Register.md` and must be updated with PRD-affecting changes.

## 2. Product Vision
Provide a secure self-service GPU platform where users can discover capacity, provision compute, access nodes, monitor usage, and pay based on consumption.

## 3. Product Goals
- Fast time-to-compute: provision in minutes.
- Transparent usage and billing.
- Safe multi-user operations with role-based controls.
- Operator-friendly admin surface for inventory and accounts.

## 4. Personas
- End User: provisions and operates GPU nodes.
- Admin: manages users, balances, and node inventory.
- Billing Operator: monitors charges, top-ups, and reconciliation.

## 5. In Scope (MVP Rebuild)
- OIDC-based auth and role/tenant-aware authorization.
- SKU catalog and node inventory.
- Allocation lifecycle: provision, active, release.
- Browser terminal session to active allocation.
- Secure SSH key retrieval for active allocation.
- Usage metering and periodic rating.
- Balance warnings and depleted enforcement.
- Stripe checkout top-up and webhook processing.
- Admin user/node management.
- Admin allocation/audit/payment-session operational visibility.
- Admin operations telemetry overview (health, queue depth, throughput, error rates).
- Object-storage-backed user storage operations.
- Audit logging for privileged and financial actions.

## 6. Out of Scope (MVP)
- Managed scheduler products (SLURM/k8s/Ray) as real backend features.
- Enterprise invoicing/subscriptions/commit contracts.
- Full multi-tenant org hierarchy UX (schema/policy readiness still required).
- Multi-region active/active runtime.
- User-managed API keys for programmatic auth (deferred; middleware remains pluggable for future key resolver).

## 7. Functional Requirements

### FR-1 Authentication and Session
- Users authenticate via OIDC-compatible provider.
- API accepts short-lived access tokens and enforces server-side authz.
- Protected APIs reject missing/invalid tokens.

Acceptance criteria:
- Invalid tokens return unauthorized.
- Admin-only routes enforce role.
- Tenant-scoped routes enforce tenant policy.

### FR-2 SKU Catalog and Inventory
- System exposes SKU catalog and availability.
- User-facing inventory view excludes infrastructure connection secrets.

Acceptance criteria:
- Free capacity reflects only online and unassigned resources.
- User node list does not expose admin-only connection coordinates.

### FR-3 Provisioning Lifecycle
- User requests allocation creation.
- Prechecks enforce availability, policy, and funding constraints.
- System provisions runtime access and records allocation + usage start.

Acceptance criteria:
- Provision failures return explicit machine-readable reason.
- Allocation status transitions are visible: `requested`, `provisioning`, `active`, `releasing`, `released`, `failed`, `release_failed`.

### FR-4 Runtime Access
- User can open terminal to active allocation.
- User can view metrics for active allocation.
- User can retrieve access credentials without persistent server-side private-key storage.

Acceptance criteria:
- Access denied for non-owner/non-admin.
- Key retrieval does not rely on query-token auth.
- Production path does not require long-lived storage of user SSH private key material in control-plane DB.

### FR-5 Release Lifecycle
- User/admin can request release.
- System transitions allocation to `releasing` then `released`.
- System performs runtime cleanup and ends usage accounting.

Acceptance criteria:
- Released allocation is not billed further.
- Node/resource becomes available for next provisioning.
- `release_failed` is surfaced with retry path; billing is stopped while in `release_failed`.

### FR-6 Usage and Billing
- Usage rated by SKU x quantity x duration.
- Monetary values use minor units (integer) with explicit currency.
- Low-balance and depleted-balance policies enforced.

Acceptance criteria:
- Depleted users have active allocations force-released.
- Billing APIs include currency and minor units.
- Low-balance and projected depletion warnings are emitted before forced release when projection data is available.

### FR-7 Payments
- User can initiate Stripe checkout top-up.
- Webhook processing is signature-verified and idempotent.
- Successful credits emit domain event for downstream consumers.
- Refund policy uses hybrid model:
  - Provider refund allowed within configurable window `refund_window_days`.
  - Outside window, refund request falls back to internal balance credit.
  - Refundable amount must be constrained by configurable policy for unused/prepaid balance.

Acceptance criteria:
- Duplicate webhook does not double-credit.
- Payment-credit event available for notifications/billing read model.
- Refund outcome is explicit (`provider_refund` or `internal_credit`) and auditable.

### FR-8 Admin Operations
- Admin can create users.
- Admin can adjust user balance with explicit credit/debit semantics.
- Admin can request refunds through dedicated refund API (not generic balance adjustment).
- Admin can add/probe/delete nodes.
- Admin can view cross-user allocations and force-release with explicit reason.
- Admin can query and export audit logs.
- Admin can view payment sessions for reconciliation.

Acceptance criteria:
- All admin mutations write audit logs.
- Node probe status reflected in admin inventory.
- Refund API behavior matches policy window + fallback rules.

### FR-9 Storage Access
- User storage is object-storage-backed with metadata index.
- User can list/upload/download/create/rename/delete within scoped namespace.
- Traversal and namespace breakout attempts are rejected.

Acceptance criteria:
- Namespace enforcement verified by negative tests.

### FR-10 Abuse Protection and Rate Limiting
- Public APIs enforce rate limits and abuse controls.
- Limits are policy-configurable per endpoint/user class.

Acceptance criteria:
- Rate-limit responses are deterministic and observable.
- Abuse policy ownership is defined in operations/security docs.

### FR-11 Audit Logging
- System records immutable audit entries for privileged actions.
- Billing and payment mutations are auditable with correlation IDs.
- Admins can query and export audit logs for compliance and incident response.

Acceptance criteria:
- Admin balance adjustments, refunds, user creation, and node deletion are auditable.
- Audit logs are available via paginated admin API and CSV export endpoint.

### FR-12 Operations Observability Surface
- Admin can view a read-only operational telemetry summary from within the product UI.
- Ops summary is aggregated and sanitized for browser exposure (no raw infra secrets or tokens).
- Ops panel is role-gated to admin users.

Acceptance criteria:
- `/api/v1/admin/ops/overview` returns aggregated health/queue/error/throughput metrics.
- Endpoint enforces admin authorization and standard error model.

## 8. Allocation State Machine (Required)
States:
- `requested` -> `provisioning` -> `active` -> `releasing` -> `released`
- Failure side paths:
  - `provisioning` -> `failed`
  - `releasing` -> `release_failed`
  - `release_failed` -> `releasing` (user retry or admin force-release)

Rules:
- Resource can have max one active allocation.
- User concurrent allocation limit policy: default `allocation.max_concurrent_per_user = 2` (configurable).
- `release_failed` means cleanup retries were exhausted; billing is stopped and admin/user retry path must remain available.

## 9. Billing State Machine (Required)
User billing states:
- `healthy` (balance > low threshold)
- `low_balance` (0 < balance <= low threshold)
- `auto_release_pending` (advisory warning state when projected depletion time is within warning window)
- `depleted` (balance <= 0)

Transitions:
- `healthy -> low_balance`: trigger warning
- `low_balance -> auto_release_pending` (advisory): trigger projected depletion warning when estimate available
- `low_balance -> depleted`: trigger forced release
- `depleted -> healthy`: after successful top-up; allocations do **not** auto-restart by default

Recovery policy:
- After top-up, user must manually reprovision (default).
- Auto-restart may be introduced as explicit future policy.

First-run onboarding policy:
- If first login has zero balance and no allocations, UX routes user to billing with an onboarding CTA.

## 10. Non-Functional Requirements
- Security: secure token handling, audited privileged actions, secret management, abuse protections.
- Reliability: idempotent webhook/provisioning/billing flows with retries and DLQ.
- Durability: transactional DB and immutable ledger.
- Performance: responsive APIs under expected load with enforced pagination.
- Observability: tracing, structured logs, metrics, and alerting.

## 11. Policy Configuration Model (Mandatory)
All operational policy values are configuration-driven (DB/config), not hardcoded constants.

Required capabilities:
- Scoped policy resolution: `global -> plan -> org -> user` (or narrower where applicable).
- Validated bounds on every policy key (`min`, `max`, allowed enum values).
- Effective-at support for controlled rollouts of policy changes.
- Full audit trail for policy updates (`who`, `what`, `before`, `after`, `when`, `reason`).

Initial policy keys (launch defaults configurable):
- `rate_limit.api_requests_per_minute`
- `rate_limit.terminal_token_requests_per_minute`
- `rate_limit.financial_requests_per_minute`
- `rate_limit.admin_overview_requests_per_minute`
- `allocation.max_concurrent_per_user`
- `billing.window_seconds`
- `billing.low_balance_threshold_minor`
- `allocation.refund_window_days`
- `billing.minimum_deposit_minor`
- `billing.maximum_deposit_minor`
- `notification.low_balance_enabled`
- `notification.balance_depleted_enabled`

## 12. Open Questions
- Launch default values for policy keys above.
- Enterprise override ranges and approval workflow for policy changes.

## 13. Delivery Milestones and Success Criteria
1. Architecture/Contract Baseline Ready
Success criteria: ADRs frozen, OpenAPI+AsyncAPI validated, Phase tracker >= Ready for Signoff for Phases 1-4.
2. Core Platform Slice
Success criteria: Auth + catalog + allocation read APIs passing contract/integration tests.
3. Provision/Billing/Payments Core
Success criteria: end-to-end allocate->bill->release flow stable with idempotency tests.
4. Admin/Storage/Terminal Completion
Success criteria: admin and storage APIs + terminal gateway pass security and integration suites.
5. Hardening and Launch Readiness
Success criteria: Go/No-Go checklist mandatory items all pass.

## 14. Phase-2 Readiness Constraints (Mandatory in MVP Design)
The following items remain out of MVP feature scope, but MVP architecture/design MUST avoid blocking them.

### 14.1 Managed Schedulers (SLURM/k8s/Ray)
- Allocation model supports pluggable execution backends.
- Allocation API remains scheduler-agnostic.

### 14.2 Enterprise Billing
- Ledger model extensible for invoices, subscriptions, commitments.

### 14.3 Multi-Tenant Hierarchy and Policy
- Core entities tenant-aware (`org_id`, optional `project_id`).

### 14.4 Multi-Region
- Region first-class in placement and resource identity.

### 14.5 No-Rework Acceptance Criteria
- Scheduler backend addition does not require breaking allocation API.
- Enabling org tenancy does not require ledger redesign.
- Second-region introduction does not require identity rewrite.
- Enterprise pricing is additive over billing core.