# App Control Plane v1 (Extensibility Baseline)

## Goal
Enable product teams to add platform applications (for example model serving, inference, schedulers) without changing core allocation APIs per app.

Core principle:
- Core platform exposes primitives.
- App teams integrate through a consistent app control plane contract.

Companion baseline for scheduler apps:
- `doc/architecture/Scheduler_as_Platform_App_v1.md`
- `doc/architecture/App_Runtime_Operating_Modes_v1.md`
- `doc/architecture/Clustered_App_Model_v1.md`
- `doc/architecture/App_Platform_Primitive_Boundary_v1.md`
- `doc/architecture/Slurm_First_Slice_Platform_App_Split_v1.md`
- `doc/architecture/App_Platform_OCI_Registry_Baseline_v1.md`

## Platform Invariants (Non-Negotiable)
These invariants must hold for every app-platform feature, including internal reference apps.

1. Policy/IAM is first-class, not optional.
- Every app action is tenant/project scoped and evaluated through the same role/policy path as core resources.
- No internal-app bypass paths are allowed for authz.
- Privileged app mutations must produce audit logs and canonical error envelopes with `correlation_id`.

2. Artifact and runtime neutrality is mandatory.
- The control plane contract stays runtime-agnostic (`k8s|slurm|ray|bare_metal` adapters behind one API model).
- Artifact sources are policy-governed allowlists; no hardcoded single-vendor/source coupling in API semantics.
- OCI registry integration and non-OCI artifact source workflows are foundation requirements and are not yet fully implemented in runtime.

3. Lifecycle is event-driven and loosely coupled.
- App instance lifecycle transitions must emit typed domain events (`apps.instance.*`).
- Integrations consume contracts/events, not database internals.
- Async state changes are observable by correlation and trace context across services.

4. Internal reference apps must prove platform generality.
- Any first-party app added to validate the platform must use the same public contracts and operator model as third-party teams.
- If a feature only works for an internal app via special-case code, it is considered a platform defect.

## Scope (v1)
1. App catalog (platform-owned metadata).
2. Project app entitlements (tenant/project scoping + policy overlays).
3. App instances (request/deploy/run/fail/delete lifecycle).
4. Async lifecycle events for operators and observability.

Out of scope for v1:
- App runtime implementation details (k8s/slurm/ray operator internals).
- UI workflows beyond existing shell.
- Final pricing engine and runtime-specific metering implementation.

## Ownership Model
- App instance ownership: project.
- Attribution: `requested_by_user_id`.
- Tenant boundary: inferred from project -> org.
- Service accounts remain project-scoped and are used by app operators.

Important distinction:
1. app instance ownership remains project-scoped
2. runtime control plane may be `project`, `tenant`, or `platform` scoped depending on operating mode

This is required so the platform can start with tenant-dedicated app backends and later introduce platform-managed services without changing ownership semantics.

Current limitation:
- a tenant-scoped runtime control plane is not yet the same thing as a
  tenant-owned shared runtime contract,
- tenant-owned shared mode needs an explicit attached-project model instead of
  overloading one project-owned app instance to mean tenant ownership,
- see:
  - `doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md`
  - `doc/architecture/App_Tenant_Shared_Runtime_API_Direction_v1.md`

## Operating Mode Baseline
See `doc/architecture/App_Runtime_Operating_Modes_v1.md`.

Initial direction:
1. production default is `tenant_dedicated`
2. future shared offerings use `platform_managed`
3. both modes use the same app catalog, entitlement, app instance, IAM, audit, and observability paths

## Billing Attribution Baseline
See `doc/architecture/App_Runtime_Billing_Model_v1.md`.

Baseline rules:
1. project remains the primary app-runtime billing anchor, even when the effective control plane is `tenant` or `platform` scoped
2. app-runtime billing must preserve:
   - `org_id`
   - `project_id`
   - `app_instance_id`
   - `app_slug`
   - `operating_mode`
   - `control_plane_scope`
   - `runtime_backend`
   - `correlation_id`
3. `tenant_dedicated + project` is the clean default for `dev/test/stage/prod` style environment boundaries
4. tenant-scoped shared control planes are supported, but any cross-project cost distribution must be explicit and policy-driven
5. platform-managed shared-service costs must still reconcile back to tenant and project usage records

## API Surface (contract-first)
See `doc/api/openapi.draft.yaml`.

Added surfaces:
1. `GET /api/v1/apps/catalog`
2. `GET /api/v1/apps/catalog/{app_slug}/versions`
3. `POST /api/v1/admin/apps/catalog/{app_slug}/versions/{version}/publish`
4. `POST /api/v1/admin/apps/catalog/{app_slug}/versions/{version}/deprecate`
5. `GET /api/v1/projects/{project_id}/apps/entitlements`
6. `PUT /api/v1/projects/{project_id}/apps/entitlements/{app_slug}`
7. `GET /api/v1/projects/{project_id}/app-instances`
8. `POST /api/v1/projects/{project_id}/app-instances`
9. `GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}`
10. `DELETE /api/v1/projects/{project_id}/app-instances/{app_instance_id}`
11. `GET /api/v1/apps/registry`
12. `GET /api/v1/projects/{project_id}/app-artifacts`
13. `POST /api/v1/projects/{project_id}/app-artifacts/publish-intents`
14. `POST /api/v1/projects/{project_id}/app-artifacts`
15. `POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/promote`
16. `POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/deprecate`
17. `POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/retire`
18. `GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members`
19. `GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members/{member_id}`
20. `POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations`
21. `GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations/{operation_id}`

Behavioral intent:
- Catalog is read-only for tenant users.
- Entitlements are project-scoped controls (enable/disable + policy overrides).
- Instance create/delete are async (`202 Accepted`), state transitions tracked via events.
- Member lifecycle requests are async (`202 Accepted`) and remain generic platform operation envelopes; runtime-specific implementation stays in the adapter.
- Artifact upload bytes flow directly to the registry; the API owns publish intent, digest registration, lifecycle, and audit.

## Event Surface
See `doc/api/asyncapi.draft.yaml`.

Added lifecycle events:
1. `apps.entitlement.updated`
2. `apps.instance.requested`
3. `apps.instance.running`
4. `apps.instance.failed`
5. `apps.instance.deleting`
6. `apps.instance.deleted`
7. `apps.artifact.registered`
8. `apps.artifact.promoted`
9. `apps.artifact.deprecated`
10. `apps.artifact.retired`

Envelope remains standard:
- `event_id`
- `event_type`
- `occurred_at`
- `version`
- `correlation_id`
- `payload`

## Security and Isolation
1. Project context (`X-Project-ID` or project path) is authoritative for all project-owned app operations.
2. Cross-project/cross-tenant operations are denied by default.
3. App operators authenticate via project-scoped service accounts only.
4. No raw command execution surface is exposed via app APIs.

## RBAC Action Matrix (v1 Baseline)

### Project-scoped app actions

| Action | platform_superadmin | platform_ops | tenant_owner | tenant_admin | project_owner | project_admin | project_member | project_viewer | service_account |
|---|---|---|---|---|---|---|---|---|---|
| apps.catalog.read | allow | allow | allow | allow | allow | allow | allow | allow | deny |
| apps.versions.read | allow | allow | allow | allow | allow | allow | allow | allow | deny |
| apps.entitlement.read | allow | allow | allow | allow | allow | allow | deny | deny | deny |
| apps.entitlement.write | allow | allow | allow | allow | allow | allow | deny | deny | deny |
| apps.instance.read | allow | allow | allow | allow | allow | allow | allow | allow | allow (same project only) |
| apps.instance.create | allow | allow | allow | allow | allow | allow | allow | deny | allow (same project only) |
| apps.instance.delete | allow | allow | allow | allow | allow | allow | deny | deny | deny |
| apps.instance.member.read | allow | allow | allow | allow | allow | allow | allow | allow | allow (same project only) |
| apps.instance.member.operate | allow | allow | allow | allow | allow | allow | deny | deny | allow (same project only, explicit allowlist only) |

Rules:
1. `platform_superadmin` and `platform_ops` are break-glass/platform operations and bypass tenant-level membership checks for explicit admin endpoints only.
2. `service_account` permissions are constrained to same-project resources and explicit allowlisted endpoint set.
3. `tenant_owner` and `tenant_admin` can manage project entitlements and app instances inside their tenant.
4. Project-scoped role evaluation follows `project -> tenant -> platform` decision chain from role lifecycle baseline.

## Policy Overlay Direction
Overlay resolution order (future implementation):
1. global defaults
2. tenant overrides
3. project app entitlement overrides

Most-specific scope wins. Global hard-deny remains non-overridable.

### Overlay schema direction (v1)
`project_app_entitlements.policy_overrides` supports:
1. `allowed_regions`: array of region codes.
2. `allowed_skus`: array of catalog sku codes.
3. `max_instances_per_project`: integer.
4. `max_gpus_per_instance`: integer.
5. `artifact_source_allowlist`: array of host patterns.
6. `allowed_operating_modes`: array of `tenant_dedicated | platform_managed`.
7. `allowed_control_plane_scopes`: array of `project | tenant | platform`.

Restrictions:
1. Project override can only narrow scope versus tenant/global policy.
2. Project override cannot enable an app disabled by tenant/global hard-deny.
3. Conflicts resolve by most-specific then most-restrictive.

## Observability Requirements
Every app-instance mutation and event should carry:
- `correlation_id`
- `org_id`
- `project_id`
- `app_slug`
- `app_instance_id`

Target triage path:
- API error envelope -> correlation id
- Loki lookup by correlation id
- Tempo trace by trace_id
- App lifecycle event timeline from async stream

## Follow-ups (next iterations)
1. Runtime-specific metering implementation and usage-record to ledger pipeline.
2. Operator onboarding guide (reference implementation for one app, e.g. model serving).
3. Admin catalog version disable endpoint and retirement workflow.
4. Registry credential delivery through Vault-backed publish/pull secret paths.
5. Generic clustered-app topology and component-role contract for multi-member example apps.
6. Tenant-shared runtime ownership and attachment contract for apps that support
   tenant-owned shared mode.

## DB Schema Proposal (v1 Draft)

SQL companion:
- `doc/architecture/db_schema_app_control_plane_phase1_draft.sql`

### Tables
1. `app_catalog`
   - `id uuid pk`
   - `slug text unique not null`
   - `display_name text not null`
   - `category text not null`
   - `publisher text not null`
   - `status text not null check (status in ('active','deprecated','disabled'))`
   - `created_at timestamptz not null default now()`
   - `updated_at timestamptz not null default now()`

2. `app_versions`
   - `id uuid pk`
   - `app_id uuid not null references app_catalog(id) on delete cascade`
   - `version text not null`
   - `runtime_backend text not null check (runtime_backend in ('k8s','rke2','slurm','ray','bare_metal'))`
   - `manifest jsonb not null`
   - `status text not null check (status in ('active','deprecated','disabled'))`
   - `created_at timestamptz not null default now()`
   - `unique (app_id, version)`

3. `project_app_entitlements`
   - `id uuid pk`
   - `org_id uuid not null references organizations(id) on delete cascade`
   - `project_id uuid not null references projects(id) on delete cascade`
   - `app_id uuid not null references app_catalog(id) on delete cascade`
   - `enabled boolean not null default true`
   - `policy_overrides jsonb not null default '{}'::jsonb`
   - `updated_by_user_id uuid null references users(id)`
   - `created_at timestamptz not null default now()`
   - `updated_at timestamptz not null default now()`
   - `unique (project_id, app_id)`

4. `app_instances`
   - `id uuid pk`
   - `resource_name text not null unique`
   - `org_id uuid not null references organizations(id) on delete cascade`
   - `project_id uuid not null references projects(id) on delete cascade`
   - `app_id uuid not null references app_catalog(id) on delete restrict`
   - `app_version_id uuid not null references app_versions(id) on delete restrict`
   - `display_name text not null`
   - `operating_mode text not null default 'tenant_dedicated' check (operating_mode in ('tenant_dedicated','platform_managed'))`
   - `control_plane_scope text not null default 'project' check (control_plane_scope in ('project','tenant','platform'))`
   - `tenant_boundary_mode text not null default 'tenant_isolated' check (tenant_boundary_mode in ('tenant_isolated','shared_service'))`
   - `status text not null check (status in ('requested','deploying','running','failed','deleting','deleted'))`
   - `requested_by_user_id uuid not null references users(id)`
   - `operator_service_account_id uuid null references service_accounts(id)`
   - `config jsonb not null default '{}'::jsonb`
   - `runtime_state jsonb not null default '{}'::jsonb`
   - `failure_reason text null`
   - `created_at timestamptz not null default now()`
   - `updated_at timestamptz not null default now()`
   - `deleted_at timestamptz null`

### Required indexes
1. `ix_app_instances_project_status_created` on `(project_id, status, created_at desc)`
2. `ix_app_instances_org_status` on `(org_id, status)`
3. `ix_project_app_entitlements_project` on `(project_id)`
4. `ix_app_versions_app_status` on `(app_id, status, created_at desc)`

### Required integrity constraints
1. `app_instances.project_id` must belong to `app_instances.org_id`.
2. `project_app_entitlements.project_id` must belong to `project_app_entitlements.org_id`.
3. `app_instances.operator_service_account_id` (if set) must belong to same `project_id` and `org_id`.
4. `app_instances.app_version_id` must reference same `app_id` as `app_instances.app_id`.

### Migration/cutover strategy
1. Add tables and indexes without touching existing allocation/storage paths.
2. Launch catalog read APIs first.
3. Gate entitlement and instance mutations behind feature flag.
4. Introduce one reference operator integration before broad app onboarding.
