# App SDK Design Principles and Composition Contracts v1

Status: draft
Date: 2026-06-01
Related:
- `doc/architecture/Platform_Proxy_OSS_Data_Plane_ADR_v1.md`
- `doc/architecture/Pomerium_Edge_Migration_Next_Steps_v1.md`
- `doc/architecture/Example_App_Developer_Reference_Workflow_v1.md`
- `doc/architecture/Platform_Proxy_Host_Routing_DNS_TLS_v1.md`

## Purpose

This document is the **principles layer** above the existing App SDK specs. It does not redefine any contract already settled elsewhere — it captures the decision rules that should constrain how those contracts evolve, what kinds of apps the platform admits, and how cross-cutting concerns (auth, billing, K8s exposure) generalize across app categories.

Read this **before** changing:
- the launchable manifest schema
- the endpoint type enum
- the scheduler-as-app contract
- the billing attribution shape
- the proxy / route model
- the App SDK developer-facing API

Existing documents that this references (do not duplicate):
- `App_Platform_Primitive_Boundary_v1.md` — the platform/app split rule
- `App_Runtime_Billing_Model_v1.md` — attribution anchors and operating modes
- `App_Runtime_Metering_v1.md`, `App_Runtime_Instance_Lifecycle_v1.md`, `App_Runtime_Operating_Modes_v1.md`
- `App_Control_Plane_v1.md`
- `App_Manifest_Registration_Guide_v1.md`
- `App_Artifact_Trust_and_Promotion_v1.md`
- `App_Tenant_Shared_Attachment_Model_v1.md`
- `Launchable_OCI_Workload_Profile_Contract_v1.md`
- `Scheduler_as_Platform_App_v1.md`
- `Clustered_App_Model_v1.md`

The team operates with a small staff and a composition-first strategy. The principles below exist to keep that strategy leverage-positive as the catalog grows.

---

## Decision Summary

This document is the guardrail for app-runtime, proxy, scheduler, and building
block work.

The important implementation consequences are:

1. app behavior is declared by capability, endpoint type, auth pattern, and
   lifecycle contract, not by app name,
2. Pomerium, Helm, Headlamp, and future OSS components are render targets, not
   product contracts,
3. `managed_ingress` is the default product primitive for exposing app
   endpoints through the platform edge,
4. apps that cannot express auth through the supported two-tier patterns are
   not contract-ready,
5. billing must evolve toward dimensional ledger attribution before apps and
   building blocks become separately billable,
6. every new endpoint type, auth pattern, or building block is a platform
   commitment and should be tracked as a deliberate contract change.

Use this as a review checklist before adding Pomerium route behavior, new
launchable manifest fields, scheduler-app extensions, or app-specific runtime
code.

Every app-related change must declare its class before implementation:

| Change class | Correct owner | Required contract action |
|---|---|---|
| Runtime fix | app runtime/controller/backend owning layer | Fix the runtime behavior and add regression evidence; SDK update only if the public contract changes. |
| Catalog or manifest change | App SDK / manifest contract | Update manifest docs, examples, validators, and app registration path; avoid seed-only or backend-only assumptions. |
| SDK/developer contract change | App SDK / developer platform | Update SDK-visible contract, examples, smoke tests, and portal docs before treating the behavior as supported. |

Runtime/controller bugs should be fixed where they live. App defaults,
developer-facing launch behavior, publish/promotion semantics, and connect
actions should move toward SDK-visible manifests and validation harnesses rather
than accumulating as backend seed/runtime one-offs.

For day-to-day PR review, start with this Decision Summary, §6 Endpoint Type
Catalog, and §7 Anti-Patterns. Read the deeper sections when changing a
contract, adding a building block, or reviewing a new app category.

---

## 1. Core Principles

### 1.1 Compose, don't build
Build only what is **unique to GPUaaS**: the control plane (route lifecycle, scheduler-app boundary, audit context), the trust roots (User CA, Host CA, identity model), the billing ledger, the allocation state machine. Everything else is candidate for OSS composition. The "save engineering time" argument is secondary; the durable reason is **avoiding ownership of code we don't have to own**.

### 1.2 The platform stays the trust authority
Adopting OSS at the runtime layer (Pomerium, Headlamp, Helm, etc.) does **not** transfer the security root. gpuaas owns:
- the User CA and Host CA
- the identity model (OIDC issuer trust, claim shape)
- the policy authority (`platform_policy_values`)
- the audit ledger
- the allocation lifecycle

OSS components are renderable targets, not sources of truth. If Pomerium ever needs to be swapped for Envoy, Caddy, or a successor, the gpuaas contract should not change.

### 1.3 Contract over code
The platform may not branch on app identity. `if app == "X"` is the failure mode. The platform branches on **declared capability**:
- has `provides_scheduler_runtime`?
- has `endpoint.type == http`?
- declares `requires_persistent_state`?

If an app behavior cannot be expressed via declared capability, **fix the contract**, not the platform code. The cost of contract changes today is small. The cost when the catalog has 15 apps is most of a quarter.

### 1.4 Closed enums, open manifests
Some manifest fields are **closed enums** controlled by the platform: endpoint types, scheduler kinds, operating modes, lifecycle phases. Adding a value is a deliberate platform decision with an ADR. Other manifest fields are **open data** — names, descriptions, resource requests, env vars, values for templated charts.

Closed enums are how composition stays bounded. Open data is how apps stay expressive.

### 1.5 Edge concentration of cross-cutting concerns
Identity, audit, rate-limit, TLS termination, CORS, header injection, and trace propagation belong at the edge (Pomerium today). Once an authenticated, identified, audited request reaches a workload, the workload should not re-implement these. Every binary's responsibility narrows to **domain logic + an identity header**.

### 1.6 Lifecycle uniformity
Every runtime — `oci_image`, `oci_compose`, `kubernetes` (via Helm or manifests), `slurm`, future runtimes — declares the same lifecycle hooks with the same contract:
- `provision` — start
- `ready_probe` — readiness signal
- `drain` — graceful shutdown signal, time-bounded
- `teardown` — final teardown

Where runtimes differ in *implementation*, the difference belongs in a **runtime adapter** (Go interface), never in the manifest schema. If a manifest field exists because Helm needs it, the contract has leaked.

### 1.7 Manifest as a product surface
The launchable manifest schema is a **product** for app developers. It has:
- a semver version
- a changelog
- a deprecation policy for fields
- migration tooling when major versions change
- documented examples for every operating mode

Treating the manifest casually is how the SDK loses its leverage promise.

The manifest/SDK path should own or validate these developer-facing contracts:

- app defaults: ports, health paths, route intent, auth mode, and connect
  actions;
- runtime expectations: env vars, mounted credentials, tokens, service
  accounts, storage, and network posture;
- publish and promotion: artifact selection, signing/trust, versioning, and
  promotion state;
- launch: required inputs, optional dependencies, default dependency creation,
  and validation errors;
- connect: `Open app`, `Try endpoint`, `Open cluster`, token/API-key, and
  app-specific connection flows;
- UAT: each SDK example should have launch, connect, and decommission smoke
  coverage;
- failure behavior: app-auth failure, upstream 503, missing token, bad route,
  unavailable artifact, and node-task timeout should render product-owned
  errors rather than raw proxy/runtime pages.

### 1.8 Two-tier app trust model: curated vs open
Not every app gets the same trust. Two tiers:

| Tier        | Examples                                                     | Review                                                   | Capability scope                                             |
| ----------- | ------------------------------------------------------------ | -------------------------------------------------------- | ------------------------------------------------------------ |
| **Curated** | Schedulers (Slurm, K8s, Ray), platform building blocks (managed Ingress, vector DB, MLflow), identity-sensitive integrations | Platform team review, signed artifacts, capability audit | Can hold long-lived state, admit other workloads, request platform-scoped credentials |
| **Open**    | Direct user workloads (Jupyter, vLLM, Open WebUI, ComfyUI), service-providing apps (MCP servers, model APIs) | Manifest validation only                                 | Allocation-scoped, no cross-user identity, no privileged platform calls |

External developers may submit apps to both tiers; the bar is different. Schedulers and platform building blocks are curated because they hold a higher trust posture (admitting third-party workloads, propagating identity). This is not gatekeeping — it is acknowledging that the security model differs.

**Decide tier explicitly at manifest registration. Do not promote apps between tiers via runtime exceptions.**

### 1.9 Visibility spans composition levels
A user submitting a job to a Ray cluster running on K8s on an allocation has a four-level stack. When something fails, the platform's task/evidence pattern must **flatten** the layers for the user, not require them to climb each layer separately. Every composition level emits tasks correlated to the parent. The V3 task-detail surface is the right shape; it needs to walk parent links across levels.

### 1.10 Reversibility
Every OSS component adopted should be **swappable** without changing the public app contract. This means:
- Pomerium-specific data shapes do not appear in the launchable manifest
- Helm-specific lifecycle hooks do not appear in the lifecycle enum
- Headlamp-specific plugin contracts do not appear in V3's K8s read models

If you cannot articulate what would change in our code to replace component X, the contract is too coupled.

---

## 2. App Design Heuristics

When evaluating whether to admit an app to the catalog, ask these questions in order. The answers determine tier, endpoint types, lifecycle shape, and whether the contract needs an addition.

### 2.1 Trust shape
1. **Does this app admit workloads from other users at runtime?** If yes → curated tier (scheduler-class).
2. **Does this app need credentials beyond its allocation scope?** (Cross-allocation reads, cross-tenant access, platform-issued service tokens.) If yes → curated tier with capability review.
3. **Can this app run as a non-privileged Linux user?** If no → curated tier or rejected.

### 2.2 Auth shape
4. **Can this app authenticate users via OIDC natively?** If yes → standard `http` endpoint, Pomerium-fronted, no credential bridging needed.
5. **Does this app authenticate via a non-OIDC native protocol** (postgres-auth, mongo SCRAM, Redis AUTH, mTLS, API key)? → two-tier auth pattern (see §3).
6. **Does this app embed credentials in its config that the user cannot rotate?** If yes → per-user instance, not shared.

### 2.3 Endpoint shape
7. **Does the app expose only HTTP/WS?** → `endpoint.type: http`.
8. **Does the app expose raw TCP (gRPC, custom protocol, DB protocol)?** → `endpoint.type: tcp`, identity-aware tunnel.
9. **Is the app a scheduler that accepts job submissions?** → `endpoint.type: job_submission`, sub-protocol declared.
10. **Does the app need access to a K8s API on the allocation?** → `endpoint.type: kubernetes`, kubeconfig minted on demand.
11. **Is the app an MCP tool server?** → `endpoint.type: mcp`, identity propagated to the MCP protocol.

If the answer is "none of the above," the contract has a gap. **Propose a new endpoint type via ADR**, do not add a per-app shim.

### 2.4 Lifecycle shape
12. **Does the app outlive any single workload submitted to it?** (Scheduler/cluster pattern.) → declares `lifecycle.long_lived: true`, must implement explicit `drain` with bounded time.
13. **Does the app hold mutable state that survives restart?** → declares `requires_persistent_state`, mounts a platform-managed PVC.
14. **Does the app support in-place upgrade?** (Helm upgrade, rolling restart.) → declares `supports_upgrade: true` with the upgrade contract version.
15. **Can the app be released cleanly within 60s?** If no → declares `drain_timeout_seconds`, platform allocates accordingly.

### 2.5 Cost shape
16. **Does the app consume measurable resources beyond the allocation's GPU-hour bundle?** (External LB, vector DB, managed Ingress, license token, premium tier.) → declares billable building-block dependencies; metering follows the dimensional ledger (see §4).
17. **Can multiple users share one instance within a project?** → declares `sharing_model: project_shared`; per-user attribution is mandatory.
18. **Does the app auto-scale workers/replicas?** → declares the scaling envelope; platform meters actual provisioned capacity.

### 2.6 Visibility shape
19. **Does the app produce sub-tasks the user should see?** (Job submissions, pipeline steps, training runs.) → declares task emission via the standard task contract; tasks correlate to the parent allocation/app-instance task.
20. **Does the app produce failure modes that need root-cause across layers?** → declares evidence pivot keys (correlation_id propagation rules).

**Apps that cannot answer all 20 questions cleanly are not yet contract-ready.** Either the contract needs an addition or the app is not a good fit. Do not admit by exception.

---

## 3. Two-Tier Authentication Model

This is the section that addresses the **"some apps cannot do OIDC"** problem (databases, Redis, MongoDB, vector DBs, gRPC services, anything with a native auth protocol).

### 3.1 The model
Every authenticated endpoint has two tiers:

```
┌─────────────────────────────────────────────────────────────┐
│  TIER 1 — Edge (OIDC, platform-owned)                       │
│  Pomerium authenticates the user via Keycloak OIDC          │
│  Validates session, scope, project membership               │
│  Decides: "may this user reach this endpoint?"              │
│  Emits audit event with user identity + correlation_id      │
└─────────────────┬───────────────────────────────────────────┘
                  │
                  │  Tier 1 → Tier 2 bridge
                  │  Platform-managed credential injection
                  │
┌─────────────────▼───────────────────────────────────────────┐
│  TIER 2 — App native (whatever the app speaks)              │
│  Postgres auth / Redis AUTH / mTLS / API key / token        │
│  App receives a credential that proves user identity        │
│  (or a credential that *represents* the user, signed by     │
│  platform), validates per its native protocol               │
└─────────────────────────────────────────────────────────────┘
```

**Key property:** the user never possesses, manages, or sees the tier-2 credential. The platform mints, rotates, and injects it. If a user can read or pass their own tier-2 credential, the model has broken.

### 3.2 Bridging patterns

The pattern used depends on the app's native auth capabilities. Five canonical patterns, in order of preference:

#### Pattern A — OIDC-native (no bridge needed)
App speaks OIDC directly. Examples: Jupyter (with `jupyter-server-proxy` OIDC), Open WebUI, Grafana, Headlamp.
- Pomerium forwards user's OIDC token to the app via `Authorization: Bearer ...` or `X-Forwarded-Email` style headers
- App validates the token against the same OIDC issuer (Keycloak)
- No credential minting; the OIDC token *is* the tier-2 credential
- **Preferred whenever available**

#### Pattern B — Header-injected identity (claim-based)
App trusts a verified header set by an upstream proxy.
- Pomerium signs a short-lived JWT with user claims
- Injects via `X-Pomerium-Jwt-Assertion` or similar
- App validates the JWT signature (Pomerium's public key) and reads claims
- Examples: Jenkins, Guacamole, internal admin tools
- **Use when app supports trusted-header auth**

#### Pattern C — Per-connection minted credential (database protocol)
App speaks a credentialed protocol (Postgres, MongoDB, Redis, MySQL).
- Pomerium establishes the TCP tunnel after OIDC check
- Platform credential broker mints a per-user, short-lived database credential (TTL ≤ 60min)
- Credential is created in the app (e.g., `CREATE USER alice_2024-05-14_xyz`) and granted scoped permissions
- Pomerium injects the credential into the connection handshake
- Credential auto-expires; cleanup job removes stale users
- **Use for stateful protocols where the app cannot do OIDC**

Implementation notes:
- Credential broker is a curated-tier service (platform-owned)
- Audit row at credential mint, at connection bind, at user expiry
- Failure mode: broker unreachable → connection refused with a clear
  `service_unavailable` response at the edge until a broker-specific catalog
  code is added with the first implementation
- Never reuse credentials across users; never share long-lived service-account credentials for tier 2

#### Pattern D — mTLS with platform-issued client cert
App speaks mTLS (gRPC services, internal-only services).
- Platform mints a short-lived client cert signed by User CA with subject = user identity
- Pomerium establishes the TCP tunnel, presents the cert to the upstream
- App validates against the User CA pubkey, reads identity from cert subject
- **Use for mTLS-native apps; this is the cleanest pattern when available**

#### Pattern E — Per-user instance (last resort)
App cannot do A–D. Either it has hardcoded credentials, no auth at all, or auth that cannot represent identity.
- Each user gets their own instance of the app, with credentials known only to the platform
- Pomerium routes to the per-user instance based on session identity
- No credential sharing, no cross-user access
- **Higher cost, use only when no other pattern works**

#### Pattern F (anti-pattern): user-managed tier-2 credentials
The user is shown the tier-2 credential and uses it themselves. Examples: "here is your postgres password, use it from your laptop."
- **Do not allow this for platform-mediated apps.** It defeats audit, identity propagation, and revocation.
- Users wanting raw protocol access should go through the platform's OIDC-bound tunnel (Pattern C), not be handed credentials.

### 3.3 What the manifest declares

Every endpoint declares its auth pattern explicitly. The pattern is a closed enum, not a string:

```yaml
endpoints:
  - name: jupyter
    type: http
    auth_pattern: oidc_native           # Pattern A
    port: 8888

  - name: vllm-api
    type: http
    auth_pattern: header_injected_jwt    # Pattern B
    port: 8000
    header_contract: pomerium_assertion_v1

  - name: vector-db
    type: tcp
    protocol: postgres
    auth_pattern: per_connection_credential  # Pattern C
    credential_broker: gpuaas_pg_broker
    credential_ttl_seconds: 3600
    port: 5432

  - name: triton-grpc
    type: tcp
    protocol: grpc
    auth_pattern: mtls_user_cert         # Pattern D
    port: 8001

  - name: legacy-app
    type: http
    auth_pattern: per_user_instance      # Pattern E
    isolation: per_user_instance
    port: 8080
```

### 3.4 Audit guarantees
For every tier-2 connection, the platform must emit at minimum:
- `tier1.auth_decision` — user identified, scope checked, allowed/denied
- `tier2.credential_bind` — credential issued/looked up, expiry
- `tier2.connection_open` — tier-2 protocol handshake succeeded
- `tier2.connection_close` — duration, bytes, close reason

All four rows share `correlation_id`. This is the minimum bar for incident investigation across the bridge.

### 3.5 Managed credential broker contract

Pattern C requires a platform-owned `managed_credential_broker` building block.
Do not implement database-class app onboarding with ad-hoc per-app secret
scripts. The broker contract is the durable boundary.

This is a control-plane contract, not a user-facing API in the first slice.
Callers are platform components only:

- the proxy/runtime bridge that needs bind material for a connection,
- app-runtime reconciliation that revokes credentials when an endpoint changes,
- IAM/project membership flows that revoke credentials when access changes,
- a scheduled sweeper that removes expired native users/secrets.

The broker owns four operations:

| Operation | Purpose | Required inputs | Output |
|---|---|---|---|
| `mint` | Create or refresh a short-lived native credential for one user and one endpoint | `org_id`, `project_id`, `app_instance_id`, `endpoint_name`, `user_id`, `subject_claims`, `requested_ttl_seconds`, `scope`, `correlation_id` | `credential_ref`, `expires_at`, `native_subject`, protocol-specific bind material |
| `lookup` | Resolve bind material during a connection attempt | `credential_ref`, `connection_id`, `correlation_id` | protocol-specific username/password/cert/token material, expiry, `native_subject` |
| `revoke` | Invalidate active credentials after user removal, route removal, app stop, endpoint route removal, or admin action | `credential_ref` or owner tuple, `reason`, `actor`, `correlation_id` | revocation status, revoked count, best-effort cleanup errors |
| `sweep` | Remove stale native users/secrets from the app | owner tuple, max age, dry-run flag, `correlation_id` | scanned count, deleted count, retained count, errors |

The broker must enforce:

- credentials are per user, per app instance, per endpoint,
- default TTL is no more than 60 minutes unless a stricter app contract says
  otherwise,
- credentials are never returned to browser clients or stored in app manifests,
- native usernames or certificate subjects are deterministic enough for audit
  but not reusable across app instances,
- every mutation writes an audit row and emits route/app evidence,
- failure is fail-closed.

Credential references are opaque. The only stable identity fields are the owner
tuple and `native_subject`; callers must not infer protocol, grant shape, or
secret storage layout from `credential_ref`.

Protocol adapters sit behind this contract. A Postgres adapter may use
`CREATE ROLE` plus scoped grants, Redis may mint ACL users, and mTLS-native
services may issue short-lived client certs. The app manifest only declares
`auth_pattern: per_connection_credential` and broker capability requirements;
it does not declare SQL, Redis ACL syntax, or certificate plumbing.

#### 3.5.1 State model

The implementation should model credentials as durable state even when native
bind material is short-lived:

| Field | Requirement |
|---|---|
| `credential_ref` | Opaque platform ID, stable for one minted credential lifecycle |
| `owner` | `org_id`, `project_id`, `app_instance_id`, `endpoint_name`, `user_id` |
| `native_subject` | Sanitized subject visible in audit and native system logs |
| `scope` | Protocol-neutral grants, for example read-only database/schema/table set |
| `status` | `active`, `expired`, `revoked`, `cleanup_failed` |
| `expires_at` | Required; defaults to `now + min(requested_ttl, 60m)` |
| `last_bound_at` | Updated on successful `lookup`/connection bind |
| `revoked_at` / `revoked_reason` | Required for revoked credentials |

Raw bind material may be returned only to the internal connection bridge and
must be held for the minimum protocol-required lifetime. Browser clients,
manifests, read models, task evidence, logs, and audit rows must never contain
passwords, private keys, bearer tokens, SCRAM material, or wrapped secrets.

#### 3.5.2 Audit and evidence

The broker writes audit rows for these actions:

| Action | Result examples | Notes |
|---|---|---|
| `app.credential.mint` | `success`, `denied`, `failed` | Includes owner tuple, `credential_ref`, `native_subject`, expiry, no secret material |
| `app.credential.bind` | `success`, `expired`, `revoked`, `failed` | Includes `connection_id`, route ID if present, and correlation ID |
| `app.credential.revoke` | `success`, `not_found`, `partial_failure` | Includes actor, reason, and revoked count |
| `app.credential.sweep` | `success`, `partial_failure`, `failed` | Includes scanned/deleted/retained counts and adapter name |

The same facts should appear in app/runtime evidence so operators can debug
route failures without direct DB inspection. Evidence payloads must be sanitized
with the same credential blocklist as logs.

#### 3.5.3 Cleanup semantics

Cleanup is two-layered:

1. `revoke` is synchronous for platform state and best-effort for native system
   cleanup. If native cleanup fails, the platform credential becomes
   `cleanup_failed`, connection binds remain denied, and the sweeper retries.
2. `sweep` is idempotent. It removes expired/revoked native subjects, repairs
   drift where native users exist without active platform credentials, and
   emits evidence for every partial failure.

App stop, endpoint route removal, user/project membership removal, app
decommission, and admin disable all call `revoke` on the owner tuple before the
runtime is considered fully drained.

#### 3.5.4 Error catalog decision

Do **not** add `credential_broker_unavailable` to the runtime code path until
the first broker implementation lands. For the contract stage, unavailable
broker failures map to the existing `service_unavailable` code at the edge and
`upstream_error` when a native protocol adapter fails after the broker accepted
the request. Add a broker-specific catalog code only with the first API or
operator surface that needs clients to distinguish broker outage from other
service outages.

---

## 4. Billing Evolution

### 4.1 Current shape
Billing today centers on **allocation-as-the-billable-resource**: GPU-hours accrue per allocation, ledger entries key on `(user_id, allocation_id, ...)`. This is correct for the current product where allocations are the only thing users buy.

`App_Runtime_Billing_Model_v1.md` already extends attribution to include `app_instance_id`, `app_slug`, `operating_mode`, `control_plane_scope`, `runtime_backend`, `correlation_id`. Those fields are sufficient for the next phase. **This document does not redefine that.**

### 4.2 Why it needs to evolve further
The current allocation-centric model breaks at three known future inflection points:

1. **Apps with intrinsic value.** Premium scheduler apps, licensed commercial apps, platform-curated building blocks (managed Ingress, vector DB, MLflow). The "free with allocation" assumption breaks.
2. **Platform building blocks consumed by user apps.** Managed Ingress requests, managed storage IOPS, managed vector DB queries — these are consumed by apps inside an allocation but are not part of the allocation's GPU-hour budget.
3. **Cross-allocation resources.** A shared MLflow tracking server, a project-wide vector DB, an org-level model registry. These exist outside any single allocation.
4. **Per-submitter attribution within shared schedulers.** A shared Ray cluster bills the project, but the audit + cost breakdown needs per-submitter detail.

### 4.3 Dimensional ledger principles

The ledger entry shape must be extensible enough to express all four futures without rewriting historical rows. Treat ledger entries as **dimensional events**, not allocation-pegged units:

```sql
-- Conceptual shape; do not migrate live rows
ledger_entries (
  entry_id          uuid primary key,
  org_id            text not null,
  project_id        text not null,
  user_id           text,                 -- nullable when entry is not user-attributable
  resource_type     text not null,         -- closed enum: 'allocation' | 'app_instance' | 'building_block' | 'license' | 'credit' | 'refund' | ...
  resource_id       text not null,         -- the id of the resource of resource_type
  parent_resource_type  text,              -- optional parent (e.g. allocation that hosts the app)
  parent_resource_id    text,
  submitter_user_id text,                  -- for scheduler job attribution within a shared instance
  unit              text not null,         -- 'gpu_hour' | 'request' | 'gb_month' | 'token' | 'connection_hour' | ...
  quantity_numeric  numeric not null,
  amount_minor      bigint not null,       -- still in minor units, signed
  occurred_at       timestamptz not null,
  correlation_id    text not null,
  metadata_jsonb    jsonb,                 -- runtime-specific signals
  ...
)
```

**The immutable ledger rule does not change.** Corrections are new entries. No `UPDATE` or `DELETE`.

Balance is still **computed from the ledger**, never stored. New `resource_type` values are additive — they don't require historical re-attribution.

### 4.4 Migration discipline
- **Today's entries stay valid.** Existing rows have `resource_type=allocation`, `unit=gpu_hour`. No backfill required.
- **New entries adopt the wider shape.** As app-instance, building-block, and submitter-attributed entries come online, they fill the new fields. Old code paths continue to write allocation entries.
- **Reads generalize before writes.** Update read models and balance queries to handle the wider schema before any new writer emits non-allocation entries. Otherwise dashboards silently undercount.
- **Refund/credit semantics extend.** A refund today is `resource_type=allocation`. A future refund may target `resource_type=app_instance` or `resource_type=building_block`. The refund window policy (`allocation.refund_window_days`) generalizes to a per-resource-type policy.

### 4.5 What this means for app developers
External app developers do not write ledger entries. The platform meters apps based on signals declared in the manifest:

```yaml
metering:
  unit: gpu_hour                    # or 'request' | 'connection_hour' | etc.
  signal_source: node_agent_runtime  # or 'pomerium_request_count' | 'k8s_pod_uptime' | etc.
  rate_per_unit_minor: 1500          # for curated/licensed apps; absent for free-with-allocation apps
  billable_to: project               # or 'submitter' | 'allocation'
```

The platform's metering subsystem reads these signals and writes ledger entries. App code never touches billing.

### 4.6 Cross-allocation route identity
Decision: proxy route identity is project/app/endpoint scoped, with allocation
or runtime target bindings attached below that identity.

The durable route key is:

```text
(org_id, project_id, app_instance_id, endpoint_name)
```

`allocation_id` is optional target metadata, not part of the primary route
identity. For the current single-allocation launchable app shape, each route
usually has exactly one allocation target. For scheduler apps, project-wide
building blocks, multi-node groups, and future shared runtimes, one route
identity may fan out to zero, one, or many target bindings.

This matches the current implementation direction in
`packages/services/inventory/proxy_runtime.go`: `ProxyRouteOwnerType` is
closed to `app_instance` or `platform_service`, not `allocation`.

Required consequences:

1. `app_proxy_routes` and future route-intent read models store `org_id`,
   `project_id`, `app_instance_id`, and `endpoint_name` as the stable intent
   key.
2. `allocation_id` remains available on target bindings for allocation-local
   routing, billing/audit joins, health checks, and rollback.
3. A route without an active allocation target may still be valid when the
   endpoint belongs to a project-scoped building block or external managed
   runtime.
4. Pomerium hostnames and renderer object names are derived from stable route
   identity, not from `allocation_id`.
5. Browser-session authorization may still require allocation ownership when
   the selected target binding is allocation-local.
6. Capacity, billing, and lifecycle evidence preserve allocation-level detail
   when present, but the route schema must not assume every app endpoint is
   allocation-owned.

This decision is locked before the Pomerium Phase 2/3 route schema hardens. It
keeps current allocation-backed JupyterLab/vLLM routes simple while allowing the
same managed-ingress contract to serve shared schedulers and building blocks
later.

---

## 5. K8s Service Exposure and Platform Building Blocks

This section addresses the unresolved question: when a user has K8s on their allocation and a workload exposes a `Service`, **how does it become reachable, and who provides the LB/Ingress/Gateway?**

### 5.1 The two paths and why we pick one

**Path A — Users figure it out.** Platform provides a vanilla K8s cluster. User installs their own ingress controller, configures LoadBalancer Services, manages DNS, etc. Cheaper for the platform, painful for users. Same pattern as bare K8s on a bare cloud VM.

**Path B — Platform provides exposure as a building block.** Platform offers a managed Ingress/Gateway primitive. User annotates their Service; platform reconciler picks it up, creates the Pomerium route, returns the public URL. Same pattern as cloud-provider ALB or GKE Ingress. **This is the recommended path.**

Reasons to pick Path B:
- It is consistent with the rest of the platform's UX (typed endpoints, OIDC at edge, audit).
- It removes a class of "how do I expose this?" support load.
- It turns ingress capacity into a billable building block.
- It does not preclude users from going lower-level when they want to; advanced users can always install their own ingress.

### 5.2 The building block model

A **building block** is a platform-provided service that user apps consume. It is not the same as a curated user-facing app — building blocks have:
- A platform-managed lifecycle (provisioned per-region or per-cluster, not per-allocation)
- An SLO (uptime, latency, throughput)
- A unit price (request, GB-hour, connection-hour, etc.)
- A quota policy
- A deprecation policy if the underlying implementation is swapped
- A capability-declared dependency from consuming apps

User-facing manifest contract:

```yaml
dependencies:
  - building_block: managed_ingress
    version: ">=1.0"
    expose:
      - service: vllm-api
        type: http
        auth_pattern: header_injected_jwt
```

Platform reconciler:
1. Watches `Service` resources in the user's cluster matching `gpuaas.io/expose: managed_ingress` annotation
2. Creates a Pomerium route pointing at the Service
3. Updates the app-instance read model with the public URL
4. Emits ledger entries against `resource_type=building_block`, `resource_id=managed_ingress`, `unit=connection_hour` or similar

### 5.3 Initial building-block catalog

The first wave of building blocks worth providing (in roughly priority order):

| Building block                | What it is                                                   | Billing unit                                         | Notes                                                        |
| ----------------------------- | ------------------------------------------------------------ | ---------------------------------------------------- | ------------------------------------------------------------ |
| **managed_ingress**           | Pomerium-fronted public exposure for HTTP/TCP from user clusters | per request + per connection-hour for long-lived TCP | The first one to build; unblocks K8s scheduler-apps entirely |
| **managed_storage**           | PVC-style mounts backed by the platform's storage layer      | GB-month + IOPS                                      | Bridges existing bucket model into K8s                       |
| **managed_secrets**           | OIDC-scoped secret store (Vault-equivalent abstraction)      | per secret-month                                     | Required for Pattern C credentials                           |
| **managed_credential_broker** | Database/Redis/Mongo credential minting for Pattern C two-tier auth | per credential-mint + per broker-month               | Curated tier; platform-owned                                 |
| **managed_vector_db**         | Shared or per-project Qdrant/Weaviate/etc.                   | GB-month + query                                     | First clear "premium app" pattern; tests the cost model      |
| **managed_tracking**          | MLflow tracking server as a service                          | per tracking-server-month + storage                  | High user demand, low marginal cost                          |
| **managed_cache**             | Redis as a service                                           | GB-month                                             | Useful for inference + session caches                        |

Each block is a curated-tier app under the existing SDK. Building blocks **eat their own dog food**: they are apps in the catalog, with manifests, deployed via the same machinery — they just happen to be platform-curated and platform-billed.

### 5.4 The decision a user actually faces

When a user deploys a workload that needs exposure:

```yaml
# In the user's launchable manifest
expose:
  - service: my-inference-api
    via: managed_ingress         # the building block they want
    auth_pattern: oidc_native
    rate_limit_per_minute: 600
```

The platform handles everything downstream. The user does not learn about LoadBalancer Services, NodePorts, Ingress controllers, or DNS. **If they want to go deeper** (install their own nginx, use Gateway API directly, raw NodePort), they can — but the default path is the building block.

### 5.5 What this rules out
- **No per-allocation ad-hoc LoadBalancers.** All public exposure goes through the managed building block. This keeps the public surface auditable, rate-limited, and Pomerium-fronted.
- **No bypass of Pomerium for "internal" exposure.** Internal-to-cluster traffic stays internal; cross-cluster or cross-allocation goes through the building block, which goes through Pomerium.
- **No "give me a public IP."** Public IPs are managed at the building-block layer, not handed to users.

### 5.6 Quota and SLO discipline
Every building block has:
- A per-project quota (default + override)
- A documented SLO (e.g., managed_ingress: 99.9% uptime, p99 < 50ms add)
- A burndown / overage policy
- A capacity-planning model owned by the platform team

Without these, building blocks become an unmonitored cost center. With them, they become a credible part of the product.

---

## 6. Endpoint Type Catalog (closed enum)

The platform recognizes the following endpoint types. **Adding a new type requires an ADR.**

| Type             | Description                                                  | Auth patterns (§3)        | Bridged by                                        |
| ---------------- | ------------------------------------------------------------ | ------------------------- | ------------------------------------------------- |
| `http`           | HTTP/HTTPS/WebSocket                                         | A, B                      | Pomerium                                          |
| `tcp`            | Generic TCP (gRPC, DB protocols, custom binary)              | C, D, E                   | Pomerium TCP tunnel                               |
| `ssh`            | SSH session-class (TTY, scp, sftp, bounded port-forward)     | D (user cert)             | gpuaas-ssh-gateway (planned) or Pomerium Zero-SSH |
| `kubernetes`     | K8s API server access (kubectl, Headlamp)                    | A (OIDC) → kubeconfig     | Pomerium kubectl proxy                            |
| `mcp`            | Model Context Protocol tool servers                          | A, B                      | Pomerium                                          |
| `job_submission` | Scheduler job-submission protocols (slurmrestd, k8s-api-as-scheduler, ray-client) | D, plus sub-protocol enum | Per-scheduler adapter                             |

Each entry implies:
- A bridge mechanism (which gateway speaks which protocol)
- An auth-pattern allowlist (which §3 patterns make sense)
- A protocol-specific audit shape
- A reconciler responsible for promoting the type to a routable target

**Anti-pattern:** introducing `http_with_grafana_quirks` or `http_no_html_rewrite`. The existing per-app HTML rewriting in the proxy is the canonical example of what this rule prevents going forward.

---

## 7. Anti-Patterns

Document these so reviewers can name them in PRs:

1. **`if app.name == "X"` branches in platform code.** Always indicates a contract gap. Fix the contract.
2. **New endpoint-type enum values for one app.** If only one app needs the type, the type is the wrong abstraction.
3. **Manifest fields that exist "because Helm needs it" / "because vLLM needs it."** Runtime-specific concerns live in adapters, not the manifest.
4. **Allocation-keyed billing assumptions in new code.** Use the dimensional ledger keys.
5. **Schedulers that do not enforce per-submission identity.** Implicit "everyone runs as the allocation user" is a privilege-escalation surface.
6. **Apps admitted with no clean security story.** "We will figure out auth later" never becomes "we figured out auth."
7. **Pomerium-specific data shapes in the manifest.** Manifest is upstream of any single rendering target.
8. **Helm-specific lifecycle hooks in the lifecycle enum.** Same rule, different runtime.
9. **Tier-2 credentials shown to users.** Defeats audit and revocation.
10. **Per-app proxy shims (the original sin).** Same shape as #1 and #2. Always the wrong direction.
11. **Direct database queries across domain boundaries in the platform code.** Already in AGENTS.md; restated for app-runtime code.
12. **Hardcoded policy values in app code.** Apps read policy via the standard mechanism.

---

## 8. Decision Discipline

### 8.1 When to update the SDK contract
Update the contract when:
- Two or more app classes need the same new capability
- A capability that was app-specific turns out to be general (and can be expressed at the contract level)
- A failure mode in production indicates the contract did not enumerate enough

Do not update the contract for:
- A single app's quirks
- Convenience for one team
- Tactical proxy/route fixes that should live in the renderer

### 8.2 When to say no to an app
- Cannot answer the 20 design heuristics (§2) cleanly
- Requires a per-app shim in platform code
- Tier-2 auth cannot be expressed in Patterns A–E (§3)
- No clean failure-mode story that the V3 task/evidence model can capture
- Asks for capabilities that fit a tier (curated vs open) the app should not be in

Saying no is a feature of the strategy, not a bug. Every yes that does not fit the contract is a tax on every future app.

### 8.3 Manifest schema bumps
- **Minor bump (e.g., 1.x → 1.y):** new optional fields, new enum values added (additive only), new lifecycle hooks (optional). Old apps continue to load.
- **Major bump (e.g., 1.x → 2.0):** field removed, enum value removed, semantic change. Requires migration tooling + deprecation window.
- **Every bump ships with a CHANGELOG entry and an examples-update.**

### 8.4 Building-block additions
Adding a building block (§5.3) is a platform commitment with:
- Capacity model
- SLO + alerting
- Pricing + quota
- Deprecation policy
- Curated tier registration

Do not ship building blocks as "experimental" without a published deprecation path. Users will build on them and the platform will inherit the support burden.

---

## 9. Open Questions

Things deliberately not resolved here, captured so they don't get lost:

1. **Per-submitter identity propagation contract.** How does a scheduler-app on a node ask the platform "what's the scope for this submitter?" with cryptographic certainty. Probable answer: short-lived signed token issued at submission time, validated by node-agent. Needs explicit design.
2. **Scheduler-app sub-state model.** Today's allocation state machine compresses scheduler-internal states (cluster forming → ready → draining jobs → draining cluster → gone). Probably needs a sub-state shape, not a flat extension.
3. **Identity propagation through MCP.** When a user's Claude/ChatGPT calls an MCP tool server on the user's allocation, how does the tool server know which user is calling? Likely Pomerium claim → MCP context header.
4. **Building-block tenancy model.** Shared-across-org vs per-project vs per-allocation. Probably block-by-block, but the default tier should be declared.
5. **External app trust review process.** What's the practical bar for admitting an external app to the catalog (open tier)? To the curated tier? Process is not yet documented.
6. **Manifest schema host.** Is the manifest schema published as a separate versioned artifact (separate repo, JSON schema) consumable by external developers, or embedded in the platform repo?

These are next-decisions, not bugs. Sequencing them deliberately is part of the strategy.

---

## 10. How This Document Is Used

- **PR reviewers** cite section numbers when flagging contract drift. Examples: "this is anti-pattern §7.1," "auth pattern not in §3.2."
- **ADRs** that propose new endpoint types, new building blocks, or new manifest fields reference the relevant sections and explain how the principles are preserved.
- **App developers** (curated tier) read this before submitting a manifest.
- **The platform team** revisits this document quarterly and after every category expansion (e.g., after the first scheduler-app, the first building block, the first cross-allocation app).

This document is **v1**. It will need a v2 after the first KubeRay spike, the first managed-ingress build, and the first non-OIDC database integration land. Treat the version number as a real commitment to revisit.
