# Platform Shared Services Model v2

Status: active target model
Owner: Platform Architecture
Last updated: 2026-06-01
Supersedes: `doc/architecture/Platform_Shared_Services_Model_v1.md`

## Purpose

Make the platform shared-services model concrete enough to drive product,
architecture, security, and implementation planning.

GPUaaS is the first product on the AI Factory platform. The App SDK, managed app
runtime, internal developer experience, and future Token Factory work should not
rebuild identity, billing, audit, evidence, release posture, notification,
policy, secrets, or operational status per product.

This v2 model defines:

- which services are shared platform services;
- which domains remain product-owned;
- the minimum contracts and registries needed before the next product surface;
- how release, UAT, security, and agent-delivery evidence become first-class;
- how the App SDK becomes an internal developer platform surface;
- the phased path from current GPUaaS implementation to the target model.

## Architecture Summary

Platform shared services provide identity, money, evidence, trust, policy,
status, and operational control. Products provide domain-specific customer value
on top of those services.

For GPUaaS, this means allocation, node lifecycle, terminal access, SKU catalog,
and MAAS behavior stay product-owned. IAM, ledger, payment custody, audit,
notification, release evidence, trusted artifacts, secrets, service accounts,
status, product registration, and developer onboarding become shared platform
capabilities.

The immediate goal is not physical microservice extraction. The immediate goal
is to stop creating product-specific forks of shared platform concerns while
the code can still be reshaped cheaply.

## Target Split

```text
Products
  GPUaaS
    GPU inventory, SKUs, allocations, node lifecycle, terminal, MAAS
  App Platform / SDK
    app catalog, manifests, app instances, runtime adapters, developer workflows
  Token Factory
    model endpoints, inference routing, model policy, token analytics
  Future products
    product-specific domain resources and workflows

Platform shared services
  IAM / Access
    identity, tenants, projects, memberships, scopes, service accounts, API keys
  Billing / Metering / Payments
    usage units, attribution, rating, ledger, balances, checkout, refunds
  Audit / Evidence
    privileged actions, correlation trail, release/UAT/security evidence
  Status / Ops
    health, incidents, maintenance, component versions, release readiness, SLOs
  Notification
    templates, preferences, dispatch, security/billing/status notices
  Registry / Artifacts
    product registry, app registrations, trusted artifacts, runtime bundles
  Secrets / PKI
    service identity, short-lived credentials, cert lifecycle, runtime secrets
  Policy / Entitlements
    quotas, limits, feature flags, product access, plan/organization/department/project entitlements

Shared infrastructure libraries
  errors, middleware, events, outbox, read cache, storage primitives, SDK codegen
```

## Service Ownership

| Shared service | Owns | Current anchor | Target contract surface |
|---|---|---|---|
| IAM / Access | principals, tenants, projects, roles, memberships, service accounts, API keys, scopes, authorization evidence | `packages/platform/auth`, `packages/platform/iam`, access/platform routes | `/access/*`, `/iam/*`, product scope registry, service-account APIs |
| Billing / Metering | usage ingestion, rating, ledger entries, balances, product attribution, quotas tied to money | `packages/platform/billing`, `cmd/billing-worker` | usage-unit registry, usage ingest API/events, ledger read models |
| Payments | checkout sessions, webhooks, refunds, provider reconciliation, ledger credits | `packages/platform/payments`, `cmd/webhook-worker` | payment session APIs, webhook worker, finance recovery APIs |
| Audit / Evidence | append-only audit rows, evidence bundles, retention classes, release/UAT/security evidence | audit helpers and platform evidence UI | audit action registry, evidence bundle APIs, platform evidence read model |
| Status / Ops | component versions, health snapshots, maintenance, incidents, release readiness, SLO evidence | platform ops/read-model routes, operations docs | status APIs, incident APIs, release evidence APIs |
| Notification | templates, preferences, durable delivery intent, WS/email/security notices | `packages/platform/notification`, `cmd/notification-relay` | template registry, preference APIs, dispatch events |
| Registry / Artifacts | product registry, app registration, trusted artifact metadata, runtime bundle promotion | app platform docs/scripts, OCI artifact scripts | product/app/artifact registries, artifact trust state APIs |
| Secrets / PKI | service identity, cert issuance/renewal, runtime secret custody, credential delivery | `packages/shared/pki`, Vault/PKI docs | secret purpose registry, service identity APIs, cert lifecycle events |
| Policy / Entitlements | quotas, feature flags, product access, plan/organization/department/project entitlements, policy value authority, future policy-engine integration | `packages/shared/policy`, `platform_policy_values` | policy registry, entitlement APIs, versioned policy snapshots, decision input/output contract |

Secrets / PKI is a coordination layer over Vault, step-ca, cert-manager, and
`packages/shared/pki`. It owns secret-purpose registration, credential delivery
contracts, lifecycle events, audit, and policy integration; it does not replace
the underlying secret or certificate custody tools.

## Product-Owned Domains

| Product/domain | Owns | Must compose shared services for |
|---|---|---|
| GPUaaS provisioning | allocation lifecycle, node assignment, release, force-release, MAAS/reimage orchestration | IAM authz, billing attribution, audit, notification, status evidence |
| GPUaaS inventory/catalog | GPU nodes, SKUs, capacity availability, marketplace presentation | policy, entitlement, billing unit registration, audit |
| GPUaaS terminal/access runtime | browser terminal, SSH/session relay, node-agent stream binding | IAM, short-lived credentials, audit, status |
| App Platform / SDK | app manifests, app catalog, app instances, shared runtimes, app operations, runtime adapters | IAM service accounts, artifact trust, billing hooks, audit, developer docs |
| Token Factory | model endpoints, routing, model-specific policy, token analytics, inference gateway | API keys/scopes, usage units, billing, audit, status, notifications |

Product services must not directly query shared-service tables. They call shared
contracts, consume read models, or subscribe to outbox-backed events.

## Required Registries

The registries are the concrete mechanism that lets products compose shared
services without hardcoding behavior.

| Registry | Minimum fields | First users |
|---|---|---|
| Product registry | `product_id`, display name, lifecycle, owner, current version | GPUaaS, App Platform, Token Factory |
| Scope registry | `scope_id`, `product_id`, version, description, lifecycle, risk class | IAM, API keys, service accounts |
| Usage-unit registry | `unit_id`, owner product/service, version, precision, rating category, lifecycle | GPU-hour, app-runtime-hour, token/request units |
| Audit-action registry | `action_id`, product/service, version, retention class, privileged flag | admin actions, release actions, app publish actions |
| Notification-template registry | `template_id`, product/service, version, channels, severity | low balance, release incident, security notice |
| Evidence-type registry | `evidence_type`, owner, required fields, retention class, release-blocking flag | UAT, security scan, release approval, residual risk |
| Artifact-type registry | `artifact_type`, trust states, signing requirement, promotion policy | app bundles, runtime images, SDK examples |
| Policy/entitlement registry | `policy_key`, scope, type, default, lifecycle, effective version | quotas, feature flags, reserve policy, department/project envelopes |

Registration changes that affect authorization, billing, audit wording,
customer-visible behavior, or release gates must be versioned. Long-lived
resources snapshot the registry version they were created under.

## Release, UAT, And Security Evidence

Release readiness becomes a shared Status/Ops plus Audit/Evidence capability,
not a collection of product-specific runbook notes.

### Evidence bundle

Every production-impacting release candidate should be able to produce a bundle
with:

- source SHA and release branch;
- environment/profile target;
- package/image/artifact digests;
- migration status;
- GitLab pipeline and job links;
- contract/codegen/SDK drift result;
- unit, integration, frontend, and UAT automation result;
- security scan/triage result;
- product owner approval;
- platform owner approval;
- security owner approval when required;
- residual risk statement;
- rollback or forward-fix plan;
- capacity reserve/ring posture;
- correlation IDs for deploy and smoke events.

### Release gates

Initial gates:

| Gate | Owner | Blocks promotion when |
|---|---|---|
| Contract and SDK drift | Platform Architecture / owning domain | OpenAPI/AsyncAPI and generated SDK artifacts are stale |
| UAT automation | Product/UI + owning domain | Required persona journey or demo UAT suite fails |
| Security evidence | Security / Platform Control | Required scan, authz, secret, or audit check is missing/failing |
| Environment profile | Platform Control | host, DNS, TLS, registry, secret, or package profile is inconsistent |
| Capacity reserve | Provisioning / Ops | no ring/canary/reserve posture is recorded for capacity-impacting change |
| Approval and residual risk | Product + Platform + Security as applicable | approval or residual risk is absent for release-risking change |

GitLab remains the execution system. The shared platform model records whether
the evidence from GitLab and UAT satisfies the release gate.

## App SDK And Internal Developer Platform

The App SDK is a product surface and a shared developer platform capability, not
only a repo implementation detail.

App runtime/controller bugs can be fixed in the owning runtime/backend layer.
App defaults, manifest behavior, publish/promotion semantics, connect actions,
and developer-visible failure behavior should move toward the SDK/manifest
contract instead of staying as backend seed/runtime assumptions.

Before internal developers build against the SDK broadly, the platform should
provide:

- a developer onboarding track in the Docusaurus portal;
- service-account creation and scoped credential flow;
- example app manifests and runnable examples;
- artifact publish and promotion workflow;
- app registration checklist;
- local and platform-control smoke commands;
- generated API/SDK reference entry points;
- contract tests for SDK examples;
- audit/evidence expectations for app publish and runtime operations;
- support boundary: what the platform owns versus what the app adapter owns.

Developer readiness gates:

| Order | Gate | Pass condition |
|---|---|---|
| 1 | API contract | App APIs and async events are documented and generated SDK smoke passes |
| 2 | Example app | at least one reference app launches through public contracts |
| 3 | Credential flow | app service account can be created, scoped, rotated, and audited |
| 4 | Portal entry | developer can follow docs without reading internal Go packages |
| 5 | Runtime evidence | app operation emits status, audit, and billing attribution hooks |
| 6 | Artifact trust | app artifact can be registered, promoted, and traced to source |

The first readiness artifact for `PSS-SDK-001` is a manifest/launch/connect
contract matrix covering current supported apps. It should identify which
behavior is already expressible through SDK-visible manifests and validators,
which behavior is still implemented as seed/backend compatibility logic, and
which gaps need SDK examples or smoke tests.

## Persona Surfaces

The same shared services serve multiple audiences through different surfaces.

| Persona | Surface | Needs |
|---|---|---|
| End user | user app/web/CLI | launch, connect, monitor, billing visibility |
| Tenant/customer admin | access/account/platform tenant pages | members, projects, quotas, audit, budgets |
| Internal app developer | Docusaurus + SDK + CLI | build, register, test, publish apps |
| External/partner developer | future public developer portal | stable APIs, SDKs, examples, partner submission |
| Platform operator | `/platform/*`, runbooks, Fairway/evidence | release posture, incidents, health, recovery |
| Security reviewer | security readiness and evidence views | controls, gaps, scan evidence, residual risk |
| Product/architecture | roadmap and domain docs | maturity, domain split, gaps, decisions |

Access control can be phased. The information architecture should assume public,
customer, partner, and internal tracks even if early implementation is internal
only.

## Degradation Model

Shared services fail differently and must advertise their posture.

| Service | Degradation posture | Initial SLO target | Status/Ops evidence |
|---|---|---|---|
| IAM / Access | authorization and key validation fail closed; cached reads allowed only within freshness budget | authorize p99 < 20ms from local read model | authz latency, cache age, JWKS age, key validation failures |
| Billing / Metering | usage ingestion buffers durably; ledger mutation never silently skips; stale balance reads are stamped | usage ingest/outbox lag p99 < 30s | usage lag, unrated events, ledger writer health, balance rollup age |
| Payments | payment initiation fails closed; webhook processing retries durably | webhook processing p99 < 60s after receipt | webhook lag, provider reachability, DLQ count |
| Audit / Evidence | privileged mutations fail closed if required audit/evidence cannot be written | audit/evidence write p99 < 50ms | audit write failures, evidence bundle lag, query read-model lag |
| Status / Ops | read-only status can degrade; incident/release evidence writes must be durable or backfilled | status read p99 < 100ms | component freshness, incident event lag, release evidence age |
| Notification | non-critical notices retry; billing/security notices retain durable intent | dispatch intent p99 < 30s | pending notifications, delivery failures, durable intent age |
| Registry / Artifacts | untrusted or unknown artifacts fail closed for promotion/deploy | registry lookup p99 < 50ms | registry version freshness, artifact trust state, promotion failures |
| Secrets / PKI | issuance and renewal fail closed outside documented grace windows | cert renewal attempt before 50% lifetime remaining | cert age, renewal failures, secret rotation age, grace-window exceptions |
| Policy / Entitlements | capacity, authority, and financial writes fail closed if policy freshness is outside budget | policy snapshot freshness < 5m for write paths | policy snapshot age, entitlement decision latency, stale-write rejects |

These targets are starting points, not SLA commitments. Product SLAs should
reference Status/Ops evidence, not ad hoc service logs.

## Extraction Trajectory

Physical extraction is later than contract separation.

| Phase | Shape |
|---|---|
| Phase 0 | Current co-located `cmd/api` and service packages, but with documented ownership and no new shared-service forks |
| Phase 1 | Platform registries and evidence bundles become schema/API-backed |
| Phase 2 | IAM facade, audit/evidence read model, and Status/Ops APIs become explicit packages/routes |
| Phase 3 | Billing usage ingestion/rating and notification dispatch harden around product-neutral contracts |
| Phase 4 | Physical service extraction only where release cadence, scale, compliance, or SLOs require it |

Do not extract a service just to make the diagram prettier. Extract when the
contract is stable and the operational reason is real.

## Concrete Implementation Path

Implementation status as of the platform-foundation Fairway track:

- The shared-services target model is complete enough for implementation.
- Package, route, schema, event, frontend, and worker ownership maps exist.
- Report-only boundary guards, warning mode, and blocking-new mode exist.
- Evidence/status, IAM, registry, GPUaaS, and App Platform facades exist as the
  first modular-monolith boundary.
- Deployment extraction readiness gates exist; physical extraction has not
  started.

The model should not be treated as "all platform shared services are fully
implemented." It means the ownership model, first contracts, facades, and guard
path are ready. Remaining work is operationalization: hardening evidence/status,
turning registries into durable runtime inputs where needed, making App SDK
readiness executable, and only then considering service extraction.

The remaining tracks and phases now live in
`Platform_Shared_Services_Completion_Roadmap_v1.md`. That roadmap supersedes
the earlier 30/60/90-day planning view because the foundation baseline has been
completed.

Summary of the remaining completion phases:

| Phase | Focus | Primary outcome |
|---|---|---|
| Phase 1 | Operating record hardening | release/UAT/security/operator evidence becomes the shared operating record |
| Phase 2 | Service identity and entitlements | product-to-platform calls use explicit service identity and scope contracts |
| Phase 3 | Product-neutral usage and money domain | usage-unit registry, neutral usage events, and money-domain status evidence |
| Phase 4 | App SDK, registry, artifact trust, and onboarding | internal developer path, artifact promotion, and next-product onboarding |
| Phase 5 | Notification, policy, tenant, and external surfaces | shared notices, quotas, feature flags, tenant posture, and portal tracks |
| Phase 6 | Runtime trust and extraction decisions | Secrets/PKI evidence and keep/split/extract decisions for candidate services |

## First Work Packages

| ID | Work package | Owner | Output |
|---|---|---|---|
| PSS-REG-001 | Registry schema/spec baseline | Platform Architecture | registry doc + seed/update plan |
| PSS-EVID-001 | Release evidence bundle model | Platform Control / Security | evidence schema + mapping to GitLab/UAT |
| PSS-SDK-001 | App SDK internal developer readiness | App Platform / Product | manifest/launch/connect matrix + checklist + Docusaurus IA update |
| PSS-IAM-001 | IAM facade boundary | IAM / Access | package/route split proposal |
| PSS-AUDIT-001 | Audit/evidence read model | Audit / Status | read-model contract and first endpoints |
| PSS-USAGE-001 | Product-neutral usage-unit registration | Billing | GPU-hour/app-runtime/token unit model |
| PSS-ART-001 | Artifact trust and promotion boundary | App Platform / Platform Control | artifact states and promotion gates |

PSS-* IDs describe owned platform shared-service work packages. Orchestrator
task IDs describe the agent execution lanes that carry those packages forward.

| PSS work package | Primary orchestrator task(s) |
|---|---|
| PSS-REG-001 | `A-PLATFORM-IAM-REGISTRY-FACADE-001`, `D-PLATFORM-FOUNDATION-MAPS-001` |
| PSS-EVID-001 | `A-PLATFORM-EVIDENCE-STATUS-FACADE-001`, `C-PLATFORM-RELEASE-EVIDENCE-INPUTS-001` |
| PSS-SDK-001 | `B-PLATFORM-EVIDENCE-STATUS-CONTRACT-001` plus App SDK manifest/launch/connect matrix and Docusaurus follow-up tasks |
| PSS-IAM-001 | `A-PLATFORM-IAM-REGISTRY-FACADE-001` |
| PSS-AUDIT-001 | `A-PLATFORM-EVIDENCE-STATUS-FACADE-001`, `C-PLATFORM-RELEASE-EVIDENCE-INPUTS-001` |
| PSS-USAGE-001 | `D-PLATFORM-FOUNDATION-MAPS-001` first, then billing usage-unit implementation tasks |
| PSS-ART-001 | App Platform artifact trust tasks after registry/evidence contracts land |

## Open Decisions

1. Resolved for the first implementation by
   `Platform_Registry_Contract_v1.md`: registries start seed-backed through
   `packages/platform/registry`, with schema-backed migration tracked centrally
   as `OD-001` in `../Platform_Architecture_Open_Decisions_v1.md`.
2. Which release evidence fields are mandatory for all releases versus only
   production-impacting releases?
3. Does Status/Ops expose customer-safe status in the first implementation, or
   only operator/security evidence?
4. Where does the first IAM facade live? Resolved locally by
   `Platform_Code_And_Deployment_Architecture_v1.md`: introduce
   `packages/platform/iam` first with a temporary adapter over
   `packages/platform/auth`. Physical rename/extraction timing remains tracked
   centrally as `OD-013`.
5. Which App SDK examples become the internal developer readiness baseline?
6. What is the first product after GPUaaS that must complete the product
   onboarding checklist?
7. When does evidence/status move from co-located package/route implementation
   to a separately deployed service? The first candidate is evidence/status,
   but extraction is gated by
   `Platform_Deployment_Extraction_Readiness_v1.md`.

## Related Docs

- `doc/architecture/platform-foundation/AI_Factory_Production_Readiness_Gap_Portfolio_v1.md`
- `doc/architecture/platform-foundation/Platform_Code_And_Deployment_Architecture_v1.md`
- `doc/architecture/Platform_Shared_Services_Model_v1.md` (superseded historical reference)
- `doc/architecture/AI_Factory_Team_Domain_Operating_Model_v1.md`
- `doc/architecture/Domain_Ownership_Map.md`
- `doc/architecture/App_Developer_Starter_Pack_v1.md`
- `doc/architecture/App_Platform_Primitive_Boundary_v1.md`
- `doc/product/GPUaaS_Documentation_and_Developer_Portal_Docusaurus_v1.md`
- `doc/operations/GPUaaS_Security_CD_Current_State_Gap_Roadmap_v1.md`
- `doc/operations/Platform_Control_CI_CD_Target_Model_v1.md`
- `doc/governance/Platform_Control_Release_Promotion_Policy.md`
