# Platform Shared Services Completion Roadmap v1

Status: active roadmap
Owner: Platform Architecture
Last updated: 2026-06-03

## Purpose

Define the remaining tracks and phases needed to complete the Platform Shared
Services Model v2 after the platform-foundation baseline.

The foundation work is complete: ownership maps, boundary guards, first
facades, evidence/status read models, registry seed/facade, App SDK manifest
validator, release evidence gates, and extraction readiness packet are in
place. The remaining work is not another broad reshuffle. It is a set of
shared-service productization tracks that consume the foundation.

## Completion Definition

PSSM v2 is complete when each shared service has:

- a named owner and product-facing contract;
- registry entries where product composition needs versioned identifiers;
- API, event, or read-model surface for cross-product use;
- service-auth and audit posture for product-to-platform calls;
- degradation behavior with Status/Ops evidence;
- CI, UAT, security, and release evidence mapping;
- portal or runbook documentation for its primary personas;
- extraction readiness decision: keep in process, split worker, or extract
  service.

Completion does not require every shared service to become a physical
microservice.

## Remaining Tracks

| Track | Shared-service area | Primary outcome | First owner |
|---|---|---|---|
| T1 | Evidence, Status/Ops, Audit | release/UAT/security/operator evidence becomes the shared operating record | Platform Control / Ops / Security |
| T2 | IAM, service auth, entitlements | product-to-platform calls use explicit service identity and scope contracts | IAM / Security |
| T3 | Billing, metering, usage units, payments | products emit neutral usage and money-domain writes stay platform-owned | Billing / Finance Platform |
| T4 | Registry, artifacts, SDK publishing | product/app/artifact/SDK contracts become durable onboarding surfaces | Platform Architecture / App Platform |
| T5 | Notification, policy, tenant customization | cross-product notices, quotas, feature flags, and tenant posture stop forking by product | Platform / Product |
| T6 | Secrets/PKI and runtime trust | credential purpose, certificate lifecycle, and runtime secret delivery become auditable shared contracts | Security / Infra |
| T7 | Portal and persona documentation | users, developers, product, security, architecture, ops, and infra can consume the model without GitHub spelunking | Product / Docs / Architecture |
| T8 | Extraction hardening | decide which shared services stay modular-monolith versus split worker/service | Platform Architecture / Ops |

## Phase 1 - Operating Record Hardening

Goal: make evidence/status/audit reliable enough to become the operating source
of truth for releases, UAT, security, and operators.

Scope:

- make release evidence bundles required for platform-control promotions;
- connect UAT invariant outputs to evidence bundles;
- add audit/evidence read-model queries for privileged mutations;
- publish component status and guard posture for release review;
- define customer-safe versus internal-only status views.

Primary tasks:

| ID | Output | Depends on |
|---|---|---|
| PSS-C1-EVIDENCE-PROMOTION-GATE | platform-control promotion requires release evidence bundle | foundation evidence/status |
| PSS-C1-UAT-INVARIANT-MAP | UAT scripts publish named invariant coverage | evidence bundle schema |
| PSS-C1-AUDIT-READMODEL | privileged audit search/read model for platform and product reviewers | audit action registry |
| PSS-C1-STATUS-SLO-FEED | component freshness, DLQ, outbox, and guard posture feed Status/Ops | statusops facade |

Exit criteria:

- a release cannot silently pass with missing required evidence;
- UAT evidence proves named product invariants;
- security and ops can inspect audit/evidence/status without direct SQL.

## Phase 2 - Service Identity And Entitlement Contracts

Goal: make product-to-platform calls explicit, scoped, auditable, and
revocable before more products consume shared services.

Scope:

- implement service-account lifecycle and scoped credentials;
- define product-to-platform token claims and validation path;
- connect scope registry to IAM facade decisions;
- add entitlement read model for product access, quotas, and feature flags;
- document emergency disable and rotation flows.

Primary tasks:

| ID | Output | Depends on |
|---|---|---|
| PSS-C2-SERVICE-AUTH | service account, token shape, rotation, emergency disable | IAM facade, registry |
| PSS-C2-SCOPE-REGISTRY-RUNTIME | IAM reads scope registry through `packages/platform/iam.EvaluateScopePermission` for service/user authorization | registry facade |
| PSS-C2-ENTITLEMENT-READMODEL | product/tenant/project entitlement read model | policy registry |
| PSS-C2-AUTHZ-EVIDENCE | authz decisions emit audit/evidence hooks | audit/evidence |

Exit criteria:

- product services no longer rely on ad hoc shared secrets or user-token reuse;
- privileged product-to-platform calls are scoped and auditable;
- entitlements are visible as shared platform decisions.

## Phase 3 - Product-Neutral Usage And Money Domain

Goal: make billing/metering/payments a shared platform capability that products
compose through usage units and immutable money-domain contracts.

Scope:

- add usage-unit registry runtime lookup;
- define usage event contract for GPU-hour, app-runtime-hour, and future token
  usage;
- route app runtime and GPUaaS usage through product-neutral ingestion;
- keep ledger/payment writes platform-owned;
- expose billing readiness and money-domain health in Status/Ops.

Primary tasks:

| ID | Output | Depends on |
|---|---|---|
| PSS-C3-USAGE-UNIT-RUNTIME | usage-unit registry consumed by billing/rating | registry runtime |
| PSS-C3-USAGE-EVENT-CONTRACT | versioned usage event and outbox contract | event ownership map |
| PSS-C3-APP-RUNTIME-METERING | app runtime emits neutral usage evidence | App Platform runtime facade |
| PSS-C3-MONEY-STATUS | ledger/webhook/rating health in Status/Ops | statusops |

Exit criteria:

- new products do not create product-specific ledgers;
- usage attribution is product-neutral and registry-backed;
- money-domain degradation and recovery evidence are visible.

## Phase 4 - App SDK, Registry, Artifact Trust, And Product Onboarding

Goal: make App SDK and product onboarding real internal developer surfaces, not
seed-only or architecture-only contracts.

Scope:

- publish SDK examples for launch/connect/decommission;
- add service-account and artifact-promotion examples;
- make app manifests source-controlled fixtures or generated from canonical
  fixtures;
- complete artifact trust states, promotion gates, SBOM/provenance references;
- create product onboarding checklist for Token Factory or the next product.

Primary tasks:

| ID | Output | Depends on |
|---|---|---|
| PSS-C4-SDK-EXAMPLE-SMOKE | JupyterLab or vLLM SDK example with smoke/evidence | manifest validator |
| PSS-C4-APP-MANIFEST-SOURCE | canonical app manifest fixtures or generation path | App SDK readiness |
| PSS-C4-ARTIFACT-TRUST | artifact trust, promotion, provenance, SBOM contract | registry/artifacts |
| PSS-C4-PRODUCT-ONBOARDING | next-product checklist for scopes, usage, audit, notification, status | registry baseline |

Exit criteria:

- internal app developers can follow portal/SDK paths without reading seed SQL;
- artifact promotion is trust-state driven;
- the next product can onboard without reinventing IAM, billing, audit, status,
  notification, or artifact contracts.

## Phase 5 - Notification, Policy, Tenant, And External Surfaces

Goal: complete the shared surfaces that make the platform usable by multiple
products and personas.

Scope:

- notification template registry and dispatch intent;
- policy/entitlement registry and versioned snapshots;
- quota composition model across GPU, app runtime, token, storage, and network;
- tenant customization boundaries without shared-service forks;
- portal tracks for public, customer, partner, and internal views.

Primary tasks:

| ID | Output | Depends on |
|---|---|---|
| PSS-C5-NOTIFICATION-TEMPLATES | template registry and delivery intent model | notification ownership map |
| PSS-C5-POLICY-SNAPSHOTS | versioned policy/entitlement snapshots | policy registry |
| PSS-C5-QUOTA-COMPOSITION | cross-product quota model | billing/entitlement |
| PSS-C5-PORTAL-TRACKS | persona tracks and access-control assumptions | Docusaurus portal |

Phase 5 output:

- `Notification_Policy_Portal_Surface_Model_v1.md` defines the shared contract
  for template registry/delivery intent, policy and entitlement snapshots,
  cross-product quota composition, tenant customization, and portal publication
  tracks.

Exit criteria:

- notices, quotas, and feature flags are not product-specific forks;
- portal IA can serve internal and future external audiences;
- tenant-specific behavior is extension/configuration, not copied services.

## Phase 6 - Runtime Trust And Extraction Decisions

Goal: decide what should stay in-process, split into workers, or become
separately deployed services based on operational evidence.

Scope:

- Secrets/PKI purpose registry and credential-delivery contract;
- certificate/secret rotation evidence in Status/Ops;
- service-auth packets for extraction candidates;
- extraction readiness packets for evidence/status, billing usage, notification,
  artifact trust, and service-auth paths;
- kind/platform-control smoke for any split worker or extracted service.

Primary tasks:

| ID | Output | Depends on |
|---|---|---|
| PSS-C6-SECRETS-PKI-CONTRACT | secret purpose, credential delivery, cert lifecycle evidence | Security/Infra |
| PSS-C6-ROTATION-EVIDENCE | cert/secret age and rotation health in Status/Ops | statusops |
| PSS-C6-EXTRACTION-PACKETS | candidate packets with keep/split/extract decisions | previous phases |
| PSS-C6-SPLIT-SMOKE | smoke/rollback evidence for any split deployable | deployment readiness |

Phase 6 outputs:

- `Secrets_PKI_Runtime_Trust_Model_v1.md` defines the secret purpose,
  credential delivery, cert lifecycle, rotation evidence, and runtime-trust
  coordination contract.
- `.fairway/artifacts/platform-shared-services-extraction-packets.yaml`
  records keep/split/extract recommendations for evidence/status, billing
  usage, notification dispatch, artifact trust, service auth, and Secrets/PKI.
- `scripts/ci/platform_status_snapshot.sh` emits `runtime-cert-rotation` and
  `secret-rotation` Status/Ops component rows.

Exit criteria:

- extraction decisions are evidence-backed, not diagram-driven;
- any separated deployable has service auth, degradation, rollback, and smoke
  evidence;
- runtime trust and credential posture are visible to security and ops.

## Suggested Sequence

Work can run in parallel by track, but dependencies should stay disciplined:

1. Phase 1 first for operating record hardening.
2. Phase 2 next for product-to-platform service auth.
3. Phase 3 and Phase 4 can run in parallel once registry/IAM contracts are
   stable.
4. Phase 5 follows after notification/policy owners accept the shared model.
5. Phase 6 runs continuously for Secrets/PKI, but physical extraction decisions
   wait until the relevant service has evidence from earlier phases.

## First Follow-On Epics

| Epic | Why first |
|---|---|
| `PSS-C1-OPERATING-RECORD` | converts foundation evidence/status into release, UAT, audit, and operator reality |
| `PSS-C2-SERVICE-AUTH` | unlocks safe product-to-platform calls and later extraction |
| `PSS-C4-APP-SDK-DEVELOPER-PATH` | exposes the SDK as an internal developer platform instead of an internal implementation detail |
| `PSS-C3-USAGE-UNIT-METERING` | prevents the next product from inventing its own billing/metering path |

## Post-IAM Production Completion Backlog

The IAM department hierarchy slice is complete enough to unblock PSSM
production-completion planning. The next Fairway epic is
`PSSM-PRODUCTION-COMPLETION-BACKLOG`.

Do not treat this as another foundation reshuffle. The goal is to turn
platform-shared services from proven foundation slices into repeatable
production contracts.

### Organizing Principles

1. Product-facing contracts before physical extraction.
2. Registry/version snapshots before new products depend on mutable meanings.
3. Credential custody and service-auth before cross-process service calls.
4. Billing attribution and policy decisions remain platform-owned.
5. Status/evidence must prove runtime posture, not only document it.
6. Product onboarding is a repeatable checklist, not a Token Factory-only path.

### Active Fairway Breakdown

| Fairway task | Area | Output |
|---|---|---|
| `PSSM-PROD-C0-PLAN-001` | Architecture | post-IAM PSSM production completion plan, sequencing, and exit criteria |
| `PSSM-PROD-C1-REGISTRY-MATURITY-001` | Registry | schema/API-backed registry maturity plan and version snapshot rules |
| `PSSM-PROD-C2-CREDENTIAL-CUSTODY-001` | Security | two-tier credential custody, rotation, one-time reveal, and compromise evidence |
| `PSSM-PROD-C3-PRODUCT-ONBOARDING-CONTRACT-001` | Product onboarding | next-product onboarding packet template and required registry/billing/audit/status outputs |
| `PSSM-PROD-C4-POLICY-QUOTA-CAPACITY-001` | Policy / entitlements | cross-product quota and capacity-reservation model |
| `PSSM-PROD-C5-ANALYTICS-OLAP-BOUNDARY-001` | Data platform | OLTP/OLAP usage analytics boundary and rollup requirements |
| `PSSM-PROD-C6-RECONCILIATION-EVIDENCE-001` | Runtime / ops | provider/runtime reconciliation, orphan cleanup, quarantine, and evidence model |
| `PSSM-PROD-C7-STATUS-EVIDENCE-MATURITY-001` | Status / evidence | service health, incident, release, SLO, and degradation evidence maturity |
| `PSSM-PROD-C8-FACADE-DEPTH-REPLACEMENT-001` | Backend architecture | replace thin compatibility facades with deeper platform-owned contracts |
| `PSSM-PROD-C9-RELEASE-PROFILE-GATES-001` | Release engineering | environment/profile gates for contracts, secrets, DNS/TLS, packages, migrations, UAT, rollback |
| `PSSM-PROD-C10-EXTRACTION-DECISION-PACKETS-001` | Ops / architecture | keep/split/extract decision packets for mature shared-service candidates |
| `PSSM-PROD-C11-SERVICE-LEVEL-CICD-OPERATING-MODE-001` | Release engineering | deferred service/domain-level CI/CD operating model after PSSM maturity |
| `PSSM-PROD-C12-ERROR-OBSERVABILITY-SWEEP-001` | Ops / backend | API error envelope, correlation ID, and server-side cause logging sweep |
| `PSSM-PROD-C13-ERROR-OBSERVABILITY-AUDIT-GATE-001` | Backend / ops | exhaustive production-readiness audit of API and worker error paths |

Current C3-C6 outputs:

- `Product_Onboarding_Executable_Packet_v1.md` turns the onboarding checklist
  into a required packet shape with fail-closed validation.
- `Platform_Policy_Quota_Capacity_Composition_v1.md` defines the global ->
  plan -> organization -> department -> project decision order, quota
  dimensions, and capacity reservation posture.
- `Platform_Runtime_Reconciliation_Evidence_Model_v1.md` defines provider and
  runtime drift classification, evidence records, quarantine/cleanup posture,
  and API-first operator verification.
- `Platform_Usage_Analytics_OLTP_OLAP_Boundary_v1.md` separates hot
  usage/rating paths from department/project/API-key/model rollups and
  dashboard query sources.
- `Platform_Release_Profile_Gates_v1.md` defines environment/profile gates,
  required release evidence, profile-specific failure handling, and graduation
  from report-only to blocking.
- `doc/operations/Platform_Service_Level_CI_CD_Operating_Model_v1.md` defines
  the post-PSSM CI/CD model: global contract gates, domain-local gates,
  consumer smokes, service evidence bundles, ownership-map routing, and
  independent promotion eligibility.

### Future CI/CD Operating Model Follow-Up

After the PSSM production-completion work matures, CI/CD should move from a
mostly monorepo-wide release validation model toward:

`contract-global + domain-local + consumer-smoke + service-evidence`

This is follow-up work, not a prerequisite for the current IAM or first PSSM
planning slice. The split becomes useful once platform-shared services have
versioned contracts, evidence/status maturity, release profile gates, and
keep/split/extract decision packets.

Candidate service/domain lanes:

- Platform shared services: IAM/access, billing/metering/payments,
  audit/evidence, status/ops, registry/artifacts, policy/entitlements,
  notification, and Secrets/PKI.
- Products: GPUaaS, App Platform, Token Factory, and future products that
  onboard through the platform registry and usage/billing contract.

Target rules:

- Global gates still validate bundled API/event contracts, schema compatibility,
  security posture, and cross-domain invariants.
- Domain-local gates run ownership-specific unit, integration, migration,
  policy, and UI checks from path/domain ownership maps.
- Consumer smokes prove product-to-platform compatibility before promotion.
- Service-level evidence bundles record contract version, migration posture,
  rollback posture, release profile, runtime health, and degraded-mode behavior.
- Independent promotion is only allowed for a domain after its extraction
  decision packet proves service auth, rollback, smoke, and degradation evidence.

### Error Observability Follow-Up

The platform-control deploy exposed a class of operational failures where the
API returned a user-safe error such as `failed to bootstrap user context`, but
the actionable cause was only visible in Postgres logs. That keeps the external
contract safe, but it slows incident triage.

Future PSSM production work should include a clean sweep for API and worker
error handling:

- User responses continue to use canonical `ErrorResponse` envelopes with a
  required `correlation_id` and safe messages.
- Server logs include the same `correlation_id`, the underlying cause, the
  owning domain, and enough sanitized context to distinguish database,
  dependency, authz, validation, and local-defect failures.
- Database and migration errors are logged at the boundary where they are
  classified, not only left to Postgres logs.
- API handlers preserve domain sentinel errors and avoid collapsing all
  bootstrap/authz/setup failures into indistinguishable `internal_error`
  messages.
- Regression tests assert that representative 4xx and 5xx paths return a
  correlation ID and emit a cause-bearing sanitized log entry.

### Production Error Audit Gate

C12 gives guard-backed and representative-path confidence. Before production
readiness, run a separate exhaustive audit gate that inventories every API,
gateway, worker, relay, bootstrap, and privileged mutation failure path.

The audit must produce a tracked report with one row per failure path:

- owner and source file/function;
- public response status, error code, and user-safe message;
- whether the response is a canonical `ErrorResponse` with non-empty
  `correlation_id`;
- underlying cause classification: validation, authn, authz, database,
  upstream dependency, local defect, provider/runtime, or retryable
  infrastructure;
- sanitized log fields present, including `correlation_id`, owning domain,
  actor/scope/resource identifiers where available, and root cause;
- test or guard coverage proving the behavior;
- explicit fix task for any gap, with no undocumented waiver.

Production readiness should not be signed off while this audit has open S1/S2
findings. Any deferred S3 finding needs an owner, reason, and removal criteria.

### Exit Criteria For The Backlog Epic

- Every open or partial high/medium gap in `Platform_Architecture_Gap_Register_v1.md`
  is either closed by an implementation task or explicitly deferred by an ADR.
- The production error audit gate has no open S1/S2 findings and all remaining
  S3 findings have owners and removal criteria.
- Product onboarding can be executed for a future product without creating
  product-owned IAM, billing, audit, notification, status, policy, registry, or
  credential forks.
- Physical extraction decisions are documented as `keep_in_process`,
  `split_worker`, or `extract_service` with service-auth, degradation, rollback,
  and smoke evidence.

## Related Documents

- `Platform_Shared_Services_Model_v2.md`
- `Platform_Deployment_Extraction_Readiness_v1.md`
- `Platform_Evidence_Status_Slice_v1.md`
- `Platform_Registry_Contract_v1.md`
- `App_SDK_Readiness_Matrix_v1.md`
- `AI_Factory_Production_Readiness_Gap_Portfolio_v1.md`
