# Platform Runtime Reconciliation Evidence Model v1

Status: active architecture contract
Owner: Platform Ops / Runtime Owners
Last updated: 2026-06-03
Fairway task: `PSSM-PROD-C6-RECONCILIATION-EVIDENCE-001`

## Purpose

Define the shared reconciliation evidence model for provider and runtime
resources that can outlive platform records or drift away from intended state.

The model covers GPUaaS, App Platform runtime, storage, and future model-serving
resources. It turns reconciliation from ad hoc SQL and logs into API-visible
operator evidence.

## Reconciliation Principle

Every durable runtime resource must have:

1. an owning platform or product record;
2. a provider/runtime identity;
3. an expected lifecycle state;
4. an observed lifecycle state;
5. drift classification;
6. cleanup or quarantine posture;
7. evidence tied to correlation id and owner.

Operators should verify reconciliation through APIs/read models first. Direct
SQL is acceptable only during implementation or incident work when the owning
read model does not exist yet; repeated SQL means the read model is missing.

## Resource Families

| Family | Platform record | Provider/runtime identity | Drift examples | Owner |
|---|---|---|---|---|
| GPUaaS allocation | allocation, node task, inventory node | MAAS machine, node-agent identity, OS user, SSH route | active provider node with released allocation, unreachable node-agent, stale OS user | GPUaaS provisioning / Ops |
| App runtime | app instance, app route, workload record | OCI digest, runtime pod/process, proxy route | route exists after app deletion, wrong artifact digest, launch task stuck | App Platform runtime / Ops |
| Storage | bucket, volume, attachment, share | provider bucket/volume/share id | provider object exists after platform deletion, quota mismatch, orphan attachment | Platform Storage / product owner |
| Model serving | endpoint, model deployment, api key binding | gateway route, backend pool, model worker | accepted traffic for disabled endpoint, stale route, backend capacity mismatch | Token Factory / Platform Ops |
| Billing usage | usage record, rated line, ledger entry | accepted usage event, gateway/provider metering id | usage accepted but not rated, rated but not ledgered, duplicate usage | Platform Billing |

## Node-Agent Recovery Evidence Boundary

Node-agent drift is reconciled as a control-plane/runtime-management signal.
It must not be collapsed into workload downtime unless app-runtime or edge
evidence independently proves customer impact.

Required node-agent recovery evidence fields:

| Field | Meaning |
|---|---|
| `node_agent_state` | `healthy`, `stale`, `recovering`, `operator_action_required`, `security_action_required` |
| `node_agent_reason` | low-cardinality reason such as `agent_down`, `cert_expired`, `endpoint_profile_drift`, `recovery_token_missing`, `cert_untrusted_by_ingress`, `cloned_identity_detected` |
| `update_path` | `none`, `self_update`, `recovery_enrollment`, `operator_rebootstrap`, `full_reimage` |
| `expected_agent_version` | version or digest expected by release/profile gate |
| `observed_agent_version` | version or digest reported by heartbeat or local diagnostic evidence |
| `cert_issuer_fingerprint` | fingerprint of the node certificate issuer, never private key material |
| `ingress_trust_fingerprint` | fingerprint of the trusted node-api ingress CA |
| `terminal_stream_probe` | `passed`, `failed`, `not_required`, or `not_executed` |
| `workload_impact` | `none_observed`, `degraded`, `unavailable`, or `unknown` based on app/runtime and edge evidence |

Status/Ops and release gates should classify:

- `pass`: node-agent healthy, expected version matches, no recovery reason, and
  required terminal/app-runtime probes passed;
- `warning`: management evidence stale but no workload or edge impact observed;
- `retry`: bounded automatic recovery or self-update is in progress;
- `investigate`: recovery evidence is incomplete, contradictory, or missing;
- `block_production`: S1/S2 recovery findings, cloned identity, trust-domain
  drift, missing rollback evidence for node-agent release, or dry-run-only
  evidence for required live recovery probes.

## Drift Classification

| Classification | Meaning | Default posture |
|---|---|---|
| orphan_provider_resource | Provider/runtime resource exists without active platform owner | quarantine, then cleanup after grace window |
| missing_provider_resource | Platform record expects a live provider resource, but none is observed | block scheduling or mark degraded |
| stale_lifecycle_state | Platform and provider states disagree | retry reconciliation, then operator review |
| identity_mismatch | Provider identity does not match platform registry or credential binding | fail closed and require security review |
| quota_or_capacity_mismatch | Observed usage/capacity exceeds effective quota or reservation | deny new admission and emit policy evidence |
| billing_mismatch | Usage, rating, and ledger paths do not reconcile | stop financial automation that depends on the missing truth |
| route_or_dns_mismatch | Public/internal route exists with wrong target, cert, or product owner | disable route or mark release gate failed |

## Evidence Record

```yaml
reconciliation_evidence:
  evidence_id: string
  correlation_id: string
  resource_family: gpuaas_allocation | app_runtime | storage | model_serving | billing_usage
  platform_owner:
    product_id: string
    owning_domain: string
    owning_package: string
  platform_record:
    record_type: string
    record_id: string
    expected_state: string
    expected_version: string
  provider_record:
    provider: maas | node_agent | app_runtime | storage_provider | gateway | billing_ingestion
    provider_id: string
    observed_state: string
    observed_version: string
    observed_at: string
  drift:
    classification: string
    severity: info | warning | critical
    first_seen_at: string
    last_seen_at: string
    retry_count: number
  action:
    posture: none | retry | quarantine | cleanup | block_admission | rollback | operator_review
    next_attempt_at: string
    runbook: string
  evidence:
    status_component_id: string
    release_gate_id: string
    invariant_id: string
    fairway_task_id: string
```

## Reconciliation Flow

```text
observe provider/runtime state
  -> join to platform owner/read model
  -> classify drift
  -> emit evidence item and status component update
  -> retry safe transient mismatches
  -> quarantine resources that can affect customers or security
  -> cleanup only after owner, grace window, and rollback checks pass
  -> preserve operator-visible proof
```

Cleanup must not hide evidence. The evidence record remains after the provider
resource is removed.

## API And Read Model Expectations

| Surface | Required behavior |
|---|---|
| Reconciliation status | Summarize last run, drift counts, critical findings, blocked admissions, and stale scans. |
| Drift list | Filter by product, resource family, classification, severity, owner, state, and age. |
| Drift detail | Show expected state, observed state, evidence, retry history, and recommended action. |
| Retry/cleanup action | Audited, idempotent, and tied to a runbook and correlation id. |
| Status/Ops integration | Critical drift degrades the owning component and can fail release gates. |
| Evidence integration | Reconciliation findings are attachable to UAT, release, and incident bundles. |

## Product Coverage

| Product/resource | First required proof |
|---|---|
| GPUaaS | MAAS/node-agent/inventory reconciliation status and drift list cover unreachable nodes, stale assignments, and orphan cleanup. |
| App Platform | Runtime route/artifact reconciliation covers app launch, connect, decommission, and route readiness. |
| Storage | Bucket/volume/share reconciliation covers quota, attachment, provider orphan cleanup, and access posture. |
| Token Factory | Gateway/backend/model endpoint reconciliation covers accepted traffic, disabled endpoints, backend capacity, and usage emission. |

Token Factory is future work, but the reconciliation contract must be ready
before Token Factory starts building gateway/runtime cleanup as product-specific
logic.

## Release And UAT Gates

Production-impacting releases must attach reconciliation evidence when they
change:

1. provisioning, node-agent, MAAS, or scheduler behavior;
2. app runtime launch/connect/decommission behavior;
3. storage provider lifecycle behavior;
4. gateway, route, DNS, TLS, or model-serving behavior;
5. billing usage ingestion, rating, or ledger reconciliation.

Gate failure must produce a clear path: forward fix, rollback, quarantine, or
approved residual risk.

## Related Docs

- `Platform_Architecture_Gap_Register_v1.md`
- `Platform_Evidence_Status_Slice_v1.md`
- `Platform_Evidence_Status_Schema_v1.md`
- `Platform_Evidence_Input_Mapping_v1.md`
- `../Data_Tiering_and_Database_Operations_Work_Plan_v1.md`