# SRE Tool Access Matrix v1

Status: production-readiness policy for `OPS-PROD-SRE-TOOL-ACCESS-MATRIX-001`

Owner: Platform Operations / Security Architecture

Last updated: 2026-06-06

## Purpose

Define which operational tools may be opened directly by SREs and which must be
consumed through GPUaaS platform APIs, read models, evidence bundles, or
runbooks first.

The default rule is API/read-model first. Direct tool UI access is an operator
escape hatch, not the product access model.

## Access Classes

| Class | Meaning | Minimum control |
|---|---|---|
| API/read-model-first | Operators should start in `/platform/ops`, platform read models, evidence bundles, or runbooks. Raw tool UI is secondary. | GPUaaS authz, correlation IDs, audit/evidence links, unavailable-state handling. |
| Direct UI allowed | Direct tool UI is allowed for a named SRE use case that the platform does not yet cover. | Pomerium/browser OIDC, `platform.ops.read` or stronger launch authorization, DNS/TLS evidence, route smoke, audit/correlation where possible. |
| Internal-only | Tool is reachable only from internal network, port-forward, break-glass host, or approved admin path. | Named operator group, no public route by default, runbook and evidence for access. |
| Disabled | No operator UI route should be exposed until an explicit enablement task lands. | Empty profile URL, disabled-state UI, negative tests for legacy paths. |

## Matrix

| Tool | Default class | Direct UI allowed use case | Required controls | Missing platform surface / follow-up |
|---|---|---|---|---|
| Grafana | Direct UI allowed | Correlation-driven dashboard pivot, incident timeline, dashboard panels not yet represented in `/platform/ops`. | Pomerium host route, browser OIDC, `platform.ops.read`, DNS/TLS smoke, no legacy `/p/grafana` or `/backend/p/grafana`, dashboard/runbook link from platform ops. | Keep API/read-model pressure through `OPS-PROD-OBSERVABILITY-READMODEL-GAPS-001`. |
| Prometheus | API/read-model-first | Direct Prometheus UI only for platform SRE query debugging when Grafana/API surface is insufficient. | Internal-only or Pomerium ops route when enabled; no tenant self-service; query access scoped to SRE. | Expose key health/query outcomes through platform status and evidence before direct UI expansion. |
| Loki | API/read-model-first | Direct Loki query only for SRE log diagnosis after correlation ID is known. | Prefer Grafana Explore or platform correlation lookup; raw Loki should be internal-only unless protected by Pomerium and SRE RBAC. | Add structured platform log/correlation pivots before exposing raw Loki broadly. |
| Tempo | API/read-model-first | Direct Tempo trace view only for trace-by-id diagnosis. | Prefer Grafana/Tempo pivot from correlation ID; direct UI internal-only or Pomerium protected. | Add trace freshness and sampled trace pivots to platform ops/evidence. |
| Temporal UI | Direct UI allowed in dev/demo; disabled in local-kind product profile by default | Workflow search, stuck activity diagnosis, retry history, schedule state when platform workflow read models are insufficient. | Pomerium host route, browser OIDC, `platform.ops.read`, route smoke, profile verification, no `/backend/p/temporal`; empty URL renders unavailable. | `OPS-PROD-TEMPORAL-WORKFLOW-READMODEL-GAPS-001`. |
| Netdata | Disabled by default | Node-local telemetry inspection during node bootstrap/runtime incidents after node capability exists. | Ops-only host route, Pomerium OIDC, `platform.ops.read`, node-local edge parity, unavailable state when URL empty, no `/p/netdata` or `/backend/p/netdata`. | `PSSM-PROD-C20-NETDATA-OPS-HOST-ROUTE-ENABLEMENT-001`. |
| Swagger | Direct UI allowed | Internal/developer API exploration and smoke validation where the docs portal/playground is not sufficient. | Pomerium host route or same-origin docs helper, no query-string tokens, bearer bridge without URL token leakage, route smoke. | API playground decision remains separate; keep same-origin docs helpers until host route fully covers developer workflow. |
| Redoc | Disabled by default | Read-only API reference where Docusaurus/Stoplight/Scalar docs path does not cover the need. | Pomerium host route when enabled; empty profile URL and unavailable state when disabled; no `/backend/p/redoc`. | Covered by docs portal/API reference follow-up; do not enable as a proxy workaround. |
| Registry UI | API/read-model-first | SRE artifact diagnosis when release evidence and registry read models are insufficient. | Prefer release evidence, artifact manifest APIs, and registry digest refs. Direct UI internal-only or Pomerium protected; no credential exposure. | `OPS-PROD-REGISTRY-OPS-READMODEL-001`. |
| Vault | Internal-only | Break-glass secret custody diagnosis and approved rotation/drill operations. | No public UI route by default, privileged SRE/security group only, audited change ticket, break-glass runbook, never expose root/unseal material. | `OPS-PROD-SECRETS-PKI-OPS-READMODEL-001`; live rotation remains approval-gated. |
| Keycloak Admin | Internal-only | Realm/client/user federation diagnosis not covered by IAM admin APIs. | Internal route or Pomerium ops/security route only, MFA, admin group, audit of admin actions, no tenant-user exposure. | Add IAM runtime/status read models before expanding direct admin use. |
| Kubernetes dashboards | Disabled | No dashboard exposure by default; use kubectl/GitOps evidence and platform status. | If reconsidered, require explicit security review, Pomerium OIDC, cluster RBAC, network policy, and negative tests. | `OPS-PROD-K8S-RUNTIME-BASELINE-001` remains the baseline, dashboard enablement is not a current goal. |
| Provider consoles: Proxmox, MAAS, Cloudflare, DNS, registry provider | Internal-only | Provider incident work, capacity reconciliation, DNS/TLS route repair, artifact pull diagnosis, or approved break-glass operations. | Provider-native MFA/RBAC, named operator group, change ticket, runbook, no product self-service path, before/after evidence for mutations, read-model follow-up when direct access recurs. | `doc/operations/Provider_Console_Breakglass_Access_Model_v1.md`; follow-up tasks listed there. |

## Direct UI Admission Rule

A direct tool UI route may be enabled only when all conditions hold:

1. A named operator use case cannot yet be satisfied by platform APIs/read
   models, `/platform/ops`, evidence bundles, runbooks, or CLI/API tools.
2. The route is host-based, not a legacy path-prefix route.
3. Pomerium browser OIDC protects the route.
4. GPUaaS launch visibility requires `platform.ops.read` or a stronger role.
5. DNS/TLS evidence exists for the environment profile.
6. Smoke coverage verifies reachable or explicitly disabled state.
7. `/platform/ops` renders an unavailable state when the profile URL is empty.
8. Legacy paths such as `/p/*` and `/backend/p/*` remain negative-tested.
9. The route has a removal or read-model replacement condition.

## API-First Operator Path

Operator default sequence:

1. Start from `/platform/ops` or the incident/runbook index.
2. Use platform status, evidence bundles, correlation lookup, and canonical
   resource names.
3. Use Grafana/Tempo/Loki/Prometheus pivots only after the incident class and
   correlation ID are known.
4. Use direct provider/tool consoles only when the runbook calls for that tool
   and the action is read-only or approved.
5. Record provider/tool use in incident notes or Fairway evidence when it
   changes operator decisions.

## Current Decisions

| Decision | Result |
|---|---|
| Restore legacy path-prefix proxy for tools | No. Retired `/p/*` and `/backend/p/*` paths remain negative-tested. |
| Make Temporal UI broadly available | No. `OPS-PROD-TEMPORAL-UI-OPS-ACCESS-001` allows dev/demo ops-only host routes as a temporary escape hatch; local-kind product profile remains disabled by default and production remains approval-gated. |
| Expose Kubernetes dashboard | No. Dashboard exposure is not a current production goal. |
| Expose Vault UI publicly | No. Vault remains internal-only/break-glass. |
| Use Grafana as an SRE tool | Yes, with Pomerium OIDC, platform ops launch visibility, and route smoke evidence. |
| Use Swagger as a tool | Yes, for internal/developer API exploration with token-safe behavior. |
| Treat docs portal/API playground as replacement for every API tool | Not yet. API playground work is separate; do not remove existing token-safe docs helpers until replacement coverage exists. |

## Follow-Up Tasks

| Task | Purpose |
|---|---|
| `OPS-PROD-TEMPORAL-UI-OPS-ACCESS-001` | Decide and wire Temporal UI as an ops-only Pomerium surface or keep it disabled with unavailable-state proof. |
| `OPS-PROD-TEMPORAL-WORKFLOW-READMODEL-GAPS-001` | Define workflow search, retry-history, stuck-activity, and schedule-status read models that replace normal Temporal UI dependence. |
| `OPS-PROD-OBSERVABILITY-READMODEL-GAPS-001` | Identify Grafana/Prometheus/Loki/Tempo use cases still forcing direct UI access and turn them into platform read-model/evidence tasks. Output: `doc/operations/Observability_Read_Model_Gap_Map_v1.md`. |
| `OPS-PROD-REGISTRY-OPS-READMODEL-001` | Add or validate artifact/registry operator read models before exposing registry UI broadly. Output: `doc/operations/Registry_Ops_Read_Model_Gap_Map_v1.md`. |
| `OPS-PROD-SECRETS-PKI-OPS-READMODEL-001` | Add or validate secrets/PKI custody, rotation, and break-glass evidence surfaces before expanding Vault direct use. Output: `doc/operations/Secrets_PKI_Ops_Read_Model_Gap_Map_v1.md`. |
| `OPS-PROD-PROVIDER-CONSOLE-BREAKGLASS-001` | Define provider-console break-glass access and evidence expectations for Proxmox, MAAS, Cloudflare, DNS, and registry providers. Output: `doc/operations/Provider_Console_Breakglass_Access_Model_v1.md`. |
| `OPS-PROD-TOOL-ROUTE-SMOKE-COVERAGE-001` | Keep direct allowed/disabled tool routes covered by profile verification, `scripts/ci/sre_tool_route_smoke_coverage_guard.sh`, and no-legacy negative tests. |
