# Vulnerability Remediation SLA v1

Status: production-readiness policy
Owner: Security / Ops / Governance
Last updated: 2026-06-05

## Purpose

Define how GPUaaS triages, owns, remediates, waives, and reports
vulnerability findings from SAST, secret scanning, dependency/SCA, image
scanning, DAST, and release-hardening checks.

This policy turns security scan output into release evidence. It is paired
with:

- `scripts/ci/security_promotion_gate.sh`
- `doc/operations/Security_Scan_Promotion_Gate_Runbook.md`
- `doc/governance/security_scan_exceptions.json`

## Severity Model

| Severity | Examples | Initial owner | Production promotion posture |
|---|---|---|---|
| Critical | active exploitable RCE, exposed credential, auth bypass, critical CVE with reachable code, public edge exploit path | Security assigns within 4h | Blocks promotion unless fixed or emergency exception is approved |
| High | high CVE with reachable package, tenant-isolation bypass, privilege escalation, unsafe secret handling, high DAST finding | Security assigns within 1 business day | Blocks promotion unless fixed or approved exception exists |
| Medium | defense-in-depth issue, hardening gap, non-reachable high-noise static finding after triage | Owning domain within 3 business days | Does not block by default; can block if security marks release-impacting |
| Low | style-level hardening finding, non-production-only scanner noise, informational rule | Owning domain during normal backlog triage | Advisory unless repeated or policy-sensitive |

Scanner severities are inputs, not final authority. Security can raise or lower
severity when evidence shows reachability, exploitability, tenant impact,
secret exposure, public edge exposure, or compensating controls.

## Remediation Clocks

| Severity | Triage target | Mitigation target | Remediation target | Max exception duration |
|---|---:|---:|---:|---:|
| Critical | 4 hours | 24 hours | 72 hours | 7 days |
| High | 1 business day | 3 business days | 14 calendar days | 30 days |
| Medium | 3 business days | 10 business days | 45 calendar days | 90 days |
| Low | 10 business days | best effort | 90 calendar days | 180 days |

Clock start:

- first detection in a release, branch, scheduled scan, UAT/security run, or
  manual security review;
- scanner availability failure for a required production gate starts the same
  clock as a high finding until the scanner is restored or explicitly waived;
- resurfaced findings restart the clock unless the original remediation task is
  still active and the finding was not reintroduced by a new change.

Clock pause:

- paused only when an approved exception exists with owner, approver, expiry,
  impact, compensating control, and follow-up task;
- paused while security awaits external vendor confirmation only when a
  mitigation exists and the exception record links the external case.

Clock stop:

- fixed in code/config/infrastructure and verified by the original scanner or
  an equivalent reproducer;
- false positive confirmed by security and recorded as an approved exception or
  scanner rule suppression with rationale;
- risk accepted through the approved exception path until expiry.

Expired exceptions reopen the clock and block production-style promotion for
critical/high-class findings.

## Ownership Routing

| Finding source | Primary owner | Secondary owner | Required evidence |
|---|---|---|---|
| Go code SAST | Backend | Security | test or code review proving fixed path |
| Frontend SAST / browser DAST | Frontend | Security / Product | browser or route reproducer plus user-safe error posture when applicable |
| Dependency / SCA | Owning runtime package | Security / Ops | upgraded lockfile or documented vendor posture |
| Image / container | Ops | Backend / Frontend | rebuilt image digest and image scan summary |
| Secret scan | Security | Owning committer/domain | secret revocation or proof placeholder was non-secret |
| Edge/API DAST | Security / Ops | Backend / Product | route, auth, WAF, or app fix plus DAST rerun |
| Release hardening | Ops | Architecture / Governance | hardening summary and approved waiver if not fixed |

Security owns severity and exception approval. The domain owner owns the fix.
Ops owns release gating, evidence capture, and promotion blocking behavior.
Governance owns waiver hygiene and expiry review.

## Release Gate Behavior

Production-style promotion must block when:

1. a required scanner summary is missing;
2. a required scanner is skipped or unavailable;
3. a critical/high-class finding is present without a valid exception;
4. an exception is expired or missing required fields;
5. a finding has resurfaced after being marked remediated.

Production-style promotion can proceed with visible residual risk when:

- the finding is medium/low and not marked release-impacting; or
- a critical/high finding has an approved, unexpired exception with owner,
  approver, expiry, impact, compensating control, and follow-up task.

Approved residual risk is not a clean pass. It should appear as `waived` or
`partial` evidence in Status/Ops and release packets.

## Evidence Report Model

Each security scan promotion evidence report should expose the following
minimum fields so Status/Ops can show open, overdue, remediated, waived, and
resurfaced findings by owner and environment profile.

```json
{
  "schema_version": "1.0",
  "generated_at": "2026-06-05T00:00:00Z",
  "source_commit": "<sha>",
  "environment_profile": "dev|demo|staging|production|platform-control",
  "release_branch": "master|release/platform-control",
  "summary": {
    "open": 0,
    "overdue": 0,
    "remediated": 0,
    "waived": 0,
    "resurfaced": 0
  },
  "findings": [
    {
      "id": "scanner:finding-key",
      "scanner": "semgrep|gitleaks|govulncheck|trivy|zap|hardening",
      "severity": "critical|high|medium|low",
      "state": "open|overdue|remediated|waived|resurfaced|false_positive",
      "owner": "backend|frontend|ops|security|governance|product",
      "environment_scope": ["platform-control"],
      "first_seen_at": "2026-06-05T00:00:00Z",
      "last_seen_at": "2026-06-05T00:00:00Z",
      "due_at": "2026-06-19T00:00:00Z",
      "remediated_at": null,
      "exception_id": null,
      "followup_task": "SEC-FIX-EXAMPLE-001",
      "release_blocking": true,
      "evidence_uri": "dist/security/promotion-gate/security-promotion-gate.json"
    }
  ]
}
```

The initial implementation derives this report with:

```bash
scripts/ci/vulnerability_sla_summary.sh
```

It consumes `security-promotion-gate.json`, the security exception registry,
and optional finding-state input for first-seen, remediated, and resurfaced
tracking. A later Status/Ops worker can persist the same shape into platform
evidence tables.

## Triage Flow

1. CI or a scheduled security run produces scanner summaries.
2. `security_promotion_gate.sh` classifies missing summaries, skipped scanners,
   high/critical findings, and exceptions.
3. Security reviews new or resurfaced findings and sets final severity.
4. The owning domain creates or updates a scoped fix task.
5. Ops records whether the finding blocks the current promotion.
6. If risk is accepted, governance records an expiring exception.
7. The fix owner lands remediation and reruns the scanner or equivalent
   reproducer.
8. Security closes the finding as remediated, false positive, or accepted risk.

## Accepted-Risk Path

An exception is valid only when it includes:

- exception id;
- scanner or control;
- finding key;
- severity;
- owner;
- approver;
- reason and impact;
- compensating control;
- expiry date;
- environment scope;
- follow-up task.

Broad exceptions that use `*` for scanner, severity, or finding key should be
rare, shorter-lived, and explicitly approved by security and governance.

## Escalation

| Trigger | Escalation |
|---|---|
| Critical finding not triaged within 4h | Security lead and release owner |
| Critical mitigation missed | Security, ops lead, product owner, architecture |
| High remediation target missed | Security lead and owning engineering lead |
| Expired critical/high exception | Release owner blocks promotion until renewed or fixed |
| Scanner unavailable for required production gate | Ops and security jointly decide fix, rerun, or explicit waiver |
| Repeated resurfaced finding | Architecture review of owning boundary or control weakness |

## Release Packet Mapping

Release evidence should link:

- scanner summaries;
- `security-promotion-gate.json`;
- exception registry version;
- open/overdue/remediated/waived/resurfaced counts;
- owner and follow-up task for every blocking or waived finding;
- explicit residual-risk statement when any high/critical finding is waived.

The release packet should not treat missing security evidence as a pass. Missing
or stale evidence is `missing` or `blocked` until rerun or waived.

## Current Implementation Posture

Current:

- scanner summaries are produced under `dist/security/` and
  `dist/release-hardening/`;
- `security_promotion_gate.sh` fails production-style promotion on missing
  summaries, skipped scanners, unwaived high/critical findings, and invalid or
  expired exceptions;
- `vulnerability_sla_summary.sh` emits machine-readable open, overdue,
  remediated, waived, resurfaced, false-positive, and invalid finding posture;
- approved exceptions are tracked in
  `doc/governance/security_scan_exceptions.json`.

Next:

- surface vulnerability SLA posture in platform evidence/status read models;
- reconcile security exceptions with the broader waiver governance model.
