# GitOps Adoption Decision v1

Status: Proposed

Owner: Platform Operations

Related Fairway task: `OPS-PROD-GITOPS-ADOPTION-DECISION-001`

## Purpose

Decide the initial GitOps adoption path for platform-control and production
deployment control without losing the release hardening already built into the
current scripts.

This packet compares Argo CD and Flux for GPUaaS needs and classifies GitOps as
a production confidence hardening item, not a prerequisite for the current
stabilization loop.

## Decision

Adopt **Argo CD first** for staging and production desired-state
reconciliation, after release/runtime parity and environment profile evidence
are stable.

Do not start the controller migration yet. Continue using the current
manifest-only deploy scripts while the stabilization work closes UAT, runtime
parity, and environment-profile gates.

## Classification

| Classification | Decision |
|---|---|
| Launch blocker | No, unless production launch requires formal continuous drift reconciliation before approval. |
| Confidence hardening | Yes. GitOps gives desired-vs-live visibility, sync evidence, rollback discipline, and clearer approval records. |
| Future scale hardening | Yes. Multi-environment operations, tenant-facing release evidence, and production change control become simpler when desired state is continuously reconciled. |

Trigger to start implementation:

1. `OPS-PROD-RELEASE-RUNTIME-PARITY-001` has stable evidence.
2. `OPS-PROD-ENV-PARITY-STAGING-001` has stable staging/profile evidence.
3. Release manifests are consistently digest-pinned and consumed by deploy.
4. Remote validation can report intended-vs-live image/config drift.
5. Secret references are represented as references, not Git-tracked values.

## Current Controls To Preserve

GitOps must not replace these controls with a weaker path:

| Current control | GitOps mapping |
|---|---|
| `platform_control_resolve_release_profile_contract.sh` | Generates or validates the environment/profile contract before a GitOps sync request is accepted. |
| Release manifest digest fan-in | Becomes the immutable desired-state input for image digests and component versions. |
| Manifest-only deploy wrapper | Becomes the bridge that writes or opens a PR against the GitOps environment path. |
| Image freshness and live digest checks | Become post-sync drift evidence and release-readiness gates. |
| Remote validation phase split | Runs after GitOps sync health succeeds; remains independently rerunnable. |
| Platform evidence bundle | Records release manifest, GitOps sync revision, live image drift, smoke results, and approved exceptions. |
| Promotion branch discipline | Stays until environment repo promotion is proven; GitOps must still consume one exact source SHA/release manifest. |

## Argo CD vs Flux

| Criterion | Argo CD | Flux |
|---|---|---|
| Desired-state reconciliation | Strong application model with first-class sync and health status. | Strong reconciliation through controllers and `Kustomization`/`HelmRelease` resources. |
| Drift visibility | Strong UI/API for app status, sync status, resource health, and diff. | Good CLI/API/events; less operator-friendly built-in UI by default. |
| Approval workflow | Pairs well with Git PR approval and manual sync windows. | Pairs well with Git PR approval and automation; less centralized manual-sync UX. |
| Rollback | Clear app history and rollback/revert flow when environment state is Git-backed. | Git revert plus controller reconciliation is clean; less productized rollback surface. |
| Secret integration | Works with External Secrets, Sealed Secrets, SOPS workflows, and Vault-backed references. | Strong SOPS integration and Git-native encrypted-secret workflows. |
| Image digest handling | Supports digest-pinned manifests; image updater is optional and should not own promotion initially. | Strong image automation controllers, but avoid automatic mutation until release manifest discipline is stable. |
| Evidence export | App/status API can be scraped into release evidence and Status/Ops read models. | Events/status conditions can be scraped; needs more custom presentation for operators. |
| Operational fit for GPUaaS | Better first fit because platform-control needs human-readable sync/drift status for ops, security, architecture, and product stakeholders. | Good fit for Git-native automation later, especially if the team prefers controller composition over UI-centric operations. |

Recommendation:

- Use Argo CD for the first staging/production GitOps controller.
- Keep Flux as a valid future option for image automation or Git-native
  controller composition if Argo CD becomes too UI/operator-centric for later
  scale.
- Do not use either tool to auto-promote mutable tags or bypass release
  manifest gates.

## Target Operating Model

The GitOps path should be:

1. CI validates code, contracts, security scans, and tests.
2. Artifact build publishes immutable images and packages.
3. Release manifest records source SHA, image digests, schema/seed identifiers,
   profile metadata, and artifact refs.
4. Promotion creates a Git change to the environment desired-state path.
5. Argo CD syncs the desired state to the target environment.
6. Remote validation runs after sync health is green.
7. Status/Ops records expected release, GitOps sync revision, live digest,
   route health, validation evidence, and exceptions.

```mermaid
flowchart LR
  A["Code CI"] --> B["Artifact build"]
  B --> C["Release manifest"]
  C --> D["Environment desired-state PR"]
  D --> E["Argo CD sync"]
  E --> F["Remote validation"]
  F --> G["Status/Ops evidence"]
```

## Environment Repository Shape

Start with one GitOps path per environment/profile pair:

```text
environments/
  dev-control-rke2/
    kustomization.yaml
    release-manifest.ref
    values/
  staging/
    kustomization.yaml
    release-manifest.ref
    values/
  prod/
    kustomization.yaml
    release-manifest.ref
    values/
```

Rules:

- Desired-state files may reference image digests and secret references.
- Desired-state files must not contain secret values, kubeconfigs, private keys,
  database passwords, registry passwords, or bootstrap tokens.
- Environment config remains declarative.
- Stateful migrations stay outside automatic sync until migration preflight,
  backup, cutover, and rollback evidence exist.

## Adoption Phases

### Phase 0: Hold

Continue the current script-driven manifest-only deploy path.

Exit criteria:

- release manifest is stable enough to be the only deployment unit;
- runtime parity evidence detects live image/config drift;
- staging environment/profile evidence is stable.

### Phase 1: Shadow Drift

Install Argo CD in staging or a non-production control namespace.

Scope:

- read/sync disabled or manual-sync only;
- render the same environment overlay the deploy script uses;
- compare desired state to live state;
- export app sync/health/drift to evidence.

Exit criteria:

- drift report matches current remote validation results;
- no secret values are required in Git;
- operators can map sync health to release manifest identity.

### Phase 2: Controlled Staging Sync

Allow Argo CD to sync staging after promotion approval.

Scope:

- staging only;
- manual sync or approved auto-sync;
- current remote validation remains the post-sync gate;
- deploy scripts become PR/render/evidence helpers, not direct cluster mutators.

Exit criteria:

- rollback via Git revert is proven;
- remote validation evidence links to Argo CD sync revision;
- status surface reports intended-vs-live state.

### Phase 3: Production Adoption

Adopt Argo CD for production after staging has stable sync, rollback, secret,
and evidence behavior.

Production requirements:

- protected environment path;
- required approvals;
- maintenance-window policy for stateful changes;
- alerting on out-of-sync or degraded apps;
- emergency break-glass sync/rollback runbook;
- evidence bundle export for every production promotion.

## Guardrails

1. GitOps does not create secrets; it references approved secret objects.
2. GitOps does not run unreviewed DB migrations automatically.
3. GitOps does not auto-promote images from mutable tags.
4. GitOps sync success is not release success; remote validation and evidence
   gates still decide release readiness.
5. Emergency manual cluster changes must create drift evidence and a follow-up
   reconciliation task.
6. Do not replace current deploy hardening until the equivalent GitOps evidence
   is implemented and reviewed.

## Open Decisions

| ID | Decision | Default for now |
|---|---|---|
| GITOPS-OD-001 | Same repo vs separate environment repo | Start with in-repo environment paths until production separation is needed. |
| GITOPS-OD-002 | Auto-sync vs manual sync | Manual sync for staging/prod until evidence proves safe automation. |
| GITOPS-OD-003 | Secret reference mechanism | External Secrets with Vault references is preferred; SOPS remains a fallback for low-scale bootstrap-only cases. |
| GITOPS-OD-004 | Image updater usage | Disabled initially; release manifest owns image digest selection. |
| GITOPS-OD-005 | Progressive delivery controller | Defer Argo Rollouts/Flagger until staging sync is stable and rollout SLOs are defined. |

## Next Work

1. Complete release/runtime parity and staging environment profile evidence.
2. Add a shadow-drift proof task for Argo CD against staging or dev-control.
3. Add evidence export from Argo CD app status to Status/Ops.
4. Define break-glass and drift reconciliation runbooks.
5. Revisit Flux only if Argo CD cannot satisfy evidence/export or operations
   ergonomics requirements.
