# Reproducible Environment Automation v1

As of: May 10, 2026

## Purpose

Define the production-shaped automation model for creating, recreating,
upgrading, validating, and recovering GPUaaS environments from inventory and
reviewed configuration.

This is the infrastructure/governance track. It is separate from V3 product UI
and API work. The first candidate target is `vm-105` at `100.69.173.30` with
SSH user `hpcadmin`, sudo access, 32 cores, about 193 GiB RAM, a root volume of
about 247 GiB, and `/ai-cloud-data` of about 512 GiB. Current queue context
names the first `vm-105` use as `demo`; the model must allow the same host and
profile to be recreated later as `test`, `dev`, or another environment without
changing automation logic. This document does not authorize deploying to or
mutating `vm-105`.

The intended end state is that an operator can provide one or more machines,
sudo access, disk profile, DNS/certificate inputs, and environment name, then
stand up a production-grade control-plane environment through idempotent
automation.

## Scope

In scope:
1. inventory schema for one or more hosts,
2. single-node and multi-node k3s topology,
3. host bootstrap, durable state layout, secrets, DNS, certificates, ingress,
4. platform-control-style deploy, validation, evidence, rollback, and rebuild,
5. local Apple Silicon, local x86/mac/Linux VM, and remote Linux differences,
6. a path from local `docker compose` and `kind` parity to production-shaped k3s.

Out of scope:
1. product V3 UI/API work,
2. new public API contracts,
3. mutating `vm-105`,
4. destructive scripts,
5. committing Cloudflare credentials or generated production secrets.

## Related References

1. `doc/operations/Control_Plane_K8s_Migration_v1.md`
2. `doc/architecture/Platform_DNS_Cert_Endpoint_Model_v1.md`
3. `doc/governance/Platform_Control_Release_Promotion_Policy.md`
4. `doc/governance/Multi_Agent_Lane_Worktrees_v1.md`
5. `doc/governance/Agent_Queue_Structured_Store_v1.md`
6. `doc/governance/Agent_Queue_State_And_Telemetry_Hardening_v1.md`
7. `doc/operations/local-dev/README.md`
8. `infra/ansible/README.md`

## Design Principles

1. Reproducible means rebuildable from Git-reviewed config plus durable state
   backups, not from manual memory.
2. Inventory drives topology. Automation must support a single host and a
   future multi-node k3s cluster without changing deployment logic.
3. Git owns reconstructable config. Durable operational state owns the facts
   that cannot be regenerated safely, such as Postgres data, Vault storage,
   registry blobs, queue execution telemetry, cert-manager account state, and
   backup catalogs.
4. Every manual action on `vm-105` must be recorded as an automation gap and
   converted into an idempotent script, Ansible task, Make target, or documented
   runbook step before the environment is considered complete.
5. Public endpoints are named surfaces. They must not depend on Tailscale
   Funnel, node IPs, or direct ports as the steady-state model.
6. Release and deployment follow platform-control promotion discipline: build
   from one reviewed SHA, deploy immutable image digests, validate, and keep
   rollback evidence.
7. Governance/orchestration tooling must remain isolated enough that it can
   become a separate product or tool later.

## Environment Layers

GPUaaS should keep four environment modes with different jobs:

| Mode | Runtime | Job |
|---|---|---|
| local-dev | `docker compose` | fastest inner loop, disposable DB resets, host ports |
| local parity | `kind` | Kubernetes-shaped validation on a developer machine |
| reproducible env | `k3s` | inventory-driven production-shaped environment on one or more hosts |
| platform-control | promoted `k3s` release target | shared continuity, release validation, operational proof |

The reproducible environment model is the factory shape that should eventually
create platform-control-like environments. It should reuse current Ansible,
kustomize, smoke, backup, queue, and release primitives where practical, but it
must not bake in current dev-control hostnames or Tailscale Funnel names.

## Inventory Schema

Each environment must have an inventory file and an environment config file.
The schema should be readable as YAML and convertible to Ansible inventory.

Required environment fields:
1. `environment.name`, such as `dev`, `staging`, or `prod`,
2. `environment.tier`, such as `dev`, `staging`, `prod`,
3. `environment.cluster_name`,
4. `environment.domain_base`,
5. `environment.dns.zone`,
6. `environment.dns.provider`,
7. `environment.dns.profile`,
8. `environment.public_endpoint_profile`,
9. `environment.ingress.strategy`,
10. `environment.tls.issuer`,
11. `environment.state_root`,
12. `environment.backup_profile`.

Required per-host fields:
1. `hostname`,
2. `ip`,
3. `ssh_user`,
4. `sudo_mode`, for example `passwordless` or `password_required`,
5. `architecture`, for example `amd64` or `arm64`,
6. `intended_role`, one or more of `k3s_control_plane`, `k3s_worker`,
   `ci_runner`, `stateful_infra`, `gpu_worker`,
7. `data_disk_profile`, including root volume, data mount, filesystem,
   capacity, and intended storage classes,
8. `labels`,
9. `taints`,
10. `environment_name`,
11. `dns_zone_profile`,
12. `public_endpoint_profile`.

The first candidate `vm-105` inventory entry should be modeled as data, not as a
hardcoded script target:

```yaml
hostname: vm-105
ip: 100.69.173.30
ssh_user: hpcadmin
sudo_mode: passwordless
architecture: amd64
intended_role:
  - k3s_control_plane
  - k3s_worker
environment_name: demo
data_disk_profile:
  root:
    mount: /
    size_gib: 247
  data:
    mount: /ai-cloud-data
    size_gib: 512
    storage_class: local-path
labels:
  gpuaas.io/host-role: platform-control
  gpuaas.io/environment: demo
taints: []
dns_zone_profile: core42-dev
public_endpoint_profile: core42-demo-public
```

## Command Surface

The future command surface should be Make targets wrapping scripts or Ansible
playbooks. Initial target names should be explicit and dry-run friendly:

1. `make env-inventory-validate ENV=<env>`
2. `make env-preflight ENV=<env>`
3. `make env-bootstrap-hosts ENV=<env>`
4. `make env-bootstrap-k3s ENV=<env>`
5. `make env-bootstrap-dns ENV=<env>`
6. `make env-bootstrap-secrets ENV=<env>`
7. `make env-deploy ENV=<env> SOURCE_REF=<ref>`
8. `make env-validate ENV=<env>`
9. `make env-capture-evidence ENV=<env>`
10. `make env-backup ENV=<env>`
11. `make env-restore-drill ENV=<env>`
12. `make env-rollback ENV=<env> RELEASE=<id>`
13. `make env-record-gap ENV=<env> SUMMARY=<text>`

No command should require embedding secrets in the command line. Secret-bearing
inputs should come from local untracked files, environment variables loaded by a
wrapper that redacts output, or a secret manager.

Initial safe slice:

```bash
make env-inventory-validate ENV=demo
make env-preflight ENV=demo HOST=vm-105
make env-capture-evidence ENV=demo HOST=vm-105
```

The first implemented commands are validation and read-only evidence capture
only. They must not install packages, write remote files, deploy k3s, mutate
DNS, create cert-manager resources, or print secret values. Evidence is written
locally under `.git/ops-evidence/env-automation/<env>/`.

No bootstrap target may be used against `vm-105` until the latest preflight
evidence is reviewed and the target inventory, DNS profile, state root, rollback
path, and exact mutating command are confirmed.

## vm-105 Mutation Review Gate

Before any command mutates `vm-105`, reviewers must confirm:
1. `doc/operations/env-automation/environments/demo/config.yaml` still matches
   the intended environment name, `core42.dev` DNS profile, ingress strategy,
   state root, backup profile, and public endpoint map,
2. `doc/operations/env-automation/environments/demo/inventory.yaml` still
   matches the target host, IP `100.69.173.30`, SSH user, sudo mode, role set,
   architecture, labels, taints, root volume, and `/ai-cloud-data` profile,
3. a current read-only preflight bundle exists and has been reviewed for OS,
   kernel, mounts, free space, architecture, network, tool versions, and
   non-interactive sudo behavior,
4. Cloudflare credentials are present only as local secret input and the planned
   DNS-01 action is limited to the reviewed `core42.dev` zone/profile,
5. the exact mutating target is implemented as idempotent automation with
   recorded evidence, a dry-run/check mode where practical, and no secret
   printing,
6. state backup and rollback commands are documented for the affected layer
   before deploy, upgrade, restore, or destructive rebuild testing,
7. every expected manual or emergency `sudo` action has either been converted to
   automation or recorded as an automation gap.

## Host Bootstrap Workflow

The host bootstrap phase must be idempotent and safe to rerun.

Required steps:
1. load inventory and fail if required fields are missing,
2. verify SSH and sudo without printing credentials,
3. collect host facts: OS, kernel, CPU, memory, architecture, mounts, disks,
   filesystems, IP addresses, package versions, container runtime versions,
4. create base directories under `/opt/gpuaas`, `/etc/gpuaas`,
   `/var/lib/gpuaas`, `/var/log/gpuaas`, and the environment state root,
5. install approved packages and pin or record versions,
6. configure time sync, firewall baseline, kernel/sysctl requirements, and
   container runtime requirements,
7. configure data disk mount ownership and local-path storage root,
8. write host-role markers from automation,
9. emit a bootstrap evidence bundle.

Manual recovery actions are allowed during incident work, but they must produce
an automation gap record before the environment can be marked complete.

## k3s Topology

### Single-node

For a first environment, one host may run both `k3s_control_plane` and
`k3s_worker`. This is acceptable for a dev or bootstrap environment when:
1. durable state is backed up,
2. rebuild from backup is rehearsed,
3. the inventory is already multi-node capable,
4. control-plane and workload labels are explicit,
5. data placement is documented.

### Multi-node

Multi-node k3s must be modeled from the start:
1. first control-plane node initializes the cluster,
2. additional control-plane nodes join through a generated join token stored in
   the environment secret store,
3. worker nodes join through a separate token where supported,
4. kubeconfig is written to a durable operator path with least privilege,
5. labels and taints are applied from inventory,
6. validation waits for all expected nodes to be `Ready` before deploy.

The implementation should support these role groups:
1. `k3s_control_plane_primary`,
2. `k3s_control_plane_secondary`,
3. `k3s_worker`,
4. `stateful_infra`,
5. `ci_runner`,
6. `gpu_worker`.

## Storage and Durable State

Git-reconstructable config:
1. inventory and environment config,
2. Ansible roles and playbooks,
3. kustomize bases and overlays,
4. scripts and Make targets,
5. smoke test definitions,
6. backup/restore runbooks,
7. queue task definitions.

Durable operational state:
1. Postgres data and WAL/archive backups,
2. Redis persistence if enabled for non-cache data,
3. NATS JetStream data,
4. Temporal persistence,
5. Keycloak database and realm runtime state,
6. Vault storage and unseal/recovery material references,
7. registry blobs and metadata,
8. cert-manager ACME account and issued certificate state,
9. k3s server datastore snapshots,
10. queue mutable state and telemetry,
11. evidence bundles and validation artifacts.

The environment config must declare state locations. A recommended first layout
for a host with `/ai-cloud-data` is:

```text
/ai-cloud-data/gpuaas/<env>/k3s/
/ai-cloud-data/gpuaas/<env>/local-path/
/ai-cloud-data/gpuaas/<env>/backups/
/ai-cloud-data/gpuaas/<env>/evidence/
/ai-cloud-data/gpuaas/<env>/registry/
/ai-cloud-data/gpuaas/<env>/vault/
```

State backup is mandatory before upgrade, rollback rehearsal, or destructive
rebuild testing.

## Postgres Placement

Postgres can be outside k3s or inside k3s. The choice is environment-specific
and must be explicit.

External or host-managed Postgres advantages:
1. easier early recovery when k3s is broken,
2. simpler access to native backup tooling,
3. lower risk during the first k3s migration.

External or host-managed Postgres costs:
1. more host-specific automation,
2. separate lifecycle from workloads,
3. less declarative service placement.

In-cluster Postgres advantages:
1. environment manifests describe the full stack,
2. k3s Service discovery and NetworkPolicy can own access,
3. operator-based backup can standardize later environments.

In-cluster Postgres costs:
1. storage class and node failure semantics must be correct,
2. restore must be tested when the cluster itself is degraded,
3. a single-node cluster creates correlated failure between control plane and
   database unless backups are off-host.

Recommendation:
1. keep production-shaped dev environments allowed to start with host-managed
   Postgres while k3s deployment automation stabilizes,
2. allow in-cluster Postgres only with declared PVC/storage class, scheduled
   backups, off-host copies, restore drill, and documented failure mode,
3. require a restore drill before moving a shared environment's source of truth
   into k3s.

## Secrets Model

Secrets must not be committed. The design assumes three categories:

1. local operator inputs, such as `.env.cloudflare.core42-dev`,
2. environment bootstrap secrets, generated once and stored in the environment
   secret manager,
3. workload runtime secrets, synced into Kubernetes through controlled
   automation.

Cloudflare credentials for `core42.dev` are expected locally in
`.env.cloudflare.core42-dev`. That file must be gitignored, never printed, and
only loaded by commands that redact environment output. The file should contain
least-privilege credentials limited to DNS changes for the required zone.

Vault remains the preferred environment secret store once available. Before
Vault is bootstrapped, generated secret references and one-time recovery
material must be captured in an operator runbook location outside Git.

## DNS and Endpoint Model

The domain `core42.dev` is Cloudflare-managed. The steady-state public model
should use Cloudflare DNS plus Let's Encrypt DNS-01 issuance, preferably through
cert-manager in k3s.

Required endpoint surfaces:
1. `app`,
2. `api`,
3. `auth`,
4. `term`,
5. `grafana`,
6. `prometheus`,
7. `loki`,
8. `tempo`,
9. `registry`,
10. `vault`.

Tailscale Funnel can remain a transitional operator path, but it should not be
the target public endpoint model. Direct node IPs and `retired IP-derived DNS` are development
realizations, not the production naming contract.

## Recommended Hostname Pattern

Recommended pattern:

```text
<service>.<env>.aicloud.core42.dev
```

Examples:
1. `app.demo.aicloud.core42.dev`
2. `api.demo.aicloud.core42.dev`
3. `auth.demo.aicloud.core42.dev`
4. `term.demo.aicloud.core42.dev`
5. `grafana.demo.aicloud.core42.dev`

Wildcard certificate:

```text
*.demo.aicloud.core42.dev
```

Why this is preferred:
1. one wildcard covers all ordinary service endpoints for one environment,
2. operators read names from most specific to least specific: service, env,
   product/domain,
3. cert-manager ingress rules are simple because hostnames share one
   environment wildcard,
4. environments are clearly isolated by wildcard boundary,
5. adding `staging` or `prod` does not require rethinking service names.

Rejected pattern: `aicloud.<env>.<service>.core42.dev`.
This makes the service label sit above the wildcard boundary. A single wildcard
such as `*.dev.core42.dev` would cover `aicloud.dev.api.core42.dev` only if the
labels were rearranged, and `*.api.core42.dev` would group by service instead of
environment. It is less natural for ingress and harder for humans.

Alternative pattern: `<service>.<env>.core42.dev`.
This is also wildcard-friendly with `*.dev.core42.dev`, but it consumes the
top-level environment namespace directly under `core42.dev`. The `aicloud`
product label gives a cleaner boundary if `core42.dev` later hosts non-GPUaaS
systems.

Alternative pattern: `<service>.aicloud.<env>.core42.dev`.
This requires one wildcard per `aicloud.<env>.core42.dev`, but reads less
naturally and makes the environment less prominent in DNS zone navigation.

Recommended wildcard policy:
1. issue one wildcard per environment: `*.demo.aicloud.core42.dev`,
2. optionally issue the apex environment name `demo.aicloud.core42.dev` only if
   an apex landing or redirect is required,
3. do not issue one certificate per service unless a service needs a distinct
   trust, lifecycle, or public exposure policy.

## Cloudflare and Let's Encrypt

The implementation should use cert-manager with a Cloudflare DNS-01 ClusterIssuer
or namespace-scoped Issuer. Required properties:
1. Cloudflare token is loaded from `.env.cloudflare.core42-dev` or secret store,
2. token has least-privilege zone DNS permissions,
3. DNS-01 is used so wildcard certificates do not require public HTTP reachability,
4. cert-manager owns renewal,
5. ingress references Kubernetes TLS secrets by name,
6. certificate expiry checks feed observability and validation gates.

Expected high-level flow:
1. validate Cloudflare credentials without printing them,
2. create or verify DNS records for environment ingress front door,
3. install cert-manager,
4. apply Cloudflare issuer secret and issuer manifest,
5. request `*.demo.aicloud.core42.dev`,
6. wait for certificate readiness,
7. apply ingress hosts that use the wildcard secret,
8. run external TLS and hostname validation.

## Ingress and Public Endpoint Profiles

Each environment must declare a public endpoint profile:

1. `dns_provider`, such as Cloudflare,
2. `base_domain`, such as `demo.aicloud.core42.dev`,
3. `wildcard_domain`, such as `*.demo.aicloud.core42.dev`,
4. `frontdoor_strategy`, such as MetalLB, external reverse proxy, cloud load
   balancer, or temporary Tailscale Funnel,
5. `ingress_class`, such as Traefik,
6. `tls_secret_name`,
7. `issuer_ref`,
8. endpoint map from service surface to hostname.

The profile must separate logical names from actual node membership so a second
control-plane node can be added without changing public URLs.

## Deploy Workflow

Deploy should mirror platform-control release discipline:
1. choose one source SHA that is already reviewed and integrated according to
   the target environment policy,
2. run local or CI preflight appropriate to the changed files,
3. build runtime images with required OCI labels,
4. publish images to a registry and capture immutable digests,
5. render manifests from environment config,
6. apply by digest, not mutable tags,
7. wait for rollouts,
8. run remote validation against public endpoints,
9. capture evidence and release metadata,
10. leave `kubectl rollout undo` and previous digest set available for rollback.

The release branch rule from platform-control remains the reference for shared
environments: do not hand-edit release branches, and do not treat deploy success
as proof that source and release contents match.

## Validation Gates

Minimum gates before declaring an environment ready:
1. inventory schema validates,
2. SSH and sudo preflight pass,
3. host facts captured,
4. disks and data mounts match inventory,
5. k3s nodes are `Ready`,
6. labels and taints match inventory,
7. namespaces and storage classes exist,
8. DNS records resolve,
9. Let's Encrypt wildcard certificate is ready and not near expiry,
10. public HTTPS endpoints pass health checks,
11. Postgres migration and seed pass,
12. API health and auth smoke pass,
13. worker pods are ready,
14. NATS, Redis, Temporal, Keycloak, registry, and Vault readiness pass as
    applicable,
15. observability scrape/log/trace checks pass,
16. backup job succeeds,
17. restore drill succeeds on a non-production target,
18. rollback drill is documented or rehearsed,
19. evidence bundle is stored under the environment evidence path.

The validation command set should include existing checks where possible:
`kind` parity validation, platform-control remote validation, backup/restore
smoke, cert expiry check, observability smoke, and queue validation.

## Evidence Capture

Each run should write a timestamped evidence bundle containing:
1. source SHA and rendered environment config hash,
2. inventory hash,
3. host facts,
4. package and runtime versions,
5. k3s version and node list,
6. DNS records checked,
7. certificate issuer, subject, SANs, and expiry,
8. deployed image digests,
9. rollout status,
10. smoke command results,
11. backup artifact names,
12. restore drill result,
13. manual action log and automation gaps.

Evidence should be stored under durable environment state and summarized in the
queue database or handoff. Large logs should be referenced, not committed to
Git.

## Backup and Restore

Backup requirements:
1. Postgres logical backup plus PITR-ready physical/WAL strategy for shared
   environments,
2. k3s datastore snapshots,
3. Vault storage backup and separate recovery material handling,
4. registry storage backup or rebuild-from-source policy with digest evidence,
5. Keycloak database and realm export strategy,
6. NATS JetStream and Temporal persistence backup when stateful,
7. queue database backup when shared orchestration is enabled,
8. off-host copy for any environment whose data matters.

Restore requirements:
1. restore to an isolated target first,
2. validate schema, seed, auth, API health, and critical workflows,
3. record RTO/RPO evidence,
4. rehearse before declaring in-cluster Postgres production-ready,
5. document which state is intentionally rebuildable and which is backed up.

## Rollback and Rebuild

Rollback levels:
1. workload rollback with previous Deployment ReplicaSet or previous image
   digest set,
2. manifest rollback to a prior release candidate,
3. schema rollback through forward-only corrective migrations where possible,
4. infrastructure rollback through Ansible/k3s config convergence,
5. disaster rebuild from inventory plus backups.

The rebuild test for "reproducible" is:
1. provision a blank compatible host or VM,
2. run inventory preflight,
3. bootstrap host and k3s,
4. restore durable state or seed a fresh environment,
5. deploy one reviewed release,
6. validate public endpoints and observability,
7. compare evidence against the expected environment contract.

## Queue and Governance State

Queue task definitions remain in Git. Mutable execution state and telemetry are
operational data. The local SQLite implementation is acceptable for a single
developer, but multi-machine orchestration should move to a network-reachable
Postgres-compatible queue database while preserving command semantics.

The environment automation should treat the queue store as a separate component:
1. isolated schema and credentials,
2. backup/restore policy,
3. exported snapshots for human-readable handoff,
4. no coupling to product V3 runtime tables,
5. migration path to a standalone governance/orchestration product.

## Architecture Differences

### Apple Silicon local

Apple Silicon should run local-dev and `kind` parity for development feedback.
It must not hide architecture differences:
1. image builds may need `linux/amd64` output for remote x86 hosts,
2. QEMU/emulation can be slower and should not be used as performance evidence,
3. GPU/runtime validation belongs on Linux GPU-capable hosts,
4. local cert/DNS behavior should preserve the same endpoint semantics.

### x86 Mac or Linux VM

x86 local machines can run `kind` parity closer to remote image architecture,
but still differ from production k3s in storage, networking, systemd, and
long-lived state. They are good for deploy-shape validation, not final recovery
evidence.

### Remote Linux VM

Remote Linux VMs such as `vm-105` are the first production-shaped automation
target:
1. systemd, k3s, storage mounts, DNS, cert-manager, ingress, backup, and
   observability must be exercised,
2. every sudo change must come from automation or be recorded as a gap,
3. remote validation must target public names, not direct ports.

## Phases

### Phase 0: Design and skeleton

Deliver this document and inert examples. Do not mutate remote hosts.

### Phase 1: Inventory and preflight

Implement schema validation, host fact collection, SSH/sudo checks, dry-run
Ansible inventory rendering, Cloudflare credential presence checks, and evidence
directory creation.

### Phase 2: Single-node k3s bootstrap

Bootstrap one node from inventory, likely `vm-105` only after review approval.
Install k3s, configure storage, labels, taints, kubeconfig, namespaces, ingress,
cert-manager, and validation. No hand-edited final state is allowed.

### Phase 3: Core deploy

Deploy GPUaaS core services by digest using platform-control-like promotion
discipline. Keep state placement explicit and run smoke checks.

### Phase 4: DNS and certificate hardening

Move from Tailscale Funnel and `retired IP-derived DNS` realization to Cloudflare DNS and
Let's Encrypt wildcard certificates under `*.demo.aicloud.core42.dev`.

### Phase 5: Backup, restore, and rollback drills

Add scheduled backups, off-host copy, restore rehearsal, rollback rehearsal,
certificate expiry checks, and evidence capture.

### Phase 6: Multi-node expansion

Add secondary control-plane and worker roles through inventory. Validate join
tokens, node readiness, storage placement, and stable public endpoints.

## Open Decisions

1. Whether the first reviewed implementation target is `vm-105` as a combined
   control-plane/worker or a separate role split.
2. Whether first Postgres placement for `vm-105` should be host-managed or
   in-cluster.
3. Which front-door strategy replaces Tailscale Funnel first: MetalLB/VIP,
   external reverse proxy, or Cloudflare Tunnel as an interim public ingress.
4. Where off-host backups land for early `core42.dev` environments.
5. Whether Cloudflare DNS is managed directly by cert-manager only, or also by a
   separate DNS reconciliation step for A/CNAME records.

## Immediate Implementation Tasks

1. Add environment inventory schema validation and a dry-run renderer.
2. Add a preflight command that checks local `.env.cloudflare.core42-dev`
   presence without printing values.
3. Add a non-mutating `vm-105` facts collection plan for review.
4. Add Ansible role boundaries for `env_common`, `k3s_server`, `k3s_agent`,
   `cert_manager_cloudflare`, `env_evidence`, and `env_backup`.
5. Add a manual action gap log format and a required evidence check.
6. Add a restore-drill runbook for the chosen Postgres placement.