# Dev Control Environment Reset Runbook

This runbook covers distribution-profiled dev-control environment creation. The
default production-shaped dev-control profile for new setup is `single_node_rke2`.
`vm-104` (`100.90.157.34`) is the current testing node, and the existing
`bootstrap_plan.yaml` now implements the guarded `single_node_rke2` dev-control
profile. k3s remains documented only as the previous lightweight validation
profile and as rollback context.

The automation here is **read-only by default**. No tooling in this repo
applies mutations without an explicit stage allowlist and operator gate.
Every mutating step is an explicit operator-run command after the
operator-gate criteria below.

---

## 1. Target

| | |
|---|---|
| Host | `vm-104` |
| Tailnet IP | `100.90.157.34` |
| SSH user | `hpcadmin` (passwordless sudo expected) |
| OS | Ubuntu 24.04 LTS (`noble`) |
| Data root | `/ai-cloud-data` (xfs, about 1.6T on current vm-104) |
| Domain | `aicloud-dev-<service>.core42.dev` |
| Default dev-control profile | `single_node_rke2` |
| Current dev-control profile | `single_node_rke2` active on clean vm-104 rebuild |
| Cluster in current plan | single-node RKE2, control-plane + worker |
| Ingress | traefik (helm-managed), default class, exposed by Cloudflare Tunnel on vm-104 |
| TLS | cert-manager + Let's Encrypt + Cloudflare DNS-01 |
| Wildcard secret | `gpuaas-dev-wildcard-tls` |
| Backup root | `/ai-cloud-data/gpuaas/dev-control/backups` |

---

## 2. Distribution profiles

| Profile | Role | Status |
|---|---|---|
| `single_node_rke2` | Production-shaped single-node dev-control/staging-like target | Current vm-104 dev-control profile |
| `multi_node_rke2` | Future production-shaped multi-node target | Planned |
| `single_node_k3s` | Lightweight test/dev validation profile | Parked previous validation profile |

The current checked-in plan is intentionally marked
`distribution_profile: single_node_rke2`. It still includes k3s
decommission/rollback stages so a replacement or previously bootstrapped host
can move from the lightweight validation profile to the prod-shaped dev-control
profile without shell-history reconstruction.

Current queue ownership:

- `C-ENV-PLATFORM-CONTROL-RESET-CLOUDFLARE-001` owns this standalone dev
  control reset. Preserve GitLab, retire old Funnel/retired IP-derived DNS assumptions, remove
  stale external Postgres only during an approved cleanup window, and rebuild
  the GPUaaS platform shape under `/ai-cloud-data/gpuaas/dev-control`.
- The checked-in automation can safely validate inventory, capture read-only
  evidence, emit plans, apply guarded bootstrap stages, and smoke the public
  edge.
- Destructive cleanup remains operator-gated.

## 2.1 Current State - 2026-05-24

Validated with `make env-preflight ENV=dev HOST=vm-104`,
`make env-bootstrap-check ENV=dev`, and `make env-smoke ENV=dev`.

- SSH to `hpcadmin@100.90.157.34` works with passwordless sudo.
- `/ai-cloud-data` is mounted and reserved for dev-control data.
- Docker is installed and only the GitLab containers remain:
  `gpuaas-gitlab` is healthy and `gpuaas-gitlab-runner` is up.
- GitLab sign-in on `http://127.0.0.1:8929/users/sign_in` returns HTTP 200
  from the host after cleanup.
- Removed legacy Docker carryover:
  `gpuaas-platform-control-postgres`, `gpuaas-platform-control-adminer`, old
  `gpuaas-dev-*-tailscale` / `gpuaas-dev-*-proxy` containers, old
  `external-infra_*`, `local-dev_*`, `gpuaas-fe2e-*`, and stale runner cache
  volumes.
- k3s has been uninstalled and is inactive; `/ai-cloud-data/k3s-storage` was
  removed.
- RKE2 is installed and active; node `vm-104` is Ready on Kubernetes
  `v1.30.8+rke2r1`.
- The operator kubeconfig is exported to the gitignored evidence path
  `.git/ops-evidence/env-automation/dev/operator-kubeconfig/rke2-dev-control.yaml`.
- cert-manager, the Cloudflare DNS-01 ClusterIssuer, the wildcard origin
  certificate, and Traefik are installed and Ready.
- Cloudflare Tunnel `gpuaas-dev-vm-104` runs in Docker on `vm-104`, with its
  token staged at `/ai-cloud-data/gpuaas/dev-control/secrets/cloudflared/token`.
- `make env-smoke ENV=dev` passes DNS, public TLS, public edge reachability,
  and kubectl node readiness. App/API routes may still return Traefik 404
  until the GPUaaS app stack is deployed.

---

## 3. Read-only commands you should run first

These run sanitized and write evidence under `.git/ops-evidence/env-automation/dev/`.
None of them install packages, write files on `vm-104`, mutate DNS, or
read Cloudflare credential values.

```bash
# 1. Validate the inventory and config shape.
make env-inventory-validate ENV=dev

# 2. Capture sanitized SSH facts (id, os, mounts, tools, sudo, network).
make env-preflight ENV=dev HOST=vm-104

# 3. Emit the full bootstrap plan (human + JSON evidence).
make env-bootstrap-plan ENV=dev

# 4. Probe per-stage state on the host (read-only).
make env-bootstrap-check ENV=dev

# 5. Run end-to-end DNS / TLS / API / k8s smoke checks.
make env-smoke ENV=dev

# 6. Emit the rollback plan for review before any mutating step.
make env-rollback-plan ENV=dev
```

Every command writes a sanitized `*-<command>.json` evidence file to
`$(git rev-parse --git-path ops-evidence)/env-automation/dev/`. Evidence
never contains Cloudflare credential bytes, k3s node-tokens, kubeconfig
contents, or TLS private keys. SSH stderr is scrubbed for `token=`,
`password=`, `secret=`, `credential=`, and `api_key=` patterns.

---

## 4. Operator gate before any mutation

Do **not** proceed past `bootstrap-check` until all of the following are
true. This gate is intentionally manual.

- [ ] `make env-inventory-validate ENV=dev` exits 0.
- [ ] `make env-preflight ENV=dev HOST=vm-104` evidence reviewed and
      shows `/ai-cloud-data` mounted with >100G free.
- [ ] `make env-bootstrap-check ENV=dev` evidence reviewed and the
      operator agrees the current host state matches the expected
      pre-bootstrap state.
- [ ] `.env.cloudflare.core42-dev` exists on the operator workstation,
      is gitignored, holds a token scoped to `Zone.DNS:Edit` for
      `core42.dev` only, and has a rotation owner recorded out-of-band.
- [ ] Operator has read `bootstrap_plan.yaml` and the corresponding
      stage's `apply.operator_gate` text.
- [ ] Operator has read `make env-rollback-plan ENV=dev` output and is
      prepared to execute the relevant rollback section if a stage fails.
- [ ] Operator confirms the named environment, target host, DNS profile,
      state root, and evidence path before each mutating command.

---

## 5. Stage order

Stage definitions, including `apply.commands` and `rollback.commands`,
live in `bootstrap_plan.yaml`. The order is:

1. `01-state-directories` — create `/ai-cloud-data/gpuaas/dev-control/...`
   subtree. Sudo. Idempotent (`install -d`).
2. `02-cloudflare-credentials` — create Kubernetes secret
   `cloudflare-api-token-secret` in `cert-manager` from operator-local
   `.env.cloudflare.core42-dev`. Never read or print values.
3. `03-k3s-single-node` — install k3s with state on the data disk and
   traefik disabled so we can install our pinned chart in stage 05.
4. `04-kubeconfig-export` — copy kubeconfig to the operator over SSH,
   then delete the on-host copy.
5. `03b-k3s-validation-decommission` — guarded teardown of the current
   k3s validation profile before rebuilding the host as RKE2.
6. `03c-rke2-single-node` — install the production-shaped single-node
   RKE2 server profile with bundled ingress disabled.
7. `04a-rke2-kubeconfig-export` — copy the RKE2 kubeconfig to the
   operator-controlled gitignored evidence path.
8. `05-ingress-and-cert-manager` — helm-install traefik + cert-manager,
   apply the Cloudflare ClusterIssuer and the wildcard Certificate.
9. `06-cloudflare-tunnel-public-ingress` — operator configures a named
   Cloudflare Tunnel and proxied CNAMEs, then runs `cloudflared` in Docker
   on `vm-104`. The operator laptop is not exposed.
10. `07-backups` — local backup directory. Off-host backup target is
   `pending-review` and must be filled in before day 14.
11. `99-decommission` — operator-only teardown (reverse stage order).

Each stage in the plan declares:

- `idempotent: true|false` — only `true` stages are eligible for retry.
- `sudo: true|false` — whether commands assume passwordless sudo on the
  target.
- `check.commands` / `check.local_commands` — read-only state queries
  the tool runs for you.
- `apply.commands` — mutating commands. **Never run by the tool.**
- `apply.operator_gate` — preconditions to confirm before applying.
- `rollback.commands` — manual rollback. Never run by the tool.

---

## 6. Evidence layout

```
.git/ops-evidence/env-automation/dev/
  dev-control-<UTC-timestamp>-preflight.json
  dev-control-<UTC-timestamp>-bootstrap-plan.json
  dev-control-<UTC-timestamp>-bootstrap-check.json
  dev-control-<UTC-timestamp>-rollback-plan.json
  dev-control-<UTC-timestamp>-smoke.json
```

`.git/ops-evidence` is outside the working tree and is not tracked by git.
Retention is 30 days. Anything resembling a token, password, credential,
api-key, or private key is replaced with `[REDACTED]` before write.

---

## 7. Known guardrails

- Do not commit `.env.cloudflare.core42-dev` or the exported kubeconfig.
- Do not edit `bootstrap_plan.yaml` apply commands inline at run time;
  PR the change so it lands as a versioned plan revision.
- Do not run `apply` commands from this repo's tooling; the tooling is
  read-only by design.
- Do not introduce a non-`/ai-cloud-data`-rooted state directory; the
  validator enforces `state_root` and `backup_profile.local_path` are
  both under `/ai-cloud-data`.
- Do not add new k3s-only stages unless they are explicitly scoped to the
  `single_node_k3s` profile. New dev-control-equivalent setup should target
  `single_node_rke2`.
- Keep this work isolated from product V3 UI/API changes and the proxy
  data-plane track.

---

## Current state snapshot (informational only)

Latest preflight against `vm-104` recorded in `.git/ops-evidence/...`:

- Ubuntu 24.04.4 LTS, 32 cores, 193 GiB RAM.
- `/ai-cloud-data` mounted xfs on `/dev/vda`, 502 GiB free.
- `docker` + `containerd` already installed.
- `k3s`, `kubectl`, `helm`, `ansible` not installed.
- Passwordless sudo available.
- Tailscale up; public reachability for DNS targets is provided by
  Cloudflare Tunnel once stage `06-cloudflare-tunnel-public-ingress` is
  configured.

The current dev-control cluster has progressed past the original stage-1
snapshot: k3s, cert-manager, Traefik, and wildcard TLS are already
installed. The remaining public-readiness gap is DNS/ingress exposure via
Cloudflare Tunnel.

2026-05-15 live check:

- SSH reachable at `hpcadmin@100.90.157.34`.
- k3s node `vm-104` is Ready on Ubuntu 24.04.4.
- cert-manager, Traefik, and wildcard certificate
  `gpuaas-dev-control-wildcard` are Ready.
- Cloudflare Tunnel is not running because the tunnel token is missing at
  `/ai-cloud-data/gpuaas/dev-control/secrets/cloudflared/token`.
- `*.dev-control.aicloud.core42.dev` does not yet resolve publicly.
- The next action is not to finish this k3s public path; it is to implement
  and review the `single_node_rke2` profile, then rebuild vm-104 through
  `C-DEV-VM105-DEV-RKE2-LIVE-REBUILD-001`.

2026-05-16 live rebuild update:

- `vm-104` was rebuilt from the k3s validation profile to single-node RKE2
  through the guarded automation stages.
- RKE2 node `vm-104` is Ready with Kubernetes `v1.30.8+rke2r1`.
- cert-manager, the Cloudflare DNS-01 ClusterIssuer, the wildcard origin
  certificate, and Traefik are installed on RKE2.
- Cloudflare Tunnel `gpuaas-dev-vm-104` runs in Docker on `vm-104`.
- Public edge hostnames use the flat `aicloud-dev-<service>.core42.dev`
  pattern. The nested `*.dev-control.aicloud.core42.dev` pattern failed at the
  Cloudflare edge TLS layer for the same reason as the earlier kind
  `*.kind.aicloud.core42.dev` attempt.
- The base environment is reachable through Cloudflare, but there are no
  GPUaaS app/API workloads deployed yet, so public routes currently return
  Traefik 404 until the app stack is deployed.

2026-05-18 app-route capacity update:

- The GPUaaS app/API stack is deployed and public dev-control auth/API/app hosts are
  reachable.
- `make env-preflight ENV=dev HOST=vm-104` passes; the RKE2 node `vm-104` is
  Ready and core, infra, observability, Traefik, Cloudflare tunnel, and
  Pomerium pods are Running.
- Demo JupyterLab and vLLM app artifacts are published through public APIs, and
  the launch prerequisites exist: default SSH key, workspace storage bucket,
  and an active service account.
- `scripts/ops/dev-control_app_route_readiness.sh` reports both JupyterLab and vLLM
  prechecks as `ready=true`.
- Worker capacity now exists through product APIs: vm-104 is temporarily
  enrolled as a dev-control self-worker, with one active allocation and running
  JupyterLab/vLLM app instances for route validation. This is a dev-control unblock,
  not the target production worker shape.

### Demo Worker Capacity Bootstrap

Queue task: `C-DEV-WORKER-CAPACITY-API-FIRST-BOOTSTRAP-001`.

The dev-control app-route smoke must not be unblocked by direct SQL inserts. The
correct path is API-first and should mirror what a real customer/dev-control operator
would do:

1. Choose the worker-capacity shape.
   - Preferred: a separate worker VM in the same network as vm-104.
   - Temporary fallback: vm-104 itself as a worker, but only with an explicit
     risk note because it is also the RKE2 control-plane host.
2. Converge the manual worker-node baseline before enrollment:
   `scripts/ops/gpuaas_manual_worker_node_converge.sh` for packages,
   log rotation, Docker/runtime prerequisites, node-agent directories, and
   observability/log-shipping prerequisites.
3. Register the node through the admin API:
   `POST /api/v1/admin/nodes` with `onboarding_mode=manual`,
   `sku=compute-vm`, `gpus_total=2`, and `region_code` matching the dev-control
   region.
4. Retrieve bootstrap material through the supported API:
   `POST /api/v1/admin/nodes/{node_id}/bootstrap-script`, or the matching
   V3 lifecycle alias if the operator is driving it from the V3 shell.
5. Run the bootstrap on the worker so `gpuaas-node-agent` enrolls and starts
   heartbeating. Verify node-agent logs are in Loki and metrics appear before
   scheduling user work.
6. Create a normal user allocation through public allocation APIs. Do not mark
   allocations active by hand.
7. Launch JupyterLab and vLLM through normal app launch APIs so the app runtime
   worker creates real app instances and route records.
8. Rerun:

```bash
APP_PUBLIC_URL=https://aicloud-dev-app.core42.dev \
AUTH_PUBLIC_URL=https://aicloud-dev-auth.core42.dev \
scripts/ops/dev-control_app_route_readiness.sh
```

Only after nodes, allocations, app instances, and prechecks are green should the
dev-control Pomerium app-route smoke run.

2026-05-18 live notes:

- Demo Worker Capacity Bootstrap now has a guarded helper:
  `scripts/ops/dev-control_worker_capacity_bootstrap.sh`.
- The helper is API-first: it mints an operator token, verifies the default
  project, creates the manual node only in `--apply` mode, retrieves bootstrap
  material from the product API, and runs the script over SSH.
- vm-104 is currently allowed as a temporary self-worker only with
  `--allow-control-plane-worker`.
- The dev-control profile keeps browser-facing app/api/auth routes behind Cloudflare,
  but node-agent control traffic uses the direct vm-104 LAN NodePort so
  separate worker VMs can enroll:
  `NODE_BOOTSTRAP_API_URL=http://10.176.46.104:32224`.
- This direct NodePort assumption is provider-network dependent. Mac/UTM
  MAAS-LXD workers behind a different VPN/routed subnet could reach
  `https://aicloud-dev-api.core42.dev` but could not reach
  `http://10.176.46.104:32224`; for that provider profile, the installed
  node-agent runtime URL must use the public dev-control API or another
  provider-reachable control-plane endpoint.
- The guarded helper supports this for lab/provider-specific bootstrap with
  `--node-runtime-api-url https://aicloud-dev-api.core42.dev`. This changes the
  installed `GPUAAS_API_URL` only; bootstrap script/package fetches still use
  `NODE_BOOTSTRAP_PUBLIC_API_BASE_URL`.
- For repeatable provider-profile reachability, use the named preset instead
  of editing dev-control env files by hand:
  `--node-runtime-api-profile dev-control-public-api`. The helper resolves that preset
  to `https://aicloud-dev-api.core42.dev`, validates the URL shape, and checks
  `${GPUAAS_API_URL}/api/v1/healthz` from the target worker over SSH during
  dry-run and apply mode before node-agent install.
- The previous host-local value, `http://127.0.0.1:32748`, only works for the
  temporary vm-104 self-worker fallback. Do not use it for Proxmox, MAAS-LXD,
  or other separate worker VMs.
- Do not set `NODE_BOOTSTRAP_RESOLVE_ADDRESS` for the dev-control Cloudflare profile.
  Mapping `aicloud-dev-api.core42.dev` to the vm-104 host IP sends bootstrap
  traffic to port 443, where nothing is listening outside the Cloudflare
  tunnel.
- Node-facing terminal streams are separate from browser-facing terminal
  WebSocket traffic. Demo worker nodes should use
  `NODE_BOOTSTRAP_TERMINAL_API_URL=https://term.dev.aicloud.core42.dev:30950`
  with `NODE_BOOTSTRAP_TERMINAL_RESOLVE_ADDRESS=10.176.46.104`. The hostname is
  intentionally under `*.dev.aicloud.core42.dev` so node-agent TLS verification
  matches the installed wildcard certificate; do not use
  `term.dev-control.aicloud.core42.dev` unless a matching certificate is issued.
  This materializes only the terminal hostname to the private vm-104 address during
  bootstrap and avoids rewriting API or registry hostnames. Manual `/etc/hosts`
  edits on worker nodes are diagnostic proof only; the durable dev-control bridge is
  the environment-profile/bootstrap rendering.
- Future infra-backed replacements are internal DNS for the node-facing dev-control
  hostname or a private load-balancer VIP for `gpuaas-terminal-gateway-node`.
  Until one exists, the terminal-specific bootstrap resolve address is the
  controlled bridge.
- Provider VM cloud-init fetches the manual bootstrap script with bounded retry.
  If a VM starts while `gpuaas-api` is rolling, a temporary
  `Failed to connect to 10.176.46.104 port 32224` should recover without manual
  intervention. If the retry window is exhausted, inspect `cloud-init-output.log`,
  verify `NODE_BOOTSTRAP_API_URL`, and confirm the `gpuaas-api` NodePort is
  reachable from the provider network before launching another refill.
- The API bootstrap bundle must point at an artifact that exists in the dev-control
  registry: `aicloud-dev-registry.core42.dev/platform/node-agent-bootstrap`.
  The dev-control release profile publishes this artifact to the dev-control registry and
  patches the live `NODE_BOOTSTRAP_PACKAGE_REF`, digest, and tag from the
  release manifest. Do not patch those values from the CI registry unless the
  bootstrap broker is also configured to read that registry.
- A separate Proxmox worker VM has been enrolled for dev-control capacity:
  `gpuaas-compute-vm-tiny-smoke-01` at `10.176.32.80`, reachable from the
  operator workstation with `ProxyJump=subash@10.176.32.19`. The enrolled
  admin node id is `e365c8d6-1c74-4db3-b22c-7e685732e325`.
- Current admin node creation still requires a positive `gpus_total`; use
  `--gpus-total 1` for this temporary CPU worker until the SKU resource-model
  migration removes the GPU-shaped admin node contract.
- Dry-run the existing enrolled worker with:

  ```bash
  scripts/ops/dev-control_worker_capacity_bootstrap.sh \
    --target-host 10.176.32.80 \
    --target-ssh-host 10.176.32.80 \
    --target-user gpuaas \
    --hostname gpuaas-compute-vm-tiny-smoke-01 \
    --sku compute-vm \
    --gpus-total 1 \
    --ssh-option -J \
    --ssh-option subash@10.176.32.19
  ```

- Verify worker parity with:

  ```bash
  scripts/ops/gpuaas_worker_node_parity_check.sh \
    --host 10.176.32.80 \
    --user gpuaas \
    --ssh-option -J \
    --ssh-option subash@10.176.32.19 \
    --stage enrolled
  ```
- Current dev-control node-control mTLS is intentionally disabled because Cloudflare
  does not preserve the node client certificate into Traefik/API. This is a
  dev-control profile constraint, not the target production security model; track the
  direct node-control mTLS endpoint separately before promoting this pattern.
- The dev-control app runtime worker must have
  `APP_RUNTIME_MANAGED_INGRESS_PUBLIC_HOST_MAP` set for both
  `jupyterlab.web` and `vllm-openai.openai`.
- The dev-control proxy runtime reconciler must not pin `PROXY_RUNTIME_ENDPOINT` to
  `web`; an empty value reconciles both browser and API endpoints. Pinning it
  to `web` hid the vLLM/OpenAI route.
- 2026-05-18 app route smoke passed:
  JupyterLab `https://aicloud-dev-jupyter.core42.dev/lab` returns the
  expected Pomerium/OIDC redirect for browser auth, and vLLM
  `https://aicloud-dev-openai.core42.dev/v1/models` returns API-style `401`
  without a browser redirect.
- The first JupyterLab artifact published to dev-control was arm64-only and failed on
  vm-104 amd64 with `exec format error`. The dev-control route smoke currently uses a
  republished linux/amd64 JupyterLab artifact. Release automation should
  publish or validate target-platform runtime artifacts before launch.

2026-05-31 UAT readiness note:

- Dev-control on-demand Proxmox workers failed to enroll when the RKE2 overlay
  still rendered the stale node-facing NodePorts `32748` and `32368`.
  The first 2026-05-31 correction accidentally targeted `10.176.46.105`,
  which is a different RKE2 cluster and rejects dev-control OIDC/bootstrap
  tokens. The dev-control cluster is `vm-104` / `10.176.46.104`; its live
  node-facing services are `gpuaas-node-api` on NodePort `32224` and
  `gpuaas-terminal-gateway-node` on NodePort `30950`.
- The durable profile source is now `10.176.46.104` in
  `infra/k8s/overlays/dev-control-rke2/configmap.yaml` and
  `infra/ansible/inventory/environments/dev/group_vars/all.yml`.
- Before running mutating UAT, verify the provider-network path from the Proxmox
  jump host:

  ```bash
  ssh subash@10.176.32.19 \
    'curl -fsS http://10.176.46.104:32224/api/v1/healthz && nc -vz -w3 10.176.46.104 30950'
  ```

- Use the dev-control wrapper and named private profile for manual worker
  bootstrap. The wrapper only supplies dev public URLs; all mutation remains
  gated by `--apply`:

  ```bash
  scripts/ops/dev-control_worker_capacity_bootstrap.sh \
    --target-host <worker-ip> \
    --target-ssh-host <worker-ip> \
    --target-user gpuaas \
    --hostname <worker-hostname> \
    --sku compute-vm \
    --gpus-total 1 \
    --region-code region-maas-1 \
    --node-runtime-api-profile dev-control-private-node-api \
    --ssh-option -J \
    --ssh-option subash@10.176.32.19
  ```

Track status for C-DEV-PUSHBUTTON-K8S-ENV-CLOSURE-001: **parked**. The
repeatable profile model is documented and validated locally. The
prod-shaped `single_node_rke2` guarded profile stages now exist in
`bootstrap_plan.yaml`; resume with the live rebuild task rather than more
vm-104-specific k3s public-tunnel work.

---

## 8. Stage 03 — single-node k3s (C-DEV-VM105-DEV-K3S-STAGE2-001)

### 8.1 What this stage is

Stage `03-k3s-single-node` installs a single-node k3s server (control-plane
+ worker) on `vm-104` with state under
`/ai-cloud-data/gpuaas/dev-control/state/k3s`, default Traefik **disabled**, and
two node labels driven by inventory:
`gpuaas.io/environment=dev`, `gpuaas.io/host-role=platform-control`.

The runner is **check-only by default**. `make env-bootstrap-plan ENV=dev`
emits the planned commands and `make env-bootstrap-check ENV=dev` runs
read-only probes; neither installs anything. Mutation requires every gate
in §8.3 below.

### 8.2 Read-only commands (always safe to run)

```bash
make env-inventory-validate ENV=dev
make env-preflight ENV=dev HOST=vm-104
make env-bootstrap-plan ENV=dev FORMAT=json   # full plan as JSON evidence
make env-bootstrap-check ENV=dev              # per-stage check.commands
make env-rollback-plan ENV=dev                # uninstall path on display
```

A green `bootstrap-check` for stage 03 will report
`k3s_installed/k3s_missing`, `is-active k3s` status, the
`state_dir_ok` flag on `/ai-cloud-data/gpuaas/dev-control/state/k3s`, and the
current `gpuaas.io/environment` label.

### 8.3 Apply path (operator gate)

The apply path is intentionally guarded by **four** distinct inputs that
must all match. Anything weaker refuses immediately:

| Gate | Required value |
|---|---|
| `ENV` | `dev-control` |
| `HOST` | `vm-104` |
| `STAGE` | `03-k3s-single-node` |
| Second confirmation env | `CONFIRM_K3S_APPLY=I-UNDERSTAND` |

The runner additionally:

1. Validates inventory + config shape.
2. Captures a fresh preflight evidence file.
3. Runs `bootstrap-check` to record the pre-apply state.
4. Emits the rollback plan as evidence.
5. Applies the stage's `apply.commands` (idempotent — re-running on an
   already-installed node is a no-op).
6. Re-runs the stage's `check.commands` as a post-apply verification.
7. Writes one `*-bootstrap-apply.json` evidence file with the pre-apply
   check pointer, the apply results, and the post-apply check.

Command (only run after a human has confirmed §3 and the §8.4 checklist):

```bash
ENV=dev HOST=vm-104 STAGE=03-k3s-single-node \
  CONFIRM_K3S_APPLY=I-UNDERSTAND \
  make env-bootstrap-apply
```

### 8.4 Pre-apply checklist (in addition to §3)

- [ ] `make env-bootstrap-check ENV=dev` shows
      `01-state-directories` all `ok:` and `03-k3s-single-node` reports
      `k3s_missing` + `state_dir_ok` (we are pre-install).
- [ ] `df -h /ai-cloud-data` on `vm-104` shows >50 GiB free.
- [ ] No existing `k3s`, `kubelet`, or `containerd-shim` processes are
      running on `vm-104` outside the expected docker daemon.
- [ ] Node-token rotation policy is recorded in the secret manager of
      record (the token lives in `/etc/rancher/k3s/`; do not commit).
- [ ] Operator pasted the full command from §8.3 in the operator's
      terminal, not in a script or chat client.

### 8.5 Rollback

```bash
make env-rollback-plan ENV=dev          # display the rollback commands first
# Then, by the operator, only after explicit approval:
ssh hpcadmin@100.90.157.34 'sudo /usr/local/bin/k3s-uninstall.sh'
```

`k3s-uninstall.sh` removes the k3s service, systemd unit, kubectl shims,
and `/usr/local/bin/k3s` but it **preserves**
`/ai-cloud-data/gpuaas/dev-control/state/k3s`. Removing that data dir is a
separate operator-confirmed step (the bootstrap plan's rollback comment
calls this out).

### 8.6 Decision for the first install (this slice)

The first install through this automation was executed **as a dry-run
only**: `make env-bootstrap-plan`, `make env-bootstrap-check`, and the
refused-without-confirmation `bootstrap-apply` calls (recorded as
acceptance evidence in the queue). The actual `apply.commands` for
stages 03 and 04 were not run against `vm-104` in this slice and **vm-104
was not mutated**. Future slices that flip those stages on must add the
matching `CONFIRM_*` env var and capture the resulting evidence.

---

## 9. Stage 04 — kubeconfig export

### 9.1 What this stage is

Stage `04-kubeconfig-export` pulls the cluster-admin kubeconfig from
`vm-104` to an operator-controlled gitignored path. The on-host copy is
written to `/home/__SSH_USER__/k3s-dev-control.yaml` (placeholder rendered from
inventory at apply time) and rewritten so the `server:` URL is
`https://<inventory_ip>:6443` instead of `https://127.0.0.1:6443`. The
runner then `scp`s the file back to `KUBECONFIG_EXPORT_PATH`.

Default `KUBECONFIG_EXPORT_PATH`:
`$(git rev-parse --git-common-dir)/ops-evidence/env-automation/dev/operator-kubeconfig/rke2-dev-control.yaml`
which is outside the working tree and is **never** committed. The
working-tree path is rejected by the runner.

The kubeconfig contents themselves are not logged or written into evidence
JSON; only `path`, `perms`, `size_bytes`, and the rewritten endpoint hint
are recorded.

### 9.2 Apply gate

| Gate | Required value |
|---|---|
| `ENV` | `dev-control` |
| `HOST` | `vm-104` |
| `STAGE` | `04-kubeconfig-export` |
| Second confirmation env | `CONFIRM_KUBECONFIG_EXPORT=I-UNDERSTAND` |

```bash
ENV=dev HOST=vm-104 STAGE=04-kubeconfig-export \
  CONFIRM_KUBECONFIG_EXPORT=I-UNDERSTAND \
  make env-bootstrap-apply
```

To override the output path (must be outside the working tree):

```bash
KUBECONFIG_EXPORT_PATH=/tmp/k3s-dev-control.yaml \
ENV=dev HOST=vm-104 STAGE=04-kubeconfig-export \
  CONFIRM_KUBECONFIG_EXPORT=I-UNDERSTAND \
  make env-bootstrap-apply
```

### 9.3 Cleanup / rollback

```bash
# Operator workstation: delete the local kubeconfig copy once loaded.
rm -f "$(git rev-parse --git-common-dir)/ops-evidence/env-automation/dev/operator-kubeconfig/rke2-dev-control.yaml"

# vm-104: delete the staged on-host copy (rollback section of stage 04a).
ssh hpcadmin@100.90.157.34 'rm -f /home/hpcadmin/k3s-dev-control.yaml'
```

Both paths are gitignored; neither file is ever committed.

---

## 10. Evidence and "what to run when reviewing this slice"

A reviewer who only wants to inspect (not mutate) should run:

```bash
make env-inventory-validate ENV=dev
make env-preflight ENV=dev HOST=vm-104
make env-bootstrap-check ENV=dev
make env-bootstrap-plan ENV=dev FORMAT=json
ENV=dev HOST=vm-104 STAGE=03-k3s-single-node     make env-bootstrap-apply  # refuses
ENV=dev HOST=vm-104 STAGE=04-kubeconfig-export   make env-bootstrap-apply  # refuses
ENV=dev HOST=vm-104 STAGE=05a-cert-manager-install   make env-bootstrap-apply  # refuses
ENV=dev HOST=vm-104 STAGE=05b-cloudflare-dns01-issuer make env-bootstrap-apply  # refuses
ENV=dev HOST=vm-104 STAGE=05c-dev-control-ingress-baseline  make env-bootstrap-apply  # refuses
```

The intentionally refused commands prove the gate is wired and that
`vm-104` cannot be mutated without the explicit second confirmation env
var. Evidence JSON for each command lands under
`$(git rev-parse --git-common-dir)/ops-evidence/env-automation/dev/`.

---

## 11. Stage 05a — cert-manager install (C-DEV-VM105-DEV-CERT-INGRESS-STAGE3-001)

### 11.1 What this stage does

Installs cert-manager (CRDs + controllers) into the `cert-manager`
namespace via Helm (`jetstack/cert-manager` chart, version pinned in
`bootstrap_plan.yaml` to `v1.15.3`). cert-manager is split from the
DNS-01 issuer (stage 05b) and from ingress (stage 05c) so each piece
is reviewable on its own and rollback-bounded.

### 11.2 Operator gate

| Gate | Required value |
|---|---|
| `ENV` | `dev-control` |
| `HOST` | `vm-104` |
| `STAGE` | `05a-cert-manager-install` |
| Second confirmation | `CONFIRM_CERT_MANAGER_INSTALL=I-UNDERSTAND` |
| Operator workstation | `kubectl` + `helm` on PATH |
| Operator kubeconfig | `KUBECONFIG_EXPORT_PATH` (or default gitignored path) must exist |

Apply command:

```bash
ENV=dev HOST=vm-104 STAGE=05a-cert-manager-install \
  CONFIRM_CERT_MANAGER_INSTALL=I-UNDERSTAND \
  make env-bootstrap-apply
```

The runner refuses if any precondition is missing and the refusal is
emitted **before** any kubectl/helm command runs. No partial mutation.

### 11.3 Rollback

```bash
helm --kubeconfig "$KCFG" uninstall cert-manager -n cert-manager
kubectl --kubeconfig "$KCFG" delete namespace cert-manager
```

---

## 12. Stage 05b — Cloudflare DNS-01 ClusterIssuer + wildcard certificate

### 12.1 What this stage does

Renders the Cloudflare API token from `.env.cloudflare.core42-dev` into
a Kubernetes Secret (`cloudflare-api-token-secret` in `cert-manager`),
normalizing the token key to cert-manager's required `api-token` field.
The token bytes are not logged or persisted to evidence. Then renders the
Let's Encrypt ClusterIssuer (production + staging variants) with
`ACME_CONTACT_EMAIL` as the ACME renewal-failure contact and applies the
wildcard `Certificate` for
`*.dev-control.aicloud.core42.dev` (Secret `gpuaas-dev-wildcard-tls` in
`kube-system`). Wildcard-vs-per-host tradeoff is documented inline in
`manifests/wildcard-certificate.yaml`.

### 12.2 Operator gate

| Gate | Required value |
|---|---|
| `ENV` | `dev-control` |
| `HOST` | `vm-104` |
| `STAGE` | `05b-cloudflare-dns01-issuer` |
| Second confirmation | `CONFIRM_CLOUDFLARE_DNS01_ISSUER=I-UNDERSTAND` |
| ACME contact | `ACME_CONTACT_EMAIL=<monitored mailbox on a real domain>` |
| Operator workstation | `kubectl` on PATH |
| `.env.cloudflare.core42-dev` | must exist (token scope: Zone:DNS:Edit + Zone:Zone:Read on `core42.dev` only) |

Apply command:

```bash
ENV=dev HOST=vm-104 STAGE=05b-cloudflare-dns01-issuer \
  CONFIRM_CLOUDFLARE_DNS01_ISSUER=I-UNDERSTAND \
  ACME_CONTACT_EMAIL=platform-renewals@core42.dev \
  make env-bootstrap-apply
```

If the operator wants to use the LE staging issuer first to avoid rate
limits, edit `manifests/wildcard-certificate.yaml` `issuerRef.name` to
`letsencrypt-cloudflare-dns01-staging` before re-running.

### 12.3 Rollback

```bash
kubectl --kubeconfig "$KCFG" delete -f doc/operations/env-automation/environments/dev/manifests/wildcard-certificate.yaml --ignore-not-found
kubectl --kubeconfig "$KCFG" delete -f doc/operations/env-automation/environments/dev/manifests/clusterissuer-letsencrypt-cloudflare.yaml --ignore-not-found
kubectl --kubeconfig "$KCFG" -n cert-manager delete secret cloudflare-api-token-secret --ignore-not-found
# Then in the Cloudflare console: revoke the API token if no longer needed.
```

### 12.4 Token safety

- The runner reads the token file from the operator workstation only
  when the apply command runs; the bytes pass through `kubectl create
  secret --from-env-file=…` into kubectl stdin.
- `env_automation.rb` `sanitize_stderr` scrubs `token=…` / `secret=…` /
  `password=…` / `credential=…` / `api_key=…` patterns from any stderr
  before persistence.
- The plan's `prohibited` regex for this stage refuses any apply command
  that contains a literal `api_token=` (defense in depth in case a
  future edit accidentally embeds the token value).

---

## 13. Stage 05c — dev-control ingress baseline (Traefik)

### 13.1 What this stage does

Installs Traefik via Helm (`traefik/traefik` chart, version pinned in
`bootstrap_plan.yaml` to `28.3.0`) using the values file at
`manifests/traefik-values.yaml`. Traefik runs in its own `traefik`
namespace, is the default ingress class, and reads its default TLS
certificate from the wildcard Secret issued in stage 05b. Stage 03c
installed k3s with `--disable traefik` so this chart is the single
version-pinned source of truth.

Tailscale Funnel is **not** the steady-state endpoint model; the
Funnel helper (`scripts/ops/platform_control_tailscale_funnel_edges.sh`)
remains a compatible operator escape hatch but the canonical public
surface is `aicloud-dev-<service>.core42.dev` served through this
Traefik instance.

### 13.2 Operator gate

| Gate | Required value |
|---|---|
| `ENV` | `dev-control` |
| `HOST` | `vm-104` |
| `STAGE` | `05c-dev-control-ingress-baseline` |
| Second confirmation | `CONFIRM_INGRESS_BASELINE=I-UNDERSTAND` |
| Operator workstation | `kubectl` + `helm` on PATH |

Apply command:

```bash
ENV=dev HOST=vm-104 STAGE=05c-dev-control-ingress-baseline \
  CONFIRM_INGRESS_BASELINE=I-UNDERSTAND \
  make env-bootstrap-apply
```

### 13.3 Rollback

```bash
helm --kubeconfig "$KCFG" uninstall traefik -n traefik
kubectl --kubeconfig "$KCFG" delete namespace traefik
```

---

## 13a. Stage 06 — Cloudflare Tunnel public ingress

### 13a.1 What this stage does

Stage `06-cloudflare-tunnel-public-ingress` makes
`aicloud-dev-<service>.core42.dev` public through a named Cloudflare
Tunnel, with `cloudflared` running in Docker on `vm-104`. DNS records are
proxied CNAMEs to `<tunnel-id>.cfargotunnel.com`; there are no A records
to the operator laptop.

The helper script is:

```bash
scripts/ops/vm104_dev_control_cloudflare_tunnel.sh
```

It discovers the Traefik `websecure` NodePort from the operator
kubeconfig, writes the tunnel config in Cloudflare, stores the tunnel
runtime token under `.git/ops-evidence/env-automation/dev/cloudflare-tunnel/token`,
copies that token to `vm-104`, and starts `gpuaas-dev-cloudflared` as a
Docker container on `vm-104`.

Base service hosts route to Traefik. Pomerium-managed hosts
(`aicloud-dev-authn.core42.dev`, `aicloud-dev-term.core42.dev`,
`aicloud-dev-grafana.core42.dev`,
`aicloud-dev-swagger.core42.dev`,
`aicloud-dev-jupyter.core42.dev`, `aicloud-dev-openai.core42.dev`,
`aicloud-dev-notifications.core42.dev`) route to the Pomerium proxy HTTPS
NodePort, default `31920`.

### 13a.2 Operator gate

| Gate | Required value |
|---|---|
| `CONFIRM_DEV_CONTROL_TUNNEL` | `I-UNDERSTAND` for `configure`, `install-token`, `start`, `restart` |
| Operator workstation | `kubectl`, `curl`, `jq`, `ssh`, `scp` on PATH |
| Operator kubeconfig | `KUBECONFIG_EXPORT_PATH` or default `.git/ops-evidence/.../rke2-dev-control.yaml` must exist |
| Cloudflare env file | `.env.cloudflare.core42-dev` with `AccountID` and `APIToken` |
| Cloudflare token scope | `Account:Cloudflare Tunnel Edit`, `Zone:DNS Edit`, `Zone:Zone Read` |
| vm-104 | Docker installed and passwordless sudo for `hpcadmin` |

Read-only planning:

```bash
scripts/ops/vm104_dev_control_cloudflare_tunnel.sh plan
scripts/ops/vm104_dev_control_cloudflare_tunnel.sh status
```

Apply path:

```bash
CONFIRM_DEV_CONTROL_TUNNEL=I-UNDERSTAND \
  scripts/ops/vm104_dev_control_cloudflare_tunnel.sh configure

CONFIRM_DEV_CONTROL_TUNNEL=I-UNDERSTAND \
  scripts/ops/vm104_dev_control_cloudflare_tunnel.sh install-token

CONFIRM_DEV_CONTROL_TUNNEL=I-UNDERSTAND \
  scripts/ops/vm104_dev_control_cloudflare_tunnel.sh start
```

Validation:

```bash
scripts/ops/vm104_dev_control_cloudflare_tunnel.sh verify
make env-smoke ENV=dev
```

Pomerium route validation:

```bash
scripts/ops/dev-control_pomerium_oidc_configure.sh

EDGE_PROFILE=prod_public_ingress \
EDGE_DNS_SERVER=1.1.1.1 \
ROUTES=swagger \
POMERIUM_AUTHN_HOST=aicloud-dev-authn.core42.dev \
SWAGGER_HOST=aicloud-dev-swagger.core42.dev \
GRAFANA_HOST=aicloud-dev-grafana.core42.dev \
APP_HOST=aicloud-dev-app.core42.dev \
API_HOST=aicloud-dev-api.core42.dev \
AUTH_HOST=aicloud-dev-auth.core42.dev \
  scripts/ops/pomerium_edge_profile_smoke.sh
```

The configurator applies the dev-control Swagger route plus terminal and notification
WebSocket routes. The WebSocket routes are Pomerium-rendered for host routing
and upgrade handling. GPUaaS still validates terminal tokens, notification
bearer tokens, allocation/session binding, and notification fanout.

Authenticated dev-control WebSocket smoke:

```bash
scripts/ops/pomerium_dev-control_ws_authenticated_smoke.sh
```

If no active dev-control allocation exists, the smoke skips terminal with a `BLOCK`
line and still verifies the notification WebSocket. Set
`REQUIRE_TERMINAL_ALLOCATION=true` when terminal parity is the release gate.

`EDGE_DNS_SERVER` is optional. Use it when the operator workstation resolver
is pinned to Tailscale or a LAN resolver that has not picked up newly-created
Cloudflare records yet.

### 13a.3 Rollback

```bash
scripts/ops/vm104_dev_control_cloudflare_tunnel.sh stop
ssh hpcadmin@100.90.157.34 'sudo rm -f /ai-cloud-data/gpuaas/dev-control/secrets/cloudflared/token'
```

Then delete the proxied CNAMEs or the named tunnel in Cloudflare after
traffic is drained. The local `.env.cloudflare.core42-dev` file is never
changed by the helper.

---

## 14. Stage 03a — k3s API TLS SAN drop-in (C-DEV-VM105-DEV-CERT-INGRESS-LIVE-APPLY-001)

### 14.1 What this stage does

Adds `100.90.157.34` (and `vm-104`) to the k3s API server's TLS SAN
list via a config drop-in at `/etc/rancher/k3s/config.yaml.d/00-tls-san.yaml`
and restarts the k3s service. k3s additively merges `tls-san` with its
defaults (localhost, 127.0.0.1, internal IP, hostname) at startup and
rotates the serving cert in place when the SAN list changes. The cluster
CA does not change, so existing kubeconfigs (the operator kubeconfig
from stage 04a with `server: https://100.90.157.34:6443`) keep working
without re-issue.

Without this stage the operator kubeconfig fails TLS verify against
`100.90.157.34:6443` and stages 05a/05b/05c cannot run.

### 14.2 Operator gate

| Gate | Required value |
|---|---|
| `ENV` | `dev-control` |
| `HOST` | `vm-104` |
| `STAGE` | `03a-k3s-tls-san` |
| Second confirmation | `CONFIRM_K3S_TLS_SAN=I-UNDERSTAND` |

Apply command:

```bash
ENV=dev HOST=vm-104 STAGE=03a-k3s-tls-san \
  CONFIRM_K3S_TLS_SAN=I-UNDERSTAND \
  make env-bootstrap-apply
```

The apply briefly restarts the k3s service. Data dir
(`/ai-cloud-data/gpuaas/dev-control/state/k3s`), node-token, and pod state are
preserved. The post-apply check verifies the cert SAN includes
`100.90.157.34` and `k3s get --raw=/readyz` returns ok.

### 14.3 Rollback

```bash
ssh hpcadmin@100.90.157.34 'sudo rm -f /etc/rancher/k3s/config.yaml.d/00-tls-san.yaml && sudo systemctl restart k3s'
```

Reverts the SAN list to the k3s defaults; the operator kubeconfig will
again fail TLS verify against `100.90.157.34:6443`. Only run if the SAN
must be retracted.

---

## 15. Helm CLI install (operator workstation)

Stages `05a-cert-manager-install` and `05c-dev-control-ingress-baseline` list
`helm` in `require_local_binaries`. The runner refuses those stages if
`helm` is not on PATH, before any kubectl/helm command runs.

The documented install path is the committed script:

```bash
./scripts/ops/operator_install_helm.sh
```

- On macOS with Homebrew, uses `brew install helm`.
- Elsewhere, uses the upstream `get-helm-3` installer over `curl`.
- Refuses to run as root unless `HELM_INSTALL_ALLOW_ROOT=1` is set.
- Idempotent: if `helm` is already on PATH, prints its version and exits 0.
- Does not touch `vm-104`. Helm runs locally against the operator
  kubeconfig pulled by stage 04a.

---

## 16. Stage 05 lane — current status (C-DEV-VM105-DEV-CERT-INGRESS-LIVE-APPLY-001)

The two non-credential blockers from the previous slice
(C-DEV-VM105-DEV-CERT-INGRESS-STAGE3-001) are resolved by committed
automation in this slice:

| Blocker | Resolution |
|---|---|
| k3s API server TLS cert lacked `100.90.157.34` SAN | new stage `03a-k3s-tls-san` (§14) — drop-in file + `systemctl restart k3s`; idempotent; rollback documented |
| `helm` not installed on the operator workstation | `scripts/ops/operator_install_helm.sh` (§15) — committed installer; ran via brew on the C-ops workstation in this slice |
| `.env.cloudflare.core42-dev` not present locally | **still blocking** — operator must mint a Cloudflare API token scoped to `Zone:DNS:Edit + Zone:Zone:Read` on `core42.dev` only, record TTL/rotation in the secret manager of record, and place the file at the repo root (gitignored). The runner refuses stage 05b until the file exists; stage 05c then waits on the wildcard Secret from 05b. |

This slice live-applied `03a-k3s-tls-san` and `05a-cert-manager-install`
and recorded acceptance evidence under
`$(git rev-parse --git-common-dir)/ops-evidence/env-automation/dev/`.
Stage `05b-cloudflare-dns01-issuer` and stage `05c-dev-control-ingress-baseline`
remain wired but unapplied; the runner refuses 05b with
`bootstrap-apply for 05b-cloudflare-dns01-issuer refuses to mutate
because operator secret file is missing: .env.cloudflare.core42-dev`.

A follow-up slice owns the Cloudflare token mint + 05b/05c live apply
once the token is in the operator's possession.

---

## 17. Demo CD profile

Demo deploys use the same promotion discipline as platform-control:

- development lands in `master`;
- `release/platform-control` is force-promoted to one exact `master` SHA;
- GitLab runs with `PLATFORM_CONTROL_RELEASE_PROFILE=dev-control-rke2`;
- deploy and remote validation source `scripts/ci/dev_control_rke2_release_env.sh`
  before invoking the existing platform-control release scripts.

Do not create or hand-edit a long-lived `release/dev-control` branch unless CI
environment protection later requires branch-scoped variables. If that
happens, `release/dev-control` must follow the same promotion-only rule as
`release/platform-control`.

Current dev-control CD target defaults:

| Setting | Value |
|---|---|
| SSH target | `hpcadmin@100.90.157.34` |
| Cluster service | `rke2-server` |
| Remote kubectl | `/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml` |
| App URL | `https://aicloud-dev-app.core42.dev` |
| API URL | `https://aicloud-dev-api.core42.dev` |
| Auth URL | `https://aicloud-dev-auth.core42.dev` |

Operator command:

```bash
DEV_CONTROL_RKE2_SSH_PRIVATE_KEY_B64="$(base64 < ~/.ssh/gpuaas-dev-control-rke2-cd | tr -d '\n')" \
PLATFORM_CONTROL_RELEASE_MODE=deploy \
  scripts/ci/dev_control_rke2_release_deploy.sh origin/master
```

Required GitLab variables:

- `GITLAB_BASE_URL`
- `GITLAB_TOKEN`
- `GITLAB_PROJECT_ID`
- registry credentials already used by platform-control release jobs
- one dev-control-specific SSH credential accepted by
  `scripts/ci/platform_control_ssh_common.sh`
  (`DEV_CONTROL_RKE2_SSH_PRIVATE_KEY`,
  `DEV_CONTROL_RKE2_SSH_PRIVATE_KEY_B64`, or
  `DEV_CONTROL_RKE2_SSH_PRIVATE_KEY_FILE`). The dev-control profile maps this credential
  to `PLATFORM_CONTROL_*` inside the pipeline so vm-104/platform-control
  SSH variables cannot be reused accidentally.

Optional overrides:

- `DEV_CONTROL_RKE2_SSH_HOST` for a replacement dev-control node;
- `DEV_CONTROL_RKE2_REMOTE_KUBECTL` if the RKE2 kubeconfig path changes;
- any `PLATFORM_CONTROL_*_URL` if the public endpoint profile changes.

Post-deploy validation currently runs the platform-control remote
validation suite against the dev-control public endpoints. The base environment
smoke remains available as an independent infrastructure check:

```bash
make env-bootstrap-check ENV=dev
make env-smoke ENV=dev
```

The dev-control CD profile applies `infra/k8s/overlays/dev-control-rke2`, not the
platform-control `dev-control` overlay. The overlay includes an in-cluster
Postgres deployment using hostPath storage under
`/ai-cloud-data/gpuaas/dev-control/state/postgres`, so vm-104 does not depend on
the vm-104/platform-control Docker Postgres container.

The RKE2 dev-control host does not assume a dynamic default StorageClass. The
`dev-control-rke2` overlay declares static hostPath persistent volumes under
`/ai-cloud-data/gpuaas/dev-control/state/*` for Postgres, Vault, NATS, Redis,
registry, Grafana, Loki, Prometheus, and Tempo. Vault's StatefulSet-created
PVC is bound by the deploy script through
`PLATFORM_CONTROL_STATIC_PVC_BINDINGS` so failed deploys can be retried
without mutating immutable StatefulSet volume claim templates.

When GitLab provides `CI_REGISTRY`, the dev-control profile configures an RKE2
registry mirror for that registry host using the local HTTP endpoint and
creates a `gpuaas-core` pull secret from the job registry credentials. This
keeps vm-104 able to pull release images even when the local GitLab registry
TLS ingress is unavailable.

The pull secret is keyed by the bare registry host, not an `https://`
URL, because Kubernetes image references use `registry.host/repo` and
kubelet will otherwise ignore the credentials. The deploy script also
patches every `gpuaas-core` service account so controller deployments that
use non-default service accounts receive the same pull secret.

The dev-control profile derives the node bootstrap CA file from the
`kube-system/gpuaas-dev-wildcard-tls` secret and writes it to the same
remote path used by platform-control validation:
`/etc/gpuaas/platform-control/tls/ca.crt`.
It also copies that wildcard certificate into the namespace-local
`gpuaas-public-tls` secret expected by the shared ingress and terminal
gateway manifests.

The deploy script also syncs local-dev Keycloak import assets to
`/opt/gpuaas/platform_control/external-infra/keycloak` because the shared
infra manifest mounts the realm export and theme directory from that host
path. New dev-control hosts must not rely on vm-104 already having those files.

Controller credential bootstrap is intentionally cluster-local in the dev-control CD
profile. `scripts/ci/dev_control_rke2_release_env.sh` enables
`PLATFORM_CONTROL_CONTROLLER_BOOTSTRAP_PORT_FORWARD=true`, so
`platform_control_deploy.sh` opens short-lived `kubectl port-forward`
connections from vm-104 to the in-cluster Keycloak and GPUaaS API services
while creating or rotating the Slurm and RKE2 controller service-account
credentials. This avoids making deploy correctness depend on Cloudflare or
public TLS while those same public endpoints are being rolled.

The same controller-bootstrap step also enables
`PLATFORM_CONTROL_BOOTSTRAP_DEV_ADMIN_PROJECT_SCOPE=true`. That idempotently
aligns the dev-control `dev-admin` Keycloak subject with the dev-control default project
before using public service-account APIs. Without this, a fresh vm-104 can mint
a valid admin token but still fail project-scoped API calls with
`ownership_required` until someone has manually seeded the local dev-control persona
bindings.

To add a future environment such as `dev` on vm-104 or `test` on vm-104,
add a new target profile script modeled after
`scripts/ci/dev_control_rke2_release_env.sh` and point it at a dedicated overlay,
SSH target, public endpoint profile, Kubernetes distribution, and DB apply
mode. Do not reuse `dev-control-rke2` for another host.
