# Dev Control Environment Reset Runbook This runbook covers distribution-profiled dev-control environment creation. The default production-shaped dev-control profile for new setup is `single_node_rke2`. `vm-104` (`100.90.157.34`) is the current testing node, and the existing `bootstrap_plan.yaml` now implements the guarded `single_node_rke2` dev-control profile. k3s remains documented only as the previous lightweight validation profile and as rollback context. The automation here is **read-only by default**. No tooling in this repo applies mutations without an explicit stage allowlist and operator gate. Every mutating step is an explicit operator-run command after the operator-gate criteria below. --- ## 1. Target | | | |---|---| | Host | `vm-104` | | Tailnet IP | `100.90.157.34` | | SSH user | `hpcadmin` (passwordless sudo expected) | | OS | Ubuntu 24.04 LTS (`noble`) | | Data root | `/ai-cloud-data` (xfs, about 1.6T on current vm-104) | | Domain | `aicloud-dev-.core42.dev` | | Default dev-control profile | `single_node_rke2` | | Current dev-control profile | `single_node_rke2` active on clean vm-104 rebuild | | Cluster in current plan | single-node RKE2, control-plane + worker | | Ingress | traefik (helm-managed), default class, exposed by Cloudflare Tunnel on vm-104 | | TLS | cert-manager + Let's Encrypt + Cloudflare DNS-01 | | Wildcard secret | `gpuaas-dev-wildcard-tls` | | Backup root | `/ai-cloud-data/gpuaas/dev-control/backups` | --- ## 2. Distribution profiles | Profile | Role | Status | |---|---|---| | `single_node_rke2` | Production-shaped single-node dev-control/staging-like target | Current vm-104 dev-control profile | | `multi_node_rke2` | Future production-shaped multi-node target | Planned | | `single_node_k3s` | Lightweight test/dev validation profile | Parked previous validation profile | The current checked-in plan is intentionally marked `distribution_profile: single_node_rke2`. It still includes k3s decommission/rollback stages so a replacement or previously bootstrapped host can move from the lightweight validation profile to the prod-shaped dev-control profile without shell-history reconstruction. Current queue ownership: - `C-ENV-PLATFORM-CONTROL-RESET-CLOUDFLARE-001` owns this standalone dev control reset. Preserve GitLab, retire old Funnel/retired IP-derived DNS assumptions, remove stale external Postgres only during an approved cleanup window, and rebuild the GPUaaS platform shape under `/ai-cloud-data/gpuaas/dev-control`. - The checked-in automation can safely validate inventory, capture read-only evidence, emit plans, apply guarded bootstrap stages, and smoke the public edge. - Destructive cleanup remains operator-gated. ## 2.1 Current State - 2026-05-24 Validated with `make env-preflight ENV=dev HOST=vm-104`, `make env-bootstrap-check ENV=dev`, and `make env-smoke ENV=dev`. - SSH to `hpcadmin@100.90.157.34` works with passwordless sudo. - `/ai-cloud-data` is mounted and reserved for dev-control data. - Docker is installed and only the GitLab containers remain: `gpuaas-gitlab` is healthy and `gpuaas-gitlab-runner` is up. - GitLab sign-in on `http://127.0.0.1:8929/users/sign_in` returns HTTP 200 from the host after cleanup. - Removed legacy Docker carryover: `gpuaas-platform-control-postgres`, `gpuaas-platform-control-adminer`, old `gpuaas-dev-*-tailscale` / `gpuaas-dev-*-proxy` containers, old `external-infra_*`, `local-dev_*`, `gpuaas-fe2e-*`, and stale runner cache volumes. - k3s has been uninstalled and is inactive; `/ai-cloud-data/k3s-storage` was removed. - RKE2 is installed and active; node `vm-104` is Ready on Kubernetes `v1.30.8+rke2r1`. - The operator kubeconfig is exported to the gitignored evidence path `.git/ops-evidence/env-automation/dev/operator-kubeconfig/rke2-dev-control.yaml`. - cert-manager, the Cloudflare DNS-01 ClusterIssuer, the wildcard origin certificate, and Traefik are installed and Ready. - Cloudflare Tunnel `gpuaas-dev-vm-104` runs in Docker on `vm-104`, with its token staged at `/ai-cloud-data/gpuaas/dev-control/secrets/cloudflared/token`. - `make env-smoke ENV=dev` passes DNS, public TLS, public edge reachability, and kubectl node readiness. App/API routes may still return Traefik 404 until the GPUaaS app stack is deployed. --- ## 3. Read-only commands you should run first These run sanitized and write evidence under `.git/ops-evidence/env-automation/dev/`. None of them install packages, write files on `vm-104`, mutate DNS, or read Cloudflare credential values. ```bash # 1. Validate the inventory and config shape. make env-inventory-validate ENV=dev # 2. Capture sanitized SSH facts (id, os, mounts, tools, sudo, network). make env-preflight ENV=dev HOST=vm-104 # 3. Emit the full bootstrap plan (human + JSON evidence). make env-bootstrap-plan ENV=dev # 4. Probe per-stage state on the host (read-only). make env-bootstrap-check ENV=dev # 5. Run end-to-end DNS / TLS / API / k8s smoke checks. make env-smoke ENV=dev # 6. Emit the rollback plan for review before any mutating step. make env-rollback-plan ENV=dev ``` Every command writes a sanitized `*-.json` evidence file to `$(git rev-parse --git-path ops-evidence)/env-automation/dev/`. Evidence never contains Cloudflare credential bytes, k3s node-tokens, kubeconfig contents, or TLS private keys. SSH stderr is scrubbed for `token=`, `password=`, `secret=`, `credential=`, and `api_key=` patterns. --- ## 4. Operator gate before any mutation Do **not** proceed past `bootstrap-check` until all of the following are true. This gate is intentionally manual. - [ ] `make env-inventory-validate ENV=dev` exits 0. - [ ] `make env-preflight ENV=dev HOST=vm-104` evidence reviewed and shows `/ai-cloud-data` mounted with >100G free. - [ ] `make env-bootstrap-check ENV=dev` evidence reviewed and the operator agrees the current host state matches the expected pre-bootstrap state. - [ ] `.env.cloudflare.core42-dev` exists on the operator workstation, is gitignored, holds a token scoped to `Zone.DNS:Edit` for `core42.dev` only, and has a rotation owner recorded out-of-band. - [ ] Operator has read `bootstrap_plan.yaml` and the corresponding stage's `apply.operator_gate` text. - [ ] Operator has read `make env-rollback-plan ENV=dev` output and is prepared to execute the relevant rollback section if a stage fails. - [ ] Operator confirms the named environment, target host, DNS profile, state root, and evidence path before each mutating command. --- ## 5. Stage order Stage definitions, including `apply.commands` and `rollback.commands`, live in `bootstrap_plan.yaml`. The order is: 1. `01-state-directories` — create `/ai-cloud-data/gpuaas/dev-control/...` subtree. Sudo. Idempotent (`install -d`). 2. `02-cloudflare-credentials` — create Kubernetes secret `cloudflare-api-token-secret` in `cert-manager` from operator-local `.env.cloudflare.core42-dev`. Never read or print values. 3. `03-k3s-single-node` — install k3s with state on the data disk and traefik disabled so we can install our pinned chart in stage 05. 4. `04-kubeconfig-export` — copy kubeconfig to the operator over SSH, then delete the on-host copy. 5. `03b-k3s-validation-decommission` — guarded teardown of the current k3s validation profile before rebuilding the host as RKE2. 6. `03c-rke2-single-node` — install the production-shaped single-node RKE2 server profile with bundled ingress disabled. 7. `04a-rke2-kubeconfig-export` — copy the RKE2 kubeconfig to the operator-controlled gitignored evidence path. 8. `05-ingress-and-cert-manager` — helm-install traefik + cert-manager, apply the Cloudflare ClusterIssuer and the wildcard Certificate. 9. `06-cloudflare-tunnel-public-ingress` — operator configures a named Cloudflare Tunnel and proxied CNAMEs, then runs `cloudflared` in Docker on `vm-104`. The operator laptop is not exposed. 10. `07-backups` — local backup directory. Off-host backup target is `pending-review` and must be filled in before day 14. 11. `99-decommission` — operator-only teardown (reverse stage order). Each stage in the plan declares: - `idempotent: true|false` — only `true` stages are eligible for retry. - `sudo: true|false` — whether commands assume passwordless sudo on the target. - `check.commands` / `check.local_commands` — read-only state queries the tool runs for you. - `apply.commands` — mutating commands. **Never run by the tool.** - `apply.operator_gate` — preconditions to confirm before applying. - `rollback.commands` — manual rollback. Never run by the tool. --- ## 6. Evidence layout ``` .git/ops-evidence/env-automation/dev/ dev-control--preflight.json dev-control--bootstrap-plan.json dev-control--bootstrap-check.json dev-control--rollback-plan.json dev-control--smoke.json ``` `.git/ops-evidence` is outside the working tree and is not tracked by git. Retention is 30 days. Anything resembling a token, password, credential, api-key, or private key is replaced with `[REDACTED]` before write. --- ## 7. Known guardrails - Do not commit `.env.cloudflare.core42-dev` or the exported kubeconfig. - Do not edit `bootstrap_plan.yaml` apply commands inline at run time; PR the change so it lands as a versioned plan revision. - Do not run `apply` commands from this repo's tooling; the tooling is read-only by design. - Do not introduce a non-`/ai-cloud-data`-rooted state directory; the validator enforces `state_root` and `backup_profile.local_path` are both under `/ai-cloud-data`. - Do not add new k3s-only stages unless they are explicitly scoped to the `single_node_k3s` profile. New dev-control-equivalent setup should target `single_node_rke2`. - Keep this work isolated from product V3 UI/API changes and the proxy data-plane track. --- ## Current state snapshot (informational only) Latest preflight against `vm-104` recorded in `.git/ops-evidence/...`: - Ubuntu 24.04.4 LTS, 32 cores, 193 GiB RAM. - `/ai-cloud-data` mounted xfs on `/dev/vda`, 502 GiB free. - `docker` + `containerd` already installed. - `k3s`, `kubectl`, `helm`, `ansible` not installed. - Passwordless sudo available. - Tailscale up; public reachability for DNS targets is provided by Cloudflare Tunnel once stage `06-cloudflare-tunnel-public-ingress` is configured. The current dev-control cluster has progressed past the original stage-1 snapshot: k3s, cert-manager, Traefik, and wildcard TLS are already installed. The remaining public-readiness gap is DNS/ingress exposure via Cloudflare Tunnel. 2026-05-15 live check: - SSH reachable at `hpcadmin@100.90.157.34`. - k3s node `vm-104` is Ready on Ubuntu 24.04.4. - cert-manager, Traefik, and wildcard certificate `gpuaas-dev-control-wildcard` are Ready. - Cloudflare Tunnel is not running because the tunnel token is missing at `/ai-cloud-data/gpuaas/dev-control/secrets/cloudflared/token`. - `*.dev-control.aicloud.core42.dev` does not yet resolve publicly. - The next action is not to finish this k3s public path; it is to implement and review the `single_node_rke2` profile, then rebuild vm-104 through `C-DEV-VM105-DEV-RKE2-LIVE-REBUILD-001`. 2026-05-16 live rebuild update: - `vm-104` was rebuilt from the k3s validation profile to single-node RKE2 through the guarded automation stages. - RKE2 node `vm-104` is Ready with Kubernetes `v1.30.8+rke2r1`. - cert-manager, the Cloudflare DNS-01 ClusterIssuer, the wildcard origin certificate, and Traefik are installed on RKE2. - Cloudflare Tunnel `gpuaas-dev-vm-104` runs in Docker on `vm-104`. - Public edge hostnames use the flat `aicloud-dev-.core42.dev` pattern. The nested `*.dev-control.aicloud.core42.dev` pattern failed at the Cloudflare edge TLS layer for the same reason as the earlier kind `*.kind.aicloud.core42.dev` attempt. - The base environment is reachable through Cloudflare, but there are no GPUaaS app/API workloads deployed yet, so public routes currently return Traefik 404 until the app stack is deployed. 2026-05-18 app-route capacity update: - The GPUaaS app/API stack is deployed and public dev-control auth/API/app hosts are reachable. - `make env-preflight ENV=dev HOST=vm-104` passes; the RKE2 node `vm-104` is Ready and core, infra, observability, Traefik, Cloudflare tunnel, and Pomerium pods are Running. - Demo JupyterLab and vLLM app artifacts are published through public APIs, and the launch prerequisites exist: default SSH key, workspace storage bucket, and an active service account. - `scripts/ops/dev-control_app_route_readiness.sh` reports both JupyterLab and vLLM prechecks as `ready=true`. - Worker capacity now exists through product APIs: vm-104 is temporarily enrolled as a dev-control self-worker, with one active allocation and running JupyterLab/vLLM app instances for route validation. This is a dev-control unblock, not the target production worker shape. ### Demo Worker Capacity Bootstrap Queue task: `C-DEV-WORKER-CAPACITY-API-FIRST-BOOTSTRAP-001`. The dev-control app-route smoke must not be unblocked by direct SQL inserts. The correct path is API-first and should mirror what a real customer/dev-control operator would do: 1. Choose the worker-capacity shape. - Preferred: a separate worker VM in the same network as vm-104. - Temporary fallback: vm-104 itself as a worker, but only with an explicit risk note because it is also the RKE2 control-plane host. 2. Converge the manual worker-node baseline before enrollment: `scripts/ops/gpuaas_manual_worker_node_converge.sh` for packages, log rotation, Docker/runtime prerequisites, node-agent directories, and observability/log-shipping prerequisites. 3. Register the node through the admin API: `POST /api/v1/admin/nodes` with `onboarding_mode=manual`, `sku=compute-vm`, `gpus_total=2`, and `region_code` matching the dev-control region. 4. Retrieve bootstrap material through the supported API: `POST /api/v1/admin/nodes/{node_id}/bootstrap-script`, or the matching V3 lifecycle alias if the operator is driving it from the V3 shell. 5. Run the bootstrap on the worker so `gpuaas-node-agent` enrolls and starts heartbeating. Verify node-agent logs are in Loki and metrics appear before scheduling user work. 6. Create a normal user allocation through public allocation APIs. Do not mark allocations active by hand. 7. Launch JupyterLab and vLLM through normal app launch APIs so the app runtime worker creates real app instances and route records. 8. Rerun: ```bash APP_PUBLIC_URL=https://aicloud-dev-app.core42.dev \ AUTH_PUBLIC_URL=https://aicloud-dev-auth.core42.dev \ scripts/ops/dev-control_app_route_readiness.sh ``` Only after nodes, allocations, app instances, and prechecks are green should the dev-control Pomerium app-route smoke run. 2026-05-18 live notes: - Demo Worker Capacity Bootstrap now has a guarded helper: `scripts/ops/dev-control_worker_capacity_bootstrap.sh`. - The helper is API-first: it mints an operator token, verifies the default project, creates the manual node only in `--apply` mode, retrieves bootstrap material from the product API, and runs the script over SSH. - vm-104 is currently allowed as a temporary self-worker only with `--allow-control-plane-worker`. - The dev-control profile keeps browser-facing app/api/auth routes behind Cloudflare, but node-agent control traffic uses the direct vm-104 LAN NodePort so separate worker VMs can enroll: `NODE_BOOTSTRAP_API_URL=http://10.176.46.104:32224`. - This direct NodePort assumption is provider-network dependent. Mac/UTM MAAS-LXD workers behind a different VPN/routed subnet could reach `https://aicloud-dev-api.core42.dev` but could not reach `http://10.176.46.104:32224`; for that provider profile, the installed node-agent runtime URL must use the public dev-control API or another provider-reachable control-plane endpoint. - The guarded helper supports this for lab/provider-specific bootstrap with `--node-runtime-api-url https://aicloud-dev-api.core42.dev`. This changes the installed `GPUAAS_API_URL` only; bootstrap script/package fetches still use `NODE_BOOTSTRAP_PUBLIC_API_BASE_URL`. - For repeatable provider-profile reachability, use the named preset instead of editing dev-control env files by hand: `--node-runtime-api-profile dev-control-public-api`. The helper resolves that preset to `https://aicloud-dev-api.core42.dev`, validates the URL shape, and checks `${GPUAAS_API_URL}/api/v1/healthz` from the target worker over SSH during dry-run and apply mode before node-agent install. - The previous host-local value, `http://127.0.0.1:32748`, only works for the temporary vm-104 self-worker fallback. Do not use it for Proxmox, MAAS-LXD, or other separate worker VMs. - Do not set `NODE_BOOTSTRAP_RESOLVE_ADDRESS` for the dev-control Cloudflare profile. Mapping `aicloud-dev-api.core42.dev` to the vm-104 host IP sends bootstrap traffic to port 443, where nothing is listening outside the Cloudflare tunnel. - Node-facing terminal streams are separate from browser-facing terminal WebSocket traffic. Demo worker nodes should use `NODE_BOOTSTRAP_TERMINAL_API_URL=https://term.dev.aicloud.core42.dev:30950` with `NODE_BOOTSTRAP_TERMINAL_RESOLVE_ADDRESS=10.176.46.104`. The hostname is intentionally under `*.dev.aicloud.core42.dev` so node-agent TLS verification matches the installed wildcard certificate; do not use `term.dev-control.aicloud.core42.dev` unless a matching certificate is issued. This materializes only the terminal hostname to the private vm-104 address during bootstrap and avoids rewriting API or registry hostnames. Manual `/etc/hosts` edits on worker nodes are diagnostic proof only; the durable dev-control bridge is the environment-profile/bootstrap rendering. - Future infra-backed replacements are internal DNS for the node-facing dev-control hostname or a private load-balancer VIP for `gpuaas-terminal-gateway-node`. Until one exists, the terminal-specific bootstrap resolve address is the controlled bridge. - Provider VM cloud-init fetches the manual bootstrap script with bounded retry. If a VM starts while `gpuaas-api` is rolling, a temporary `Failed to connect to 10.176.46.104 port 32224` should recover without manual intervention. If the retry window is exhausted, inspect `cloud-init-output.log`, verify `NODE_BOOTSTRAP_API_URL`, and confirm the `gpuaas-api` NodePort is reachable from the provider network before launching another refill. - The API bootstrap bundle must point at an artifact that exists in the dev-control registry: `aicloud-dev-registry.core42.dev/platform/node-agent-bootstrap`. The dev-control release profile publishes this artifact to the dev-control registry and patches the live `NODE_BOOTSTRAP_PACKAGE_REF`, digest, and tag from the release manifest. Do not patch those values from the CI registry unless the bootstrap broker is also configured to read that registry. - A separate Proxmox worker VM has been enrolled for dev-control capacity: `gpuaas-compute-vm-tiny-smoke-01` at `10.176.32.80`, reachable from the operator workstation with `ProxyJump=subash@10.176.32.19`. The enrolled admin node id is `e365c8d6-1c74-4db3-b22c-7e685732e325`. - Current admin node creation still requires a positive `gpus_total`; use `--gpus-total 1` for this temporary CPU worker until the SKU resource-model migration removes the GPU-shaped admin node contract. - Dry-run the existing enrolled worker with: ```bash scripts/ops/dev-control_worker_capacity_bootstrap.sh \ --target-host 10.176.32.80 \ --target-ssh-host 10.176.32.80 \ --target-user gpuaas \ --hostname gpuaas-compute-vm-tiny-smoke-01 \ --sku compute-vm \ --gpus-total 1 \ --ssh-option -J \ --ssh-option subash@10.176.32.19 ``` - Verify worker parity with: ```bash scripts/ops/gpuaas_worker_node_parity_check.sh \ --host 10.176.32.80 \ --user gpuaas \ --ssh-option -J \ --ssh-option subash@10.176.32.19 \ --stage enrolled ``` - Current dev-control node-control mTLS is intentionally disabled because Cloudflare does not preserve the node client certificate into Traefik/API. This is a dev-control profile constraint, not the target production security model; track the direct node-control mTLS endpoint separately before promoting this pattern. - The dev-control app runtime worker must have `APP_RUNTIME_MANAGED_INGRESS_PUBLIC_HOST_MAP` set for both `jupyterlab.web` and `vllm-openai.openai`. - The dev-control proxy runtime reconciler must not pin `PROXY_RUNTIME_ENDPOINT` to `web`; an empty value reconciles both browser and API endpoints. Pinning it to `web` hid the vLLM/OpenAI route. - 2026-05-18 app route smoke passed: JupyterLab `https://aicloud-dev-jupyter.core42.dev/lab` returns the expected Pomerium/OIDC redirect for browser auth, and vLLM `https://aicloud-dev-openai.core42.dev/v1/models` returns API-style `401` without a browser redirect. - The first JupyterLab artifact published to dev-control was arm64-only and failed on vm-104 amd64 with `exec format error`. The dev-control route smoke currently uses a republished linux/amd64 JupyterLab artifact. Release automation should publish or validate target-platform runtime artifacts before launch. 2026-05-31 UAT readiness note: - Dev-control on-demand Proxmox workers failed to enroll when the RKE2 overlay still rendered the stale node-facing NodePorts `32748` and `32368`. The first 2026-05-31 correction accidentally targeted `10.176.46.105`, which is a different RKE2 cluster and rejects dev-control OIDC/bootstrap tokens. The dev-control cluster is `vm-104` / `10.176.46.104`; its live node-facing services are `gpuaas-node-api` on NodePort `32224` and `gpuaas-terminal-gateway-node` on NodePort `30950`. - The durable profile source is now `10.176.46.104` in `infra/k8s/overlays/dev-control-rke2/configmap.yaml` and `infra/ansible/inventory/environments/dev/group_vars/all.yml`. - Before running mutating UAT, verify the provider-network path from the Proxmox jump host: ```bash ssh subash@10.176.32.19 \ 'curl -fsS http://10.176.46.104:32224/api/v1/healthz && nc -vz -w3 10.176.46.104 30950' ``` - Use the dev-control wrapper and named private profile for manual worker bootstrap. The wrapper only supplies dev public URLs; all mutation remains gated by `--apply`: ```bash scripts/ops/dev-control_worker_capacity_bootstrap.sh \ --target-host \ --target-ssh-host \ --target-user gpuaas \ --hostname \ --sku compute-vm \ --gpus-total 1 \ --region-code region-maas-1 \ --node-runtime-api-profile dev-control-private-node-api \ --ssh-option -J \ --ssh-option subash@10.176.32.19 ``` Track status for C-DEV-PUSHBUTTON-K8S-ENV-CLOSURE-001: **parked**. The repeatable profile model is documented and validated locally. The prod-shaped `single_node_rke2` guarded profile stages now exist in `bootstrap_plan.yaml`; resume with the live rebuild task rather than more vm-104-specific k3s public-tunnel work. --- ## 8. Stage 03 — single-node k3s (C-DEV-VM105-DEV-K3S-STAGE2-001) ### 8.1 What this stage is Stage `03-k3s-single-node` installs a single-node k3s server (control-plane + worker) on `vm-104` with state under `/ai-cloud-data/gpuaas/dev-control/state/k3s`, default Traefik **disabled**, and two node labels driven by inventory: `gpuaas.io/environment=dev`, `gpuaas.io/host-role=platform-control`. The runner is **check-only by default**. `make env-bootstrap-plan ENV=dev` emits the planned commands and `make env-bootstrap-check ENV=dev` runs read-only probes; neither installs anything. Mutation requires every gate in §8.3 below. ### 8.2 Read-only commands (always safe to run) ```bash make env-inventory-validate ENV=dev make env-preflight ENV=dev HOST=vm-104 make env-bootstrap-plan ENV=dev FORMAT=json # full plan as JSON evidence make env-bootstrap-check ENV=dev # per-stage check.commands make env-rollback-plan ENV=dev # uninstall path on display ``` A green `bootstrap-check` for stage 03 will report `k3s_installed/k3s_missing`, `is-active k3s` status, the `state_dir_ok` flag on `/ai-cloud-data/gpuaas/dev-control/state/k3s`, and the current `gpuaas.io/environment` label. ### 8.3 Apply path (operator gate) The apply path is intentionally guarded by **four** distinct inputs that must all match. Anything weaker refuses immediately: | Gate | Required value | |---|---| | `ENV` | `dev-control` | | `HOST` | `vm-104` | | `STAGE` | `03-k3s-single-node` | | Second confirmation env | `CONFIRM_K3S_APPLY=I-UNDERSTAND` | The runner additionally: 1. Validates inventory + config shape. 2. Captures a fresh preflight evidence file. 3. Runs `bootstrap-check` to record the pre-apply state. 4. Emits the rollback plan as evidence. 5. Applies the stage's `apply.commands` (idempotent — re-running on an already-installed node is a no-op). 6. Re-runs the stage's `check.commands` as a post-apply verification. 7. Writes one `*-bootstrap-apply.json` evidence file with the pre-apply check pointer, the apply results, and the post-apply check. Command (only run after a human has confirmed §3 and the §8.4 checklist): ```bash ENV=dev HOST=vm-104 STAGE=03-k3s-single-node \ CONFIRM_K3S_APPLY=I-UNDERSTAND \ make env-bootstrap-apply ``` ### 8.4 Pre-apply checklist (in addition to §3) - [ ] `make env-bootstrap-check ENV=dev` shows `01-state-directories` all `ok:` and `03-k3s-single-node` reports `k3s_missing` + `state_dir_ok` (we are pre-install). - [ ] `df -h /ai-cloud-data` on `vm-104` shows >50 GiB free. - [ ] No existing `k3s`, `kubelet`, or `containerd-shim` processes are running on `vm-104` outside the expected docker daemon. - [ ] Node-token rotation policy is recorded in the secret manager of record (the token lives in `/etc/rancher/k3s/`; do not commit). - [ ] Operator pasted the full command from §8.3 in the operator's terminal, not in a script or chat client. ### 8.5 Rollback ```bash make env-rollback-plan ENV=dev # display the rollback commands first # Then, by the operator, only after explicit approval: ssh hpcadmin@100.90.157.34 'sudo /usr/local/bin/k3s-uninstall.sh' ``` `k3s-uninstall.sh` removes the k3s service, systemd unit, kubectl shims, and `/usr/local/bin/k3s` but it **preserves** `/ai-cloud-data/gpuaas/dev-control/state/k3s`. Removing that data dir is a separate operator-confirmed step (the bootstrap plan's rollback comment calls this out). ### 8.6 Decision for the first install (this slice) The first install through this automation was executed **as a dry-run only**: `make env-bootstrap-plan`, `make env-bootstrap-check`, and the refused-without-confirmation `bootstrap-apply` calls (recorded as acceptance evidence in the queue). The actual `apply.commands` for stages 03 and 04 were not run against `vm-104` in this slice and **vm-104 was not mutated**. Future slices that flip those stages on must add the matching `CONFIRM_*` env var and capture the resulting evidence. --- ## 9. Stage 04 — kubeconfig export ### 9.1 What this stage is Stage `04-kubeconfig-export` pulls the cluster-admin kubeconfig from `vm-104` to an operator-controlled gitignored path. The on-host copy is written to `/home/__SSH_USER__/k3s-dev-control.yaml` (placeholder rendered from inventory at apply time) and rewritten so the `server:` URL is `https://:6443` instead of `https://127.0.0.1:6443`. The runner then `scp`s the file back to `KUBECONFIG_EXPORT_PATH`. Default `KUBECONFIG_EXPORT_PATH`: `$(git rev-parse --git-common-dir)/ops-evidence/env-automation/dev/operator-kubeconfig/rke2-dev-control.yaml` which is outside the working tree and is **never** committed. The working-tree path is rejected by the runner. The kubeconfig contents themselves are not logged or written into evidence JSON; only `path`, `perms`, `size_bytes`, and the rewritten endpoint hint are recorded. ### 9.2 Apply gate | Gate | Required value | |---|---| | `ENV` | `dev-control` | | `HOST` | `vm-104` | | `STAGE` | `04-kubeconfig-export` | | Second confirmation env | `CONFIRM_KUBECONFIG_EXPORT=I-UNDERSTAND` | ```bash ENV=dev HOST=vm-104 STAGE=04-kubeconfig-export \ CONFIRM_KUBECONFIG_EXPORT=I-UNDERSTAND \ make env-bootstrap-apply ``` To override the output path (must be outside the working tree): ```bash KUBECONFIG_EXPORT_PATH=/tmp/k3s-dev-control.yaml \ ENV=dev HOST=vm-104 STAGE=04-kubeconfig-export \ CONFIRM_KUBECONFIG_EXPORT=I-UNDERSTAND \ make env-bootstrap-apply ``` ### 9.3 Cleanup / rollback ```bash # Operator workstation: delete the local kubeconfig copy once loaded. rm -f "$(git rev-parse --git-common-dir)/ops-evidence/env-automation/dev/operator-kubeconfig/rke2-dev-control.yaml" # vm-104: delete the staged on-host copy (rollback section of stage 04a). ssh hpcadmin@100.90.157.34 'rm -f /home/hpcadmin/k3s-dev-control.yaml' ``` Both paths are gitignored; neither file is ever committed. --- ## 10. Evidence and "what to run when reviewing this slice" A reviewer who only wants to inspect (not mutate) should run: ```bash make env-inventory-validate ENV=dev make env-preflight ENV=dev HOST=vm-104 make env-bootstrap-check ENV=dev make env-bootstrap-plan ENV=dev FORMAT=json ENV=dev HOST=vm-104 STAGE=03-k3s-single-node make env-bootstrap-apply # refuses ENV=dev HOST=vm-104 STAGE=04-kubeconfig-export make env-bootstrap-apply # refuses ENV=dev HOST=vm-104 STAGE=05a-cert-manager-install make env-bootstrap-apply # refuses ENV=dev HOST=vm-104 STAGE=05b-cloudflare-dns01-issuer make env-bootstrap-apply # refuses ENV=dev HOST=vm-104 STAGE=05c-dev-control-ingress-baseline make env-bootstrap-apply # refuses ``` The intentionally refused commands prove the gate is wired and that `vm-104` cannot be mutated without the explicit second confirmation env var. Evidence JSON for each command lands under `$(git rev-parse --git-common-dir)/ops-evidence/env-automation/dev/`. --- ## 11. Stage 05a — cert-manager install (C-DEV-VM105-DEV-CERT-INGRESS-STAGE3-001) ### 11.1 What this stage does Installs cert-manager (CRDs + controllers) into the `cert-manager` namespace via Helm (`jetstack/cert-manager` chart, version pinned in `bootstrap_plan.yaml` to `v1.15.3`). cert-manager is split from the DNS-01 issuer (stage 05b) and from ingress (stage 05c) so each piece is reviewable on its own and rollback-bounded. ### 11.2 Operator gate | Gate | Required value | |---|---| | `ENV` | `dev-control` | | `HOST` | `vm-104` | | `STAGE` | `05a-cert-manager-install` | | Second confirmation | `CONFIRM_CERT_MANAGER_INSTALL=I-UNDERSTAND` | | Operator workstation | `kubectl` + `helm` on PATH | | Operator kubeconfig | `KUBECONFIG_EXPORT_PATH` (or default gitignored path) must exist | Apply command: ```bash ENV=dev HOST=vm-104 STAGE=05a-cert-manager-install \ CONFIRM_CERT_MANAGER_INSTALL=I-UNDERSTAND \ make env-bootstrap-apply ``` The runner refuses if any precondition is missing and the refusal is emitted **before** any kubectl/helm command runs. No partial mutation. ### 11.3 Rollback ```bash helm --kubeconfig "$KCFG" uninstall cert-manager -n cert-manager kubectl --kubeconfig "$KCFG" delete namespace cert-manager ``` --- ## 12. Stage 05b — Cloudflare DNS-01 ClusterIssuer + wildcard certificate ### 12.1 What this stage does Renders the Cloudflare API token from `.env.cloudflare.core42-dev` into a Kubernetes Secret (`cloudflare-api-token-secret` in `cert-manager`), normalizing the token key to cert-manager's required `api-token` field. The token bytes are not logged or persisted to evidence. Then renders the Let's Encrypt ClusterIssuer (production + staging variants) with `ACME_CONTACT_EMAIL` as the ACME renewal-failure contact and applies the wildcard `Certificate` for `*.dev-control.aicloud.core42.dev` (Secret `gpuaas-dev-wildcard-tls` in `kube-system`). Wildcard-vs-per-host tradeoff is documented inline in `manifests/wildcard-certificate.yaml`. ### 12.2 Operator gate | Gate | Required value | |---|---| | `ENV` | `dev-control` | | `HOST` | `vm-104` | | `STAGE` | `05b-cloudflare-dns01-issuer` | | Second confirmation | `CONFIRM_CLOUDFLARE_DNS01_ISSUER=I-UNDERSTAND` | | ACME contact | `ACME_CONTACT_EMAIL=` | | Operator workstation | `kubectl` on PATH | | `.env.cloudflare.core42-dev` | must exist (token scope: Zone:DNS:Edit + Zone:Zone:Read on `core42.dev` only) | Apply command: ```bash ENV=dev HOST=vm-104 STAGE=05b-cloudflare-dns01-issuer \ CONFIRM_CLOUDFLARE_DNS01_ISSUER=I-UNDERSTAND \ ACME_CONTACT_EMAIL=platform-renewals@core42.dev \ make env-bootstrap-apply ``` If the operator wants to use the LE staging issuer first to avoid rate limits, edit `manifests/wildcard-certificate.yaml` `issuerRef.name` to `letsencrypt-cloudflare-dns01-staging` before re-running. ### 12.3 Rollback ```bash kubectl --kubeconfig "$KCFG" delete -f doc/operations/env-automation/environments/dev/manifests/wildcard-certificate.yaml --ignore-not-found kubectl --kubeconfig "$KCFG" delete -f doc/operations/env-automation/environments/dev/manifests/clusterissuer-letsencrypt-cloudflare.yaml --ignore-not-found kubectl --kubeconfig "$KCFG" -n cert-manager delete secret cloudflare-api-token-secret --ignore-not-found # Then in the Cloudflare console: revoke the API token if no longer needed. ``` ### 12.4 Token safety - The runner reads the token file from the operator workstation only when the apply command runs; the bytes pass through `kubectl create secret --from-env-file=…` into kubectl stdin. - `env_automation.rb` `sanitize_stderr` scrubs `token=…` / `secret=…` / `password=…` / `credential=…` / `api_key=…` patterns from any stderr before persistence. - The plan's `prohibited` regex for this stage refuses any apply command that contains a literal `api_token=` (defense in depth in case a future edit accidentally embeds the token value). --- ## 13. Stage 05c — dev-control ingress baseline (Traefik) ### 13.1 What this stage does Installs Traefik via Helm (`traefik/traefik` chart, version pinned in `bootstrap_plan.yaml` to `28.3.0`) using the values file at `manifests/traefik-values.yaml`. Traefik runs in its own `traefik` namespace, is the default ingress class, and reads its default TLS certificate from the wildcard Secret issued in stage 05b. Stage 03c installed k3s with `--disable traefik` so this chart is the single version-pinned source of truth. Tailscale Funnel is **not** the steady-state endpoint model; the Funnel helper (`scripts/ops/platform_control_tailscale_funnel_edges.sh`) remains a compatible operator escape hatch but the canonical public surface is `aicloud-dev-.core42.dev` served through this Traefik instance. ### 13.2 Operator gate | Gate | Required value | |---|---| | `ENV` | `dev-control` | | `HOST` | `vm-104` | | `STAGE` | `05c-dev-control-ingress-baseline` | | Second confirmation | `CONFIRM_INGRESS_BASELINE=I-UNDERSTAND` | | Operator workstation | `kubectl` + `helm` on PATH | Apply command: ```bash ENV=dev HOST=vm-104 STAGE=05c-dev-control-ingress-baseline \ CONFIRM_INGRESS_BASELINE=I-UNDERSTAND \ make env-bootstrap-apply ``` ### 13.3 Rollback ```bash helm --kubeconfig "$KCFG" uninstall traefik -n traefik kubectl --kubeconfig "$KCFG" delete namespace traefik ``` --- ## 13a. Stage 06 — Cloudflare Tunnel public ingress ### 13a.1 What this stage does Stage `06-cloudflare-tunnel-public-ingress` makes `aicloud-dev-.core42.dev` public through a named Cloudflare Tunnel, with `cloudflared` running in Docker on `vm-104`. DNS records are proxied CNAMEs to `.cfargotunnel.com`; there are no A records to the operator laptop. The helper script is: ```bash scripts/ops/vm104_dev_control_cloudflare_tunnel.sh ``` It discovers the Traefik `websecure` NodePort from the operator kubeconfig, writes the tunnel config in Cloudflare, stores the tunnel runtime token under `.git/ops-evidence/env-automation/dev/cloudflare-tunnel/token`, copies that token to `vm-104`, and starts `gpuaas-dev-cloudflared` as a Docker container on `vm-104`. Base service hosts route to Traefik. Pomerium-managed hosts (`aicloud-dev-authn.core42.dev`, `aicloud-dev-term.core42.dev`, `aicloud-dev-grafana.core42.dev`, `aicloud-dev-swagger.core42.dev`, `aicloud-dev-jupyter.core42.dev`, `aicloud-dev-openai.core42.dev`, `aicloud-dev-notifications.core42.dev`) route to the Pomerium proxy HTTPS NodePort, default `31920`. ### 13a.2 Operator gate | Gate | Required value | |---|---| | `CONFIRM_DEV_CONTROL_TUNNEL` | `I-UNDERSTAND` for `configure`, `install-token`, `start`, `restart` | | Operator workstation | `kubectl`, `curl`, `jq`, `ssh`, `scp` on PATH | | Operator kubeconfig | `KUBECONFIG_EXPORT_PATH` or default `.git/ops-evidence/.../rke2-dev-control.yaml` must exist | | Cloudflare env file | `.env.cloudflare.core42-dev` with `AccountID` and `APIToken` | | Cloudflare token scope | `Account:Cloudflare Tunnel Edit`, `Zone:DNS Edit`, `Zone:Zone Read` | | vm-104 | Docker installed and passwordless sudo for `hpcadmin` | Read-only planning: ```bash scripts/ops/vm104_dev_control_cloudflare_tunnel.sh plan scripts/ops/vm104_dev_control_cloudflare_tunnel.sh status ``` Apply path: ```bash CONFIRM_DEV_CONTROL_TUNNEL=I-UNDERSTAND \ scripts/ops/vm104_dev_control_cloudflare_tunnel.sh configure CONFIRM_DEV_CONTROL_TUNNEL=I-UNDERSTAND \ scripts/ops/vm104_dev_control_cloudflare_tunnel.sh install-token CONFIRM_DEV_CONTROL_TUNNEL=I-UNDERSTAND \ scripts/ops/vm104_dev_control_cloudflare_tunnel.sh start ``` Validation: ```bash scripts/ops/vm104_dev_control_cloudflare_tunnel.sh verify make env-smoke ENV=dev ``` Pomerium route validation: ```bash scripts/ops/dev-control_pomerium_oidc_configure.sh EDGE_PROFILE=prod_public_ingress \ EDGE_DNS_SERVER=1.1.1.1 \ ROUTES=swagger \ POMERIUM_AUTHN_HOST=aicloud-dev-authn.core42.dev \ SWAGGER_HOST=aicloud-dev-swagger.core42.dev \ GRAFANA_HOST=aicloud-dev-grafana.core42.dev \ APP_HOST=aicloud-dev-app.core42.dev \ API_HOST=aicloud-dev-api.core42.dev \ AUTH_HOST=aicloud-dev-auth.core42.dev \ scripts/ops/pomerium_edge_profile_smoke.sh ``` The configurator applies the dev-control Swagger route plus terminal and notification WebSocket routes. The WebSocket routes are Pomerium-rendered for host routing and upgrade handling. GPUaaS still validates terminal tokens, notification bearer tokens, allocation/session binding, and notification fanout. Authenticated dev-control WebSocket smoke: ```bash scripts/ops/pomerium_dev-control_ws_authenticated_smoke.sh ``` If no active dev-control allocation exists, the smoke skips terminal with a `BLOCK` line and still verifies the notification WebSocket. Set `REQUIRE_TERMINAL_ALLOCATION=true` when terminal parity is the release gate. `EDGE_DNS_SERVER` is optional. Use it when the operator workstation resolver is pinned to Tailscale or a LAN resolver that has not picked up newly-created Cloudflare records yet. ### 13a.3 Rollback ```bash scripts/ops/vm104_dev_control_cloudflare_tunnel.sh stop ssh hpcadmin@100.90.157.34 'sudo rm -f /ai-cloud-data/gpuaas/dev-control/secrets/cloudflared/token' ``` Then delete the proxied CNAMEs or the named tunnel in Cloudflare after traffic is drained. The local `.env.cloudflare.core42-dev` file is never changed by the helper. --- ## 14. Stage 03a — k3s API TLS SAN drop-in (C-DEV-VM105-DEV-CERT-INGRESS-LIVE-APPLY-001) ### 14.1 What this stage does Adds `100.90.157.34` (and `vm-104`) to the k3s API server's TLS SAN list via a config drop-in at `/etc/rancher/k3s/config.yaml.d/00-tls-san.yaml` and restarts the k3s service. k3s additively merges `tls-san` with its defaults (localhost, 127.0.0.1, internal IP, hostname) at startup and rotates the serving cert in place when the SAN list changes. The cluster CA does not change, so existing kubeconfigs (the operator kubeconfig from stage 04a with `server: https://100.90.157.34:6443`) keep working without re-issue. Without this stage the operator kubeconfig fails TLS verify against `100.90.157.34:6443` and stages 05a/05b/05c cannot run. ### 14.2 Operator gate | Gate | Required value | |---|---| | `ENV` | `dev-control` | | `HOST` | `vm-104` | | `STAGE` | `03a-k3s-tls-san` | | Second confirmation | `CONFIRM_K3S_TLS_SAN=I-UNDERSTAND` | Apply command: ```bash ENV=dev HOST=vm-104 STAGE=03a-k3s-tls-san \ CONFIRM_K3S_TLS_SAN=I-UNDERSTAND \ make env-bootstrap-apply ``` The apply briefly restarts the k3s service. Data dir (`/ai-cloud-data/gpuaas/dev-control/state/k3s`), node-token, and pod state are preserved. The post-apply check verifies the cert SAN includes `100.90.157.34` and `k3s get --raw=/readyz` returns ok. ### 14.3 Rollback ```bash ssh hpcadmin@100.90.157.34 'sudo rm -f /etc/rancher/k3s/config.yaml.d/00-tls-san.yaml && sudo systemctl restart k3s' ``` Reverts the SAN list to the k3s defaults; the operator kubeconfig will again fail TLS verify against `100.90.157.34:6443`. Only run if the SAN must be retracted. --- ## 15. Helm CLI install (operator workstation) Stages `05a-cert-manager-install` and `05c-dev-control-ingress-baseline` list `helm` in `require_local_binaries`. The runner refuses those stages if `helm` is not on PATH, before any kubectl/helm command runs. The documented install path is the committed script: ```bash ./scripts/ops/operator_install_helm.sh ``` - On macOS with Homebrew, uses `brew install helm`. - Elsewhere, uses the upstream `get-helm-3` installer over `curl`. - Refuses to run as root unless `HELM_INSTALL_ALLOW_ROOT=1` is set. - Idempotent: if `helm` is already on PATH, prints its version and exits 0. - Does not touch `vm-104`. Helm runs locally against the operator kubeconfig pulled by stage 04a. --- ## 16. Stage 05 lane — current status (C-DEV-VM105-DEV-CERT-INGRESS-LIVE-APPLY-001) The two non-credential blockers from the previous slice (C-DEV-VM105-DEV-CERT-INGRESS-STAGE3-001) are resolved by committed automation in this slice: | Blocker | Resolution | |---|---| | k3s API server TLS cert lacked `100.90.157.34` SAN | new stage `03a-k3s-tls-san` (§14) — drop-in file + `systemctl restart k3s`; idempotent; rollback documented | | `helm` not installed on the operator workstation | `scripts/ops/operator_install_helm.sh` (§15) — committed installer; ran via brew on the C-ops workstation in this slice | | `.env.cloudflare.core42-dev` not present locally | **still blocking** — operator must mint a Cloudflare API token scoped to `Zone:DNS:Edit + Zone:Zone:Read` on `core42.dev` only, record TTL/rotation in the secret manager of record, and place the file at the repo root (gitignored). The runner refuses stage 05b until the file exists; stage 05c then waits on the wildcard Secret from 05b. | This slice live-applied `03a-k3s-tls-san` and `05a-cert-manager-install` and recorded acceptance evidence under `$(git rev-parse --git-common-dir)/ops-evidence/env-automation/dev/`. Stage `05b-cloudflare-dns01-issuer` and stage `05c-dev-control-ingress-baseline` remain wired but unapplied; the runner refuses 05b with `bootstrap-apply for 05b-cloudflare-dns01-issuer refuses to mutate because operator secret file is missing: .env.cloudflare.core42-dev`. A follow-up slice owns the Cloudflare token mint + 05b/05c live apply once the token is in the operator's possession. --- ## 17. Demo CD profile Demo deploys use the same promotion discipline as platform-control: - development lands in `master`; - `release/platform-control` is force-promoted to one exact `master` SHA; - GitLab runs with `PLATFORM_CONTROL_RELEASE_PROFILE=dev-control-rke2`; - deploy and remote validation source `scripts/ci/dev_control_rke2_release_env.sh` before invoking the existing platform-control release scripts. Do not create or hand-edit a long-lived `release/dev-control` branch unless CI environment protection later requires branch-scoped variables. If that happens, `release/dev-control` must follow the same promotion-only rule as `release/platform-control`. Current dev-control CD target defaults: | Setting | Value | |---|---| | SSH target | `hpcadmin@100.90.157.34` | | Cluster service | `rke2-server` | | Remote kubectl | `/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml` | | App URL | `https://aicloud-dev-app.core42.dev` | | API URL | `https://aicloud-dev-api.core42.dev` | | Auth URL | `https://aicloud-dev-auth.core42.dev` | Operator command: ```bash DEV_CONTROL_RKE2_SSH_PRIVATE_KEY_B64="$(base64 < ~/.ssh/gpuaas-dev-control-rke2-cd | tr -d '\n')" \ PLATFORM_CONTROL_RELEASE_MODE=deploy \ scripts/ci/dev_control_rke2_release_deploy.sh origin/master ``` Required GitLab variables: - `GITLAB_BASE_URL` - `GITLAB_TOKEN` - `GITLAB_PROJECT_ID` - registry credentials already used by platform-control release jobs - one dev-control-specific SSH credential accepted by `scripts/ci/platform_control_ssh_common.sh` (`DEV_CONTROL_RKE2_SSH_PRIVATE_KEY`, `DEV_CONTROL_RKE2_SSH_PRIVATE_KEY_B64`, or `DEV_CONTROL_RKE2_SSH_PRIVATE_KEY_FILE`). The dev-control profile maps this credential to `PLATFORM_CONTROL_*` inside the pipeline so vm-104/platform-control SSH variables cannot be reused accidentally. Optional overrides: - `DEV_CONTROL_RKE2_SSH_HOST` for a replacement dev-control node; - `DEV_CONTROL_RKE2_REMOTE_KUBECTL` if the RKE2 kubeconfig path changes; - any `PLATFORM_CONTROL_*_URL` if the public endpoint profile changes. Post-deploy validation currently runs the platform-control remote validation suite against the dev-control public endpoints. The base environment smoke remains available as an independent infrastructure check: ```bash make env-bootstrap-check ENV=dev make env-smoke ENV=dev ``` The dev-control CD profile applies `infra/k8s/overlays/dev-control-rke2`, not the platform-control `dev-control` overlay. The overlay includes an in-cluster Postgres deployment using hostPath storage under `/ai-cloud-data/gpuaas/dev-control/state/postgres`, so vm-104 does not depend on the vm-104/platform-control Docker Postgres container. The RKE2 dev-control host does not assume a dynamic default StorageClass. The `dev-control-rke2` overlay declares static hostPath persistent volumes under `/ai-cloud-data/gpuaas/dev-control/state/*` for Postgres, Vault, NATS, Redis, registry, Grafana, Loki, Prometheus, and Tempo. Vault's StatefulSet-created PVC is bound by the deploy script through `PLATFORM_CONTROL_STATIC_PVC_BINDINGS` so failed deploys can be retried without mutating immutable StatefulSet volume claim templates. When GitLab provides `CI_REGISTRY`, the dev-control profile configures an RKE2 registry mirror for that registry host using the local HTTP endpoint and creates a `gpuaas-core` pull secret from the job registry credentials. This keeps vm-104 able to pull release images even when the local GitLab registry TLS ingress is unavailable. The pull secret is keyed by the bare registry host, not an `https://` URL, because Kubernetes image references use `registry.host/repo` and kubelet will otherwise ignore the credentials. The deploy script also patches every `gpuaas-core` service account so controller deployments that use non-default service accounts receive the same pull secret. The dev-control profile derives the node bootstrap CA file from the `kube-system/gpuaas-dev-wildcard-tls` secret and writes it to the same remote path used by platform-control validation: `/etc/gpuaas/platform-control/tls/ca.crt`. It also copies that wildcard certificate into the namespace-local `gpuaas-public-tls` secret expected by the shared ingress and terminal gateway manifests. The deploy script also syncs local-dev Keycloak import assets to `/opt/gpuaas/platform_control/external-infra/keycloak` because the shared infra manifest mounts the realm export and theme directory from that host path. New dev-control hosts must not rely on vm-104 already having those files. Controller credential bootstrap is intentionally cluster-local in the dev-control CD profile. `scripts/ci/dev_control_rke2_release_env.sh` enables `PLATFORM_CONTROL_CONTROLLER_BOOTSTRAP_PORT_FORWARD=true`, so `platform_control_deploy.sh` opens short-lived `kubectl port-forward` connections from vm-104 to the in-cluster Keycloak and GPUaaS API services while creating or rotating the Slurm and RKE2 controller service-account credentials. This avoids making deploy correctness depend on Cloudflare or public TLS while those same public endpoints are being rolled. The same controller-bootstrap step also enables `PLATFORM_CONTROL_BOOTSTRAP_DEV_ADMIN_PROJECT_SCOPE=true`. That idempotently aligns the dev-control `dev-admin` Keycloak subject with the dev-control default project before using public service-account APIs. Without this, a fresh vm-104 can mint a valid admin token but still fail project-scoped API calls with `ownership_required` until someone has manually seeded the local dev-control persona bindings. To add a future environment such as `dev` on vm-104 or `test` on vm-104, add a new target profile script modeled after `scripts/ci/dev_control_rke2_release_env.sh` and point it at a dedicated overlay, SSH target, public endpoint profile, Kubernetes distribution, and DB apply mode. Do not reuse `dev-control-rke2` for another host.