# Node Agent Host Certificate Lifecycle v1

Status: draft for review
Owner: Platform Architecture / PKI / Node Agent
Last updated: 2026-05-24

## Purpose

Define the host-side certificate lifecycle for node agents running on MAAS-LXD,
Proxmox, and bare-metal workers.

This document closes the boundary left by `Cert_Manager_Integration_v1.md`:
cert-manager is the right controller for Kubernetes and edge certificate
lifecycle, but it is not by itself the host node-agent certificate lifecycle.

## Decision

step-ca remains the node-agent issuer and renewal authority.

The node-agent renews directly through the platform API or through a small
platform certificate broker. The node does not talk to step-ca directly.

cert-manager may manage Kubernetes-side certificates and edge certificates, but
host node-agent certs require their own delivery, renewal, recovery, and
diagnostic contract.

## Required Properties

1. The node-agent has one control-plane destination: the node-facing API
   profile.
2. The API or cert broker is the only component that talks to step-ca for node
   certificates.
3. Renewal uses the current valid node certificate when possible.
4. Expired-cert recovery uses a node-bound recovery token or operator-issued
   repair bundle.
5. Recovery never disables TLS verification.
6. Recovery must distinguish endpoint/profile drift from credential failure.
7. Cert, key, token, and CA bundle material must never be logged.
8. The node must tolerate restart and temporary network loss without full
   database repair.
9. Node-agent, metrics, and log shipping must be disk-bounded. Observability
   must never fill the root filesystem or prevent cert renewal, task polling,
   terminal callback, cleanup, or workload execution.

## Host State

Each enrolled node must have:

- `GPUAAS_API_URL`: node-facing runtime API URL;
- `GPUAAS_TERMINAL_API_URL`: node-facing terminal stream URL;
- `GPUAAS_CA_BUNDLE_PATH`: HTTPS trust bundle for runtime API;
- `GPUAAS_NODE_CERT_CA_BUNDLE_PATH`: node client cert issuer bundle;
- `GPUAAS_CERT_PATH`: current node client certificate;
- `GPUAAS_KEY_PATH`: current node private key;
- `GPUAAS_ENROLLMENT_TOKEN` or node-bound recovery token when recovery is
  allowed;
- `GPUAAS_RECOVERY_API_URL`: recovery endpoint/profile, defaulting to runtime
  API only when explicitly safe;
- `GPUAAS_RECOVERY_CA_BUNDLE_PATH`: recovery trust bundle;
- `GPUAAS_DIAGNOSTIC_PATH`: writable local diagnostic path.

The bootstrap/reconciliation scripts must ensure the diagnostic path is on a
writable filesystem and that node-facing hostnames survive reboot.

## Renewal Flow

1. Node-agent checks certificate expiry on a fixed interval.
2. Before the renew-before window, it emits expiry metrics only.
3. Inside the renew-before window, it creates a new keypair and CSR.
4. It calls the API renewal endpoint over mTLS using the current certificate.
5. API validates node identity, asks step-ca to sign, and returns the new cert.
6. Node-agent atomically swaps cert/key and keeps old connections alive until
   they naturally reconnect.

## Recovery Flow

When normal renewal fails:

1. Classify failure before retry:
   - `endpoint_unreachable`;
   - `server_tls_untrusted`;
   - `cert_expired`;
   - `identity_revoked_or_fenced`;
   - `endpoint_profile_drift`;
   - `recovery_enrollment_blocked`;
   - `disk_full`;
   - `clock_skew`.
2. If recovery token and recovery trust are present, call recovery enrollment
   without presenting the stale client certificate.
3. API verifies the node-bound recovery token and node state.
4. API asks step-ca to issue a replacement certificate.
5. Node-agent atomically writes cert/key and resumes task polling.
6. If recovery is not safe, fail closed with local diagnostic evidence and
   operator remediation instructions.

## Recovery Token Issuance

Recovery tokens are node-bound, short-lived or rotation-managed credentials used
only when mTLS renewal is no longer possible.

Initial issuance happens during the trusted bootstrap path:

- MAAS deploy cloud-init for MAAS-LXD or bare-metal workers;
- Proxmox/VM bootstrap when the platform composes or prepares the VM worker;
- operator repair bundle only for emergency remediation.

An operator repair bundle contains the node-bound recovery token, recovery CA
bundle, recovery endpoint profile id, and audit-traceable issuance evidence. It
does not contain a reusable node client private key.

The API owns token issuance, hashing, revocation, and audit. Tokens are scoped
to one node identity and one recovery profile, and they must not grant task
execution, terminal access, or allocation control. Refreshing a recovery token
while the node is healthy should use mTLS and rotate the stored token
atomically. Emergency operator issuance must require privileged audit and should
quarantine the node until recovery evidence passes.

Every recovery attempt and outcome emits audit evidence through the platform
Audit service with a correlation_id linking renewal failure, recovery request,
certificate issuance, and resolution.

## Metrics

Node-agent must expose:

- `gpuaas_node_agent_cert_expiry_seconds`;
- `gpuaas_node_agent_cert_renewal_attempts_total`;
- `gpuaas_node_agent_cert_renewal_failures_total{reason}`;
- `gpuaas_node_agent_recovery_enrollment_attempts_total{reason}`;
- `gpuaas_node_agent_recovery_enrollment_success_total`;
- `gpuaas_node_agent_recovery_enrollment_failures_total{reason}`;
- `gpuaas_node_agent_last_successful_mtls_task_poll_timestamp`;
- `gpuaas_node_agent_endpoint_profile_drift_total`;
- `gpuaas_node_agent_local_diagnostic_write_failures_total{reason}`;
- `gpuaas_node_agent_node_filesystem_available_bytes{mount}`;
- `gpuaas_node_agent_observability_dropped_records_total{source,reason}`.

## Operator Evidence

The API/read-model should surface:

- current cert expiry or last reported expiry;
- last renewal attempt;
- last renewal failure reason;
- last recovery attempt;
- last recovery failure reason;
- node-facing API URL and terminal URL profile id;
- whether the local diagnostic path is writable;
- last reported filesystem free space for cert, diagnostic, task retry, and
  runtime paths;
- whether the node is excluded from scheduling and why.

## Failure Matrix

| Failure | Expected behavior |
|---|---|
| node cert expires while node is online | bearer recovery enrollment replaces cert |
| node cert expires while node is offline | recovery runs on next start if token/trust valid |
| API hostname missing after reboot | classify `endpoint_unreachable` / `endpoint_profile_drift`; reconciliation restores host mapping |
| active CA bundle stale | detect server TLS chain failure against runtime bundle, validate recovery endpoint against recovery bundle, then refresh runtime CA material through recovery; no insecure fallback |
| recovery CA bundle stale | fail closed with `recovery_enrollment_blocked` |
| disk full | classify `disk_full`; stop pretending node is simply unreachable |
| clock skew | classify `clock_skew` when local time differs from API/recovery authority by more than the configured tolerance, initially 5 minutes; do not churn cert requests |
| node identity revoked | fail closed; do not self-recover without API authority |

## Relationship To Cert-Manager

cert-manager may manage:

- Pomerium TLS;
- public/private wildcard certificates;
- Kubernetes internal service certs;
- step-ca-backed certs for pods.

cert-manager does not directly manage:

- host `/etc/gpuaas/cert.pem`;
- host `/etc/gpuaas/key.pem`;
- host recovery tokens;
- node-facing endpoint profile repair;
- host disk-full or reboot recovery.

Any future cert-manager-to-host delivery must be treated as a separate
controller with its own security review and recovery matrix.

## Disk Safety Invariant

Node-agent stability depends on a writable host. The local worker must reserve
space for:

- node-agent cert/key atomic writes;
- local diagnostic record writes;
- task result retry state;
- cleanup/quarantine evidence;
- terminal session transient state.

Local observability must be strictly bounded:

- journald must have an explicit size cap;
- node-agent retry logs must be coalesced;
- metrics helpers must not persist unbounded samples locally;
- Vector or other log shippers must use bounded buffers;
- local archives must rotate aggressively;
- when under disk pressure, the node must drop or compact observability data
  before application/runtime state is endangered.

Disk pressure must become an explicit readiness reason such as `disk_full` or
`node_observability_disk_pressure`. It must not collapse into generic
`node_agent_unreachable`.

## Open Decisions

1. Whether node recovery tokens are refreshed on every successful renewal, on a
   fixed rotation interval, or only during operator repair.
2. Whether high-security tenants require node-agent public-key pinning in
   addition to CA-chain trust, and how pin rotation is delivered without
   creating a self-lockout path.

TLS trust-on-first-use is deliberately excluded. Node-agent trust must be
anchored in a provisioned CA bundle, recovery trust bundle, or explicit future
pinning contract, not in first-observed server identity.
