# Observability Local Smoke Report

## 2026-05-19

  - Command:
  - `OBSERVABILITY_EXPECT_DEPLOYMENT_ENV=kind OBSERVABILITY_EXPECT_LOKI_SERVICE=api bash scripts/ops/observability_stack_smoke.sh`
  - `bash scripts/ops/observability_smoke.sh`
  - `WINDOW=30m NODE_ID=bc2204d4-fa91-42b7-b0e3-0862fac1f809 bash scripts/ops/node_agent_loki_smoke.sh`
  - `WINDOW=30m NODE_ID=3851fb6c-2789-4a0d-85a2-50ce5411893a bash scripts/ops/node_agent_loki_smoke.sh`
- Scope:
  - Grafana, Prometheus, Loki, and Tempo health endpoints.
  - Prometheus environment labels for core services and Pomerium.
  - Loki `service_name=api` label visibility.
  - API and webhook worker metrics.
  - Prometheus ingestion warnings after the Pomerium scrape relabel guard.
  - Worker-node log visibility for `gpuaas-node-agent` and `gpuaas-metrics-helper`.
- Result:
  - Stack smoke: `success`.
  - API/webhook metrics smoke: `success`.
  - Prometheus recent logs: no duplicate-sample ingestion warnings after dropping
    Pomerium `controller_runtime_*` and `workqueue_*` metric families.
  - Node-agent Loki smoke: `success` for `compute-node-01` and
    `compute-node-02` after manual-node collector convergence.
  - Node-log-gateway and Loki recent logs: no current backend rejection loop
    and no `entry too far behind` ingestion failures after the collector
    stale-event filter.
- Live worker-node convergence:
  - `scripts/ops/gpuaas_manual_worker_node_log_collector_converge.sh --host 192.168.1.162 --user hpcadmin --loki-url https://node-api.gpuaas.test/internal/v1/node-logs --install-method none --reset-buffer --enable-metrics-helper`
  - `scripts/ops/gpuaas_manual_worker_node_log_collector_converge.sh --host 192.168.1.77 --user hpcadmin --loki-url https://node-api.gpuaas.test/internal/v1/node-logs --install-method none --reset-buffer --enable-metrics-helper`
  - `gpuaas-node-log-collector.service` and `gpuaas-metrics-helper.timer`
    active on both worker nodes.
  - Strict worker parity with `--require-log-collector --require-metrics-helper`
    passed for both worker nodes with one warning-free summary per node.

## 2026-02-24

- Command: `make ops-observability-smoke`
- Scope:
  - API metrics endpoint: `GET /metrics`
  - Webhook worker metrics endpoint: `GET /metrics`
  - Core required metric names present in both services
  - Optional protected internal stats check (`GET /api/v1/internal/stats`) when `INTERNAL_STATS_TOKEN` is configured
- Result:
  - API metrics endpoint reachable and required keys found.
  - Webhook worker metrics endpoint reachable and required keys found.
  - Internal stats check skipped locally when `INTERNAL_STATS_TOKEN` is unset (expected in default local env).
