# Coding Standards (Production + Agent Compatible)

## General
- Strong typing required.
- Lint/format/static analysis mandatory.
- Small, cohesive modules with single responsibility.
- Follow the project’s evidence-first execution model in [Evidence_First_Change_Protocol.md](./Evidence_First_Change_Protocol.md).

## Evidence-First Execution (Required)
- Establish a relevant baseline before changing behavior.
- Prefer the smallest verifiable unit of change.
- Predict the expected outcome before running verification.
- Re-run the same scoped checks after the change and compare results.
- Do not mark work complete without direct proof of the intended behavior change.
- Do not treat “compiles” or “looks right” as sufficient evidence.

## API and Domain
- Contract-first changes only.
- Explicit request/response schemas.
- Standard error envelope with correlation ID.
- All mutations idempotent unless explicitly documented as non-idempotent security session issuance/revocation operations (for example, single-use terminal token minting).
- Use generated OpenAPI types (`packages/shared/gen/openapigen`) only at HTTP boundaries (request decode/response encode); use hand-written domain models for internal service logic, with explicit mapping between boundary and domain types.

## Security
- No secrets in code.
- Validate all untrusted input.
- Enforce authN/authZ server-side only.
- Emit audit events for privileged actions.
- Node provisioning/release orchestration must use node-agent typed tasks over internal mTLS.
- Direct control-plane SSH provisioning is forbidden in MVP runtime paths.
- Provisioning lifecycle transitions (`requested` → `provisioning` → `active` → `releasing` → `released`/`release_failed`) must be driven by Temporal workflows/events only. Direct bypass state writes are forbidden outside workflow-controlled paths.

## Data Integrity
- Immutable ledger for money operations.
- No direct mutable-balance source of truth.
- Transactions for cross-entity critical updates.
- No hardcoded runtime business-policy constants in production code paths.
- Policy/business values must come from config/DB tables with bounds validation and audited change history.
- Test-only constants are allowed in test/fixture code, never in runtime services.
- Operator/admin verification should use API and read-model surfaces by default, not direct SQL.
- Temporary direct DB inspection is allowed only while the owning operator/debug surface is missing;
  if the same query is needed repeatedly, add the corresponding GET/read-model API and treat the
  missing surface as a product gap.

## Root-Cause-First Remediation (Required)
- Do not ship symptom-only fixes to unblock tests.
- Every bug fix must identify and patch the owning layer/root cause (contract, schema, service, worker, runtime, or UI boundary), not only downstream fallout.
- Temporary fallbacks are allowed only when:
  - explicitly feature-flagged,
  - time-boxed with a queue/backlog task,
  - documented with risk and removal criteria.
- If root cause is outside current task scope, mark the task `blocked` and create the upstream fix task; do not mark `done` with a local workaround only.

## SQL Parameter Typing (Required)
- In SQL using `jsonb_build_object(...)` or other polymorphic Postgres functions, explicitly cast bind parameters (`$n::text`, `$n::int`, `$n::uuid`) instead of relying on type inference.
- Any new handler/worker SQL touching audit or metadata JSON must include a test path that executes the query against Postgres (unit with pgx mock is not sufficient for this class of failure).

## 5xx Classification (Required)
- Any new/changed 5xx response path must be classified in code review as one of:
  - upstream dependency failure (`upstream_error` / `service_unavailable`)
  - local contract/schema/query/runtime bug (`internal_error`)
- Do not re-label local defects as upstream issues. Fix the owning layer and add a regression test.

## Observability
- Structured logging everywhere.
- Trace context propagation mandatory.
- Service-level metrics for critical flows.

### Traceability-First Implementation Rules (Required)
1. Every runtime binary under `cmd/` (except explicitly documented edge agents) must initialize OTel via `middleware.SetupOTel(...)`.
2. Every HTTP server binary must wrap routers with tracing middleware:
   - `middleware.Tracing("<service-name>")(middleware.CorrelationID(...))`
3. Async consumers (NATS/workers/relays) must create a processing span per message and include:
   - `correlation_id`
   - `event.type`
   - `event.id`
   - messaging destination/subject
4. Mutation handlers must create child spans for high-value steps:
   - project/tenant scope resolution
   - domain service/orchestrator call
   - audit/outbox write boundary
5. Error paths must set span error status and `error_code` (catalog-aligned) whenever known.
6. Any new service added to local observability compose must have `OTEL_EXPORTER_OTLP_ENDPOINT` wired.

Enforcement:
- CI gate script: `scripts/ci/observability_trace_gate.sh`
- Make target: `make ops-observability-trace-gate`

## Log and Trace Sanitization
Sensitive and PII fields must be redacted before they reach any log sink or trace backend. This applies equally to structured logs, OTel trace attributes, and span events.

**Sanitize First rule**: all internal services must pass requests through a sanitization layer before logging or creating trace spans. This is not optional for production services.

Fields that must never appear in logs or traces in plaintext:
- `password`, `password_hash` — any credential value
- `access_token`, `refresh_token`, `id_token` — any auth token material
- `ssh_private_key`, `ssh_private_key_enc` — any key material
- `stripe_customer_id`, `payment_reference` — payment identity fields
- User PII: `email`, `username` where used as a personal identifier in high-volume paths
- Any field from `access_secret_enc` or `scheduler_metadata` that may contain credentials

**Implementation requirements**:
- Implement a sanitization middleware/interceptor at the service boundary that scrubs known sensitive field names before the log entry or span is emitted.
- Redaction format: replace value with `[REDACTED]` — never omit the field entirely, to preserve log structure for debugging.
- Apply the same scrubber to error messages that may echo request payloads.
- Audit log `metadata` jsonb fields must follow an explicit allowlist. Unknown keys are rejected.
- Allowed `platform_audit_logs.metadata` keys (MVP):
  - `reason`
  - `policy_key`
  - `old_value`
  - `new_value`
  - `status_from`
  - `status_to`
  - `error_code`
  - `request_scope`
  - `idempotency_key_hash`
  - `provider_ref`
  - `allocation_id`
  - `node_id`
- Forbidden in `platform_audit_logs.metadata`: raw tokens, raw credentials, SSH private/public key material, full request/response payload dumps, direct payment instrument data, end-user PII fields beyond stable IDs.

## Agent PR Rules
- Spec updates included when behavior changes.
- Tests included for changed behavior.
- Migration and rollback notes required when schema changes.

---

## Go Implementation Patterns

These patterns are mandatory for all Go code in this repo. Every agent and
contributor must follow them so the codebase reads consistently regardless of
who wrote a given file.

### Import grouping

Three groups, blank-line-separated: stdlib → external → internal.
`goimports` enforces this automatically.

```go
import (
    // 1. Standard library
    "context"
    "net/http"

    // 2. External dependencies
    "github.com/google/uuid"
    "github.com/jackc/pgx/v5"

    // 3. Internal packages (always use full module path)
    apierrors "github.com/gpuaas/platform/packages/shared/errors"
    "github.com/gpuaas/platform/packages/shared/middleware"
    "github.com/gpuaas/platform/packages/shared/policy"
)
```

### Service handler struct

Every service package exposes a `Handler` struct that holds injected
dependencies. Never use package-level variables for dependencies.

```go
type Handler struct {
    pool   *pgxpool.Pool
    policy policy.Client
    log    *slog.Logger
}

func NewHandler(pool *pgxpool.Pool, pc policy.Client, log *slog.Logger) *Handler {
    return &Handler{pool: pool, policy: pc, log: log}
}
```

### HTTP handler signature

All route handlers are methods on the service `Handler` struct. Extract
`corrID` and `claims` at the top of every handler.

```go
func (h *Handler) CreateAllocation(w http.ResponseWriter, r *http.Request) {
    ctx    := r.Context()
    corrID := middleware.CorrelationIDFromContext(ctx)
    claims := middleware.ClaimsFromContext(ctx)   // always non-nil on auth-protected routes

    var req CreateAllocationRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        writeJSON(w, http.StatusBadRequest,
            apierrors.New(apierrors.ErrInvalidRequest, "invalid JSON", corrID))
        return
    }
    // validate → call service layer → respond
}
```

### Service function shape

Service functions never know about HTTP. They accept typed inputs and return
`(result, error)`. Domain sentinel errors are defined in the package and
translated to HTTP status codes only in the handler.

```go
// In the service layer:
var ErrAllocationNotFound = errors.New("allocation not found")

func (s *Service) GetAllocation(ctx context.Context, id uuid.UUID, userID string) (*Allocation, error) {
    // query DB, return (nil, ErrAllocationNotFound) for missing rows
}

// In the handler layer:
alloc, err := h.svc.GetAllocation(ctx, id, claims.UserID)
if errors.Is(err, svc.ErrAllocationNotFound) {
    writeJSON(w, http.StatusNotFound,
        apierrors.New(apierrors.ErrAllocationNotFound, "allocation not found", corrID))
    return
}
if err != nil {
    h.log.ErrorContext(ctx, "get allocation failed", "error", err, "correlation_id", corrID)
    writeJSON(w, http.StatusInternalServerError,
        apierrors.New(apierrors.ErrInternal, "internal error", corrID))
    return
}
```

### Error response helper

Every service handler file that writes HTTP responses should include a local
`writeJSON` helper (or import one from a shared internal package):

```go
func writeJSON(w http.ResponseWriter, status int, v any) {
    b, _ := json.Marshal(v)
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(status)
    _, _ = w.Write(b)
}
```

### Structured logging

Always use `slog` (Go stdlib). Always include `correlation_id`. Never log
fields from the PII blocklist — pass data through `middleware.Sanitize` first.

```go
// Correct:
slog.InfoContext(ctx, "allocation created",
    "allocation_id", alloc.ID,
    "user_id",       claims.UserID,
    "correlation_id", corrID,
)

// Correct — log sanitised request body at DEBUG:
slog.DebugContext(ctx, "incoming request",
    slog.Any("body", middleware.Sanitize(bodyMap)))

// Wrong — raw request struct may contain tokens or keys:
slog.InfoContext(ctx, "request", "body", req)
```

### Build metadata standard

All Go binaries under `cmd/` must log build identity at startup and shutdown
using shared `packages/shared/buildinfo` fields:

- `version`
- `commit`
- `built_at`

Do not define ad-hoc per-binary version variables in `main` packages. Use
centralized ldflags stamping via Makefile build targets (for example
`make build-go-binaries` / `make build-node-agent`) so logs remain consistent
across API, workers, gateways, and agent binaries.

### DB transaction + outbox

Any mutation that changes domain state AND should emit an event must write both
in the same transaction. Never call `events.PublishTyped` directly from a
handler or service function — write to `platform_outbox_events` instead.

```go
tx, err := h.pool.BeginTx(ctx, pgx.TxOptions{})
if err != nil {
    return fmt.Errorf("begin tx: %w", err)
}
defer tx.Rollback(ctx) // no-op after Commit

// 1. Domain mutation
_, err = tx.Exec(ctx, `UPDATE allocations SET status = 'releasing' WHERE id = $1`, id)
if err != nil {
    return err
}

// 2. Outbox event (same transaction)
payload, _ := json.Marshal(events.ReleasingRequestedPayload{AllocationID: id.String(), ...})
_, err = tx.Exec(ctx, `
    INSERT INTO platform_outbox_events (aggregate_type, aggregate_id, event_type, payload, correlation_id)
    VALUES ($1, $2, $3, $4, $5)
`, "allocation", id, events.SubjectProvisioningReleasingRequested, payload, corrID)
if err != nil {
    return err
}

return tx.Commit(ctx)
```

### Policy values

Never hardcode business constants. Always read from `policy.Client`. Provide a
safe in-code fallback only when the policy key is optional or the fallback is
explicitly documented.

```go
// Correct:
limit, err := h.policy.GetInt(ctx, policy.KeyAllocationMaxConcurrentPerUser, policy.WithOrgScope(claims.OrgID))
if err != nil {
    return 0, fmt.Errorf("policy lookup: %w", err)
}

// Wrong — hardcoded constant in production path:
const maxAllocations = 5
```

### Audit log

Every privileged mutation (provision, release, force-release, refund, admin
node ops, admin user ops) must insert an `platform_audit_logs` row. Write it inside the
same DB transaction as the mutation.

```go
_, err = tx.Exec(ctx, `
    INSERT INTO platform_audit_logs
        (actor_user_id, actor_role, action, target_type, target_id, result, correlation_id)
    VALUES ($1, $2, $3, $4, $5, $6, $7)
`, claims.UserID, role, "allocation.release", "allocation", id, "success", corrID)
```

### Context propagation

Always thread `context.Context` as the first argument. Never store a context in
a struct field. Never use `context.Background()` inside a handler — always
propagate from `r.Context()`.

### Naming conventions

| Thing | Convention | Example |
|---|---|---|
| Handler struct | `Handler` per package | `billing.Handler` |
| Service struct | `Service` per package | `billing.Service` |
| Constructor | `New<Type>` | `NewHandler`, `NewService` |
| Domain sentinel errors | `Err<Noun>` | `ErrAllocationNotFound` |
| Policy key consts | `Key<Domain><Name>` | `policy.KeyBillingWindowSeconds` |
| Event subject consts | `Subject<Domain><Event>` | `events.SubjectProvisioningActive` |
| Test file | `<file>_test.go` same package | `handler_test.go` |
| Integration test file | `//go:build integration` tag | see Testing Standards |