# NATS JetStream Stream Configuration

## Overview

All domain events use NATS JetStream with explicit acknowledgement and durable consumers.
The per-subject consumer settings (`ack_wait`, `max_deliver`, DLQ subject) are defined in
`doc/api/asyncapi.draft.yaml`. This document defines the complementary stream-level
configuration: stream names, subject bindings, retention policy, and the durable consumer
catalog with assigned names.

---

## Stream Definitions

### `BILLING`

| Parameter | Value |
|---|---|
| Name | `BILLING` |
| Subjects | `platform.billing.>` |
| Retention | `LimitsPolicy` (fan-out — multiple consumers read each message independently) |
| Storage | `File` |
| Max age | 7 days |
| Max bytes | 64 MB |
| Discard policy | `DiscardOld` |
| Replicas | 1 (dev) / 3 (prod) |

### `PAYMENTS`

| Parameter | Value |
|---|---|
| Name | `PAYMENTS` |
| Subjects | `platform.payments.>` |
| Retention | `LimitsPolicy` |
| Storage | `File` |
| Max age | 30 days (financial events — longer retention for audit) |
| Max bytes | 64 MB |
| Discard policy | `DiscardOld` |
| Replicas | 1 (dev) / 3 (prod) |

### `PROVISIONING`

| Parameter | Value |
|---|---|
| Name | `PROVISIONING` |
| Subjects | `gpuaas.provisioning.>` |
| Retention | `LimitsPolicy` |
| Storage | `File` |
| Max age | 7 days |
| Max bytes | 64 MB |
| Discard policy | `DiscardOld` |
| Replicas | 1 (dev) / 3 (prod) |

### `APPS`

| Parameter | Value |
|---|---|
| Name | `APPS` |
| Subjects | `appplatform.>`, `platform.artifact.>` |
| Retention | `LimitsPolicy` |
| Storage | `File` |
| Max age | 7 days |
| Max bytes | 64 MB |
| Discard policy | `DiscardOld` |
| Replicas | 1 (dev) / 3 (prod) |

### `STORAGE`

| Parameter | Value |
|---|---|
| Name | `STORAGE` |
| Subjects | `storage.>` |
| Retention | `LimitsPolicy` |
| Storage | `File` |
| Max age | 7 days |
| Max bytes | 64 MB |
| Discard policy | `DiscardOld` |
| Replicas | 1 (dev) / 3 (prod) |

### `DLQ`

| Parameter | Value |
|---|---|
| Name | `DLQ` |
| Subjects | `dlq.>` |
| Retention | `LimitsPolicy` |
| Storage | `File` |
| Max age | 30 days (time for ops investigation and manual replay) |
| Max bytes | 64 MB |
| Discard policy | `DiscardOld` |
| Replicas | 1 (dev) / 3 (prod) |

---

## Durable Consumer Catalog

Each row is one durable consumer. Consumer names are stable — renaming a consumer
loses its tracked position and causes reprocessing from the delivery policy start.

### `BILLING` stream consumers

| Consumer name | Filter subject | Consuming service | Ack wait | Max deliver | Deliver policy |
|---|---|---|---|---|---|
| `notification_relay_low_balance` | `platform.billing.low_balance_warning` | notification-relay | 30s | 10 | `DeliverAll` |
| `notification_relay_auto_release_pending` | `platform.billing.auto_release_pending` | notification-relay | 30s | 10 | `DeliverAll` |
| `notification_relay_balance_depleted` | `platform.billing.balance_depleted` | notification-relay | 30s | 10 | `DeliverAll` |
| `billing_worker_usage_metered` | `platform.billing.usage.metered` | billing-worker | 30s | 10 | `DeliverAll` |

### `PAYMENTS` stream consumers

| Consumer name | Filter subject | Consuming service | Ack wait | Max deliver | Deliver policy |
|---|---|---|---|---|---|
| `billing_worker_balance_credited` | `platform.payments.balance_credited` | billing-worker | 30s | 10 | `DeliverAll` |
| `notification_relay_reconcile_failed` | `platform.payments.reconcile_failed` | notification-relay | 30s | 10 | `DeliverAll` |

### `PROVISIONING` stream consumers

| Consumer name | Filter subject | Consuming service | Ack wait | Max deliver | Deliver policy |
|---|---|---|---|---|---|
| `provisioning_worker_provision_requested` | `gpuaas.provisioning.requested` | provisioning-worker | 45s | 20 | `DeliverAll` |
| `billing_worker_provision_active` | `gpuaas.provisioning.active` | billing-worker | 30s | 10 | `DeliverAll` |
| `notification_relay_provision_active` | `gpuaas.provisioning.active` | notification-relay | 30s | 10 | `DeliverAll` |
| `notification_relay_provision_failed` | `gpuaas.provisioning.failed` | notification-relay | 30s | 10 | `DeliverAll` |
| `provisioning_worker_force_release` | `gpuaas.provisioning.force_release_requested` | provisioning-worker | 45s | 20 | `DeliverAll` |
| `provisioning_worker_releasing_requested` | `gpuaas.provisioning.releasing.requested` | provisioning-worker | 45s | 20 | `DeliverAll` |
| `billing_worker_releasing_completed` | `gpuaas.provisioning.releasing.completed` | billing-worker | 30s | 10 | `DeliverAll` |
| `notification_relay_releasing_completed` | `gpuaas.provisioning.releasing.completed` | notification-relay | 30s | 10 | `DeliverAll` |
| `billing_worker_release_failed` | `gpuaas.provisioning.release_failed` | billing-worker | 30s | 10 | `DeliverAll` |
| `notification_relay_release_failed` | `gpuaas.provisioning.release_failed` | notification-relay | 30s | 10 | `DeliverAll` |

### `APPS` stream consumers

| Consumer name | Filter subject | Consuming service | Ack wait | Max deliver | Deliver policy |
|---|---|---|---|---|---|
| `app_runtime_worker_instance_requested` | `appplatform.runtime.instance.requested` | app-runtime-worker | 45s | 20 | `DeliverAll` |
| `app_runtime_worker_instance_upgrade_requested` | `appplatform.runtime.instance.upgrade_requested` | app-runtime-worker | 45s | 20 | `DeliverAll` |
| `app_runtime_worker_instance_rollback_requested` | `appplatform.runtime.instance.rollback_requested` | app-runtime-worker | 45s | 20 | `DeliverAll` |
| `app_runtime_worker_instance_decommission_requested` | `appplatform.runtime.instance.decommission_requested` | app-runtime-worker | 45s | 20 | `DeliverAll` |

### `STORAGE` stream consumers

| Consumer name | Filter subject | Consuming service | Ack wait | Max deliver | Deliver policy |
|---|---|---|---|---|---|
| `storage_worker_attachment_requested` | `storage.attachment.requested` | provisioning-worker / storage-worker | 45s | 20 | `DeliverAll` |
| `storage_worker_attachment_detach_requested` | `storage.attachment.detach_requested` | provisioning-worker / storage-worker | 45s | 20 | `DeliverAll` |

### `DLQ` stream consumers

| Consumer name | Filter subject | Consuming service | Ack wait | Max deliver | Deliver policy |
|---|---|---|---|---|---|
| `dlq_inspector` | `dlq.>` | ops / replay tooling | 60s | 1 | `DeliverAll` |

---

## Consumer Summary

| Consuming service | Total consumers |
|---|---|
| provisioning-worker | 3 |
| app-runtime-worker | 4 |
| billing-worker | 3 |
| notification-relay | 7 |
| provisioning-worker / storage-worker | 2 |
| ops / replay tooling | 1 |
| **Total** | **20** |

---

## Environment Configuration

| Setting | Dev | Prod |
|---|---|---|
| Replicas | 1 | 3 (requires 3-node NATS cluster) |
| Storage | File | File |
| NATS address | `nats://localhost:4222` | cluster DNS (env-specific) |
| JetStream enabled | `-js` flag (docker-compose) | via NATS config file |

---

## Initialization

Streams and consumers are created idempotently on service startup using the NATS Go client.
`js.AddStream` returns `ErrStreamNameAlreadyInUse` if the stream exists — treat as a no-op.
`js.AddConsumer` is likewise idempotent.

**Responsibility**: `packages/shared/events` exports an `InitStreams(js nats.JetStreamContext) error`
function that creates all streams. Each consuming service creates its own durable consumer
on startup — not all consumers are created centrally, keeping service boundaries clean.

Consumer mode requirement:
- Durable consumers are configured as **pull consumers** (no `DeliverSubject`).
- Worker instances use pull-subscribe semantics for competing-consumer scaling.
- Do not use random-inbox push consumers for durable worker groups.

Stream init order: `BILLING`, `PAYMENTS`, `PROVISIONING`, `APPS`, `DLQ` (order is not significant but
`DLQ` last is conventional since DLQ subjects are routed from other streams at consumer level).

Initialization must run before any publisher or consumer begins processing. A startup health
check should verify JetStream connectivity before accepting traffic.

---

## DLQ Replay

When a message exhausts `max_deliver` retries it is re-published by the consumer to its
configured `dlq_subject` (see AsyncAPI `x-nats-config.dlq_subject`).

Replay procedure:
1. Inspect the DLQ subject using `nats sub dlq.<domain>.<event>` or the NATS monitoring port.
2. Fix the root cause (code deploy, data correction, external dependency recovery).
3. Re-publish the message to the original subject with the same `event_id` (consumers are
   idempotent by contract — see `doc/architecture/Event_Taxonomy.md`).
4. Ack or delete the DLQ message to clear it.

A `scripts/replay-dlq.sh` helper should be provided per environment.

---

## Rules

1. **Consumer names are stable** — treated as a breaking change to rename or delete.
2. **New consumers are additive** — add without stream downtime.
3. **All consumers use explicit ack and pull mode** — no auto-ack and no random-inbox push consumer groups in production code.
4. **Idempotent consumers mandatory** — replay must be safe; `event_id` is the deduplication key.
5. **Do not publish directly to `dlq.>`** — DLQ messages are routed there automatically on
   `max_deliver` exhaustion; manual DLQ writes are only for testing.
