# AI Factory Team Domain Operating Model v1

## Purpose

This document captures why the v3 product model and the domain-based code refactor are happening now.

The prototype has moved from an exploratory build to a platform that can support a staffed product engineering program. The codebase therefore needs to make team boundaries explicit before the UI overhaul and backend read-model work scale out.

## Strategic Context

The executive SteerCo direction points to a Core42 AI Factory platform, not a narrow GPU rental portal.

The long-term product needs to support:

- public exploration before purchase friction,
- SSO-first entry,
- KYC and checkout at the purchase moment where region/use case requires it,
- Bare Metal GPUaaS,
- managed GPU service,
- Token Factory / model serving,
- regional AI Factory deployments,
- portal, API, and CLI access,
- multi-silicon compute,
- high-performance storage and hybrid networking,
- telemetry, billing, metering, tenant isolation, and audit.

That means v3 should become the long-term product and engineering model, not a temporary visual redesign.

## Core Principle

Product surfaces, API contracts, route files, service packages, queue tasks, and team ownership should line up around the same domains.

When those boundaries drift, every change requires cross-team coordination and `routes.go`/OpenAPI become shared bottlenecks. When they align, domain experts can own their slices independently while the platform still publishes one coherent product and one canonical API contract.

## Product Pillars

| Pillar | Product Meaning | Primary Engineering Domains |
|---|---|---|
| Bare Metal GPUaaS | Dedicated nodes, slices, Slurm/K8s, MAAS/reimage, node-agent operations | Provisioning, Storage/Network, Observability/Ops |
| Managed GPU Service | Workloads, app launches, managed runtimes, project-scoped access | Product/UI, App Platform, IAM/Access, Billing |
| Metered Inference Factory | Model catalogue, inference endpoints, usage accounting, Compass integration | App Platform, Billing/Finance, Observability/Ops |
| Platform Control | Internal control plane reliability and operator safety | Platform Control, identity and authorization, Observability/Ops |

## Team Domains

### Product/UI

Owns:

- v3 shell and navigation model,
- page families and production component system,
- launch wizards and inline dependency creation,
- user, tenant-admin, project-admin, platform-admin, and ops journeys,
- responsive behavior and accessibility.

Code ownership:

- `packages/web/src/components/product/`
- production pages migrated from `/v3` mock patterns
- `doc/product/`

Minimum skills:

- TypeScript/React/Next.js production engineer,
- UX/product designer with complex workflow IA experience,
- frontend test automation for workflow and accessibility coverage,
- enough API literacy to work contract-first against OpenAPI.

### Provisioning

Owns:

- allocation lifecycle,
- scheduler integration,
- node-agent task protocol,
- node bootstrap,
- MAAS/reimage,
- BMC/IPMI/Redfish lifecycle,
- Slurm/K8s direct-to-node provisioning paths,
- capacity drain, retire, re-enroll, and slot discovery.

Code ownership:

- `packages/services/provisioning/`
- `cmd/node-agent/`
- `cmd/provisioning-worker/`
- future `packages/services/maas/`
- `cmd/api/routes_provisioning_*.go`
- `cmd/api/routes_bootstrap_nodes.go`
- `cmd/api/routes_node_internal.go`

Suggested route split:

```text
routes_provisioning_allocations.go
routes_provisioning_nodes.go
routes_provisioning_maas.go
routes_bootstrap_nodes.go
routes_node_internal.go
```

Minimum skills:

- Go backend/distributed systems engineer,
- bare-metal/HPC/MAAS infrastructure engineer,
- node-agent/Linux systems engineer,
- Slurm/K8s operations knowledge,
- GPU/MIG/slice lifecycle knowledge,
- BMC/IPMI/Redfish and host-network troubleshooting experience.

Infra-domain knowledge is mandatory for this domain. Generic backend implementation skill is not sufficient because the hardest failures happen at the boundary between control-plane state, host OS state, node-agent behavior, scheduler behavior, and physical infrastructure.

### Platform Control

Owns:

- registry integration and credential delivery,
- Vault/PKI and certificate lifecycle,
- Redis/read-model cache,
- NATS, outbox, DLQ, and replay tooling,
- release promotion and deployment control,
- internal health and control-plane SLOs.

Code ownership:

- `packages/shared/readcache/`
- `packages/shared/pki/`
- `packages/shared/events/`
- `packages/shared/outbox/`
- release/deployment scripts under `scripts/ci/`
- `cmd/api/routes_platform_*.go`

Suggested route split:

```text
routes_platform_registry.go
routes_platform_vault.go
routes_platform_cache.go
routes_platform_events.go
routes_platform_releases.go
routes_platform_health.go
```

Minimum skills:

- Go backend/platform engineer,
- security/platform engineer for PKI/Vault,
- Redis/NATS/outbox operational knowledge,
- SRE-oriented engineer for release reliability, rollback, and DLQ recovery,
- CI/CD and deployment automation ownership.

### IAM / Access

Owns:

- OIDC/SSO and identity federation,
- tenant/project membership,
- platform roles,
- service accounts,
- SSH/API keys,
- account linking and duplicate identity resolution,
- access audit and authorization evidence.

Code ownership:

- `packages/services/auth/`
- IAM/access API contracts and routes,
- Keycloak/OpenFGA integration when introduced,
- account/access v3 read models.

Suggested route split:

```text
routes_auth.go
routes_tenant_users.go
routes_access_memberships.go
routes_access_credentials.go
routes_account.go
```

Minimum skills:

- Go backend engineer with IAM/security depth,
- OIDC/SAML/SSO and Keycloak/OpenFGA experience,
- TypeScript frontend/product engineer shared with Product/UI for access UX,
- security reviewer for authz, credential, and audit paths.

### Billing / Finance

Owns:

- ledger invariants,
- usage collection and attribution,
- pricing modes: on-demand, spot, reserved,
- budgets and alerts,
- invoices and payments,
- delinquency controls,
- ingress/egress and app/token accounting.

Code ownership:

- `packages/services/billing/`
- `packages/services/payments/`
- billing workers and webhook worker,
- finance/account billing surfaces.

Suggested route split:

```text
routes_billing.go
routes_payments.go
routes_admin_finance.go
```

Minimum skills:

- Go backend engineer with ledger/accounting rigor,
- payments integration engineer,
- idempotency/reconciliation testing experience,
- product analyst / finance operations partner,
- TypeScript support for account, billing, and admin finance surfaces.

### App Platform

Owns:

- app catalog,
- app launch,
- app instances as workload subtypes,
- app artifacts and runtime bundles,
- shared runtimes,
- app telemetry producers,
- model serving integration and Token Factory app surfaces.

Code ownership:

- app platform service packages,
- app-instance and app-artifact routes,
- app launch wizard backend dependencies,
- external app team integration contracts.

Suggested route split:

```text
routes_apps_catalog.go
routes_apps_instances.go
routes_apps_artifacts.go
routes_apps_runtimes.go
```

Minimum skills:

- Go backend app-platform engineer,
- ML/runtime engineer for vLLM/Jupyter/training/application lifecycles,
- OCI/container/runtime packaging knowledge,
- TypeScript frontend engineer shared with Product/UI,
- integration partner who can work with external app teams.

### Observability / Ops

Owns:

- fleet telemetry and operator dashboards,
- metrics read models,
- logs/Loki collector topology,
- incidents and runbooks,
- evidence/audit investigation flows,
- operational validation and smoke harnesses.

Code ownership:

- admin/platform ops routes,
- telemetry read models,
- runbook and evidence surfaces,
- log/metric collector deployment patterns.

Suggested route split:

```text
routes_admin_ops.go
routes_admin_nodes.go
routes_admin_telemetry.go
routes_admin_evidence.go
routes_admin_runbooks.go
```

Minimum skills:

- SRE/observability engineer,
- Go backend read-model engineer,
- Prometheus/VictoriaMetrics/Grafana/Loki/OpenTelemetry experience,
- incident/runbook and production smoke-test ownership,
- operations/product partner.

### Storage / Network

Owns:

- bucket/workload mount model,
- lifecycle, encryption, and KMS integration,
- storage provider abstraction across WEKA first and VAST/DDN/NVMe-class backends later,
- VRF/firewall/load balancer/VPN/public IP surfaces,
- storage and network launch dependencies,
- storage/network capacity visibility.

Code ownership:

- `packages/services/storage/`
- future network service packages,
- storage and connectivity routes/read models.

Suggested route split:

```text
routes_storage.go
routes_network_connectivity.go
routes_network_security.go
```

Minimum skills:

- storage/platform infrastructure engineer,
- network infrastructure engineer,
- Go backend engineer for API/read-model integration,
- KMS/encryption and quota/lifecycle policy knowledge,
- TypeScript support for storage/connectivity workbenches and launch dependencies.

Infra-domain knowledge is mandatory for this domain. Storage and network cannot be treated as generic CRUD surfaces because correctness depends on fabric topology, isolation boundaries, mount semantics, encryption/KMS behavior, performance, and host-level troubleshooting.

Storage should follow the same vendor-neutral posture as GPU capacity. WEKA may be the first backend, but product/API models should describe capabilities and policies, not hardcode WEKA-specific behavior in user-facing contracts. VAST, DDN, NVMe pools, or other storage classes should be able to plug in through provider capability metadata.

## Language And Skill Expectations

| Domain | Primary Languages | Mandatory Domain Expertise |
|---|---|---|
| Product/UI | TypeScript, React, Next.js | UX IA, workflow design, accessibility, frontend testing |
| Provisioning | Go, Bash, cloud-init/systemd | MAAS, Linux hosts, Slurm/K8s, BMC/IPMI/Redfish, GPU/MIG/slice lifecycle |
| Platform Control | Go, Bash, CI scripting | Redis, NATS, outbox/DLQ, Vault/PKI, registry, release reliability |
| IAM/Access | Go, TypeScript | OIDC/SAML/SSO, Keycloak/OpenFGA, credential safety, audit |
| Billing/Finance | Go, TypeScript | Ledger invariants, payments, reconciliation, pricing/budget policy |
| App Platform | Go, TypeScript, container tooling | vLLM/Jupyter/training runtimes, OCI images, app lifecycle, Token Factory integration |
| Observability/Ops | Go, Bash, query languages | OTel, Prometheus/VictoriaMetrics, Grafana, Loki, runbooks, incident response |
| Storage/Network | Go, TypeScript, infra automation | VAST/WEKA/DDN/NVMe, IB/RoCE/Ethernet, VRF/VLAN/firewall/LB/VPN/public IP, KMS |
| QA / Release Quality | TypeScript, Go, Bash | E2E workflows, contract tests, authz negative tests, kind validation, release gates |

Agents can accelerate implementation inside each lane, but they do not replace the required domain expertise. The human owner remains accountable for design correctness, production safety, reviews, test strategy, and incident response.

## Cross-Cutting Architecture Rules

- Keep one canonical bundled OpenAPI and one canonical bundled AsyncAPI artifact.
- Author contracts by domain fragments as domains migrate.
- Keep `cmd/api` as one BFF binary until a domain has independent scaling/availability reasons to extract.
- Split routes by domain now to remove the monolithic `routes.go` bottleneck.
- Keep v3 read models scoped, cacheable, and page-shaped, but graduate stable resources to long-term resource paths over time.
- Use public/admin APIs or read-models for ops verification before direct SQL.
- Every domain owns its tests, smoke harnesses, and production runbooks.

## Initial Resource Shape

This is a practical starting point, not a fixed org chart.

| Domain | Minimum Starting Coverage | Scale Trigger |
|---|---:|---|
| Product/UI | 2-3 people | Multiple v3 surfaces being migrated in parallel |
| Provisioning | 3 people | MAAS/reimage + Slurm/K8s/node-agent work running concurrently |
| Platform Control | 2-3 people | Release, PKI, registry, eventing, and cache work all active |
| IAM/Access | 1-2 people | SSO, service accounts, and fine-grained authz implementation starts |
| Billing/Finance | 1-2 people | Reserved/spot/budgets/invoicing move beyond usage accounting |
| App Platform | 2 people | Token Factory and external app integrations start parallel delivery |
| Observability/Ops | 2 people | Fleet-wide logging/metrics/runbooks become production-critical |
| Storage/Network | 2 people | VRF/firewall/LB/VPN and storage lifecycle move past placeholders |
| QA / Release Quality | 2 people | v3 replacement, node-agent/provisioning, billing, and authz changes run in parallel |

With coding agents, implementation throughput can increase, but the resource model should reduce generic implementer count before reducing domain owners. The platform still needs human SMEs for provisioning, storage, network, security, billing, and production quality.

## Near-Term Refactor Plan

1. Keep the v3 product model as the target IA.
2. Keep current v1 as a frozen demo/internal-user continuity surface unless a real bug requires a fix; do not treat it as a public backward-compatibility contract, and remove it after v3/domain-owned replacement routes fully cut over.
3. Continue moving route groups out of `cmd/api/routes.go` by domain.
4. Keep temporary v3 read-model routes isolated until the production migration is complete.
5. Migrate OpenAPI domains from canonical mode to fragment mode when the owning team is ready.
6. Convert v3 mock pages into production pages using shared primitives and read-model contracts.
7. Add missing read-model APIs before asking ops to use SQL for recurring workflows.
8. Keep B/UI and A/backend queue tasks aligned to domain ownership, not random files.

## Decision

The codebase should evolve toward team-owned domains while preserving one integrated product.

The v3 UI shell, domain OpenAPI authoring model, and route split are the first implementation steps toward that operating model.