# Product Surface IA and Role Model v1

Status:
- Draft design input for the next full product/admin redesign pass.
- Complements, not replaces:
  - `doc/product/Product_UI_System_Redesign_v1.md`
  - `doc/product/Unified_Product_UX_Model_v1.md`
  - `doc/product/Admin_Ops_and_User_IA_v1.md`
  - `doc/product/IAM_UX_Information_Architecture_v1.md`
  - `doc/product/UX_Intent_Flow_Audit.md`

Purpose:
- Define the full product information architecture before page-by-page admin and
  ops redesign work continues.
- Make role, intent, and scope first-class so the redesign does not collapse
  back into route dumps or entity dumps.
- Provide one system-level artifact for review and feedback.

## 0. Canonical Decisions

This is the canonical IA document for the current redesign pass.

When this document disagrees with:
- `Product_UI_System_Redesign_v1.md`
- `Unified_Product_UX_Model_v1.md`
- `Admin_Ops_and_User_IA_v1.md`
- `IAM_UX_Information_Architecture_v1.md`

this document wins for:
- shell/navigation structure;
- role and mode boundaries;
- admin vs ops vs telemetry boundaries;
- primary workload nouns;
- landing-page ownership.

### 0.1 Global shell groups

For the authenticated product shell, the canonical top-level groups are:
- Workloads
- Compute
- Apps
- Storage
- Access
- Account
- Platform

Rules:
- `Platform` is the top-level home for platform-admin, infra, SRE, telemetry,
  audit, and other privileged operator flows.
- `Access` is the durable group for project/tenant/platform IAM surfaces.
- `Storage` is the durable workbench for buckets, mounts, datasets,
  checkpoints, artifacts, retention, and workload data attachments.
- public/developer portal is a separate surface family, not an eighth internal
  left-nav group.

### 0.2 Admin vs ops vs telemetry

Decision:
- `Platform` is the authenticated top-level shell group.
- Inside `Platform`, `Ops`, `Fleet`, `Governance`, `Finance`, and `Evidence`
  are peer local-navigation families.

Implication:
- ops is not a child of admin as a top-level product concept;
- admin and ops are peer operating families inside the broader `Platform`
  surface;
- telemetry is a peer investigation/evidence capability, not the same thing as
  admin or ops.

### 0.3 Workload noun hierarchy

Decision:
- `workload` is the primary product noun for active/runnable things users and
  operators care about.
- `allocation` remains the compute substrate and billing/runtime record.
- `app instance` is treated as a workload subtype rather than a permanently
  separate top-level mental model.

Implication:
- `/workloads` is the primary user/operator runtime workbench.
- allocation surfaces remain valid, but more infrastructure-oriented.
- app-instance routes may remain temporarily, but should converge toward the
  workload model instead of competing with it.

### 0.4 Mode entry

Decision:
- mode is explicit in the shell.
- users with more than one role enter mode through a top-bar mode switcher.

Canonical modes:
- User
- Tenant Admin
- Project Admin
- Platform Admin
- Ops

Rules:
- mode selection changes default landing page and primary navigation emphasis;
- it does not bypass backend authorization;
- route access still remains permission-gated.

### 0.5 Wizard and task decisions

Decision:
- major launch flows use a full-page wizard;
- simple dependency-create flows may use modal/drawer shells, but must reuse
  the same wizard primitives and language;
- row overflow uses visible `Open` plus labeled `More` on desktop;
- long-running operations get a routed task detail at `/tasks/:task_id`, with
  embedded summaries allowed on detail pages.

These are no longer open questions for the mock pass.

### 0.6 Top-bar context model

Decision:
- the top bar always contains:
  - mode switcher
  - workspace/project selector
  - region selector
  - notifications entry
  - session menu
- balance is always visible in user, tenant-admin, and project-admin modes
  where spend is directly relevant;
- in platform-admin and ops modes, balance may collapse into a finance/status
  entry instead of remaining a primary chip.

Rules:
- project/workspace context remains globally visible, but platform-wide pages do
  not silently filter themselves to one project unless the page is explicitly
  scoped that way;
- platform surfaces should show when they are in fleet-wide vs project/tenant
  context.

### 0.7 Global search

Decision:
- the shell should include a global search / command entry point as part of the
  redesign target.

Initial scope may be limited, but the mock should include it because the
product already crosses allocations, workloads, users, nodes, and docs.

### 0.8 Notifications

Decision:
- notifications remain a top-bar shell capability with an inbox/panel model.
- the redesign does not require a full dedicated notifications workspace
  immediately, but the shell mock should assume a stable bell/inbox entry.

### 0.9 Mode-switcher behavior

Decision:
- persistent shell with emphasis: the same seven top-level groups remain visible
  across all modes;
- mode changes:
  - the default landing route;
  - which left-rail group is highlighted as the home group;
  - which actions are visible vs gated.
- mode does not swap left-rail content. Groups the user cannot use remain
  visible but disabled, with reason text on hover.

Rationale:
- a stable shell is easier to learn than a shape-shifting one;
- users with multiple roles can switch modes without losing visual orientation;
- access still flows from backend authorization, not shell visibility.

### 0.10 Mode-switcher visual and persistence

Visual:
- label + chevron dropdown in the top bar;
- compact label format: `Mode: <Name>`;
- single-role users see the dropdown disabled or hidden, with their mode shown
  as a static label.

Persistence:
- mode is sticky across sessions per user;
- first login defaults to the highest-relevance non-platform mode in the order
  User → Tenant Admin → Project Admin, unless the user has explicitly
  previously selected another mode.

### 0.11 Platform local navigation shape

Decision:
- `/platform` lands on a card-grid family overview;
- each family page (`/platform/ops`, `/platform/lifecycle`, `/platform/config`,
  `/platform/evidence`, `/platform/finance`, `/platform/iam`) uses local
  subnav or tabs as needed;
- a permanent secondary left rail inside the shell is **not** added in this
  redesign pass.

### 0.12 Account vs Access split

Access (project / tenant / platform IAM):
- projects;
- team / memberships;
- service accounts;
- policies / entitlements / platform roles.

Account (personal):
- profile;
- billing;
- SSH keys;
- personal sessions / settings.

Rationale:
- separates "who can do what across scopes" (Access) from "things about me"
  (Account).

### 0.13 Tenant-admin and project-admin landing primary questions

Both landings should answer:
1. Who has access?
2. What needs governance attention?
3. What is the current spend / usage posture?
4. What recent changes happened?
5. What needs action now?

These five questions drive the section structure of `/tenant/overview` and
`/project/overview`. Project admin landing scopes everything to the active
project; tenant admin landing aggregates across projects in the tenant.

### 0.14 App-platform visual elevation

Decision:
- `Apps` remains a peer of `Compute` in the shell — no top-level IA split in
  this pass.
- visual elevation in the shell and landings is allowed and encouraged:
  distinct icon weight, hero treatment in user-mode landings, prominent
  default-action surface in the app catalog.

Rationale:
- the app platform is a strategic differentiator (per §17.1) but is not yet a
  separate product surface;
- elevating it visually preserves IA simplicity while making the
  differentiator legible.

### 0.15 Storage as a peer shell group

Decision:
- `Storage` is a top-level authenticated shell group.
- Storage owns bucket and data-substrate management after creation.
- Launch flows may still create buckets inline when storage is a dependency.

Rationale:
- buckets can outlive workloads and remain valuable after the producing
  workload is released;
- workload-to-bucket relationships are many-to-many over time, so storage
  should not be modeled as a child of one workload;
- storage has its own operational vocabulary: quota, unattached buckets,
  failed mounts, retention, encryption, lifecycle, access drift, and scheduled
  deletes;
- storage appears in multiple launch paths, including notebooks, training
  datasets, checkpoints, artifacts, and persistence for app services.

Rules:
- workload details should link attached buckets to their storage detail pages;
- bucket details should link active and historical workload attachments back to
  workload details;
- inline bucket creation in launch flows must produce the same bucket object
  that later appears in the Storage workbench;
- do not bury persistent storage management under Apps, Workloads, Access, or
  Account.

### 0.16 Backend read-model readiness

Decision:
- v3 implementation should not wire major pages directly to ad hoc domain
  queries.
- each major shell group needs an explicit read-model/API map before migration
  from mock data to live data.
- Redis-backed read-model caching is part of the v3 backend foundation, but
  cache entries are never sources of truth.

Required implementation input:
- `doc/architecture/UI_Read_Model_Cache_Architecture_v1.md`
- endpoint-by-endpoint read-model map for Workloads, Apps, Storage, Access,
  Account, and Platform before broad page migration.

Rules:
- pages may launch behind feature flags with partial data only if missing
  read-model fields are visibly non-blocking;
- admin/operator surfaces must not require direct SQL table edits or direct DB
  inspection for normal operation;
- cache keys must include tenant/project/user scope where authorization depends
  on that scope.

## 1. Why This Document Exists

Current redesign documents already establish:
- shared page families;
- workflow-oriented admin direction;
- IAM separation principles;
- user and operator intent audits.

What is still missing is one explicit map for the whole product:
- public vs authenticated surfaces;
- user vs tenant/project admin vs platform admin vs ops;
- admin vs ops vs telemetry boundaries;
- infra-ops vs SRE-ops intent boundaries;
- how dense admin pages should decompose into summary, operations, diagnostics,
  history, and advanced detail.

This document fills that gap.

## 2. Core Design Rule

Do not design the product around backend entities or current route prefixes.

Design around:
- **role**
- **intent**
- **scope**
- **control plane**

The same entity may appear in multiple surfaces, but with different framing,
actions, and urgency.

Examples:
- allocations appear in user, project-admin, platform-admin, and ops flows;
- nodes appear in platform-admin, infra-ops, and SRE flows;
- billing appears in user, tenant-admin, and platform-admin flows;
- IAM spans user, project, tenant, platform, and future app/integration layers.

## 3. Product Surface Families

The product should be treated as several related surfaces, not one giant
navigation tree.

### 3.1 Public / Developer Portal

Purpose:
- onboarding, docs, SDKs, API references, examples, integration guides

Audience:
- prospective users
- developers
- external integration owners

Examples:
- docs
- Swagger / Redoc
- downloads
- integration guides

Rules:
- do not mix this surface into internal admin/ops left navigation;
- may share brand/system components, but not the same workspace IA.

### 3.2 User Workspace

Purpose:
- allocate, connect, operate, and understand resources owned by the current
  user/project context

Audience:
- end users
- project members

Examples:
- marketplace
- allocations
- workloads
- storage
- billing
- SSH keys

Rules:
- speak in allocation/project terms, not host/guest/internal implementation
  terms;
- show only the operational truth the user needs.

### 3.3 Tenant / Project Administration

Purpose:
- administer access, policy, project structure, and usage at tenant/project
  scope

Audience:
- tenant admins
- project admins

Examples:
- team and memberships
- projects
- service accounts
- app entitlements
- quotas and usage posture at scoped levels

Rules:
- not the same as platform admin;
- scope and blast radius must remain obvious in both copy and action design.

### 3.4 Platform Operations / Administration

Purpose:
- run the platform safely across fleet, workflows, policy, incidents, and
  privileged actions

Audience:
- platform admins
- infra operators
- SRE/operators

Examples:
- nodes
- allocations
- onboarding/decommission flows
- ops overview
- audit
- payment operations
- platform IAM

Rules:
- this surface is workflow- and incident-oriented, not an entity inventory dump;
- deep diagnostics are allowed, but only after summary and safe actions.

## 4. Role Model

These roles may touch overlapping entities, but they do not have the same
workflow.

### 4.1 User

Intent:
- get capacity
- connect to it
- understand runtime health and spend

Scope:
- self
- current project resources visible to the user

### 4.2 Tenant Admin

Intent:
- govern tenant-scoped users, projects, policy, usage posture

Scope:
- tenant

### 4.3 Project Admin

Intent:
- manage one project’s members, identities, entitlements, runtime posture

Scope:
- project

### 4.4 Platform Admin

Intent:
- privileged platform control, policy, enrollment, audit, recovery

Scope:
- platform / fleet

### 4.5 Infra Operator

Intent:
- node readiness
- capacity health
- onboarding / decommission
- networking / fabric / image correctness

Scope:
- nodes, sites, capacity pools, fleet readiness

### 4.6 SRE / Ops Operator

Intent:
- incident detection
- failure correlation
- degraded workflow recovery
- observability-guided remediation

Scope:
- services, workflows, fleet signals, critical runtime health

### 4.7 App Developer / Integration Owner

Intent:
- app lifecycle
- artifacts
- integration contracts
- external system behavior and troubleshooting

Scope:
- project, app, or integration boundary depending product maturity

Note:
- this may partially overlap with SRE today, but should not be forced into the
  same IA bucket without an explicit decision.

## 4.8 Persona-To-Landing-Page Table

This table is the minimum shell contract for the mock pass.

| Persona / mode | Default landing route | Top 3 surfaces | Mode entry |
|---|---|---|---|
| User | `/workloads` when workloads exist, otherwise `/compute` | Workloads, Compute, Storage | top-bar mode switcher or default session mode |
| Tenant Admin | `/tenant/overview` | Access, Billing/usage posture, Storage posture | top-bar mode switcher |
| Project Admin | `/project/overview` | Access, Workloads, Storage | top-bar mode switcher |
| Platform Admin | `/platform/overview` | Fleet, Governance, Finance | top-bar mode switcher |
| Ops | `/platform/ops` | Ops, Telemetry, Fleet | top-bar mode switcher |

Notes:
- tenant-admin and project-admin home surfaces now exist in `` as
  first-cutover production surfaces backed by existing read models;
- richer dedicated tenant/project read models are still expected before these
  landings become final governance dashboards.

## 5. Intent Model

The same role may operate in different intents. Top-level IA should support
these rather than hiding them inside entity tables.

Primary intents:
- provision
- manage
- monitor
- recover
- govern
- integrate
- investigate

Design implication:
- pages must be built around the question the actor is trying to answer, not
  just around the record being displayed.

## 6. Scope Model

Scope is the simplest way to keep similar verbs from collapsing into one messy
UI.

Scopes:
- self
- project
- tenant
- node
- fleet
- platform
- public

Examples:
- users provision at project/self scope;
- project admins manage access at project scope;
- tenant admins govern membership and spend at tenant scope;
- infra operators manage node readiness at node/fleet scope;
- platform admins govern policy at platform scope.

## 7. Control Planes

The redesign should explicitly separate these control planes.

### 7.1 Workloads

Focus:
- allocations
- app runtimes
- workload lifecycle
- user-facing and operator-facing runtime views

### 7.1a Storage

Focus:
- buckets
- workload mounts and attachments
- datasets
- checkpoints
- artifacts
- object lifecycle and retention
- encryption and storage access drift

### 7.2 Fleet / Infrastructure

Focus:
- nodes
- slot readiness
- networking
- fabric
- images
- enrollment and decommission

### 7.3 Operations

Focus:
- incidents
- stuck workflows
- degraded services
- recovery actions
- runbooks

### 7.4 Governance

Focus:
- users
- roles
- projects
- tenant/platform policies
- audit

### 7.5 Financial

Focus:
- spend
- accounting
- budgets
- refunds
- payment operations
- usage attribution

### 7.6 Developer / Integration

Focus:
- docs
- SDKs
- APIs
- artifact publishing
- integration contracts

## 8. Admin vs Ops vs Telemetry

This boundary needs to be explicit.

### 8.1 Admin

Admin is for:
- governance
- privileged control
- durable platform configuration
- enrollment/lifecycle authority
- audit and finance controls

Admin should answer:
- what is the correct platform state?
- who can do what?
- what policy or lifecycle action is safe?

### 8.2 Ops

Ops is for:
- live intervention
- incident handling
- degraded workflows
- recovery actions

Ops should answer:
- what needs action now?
- what is broken or risky?
- what is the next safe recovery step?

### 8.3 Fleet Telemetry

Fleet telemetry is for:
- aggregate observability
- trends
- hotspots
- saturation
- cross-node evidence

Fleet telemetry should answer:
- where is pressure or degradation emerging across the fleet?
- what signals correlate with current incidents?

Rules:
- telemetry is not where primary lifecycle mutations should live;
- admin is not where cross-fleet observability should be dumped by default;
- ops may link into both admin and telemetry, but should remain action-oriented.

## 9. Infra-Ops vs SRE-Ops

“Ops” is not one homogeneous audience.

### 9.1 Infra / Capacity Intent

Questions:
- are nodes correctly provisioned?
- are slice slots schedulable?
- is networking/fabric/storage ready?
- is image/bootstrap state consistent?

Primary surfaces:
- fleet / nodes
- onboarding
- slot readiness
- image and provisioning views

### 9.2 SRE / Live Operations Intent

Questions:
- what is degraded right now?
- which service or workflow is failing?
- how do I correlate and recover?

Primary surfaces:
- ops overview
- attention views
- workflow health
- telemetry
- runbooks

Design implication:
- do not assume one single ops landing page can serve both intents equally well.

## 10. IAM Growth Model

IAM is already multi-layered and should be designed as such.

Layers:
- platform IAM
- tenant IAM
- project IAM
- user identity and credentials
- service accounts
- app entitlements
- future external integration identities

Design implications:
- IAM should not be treated as one simple admin table;
- platform and scoped IAM must remain clearly separated;
- future growth should not require a top-level IA rewrite.

## 11. Billing Growth Model

Billing is bigger than current allocation accounting.

Current product reality:
- allocation accounting
- balances
- payment sessions

Expected growth:
- tenant/project spend views
- budgets and quota posture
- credits/refunds
- invoices/financial evidence
- usage attribution across products and runtimes
- policy-driven financial controls

Design implications:
- billing should be treated as its own durable control plane;
- do not bury it as a small detail inside one or two admin pages.

## 12. Dense Page Decomposition Rules

Some current admin pages contain enough material for several pages.

That is a signal to split by function, not add more sections to the same page.

When a page mixes:
- summary
- operations
- diagnostics
- history
- raw metadata

it should be decomposed into one or more of:

1. **Overview**
- summary
- key state
- top risks
- primary actions

2. **Operations**
- live controls
- intervention actions
- workbench/table

3. **Metrics / Diagnostics**
- charts
- telemetry
- linked debug tools such as Netdata

4. **History / Audit**
- lifecycle timeline
- audit trail
- financial or provisioning evidence

5. **Advanced / Raw**
- raw IDs
- low-frequency technical detail
- copy/debug material

Rules:
- list pages should not become detail pages;
- detail pages should not default to raw metadata dumps;
- advanced/debug information should be available, but not dominant.

## 13. Navigation Decision Model

The redesign should be validated at the full-shell level, not page by page.

Before broad UI implementation, produce a mock showing:
- full left navigation;
- top-level groups;
- local navigation inside each major surface;
- handoff points between admin, ops, telemetry, and developer surfaces.

Questions the shell must answer:
- where does a tenant admin start?
- where does an infra operator start?
- where does an SRE start?
- where does a developer go for docs and integration material?
- which duplicated current pages collapse into one owner?

## 14. Proposed Review Checklist

Use this checklist when reviewing the next IA mock or redesign deck.

1. Are public/developer, user workspace, scoped administration, and platform
   operations clearly separated?
2. Are admin, ops, and telemetry distinct in purpose?
3. Are infra and SRE intents distinguishable?
4. Are tenant admin and project admin first-class in the model?
5. Does IAM account for platform/tenant/project/service-account growth?
6. Does billing account for future financial scope beyond allocation accounting?
7. Are dense pages explicitly decomposed instead of “cleaned up in place”?
8. Can the left nav be explained in terms of user goals, not route dumps?
9. Do shared entities keep different framing where scope and risk differ?
10. Does the design leave room for app developer / integration-owner surfaces?

## 15. Immediate Next Step

Use this document with the v3 mock work to produce:
- one full-shell IA mock;
- one boundary map for admin vs ops vs telemetry;
- one decomposition plan for the densest existing admin pages.

That should happen before broad admin page implementation resumes.

## 16. Additional Current Inputs To Carry Into Redesign

These are active product and operator pain points that should inform the next
IA/mock pass.

### 16.1 Brokered identity continuity

Observed problem:
- repeated Hugging Face logins can create what appears to be a new user/login
  footprint each time when testing with fresh sessions/incognito.

Design implication:
- identity UX must distinguish:
  - first-time broker signup,
  - returning login,
  - account linking,
  - duplicate-identity conflict resolution.

This is not only an auth/backend problem. It also affects:
- how users understand identity continuity,
- whether account linking needs UI,
- how admin/IAM surfaces explain brokered identities.

### 16.2 Audit log surface parity

Observed problem:
- user audit logs and admin audit logs do not currently present with the same
  product/page grammar.

Design implication:
- audit/evidence surfaces should share the same v3 family model where possible,
  even when scope differs.
- scope can differ; page language and evidence framing should not drift without
  reason.

### 16.3 Inline dependency creation inside launch flows

Observed problem:
- launch flows can discover missing prerequisites such as SSH keys.
- upcoming launch inputs will likely include storage, network, firewall, and
  similar dependencies.

Design implication:
- wizards should not eject users into unrelated pages to satisfy missing
  prerequisites.
- dependency creation/selection should be inline or in-context wherever safe.

This applies to:
- compute launch wizard;
- app launch wizard;
- future storage/network/firewall steps.

### 16.4 App launch wizard parity

Observed problem:
- app launch flows need the same v3 system treatment as compute launch.

Design implication:
- compute and app launch should be treated as one shared wizard system with
  mode-specific branches, not separate design languages.

### 16.5 Tenant-admin quota model

Open question:
- can tenant admins set quotas for projects and/or users, and how should that
  authority be scoped?

Design implication:
- quota UX cannot remain only a platform-admin configuration surface if scoped
  delegation is intended.
- tenant/project administration boundaries must explicitly include quota
  governance decisions.

### 16.6 Billing as a separate redesign epic

Billing requires a separate effort and should not be treated as a small sidebar
to current allocation accounting.

Known future billing scope includes:
- pricing modes such as on-demand, spot, and reserved duration;
- idle policy selection at launch;
- invoice generation and payment timing;
- support for usage models beyond simple allocation duration;
- tenant budgets and alerts;
- usage attribution by user, SKU, and app;
- data ingress/egress accounting;
- standard delinquency handling.

Design implication:
- billing remains a first-class control plane and should be reviewed as its own
  epic with its own UX and domain model pass.

## 17. External Review Inputs Relevant To UI

This section captures the UI/product-relevant points from
`doc/governance/External_Architectural_Review_2026-04.md` so the redesign can
use them directly without mixing in the broader backend action list.

### 17.1 Reinforced design directions

The external review reinforces these redesign choices:

- admin should move away from entity and action dumps toward workflow-first
  operator surfaces;
- the app platform is one of the strongest and most original parts of the
  product, so the product IA should expose it more clearly instead of treating
  it as a sidecar to infrastructure;
- developer/docs/download/API-reference surfaces deserve explicit treatment as a
  real product surface, not an afterthought inside internal navigation;
- audit, evidence, and investigation flows are important enough to be their own
  deliberate page family;
- billing is too large to remain a small subpage under current product
  assumptions and should remain a separate redesign/epic.

### 17.2 UI-specific tensions to account for

The review also highlights tensions that the redesign should consciously handle:

- the product is infrastructure-shaped in some places and workload/app-platform
  shaped in others;
- app/runtime workflows are more valuable than the current information
  hierarchy suggests;
- public/developer-facing extension and integration stories are under-expressed
  in the current product surface;
- operational evidence and debug flows matter, but should not dominate default
  navigation for normal user workflows.

### 17.3 Explicitly deferred to post-UI backend work

The external review also names important backend and architecture issues, but
they should be handled after the UI/IA review rather than folded into the
current redesign pass. Examples:

- cross-cutting middleware enforcement gaps;
- idempotency and optimistic locking;
- DLQ recovery;
- metering producers;
- terminal compliance/security follow-up;
- large-file/backend decomposition work such as `cmd/api/routes.go`.

These should inform later implementation planning, but they are not blockers
for completing the IA/mock review package first.