WestconComstor - NVIDIA Blueprint Sizing Toolkit

Sam Adams

NVIDIA AI Blueprint — Concurrent User Calculator

Estimate concurrent users for any NVIDIA AI Blueprint. Select a blueprint to see its pipeline NIMs, then configure the GPU hardware and workload parameters.

NVIDIA Blueprint
Show WestconComstor verified only
Primary LLM Model

Select the LLM for RAG response generation. GPU VRAM requirements and throughput estimates update automatically. VRAM values are at BF16/FP16 precision; FP8-capable GPUs may run the model at ~50% of the listed VRAM. Cloud models (API) consume no local GPU.

Server GPU Configuration
Blueprint Compatible Cisco Platform

⚠ prefix = GPU cannot run the default LLM with the recommended config

Workload Parameters
300 tokens
30 s
Blueprint Components

Toggle optional components and choose GPU or CPU execution for each service. CPU execution frees GPU VRAM but may reduce throughput for that component.

Estimated Concurrent Users
Calculated using Little's Law: N = λ × W
LLM tokens / sec
Queries / sec
Avg response time
LLM instances
GPU VRAM Allocation
Enable
Storage Requirements

Estimate the total storage needed for your RAG deployment based on raw document data size.

TB
Advanced Mode
Note:
  • Processed data reflects chunking and metadata expansion (1.2×–3×)
  • Vector DB includes embeddings and indexing overhead (1.5×–3×)
  • Cache represents high-speed storage for fast retrieval (5%–20% of Vector DB)
RAID Configuration & Disk Requirements
Server Node Configuration

Estimated CPU and RAM requirements based on GPU configuration and storage inputs. Enter raw document storage above to include vector DB memory in the estimate.

GPU Configuration
× GB
— GB total GPU memory
AI Model
Memory (RAM)
GB
— TB
CPU Cores
logical cores recommended
Show Breakdown
Platform Selection

Mapping sizing criteria to Platforms.

Configure GPU and storage above to generate platform criteria.

Disclaimer: Throughput figures are approximate estimates for nemotron-3-super-120b-a12b (MoE 120B total / 12B active params) at FP8 TP2, derived from publicly known GPU benchmark data. Actual performance depends on batch size, sequence length, KV-cache settings, NVLink/PCIe topology, driver version, and real-world workload shape. Always validate with a representative load test on your target hardware.

Minimum system requirements: Ubuntu 22.04 · GPU Driver ≥ 560 · CUDA ≥ 12.9 · 200 GB free disk space.

→ NVIDIA RAG Blueprint Support Matrix  |  → Model Profiles  |  → LLM NIM Support Matrix

GPU Reference

All GPUs in the catalogue with VRAM capacity, FP8 support, and reference throughput against the default RAG Blueprint LLM (nemotron-3-super-120b-a12b at FP8 TP2). A “—” in TPS means no benchmark data is available for that GPU with this LLM configuration.

LLM Model Reference

All LLM models in the catalogue. VRAM figures are at BF16 / FP16 precision — FP8-capable models typically require ~50 % less VRAM. ★ marks the RAG Blueprint default model.

Platform Selector

Select server platforms based on your requirements. Filters Memory, CPU Generation, CPU Quantity, Drive Quantity, and Drive Size against the platform catalogue (enumerate = Yes entries only). Drive types are evaluated independently — any type satisfying your criteria qualifies the platform. Riser-slot drives add to the front-bay count.

NVIDIA AI Blueprints

All supported NVIDIA AI Blueprints with their NIM pipeline components. Required components are always active; optional components can be toggled in the RAG Calculator.

Blueprint NIM Components

All NIM pipeline components available across NVIDIA AI Blueprints, with their VRAM requirements, execution options, and role descriptions.

Industry & Vertical AI Use Cases

Select an industry to explore its AI transformation opportunities and the NVIDIA Blueprints most relevant to that sector.

NVIDIA Blueprint Toolkit — Technical Reference

Full technical walkthrough of the NVIDIA Blueprint Toolkit — covering the RAG Concurrent User Calculator, Multi-GPU parallelism strategy engine, WestconComstor Verified badge system, Platform Selector (the original deep-dive), and Kubernetes Deployment Guide. Sections 1–8 cover the Platform Calculator; Sections 9–12 document modules added or substantially updated since the initial release.

1. Architecture Overview

The Platform Calculator is a vanilla-JS ES-module application bundled with Vite. It is split across three source files:

FileResponsibility
platform-calculator.jsData loading, sheet parsing, filter state, matching engine, GPU/NIC slot logic. The single exported entry-point is initPlatformCalculator().
platform-calculator-htmlbuild.jsAll HTML construction. Exports renderUI(), renderResults() and clearFilters(). Never imports from platform-calculator.js to avoid circular dependencies.
style.cssAll visual styles for cards, filters, badges, collapsible sections and the summary bar.

Four module-level state variables hold the loaded data:

let _platforms = []; // parsed platform objects let _gpuCatalogue = []; // GPU catalogue rows let _cpuCatalogue = []; // CPU catalogue rows let _nicCatalogue = []; // NIC catalogue rows let _pidToGpu = {}; // Cisco GPU PID → { model, vramGB, entry } let _pidToNic = {}; // Cisco NIC PID → { description, type, ports, … } let _expCageGpuByChassisId = {}; // GPU slots from type=chnodeexpcage rows, keyed by intoChassisId

2. Data Sources

All data is fetched from Google Sheets via the gviz/tq JSON endpoint. The helper fetchGvizSheet(url) strips the google.visualization.Query.setResponse(…) wrapper, parses the JSON table, and returns { headers, dataRows, fRows } — where fRows carries the formatted (.f) cell string used when the numeric .v value is unreliable (e.g., CPU core counts stored as Excel date serials).

ConstantSheet (internal name)Contents
SHEET_URLNVBPTK_PLFPlatform catalogue. One row per server model. Columns cover: identity, CPU config, memory maximums per CPU-qty/gen, drive bay quantities and sizes, PCIe riser slot specs, GPU choices per riser variant, NIC choices per riser variant.
GPU_SHEET_URLNVBPTK_GPUGPU catalogue. Columns: GPU model, manufacturer, category, VRAM (raw string), interconnect type.
CPU_SHEET_URLNVBPTK_CPUCPU catalogue. Columns: model number, manufacturer, generation, class, core count (formatted string), clock speeds, TDP, memory type/speed, socket, SPECint score, release date.
NIC_SHEET_URLNVBPTK_NICNIC catalogue. Columns: part number (Cisco PID), description, type, port count, speed (Gb), ethernet/IB flags, medium, connector.

3. Initialisation Flow

initPlatformCalculator() is called once from main.js on page load. Its steps:

  1. Injects a loading message into #plat-calc-wrap.
  2. Calls Promise.all([loadPlatforms(), loadGpuCatalogue(), loadCpuCatalogue(), loadNicCatalogue()]) — all four sheets are fetched in parallel.
  3. Calls buildPidToGpuMap() — scans every riser slot in every platform (plus chassis-node Rear Mezz slots, MGPU-server GPU bay slots, and expansion cage slots) for GPU PIDs and their embedded display names, cross-references the GPU catalogue, and populates _pidToGpu.
  4. Calls buildPidToNicMap() — same scan for NIC PIDs, populates _pidToNic.
  5. After all platforms are parsed, parseExpansionCageRows() scans all PLF rows (not just enumerate=Yes) for type=chnodeexpcage rows, reads their ifchassisnode-RearMezz-gpuChoice, and stores the GPU slots keyed by ifnode-intoChassisId. Each chassis-node platform then has its expansionCageGpu array linked by matching intoChassisId.
  6. Calls renderUI(…) to build the filter panel and placeholder.

Riser Group Definitions (RISER_GROUPS)

Three PCIe riser groups are defined as a static constant. Each group maps to a set of mutually-exclusive variants (the platform can be ordered with any one variant per riser):

{ id:'r1', label:'Riser 1', requiresCPU2:false, variants:[{id:'1A', gpuCol:'ifnode-pcie-riser1-gpuChoice-1A', nicCol:'…-nicChoice-1A'}, …] } { id:'r2', label:'Riser 2', requiresCPU2:true, variants:[…2A, 2B, 2C] } { id:'r3', label:'Riser 3', requiresCPU2:true, variants:[…3A, 3B, 3C, 3D] }

requiresCPU2: true means the riser is physically connected to CPU socket 2 and is only populated when ≥ 2 CPUs are installed. This flag is enforced in all slot-count calculations.

4. Platform Parsing (parsePlatform)

Called for every data row in the PLF sheet. Returns null (row is skipped) if enumerate ≠ "Yes". Otherwise builds a rich platform object. Key parsed fields:

Identity & CPU

Object fieldSheet columnNotes
id, nameid, nameDisplay identity
maxCPUQtyifnode-maxCPUQtyHard maximum CPU sockets
cpuCombinationsifnode-cpuCombinationsValid CPU counts, e.g. "1,2"[1,2]. Parsed by parseCpuCombinations() which splits on whitespace/commas.
cpuMfg, cpuGenA, cpuGenBifnode-cpuMfg, ifnode-cpuGenA, ifnode-cpuGenBManufacturer and integer generation numbers. null when the column is blank or "na".

Memory

Five columns are read, covering all combinations of generation (A / B) and CPU count (1 / 2 / 4). Missing or "na" values are stored as 0.

mem: { maxA1: safeF('ifnode-maxMGB-genA-cpux1'), // Gen A, 1 CPU maxA2: safeF('ifnode-maxMGB-genA-cpux2'), // Gen A, 2 CPUs maxA4: safeF('ifnode-maxMGB-genA-cpux4'), // Gen A, 4 CPUs maxB1: safeF('ifnode-maxMGB-genB-cpux1'), // Gen B, 1 CPU maxB2: safeF('ifnode-maxMGB-genB-cpux2'), // Gen B, 2 CPUs }

Drives

Nine drive types are defined in DRIVE_DEFS (2.5″ SAS/SATA/NVMe, 3.5″ HDD, NVMe E3.S, etc.). For each type, two columns are read:

  • Max-qty column — e.g. "[FRONT] 24 [RISER 1B] 2 [RISER 3B] 2". Parsed by both parseDriveQty() (display breakdown) and parsePositionedQty() (per-position array for CPU-filtering).
  • Size column — e.g. "[FRONT] 960, 1600, … 61400 [FRONTMEZZ] 6400". Parsed by parseSizeGBPerPosition() so each position has its own allowed sizes.

Drives with no valid qty or size data are discarded (.filter(d => d.qty !== null && d.sizes.length > 0)).

GPU & NIC Riser Data

For every RISER_GROUPS variant, two sheet columns are read:

  • gpuColparseGpuChoice() — parses "SLOT1: UCSC-GPU-L4 (NVIDIA L4 (24 GB)) SLOT2: …" into an array of { slotKey, pids, pidNames }. The PID regex is UCS[A-Z]-GPU- so both UCSC-GPU- (rack server) and UCSX-GPU- (chassis node Mezz) PIDs are captured.
  • nicColparseNicChoice() — same format but with NIC PIDs.

Results are stored in p.risers[rgId][variantId] and p.nicRisers[rgId][variantId]. Boolean flags p.hasGpuData and p.hasNicData are set if any slots were found.

Special Platform Types

Three additional platform types require extra column reads beyond the standard riser slots:

type valueAdditional columns readEffect
chassisnode ifchassisnode-RearMezz-gpuChoice Parsed into p.chassisMezzGpu (same slot format). These GPUs are counted in getGpuConfig as a “Rear Mezz” group, always included (not CPU-count-dependent).
mgpuserver ifmgpuserver-gpuChoice, ifmgpuserver-GPUtoDriveRules mgpuGpuChoice replaces riser slots entirely (the platform uses dedicated GPU bays). mgpuDriveRules is parsed as a { gpuQty → maxDrives } map (e.g. [2] 24 [4] 16 [8] 8), used when a drive maxqty cell contains [ISMGPUSERVER].
chnodeexpcage ifchassisnode-RearMezz-gpuChoice, ifnode-intoChassisId Scanned by parseExpansionCageRows() (not parsePlatform). GPU slots are stored in _expCageGpuByChassisId and linked to chassis-node platforms via p.intoChassisId matching.

GPU Combinations (ifnode-gpuCombinations)

When this column is populated (e.g. "2, 4, 8"), the parsed array is stored as p.gpuCombinations. It constrains which GPU quantities are valid for the platform — used during GPU filter matching to ensure only valid configurations are returned.

Drives with [ISMGPUSERVER]

If a drive’s maxqty cell contains the literal token [ISMGPUSERVER], the drive is flagged isMgpuVariable: true and its qty is stored as null. At match time, the actual drive limit is read from mgpuDriveRules[chosenGpuQty].

5. Filter UI & runFilter()

renderUI() builds the filter panel (left column) and the results placeholder (right column). Every change to a filter input calls runFilter() via change events on selects and Enter keydown on text/number inputs.

runFilter() reads fourteen DOM fields:

DOM idVariableType
plat-cpu-mfgcpuMfgstring — CPU manufacturer filter (empty = any). Derived from ifnode-cpuMfg in PLF.
plat-memmemGBfloat — minimum memory in GB
plat-cpu-gencpuGenfloat — required CPU generation number (0 = any)
plat-cpu-qtycpuQtyint — exact CPU count (0 = any)
plat-drv-qtydrvQtyint — minimum drive bays
plat-drv-sizedrvSizefloat — minimum drive size in GB
plat-gpu-typegpuPidstring — Cisco GPU PID (empty = any/none)
plat-xseries-expansionincludeExpansionbool — include X-Series GPU Node (UCSX-9508-D expansion cage). Default: on.
plat-gpu-memgpuMemGBfloat — minimum total GPU VRAM
plat-corescoresReqint — total cores required across all CPUs
plat-nic-typenicPidstring — Cisco NIC PID (empty = any/none)
plat-nic-qtynicQtyint — minimum NIC card count
plat-nic-speednicSpeedfloat — minimum NIC speed in Gb/s
plat-nic-portsnicPortQtyint — minimum total NIC port count

These are bundled into a criteria object. includeExpansion is bound into a gpuFn closure that is passed to renderResults() so all GPU config calls use the same toggle state. A criteria chip "X-Series Expansion excluded" is shown when the toggle is off.

6. Matching Logic (matchPlatform)

Each filter check returns { match: false } immediately on failure (short-circuit). The checks run in this order:

Step 1 — CPU Quantity

If cpuQty > 0, the platform's cpuCombinations array is checked. If the array is populated, the selected count must be in the list. If the array is empty (old-format rows), the check falls back to p.maxCPUQty ≥ cpuQty.

Step 2 — CPU Generation

If cpuGen > 0, the platform's cpuGenA and cpuGenB values are compared. If neither matches, the platform fails. Matching gens are collected into activeGens (['A'], ['B'], or ['A','B']). When no gen filter is set, all non-null gens are included.

Step 3 — Memory

The set of CPU counts to check against (effQties) is determined as follows:

  • If cpuQty > 0: effQties = [cpuQty] — only the selected count is checked.
  • If no filter: effQties = cpuQtyOptions (all valid counts from cpuCombinations) — the platform's highest available memory wins.

For each combination of activeGens × effQties, the lookup key is p.mem[`max${gen}${qty}`] (e.g. maxA2 for Gen A, 2 CPUs). The global maximum becomes maxMem. If maxMem < memGB the platform is rejected.

Step 4 — Effective CPU Qty for Slots

After the memory check, a single effectiveCpuQty is computed for all subsequent slot-based checks (drives, GPU, NIC):

const maxValidCpu = max(p.cpuCombinations) || p.maxCPUQty; const effectiveCpuQty = cpuQty > 0 ? cpuQty : maxValidCpu;

This ensures that when no CPU filter is set, the full configuration (all risers populated) is shown.

Step 5 — GPU (runs before drives)

The GPU check is performed before the drive loop so that chosenGpuQty is available for [ISMGPUSERVER] drive limit calculations. Two sub-paths:

  • With GPU filter (gpuPid set) — calls getGpuConfig(p, gpuPid, effectiveCpuQty, includeExpansion). If p.gpuCombinations is populated, the raw slot count is constrained to valid combinations that fit within the slot maximum. Then the memory requirement (gpuMemGB) is applied. The largest qualifying combination becomes chosenGpuQty and drives the VRAM total. A validCombos array is stored for card display.
  • Without GPU filterchosenGpuQty is still resolved from p.gpuCombinations or p.mgpuDriveRules so that isMgpuVariable drives can be sized correctly.

Step 6 — Drives

For each drive type in p.drives, two processing paths exist:

Standard drives (qty not null):

  1. CPU filteringgetRiserMinCpu(posLabel) excludes riser positions needing more CPUs than effectiveCpuQty.
  2. Size filtering — if drvSize > 0, only positions whose sizes include a value ≥ drvSize count towards effectiveQty.
  3. Quantity check — if drvQty > 0 and effectiveQty < drvQty, this drive type is skipped.
  4. Storage totalsmaxStorageGB is the sum over available positions of qty × maxSizeForPosition.

MGPU-server variable drives (isMgpuVariable: true, maxqty cell was [ISMGPUSERVER]):

  1. Looks up p.mgpuDriveRules[chosenGpuQty] to determine the maximum bay count for the chosen GPU configuration.
  2. When no GPU filter is set, uses the maximum drive limit across all GPU combinations.
  3. The card shows a per-GPU-configuration drive table (e.g. 2 GPU→ 24 bays, 4 GPU→ 16 bays, 8 GPU→ 8 bays) instead of a fixed bay count.

If any drive filter is active and driveOptions is empty, the platform fails.

Step 7 — CPU Cores

If coresReq > 0, the per-CPU threshold is derived as perCpuCoresReq = ⌈coresReq / effectiveCpuCount⌉. The platform passes if any CPU in getCpusForPlatform() has known cores ≥ threshold or unknown cores (unknown = cannot rule out). Fails only when every catalogued CPU has verified cores below the threshold.

Step 8 — NIC

Activated when any NIC criterion is non-zero. Two sub-paths:

  • Specific PID — calls getNicConfig(p, nicPid, effectiveCpuQty), checks total count ≥ nicQty and total ports ≥ nicPortQty.
  • Any NIC — iterates all PIDs present in the platform's NIC risers, filters by speed ≥ nicSpeed, qty and port requirements, selects the PID that maximises total slot count.

A nicResult object is attached to the match on success.

7. GPU & NIC Slot Calculation

getGpuConfig(p, pid, effectiveCpuQty, includeExpansion)

Four processing paths, evaluated in priority order:

  1. MGPU-server (p.type === 'mgpuserver') — reads p.mgpuGpuChoice slots directly. Returns immediately without checking risers.
  2. PCIe risers — iterates RISER_GROUPS. Riser groups with requiresCPU2 = true are skipped when effectiveCpuQty < 2. For each eligible group, bestVariantForPid() picks the variant that maximises the slot count.
  3. Chassis-node Rear Mezz (p.chassisMezzGpu) — counted as a fixed additional group, labelled “Rear Mezz”. Always included regardless of CPU count.
  4. X-Series Expansion cage (p.expansionCageGpu) — only included when includeExpansion === true (controlled by the Include X Series GPU Node toggle). Breakdown entries are flagged isExpansion: true and rendered with an amber “via Expansion Node” badge.

Returns:

{ totalCount, totalVramGB, breakdown: [{ label, variantId, count, slotDetails, isExpansion? }] }

getNicConfig(p, pid, effectiveCpuQty)

Mirrors getGpuConfig exactly, reading from p.nicRisers instead of p.risers. Returns:

{ totalCount, breakdown: [{ label, variantId, count, slotDetails }] }

getRiserMinCpu(posLabel)

Used by the drive-counting logic. Strips any "RISER " prefix, reads the first digit as the riser group number, and returns the requiresCPU2 flag from RISER_GROUPS as a minimum count (1 or 2). Non-riser positions (FRONT, FRONTMEZZ, MIDPLANE) always return 1.

NIC Capability Badges

The NIC Capabilities section renders colour-coded badge rows (port count / speed / type / medium / connector) using buildNicBadges(), matching the visual style of the platform card collapsed-view badges.

8. Rendering Pipeline

renderResults(results, criteria, …)

Receives the array of passing match objects. Renders a chip row summarising active criteria, then calls buildCard(r, …) for each result and sets #plat-results innerHTML.

buildCard(r, criteria, …)

Each result card is wrapped in a <details class="plat-card-collapse"> element — collapsed by default. The <summary> contains the card header (platform name, ID, badges) plus a compact summary bar showing: Memory max · Best drive configuration · GPU result (if filter active) · NIC result (if filter active). Clicking the header or pressing the toggle arrow (▶/▼) expands the full detail body.

The full card body contains five sections, each built by a dedicated helper:

SectionBuilt byContents
Memoryinline in buildCardMax memory line with Gen/qty annotation. When a memory filter is active, the matched config is highlighted.
Drivesinline in buildCardPer-drive-type rows. Standard drives show bay count, size options, max storage and per-position breakdown. MGPU-server drives show a GPU-config→drive-limit table (e.g. 2 GPU → 24 bays, 4 GPU → 16 bays) instead of a fixed count.
GPUinline in buildCardWhen gpuPid filter is set: hero count, VRAM total, riser breakdown. When validCombos has multiple entries, a blue note shows all valid GPU configurations (e.g. “Valid GPU configurations: 2, 4, 8 × this GPU”). Expansion-node breakdown entries show an amber “via Expansion Node” badge. When no filter: summary table of all available GPU PIDs.
NICbuildNicSection()Same dual-mode as GPU. Filter-active mode shows hero count, NIC name, speed, port summary. No-filter mode shows all NIC PIDs across all risers with colour-coded attribute badges (port count / speed / type / medium / connector).
CPU ModelsbuildCpuSection() + <details>A per-generation table of supported CPUs with cores, speed, TDP, SPECint. Rows are highlighted green/grey/red based on cores filter. The raw model list is in a nested collapsible.

9. Application Module Map

The toolkit has grown beyond the Platform Calculator. Every tab is a distinct ES module initialised from main.js on page load:

TabModule(s)Entry pointDescription
Calculatorrag-calculator.jsinitRagCalculator()Concurrent user estimator for NVIDIA AI Blueprints. See §10.
GPUsgpu-products.jsinitGpuProducts()GPU catalogue browser — renders cards from GPU_GROUPS.
Modelsllm-models.jsinitLlmModels()LLM model catalogue browser — renders cards from LLM_MODEL_GROUPS.
Platformsplatform-calculator.js, platform-calculator-htmlbuild.jsinitPlatformCalculator()Server platform selector. Fully documented in §1–8 of this page.
Blueprintsreference-tables.js, blueprints.jsinitReferenceTables()Blueprint reference cards with WestconComstor Verified badges. See §11.
BP Componentsbp-components.jsinitBpComponents()NIM component reference — cards from COMPONENTS with VRAM / GPU / CPU fields.
Industry / Verticalindustry-vertical.js, blueprints.jsinitIndustryVertical()Industry-tagged blueprint cards with WestconComstor Verified badges. See §11.
BP_Enterprise RAGbp-detailed-enterprise-rag.jsinitBpDetailedEnterpriseRag()Deep-dive reference for the Enterprise RAG blueprint architecture.
Deploymentdeployment.jsinitDeploymentGuide()Kubernetes & Helm deployment guide for PoC and scale-out. See §12.

10. RAG Calculator — Concurrent User Estimation Engine

The Calculator tab estimates the maximum concurrent users a given hardware and blueprint configuration can support. It is driven by a single module-level state object and a pure calculate() function in rag-calculator.js.

Calculator State (state object)

FieldDefaultDescription
gpuId'H100_SXM_80'Selected GPU model key — maps to an entry in GPU_GROUPS.
gpuCount1Total GPU count for the deployment.
outputTokens300Expected average output tokens per LLM response.
thinkTime30User think time in seconds between requests.
selectedModelIdEnterprise RAG defaultLLM model ID from LLM_MODEL_GROUPS.
selectedQuantIdnullQuantization override. null auto-selects the model's maximum quantization.
parallelismStrategy'tensor'Active multi-GPU parallelism strategy ID (used only when multiGpuEnabled is true).
multiGpuEnabledfalseMulti-GPU strategy toggle state. When false, strategy is forced to single-GPU inference. See §10.3.
blueprintIdEnterprise RAGActive NVIDIA AI Blueprint — determines which NIM components are shown.
componentsrequired on, optional offObject mapping component ID → { enabled, onGpu }. Optional components can be toggled off; GPU components can be switched to CPU.

Blueprint & Component Model

Each entry in BLUEPRINTS (blueprints.js) has a required[] and optional[] array of component IDs referencing COMPONENTS in bp-components.js. When a Blueprint is selected the Calculator renders only those components; optional components can be disabled. Blueprints may carry sizingVerified: true (see §11).

The "Show WestconComstor verified only" toggle (#bp-verified-toggle-btn) filters the Blueprint dropdown to entries where sizingVerified === true. Default: off.

Multi-GPU Parallelism Strategy Toggle

A rocker switch (#multi-gpu-toggle-btn) in the Parallelism Strategy row controls whether multi-GPU strategies are enabled. Default: OFF (single GPU inference).

Toggle stateEffective strategyUI behaviour
OFF (default) 'request' (forced internally — hidden from user) Strategy dropdown hidden. Row shows a "Single GPU Inference" info tooltip explaining the mode and how it compares to each multi-GPU strategy. Row is visually dimmed via .rag-strategy-row--single.
ON state.parallelismStrategy (user-selected) Strategy dropdown visible with all 9 options. Each strategy's tooltip shows a singleGpuVs row (vs single GPU) and a tensorVs row (vs Tensor Parallelism baseline). A red warning (.rag-strategy-row--needs-gpu) appears when the selected strategy requires more GPUs than configured.

Auto-toggle: The GPU count input is wired so that changing from 1 to ≥ 2 automatically enables the multi-GPU toggle, and reducing back to 1 disables it — preventing impossible strategy/count combinations.

Parallelism Strategies (PARALLELISM_STRATEGIES constant)

Nine strategies are defined. Each carries a tpsEfficiency multiplier (relative to the Tensor Parallelism baseline at 1.00) and a gpuGrouping describing how the GPU pool is partitioned for calculation:

IDLabelTPS efficiencyGPU groupingPrimary use case
tensorTensor Parallelism1.00 (baseline)all GPUs = 1 instanceLarge model inference (10B+ params) — splits weight tensors across GPUs
pipelinePipeline Parallelism0.80linear layer chainVery large models — splits transformer layers into pipeline stages
hybridHybrid Parallelism0.90all GPUs = 1 instance100B+ production deployments combining TP + PP + DP
expertExpert Parallelism (MoE)1.20all GPUs = 1 instanceMoE models — sparse activation means only a fraction of experts fire per token
sequenceSequence / Context Parallelism0.90distributed attention128 K+ token RAG context windows — distributes attention, not weights
kvcacheKV Cache Parallelism0.85model GPUs + dedicated KV cache GPUsProduction RAG with many concurrent long-context sessions
requestRequest Parallelism0.95independent model replicasHigh-throughput serving — near-linear user scaling per GPU group
dataData Parallelism0.70gradient-sync replicasTraining / fine-tuning only — not recommended for production inference
offloadCPU / GPU Offloading0.15CPU-paged weightsVRAM-constrained environments — pages weights via PCIe (5–10× slower)

Calculation Formula (calculate())

  1. Effective strategy: effectiveStratId = state.multiGpuEnabled ? state.parallelismStrategy : 'request'.
  2. GPU instances: determined by gpuGrouping — e.g. 'all' → one instance using all GPUs; 'tp'gpuCount / minGpusPerInst independent replicas; 'kvcache' → reserves a share of GPUs for KV cache, rest for compute.
  3. TPS per instance: model token-per-second rate at the chosen quantization, scaled by strat.tpsEfficiency.
  4. Response time: responseTimeSec = outputTokens / tpsPerInstance.
  5. QPS: qps = totalTps / outputTokens.
  6. Concurrent users (Little's Law): concurrentUsers = round(qps × (thinkTime + responseTimeSec)).

The result object is returned to the rendering layer which displays a metric card, GPU VRAM allocation bar chart, and strategy breakdown panel.

11. WestconComstor Verified Badge System

Blueprints sized and validated by WestconComstor carry sizingVerified: true in their blueprints.js entry. This flag drives both a Calculator filter toggle and distinct visual treatment on the Blueprints and Industry / Vertical tabs.

Calculator Filter Toggle

The "Show WestconComstor verified only" toggle (#bp-verified-toggle-btn) in the Blueprint Selection row filters the Blueprint dropdown to verified entries only. Default: off. Label text was updated from "Show verified only" to "Show WestconComstor verified only" to make the provenance explicit.

Card Visual Treatment (Blueprints & Industry / Vertical tabs)

ElementCSS class / selectorVisual effect
Card wrapper (Blueprints tab).ref-bp-card--verifiedGold gradient overlay (135deg, rgba(217,119,6,0.10) → transparent at 50%), amber border, overflow: hidden to clip the star watermark.
Card wrapper (Industry tab).ind-bp-card--verifiedSame gradient and border treatment as Blueprints tab.
Star watermark (Blueprints).ref-bp-card--verified::before90 px ★ pseudo-element in rgba(217,119,6,0.18), anchored top-left. top: -12px compensates for font ascender space above the glyph (★ at 90 px has ~14 px of space above the visual character).
Star watermark (Industry).ind-bp-card--verified::before72 px ★, top: -9px for the smaller size.
Badge row.ref-bp-verified-row / .ind-bp-verified-rowpadding-left: 56px / 46px to clear the star watermark horizontally.
Badge chip.ref-badge-verifiedGold chip: rgba(217,119,6,0.14) background, #92400e text, amber border, font-weight: 700. Text: "★ WestconComstor Verified".

To mark a blueprint as verified, set sizingVerified: true in its blueprints.js entry. Currently only Enterprise RAG (enterprise_rag) is verified.

12. Deployment Guide Tab

The Deployment tab renders a complete Kubernetes & Helm deployment guide, targeting a single-server, 1× GPU proof-of-concept with Qwen3-35B-A3B as the LLM, and extending to a 2-node tensor-parallelism scale-out in Phase 9. All content is rendered in JavaScript by deployment.jsinitDeploymentGuide(); there is no server-side component.

Builder Helpers

FunctionReturnsPurpose
phase(num, title, icon, bodyHtml, open?)Accordion HTML stringWraps content in a .depl-phase collapsible; open by default for Phases 1–5, closed for 6–9.
step(n, title, bodyHtml)Numbered step HTMLRenders a numbered step block with a circle badge and a titled body.
cmd(code, lang)Dark code block HTMLWraps a command in .depl-cmd-block with language label and a copy-to-clipboard button (wired by initCopyButtons()).
note(text) / warn(text) / tip(text)Callout box HTMLRenders ℹ / ⚠ / ★ callout boxes with the corresponding colour style.

Guide Phases

PhaseTitleDefaultKey content
ArchArchitecture OverviewOpenKubernetes service topology — all 15+ pods with ports (client → ingress → RAG core → NIMs → data layer).
PrePrerequisites ChecklistOpenInteractive checkbox list (12 items): NGC API key, hardware, OS, GPU driver, K8s, Helm, StorageClass, GPU Operator, NIM Operator, ECK Operator, NGC CLI. Checks persist in-session.
1Storage & GPU OperatorOpenDisk audit + PVC size table; optional dedicated data disk mount at /opt/local-path-provisioner; local-path StorageClass install; PVC smoke-test; GPU Operator Helm install.
2Operator InstallationOpenNIM Operator and ECK Operator Helm installs with verification commands.
3Deploy Helm ChartOpenhelm upgrade --install with poc-values.yaml; deployment monitoring commands (watch, NIMCache status, events).
4Verify DeploymentOpenExpected kubectl get pods output (~15 pods); all services port reference table.
5Access the Web UIOpenPort-forward commands for UI (:3000), RAG API (:8081), Ingestor (:8082).
6ConfigurationClosedGPU time-slicing setup (ConfigMap + ClusterPolicy patch); Qwen3-35B-A3B NIM install (4-step); Milvus VDB switch; NIM cache persistence; optional GPU services table.
7LifecycleClosedhelm upgrade; helm uninstall; full cleanup (NIMCache + PVC deletion + namespace removal).
8TroubleshootingClosed6-card grid: Pending pods, Init/ContainerCreating, GPU shortage, ErrImagePull, disk exhaustion, port-forward timeouts.
9Scale-Out — TP=2ClosedJoin 2nd GPU node (kubeadm join); GPU Operator auto-config verify; LeaderWorkerSet install; tp2-values.yaml overlay (NIM_TENSOR_PARALLEL_SIZE=2, podAntiAffinity); apply + verify multi-node operation; scale reference table (TP=1 → TP=2 → TP=4 → TP=8).