NVLink (plus Scale-Up and Scale Out)

NVIDIA's high-bandwidth GPU-to-GPU interconnect. NVLink is the **scale-up** fabric inside a rack; lower-bandwidth Ethernet/InfiniBand handles **scale-out** between racks. Modern frontier inference and training depend on running many GPUs as one logical system. The two fabrics that connect them have very different performance characteristics, and the boundary between them shapes which model architectures are practical to deploy. ## Bandwidth tiers |Tier|Fabric|Generation|Bandwidth| |---|---|---|---| |Scale-up (intra-rack)|NVLink + NVSwitch|Blackwell NVL72|~130 TB/s aggregate across 72 GPUs| |Scale-out (inter-rack)|InfiniBand / Ethernet|—|~8× slower per link than scale-up| The scale-up fabric is full-mesh (every GPU can talk to every other GPU at high bandwidth). The scale-out fabric is hierarchical (top-of-rack switch → datacenter switch). ## Scale-up domain size The number of GPUs sharing a high-bandwidth fabric. Has grown over generations: |Generation|Domain size|Form factor| |---|---|---| |Hopper|8 GPUs|One tray| |Blackwell|72 GPUs|Full rack (NVL72)| |Rubin (rumored)|500+ GPUs|Multi-rack?| The Hopper → Blackwell jump was mostly form-factor engineering (going from tray-scale to rack-scale). Blackwell → Rubin requires harder physical work: cabling density at the rack backplane, bend radius, weight, power, cooling. ## Why scale-up domain size sets architectural limits [[MoE]] expert layers want to do **all-to-all** communication: any GPU's tokens can route to experts on any other GPU. The all-to-all naturally fits a full-mesh scale-up fabric. The size of the scale-up domain therefore **bounds how big a single MoE layer can be** — once an expert layer doesn't fit in one scale-up domain, you need to do all-to-all across racks, which kills throughput. This is the structural reason a 5T-parameter MoE wasn't shipped on Hopper-era infrastructure: 8 GPUs × 80 GB = 640 GB of [[HBM]], not enough to comfortably hold the experts. Blackwell's 72 GPUs × 192 GB = ~13 TB makes it feasible. ## Pipelining across racks Cross-rack communication uses the scale-out fabric. Pipelining different layers onto different racks works because the data exchanged at layer boundaries (the residual stream) is small relative to the all-to-all traffic inside an MoE layer. So you can spread a model across racks **for free** as long as you keep at least one full layer per pipeline stage. ## References - Reiner Pope's interview with Dwarkesh Patel, _The math behind how LLMs are trained and served_ (2026). - NVIDIA NVLink and NVL72 product documentation.