The fast memory tier that sits on the same physical package as the GPU compute die, connected via wide silicon-interposer interfaces. Every recent GPU and TPU has it. HBM is faster than the system DRAM on a typical motherboard but slower than the on-die SRAM caches inside the GPU itself — it's the middle tier of the memory hierarchy that compute kernels actually load weights from. Two parameters define a stack: **capacity** (how much fits) and **bandwidth** (how fast you can read it). Modern training and serving economics are mostly bottlenecked by the second. ## Why bandwidth is the binding constraint Modern numbers per GPU: |Generation|Capacity|Bandwidth|"Drain time" (capacity / bandwidth)| |---|---|---|---| |Hopper (H100)|80 GB|~3 TB/s|~27 ms| |Blackwell (B200)|192 GB|~8 TB/s|~24 ms| |Rubin (rumored)|~288 GB|~20 TB/s|~15 ms| The "drain time" — how long to read the entire HBM at full bandwidth — has stayed remarkably constant (~20 ms) across generations because capacity and bandwidth have scaled together. In LLM **decode** (generating one token per step), the GPU fetches all of the model's weights once per token. The compute on those weights is small relative to the fetch cost, so per-token latency is set by memory bandwidth, not by FLOPs. Decode is **memory-bound**, not compute-bound — the multi-thousand-TFLOP compute units sit mostly idle waiting for data. This produces a hard lower bound on inference latency you can't compute your way around. Reiner Pope's bound: $t_{\text{min}} = \frac{N_{\text{total}}}{B_{\text{HBM}}}$ Even at batch size 1, you can't generate a token faster than this — the weights have to come off HBM at least once per step. ## How [[NVLink (plus Scale-Up and Scale Out)|scale-up domains]] help A single GPU has fixed HBM bandwidth. Putting $K$ GPUs together in an [[NVLink (plus Scale-Up and Scale Out)|NVLink scale-up domain]] gives you $K \times$ the aggregate bandwidth — they each read from their own HBM stack in parallel, and a fast intra-rack fabric lets them combine results. The Hopper → Blackwell jump didn't increase per-GPU HBM bandwidth that much (~2.5×). But going from 8 to 72 GPUs in a scale-up domain gave an **~8× jump in effective bandwidth** for serving large [[Mixture-of-Experts (MoE)|MoE]] models. That's what made trillion-parameter inference feasible at low latency. ## References - Reiner Pope's interview with Dwarkesh Patel, _The math behind how LLMs are trained and served_ (2026).