The fast memory tier that sits on the same physical package as the GPU compute die, connected via wide silicon-interposer interfaces. Every recent GPU and TPU has it.
HBM is faster than the system DRAM on a typical motherboard but slower than the on-die SRAM caches inside the GPU itself — it's the middle tier of the memory hierarchy that compute kernels actually load weights from. Two parameters define a stack: **capacity** (how much fits) and **bandwidth** (how fast you can read it). Modern training and serving economics are mostly bottlenecked by the second.
## Why bandwidth is the binding constraint
Modern numbers per GPU:
|Generation|Capacity|Bandwidth|"Drain time" (capacity / bandwidth)|
|---|---|---|---|
|Hopper (H100)|80 GB|~3 TB/s|~27 ms|
|Blackwell (B200)|192 GB|~8 TB/s|~24 ms|
|Rubin (rumored)|~288 GB|~20 TB/s|~15 ms|
The "drain time" — how long to read the entire HBM at full bandwidth — has stayed remarkably constant (~20 ms) across generations because capacity and bandwidth have scaled together.
In LLM **decode** (generating one token per step), the GPU fetches all of the model's weights once per token. The compute on those weights is small relative to the fetch cost, so per-token latency is set by memory bandwidth, not by FLOPs. Decode is **memory-bound**, not compute-bound — the multi-thousand-TFLOP compute units sit mostly idle waiting for data. This produces a hard lower bound on inference latency you can't compute your way around. Reiner Pope's bound:
$t_{\text{min}} = \frac{N_{\text{total}}}{B_{\text{HBM}}}$
Even at batch size 1, you can't generate a token faster than this — the weights have to come off HBM at least once per step.
## How [[NVLink (plus Scale-Up and Scale Out)|scale-up domains]] help
A single GPU has fixed HBM bandwidth. Putting $K$ GPUs together in an [[NVLink (plus Scale-Up and Scale Out)|NVLink scale-up domain]] gives you $K \times$ the aggregate bandwidth — they each read from their own HBM stack in parallel, and a fast intra-rack fabric lets them combine results.
The Hopper → Blackwell jump didn't increase per-GPU HBM bandwidth that much (~2.5×). But going from 8 to 72 GPUs in a scale-up domain gave an **~8× jump in effective bandwidth** for serving large [[Mixture-of-Experts (MoE)|MoE]] models. That's what made trillion-parameter inference feasible at low latency.
## References
- Reiner Pope's interview with Dwarkesh Patel, _The math behind how LLMs are trained and served_ (2026).