Inference Prefill and Decode - Sladyn's Engineering Field Manual

# Inference: Prefill and Decode No practical conversation about LLM inference gets very far without two phases: prefill and decode. If [[Inference Core Loop of an Inference Engine]] explains the basic loop, prefill and decode explain why production inference engines are difficult to schedule efficiently. ## Prefill Prefill processes the prompt tokens through the model. This phase computes hidden states, attention outputs, and most importantly the keys and values that become part of the [[Field Notes/KV Cache|KV cache]]. Prefill is often compute-heavy because many prompt tokens can be processed in parallel. It can use the GPU aggressively. ## Decode Decode generates new tokens after the prompt has been processed. This phase is sequential for each request because token `n + 1` depends on token `n`. It is often more memory-bound than compute-bound, especially when reading from the KV cache. ## Why They Are Different Prefill and decode stress the GPU differently: - Prefill tends to be parallel and compute-heavy. - Decode tends to be sequential and memory-sensitive. - Prefill handles prompt tokens. - Decode handles generated tokens. This difference is one reason inference scheduling is hard. A serving system may need to interleave many requests across both phases while keeping latency reasonable. ## Diagram ![[Assets/inference-prefill-decode.png]] ## What To Remember For a single request, prefill must happen before decode because decode depends on the prompt's computed state. Across many requests, an inference engine may interleave prefill and decode, but doing that efficiently is a scheduling problem. ## Related - [[Inference Core Loop of an Inference Engine]] - [[Field Notes/KV Cache]] - [[Topics/AI Infrastructure]] - [[Learning Paths/AI Infrastructure from First Principles]]