KV Cache - Sladyn's Engineering Field Manual

# KV Cache Status: Seedling The KV cache stores attention keys and values for tokens that have already been processed during LLM inference. ## Mental Model During autoregressive generation, the model repeatedly generates one token at a time. Without caching, the model would recompute attention state for previous tokens over and over. The KV cache avoids that repeated work by keeping previous keys and values available for decode. ## Why It Matters The KV cache is central to inference performance because it trades memory for speed. It affects: - GPU memory usage. - Maximum context length. - Batch size. - Decode throughput. - Scheduling complexity. ## Related - [[Articles/Inference Core Loop of an Inference Engine]] - [[Articles/Inference Prefill and Decode]] - [[Topics/AI Infrastructure]]