Chinchilla Scaling Law - Obsidian Publish

Hoffmann et al. (DeepMind, 2022). The empirical scaling result that recalibrated how the field thinks about compute-optimal LLM training. Earlier scaling laws (Kaplan et al. 2020) had recommended growing model size much faster than training tokens — implying that the optimal use of more compute was a much bigger model trained on roughly the same amount of data. Chinchilla showed this was wrong. For a fixed compute budget $C$, the optimal allocation between model size $N$ and training tokens $D$ has them growing in roughly equal proportion: $N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$ The headline rule of thumb: **~20 training tokens per parameter** at the compute-optimal point. Earlier frontier models were undertrained — they would have been better at the same compute by being smaller and trained longer. ## Why frontier models are not Chinchilla-optimal In production, models are deliberately _over-trained_ relative to Chinchilla. Two reasons: - **Inference amortization.** Pre-training compute is paid once; inference compute is paid per token served. Smaller models with more training tokens are cheaper to serve, even if they cost slightly more to train. Once you serve enough tokens, the trade is net-positive. - **[[GRPO|RL]] post-training.** Adds another large compute term that Chinchilla doesn't account for. Reiner Pope's argument: when you balance pre-training, RL, and inference compute, optimal $D_{\text{pre}}$ is ~100× larger than Chinchilla-optimal. A model trained correctly outputs roughly as many tokens to users over its lifetime as it was pre-trained on. ## References - Hoffmann et al., _Training Compute-Optimal Large Language Models_ (Chinchilla, 2022). - Kaplan et al., _Scaling Laws for Neural Language Models_ (2020).