EAGLE-3 - Obsidian Publish

EAGLE-3 is the third iteration of the EAGLE family of [Speculative Decoding](Speculative Decoding) drafters, introduced by Li et al. in 2024 and extended in 2025. The defining property of the family is that the drafter is conditioned on the target model's internal hidden states, not only on the tokens the target has emitted so far. ## Why hidden-state conditioning helps The original speculative decoding setup uses a smaller independent model as the drafter (for example, a 1B model drafting for a 70B target). The drafter sees the same token prefix as the target and produces its own next-token distribution. The limitation is that the drafter's predictions are correlated with the target's only at the surface-token level. Two models trained independently agree on easy tokens and diverge on hard ones, and the per-token acceptance rate caps the achievable speedup. EAGLE conditions the drafter on the target's hidden state $h_t$ at the current position, in addition to the token embedding $e_t$. Concretely, the drafter approximates $ p_{\text{draft}}(x_{t+1} \mid x_{\leq t},, h_t) ;\approx; p_{\text{target}}(x_{t+1} \mid x_{\leq t}), $ rather than the naive $p_{\text{draft}}(x_{t+1} \mid x_{\leq t})$ used by independent drafters. Because $h_t$ already encodes most of what the target would compute before producing $x_{t+1}$, the drafter has access to a much closer approximation of the target's own internal state, and per-token acceptance rises accordingly. In the original EAGLE, the drafter is trained with two objectives: predict the next hidden state $\hat{h}_{t+1}$ and the next token $\hat{x}_{t+1}$. At inference, the drafter rolls forward on its own predicted hidden states autoregressively, and the target then verifies the draft tokens in a single forward pass per the standard speculative decoding accept/reject rule. ## What EAGLE-3 changes EAGLE-3 introduces three modifications to the base recipe: 1. **Multi-layer hidden state inputs.** The drafter consumes hidden states from multiple layers of the target, not just the final layer. Earlier layers carry information that the final layer has already collapsed into a token-level prediction, so giving the drafter access to several layers improves its ability to reproduce the target's behavior on tokens where surface features alone are insufficient. 2. **Removal of the next-feature prediction loss.** The drafter is trained only to predict the next token; the auxiliary hidden-state (feature) prediction loss is dropped. The intuition is that forcing the drafter to also reconstruct the target's hidden trajectory pulls capacity away from the actual goal of matching token distributions. 3. **Inference-aligned training data augmentation.** During training, the drafter is exposed to the kind of autoregressive drift it will see at inference, where it has to roll forward on its own predicted features rather than ground-truth ones. This reduces the train-vs-inference distribution shift and lifts acceptance length. ## Practical numbers Reported figures from the speculative-decoding RL paper, on 8B Qwen3: - Acceptance length $\alpha$: 2.77 to 3.32, compared with 2.05 to 2.47 for n-gram drafting. - Generation speedup: 1.5 to 1.8 times over standard autoregressive decoding. - End-to-end RL step speedup: 1.35 to 1.41 times. The gap between generation speedup and end-to-end step speedup is the Amdahl ceiling at work: only a fraction of total RL step time is sequential decoding, so the realized improvement on the full step is smaller than the raw decoding number. ## Initialization matters A well-initialized EAGLE-3 drafter, trained on the actual rollout distribution of the policy being served, outperforms a drafter trained on generic chat data by roughly 20 to 25 percent in realized speedup. The paper labels this finding "in-domain initialization." The mechanism is straightforward: the drafter is a learned approximation of $p_{\text{target}}$, and the quality of that approximation depends on the input distribution it was trained on. A drafter that has only seen chat-style prompts will be a worse approximator on math, code, or long-horizon agentic rollouts than one trained on the same kinds of trajectories the policy actually produces. This is also the underlying reason drafter alignment becomes a recurring problem during RL. As the policy moves, the rollout distribution moves with it, and a drafter trained on the initial policy's rollouts gradually becomes mismatched. Periodic re-training or online updating of the drafter is needed to keep $\alpha$ from drifting downward. ## Alternative: native MTP If the target model has built-in [MTP](Multi-Token Prediction (MTP)) heads from pre-training (for instance DeepSeek-V3 and V4), those heads can serve as the drafter directly. There is no separate drafter to train, deploy, or keep aligned with the policy. The tradeoff is availability: most current open and frontier models do not ship with native MTP heads, so EAGLE-style external drafters remain the practical default. ## References - Li et al., _EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty_ (2024). - Li et al., _EAGLE-3_ (2025).