On-Policy Self-Distillation (OPSD)

**On-Policy Self-Distillation.** Zhao et al. (2026). A variant of [[On-Policy Distillation (OPD)|OPD]] that uses the **student itself as the teacher**, but conditioned on privileged information (typically the ground-truth answer) that the student doesn't see at sampling time. OPD requires a same-family teacher (tokenizer + recipe match) to work, which is often unavailable. OPSD's idea is to manufacture a teacher from the student itself by giving it a hint at conditioning time: the student samples a rollout normally, and the same model — now with the answer in context — computes per-token logprobs over that rollout. Tokenizer match and recipe match are automatic. The cost is that the teacher's distribution is now significantly shifted from the student's natural distribution. Algorithmically identical to OPD: per-token reverse-KL log-ratio between teacher and student, used as the advantage in a [[GRPO]]-style loop. The only thing that changes vs. plain OPD is the choice of teacher. ## The failure mode: pivot tokens This is OPSD's defining problem and the most insightful part of the SFT-vs-RL post. In a long math rollout where the student got the answer wrong, there's typically one or two **pivot tokens** where the student went off the rails (didn't pick the right substitution, missed a key observation). At those tokens: - Student probability: very low (e.g., 0.01) - Teacher-with-answer probability: very high (e.g., 0.6) The reverse-KL contribution at this token is $\approx \log(0.6 / 0.01) \approx 4.1$, **~100× larger** than typical tokens (where both models put $\approx 0.3$ on the same token). The gradient is dominated by these few pivot tokens. Unlike RL (sparse, but unbiased — noise cancels) or SFT (dense, biased, but diffuse), OPSD's gradient is dense, biased, **and concentrated**. One concentrated tug per step toward a region the model didn't previously believe in. **Performance collapses within ~100 steps** without defenses. ## The fix Per-vocab-entry KL clipping: cap the per-position contribution so a small subset of tokens can't dominate the gradient. The fix works, but the underlying issue is structural — the KL signal is concentrated by construction. ## Compare to [[Self-Distillation Fine-Tuning (SDFT)]] SDFT does the same trick but with a **demonstration in context** as the privileged hint, not the answer itself. The distributional shift is gentler, and the failure mode is correspondingly milder. ## References - Zhao et al., _Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models_ (2026).