Self-Distillation Fine-Tuning (SDFT)

**Self-Distillation Fine-Tuning.** Shenfeld et al. (2026). A variant of [[On-Policy Distillation (OPD)|OPD]] that uses the **student conditioned on an expert demonstration** as its own teacher. Sister method to [[On-Policy Self-Distillation (OPSD)|OPSD]]. Like OPSD, SDFT exists to handle the case where no same-family external teacher is available for OPD. The student is used as its own teacher, but conditioned on privileged information at sampling time. The two methods differ only in what the privileged information is. **OPSD**: Ground-truth answer / Aggressive — teacher _knows where the trajectory ends_ **SDFT**: Expert demonstration / Gentler — teacher has _style and approach hints_, not the answer The demonstration doesn't leak the answer to the current task — it just provides distributional pull toward the expert's reasoning style, formatting, and problem decomposition. Same algorithmic shape as OPD and OPSD; same dial settings in the unifying $(\alpha, \lambda, \pi_T)$ taxonomy: $\alpha = 1$ (on-policy student rollouts), $\lambda = 1$ (all teacher KL signal), differing only in the choice of $\pi_T$. ## The ceiling problem SDFT inherits something close to SFT's ceiling: you're only as good as the demonstrations you have access to. There's no path past the demonstration distribution because the teacher is bounded by what the demos can express. This contrasts with the RL ceiling (set by the verifier, not by demonstrations). The SFT-vs-RL post flags this as SDFT's structural limit and motivates the "construct a teacher" research direction (per-task prompt optimization, hint-writers, etc.) that aims to get RL-like ceilings without an external teacher. ## Why it matters in the bigger picture Both SDFT and OPSD point at the same insight: when no same-family teacher is available, you can sometimes manufacture one by conditioning the student on privileged information. The choice of "what to condition on" determines where you sit on the Pareto curve of reward gain vs. KL distance — and how concentrated vs. diffuse the resulting gradient is. ## References - Shenfeld et al., _Self-Distillation Enables Continual Learning_ (SDFT, 2026).