On-Policy Distillation (OPD) - Obsidian Publish

A post-training method that combines RL-style sampling with SFT-style dense supervision. The student samples its own completions (like RL), but each token is supervised by the teacher's logprobs (like SFT) instead of by a final outcome reward. The motivation comes from comparing the two standard alternatives. **SFT** gives a dense per-token signal but trains on _teacher_ states and tests on _student_ states — the resulting exposure bias caps practical performance somewhere short of teacher quality on the actual eval distribution. **RL with verifier** trains on student states (no exposure bias) but the signal is sparse: one outcome reward per trajectory, broadcast across thousands of tokens. OPD is the structural sweet spot: the student generates so it sees its own state distribution, but instead of waiting for a single outcome reward at the end, the teacher scores each token along the way using its logprobs. Dense signal _and_ on-policy sampling. Reported speedups: **9–30× less compute than RL** to match the same teacher on math benchmarks. ## The gradient For each token in a student-generated rollout, compute "how much more does the teacher like this token than the student does": $\nabla_\theta J_{\text{OPD}}(\theta) = \mathbb{E}_{x,, \hat{y} \sim \pi_\theta}\left[\sum_t (\log \pi_T(\hat{y}_t \mid \hat{y}_{<t}) - \log \pi_\theta(\hat{y}_t \mid \hat{y}_{<t})) , \nabla_\theta \log \pi_\theta(\hat{y}_t \mid \hat{y}_{<t})\right]$ That log-ratio is a per-token reverse KL — "the student should put more probability on tokens the teacher likes." In a [[GRPO]]-style training loop it drops in directly as a per-token advantage: $\hat{A}_{i,t} = \text{sg}\left[\log \frac{\pi_T(\hat{y}_{i,t} \mid \cdot)}{\pi_\theta(\hat{y}_{i,t} \mid \cdot)}\right]$ The `sg` (stop-gradient) means we treat the teacher's logprobs as a fixed numerical signal — we don't differentiate through the teacher. Group size $G = 1$ is throughput-optimal here because there's no group baseline to compute (unlike GRPO). ## The same-family teacher requirement OPD's per-token loss requires the teacher and student to share a tokenizer (so token positions line up between them) and ideally a similar training recipe (so logprobs are calibrated comparably). Cross-family OPD doesn't really work — too much of the signal goes into surface-form differences (formatting, register) rather than reasoning. This is the constraint that motivates the self-distillation variants ([[On-Policy Self-Distillation (OPSD)|OPSD]]], [[Self-Distillation Fine-Tuning (SDFT)]]) when no same-family teacher is available. ## Ceilings OPD's ceiling is the **teacher**. RL's ceiling is the **verifier**. For most production workloads where you want to match a strong teacher, OPD is the better tool. For pushing past the teacher, you need RL with a verifier, or OPD plus an outcome-reward correction (as in MiMo-V2-Flash's MOPD recipe). ## Multi-teacher generalization [[On-Policy Distillation (OPD)|MOPD]] runs OPD with multiple teachers, routed per-prompt by domain. Used as a final post-training step in MiMo-V2-Flash, GLM-5, DeepSeek-V4, and as a mid-pipeline stabilizer in Nemotron-Cascade 2. ## References - Lu, K. & Thinking Machines Lab, _On-Policy Distillation_ (2025). - Agarwal et al., _On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes_ (2023).