["Multi-Teacher On-Policy Distillation: A New Post-Training Primitive"](https://yumoxu.notion.site/)
## What problem is MOPD solving?
Modern post-training has a **see-saw problem**. When you do RL to specialize a model on one capability, you tend to lose ground on others:
- **Math RLVR** (RL with verifiable rewards) shortens reasoning traces and hurts open-ended writing.
- **RLHF** buys preference alignment at the cost of strict instruction following.
- **Tool-use RL** drifts away from STEM benchmarks.
Each specialization stage trades against the others, so shipping a single model that holds onto everything is hard. The community has converged on [[On-Policy Distillation (OPD)]] as a fix for this. The natural multi-teacher generalization (MOPD) makes each specialty's strongest checkpoint a teacher and lets one student absorb them all.
To follow the rest of the post, you need three pieces of background: GRPO (the RL loss everything builds on), OPD (the variant that swaps in a teacher KL signal), and IcePop (the patch that handles train/inference numerical mismatch).
## GRPO and OPD primer
[[GRPO]] (Group Relative Policy Optimization) is the standard RL objective for post-training reasoning models. For a prompt $x$, you sample a group of $G$ trajectories from the current policy. Each trajectory gets an outcome reward $R_i$. The per-token advantage is computed group-relatively:
$\hat{A}_{i,t} = \frac{R_i - \text{mean}(R_1, \ldots, R_G)}{\text{std}(R_1, \ldots, R_G)}$
The advantage is the same for every token in trajectory $i$ (broadcast assignment). The loss is a PPO-style clipped objective with an importance ratio $r_{i,t}(\theta) = \pi_\theta(y_{i,t} \mid x, y_{i,<t}) / \pi_{\theta_{\text{old}}}(y_{i,t} \mid x, y_{i,<t})$:
$\mathcal{J}_{\text{GRPO}}(\theta) = -\mathbb{E}_{{y_i}_{i=1}^G \sim \pi_{\text{infer}}}\left[\frac{1}{G}\sum_i \frac{1}{|y_i|}\sum_t \min\bigl(r_{i,t} \hat{A}_{i,t},, \text{clip}(r_{i,t}, 1-\varepsilon, 1+\varepsilon)\hat{A}_{i,t}\bigr)\right]$
The full original GRPO uses a standard-deviation normalizer in the advantage, which has since become optional in most best-practice setups.
### From GRPO to OPD: replace the advantage
OPD makes one change. Drop the group-relative outcome-reward advantage and replace it with the per-token reverse KL log-ratio between the teacher and student:
$\hat{A}_{i,t} = \text{sg}!\left[\log \frac{\pi_{\text{infer}}(y_{i,t} \mid x, y_{i,<t};, \theta_{\text{teacher}})}{\pi_{\text{train}}(y_{i,t} \mid x, y_{i,<t};, \theta)}\right]$
`sg` is stop-gradient (teacher logprobs flow into the loss as a numerical reward, not as something to differentiate through). The intuition is direct: when the teacher assigns higher probability than the student to a sampled token, the advantage is positive and the gradient pushes the student to upweight that token. When the teacher assigns lower probability, the gradient pushes it down. The teacher acts as a **dense, per-token reward** along trajectories the student itself sampled.
>Group size 1 is throughput-optimal for OPD Because the advantage is computed against the teacher rather than against a group baseline, you don't need multiple rollouts per prompt. Setting $G = 1$ is both valid and faster. This is one of the practical wins of OPD over outcome-reward RL.
### Why reverse KL (and not forward KL)?
Two ways to measure distance between teacher $\pi_T$ and student $\pi_S$:
- **Forward KL:** $\text{KL}[\pi_T | \pi_S] = \mathbb{E}_{y \sim \pi_T}[\log(\pi_T / \pi_S)]$. Mean-seeking. Forces the student to put mass everywhere the teacher does, which spreads the student's distribution across all teacher modes.
- **Reverse KL:** $\text{KL}[\pi_S | \pi_T] = \mathbb{E}_{y \sim \pi_S}[\log(\pi_S / \pi_T)]$. Mode-seeking. Lets the student concentrate on dominant teacher modes, accepting some sharpening (occasionally over-confident).
Two reasons OPD uses reverse KL:
1. **Sampling.** Reverse KL is an expectation under the student's distribution, which means we can sample student rollouts and compute it directly. Forward KL requires teacher samples, which is the off-policy regime.
2. **Distribution shape.** Real text generation is multimodal. Forward KL would force the student to spread mass across all teacher modes, including void regions between them, which produces high-probability nonsense. Reverse KL's mode-seeking behavior is the safer default for multimodal targets.
The classic visualization (from MiniLM): fitting a Gaussian mixture with a single Gaussian. Forward KL gives you a wide Gaussian centered between the modes (mean-seeking). Reverse KL gives you a sharp Gaussian centered on one mode (mode-seeking).
### Train-inference gap: [[IcePop]]
Training and inference engines often produce different logits for the same input, due to different kernels, different batching, different numerical paths. The gap widens dramatically in MoE training because of expert routing nondeterminism. Naively using the inference-time logits as if they were training-time logits introduces noise that can destabilize training.
**IcePop** handles this by masking out tokens whose train/inference probability ratio falls outside a tolerance band $[\alpha, \beta]$:
$\mathcal{J}_{\text{IcePop}}(\theta) = -\mathbb{E}_{...}\left[\frac{1}{G}\sum_i \frac{1}{|y_i|}\sum_t \mathcal{M}!\left(\frac{\pi_{\text{train}}}{\pi_{\text{infer}}};, \alpha, \beta\right) \cdot \min(r_{i,t}\hat{A}_{i,t}, \text{clip}(\cdot)\hat{A}_{i,t})\right]$
where $\mathcal{M}$ is a 0/1 mask. Tokens where the gap is too large are dropped. Three of the four models in this survey (MiMo-V2-Flash, GLM-5, Nemotron-Cascade 2) explicitly adopt IcePop.
## The MOPD design space
The four reports differ along five axes. The taxonomy is the spine of the rest of the post.
|Model|Student init|Teachers|Prompts|Stage|ORM augmentation?|
|---|---|---|---|---|---|
|**MiMo-V2-Flash** (Jan 2026)|General SFT checkpoint|SFT + RL specialists + Self|Not specified|Final consolidation|Yes|
|**GLM-5** (Feb 2026)|Post-RL checkpoint|Stage-terminal checkpoints|Each teacher's RL training set|Final consolidation|No|
|**Nemotron-Cascade 2** (Mar)|Post-multi-domain-RL checkpoint|SFT math + RLHF + multi-domain RL|RLHF / IF-RL / multi-domain pools + math|Mid-pipeline|No|
|**DeepSeek-V4** (Apr 2026)|Likely SFT|10+ RL specialists|Not specified|Final consolidation|No|
The most useful axis to start with is **stage** (where in the pipeline MOPD sits). MiMo, GLM-5, and DeepSeek-V4 use it as the final post-training step. Nemotron-Cascade 2 uses it mid-pipeline as a stabilization point between RL stages. The other axes mostly follow from this choice.
## MiMo-V2-Flash, mixed teacher pool with outcome reward
### Pipeline
MiMo-V2-Flash's post-training is three stages:
1. **General SFT** (the student init).
2. **Domain-specialized teachers** trained independently via SFT or RL across agentic domains (search, code, general tools) and non-agentic domains (math, reasoning, safety).
3. **MOPD** distilling the SFT student against the teacher pool, with a token-level reverse-KL signal augmented by an outcome-reward model.
### Teacher composition: SFT + RL + Self
MiMo's teacher pool is the most heterogeneous of the four. It mixes three types:
- **SFT teachers:** domain-specific SFT checkpoints.
- **RL teachers:** RL-trained specialists for verifiable-reward domains.
- **Self:** a snapshot of the student at the start of MOPD, used as a fixed reference distribution.
>"Self" is a frozen snapshot of the student before MOPD begins. On tokens where the SFT/RL teachers push the student into unfamiliar territory, distilling toward Self prevents catastrophic drift. It functions as a regularizer against the more aggressive teacher signals.
### Routing: one teacher per prompt
Multi-teacher does not mean per-token ensembling. Each prompt carries a domain label that **deterministically selects a single teacher**, and per-token advantages are computed against that one teacher. The "multi" in MOPD here refers to the _pool_ of teachers across the dataset, not a per-token mixture. The aggregation happens through sample-level domain mixing.
### What augments the OPD signal: ORM
Pure OPD has a ceiling problem. The student is being trained to match the teacher's per-token distribution, which propagates the teacher's mistakes, style biases, and suboptimal reasoning patterns. The student inherits everything the teacher knew, including what the teacher got wrong.
MiMo addresses this by combining the dense OPD advantage with an outcome-reward advantage from an ORM:
$\hat{A}_{i,t} = \hat{A}_{i,t}^{\text{OPD}} + \alpha \hat{A}_{i,t}^{\text{ORM}}$
The OPD term stays as the dense imitation backbone. The ORM term is an external corrective signal: rollouts that lead to correct answers get amplified, rollouts that lead to wrong answers get downweighted, regardless of what the teacher's logprobs say.
>What is $\hat{A}^{\text{ORM}}$ exactly? The MiMo paper writes the outcome-reward term generically. GRPO is one natural instantiation, which would require $G > 1$ rollouts per prompt for the group baseline, raising the question of whether those same rollouts are also used for the OPD log-ratio or whether OPD is computed on a separate $G = 1$ rollout. PPO with a learned value baseline preserves $G = 1$ throughout. The paper does not specify which it uses.
The MiMo ablations show MOPD with ORM > MOPD without ORM > pure ORM (i.e. plain RL). The ORM contribution is non-trivial. Crucially, the combined signal lets the student **exceed teacher accuracy** on several benchmarks, which pure imitation can't do.
>The "go beyond the teacher" problem has been studied separately under ExOPD, which uses a different corrective signal: a teacher-vs-reference delta (where the reference is a pre-RL checkpoint), extrapolated past the teacher in the direction the teacher improved. The structural recipe is the same as MiMo's: keep OPD as the dense core, add a scaled corrective term. The choice of corrective signal reflects what you trust more, verifiable rewards (MiMo) or trajectory of teacher improvement (ExOPD).
### Who is the best teacher?
MiMo's benchmark table shows the per-benchmark winning teacher across 12 evaluations. The breakdown is striking:
- **RL teachers win 6 of 12** (math, code, reasoning: the verifiable-reward domains).
- **Self wins 5 of 12** (broad and open-ended tasks where SFT/RL teachers distort calibration).
- **SFT wins 1 of 12** (BrowseComp).
The MOPD student exceeds the best teacher on 8 of 12 tasks (largest gain: +4.1 on Arena-Hard Hard Prompt) and underperforms on 4 (largest loss: −6.3 on BrowseComp).
>Best standalone teacher is not the same as best distillation teacher The MiMo paper does not actually confirm that the best-performing teacher per benchmark was the one selected for MOPD. That's a plausible reading but unconfirmed. And the distinction matters: a teacher with the highest task accuracy can still be a worse distillation teacher if it has worse calibration, worse logprob support on student rollouts, or a distribution mismatch with the student. OPD cares about the teacher's conditional distribution over student rollouts, which is a different thing from the teacher's standalone evaluation accuracy.
## GLM-5, stage-terminal teachers with pure reverse KL
### Pipeline
GLM-5 runs a sequential RL pipeline: multi-task SFT, then Reasoning RL, then Agentic RL, then General RL. Each RL stage produces a terminal checkpoint, and the **final** checkpoint is used as the student init for MOPD.
### Teacher composition: stages, not types
The teachers are simply the **terminal checkpoints of each prior post-training stage**. Each prompt is paired with the stage that originally trained on it (so routing is implicit through prompt sourcing). Compared to MiMo's mixed teacher pool of types, GLM-5's teachers all share lineage and differ only in _when_ they were taken from the pipeline.
### What MOPD is doing here: recovery, not merging
This framing matters. In MiMo, MOPD merges capabilities across heterogeneous specialists. In GLM-5, the role is different: each later RL stage tends to drift away from earlier capabilities, and MOPD is used at the end to **recover** what was lost during sequential specialization. The teachers represent peak performance at each capability before drift set in.
The advantage stays pure reverse-KL OPD. No ORM augmentation.
## Nemotron-Cascade 2, mid-pipeline stabilization
### The problem: capability drift during sequential RL
Nemotron-Cascade 2's framing makes the see-saw problem explicit. Two regressions show up consistently:
- **Non-math RLVR → math reasoning.** Some forms of RLVR reduce model entropy and shorten reasoning traces. This hurts mathematical reasoning, which benefits from longer, more deliberative traces.
- **RLHF → instruction following.** Helpfulness/safety-oriented RLHF can partially trade against strict instruction-following behavior.
In Nemotron's cascade RL pipeline, each downstream stage is liable to regress capabilities established earlier. Putting MOPD only at the end (the MiMo / GLM-5 / DeepSeek pattern) means accepting that drift through every stage and trying to clean it up at the end.
### MOPD as a periodic re-anchor
Nemotron-Cascade 2 inserts MOPD **between** specialization stages instead. The pipeline is:
1. SFT init.
2. IF-RL (instruction-following RL) → produces an IF-tuned checkpoint.
3. Multi-domain RL (STEM MCQ + tool calling + structured output) → produces a multi-domain checkpoint.
4. **MOPD** to re-anchor the student to the strongest checkpoint of each capability.
After MOPD, subsequent specialization stages train on top of a balanced policy rather than a drift-degraded one.
### Multi-domain RL grouping
One implementation note: instead of running STEM, tool-use, and IF as three separate RL stages, Nemotron-Cascade 2 runs them as a single multi-domain RL stage. The reasoning was twofold:
1. **Empirically, no degradation on blended training.** Joint training showed consistent improvements on MMLU-Pro, τ-Bench, and IF-Bench.
2. **Similar response lengths and verification times.** This minimizes the throughput hit from waiting on the slowest verifier or the longest generation in the batch.
### Teacher composition
Three teachers, one per capability:
- **Math teacher: the SFT init itself.** No math-specialized RL checkpoint. The team chose to rely on curated SFT data alone for math signal, on the reasoning that further math RL would risk shortening reasoning traces (the regression mode the paper itself flags).
- **RLHF teacher:** the helpfulness/safety-aligned checkpoint from RLHF.
- **Multi-domain teacher:** the IF-RL + multi-domain RL checkpoint.
Prompts come from each teacher's own training pool (RLHF, IF-RL, multi-domain) plus AceReason-Math for math. IcePop-style importance-weight masking is applied as in the others.
>Math teacher = SFT init is a deliberate design choice When SFT data quality is high enough that further RL would degrade the capability, "the SFT checkpoint" is itself the right teacher. This is a useful pattern: not every capability benefits from an RL specialist as a teacher.
### Sample efficiency wins
One reported benefit: starting from the same initial checkpoint, MOPD reaches teacher-level performance much faster than GRPO. On AIME25, MOPD recovers teacher-level performance (~92% accuracy) within 30 steps under math-only training. On ArenaHard v2, MOPD reaches 85.5 / 71.0 (Hard Prompt / Creative Writing) within ~52 steps, while RLHF lags behind even with longer training. Dense per-token supervision is much more sample-efficient than broadcast outcome rewards.
## DeepSeek-V4, scaling MOPD with custom infrastructure
DeepSeek-V4's algorithmic recipe is similar to the others (reverse KL OPD with multiple teachers), but the **scale** at which they run it is qualitatively different. The interesting content of this section is mostly the infrastructure work needed to make scale feasible.
### Scale axes
Three things push DeepSeek-V4's MOPD into a regime that requires purpose-built infra:
1. **Full-vocabulary logit distillation.** Most prior OPD implementations estimate the KL only on the sampled token to save memory. That gives a high-variance gradient. DeepSeek preserves the full vocabulary distribution when computing reverse KL, which is more stable but much more memory-hungry. For vocabularies above 100k tokens and contexts at 1M tokens, the logit tensors are massive.
2. **More than 10 teachers.** The teacher pool spans at least four domains (math, coding, agent, instruction following), with some domains further split across reasoning-effort modes (Non-think, Think High, Think Max), each trained as a separate specialist with its own RL configuration.
3. **Model and context size.** 1.6T parameters (49B activated) for DeepSeek-V4-Pro, with 1M-token context.
Each batch may fire 10+ teacher forward passes in addition to the student. Solving this requires three distinct infra contributions.
### FP4 inference quantization
DeepSeek-V4 uses MXFP4 quantization for **all inference-only forward passes**, including teachers and the reference model. Training steps stay in FP8 via a lossless FP4 → FP8 dequantization, leaving the backward pipeline unchanged. The savings are critical when each step needs 10+ teacher forwards.
This is not OPD-specific in design but is what makes the OPD scale tractable.
### Fault-tolerant rollouts via WAL
This is the most interesting infra contribution and the one that's most tied to OPD/RL specifically.
**The problem.** DeepSeek runs on a preemptive scheduler where any task can be evicted. Hardware failures are also frequent at scale. For supervised training, you can just retry a failed mini-batch. For OPD/RL rollouts, retrying is much more delicate, because the model is sampling stochastic trajectories.
The subtle issue is **length bias**. If a long generation is interrupted and you restart from scratch with fresh randomness:
- Short completions are more likely to finish before any interruption.
- Long completions are more likely to be interrupted.
- If interrupted completions are freshly resampled, long samples are replaced more often than short ones.
This skews the accepted-sample distribution toward shorter outputs. The training distribution silently shifts. Regenerating unfinished requests from scratch is **mathematically incorrect**, even if it looks fine operationally.
**The fix: token-granular Write-Ahead Log (WAL).** Whenever a new token is generated, it is immediately appended to the request's WAL. On preemption, the inference engine pauses and saves the KV cache of unfinished requests. On resumption, the system continues decoding from the persisted WAL and saved KV cache. If there's a fatal hardware error, prefill can be rerun on the persisted WAL tokens to reconstruct the KV cache.
>Concrete example Suppose the model starts generating "A B C D E F" but gets preempted after "A B C". The WAL persists "A B C". On resumption, the system either restores the saved KV cache or rebuilds it from the WAL via prefill, then continues decoding from token C. The resulting trajectory is still the same sample as before. No length bias.
Deterministic regeneration with the same RNG seed could fix correctness if the inference stack is fully batch-invariant and deterministic. WAL is generally faster because it avoids re-running the early tokens.
### Agentic extension: DeepSeek Elastic Compute (DSec)
For agentic tasks, rollouts include environment transitions (tool calls, sandbox commands) in addition to model-generated tokens. If environment transitions aren't reproducible, the OPD data is corrupted: the student can be trained on trajectories that wouldn't have happened under the original environment state.
DSec keeps a globally ordered trajectory log per sandbox, recording every command and result. On resumption, it can fast-forward by replaying cached results instead of re-running commands. This avoids errors from re-executing non-idempotent operations and enables deterministic replay. It's the sandbox analogue of token WAL.
### Metadata/payload separation for global planning
OPD training needs both **global planning** (shuffling, packing, teacher routing) and **per-token data** (tokens, attention masks, teacher hidden states, etc.). The per-token payload is enormous in full-vocab OPD over 1M-token contexts. Loading all of it just to compute a packing layout is infeasible.
DeepSeek-V4 separates the two:
|Category|Usage|Weight|Contents|
|---|---|---|---|
|Metadata|Planning|lightweight|sample id, length, teacher id, domain, offsets, packing metadata|
|Per-token payload|Training|heavy|tokens, attention masks, loss masks, logprobs, teacher hidden states, rewards|
Metadata is loaded for the whole rollout dataset to do global shuffling and packing. Heavy per-token fields are loaded through a shared-memory data loader (avoiding duplicate intra-node copies) and released immediately after each mini-batch is consumed.
## Convergence and divergence
### What converged
All four reports independently land on a remarkably consistent core:
- **Reverse-KL OPD over student rollouts** as the central loss.
- **Multi-teacher framework**, one form or another.
- **IcePop-style train/inference mismatch mitigation** (three of four explicitly).
This convergence is notable because the underlying motivations differed. MiMo and DeepSeek-V4 are doing **capability merging** across heterogeneous specialists. GLM-5 and Nemotron-Cascade 2 are doing **forgetting recovery** across sequential RL stages. Reverse-KL OPD turned out to be the right primitive for both.
### What diverged
The interesting differences sit downstream of the algorithm:
- **Teacher composition.**
- MiMo mixes teacher _types_ (SFT, RL specialist, Self).
- GLM-5 chains _stages_ (terminal checkpoints of each prior post-training stage).
- Nemotron-Cascade 2 mixes _capabilities_ (math via SFT init, RLHF teacher, multi-domain RL teacher).
- DeepSeek-V4 scales _count_ (10+ RL specialists, no SFT-only, no Self).
- **Position in the pipeline.** MiMo, GLM-5, and DeepSeek-V4 run MOPD as a final consolidation. Nemotron-Cascade 2 runs it mid-pipeline as a stabilization point. The choice reflects whether you do forgetting recovery once at the end or continuously reset drift between stages.
- **Augmentation.** MiMo adds a scaled ORM advantage on top of OPD. The other three keep the OPD term pure.
- **Engineering scale.** Only DeepSeek-V4 pushed scale into a regime requiring purpose-built infra: full-vocabulary logits, 10+ teachers, 1.6T parameters, 1M-token context. The other three treat OPD as a comparatively lightweight overlay on existing RL infrastructure.
### Open directions the post flags
- **Scaling teacher count and teacher size.** DeepSeek-V4 shows 10+ trillion-parameter teachers are feasible with the right infra. Whether marginal benefit per additional teacher continues to scale or saturates is unknown.
- **Black-box distillation.** All four reports rely on teacher logit access. Distilling from API-only teachers (where only sampled tokens are visible) opens a different design space and is mostly unexplored at frontier scale.
- **Teacher-student co-evolution.** Distilled students could re-enter specialist training to produce stronger teachers in an outer loop. Whether the gains compound or diminish across generations is an empirical question.
## TL;DR
- **The see-saw problem:** specializing on one capability via RL often regresses others. MOPD is the current standard fix.
- **OPD core:** sample student rollouts, compute the reverse-KL log-ratio with a teacher per token, drop it in as the advantage in a GRPO-style loop. Reverse KL is mode-seeking, which suits multimodal text distributions; group size 1 is throughput-optimal because there's no group baseline.
- **MOPD = MOPD with multiple teachers**, typically routed per-prompt by domain. Aggregation is sample-level, not per-token.
- **IcePop** masks tokens where train/infer logits disagree too much, which is critical in MoE training.
- **Four deployment patterns:**
- **MiMo-V2-Flash:** mixed teacher pool (SFT + RL + Self) with ORM augmentation as a final consolidation. ORM lets the student exceed teacher accuracy; Self serves as a stability anchor.
- **GLM-5:** stage-terminal teachers, pure reverse KL, final-stage capability recovery.
- **Nemotron-Cascade 2:** mid-pipeline stabilization between RL stages. Math teacher is the SFT init itself when further RL would shorten traces.
- **DeepSeek-V4:** scales to 10+ teachers and full-vocab logits; required custom infra (FP4 inference quantization, token-WAL for fault-tolerant rollouts, metadata/payload separation).
- **Key insight on fault tolerance:** naively re-running interrupted RL rollouts introduces length bias because long generations are interrupted more often. Token-granular WAL (or deterministic seeded regeneration) is needed for correctness, not just performance.
- **The best standalone teacher is not always the best distillation teacher.** OPD cares about the teacher's conditional distribution over student rollouts, which is different from the teacher's standalone benchmark accuracy.
## References
- Yumo Xu, _Multi-Teacher On-Policy Distillation: A New Post-Training Primitive_ (the source post).
- MiMo-V2-Flash technical report (Jan 2026).
- GLM-5 technical report (Feb 2026).
- Nemotron-Cascade 2 technical report (Mar 2026).
- DeepSeek-V4 technical report (Apr 2026).
- ExOPD: extrapolation-based corrective term for OPD.
- IcePop: train/inference logit-mismatch masking.
- MiniLM: forward vs reverse KL discussion and Gaussian-mixture visualization.