NeMo-RL - Obsidian Publish

NVIDIA's framework for RL-based LLM post-training. Sits on top of [[MegatronLM]] for training and integrates with inference engines like [[vLLM]] for rollouts. The orchestration layer that coordinates the rollout-train-update loop. ## What it provides - **RL recipes**: implementations of [[GRPO]], [[PPO]], DPO, RLOO, and other post-training algorithms. - **Rollout coordination**: spawns inference engine workers, distributes prompts, collects rollouts, sends them back to the trainer. - **Weight synchronization**: broadcasts updated policy weights from the learner to the rollout engine each step. This is non-trivial at frontier scale because the model can be tens of GB and weight transfer can become a bottleneck. - **Verifier integration**: scoring rollouts for tasks like math (string match against ground truth), code (sandbox execution), or reward-model scoring. - **Async execution**: supports async-RL setups where rollout generation and training run concurrently with some policy lag. ## Naming history - **NeMo-Aligner**: the original name, introduced in Shen et al. 2024. - **NeMo-RL**: the current name. Refactored library form, designed to integrate cleanly with Megatron-Core. These are the same lineage; older papers reference NeMo-Aligner, newer ones use NeMo-RL. ## Architecture Three components running in parallel: 1. **Learner** ([[MegatronLM]]): holds the current policy, computes log-probs and gradients. 2. **Rollout engine** ([[vLLM]]): samples completions from the policy. 3. **Verifier**: scores completions for reward. Per RL step: 1. Learner sends weights → rollout engine. 2. Rollout engine generates completions for a batch of prompts → verifier. 3. Verifier scores completions → returns rewards to learner. 4. Learner recomputes log-probs (because vLLM's logits don't always match), applies the loss, takes a gradient step. ## Where it shows up in my notes The spec decoding RL paper builds on NeMo-RL and inserts spec decoding into the rollout engine, with the rest of the orchestration unchanged. ## References - Shen et al., _NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment_ (2024). - NeMo-RL documentation: https://github.com/NVIDIA/NeMo-RL