RLHF - Reinforcement Learning from Human Feedback

[[Large Language Models|Alignment technique for LLMs]] based on ideas from [[- Reinforcement Learning -]]. --- 1. Construct a dataset of comparisons:$\mathcal{D}_{\text{prefs}} = \{(x_i, y_i^+, y_i^-)\}_{i=1}^N$where $y_i^+$ is the human-preferred response over $y_i^-$ for prompt $x_i$. 2. Train a reward model $r_\phi(x, y)$ to predict human preference rankings using a binary logistic loss:$\mathcal{L}_{\text{RM}}(\phi) = -\sum_i \log \sigma\left( r_\phi(x_i, y_i^+) - r_\phi(x_i, y_i^-) \right),$where $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the [[Activation Functions|sigmoid]] [[Maps and Induced Structures - Functions, Pushforwards and Pullbacks|function]]. 3. Fine-tune $\pi_\theta$ using [[PPO - Proximal Policy Optimization|Proximal Policy Optimization (PPO)]], maximizing expected reward while penalizing divergence from a reference model:$\mathcal{L}_{\text{PPO}}(\theta) = \mathbb{E}\left[\frac{\pi_\theta(y \mid x)}{\pi_{\text{ref}}(y \mid x)} A(x, y)\right] - \beta \, \text{KL}(\pi_\theta \Vert \pi_{\text{ref}})$where $A(x, y)$ is an advantage function and $\beta$ controls KL regularization.