RWISE - 2023-08-10 - How RLHF Preference Model Tuning Works - mnml's vault

## New highlights added August 10, 2023 at 7:26 AM - In a way, the preference model serves as a means to **directly introduce a *"human preference bias”* into the base model**. ([View Highlight](https://read.readwise.io/read/01h7fjjsrf74gajj0wqz99j2qs)) - Truthful-oriented RLHF tuning, as [applied](https://openai.com/research/instruction-following?ref=assemblyai.com) at OpenAI, explicitly intends to reduce **hallucination** – the tendency of LLMs to sometimes generate false or invented statements. However, RLHF tuning often *worsens* hallucinations, as a [study](https://arxiv.org/abs/2203.02155?ref=assemblyai.com) based on InstructGPT reported. Ironically, the same study has served as the primary basis for ChatGPT’s initial RLHF design ([View Highlight](https://read.readwise.io/read/01h7fjzqsh9p32tx67kkqvae36)) - The precise mechanism behind the hallucination phenomenon is still generally unclear. A first [hypothesis](https://arxiv.org/abs/2110.10819?ref=assemblyai.com) in a paper from Google DeepMind suggests that LLMs hallucinate because they *“lack an understanding of the cause and effect of their actions.”* A different view posits that hallucinations arise from the behavior-cloning nature of language models, especially when an LLM is trained to *mimic responses containing information it doesn't possess*. ([View Highlight](https://read.readwise.io/read/01h7fk1kph0d9vv4mp09n5bg9p)) - Note: i must be missing something here...