# The Q* Hypothesis: Tree-of-Thoughts Reasoning, Process Reward Models, and Supercharging Synthetic Data
![[Attachments/a9cf23b8241d217df484f8cec301cf1f_MD5.jpg]]
## Metadata
- Author: [[Nathan Lambert]]
- Full Title: The Q* Hypothesis: Tree-of-Thoughts Reasoning, Process Reward Models, and Supercharging Synthetic Data
- Category: #articles
- Date: 2023-11-25
- URL: https://www.interconnects.ai/p/q-star?utm_source=substack&utm_medium=email&utm_campaign=email-restack-comment&r=b7i1w
- [ ] #toFile ➕ 2023-11-25
- [ ] #toProcess ➕ 2023-11-25
## Highlights added 2023-11-25
- • **Self-play** is the idea that an agent can improve its gameplay by playing against slightly different versions of itself because it’ll progressively encounter more challenging situations. In the space of LLMs, it is almost certain that the largest portion of self-play will look like AI Feedback rather than competitive processes.
• **Look-ahead planning** is the idea of using a model of the world to reason into the future and produce better actions or outputs. The two variants are based on [Model Predictive Control](https://en.wikipedia.org/wiki/Model_predictive_control#:~:text=Model%20predictive%20control%20(MPC)%20is,oil%20refineries%20since%20the%201980s.) (MPC), which is often used on continuous states, and [Monte-Carlo Tree Search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) (MCTS), which works with discrete actions and states. ([View Highlight](https://read.readwise.io/read/01hg1hcwn6y8sh3b75vyb4fs73))
- Note: Interesting things I haven't really explored when I was in [[Deep Learning]] as I never did any [[Reinforcement Learning]]