2025-04-25 [pdf](https://arxiv.org/pdf/2504.16078) claude
# LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
## SUMMARY
LLMs demonstrate sub-optimal performance in decision-making scenarios due to greediness, frequency bias, and the knowing-doing gap. Reinforcement Learning Fine-Tuning (RLFT) on self-generated Chain-of-Thought rationales enhances LLMs' decision-making by increasing exploration and narrowing the knowing-doing gap. The study compares various exploration mechanisms across multi-armed bandits, contextual bandits, and Tic-tac-toe, finding that fine-tuning with exploration bonuses and sufficient "thinking time" significantly improves performance.
## OUTLINE
- **Introduction**
- LLMs show promise for decision-making but perform sub-optimally
- Common failure modes: greediness, frequency bias, knowing-doing gap
- Proposed solution: RLFT on self-generated CoT rationales
- **Methodology**
- Reinforcement Learning Fine-Tuning approach
- Context representation and action factorization
- Reward shaping for valid actions
- Fine-tuning objective with PPO
- **Experimental Settings**
- Multi-armed bandits (MABs)
- Contextual bandits (CBs)
- Tic-tac-toe environment
- Baselines (UCB, random agent)
- **Analysis of LLM Failure Modes**
- Greediness: premature commitment to sub-optimal actions
- Frequency bias: tendency to copy most frequent actions
- Knowing-doing gap: correct rationales but incorrect actions
- **Effects of RLFT**
- Improved performance across environments
- Mitigation of greediness through increased exploration
- Reduction of frequency bias
- **Exploration Mechanisms**
- Try-all exploration strategy
- ε-greedy approach
- Context randomization and summarization
- Self-correction and self-consistency
- Exploration bonuses
- **Ablation Studies**
- RLFT in Tic-tac-toe
- Importance of CoT for effective RLFT
- Expert behavior cloning vs. thought cloning
- Effect of increased "thinking time"
- **Limitations and Future Work**
- Limited model sizes
- Restricted horizon in experiments
- Computational cost of increased thinking time
## TABLE
|Failure Mode|Description|Example|RLFT Effect|
|---|---|---|---|
|Greediness|Overly favoring best seen action|2B/9B/27B explore only 40-65% of 10 arms|+12% action coverage after 30K updates|
|Frequency Bias|Selecting most frequent action regardless of reward|2B: 96% frequent action selection|Reduced to 35% for low repetitions|
|Knowing-Doing Gap|Correct reasoning but poor action selection|87% correct rationales but 58% greedy actions|Narrowed gap between reasoning and acting|
## Key Insights
### Genius
The study systematically identifies and quantifies three distinct failure modes of LLMs in decision-making scenarios, providing a framework for understanding why these powerful models underperform in simple tasks despite their capabilities.
### Interesting
Despite their extensive pre-training, even large-scale LLMs (27B parameters) exhibit simplistic decision-making strategies like greediness rather than proper exploration-exploitation balancing that would lead to better outcomes.
### Significance
The research demonstrates that fine-tuning LLMs with reinforcement learning on their own generated rationales can significantly improve decision-making abilities, suggesting a pathway for developing more effective AI agents.
### Surprising
Larger models (27B) overcome frequency bias but still exhibit greedy behaviors, indicating that simply scaling up model size does not solve fundamental decision-making limitations.
### Paradoxical
LLMs can correctly describe optimal decision-making algorithms and compute appropriate values (87% correct rationales) but fail to act on this knowledge, highlighting a disconnect between reasoning and action.
### Key Insight
Providing LLMs with more "thinking time" (larger generation budgets) significantly improves decision-making performance, suggesting that constraints on reasoning length may be limiting agent capabilities.
### Takeaway Message
Effective LLM agents require not just knowledge and reasoning abilities but also properly trained action policies, which can be achieved through reinforcement learning approaches that balance exploration and exploitation.
### Duality
The paper reveals a fundamental tension in LLM agents between "knowing" (correct reasoning) and "doing" (appropriate action selection), which must be reconciled for effective decision-making.
### Many-to-One
Multiple exploration mechanisms (try-all actions, ε-greedy, exploration bonuses) converge toward similar improvements in LLM agent performance, suggesting various viable approaches to enhancing exploration.
---
---
---
# Source of the Greediness
Based on the paper, I can rank the potential sources of greediness in LLMs by significance:
1. **Pre-training Distribution Bias**: The most significant factor appears to be that LLMs are pre-trained on sequences that don't adequately represent exploration-exploitation tradeoffs. They're trained to predict the most likely continuation rather than to explore uncertain options with potentially higher long-term value.
2. **Supervised Learning Paradigm**: The paper suggests that greediness is "an artifact of supervised pre-training," similar to the "copycat bias" in behavioral cloning where models tend to replicate the most frequently observed patterns rather than optimizing for reward.
3. **Maximum Likelihood Objective**: LLMs are trained to maximize the likelihood of the next token, which naturally leads to greedy behavior - selecting what seems best given current information rather than considering future information value.
4. **Lack of Reinforcement Feedback During Pre-training**: The models don't receive explicit reward signals during pre-training that would teach them to balance exploration and exploitation.
5. **Context Length Limitations**: The limited context window may prevent LLMs from fully accounting for the benefits of exploration over longer horizons.
6. **Insufficient Exposure to Decision Theory**: Pre-training data likely contains limited examples of proper decision-making under uncertainty compared to other content types.
The paper doesn't explicitly rank these causes but does highlight that RLFT helps mitigate this tendency by providing reward signals that encourage exploration, suggesting that the supervised training paradigm is a fundamental source of the problem.
---
---
---
# relationship between GREEDINESS and Optimal Decision Making
# Relationship Between Greediness and Optimal Decision Making
Based on the paper, the relationship between greediness and optimal decision making is complex and multifaceted:
## Fundamental Tension
Greediness in LLMs represents a **premature commitment to exploitation** at the expense of proper exploration. Optimal decision making, in contrast, requires achieving the right **balance between exploration and exploitation**. This is a classic problem in reinforcement learning.
## Key Distinctions
- **Greedy decisions** involve selecting actions that maximize immediate or observed rewards based only on current information.
- **Optimal decisions** involve selecting actions that maximize expected cumulative rewards over time, which may require sacrificing immediate rewards to gain information.
## Quantifiable Impact
The paper demonstrates that greedy behavior leads to:
- Stagnating action coverage (only 65% of available actions explored with 27B model)
- Sub-optimal cumulative rewards (higher regret compared to UCB algorithm)
- Missed opportunities to discover potentially better actions
## Knowing-Doing Gap Insight
Interestingly, the study reveals that LLMs often **"know" the optimal decision theory** (e.g., UCB algorithm) but still **"do" the greedy action**:
- 87% of rationales correctly identified optimal actions
- Yet 58% of actions were greedy rather than optimal, despite correct reasoning
## Temporal Dimension
The greedy tendency becomes more problematic over time:
- Initially, greedy and optimal actions might align
- As interaction continues, the cost of insufficient exploration compounds
- The paper shows this as "stagnating action coverage beyond 10 steps"
This relationship highlights why simply scaling model size or improving reasoning capabilities is insufficient for optimal decision making. The research suggests that specific training objectives like RLFT are needed to overcome the inherent greediness in language models and align them with principles of optimal sequential decision making.