# LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
## SUMMARY
LLMs demonstrate sub-optimal performance in decision-making scenarios due to greediness, frequency bias, and the knowing-doing gap. Reinforcement Learning Fine-Tuning (RLFT) on self-generated Chain-of-Thought rationales enhances LLMs' decision-making by increasing exploration and narrowing the knowing-doing gap. The study compares various exploration mechanisms across multi-armed bandits, contextual bandits, and Tic-tac-toe, finding that fine-tuning with exploration bonuses and sufficient "thinking time" significantly improves performance.
## OUTLINE
- **Introduction**
- LLMs show promise for decision-making but perform sub-optimally
- Common failure modes: greediness, frequency bias, knowing-doing gap
- Proposed solution: RLFT on self-generated CoT rationales
- **Methodology**
- Reinforcement Learning Fine-Tuning approach
- Context representation and action factorization
- Reward shaping for valid actions
- Fine-tuning objective with PPO
- **Experimental Settings**
- Multi-armed bandits (MABs)
- Contextual bandits (CBs)
- Tic-tac-toe environment
- Baselines (UCB, random agent)
- **Analysis of LLM Failure Modes**
- Greediness: premature commitment to sub-optimal actions
- Frequency bias: tendency to copy most frequent actions
- Knowing-doing gap: correct rationales but incorrect actions
- **Effects of RLFT**
- Improved performance across environments
- Mitigation of greediness through increased exploration
- Reduction of frequency bias
- **Exploration Mechanisms**
- Try-all exploration strategy
- ε-greedy approach
- Context randomization and summarization
- Self-correction and self-consistency
- Exploration bonuses
- **Ablation Studies**
- RLFT in Tic-tac-toe
- Importance of CoT for effective RLFT
- Expert behavior cloning vs. thought cloning
- Effect of increased "thinking time"
- **Limitations and Future Work**
- Limited model sizes
- Restricted horizon in experiments
- Computational cost of increased thinking time
## TABLE
|Failure Mode|Description|Example|RLFT Effect|
|---|---|---|---|
|Greediness|Overly favoring best seen action|2B/9B/27B explore only 40-65% of 10 arms|+12% action coverage after 30K updates|
|Frequency Bias|Selecting most frequent action regardless of reward|2B: 96% frequent action selection|Reduced to 35% for low repetitions|
|Knowing-Doing Gap|Correct reasoning but poor action selection|87% correct rationales but 58% greedy actions|Narrowed gap between reasoning and acting|
## Key Insights
### Genius
The study systematically identifies and quantifies three distinct failure modes of LLMs in decision-making scenarios, providing a framework for understanding why these powerful models underperform in simple tasks despite their capabilities.
### Interesting
Despite their extensive pre-training, even large-scale LLMs (27B parameters) exhibit simplistic decision-making strategies like greediness rather than proper exploration-exploitation balancing that would lead to better outcomes.
### Significance
The research demonstrates that fine-tuning LLMs with reinforcement learning on their own generated rationales can significantly improve decision-making abilities, suggesting a pathway for developing more effective AI agents.
### Surprising
Larger models (27B) overcome frequency bias but still exhibit greedy behaviors, indicating that simply scaling up model size does not solve fundamental decision-making limitations.
### Paradoxical
LLMs can correctly describe optimal decision-making algorithms and compute appropriate values (87% correct rationales) but fail to act on this knowledge, highlighting a disconnect between reasoning and action.
### Key Insight
Providing LLMs with more "thinking time" (larger generation budgets) significantly improves decision-making performance, suggesting that constraints on reasoning length may be limiting agent capabilities.
### Takeaway Message
Effective LLM agents require not just knowledge and reasoning abilities but also properly trained action policies, which can be achieved through reinforcement learning approaches that balance exploration and exploitation.
### Duality
The paper reveals a fundamental tension in LLM agents between "knowing" (correct reasoning) and "doing" (appropriate action selection), which must be reconciled for effective decision-making.
### Many-to-One
Multiple exploration mechanisms (try-all actions, ε-greedy, exploration bonuses) converge toward similar improvements in LLM agent performance, suggesting various viable approaches to enhancing exploration.
JOHN
# what is the source of the greediness tendency? ranked by significance
Based on the paper, I can rank the potential sources of greediness in LLMs by significance:
1. **Pre-training Distribution Bias**: The most significant factor appears to be that LLMs are pre-trained on sequences that don't adequately represent exploration-exploitation tradeoffs. They're trained to predict the most likely continuation rather than to explore uncertain options with potentially higher long-term value.
2. **Supervised Learning Paradigm**: The paper suggests that greediness is "an artifact of supervised pre-training," similar to the "copycat bias" in behavioral cloning where models tend to replicate the most frequently observed patterns rather than optimizing for reward.
3. **Maximum Likelihood Objective**: LLMs are trained to maximize the likelihood of the next token, which naturally leads to greedy behavior - selecting what seems best given current information rather than considering future information value.
4. **Lack of Reinforcement Feedback During Pre-training**: The models don't receive explicit reward signals during pre-training that would teach them to balance exploration and exploitation.
5. **Context Length Limitations**: The limited context window may prevent LLMs from fully accounting for the benefits of exploration over longer horizons.
6. **Insufficient Exposure to Decision Theory**: Pre-training data likely contains limited examples of proper decision-making under uncertainty compared to other content types.
The paper doesn't explicitly rank these causes but does highlight that RLFT helps mitigate this tendency by providing reward signals that encourage exploration, suggesting that the supervised training paradigm is a fundamental source of the problem.
JOHN
# what is the relationship between GREEDINESS and OPTIMAL DECISION MAKING
Based on the paper, the relationship between greediness and optimal decision making is complex and multifaceted:
## Fundamental Tension
Greediness in LLMs represents a **premature commitment to exploitation** at the expense of proper exploration. Optimal decision making, in contrast, requires achieving the right **balance between exploration and exploitation**. This is a classic problem in reinforcement learning.
## Key Distinctions
- **Greedy decisions** involve selecting actions that maximize immediate or observed rewards based only on current information.
- **Optimal decisions** involve selecting actions that maximize expected cumulative rewards over time, which may require sacrificing immediate rewards to gain information.
## Quantifiable Impact
The paper demonstrates that greedy behavior leads to:
- Stagnating action coverage (only 65% of available actions explored with 27B model)
- Sub-optimal cumulative rewards (higher regret compared to UCB algorithm)
- Missed opportunities to discover potentially better actions
## Knowing-Doing Gap Insight
Interestingly, the study reveals that LLMs often **"know" the optimal decision theory** (e.g., UCB algorithm) but still **"do" the greedy action**:
- 87% of rationales correctly identified optimal actions
- Yet 58% of actions were greedy rather than optimal, despite correct reasoning
## Temporal Dimension
The greedy tendency becomes more problematic over time:
- Initially, greedy and optimal actions might align
- As interaction continues, the cost of insufficient exploration compounds
- The paper shows this as "stagnating action coverage beyond 10 steps"
This relationship highlights why simply scaling model size or improving reasoning capabilities is insufficient for optimal decision making. The research suggests that specific training objectives like RLFT are needed to overcome the inherent greediness in language models and align them with principles of optimal sequential decision making.
JOHN
# what is the relationship between EXPLORATION/EXPLOITATION and OPTIMAL DECISION MAKING
Based on the paper, the relationship between exploration/exploitation and optimal decision making is fundamental and can be characterized as follows:
## Core to Optimal Decisions
The exploration-exploitation trade-off represents the central challenge in sequential decision making. Optimal decision making fundamentally depends on balancing:
- **Exploitation**: Selecting actions that maximize immediate rewards based on current knowledge
- **Exploration**: Selecting actions to gather more information about uncertain options, potentially sacrificing immediate rewards
## Mathematical Formalization
The paper discusses algorithms like UCB that provide a mathematical framework for balancing exploration and exploitation:
- UCB assigns higher values to less-explored actions through confidence bounds
- This balance is expressed as: UCB = average_reward + √(2·ln(total_steps)/action_count)
- Optimal decisions maximize expected cumulative rewards over time, not just immediate returns
## Performance Metrics
The researchers quantify the relationship using:
- **Cumulative regret**: The difference between optimal and actual cumulative rewards
- **Action coverage**: The percentage of available actions explored at least once
- The results show that proper exploration (like UCB) consistently outperforms greedy approaches
## Model Performance
The paper demonstrates that:
- LLMs without proper exploration mechanisms achieve only 40-65% action coverage
- Adding exploration mechanisms through RLFT increases coverage and reduces regret
- The "try-all" strategy (systematic initial exploration) provided the most significant improvements
## Temporal Horizon
The optimal balance between exploration and exploitation changes over time:
- Early in the decision process, exploration typically yields greater long-term value
- As knowledge accumulates, exploitation becomes increasingly optimal
- Proper adaptive balancing is key to minimizing cumulative regret
The research ultimately shows that optimal decision making requires an explicit mechanism for managing the exploration-exploitation trade-off, which standard LLMs lack but can acquire through reinforcement learning techniques that reward appropriate exploration behavior.
JOHN
# what is the relationship between EXPLORATION and EXPLOITATION
Based on the research paper, the relationship between exploration and exploitation can be characterized as follows:
## Fundamental Trade-off
Exploration and exploitation represent a fundamental trade-off in sequential decision-making:
- **Exploration** involves selecting actions to gain new information about uncertain options
- **Exploitation** involves selecting actions that maximize rewards based on current knowledge
## Dynamic Balance
Their relationship is not fixed but dynamic and context-dependent:
- Early in decision sequences, exploration typically provides more long-term value
- As knowledge accumulates, exploitation becomes increasingly beneficial
- The optimal balance shifts gradually from exploration-heavy to exploitation-heavy
## Mathematical Relationship
In formal algorithms like UCB (discussed in the paper):
- The exploration component is proportional to the uncertainty about an action (√(ln(t)/N(a)))
- The exploitation component is the estimated value of the action (average reward)
- These components are summed to guide action selection
## Complementary Nature
Exploration and exploitation are not merely opposing forces but complementary processes:
- Effective exploration enables better exploitation later
- Poor exploration leads to sub-optimal exploitation
- Too much exploitation prevents discovering potentially better options
## In LLMs
The paper reveals that LLMs have an inherent bias toward exploitation (greediness):
- They tend to select actions with highest observed rewards (exploitation)
- They fail to systematically explore untried actions
- This imbalance leads to sub-optimal cumulative performance
The paper demonstrates that proper reinforcement learning with exploration incentives can help LLMs develop a more balanced approach to this trade-off, yielding significantly better decision-making performance.
JOHN
# what other trade off dynamics does EXPLORATION-EXPLOITATION TRADEOFF remind you of
The exploration-exploitation tradeoff reminds me of several similar trade-off dynamics across different domains:
1. **Learning vs. Performing**: The balance between spending time learning new skills (exploration) versus applying existing skills (exploitation) in education and career development.
2. **Research vs. Development**: In R&D, the tension between pursuing novel scientific discoveries (exploration) versus refining existing technologies for market applications (exploitation).
3. **Diversification vs. Concentration**: In investment, spreading capital across many assets to discover opportunities versus concentrating in known high-performing assets.
4. **Breadth vs. Depth**: The classic learning trade-off between acquiring broad knowledge across many domains versus developing deep expertise in a specific area.
5. **Novel vs. Familiar**: In creative fields, the tension between creating innovative, experimental works versus leveraging established, successful formulas.
6. **Generalization vs. Specialization**: In evolution and business strategy, adapting to varied environments versus optimizing for a specific niche.
7. **Variance vs. Bias**: In machine learning, models that are flexible enough to capture complex patterns but may overfit (high variance) versus simpler models that may underfit but generalize better (high bias).
8. **Short-term vs. Long-term Thinking**: In business and policy decisions, immediate returns versus strategic positioning for future opportunities.
9. **Local vs. Global Search**: In optimization algorithms, improving current solutions incrementally versus considering radically different approaches.
10. **Innovation vs. Efficiency**: Organizations balancing disruptive innovation against operational excellence and efficiency.
These parallels suggest the exploration-exploitation dynamic represents a fundamental tension in systems that must both learn and perform under resource constraints and uncertainty.
JOHN
# what is the bigger pattern these things exhibit
# The Meta-Pattern: Optimal Resource Allocation Under Uncertainty
These trade-offs all exhibit a fundamental pattern that transcends their specific domains:
## Core Pattern: Decision Making Under Limited Resources and Uncertainty
All these trade-offs reflect the universal challenge of **allocating finite resources when facing incomplete information about potential outcomes**. The pattern includes:
1. **Present vs. Future Value**: Immediate, known returns versus potential future returns of unknown magnitude
2. **Knowledge Acquisition vs. Knowledge Application**: Gathering new information versus utilizing existing information
3. **Risk-Reward Calibration**: Managing the tension between certainty of moderate outcomes versus uncertainty with potentially higher payoffs
4. **Diminishing Marginal Returns**: Initial exploration yields high information value, but additional exploration typically yields progressively less value
5. **Dynamic Optimal Balance**: The ideal allocation shifts over time as information accumulates and contexts change
## Information-Theoretic Perspective
This pattern fundamentally represents an **information value problem**: determining when the expected value of new information exceeds the opportunity cost of not exploiting current knowledge.
## Adaptation Process
At its core, this pattern reflects how adaptive systems (whether biological, social, computational, or economic) must simultaneously:
- Learn about their environment (exploration)
- Leverage what they've learned (exploitation)
- Continuously adjust the balance based on feedback
## Universal Constraints
The ubiquity of this pattern stems from three universal constraints:
- Limited resources (time, energy, capital, attention)
- Incomplete information
- Environments that change but contain discoverable patterns
This meta-pattern appears across so many domains because it represents perhaps the most fundamental challenge of decision-making entities: how to allocate limited resources to maximize value in an uncertain, partially-knowable world.
JOHN
# what is the argument for pairing the word Greediness with the word Agent in this context
The paper's title pairing "Greedy" with "Agent" is a deliberate and insightful choice that serves multiple purposes:
1. **Technical Precision**: In reinforcement learning and decision theory, a "greedy agent" has a specific technical meaning - it refers to an agent that always selects actions that maximize immediate rewards based on current knowledge. This precisely describes the primary failure mode the authors identified in LLMs.
2. **Highlighting the Core Finding**: The title succinctly captures the paper's central discovery - that LLMs, when acting as agents, exhibit a systematic bias toward greedy behavior rather than proper exploration.
3. **Contrast with Expectations**: There's an implicit contrast with how we might expect sophisticated language models to behave. While LLMs have demonstrated remarkable reasoning capabilities, when placed in sequential decision-making contexts, they default to simplistic greedy policies.
4. **Connection to Agency**: The term "agent" emphasizes the active decision-making role. The paper examines LLMs not merely as passive predictive models but as entities making sequential choices in environments - true agents with agency.
5. **Actionability**: By identifying "greediness" as a key limitation of LLM agents, the title suggests a clear direction for improvement - enhancing exploration capabilities through techniques like RLFT.
The authors could have chosen more technical terms like "sub-optimal exploration" or "exploitation bias," but "Greedy Agents" more effectively communicates the core behavioral pattern while being accessible to a wider audience. It also frames the finding in terms of agency and decision-making rather than just model capabilities, which aligns with the paper's focus on improving LLMs for agentic applications.
JOHN
# what is the source of the FREQUENCY BIAS
Based on the paper, the sources of frequency bias in LLMs can be identified as follows:
## Sources of Frequency Bias in LLMs
1. **Pre-training Distribution Effects**: The paper explicitly states that "frequency bias is an artifact of supervised pre-training," suggesting that LLMs learn to predict tokens based on their frequency in the training corpus.
2. **Copycat Behavior**: The researchers draw a direct parallel to "copycat bias" observed in Behavior Cloning (BC) settings, referencing works by Schmied et al. (2024b) and Wen et al. (2020).
3. **Maximum Likelihood Training Objective**: Though not explicitly stated, the bias likely stems from the fundamental training approach where models learn to predict the most statistically likely continuation.
4. **Token Distribution Matching**: LLMs are trained to match the distribution of tokens in their training data, which naturally leads them to favor frequently occurring patterns.
5. **Lack of Reward-Based Feedback**: Without reinforcement signals during pre-training, models have no mechanism to distinguish between frequency and utility.
6. **Context Window Interference**: The paper demonstrates that increasing repetitions in the context window significantly affects model behavior, suggesting that recency and proximity of tokens strongly influence predictions.
The paper shows this bias is model-size dependent, with smaller models (2B) exhibiting severe frequency bias (96% frequent action selection) while larger models (27B) largely overcome it (14%). However, even when larger models escape frequency bias, they still suffer from greediness, indicating that these are distinct failure modes with different underlying causes.
The researchers demonstrate that RLFT helps counteract frequency bias but doesn't fully eliminate it, especially with high repetitions, suggesting it's a deeply ingrained aspect of how these models process sequential information.
JOHN
# tell me from a higher perspective
From a broader perspective, frequency bias in language models reflects a fundamental tension between statistical learning and causal reasoning that extends beyond AI systems.
Language models, at their core, are pattern-matching machines. Their entire training methodology revolves around absorbing statistical regularities from massive text corpora and learning to reproduce those patterns. This is both their greatest strength and a significant limitation.
What we're observing with frequency bias is essentially the default behavior of any system trained to model correlations rather than causation. The model sees that "A often follows B" without understanding why this pattern exists or whether it's merely coincidental. When deployed as decision-making agents, these systems naturally gravitate toward reproducing frequently observed patterns rather than reasoning about what actions would actually yield optimal outcomes.
This challenge mirrors a fundamental limitation in many human cognitive biases as well. We often confuse familiarity with correctness (the mere exposure effect), overestimate the significance of frequently observed events (availability heuristic), or assume that correlation implies causation.
The paper's findings highlight the transition challenges as we move from systems designed for pattern recognition to systems intended for decision-making. Statistical learning excels at the former but requires additional mechanisms—like reinforcement learning—to succeed at the latter. This represents a broader shift from systems that model "what is" to systems that reason about "what should be."
What makes this particularly interesting is that even as these models grow in scale and capability, they don't naturally overcome these biases without explicit intervention. This suggests that certain fundamental aspects of decision-making and causal reasoning may not emerge spontaneously from statistical pattern recognition, no matter how sophisticated.
JOHN
# what is the relationship between STATISTICAL PATTERN RECOGNITION and CAUSATION
Statistical pattern recognition and causation represent fundamentally different approaches to understanding relationships in data, with important distinctions and connections:
## Fundamental Differences
**Statistical Pattern Recognition:**
- Identifies correlations and associations in data
- Answers "what" questions (what patterns exist?)
- Based on observational frequency and co-occurrence
- Makes predictions based on statistical regularities
- Can operate without understanding underlying mechanisms
**Causation:**
- Identifies cause-and-effect relationships
- Answers "why" questions (why does this pattern exist?)
- Based on intervention and counterfactual reasoning
- Makes predictions based on causal mechanisms
- Requires understanding of underlying processes
## Their Relationship
1. **Hierarchical Relationship**: Causation can be viewed as a higher-level understanding that builds upon but transcends statistical patterns. Statistical patterns may suggest causal relationships, but don't establish them.
2. **Information Gap**: Statistical patterns capture the "what" but often miss the "why." They represent surface-level correlations that may or may not reflect deeper causal mechanisms.
3. **Complementary Roles**: In scientific inquiry and decision-making, both play important roles - statistical patterns help identify potential relationships, while causal reasoning helps validate and explain them.
4. **Inference Direction**: Statistical pattern recognition works from data to relationships, while causal reasoning often works from hypothesized mechanisms to predicted patterns.
5. **Confounding Variables**: Statistical patterns can be misleading due to confounding variables that causation accounts for through proper experimental design or causal modeling.
6. **Transfer Learning**: Causal understanding transfers better to new contexts than statistical patterns, which may be specific to the distribution they were learned from.
In the context of the paper, LLMs excel at statistical pattern recognition from their training but struggle with causal decision-making. This explains why they exhibit frequency bias and greediness - they're responding based on statistical patterns (what appeared frequently or had high rewards in the past) rather than causal understanding of exploration-exploitation tradeoffs that would lead to optimal decision-making.
JOHN
# what is the relationship between statistical patterns and causal patterns
Statistical patterns and causal patterns represent two different ways of understanding relationships in data, with a complex relationship between them:
## Core Distinctions
**Statistical Patterns:**
- Represent correlations, associations, and dependencies observed in data
- Identified through mathematical and statistical analysis
- Express how variables tend to co-occur or co-vary
- Can be detected purely from observational data
- Described by formulas like P(Y|X) (probability of Y given X)
**Causal Patterns:**
- Represent true cause-and-effect relationships in the world
- Identified through experiments, interventions, or causal inference
- Express how changes in one variable produce changes in another
- Often require interventional or counterfactual data
- Described by formulas like P(Y|do(X)) (probability of Y given we intervene on X)
## Their Relationship
1. **Overlap but Distinctness**: All causal relationships produce statistical patterns, but not all statistical patterns reflect causal relationships.
2. **Information Asymmetry**: Statistical patterns are observable from data alone, while causal patterns often require additional knowledge or experiments.
3. **Direction of Inference**: Statistical patterns can point toward possible causal relationships that need verification through additional methods.
4. **Confounding**: Statistical patterns can be misleading due to confounding variables, while causal patterns account for these.
5. **Structural Stability**: Causal patterns tend to be more stable across different contexts and populations than statistical patterns.
6. **Different Uses**:
- Statistical patterns excel at prediction within the same distribution
- Causal patterns excel at intervention and transfer to new contexts
7. **Inference Challenges**: Moving from statistical to causal patterns requires additional assumptions or experimental designs (randomized trials, natural experiments, etc.).
8. **Representation Differences**: Statistical patterns can be represented through various statistical measures (correlation, regression coefficients), while causal patterns are often represented through causal graphs or structural equations.
In the context of the paper, LLMs primarily learn statistical patterns from their training data, which leads to behaviors like frequency bias. They struggle with causal reasoning needed for optimal decision-making in sequential environments, which explains why reinforcement learning helps bridge this gap by providing direct causal feedback about actions and outcomes.
JOHN
# how does CONCEPTUALIZATION relate to these two
Conceptualization acts as a crucial bridge between statistical patterns and causal patterns, forming a three-way relationship that is fundamental to how intelligence (both human and artificial) understands the world:
## What is Conceptualization?
Conceptualization is the process of forming abstract ideas, categories, and frameworks that organize observations into meaningful structures. It involves:
- Creating mental models that represent entities and relationships
- Defining categories and boundaries between different phenomena
- Developing frameworks that explain how systems operate
- Abstracting away irrelevant details to focus on essential properties
## The Three-Way Relationship
1. **Conceptualization → Statistical Patterns**
- Concepts determine which statistics we measure and track
- Categories we create influence the patterns we can detect
- Conceptual frameworks guide what correlations we consider meaningful
- Our concepts determine the features we extract from raw data
2. **Statistical Patterns → Conceptualization**
- Observed correlations lead us to form new concepts
- Statistical regularities suggest natural category boundaries
- Unexpected patterns prompt conceptual revision
- Statistical clustering can reveal latent concepts
3. **Conceptualization → Causal Patterns**
- Concepts provide the variables between which we posit causal relationships
- Conceptual models suggest potential causal mechanisms
- Our ontology defines what can be a cause or effect
- Conceptual frameworks constrain plausible causal explanations
4. **Causal Patterns → Conceptualization**
- Understanding causation reshapes our conceptual frameworks
- Causal discoveries often lead to concept refinement
- Causal structures help organize concepts hierarchically
- Interventional knowledge influences how we categorize phenomena
5. **Statistical Patterns ↔ Causal Patterns (with Conceptualization as Mediator)**
- Concepts provide the language to move from correlation to causation
- Conceptual frameworks help distinguish spurious from meaningful correlations
- Concepts allow us to formulate causal hypotheses from statistical observations
- Conceptual understanding helps interpret when statistical patterns do or don't reflect causation
## In the Context of LLMs and Decision Making
The paper's findings about LLM limitations can be understood through this three-way relationship:
- LLMs excel at capturing statistical patterns from their training data
- They form concepts through statistical co-occurrence rather than causal understanding
- Their conceptualization of decision-making lacks the causal understanding of exploration-exploitation tradeoffs
- RLFT helps bridge this gap by providing causal feedback that reshapes their conceptual understanding of decision-making
This suggests that helping LLMs develop better causal understanding requires not just exposure to more data, but specifically designing learning mechanisms that help them form causally-grounded concepts rather than purely statistically-derived ones.
JOHN
# what is the source behind the KNOWING-DOING GAP
Based on the research paper, the knowing-doing gap in LLMs appears to have several interconnected sources:
## Sources of the Knowing-Doing Gap
1. **Prediction vs. Action Optimization**: LLMs are fundamentally trained to predict text, not to optimize actions for outcomes. The paper shows that models can correctly compute UCB values (87% correct rationales) but fail to select the corresponding optimal actions (only 21% optimal actions).
2. **Architectural Limitations**: The paper doesn't explicitly identify architectural limitations, but the disconnect between reasoning and action selection suggests the model's architecture may not effectively connect these components.
3. **Training Methodology Mismatch**: Standard pre-training doesn't align language prediction with action optimization. The model can generate the correct reasoning process but doesn't internalize it for decision-making.
4. **Default to Greedy Behavior**: Even when the model correctly reasons about exploration, it defaults to greedy behaviors (58% of actions were greedy even with correct reasoning). This suggests a fundamental bias toward exploitation over exploration.
5. **Lack of Action Consequence Awareness**: Pre-training doesn't provide feedback on how generated actions affect future states, which reinforcement learning explicitly addresses.
6. **Token-level vs. Sequence-level Optimization**: The paper mentions that "exact token-level correction is usually difficult for autoregressive generation," suggesting that optimizing token-by-token doesn't naturally translate to optimizing entire action sequences.
The researchers demonstrate that RLFT helps narrow this gap by providing direct feedback on actions, highlighting that the knowing-doing gap stems from fundamental disconnects between language modeling objectives and decision-making objectives. This explains why simply scaling model size or improving reasoning capabilities doesn't eliminate this gap - it requires explicit optimization for action outcomes through mechanisms like reinforcement learning.
JOHN
# do it again from a higher perspective
From a broader perspective, the knowing-doing gap in large language models reflects a profound challenge that transcends AI and extends to human cognition itself: the fundamental distinction between declarative and procedural knowledge.
This gap emerges from the fundamental nature of knowledge representation and its connection to action. LLMs exhibit a phenomenon remarkably similar to what organizational psychologists have long observed in human institutions—the ability to articulate the correct approach doesn't automatically translate to implementing it.
At its core, this disconnect stems from the profound difference between:
1. **Understanding principles** (knowing that UCB balances exploration and exploitation)
2. **Executing those principles** (actually selecting the action with the highest UCB value)
This separation mirrors ancient philosophical distinctions between _episteme_ (theoretical knowledge) and _techne_ (practical know-how). The model can generate accurate descriptions of optimal decision-making—it has the _episteme_—but lacks the embodied _techne_ to act accordingly.
What makes this particularly fascinating is that neither humans nor AI systems automatically bridge this gap through mere understanding. The paper demonstrates that even when an LLM perfectly articulates the correct decision rule and accurately calculates values, it still defaults to simplistic behaviors when translating thought to action.
This suggests something profound about intelligence itself: reasoning and acting are fundamentally different cognitive processes that require different forms of training and alignment. RLFT works not because it teaches the model new information, but because it aligns the model's action selection with its existing knowledge through direct experiential feedback.
The knowing-doing gap thus reflects one of the deepest challenges in creating truly intelligent systems—bridging the chasm between abstract understanding and effective action, between theory and practice, between knowing what to do and actually doing it.
JOHN
# what does the knowing-doing gap remind you of? ranked by relevance
1. **Theory-Practice Gap in Professional Fields** - The disconnect between theoretical knowledge taught in educational settings and practical application in real-world contexts (medicine, law, education).
2. **Akrasia (Weakness of Will)** - The philosophical concept describing when someone acts against their better judgment despite knowing the optimal course of action.
3. **Declarative vs. Procedural Knowledge** - The cognitive science distinction between knowing facts/concepts and knowing how to perform actions/skills.
4. **Implementation Intention Gap** - The psychological phenomenon where people fail to act on their intentions despite clear plans.
5. **Analysis-Paralysis** - When overthinking and perfect reasoning prevents taking action, creating a disconnect between thorough analysis and execution.
6. **Knowledge-Action Gap in Climate Change** - Society's scientific understanding of climate issues versus actual behavioral changes and policy implementation.
7. **Competence-Performance Distinction** - In linguistics, the difference between understanding language rules and actually producing language correctly.
8. **The Dunning-Kruger Effect** (inverse relationship) - While typically about overestimating ability, the effect also includes experts who understand topics deeply but underestimate their practical implementation challenges.
9. **The Value-Action Gap** - The discrepancy between people's reported values/beliefs and their actual behaviors.
10. **Tacit vs. Explicit Knowledge** - Michael Polanyi's distinction between knowledge that can be articulated (explicit) versus knowledge embedded in practice that's difficult to formalize (tacit).
11. **Mind-Body Problem** - The philosophical question about how mental states (knowing) connect to physical actions (doing).
12. **Deep Learning Overfitting** - When models learn patterns that work for training examples but fail to generalize to new situations.
JOHN
# what patterns can you discern from this list
Looking at this list, I can discern several meaningful patterns in these phenomena related to the knowing-doing gap:
1. **Modality Separation**: These phenomena all involve a separation between different cognitive or functional modalities:
- Abstract understanding vs. concrete execution
- Internal states vs. external behaviors
- Symbolic/verbal knowledge vs. embodied/practiced skill
2. **Information-Action Transfer Problem**: Each represents a challenge in transferring information from one cognitive system to another:
- From declarative memory to procedural systems
- From analytical thinking to behavioral control
- From explicit reasoning to automatic responses
3. **Contextual Barriers**: Many involve situations where knowledge acquired in one context fails to transfer to another:
- Classroom learning vs. real-world application
- Theoretical understanding vs. practical implementation
- Controlled environments vs. complex situations
4. **Developmental Sequence**: These gaps often reflect natural stages in skill acquisition:
- Understanding precedes competent execution
- Explicit rules before internalized performance
- Conscious effort before automatic implementation
5. **Different Knowledge Representations**: They highlight how the same information takes different forms:
- Verbal/symbolic vs. sensorimotor/procedural
- Articulated principles vs. embedded patterns
- Linear reasoning vs. parallel processing
6. **Complexity Management**: They reflect challenges in translating simple principles to complex execution:
- Clear rules but complicated application contexts
- Straightforward values but complex tradeoffs in practice
- Simple intentions complicated by competing priorities
These patterns suggest the knowing-doing gap isn't merely a failure but rather reflects fundamental aspects of how knowledge is structured and transferred between different cognitive systems—whether in humans or artificial intelligence.
JOHN
# whats the relationship between greediness, frequency bias, and a knowing-doing gap?
These three failure modes identified in the paper are distinct but interrelated phenomena that collectively undermine LLMs' decision-making capabilities:
## Hierarchical Relationship
1. **Foundational Level - Frequency Bias**: The most basic failure mode, rooted in statistical pattern matching. The model defaults to repeating frequently observed patterns without considering their utility.
2. **Intermediate Level - Greediness**: A more sophisticated failure that involves evaluating rewards but optimizing myopically. The model can distinguish between better and worse actions but fails to balance exploration with exploitation.
3. **Advanced Level - Knowing-Doing Gap**: The most complex failure where the model correctly reasons about optimal decision-making strategies but fails to implement them in practice.
## Shared Origins
All three failure modes stem from fundamental limitations in how LLMs are trained:
- They learn from static text data without action feedback
- They optimize for next-token prediction rather than sequential decision outcomes
- They lack embodied experience of consequences from actions
## Progressive Sophistication
The paper shows a progression in model capabilities that reveals how these failures relate:
- Small models (2B) heavily exhibit frequency bias (96%)
- Larger models (27B) overcome frequency bias but remain greedy
- Even the largest models correctly reason about optimal strategies but fail to act accordingly
## Different Solutions Required
Each failure mode requires different mitigation strategies:
- Frequency bias can be reduced through larger models and diverse training data
- Greediness requires explicit exploration mechanisms and reward shaping
- The knowing-doing gap needs alignment between reasoning and action through techniques like RLFT
## Unified Through RLFT
Reinforcement Learning Fine-Tuning helps address all three issues by:
- Providing direct feedback on the utility of actions, counteracting frequency bias
- Rewarding exploration to overcome greediness
- Aligning reasoning with action to narrow the knowing-doing gap
These three failures represent increasingly sophisticated challenges in developing LLMs as effective decision-making agents. They appear to form a developmental hierarchy where solving earlier problems reveals the existence of more complex ones, suggesting a roadmap for improving agentic capabilities in language models.
JOHN
# in the context of LLM, what are the main sources of sub-optimal decision making, ranked and categorized
## I. Pre-training Limitations
1. **Statistical Pattern Matching** - LLMs primarily learn correlations rather than causation, leading to mimicking frequent patterns without understanding their utility.
2. **Next-Token Prediction Objective** - Optimization for predicting the next token rather than long-term decision outcomes creates a fundamental misalignment.
3. **Lack of Action-Consequence Experience** - Pre-training on static text without experiencing consequences of actions provides no feedback on decision quality.
4. **Distributional Gaps** - Underrepresentation of proper decision-making scenarios and exploration strategies in training corpora.
## II. Architectural Constraints
5. **Disconnect Between Reasoning and Action** - Model architecture doesn't effectively connect reasoning circuits with action selection mechanisms.
6. **Token-by-Token Generation** - Autoregressive generation makes it difficult to optimize for entire action sequences or decision paths.
7. **Attention Dilution** - As context grows, attention to relevant historical information for decision-making can become diluted.
## III. Behavioral Biases
8. **Greediness** - Systematic preference for actions with highest observed rewards at the expense of exploration.
9. **Frequency Bias** - Tendency to select actions that appear most frequently in the context regardless of their value.
10. **Knowing-Doing Gap** - Capability to reason correctly about decisions but failure to implement this reasoning in action selection.
## IV. Optimization Challenges
11. **Lack of Explicit Exploration Mechanisms** - No built-in mechanisms to balance exploration and exploitation systematically.
12. **Reward Horizon Problems** - Difficulty connecting current actions to delayed rewards without appropriate mechanisms.
13. **Limited "Thinking Time"** - Constraints on reasoning length that prevent thorough evaluation of decision options.
14. **Context Window Limitations** - Restricted ability to consider the full history of interactions for making optimal decisions.
The research suggests that addressing these sources requires a combination of architectural innovations, training methodology changes (particularly reinforcement learning), and explicit mechanisms to encourage exploration and connect reasoning with action.
JOHN
# what is the greater conceptual context in which sub-optimal decision making exists within
## Epistemological Frameworks
1. **Knowledge Representation Theory** - How different forms of knowledge (declarative, procedural, causal) are encoded and accessed within intelligent systems.
2. **Theory-Practice Divide** - The fundamental gap between understanding principles theoretically and applying them effectively in practice.
3. **Embodied vs. Disembodied Cognition** - The limitations of disembodied systems that learn without physically experiencing the consequences of actions.
## Cognitive Science Perspectives
4. **Dual Process Theory** - The interaction between fast, automatic System 1 thinking (statistical pattern matching) and slow, deliberative System 2 thinking (causal reasoning).
5. **Bounded Rationality** - The constraints on optimal decision-making due to limited information processing capabilities, even with substantial computational resources.
6. **Learning Paradigms** - The distinctions between supervised learning (pattern recognition), reinforcement learning (action optimization), and unsupervised learning (pattern discovery).
## AI Development Trajectories
7. **Scaling Laws vs. Architectural Innovation** - The tension between improving performance through scaling existing architectures versus developing fundamentally new approaches.
8. **Symbol Grounding Problem** - The challenge of connecting symbolic representations with their real-world referents and consequences.
9. **Alignment Challenge** - The broader problem of aligning AI capabilities with intended goals and human values.
## Philosophical Dimensions
10. **Knowledge vs. Wisdom** - The distinction between accumulating information and having the judgment to apply it appropriately.
11. **Episteme vs. Techne vs. Phronesis** - Ancient Greek distinctions between theoretical knowledge, technical skill, and practical wisdom.
12. **Teleological Understanding** - The capacity to understand purpose and goal-directed behavior rather than just patterns.
This multi-layered conceptual context suggests that sub-optimal decision making in LLMs isn't merely a technical limitation but reflects fundamental challenges in bridging different modes of knowledge, reasoning, and action that exist across all forms of intelligence.