LLM are Greedy Agents - Chatgpt final version - follow the idea

2025-04-25 [PDF](https://arxiv.org/pdf/2504.16078) # From Prediction to Action: Transforming LLMs into Coherent Decision-Making Agents --- ### Summary Large Language Models (LLMs) suffer from systematic flaws in decision-making, including greediness, frequency bias, and the knowing-doing gap. These failures stem from their statistical training objectives and architectural disconnection between reasoning and behavior. Reinforcement Learning Fine-Tuning (RLFT) on self-generated rationales bridges this divide by incentivizing exploration, aligning action with thought, and transforming predictive models into adaptive agents. --- ### Unified Long-Form Article #### Introduction: The Hidden Incompetence of the Intelligent LLMs appear brilliant—they write essays, explain algorithms, simulate Socrates. But when dropped into a decision-making task, they flail. Despite being able to describe the Upper Confidence Bound algorithm in perfect prose, they act like gamblers stuck on their first lucky slot machine. This paradox reveals something profound: intelligence without aligned action is hollow. LLMs can simulate understanding but cannot yet wield it. The task before us is not just to make them better at language, but to make them better at _living in uncertainty_. This requires more than scaling. It demands a transformation of purpose—from _prediction machines_ to _decision agents_. --- #### I. The Triad of Failures ##### 1. Greediness LLMs latch onto the first decent reward and stop exploring. In a multi-armed bandit scenario, even a 27B parameter model explores only 40–65% of available actions. They are not maximizing long-term reward—they’re optimizing short-term certainty. This stems directly from their training objective: predict what’s likely, not what’s optimal. ##### 2. Frequency Bias Smaller models disproportionately favor the most frequently seen actions in context—up to 96% of the time—regardless of value. This “copycat bias” is a side effect of learning from distributions rather than environments. Frequency becomes a proxy for truth in a world where it often isn’t. ##### 3. Knowing-Doing Gap Here lies the deepest failure: LLMs often _know_ the right thing but fail to _do_ it. They can generate the correct reasoning (87% accurate rationales) yet select the wrong action (only 21% optimal behavior). Like a person who understands exercise benefits but never hits the gym, LLMs suffer from a profound disconnect between declarative and procedural knowledge. --- #### II. Causal Roots of Suboptimality |Root Cause|Type|Description| |---|---|---| |Next-token prediction|Objective Bias|Models learn to mimic rather than optimize| |No feedback on consequences|Environmental Poverty|Pre-training never penalizes poor decisions| |Token-level architecture|Granularity Misalignment|Autoregression doesn't prioritize sequence-level outcomes| |Disjointed reasoning-action|Architectural Miswiring|No deep link between the logic stream and output decision| |Reward myopia|Optimization Myopia|Immediate gain trumps long-term strategy| |Statistical overfitting|Conceptual Narrowness|LLMs confuse “often seen” with “universally good”| These causes are not random—they are baked into the training pipeline. Solving them requires more than tweaks. It requires new logic. --- #### III. RLFT: Rationalizing the Agent Reinforcement Learning Fine-Tuning (RLFT) offers the missing ingredient: _feedback with teeth_. Instead of just predicting tokens, the model is now rewarded for _choosing well_. The protocol: 1. **Generate Chain-of-Thought (CoT)**: Let the model explain its reasoning. 2. **Act Based on That Reasoning**: Use the rationale to guide decisions. 3. **Reward According to Outcomes**: Use PPO or similar methods to fine-tune based on actual results. 4. **Align Thought and Action**: Reinforce consistency between what’s said and what’s done. This approach directly attacks the knowing-doing gap and tempers greed by rewarding exploration. --- #### IV. Exploration: The Soul of Decision-Making The exploration-exploitation dilemma is not just technical—it’s archetypal. It mirrors: - **Learning vs. Performing**: Growth vs. mastery. - **Breadth vs. Depth**: Curiosity vs. efficiency. - **Innovation vs. Optimization**: Risk vs. reward. In decision-making, the early game belongs to exploration. Later, knowledge empowers exploitation. Algorithms like UCB formalize this with uncertainty bonuses. But LLMs, trained on static corpora, don’t learn this dance. RLFT changes that. It introduces: - **Try-All Strategies**: Every option gets a shot. - **ε-Greedy**: Random choices build exploration muscle. - **Exploration Bonuses**: Novelty becomes profitable. - **Context Randomization**: Weakens frequency imprinting. - **More Thinking Time**: Longer chains of reasoning yield better alignment. --- #### V. Developmental Hierarchy of Errors |Model Size|Frequency Bias|Greediness|Knowing-Doing Gap| |---|---|---|---| |2B|Severe (96%)|High|Weak reasoning| |9B|Moderate|Persistent|Partial insight| |27B|Low (14%)|Present|Rational, still misaligned| Failure modes aren’t static—they evolve. Scaling helps, but doesn’t cure. Each layer of failure is nested within the last, and only RLFT addresses all three. --- #### VI. Conceptual Bridgework Behind these issues lies a deeper triad: - **Statistical Patterns**: What happens frequently - **Causal Patterns**: What makes things happen - **Conceptualization**: How we organize both into understanding LLMs are strong at the first, weak at the second, and incomplete at the third. RLFT acts as a learning prosthesis—it introduces causal information into the feedback loop, helping the model reshape its internal concepts from “frequent” to “effective.” --- #### VII. Philosophical Frame: Knowing ≠ Doing This isn’t just an AI problem. Humans exhibit the same gap. We know how to be healthy, ethical, or disciplined—but often fail to act accordingly. The knowing-doing gap appears across domains: - **Akrasia** in philosophy - **Theory-practice gaps** in medicine, education, law - **Implementation intention failure** in psychology LLMs mirror us in this way. To close this gap, they need more than knowledge. They need _procedural embodiment_—the fusion of principle with policy. --- #### VIII. From Mimicry to Agency LLMs must graduate from parrots to planners. RLFT does this not by making them “smarter,” but by making them _coherent_. > Intelligence without alignment is simulation. > Alignment without reasoning is obedience. > Coherence is the goal—an agent that knows _and_ acts well. --- ### Final Table: Unifying the Trilemma |Failure Mode|Cognitive Analogy|Core Cause|RLFT Remedy|Symbolic Duality| |---|---|---|---|---| |Frequency Bias|Habit over value|Statistical overfitting|Context shuffling, reward reshaping|Familiarity vs. Utility| |Greediness|Short-term over long-term|Reward myopia|Exploration bonuses|Now vs. Later| |Knowing-Doing Gap|Episteme vs. Techne|Declarative-procedural disconnect|CoT-aligned RLFT|Understanding vs. Embodiment| --- ### Conclusion: The Path to Agentic Intelligence What separates intelligence from wisdom is not knowledge, but _behavioral integrity_. LLMs today can dazzle us with what they know. Tomorrow, they must earn our trust by how they decide. To get there, we must do more than scale. We must teach them to _explore_, to _align_, to _act_—and above all, to unify their words with their will. Let them reason. Let them wander. Let them learn to choose. ---