Prompt Ensembles - Obsidian Publish

This is a null result, conclusion at the end Modern deep learning is a kind of [Maximum Likelihood Estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). This means that if you train a model on a dataset that is 51% heads and 49% tails, ignoring regularization effects, it will, in the limit predict heads 100% of the time. While this is great for world model building and learning to unlearn misinformation and such, this behavior can sometimes be undesirable. Humans on the other hand, have abundant noise within their brains and we are all unique, not simply clones of the same parameter set. This means that when you poll a crowd of people on a task, you will get a very wide distribution of answers, but it's average will often be correct, often more accurate than a single expert guess. This is known as the '[wisdom of the crowd](https://en.wikipedia.org/wiki/Wisdom_of_the_crowd)'. In a sense, an LLM comes 'pre-optimized', it's outputs are already maximized. You can't sample a model for a big range of acceptable answers process them via some other system and then choose the best response since most of them will be the same. This is why well-trained and well-RLed models have poor pass@k performance. ![[poly.jpg]] While for most real world tasks we only care about pass@1, for things like deep research or creativity, the pass@k is what matters most. Reasoning models in particular suffer from this 'entropy collapse', though there are [proposed methods](https://arxiv.org/abs/2509.25424) to fix this. Additionally, this higher diversity makes human generated outputs like art and writing more diverse and interesting, there aren't necessarily notable idiosyncrasies in the sum of all human text like there is for the sum of all llm text or all GPT5 text. It also might improve robustness to jailbreaks. In humans, the same social-engineering tactic don't always transfer across personalities, so by having a wide variety of personalities in your loop, you can reduce the possibility of a catastrophic exploit in exchange for slightly lower overall performance. ## Problem The problem rests in the context window, given the same context and the same question, the model will always output the most likely answer. Of course, as the conversation continues, the model will start to diversify, as the space of possible conversations of that length increase. A simple test for this is to ask your favorite LLM to flip a coin. Many will simply return heads 100% of the time as I mentioned earlier. ## Hypothesis A possible solution to this is giving each invocation or conversation it's own unique, randomized background. Just like if you were to call tech support, every time you enter a conversation, the assistant behind the screen will be a different and unique personality. The risk is that some models will over-index on the 'human' element of the background and become even more deterministic if you use a dataset of human backgrounds as we did here. Therefor it's important to identify the models that suffer this issue ahead of time. ## Coin Flip Bench **Setup**: 1. Import the [Nemotron-Personas](https://huggingface.co/datasets/nvidia/Nemotron-Personas) dataset 2. Build a system prompt using the personalities 1. Claude model system prompts are in the 3rd person (base model like) and are special cased. 2. The claude system prompt was tested on non-claude models to see if it changed outcomes, it didn't 3. For $n$ personalities, ask the model to flip a coin, roll a 6 sided dice, choose a number from \[0-9] 4. Tally up the responses, make a probability distribution, measure entropy, compare to expected entropy ## Results **Anthropic**: ![[anthropic_entropy.png]] As you can see, reasoning models suffer extreme mode collapse due to the more aggressive RL that often causes things like this. What's interesting is that between models that the personality they collapse on is unique per model, for example if you ask them what fruit they most identify with, it varies between the models: ![[balance_fav_fruit.png]] *Claude sonnet 4 seems very certain it is a strawberry* **OpenAI**: ![[openai_entropy.png]] None of the OpenAI models seem to benefit from persona control, GPT5 seems to have had all it's responses rejected, not sure why. **Google**: ![[google_entropy.png]] 2.5pro has a high rejection rate, but the increase in variability and distribution from the prompt ensembling seems to allow some of it's responses through **Meta**: ![[meta_entropy.png]] All over the place, but generally not helpful except for 405B ## Conclusion To be honest for a lot of models, especially those outside Anthropic this doesn't seem to work. But Anthropic is known for their instruction following and multi-turn performance so it doesn't surprise me that it works on their models. Overall I would say a more architectural approach is needed. This project was made to explore an alternative before I went through the effort of implementing the architectural approach I had in mind. Cool theory, unfortunately I think this has more to do with overfitting and mode collapse from RL more than anything else. Might work better on base models, who knows?