There's been a lot of discussion and research in trying to mitigate mode collapse in RL. I think frankly all of the approaches that I have seen are ill-justified and approaching this from the wrong angle. I think there are two lenses that are valid to approach this problem: 1. Bayesian modeling 2. Ecological diversity The second one is a conceptual lens that I've learned that really helps when designing scalable or distributed algorithms. The real world, and most existing intelligent systems operate in parallel with sparse interactions between independent actors. Your brain runs in parallel to everyone else's, and you communicate with one another (and the environment) occasionally. The internal evolution *within* your brain is parallel, the communication is the occasional lock and somewhat serial. # A surprising lack of mode collapse ![[speciation.excalidraw.png]] w.r.t. to RL and mode collapse, we can observe that lots of these 'parallel w/ sparse interaction' systems rarely suffer from low output diversity. In an ecosystem, if a mass death event happens, the remaining species will evolve and fill those niches left behind by the extinct. And importantly they will speciate into entirely new species that will themselves evolve independently of the main branch. When a country or empire gets too big, local lords will rise up as the interests of various parts of the empire compete against each other. When, during human personality maturation, a particular personality grows dominant, it will "speciate" into new ones. This is probably what gives people their "Aura"; you can tell when a person has gone through some kind of mass personality death event and come out the other end. They are person still just as complex but in a totally unique way. I think the reason for "speciation" is that there is a persistent non-zero noise and internal feedback/evolution in most systems, and if the intercommunication rate is too low, then the system will naturally diversify. Modern empires can be much larger than before because we can instantly communicate across the globe and have one shared "internet culture" if we want to, and indeed the internet has been an enormous boon for American influence around the world. # Why diversity? ![[diversity.excalidraw.png]] However, something I have struggled with is *why* we need diversity in the first place, and why we *need* noise. I find the intuition that `diversity = exploration = better` to be insufficient. Not all exploration is good and a more interconnected, less noisy system should always be atleast $\ge$ than a less interconnected, more noisy system. Is the interconnectedness of a system a macroscopic property of intelligence that needs to be controlled or are there intelligent systems for every level of interconnectedness? Is managing the balance between independent evolution intensity and communication a core aspect of intelligence? I don't think so, if I have a really highly interconnected system like a datacenter I can probably always saturate the interconnectivity for further improvements in intelligence. I've been extremely perplexed by this for a long time, but I think I found a much better conceptual explanation for why increased output diversity is important. I don't think it's the output diversity itself, but rather that we observe that all the currently existing systems with high output diversity have much greater explorative capacity. What actually causes the greater explorative capacity is something else entirely. # Anti-compression theory ![[bottleneck.excalidraw.png]] The difference between LLM RL and the evolution of societies or academic systems is that the internal state evolution complexity is bottlenecked by the single centralized instance of the model. In a human society, every model is independent in both parameterization and computation, so the total parameter count of a human society is in the quadrillions or quintillions while for an LLM RL system it's whatever the model or LoRA parameter count is (in the millions or billions). This means that the thought processes are confined to whatever can fit in the significantly smaller representation capacity of a single model, and the output token diversity is restricted by whatever the model can select in it's often undersized model dimension prior to the `lm_head`. ## Maintaining the balance ![[landscape.excalidraw.png]] If you naively constrained the output entropy of an LLM during RL this wouldn't necessarily improve the performance of the algorithm. This includes any approach that doesn't fundamentally change the bottleneck phenomenon we discussed earlier. I think the reason as to why is best communicated in the above image. Essentially, if the model capacity is too small, then the system can "unlearn" what regions of the landscape are bad, and end up simply wandering around aimlessly. A system with much more capacity however, can remember the things that failed and avoid repeating those mistakes, as well as integrating that information with new updates, to perhaps learn in a more advanced way from prior mistakes. *And before you pull up a study disagreeing with me, most RL papers are slop and don't actually replicate, much like optimizer papers. Though there is some reason to believe it would work, simply because it is making better use of the existing model representation capacity.* ## Tail-end Obviously, in the limit a datacenter that's 99% communication and 1% compute is not something you can saturate. But I think the true balance that determines explorative capacity is *capacity* and compute intensity. What you really have to worry about is simply having enough space to capture the full spectrum of behavior, including tail end patterns and observations. What's so powerful about this intuition is that it also explains why naively cranking up the output entropy wont solve mode collapse, it also explains why larger LLMs have that "smell" that smaller models don't, or why they are simply more capable of grokking ideas smaller models can't. # RL as an Ecosystem ![[cyclical.excalidraw.png]] I think that lens also holds an (scalable) (suboptimal) answer to mitigating output diversity in RL. Rather than treating RL as one centralized trainer and n centralized inference engines, we can treat it as an ecosystem of trainers and inference engines. Instead imagine each trainer has some number of inference engines associated with it, and occasionally they either synchronize parameters via averaging, or even more interesting is to occasionally send some subset of the reasoning traces to other training nodes. This way the total number of parameters across the whole network is vastly larger than for one single instance. You would need some kind of global balancing thing to keep entropy under control, since in the real world the planet is simply large enough and brains/ecosystems/etc. sparse enough to avoid mode collapse. But if each little cluster has enough independence to explore reasoning paths on it's own and evolve independently, and if there is enough cross pollination, they will still efficiently find the mode. This overall system would end up producing a "society" of LLMs rather than just one parameterization, which can be especially amazing if you want an extreme diversity of responses (for example for writing w/o ensloppification or idea creation or research). Or, if you just need one, good model, pick the best one and use that, or "anneal" the network between low and high crosspollination rates, and pick the best model when the performance reaches it's peak. Since the network structure of the internet is already multiscale, you could latch onto latencies to set up your network hierarchy. This isn't ideal, and as we said earlier a system with higher interconnectivity and less noise should always be better than one with lower interconnectivity and more noise. What we're doing here is just using "nature's heuristic" for the problem of explorative capacity. It's not a very good one and we can beat it as I'll mention in 'Beating Nature'. ## Ecosystems as Bayesian ML Interestingly, this can be interpreted as a Bayesian ML task, where the inference engines are "generating" a dataset, and we are training multiple independent models on different subsets of the dataset. Then, in the next rollout, we use that Bayesian model average (assuming all the models are ~equal) to generate the next dataset and so on. So in a way you can still view it via the first lens too. # Beating nature ![[complexity.excalidraw.png]] If you made one giant highly connected RL loop that contained no noise or bayesian subset training, then you would end up with a lot of the same algorithms over and over again, reducing your effective parameter count. My intuition is that in a truly intelligent system the places the training examples or "messages" are routed are themselves controlled by intelligent systems. If the whole thing was top-to-bottom intelligent then it would be always be able to saturate the bandwidth without mode collapse. Unfortunately, that sounds like a pain in the ass, so we use a partially intelligent system (a weak heuristic) known as "independent evolution + synchronization". It's the heuristic nature uses everywhere, from within your brain to between brains as well as ecosystems and all sorts of other things. But through further analysis we can hypothesize an even deeper, underlying reason for why these systems don't mode collapse, and explain how to make better-than-natural intelligent systems. The equivalent for this in the brain is to imagine if your neurons could decide how they grow and connect to each other on a conscious level. As opposed to being an amorphous blob with genetically encoded macroscopic structures, a truly intelligent brain could reconfigure itself in any particular organization. This does happen within human societies though, certain people will use the knowledge they've learned to actively move and reconfigure the connections within society. We call those people "high-agency" :)