### Cognitronium
<sup><b>noun</b></sup>
**1.** *An object or arrangement of matter that responds to feedback and signals, and changes and adapts in ways that are roughly human-like to facilitate rapid deployment around the economy.*
**2.** *The essence of intelligent machinery.*
# Part 1: To learn continually or not?
**WARNING**: This first section is about deriving from first principles whether continual learning is actually the bottleneck for "cognitronium" and laying out some ground principles for alignment. It's quite long. For the essay just on near-term solutions to continual learning, skip to part 2.
## Introduction
I like answering big questions, fundamental questions because they are often useful to setting long horizon plans. I think a recent trend of questioning for me has been trying to answer: "What are we doing here?" for ML. I think a useful abstraction to think about is what I call "cognitronium" the intelligent counterpart to "[computronium](https://en.wikipedia.org/wiki/Computronium)".
I think cognitronium as a concept is a little different from computronium because I don't think the focus should be on optimality. But rather, it's for simply grasping what it is we want at the end of ML. What does continual learning look like? Is that enough?
## Cognitronium, alloy of Stalinium
![[kolmogorov.png]]
*Andrey Kolmogorov* a prominent cyberneticist & the inventor of *Kolmogorov Complexity*
### Soviet cybernetics
In Soviet Union in the 1950s through the 1980s the dream of automating the economy through computers and machinery was the ambition of a group of scientists known as the 'cyberneticists'. They believed, that through logical programming, computer networking and AI, it would be easy to efficiently centrally plan the economy. A computer program could spit out what was needed in what quantities and human or robotic laborers would do the rest.
Unfortunately, after decades of research, the dream was never realized. Moravec's paradox reared it's ugly head. The more efficient (pseudo-)capitalist systems in China and the United States would go on to dominate the world, while the human centrally planned USSR stagnated and eventually collapsed.
![[soviet_tanks.png]]
### Marx's paradox
As noted by communist scholars in the past, the failure of the USSR was in a lack of *intensive* development. While it was efficient to centrally plan the expansion of an existing industry by monopolization and economics of scale (*extensive* development), it's very difficult to centrally plan changes within an industry. A communist version of Moravec's Paradox is that weighing the pros and cons of deploying a particular technology is actually way harder than manufacturing hundreds of thousands of tons of advanced military hardware.
In opposition to the rigid inelasticity of mechanistic computer programs and GOSPLAN, markets dynamically evolved and adapted to new technologies, reorganizing entire industries into more optimal configurations. This allowed market economies to do significantly better at *intensive* development.
What the soviets needed was an adaptable machine, a computer program that could respond to feedback and adapt dynamically, potentially more efficiently than a market could. Unfortunately, it would take humanity another 50 years atleast before we could even think of such things in practice.
## Capitalism, Nature's Cognitronium
### Continual evolution
The current generation of AI systems offers a compelling solution to this problem, large neural networks trained on natural language that learn from previous context to generate new actions and make decisions. It means we can automate loads of things we couldn't before, like data entry, short-horizon coding, etc.
For us to make Cognitronium out of LLMs, the effective time horizon needs to be extended. The models must be capable of acting, learning, adapting over long time periods. Multi-day for most office work and "substrate labor", multi-week for low level managerial work, collaborative systems, multi-month for project planning, 4-years for GOSPLAN, political planning and semi-automated financial markets and multi-decade to multi-forever for replacing market systems and natural selection.
![[berlin.png]]
*Natural selection taking it's course in the evolution of ideas*
### Out-evolving evolution
Unlike current AI systems, market/competitive systems DO continually learn, but it's important to not be reductive in how exactly it does this. It's not just "natural selection via market". While inter-company adaptation is competitive, intra-company adaptation is done by an often autocratic bureaucracy. Within companies and nations are non-competitive alignment mechanisms, like rational debate, elite institutions, and so on. The competitive superstructure acts merely as an external realignment mechanism, destroying companies who's internal mechanism are inefficient or dysfunctional.
Therefor, we shouldn't restrict the design of our AI systems to simply be a competitive system of smaller AIs. After all, the goal is to make the world a *better* place, not simply the same but automated. Additionally, a competitive system would be atrocious for alignment.
The diversity of cognitronium alloys is much bigger than just competitive systems.
# Part 2: LLMs of the World Unite!
So what are the near-term possibilities for synthesizing cognitronium?
There are 3 camps:
1. Context summarization & Tool-use (note-taking, TODO-making)
2. Linear complexity architecture (Transnormer, Deltanet, TTT)
3. "Auto-distillation" (LoRA or live parameter updates)
## Molding the trillion-dimensional jelly
![[jelly.png]]
*From [aeon](https://aeon.co/essays/how-many-dimensions-are-there-and-what-do-they-do-to-reality)*
2 and 3 are one in the same, it doesn't really make a difference if it's 10m parameters of LoRA or 10m parameters of recurrent state. I'm generally suspicious of any approach that involves training *all* of the model parameters. I've tried that (for months) and it's simply too unstable. **It's like trying to get a trillion dimensional jelly to mold itself into a fancy shape**. The errors compound quickly, you need to balance the learning and reinforcement rates, etc. If it's even a little wrong then it doesn't work at all.
Additionally, I like the "pretraining-as-evolution" analogy that karpathy wrote about in [this tweet](https://x.com/karpathy/status/1973435013875314729). From that perspective, the parameters *and* code-architecture of the model is the true "architecture", atop which the active memories and unique adaptation sit (via RL, LoRA, SFT). Analogous to social approval, hormones, pheromone signals, and pain receptors the fixed parameters of the LLM help keep the active and adaptive part on the rails. If everything was subject to change then we would need another external system like evolution to keep it in check. We should avoid full-self-modification like this for reasons mentioned in pt1.
However, even partial self modification, like through recurrent hidden states, is opaque. This makes them hard to align and debug, which will matter a lot for deployment and real-world applications. The capturing of nuance is important, but also opens the opportunity for slippage in the billion dimensional jelly.
## Through the noisy channel
![[assets/information2.png]]
*Reinforcement Learning, Pretraining, Multimodality explained*
At risk of beating a dead horse, one of the best lenses to view machine learning through is information theory. An LLM's "context window" can be interpreted as a signal, some amount of information to "correct" the decoded data inside the LLM's hidden state. When a model is trained to write code for example, the context window corrects the model's internal estimation of what the code looks like.
### Ex verbis, totus mundus
A static LLM with a changeable context is basically a context interpreter, it takes the provided context and decompresses into some internal, far bigger interpretation of the state of the world. The larger the model and the more active parameters it has, the more nuance it can capture and the larger the internal decompressed state can be.
The model is some receiver that has some initial estimated task distribution, for example NTP on web documents. The context can correct that signal, and move it to a different 'mode' like robotics or coding. Training the model on a little bit on everything would lower the number of bits needed to encode these tasks without risking overfitting or frying our model.
![[watch.png]]
*We can't be deploying AI engineers in every single field, regular people need to be able to fine tune the behavior by hand*
### The case for context summarization
The pros and cons of this approach both stem from the same issue, the context window is a bottleneck on the maximum complexity that can be used to correct the signal on the other end.
The pros are that because the weights are static, and that the context is a fixed length, they are more interpretable than recurrent hidden state models. The decompressor is fairly closely related to the base model, who's task is simply interpreting human written text. As so long as the RL is kept minimal and the incentives not perverse, the context interpreter is going to interpret the context in approximately the same way that humans do.
This means that unlike a human, a context summarizing model's internal state is ~100% viewable, interpretable, debug-able and fixable. I think this will matter a lot for real world tasks, where companies deploying the models will want something that interprets instructions in a human like manner but can be fixed if something goes wrong. Humans are notoriously hard to fix and debug, we have multiple fields dedicated to it!
Additionally, the fact that it's static means that for edge cases, an engineer can learn an intimate understanding of each particular model that will transfer across domains, making the demand for specialists lower.
![[timescales2.png]]
*Higher order effects are sparser, requires integrating little information per epoch over lots of epochs at roughly the same total complexity at each scale*
## Nuanced context summarization
The advantage of the linear recurrence options is that they can continually learn and refine their internal model over an infinite number of tokens, like humans do. This means they can pick up extraordinarily subtle trends that don't simply get ignored lots and lots of repeated context-summarization (which suffer from something like vanishing gradients).
We can mitigate this effect in one of two ways, one is to use context summarization for fast and frequent adaptation and to have a super low learning rate for the recurrent state. This way the model stays roughly on track for long periods of time, only requiring occasional intervention from humans or other models (like some kind of LLM social-pressure).
Alternatively, we can try a multiscale approach like in [[Multiscale Muon]] but for context summarization, have a portion of context allocated for different time scales. The longer timescale contexts prompt summarization of lower order events, which it then uses to advance the higher order prompt.
## What will it look like?
I think given what we've said so far, we can begin constructing what the economically-useful LLM of the near-future will look like.
1. **It'll be big**. No "small cognitive core". The bigger the decompressor the better the reconstruction of the signal, the fewer bits of context needed. Capturing tail-end patterns, meaning and data is going to be important.
2. **Lots of pretraining**. We should expect multi-epoch training with dropout. Modern LLMs have weird failure modes and edge cases, and we ought to be making them less fragile. We've gotten away without dropout for awhile, but I think it will make a comeback. More efficient dropout alternatives might be something we should figure out.
3. **Not a lots of RL.** The RL phase will probably be smaller, but far more advanced and higher taste. It has to be maximally diverse to help the model generalize across lots of subjects and tasks.
4. **It'll be fairly quick**. A large model with lots of active params won't need to reason for as many tokens, and speed will matter a lot for robotics. It should absolutely have the ability for medium length reasoning ability, but one can think of the context summary as a kind of 'scaffolding' to help the model think for longer when it needs to.
It might have to be bicameral, with one half of the brain in context-summarization mode and the other in active mode. We'll also have to figure out vision for robotics and IRL tasks ([[Image Encoding]]), more efficient pretraining ([[Multiscale Muon]]), and improving the robustness of LLMs in the real world (Better dropout).
## The far future
As I mentioned this is an evaluation of **near-term** solutions for cognitronium synthesis, I think long-term will look very different and we should look into much more creative and out of distribution ideas. Possibly even abandoning deep learning all together.
That being said it'll be a lot easier to redo the field from scratch with GPT7 and Claude Requiem 6, so let's finish this end of the tech tree first.
![[claude_requiem.excalidraw.png]]