Recently, I've been working on several new learning algorithms to help me study the 'bitter lesson'. We often say "algorithms that scale are good" but what makes an algorithm scale? What is it about deep learning that makes it so bitter-pilled?
This article is on one intelligent system that I've designed based on properties I've observed in other learning algorithms that I've made. I think the central lesson of the bitter lesson comes in 3 points:
1. the algorithm must be asynchronous
1. every participant/worker must operate mostly independently, this helps maximize flops and data locality
2. synchronization and inter worker communication is rare
1. deep learning works because there's only one step where all the nodes need to communicate, the gradient all-reduce
2. for further scale improvements we either need decentralized optimizers (like DiLoCo) or new learning algorithms
3. the base operation is not that important, there's not really anything called "logic" and you can get reasoning out of pretty much any kind of computation, so choose one that's fast and powerful
1. matrix multiplication benefits from being extremely powerful, benefiting from the well studied properties of addition and multiplication
2. it can be done using systolic arrays, which minimize routing costs and memory retrieval costs
So, here's a proposal for a new learning algorithm that satisfies all the properties of the bitter lesson while also exploiting heterogenous compute.
## Hyperspace Cellular Automata
There's been a lot of research in extending cellular automata games to higher dimensions, however it doesn't *have* to be something like the game of life, it can be more general than this. Imagine each location in space is simply a collection of small models with some neuron-like local learning rule. Diffusion can happen across this higher dimensional space time in the same way it happens in 2d and 3d. If one axis of space time is "stretched" relative to another, then synchronization across that axis will happen less frequently, as diffusion across that axis will take longer.
Additionally, unlike a traditional evolutionary algorithm where organisms are selected and the next generation is chosen based on fitness, the organisms must learn to reproduce on their own. The proposal distribution for each new instance is just whatever reproduction algorithm the creature has evolved. This avoids any unnecessary synchronization and allows for significantly greater complexity like multi-cellular life or ecosystems to evolve (co-evolution).
## GPU Clusterspace
So far in history we only really have one algorithm that we understand well enough that is guaranteed to produce a general intelligence: evolution. However, simply making a virtual world and simulating it is expensive and not viable. Instead it makes more sense to adapt the geometry of the space time of the environment to be computationally simple and to encode the physical constraints of the system. This way the organisms will evolve to be computationally efficient.
To explain what this means imagine a simple GPU cluster, it consists of servers in a 3d dimensional grid, within-rack, inter-rack and inter-row. Importantly, like in string theory there are dimensions that are "curled in" this grid. Each server has 8 GPUs, and each GPU has some number of streaming multiprocessors. These add 3 more axis/dimensions to the system, and importantly movement within each axis in this space time has a different cost and the geometry of these axis are different.
![[cyberspace.excalidraw.png]]
Importantly, we can actually loop the device-per-host axis, meaning one dimension of the space time is actually curved. Additionally, the axis vary in length, the x axis of the space time is significantly longer than the k and j axis. In physical space it requires more energy to transmit data between racks. We can encode this penalty by stating the space time along the x and y axis are "stretched" and thus require more energy to cover the same distance along that axis.
We can even add one more axis for the warps per SM, producing 7 spatial and 1 time dimension. Each organism can occupy one point in space, taking the point of some data in a single warp in a single SM in a single device in a single server in a single row in a single cluster.
### Heterogenous Cyberspace
Just like in real life where life evolved in the ocean first but gradually moved to land, we can implement 'beaches' in between different compute clusters and types. By connecting the clusters with "beach" points, organisms will naturally evolve with time to bridge this gap. One can imagine an "amphibious organism" but instead of living on water and land, it lives in between CPU cyberspace and GPU cyberspace.
Additionally, we can design 3d, 2d or 1d only regions of cyberspace for human interactivity. Imagine a Minecraft like environment where you can interact with these newly evolved organisms. Organisms can gain rewards by exchanging tokens with humans for completing tasks or computing values.
## Evolution in Cyberspace
In these mixed dimensionality subspaces, we need some mechanism to incentivize the evolution of intelligent organisms. We can imagine there being some generic "energy" unit, obtaining these energy units involves solving problems, simply spreading across and populating the cyberspace, among other tasks. We can imagine that over time a whole ecosystem of organisms may emerge. Cell-like organisms may turn low level data into more complex data for rewards, and multi-cellular life will learn to coordinate and compose these computations to obtain larger rewards.