This project is an intersection of a lot of interests of mine, mainly AI self-awareness and it's relation to RL/post-training. I think it also might be useful for mechinterp as an object of study. # Introduction The project is called "symbolic transformer", the point of it is to be a thought experiment, but with real applications. The jist of it is that it's a purely symbolic implementation of how I think a transformer really operates under the hood given my prior research. Essentially, if there are some [[Hyperbolic Space|non trivial underlying structures]] in an LLM, I should be able to extract just that abstraction and extrapolate it for the purposes of capabilities research. I've done that [[Multiscale Muon|successfully]] elsewhere so far, but a much stronger test is to port those abstractions to a symbolic representation. A slightly weaker but also interesting test is to port it to a english language representation, which is what will follow :) I've gone ahead and actually implemented this theory in python code. It is essentially a programmable transformer. It can do all the things a transformer can do, and not too much more than that, which makes it perfect for reasoning about how a transformer can actually work. It also comes with an agentic harness to explore the possibility of using an agent to program said symbolic transformer for a reason that will be explored later. # Implementation ![[dag.png]] It uses a hierarchical pattern matching system which I think is somewhat analogous to the structure that LLMs store and operate information in. The way that it works is that there is a label hierarchy, each token has it's own label by default and they can be clustered into multiple superlabels that are themselves clustered and so on. For example one can make a "noun" label or "bigram noun" label, etc. I prematurely optimized the datastructure w/ bloom filters and such in the event someone else wants to port it to another language for speed just to demonstrate how it'd be done at scale. ## MLPs From this, an MLP layer can match to a number of parent nodes or to the node itself, and if enough of them match to pass a threshold, it passes an activation function, which then inserts a set of labels into the residual stream. A major difference between this symbolic representation and an actual transformer is that ours are all discrete representations rather than continuous. The pattern matching part corresponds to the up projection in the matrix, which often contain power law distributed svals and other signatures that hint as to a deeper hierarchical pattern matching structure in the model. The inserting of the labels into the residual stream is the down projection. In the implementation both positive and negative labels can be inserted to add or remove a label from the residual stream. ## Attention & RoPE ![[rope.excalidraw.png]] Another interesting aspect of this implementation is that it explains how RoPE works pretty well. RoPE works by allowing the model to mix many different frequencies across the sequence length together. This is roughly equivalent to being able to hierarchically compose groupings of token positions into complex attention patterns. For simplicity we pass the whole vector into the key and have an MLP to produce a query on it. It's an MLP because the output of the query matrix is itself dot producted with the key, so it acts more like a semantically rich label in itself rather than simply a collection of dot products. In place of a softmax, the k for the number of top-k values as well as the threshold for the key-query match determine what gets added to the residual stream. ## Unembed Using attention and MLPs the model learns to compose circuits together. The model then uses circuits to output prediction labels like "predict_noun" or "predict_period" which then are decoded in the unembedding layer for prediction. Interestingly this explains pretty elegantly why tying embeddings usually produces a worse model. I think there are some parameter scale aspects of it that people are neglecting, but having to completely negate the input vector content as well as confounding the latent space is probably the main reason why having a separate embedding matrix is useful. More of the nitty gritty details can be found in the code, just ask claude to explain it to you # Interpretability I think this project also represents part of my contention with what I know about mechinterp. I think it neglects core aspects of intelligibility, which harms it's ability to actually model what LLMs are doing. More time should be spent on these kinds of toy projects to help produce a deeper understanding of how they work more than just the surface level characteristics like neuron activations or circuits. After all, if a major goal is an interpretable decomposing of a transformer, you should probably start by trying to make a interpretable decomposition of a transformer. SAEs for example ignores the obvious hierarchical structure of knowledge encoded in a network. So even if we did get a sparse representation of all the circuits in a model, there would be so many of them and in a totally flat hierarchy, which would be mostly uninterpretable to us. Our brains are actually organized in a hierarchical network structure, so it'd help to try and put it in a way we can actually understand. # Self awareness & CoTs The project started based on [this tweet](https://x.com/_ueaj/status/2039710939713347608) which was itself inspired by [this article](https://x.com/DimitrisPapail/status/2024555561199480918). The way a transformer should CoT is vastly different from how a human should, and the question is how should we go about making data or teaching a model how to reason efficiently for itself, rather than piggy backing on human reasoning. ![[metaprogramming.excalidraw.png]] The second thought that when met with the other thought conceived this project was that I needed a visceral way to experience what it's like being a transformer. An interesting extension of this then is to train a model to *hand program* a symbolic transformer. After all, if it can teach a human how to think like a transformer, maybe it can teach a transformer how to think for itself. I doubt this will immediately generalize to more efficient CoTs. That behavior would need to be elicited, but it's certainly an interesting thought. We should do this because "self awareness" and self modeling are going to become incredibly important model traits we optimize for in the near future as we start to deploy these models around the world. ## The hard problem of hallucinations It's easy to dismiss hallucinations as something that people do too, but I think a very large class of them are due to the lack of self awareness in an LLM compared to a human. The pretraining process of an LLM is not at all analogous to how a human works and lacks any meaningful introspective ability. Humans can lack introspective ability even though their architecture allows for it, and is why some people are overconfident, much like an LLM. This is why I think factual and contextual hallucinations are in my opinion a consciousness problem. The model learns lots of stray connections and spurious correlations in pretraining. The ability to correct for the specific misgeneralizations an LLM would make is not in the training data. They are approximately related to human mistakes, but not really. Pretraining does not meaningfully incentivize nor is capable of this kind of self awareness like RL. A model trained to verbalize a model of itself would have an easier time knowing what it might get wrong. I think this project is only one small part of that and we should make this more explicit, but it's nonetheless very interesting. ![[claudescraper.excalidraw.png]] Contextual hallucinations are a great example of a generalization an LLM would make but not a human. In the training data for example, there are lots of texts describing what goes on in figures that are not necessarily passed to the model, or are referencing some figure the author assumes the reader knows about (or just saw, like in a different page that may have split into 2 different training examples). There isn't that much data on when the context *doesn't* contain a tool or something is missing, and the model is seldom trained to respond to these weird situations. I think having the model be more viscerally aware of it's context and what it means to it would be important. I'm almost certain Anthropic has been exploring this direction as their context awareness and hallucination rates seem to be unusually good, but even still there's a long way to go. # Project ![[eval loss.png]] The project can be found [here](https://github.com/Ueaj-Kerman/symbolic-transformer) on github. It consists of 3 major components, the symbolic transformer implementation, a complex and well made agentic harness for claude to program it up by hand, and a broken prime-rl env harness to see if we can use it as an RL environment. Training is extremely slow, mostly bottlenecked by the agent speed. It also struggles with generalization, with eval/loss only improving for a little before the code becomes a horrific mess the model has to untangle. You can see on step 9 when it rewrites a ton of stuff to get the loss back down. Better prompting, actually training the model in this environment, more advanced harnesses, parallelized training, etc. can all be done to improve the performance. I would suggest implementing parallelized training, by giving the model read/simulate only access to each sample, and then ask them to combine into a report to one agent tasked with taking in the feedback from all the separate samples. The superagent should then prompt the next iteration's subagents and summarize it's findings for the next iteration's superagent. # RL Competition I've (Claude) tried to implement this as a prime-rl project to no avail as I have no experience with it and it doesn't seem like it can exactly implement the harness in the repository. So I'm open sourcing the project for anyone that would like to try this. I think it would be extremely cool to have a competition around this to see who can get a model to hand program an LLM to the lowest loss. In the future it'd be cool to train the model to predict it's own thinking traces, rather than just tiny stories, as an explicit way to teach it self awareness. But this is conditioned on getting the symbolic learning to a point where that's actually viable. I think there are simpler ways to do that as well. [PRs welcome](https://github.com/Ueaj-Kerman/symbolic-transformer)!