![[ChatGPT Image 24 ago 2025, 16_42_51.png]]
Attention sinks have recently come back to the forefront of architecture discussion, especially due to their appearance in [gpt-oss](https://github.com/openai/gpt-oss) (although in a different form than the effect we're discussing today).
As a mechanism, attention sinks are easy to describe: when trained, decoder-only transformer models tend to allocate a disproportionate amount of attention to the first few tokens, and especially to the first.
This effect is well studied in its practical terms, and is often attributed to the model "offloading" probability mass to the early tokens to avoid their spurious allocation elsewhere. Recent works, like [Softpick](https://arxiv.org/abs/2504.20966), provide architectural choices that prevent sinks from forming. While this explanation may sound convincing at a first glance, my intuition is still bothered by it: what do you mean the model "offloads"? Of course it doesn't explore that possibility intentionally, there must be some mechanism by which the attention sinks are either advantageous or a result of an intrinsic bias to the model. In this blogpost, we will argue that there is a significant bias in decoder-only transformers that may be to blame, at least partially, for this phenomenon. Moreover, this will also allow us to introduce a series of blogposts focused on analyzing transformers from the lens of message passing on graphs.
### Attention as message-passing
[Recent work by Chaitanya K. Joshi](https://arxiv.org/abs/2506.22084) has finally freed us from having to formalize independently a well known property of Transformers (and especially of attention layers): them being a special case of Graph Neural Networks (just like pretty much anything else, to be fair).
As as setting to our discussion, though, we will go over another angle with which attention can be seen as message-passing on a graph.
Most people are usually introduced to (multi-headed) self-attention directly via the [attention is all you need](https://arxiv.org/abs/1706.03762) paper. Despite this being generally a good practice in my opinion, it generally directs attention to being interpreted as the simplest way of making tokens interact in a transformer, or as just a soft version of a dictionary lookup. While neither being wrong, such interpretations often drown out some interesting geometric details that lie in attention itself.
Let's start with regular, multiheaded attention.
Say you have $n$ tokens, with an embedding dimension $d$.
Let our input tokens be shaped as a matrix $X \in \mathbb{R}^{n \times d}$, we first process $X$ with three different linear projections, namely $W_q$ , $W_k$ and $W_v$, and end up with the respective $Q \in \mathbb{R}^{n \times d_q}$, $K \in \mathbb{R}^{n \times d_k}$ and $V$ $\in \mathbb{R}^{n \times d_v}$ matrices.
We then perform the well-known attention operation
$attention(X) = softmax(\frac{QK^T}{\sqrt{d_k}})V$
let's take a look at $\alpha = QK^T$.
If we rewrite it component-wise we get
$\alpha_{ij} = \sum_{l=1}^{d_k} Q_{il}(K^T)_{lj} = \sum_{l=1}^{d_k} Q_{il}K_{jl}$
and if we note that $Q$ and $K