zoology-associative-recall - ramblings from the whirligig void

# Zoology: The Associative Recall Key Differentiator > [[index]] · [[transformers-basics]] · [[ssm-basics]] · [[hybrid-models]] · [[strengths-and-weaknesses]] ## The Critical Finding The [Zoology paper](https://arxiv.org/abs/2312.04927)[^1] (Dec 2023, HazyResearch) made a remarkable discovery: **82% of the perplexity gap between SSMs and Transformers** is explained by a single skill — **associative recall (AR)**. ## What is Associative Recall? > [!NOTE] Definition > **Associative recall** = looking back into context to retrieve a specific fact that was mentioned earlier. > > Example: *"She put vanilla extract in her strawberry smoothie … then she drank her strawberry __?"* > The model must recall that it saw "smoothie" earlier in the sequence. More examples from language: - "The capital of France is Paris. ... Q: What is the capital of France? A: ____" - "John gave Mary the book. ... Who gave Mary a gift? ____" - Code completion: seeing a variable defined 500 lines ago, then completing its use ## The Benchmark Numbers (Stark) | Model | All Tokens PPL | **AR Tokens PPL** | Other Tokens PPL | |-------|---------------|-------------------|-----------------| | Attention | 9.44 | **1.98** | 10.62 | | Long Conv | 13.13 | 13.27 | 13.12 | | Hyena | 10.38 | 4.81 | 11.00 | | H3 | 10.07 | 3.83 | 10.75 | | RWKV | 9.79 | 3.82 | 10.51 | Models: 355M parameters, trained on 10B tokens from the Pile.[^2] > [!WARNING] The Shocking Result > A **70M parameter attention model outperforms a 1.4B parameter Hyena model** on associative recall. > AR is not a scaling problem — it's a fundamental architectural capability. ## Why SSMs Fail at AR The theoretical finding: gated convolution models need **model dimension that scales with sequence length** to solve associative recall. Intuition: An SSM compresses all history into a fixed-size state vector. If you need to "look up" a specific fact from 1,000 tokens ago, the SSM must have somehow encoded that specific key-value pair in its state — but the state has a fixed size, so information is progressively overwritten or diluted. An attention model can simply "point" to the position where that fact was stored, using the exact QKV lookup mechanism. > [!TIP] The Library Card Analogy > An SSM is like a secretary who takes running meeting notes, compressing everything into a summary. > A Transformer is like a library with an exact index card for every page. > When asked "what did John say in the third paragraph?" — the library wins every time. ## The MQAR Formalization The paper introduces **Multi-Query Associative Recall (MQAR)** — a synthetic task that better reflects real language: - N key-value pairs are shown - Model must answer M queries about which values match which keys - Harder than simple AR because: multiple queries, more interfering context This explains why SSMs can "solve" simple synthetic AR tests but still underperform on real language. ## The Based Solution (Partial Fix) HazyResearch's `Based` architecture[^3] combines: - Short 1D convolutions (good at local patterns) - Linear attention (input-dependent, handles AR efficiently) Result: closes **97.4% of the attention quality gap** while remaining sub-quadratic. Achieves **4.5× inference throughput** over Transformer + FlashAttention 2. This validates the hybrid approach: convolutions + attention, but not full quadratic attention. ## Implications for Report > [!NOTE] Pedagogical Key Points > 1. SSMs are NOT just "slower Transformers" — they have a structural recall weakness > 2. The weakness is very specific: tasks requiring exact lookup into prior context > 3. SSMs excel at tasks NOT requiring this: pattern recognition, streaming, smooth signals > 4. Hybrids (Jamba, Griffin, Based) exist specifically to cover this weakness > 5. Mamba's selectivity helps but doesn't fully close the AR gap ## What SSMs Excel At (When AR Doesn't Matter) | Task Type | Example | Winner | |-----------|---------|--------| | Pattern continuation | Next note in melody | SSM | | Smooth signal processing | Audio, time series | SSM | | Long document summarization | "What was the article about?" | SSM (compressed gist) | | Associative lookup | "What did the author say about X?" | Transformer | | Code completion (local) | Completing a function body | SSM | | Code completion (long-range) | Calling a class defined 1000 lines ago | Transformer | | Genomics | DNA sequence classification | SSM (Evo) | | Streaming inference | Chat with low latency | SSM | ## Mamba's Partial Fix: Selectivity Mamba's selectivity (input-dependent B, C matrices) improves AR performance compared to earlier SSMs: - Can "decide" to store a key-value pair strongly in state - Can "decide" to clear state when context shifts - But: fixed state size still limits how many KV pairs can be stored simultaneously Mamba is better at AR than Hyena/H3, but attention is still better on MQAR. ## State Space Duality (Mamba-2) The Mamba-2 paper[^4] shows that SSMs and attention are mathematically related — both decompose a semiseparable matrix. This theoretical unification suggests future architectures may seamlessly blend both. Mamba-2's core layer is **2-8× faster** than Mamba-1 while remaining competitive on language modeling. [^1]: Arora et al. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." arXiv:2312.04927. https://arxiv.org/abs/2312.04927 [^2]: Table 1 from the Zoology paper, 355M parameter models, 10B token training. [^3]: HazyResearch (2023). "Based: An Educational and Effective Sequence Mixer." https://hazyresearch.stanford.edu/blog/2023-12-11-zoology2-based [^4]: Dao & Gu (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML 2024. arXiv:2405.21060.