# Zoology: The Associative Recall Key Differentiator
> [[index]] · [[transformers-basics]] · [[ssm-basics]] · [[hybrid-models]] · [[strengths-and-weaknesses]]
## The Critical Finding
The [Zoology paper](https://arxiv.org/abs/2312.04927)[^1] (Dec 2023, HazyResearch) made a remarkable discovery: **82% of the perplexity gap between SSMs and Transformers** is explained by a single skill — **associative recall (AR)**.
## What is Associative Recall?
> [!NOTE] Definition
> **Associative recall** = looking back into context to retrieve a specific fact that was mentioned earlier.
>
> Example: *"She put vanilla extract in her strawberry smoothie … then she drank her strawberry __?"*
> The model must recall that it saw "smoothie" earlier in the sequence.
More examples from language:
- "The capital of France is Paris. ... Q: What is the capital of France? A: ____"
- "John gave Mary the book. ... Who gave Mary a gift? ____"
- Code completion: seeing a variable defined 500 lines ago, then completing its use
## The Benchmark Numbers (Stark)
| Model | All Tokens PPL | **AR Tokens PPL** | Other Tokens PPL |
|-------|---------------|-------------------|-----------------|
| Attention | 9.44 | **1.98** | 10.62 |
| Long Conv | 13.13 | 13.27 | 13.12 |
| Hyena | 10.38 | 4.81 | 11.00 |
| H3 | 10.07 | 3.83 | 10.75 |
| RWKV | 9.79 | 3.82 | 10.51 |
Models: 355M parameters, trained on 10B tokens from the Pile.[^2]
> [!WARNING] The Shocking Result
> A **70M parameter attention model outperforms a 1.4B parameter Hyena model** on associative recall.
> AR is not a scaling problem — it's a fundamental architectural capability.
## Why SSMs Fail at AR
The theoretical finding: gated convolution models need **model dimension that scales with sequence length** to solve associative recall.
Intuition: An SSM compresses all history into a fixed-size state vector. If you need to "look up" a specific fact from 1,000 tokens ago, the SSM must have somehow encoded that specific key-value pair in its state — but the state has a fixed size, so information is progressively overwritten or diluted.
An attention model can simply "point" to the position where that fact was stored, using the exact QKV lookup mechanism.
> [!TIP] The Library Card Analogy
> An SSM is like a secretary who takes running meeting notes, compressing everything into a summary.
> A Transformer is like a library with an exact index card for every page.
> When asked "what did John say in the third paragraph?" — the library wins every time.
## The MQAR Formalization
The paper introduces **Multi-Query Associative Recall (MQAR)** — a synthetic task that better reflects real language:
- N key-value pairs are shown
- Model must answer M queries about which values match which keys
- Harder than simple AR because: multiple queries, more interfering context
This explains why SSMs can "solve" simple synthetic AR tests but still underperform on real language.
## The Based Solution (Partial Fix)
HazyResearch's `Based` architecture[^3] combines:
- Short 1D convolutions (good at local patterns)
- Linear attention (input-dependent, handles AR efficiently)
Result: closes **97.4% of the attention quality gap** while remaining sub-quadratic. Achieves **4.5× inference throughput** over Transformer + FlashAttention 2.
This validates the hybrid approach: convolutions + attention, but not full quadratic attention.
## Implications for Report
> [!NOTE] Pedagogical Key Points
> 1. SSMs are NOT just "slower Transformers" — they have a structural recall weakness
> 2. The weakness is very specific: tasks requiring exact lookup into prior context
> 3. SSMs excel at tasks NOT requiring this: pattern recognition, streaming, smooth signals
> 4. Hybrids (Jamba, Griffin, Based) exist specifically to cover this weakness
> 5. Mamba's selectivity helps but doesn't fully close the AR gap
## What SSMs Excel At (When AR Doesn't Matter)
| Task Type | Example | Winner |
|-----------|---------|--------|
| Pattern continuation | Next note in melody | SSM |
| Smooth signal processing | Audio, time series | SSM |
| Long document summarization | "What was the article about?" | SSM (compressed gist) |
| Associative lookup | "What did the author say about X?" | Transformer |
| Code completion (local) | Completing a function body | SSM |
| Code completion (long-range) | Calling a class defined 1000 lines ago | Transformer |
| Genomics | DNA sequence classification | SSM (Evo) |
| Streaming inference | Chat with low latency | SSM |
## Mamba's Partial Fix: Selectivity
Mamba's selectivity (input-dependent B, C matrices) improves AR performance compared to earlier SSMs:
- Can "decide" to store a key-value pair strongly in state
- Can "decide" to clear state when context shifts
- But: fixed state size still limits how many KV pairs can be stored simultaneously
Mamba is better at AR than Hyena/H3, but attention is still better on MQAR.
## State Space Duality (Mamba-2)
The Mamba-2 paper[^4] shows that SSMs and attention are mathematically related — both decompose a semiseparable matrix. This theoretical unification suggests future architectures may seamlessly blend both.
Mamba-2's core layer is **2-8× faster** than Mamba-1 while remaining competitive on language modeling.
[^1]: Arora et al. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." arXiv:2312.04927. https://arxiv.org/abs/2312.04927
[^2]: Table 1 from the Zoology paper, 355M parameter models, 10B token training.
[^3]: HazyResearch (2023). "Based: An Educational and Effective Sequence Mixer." https://hazyresearch.stanford.edu/blog/2023-12-11-zoology2-based
[^4]: Dao & Gu (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." ICML 2024. arXiv:2405.21060.