current-landscape-2025 - ramblings from the whirligig void

# The 2025 Landscape: SSMs vs Transformers > See [[index]], [[ssm-basics]], [[transformers-basics]], [[hybrid-models]], [[real-world-products]], [[zoology-associative-recall]], [[strengths-and-weaknesses]] > [!NOTE] This note synthesizes the state of the SSM vs Transformer debate as of early–mid 2025, drawing on key 2024 papers and practitioner reports. It is intended to complement [[real-world-products]] (which covers models and benchmarks) with a *debate and trajectory* framing. --- ## The Core Tension The research community is navigating a genuinely difficult question: **are SSMs a fundamentally better architecture for sequence modeling, or will Transformers always win given enough compute?** As of 2025, the honest answer is: *it depends on what you optimize for*. ``` ← Quality / Recall Precision ← Pure SSM ──────── Hybrid ──────── Pure Transformer │ │ │ fast balanced precise cheap practical flexible forgetful winning expensive ``` --- ## The SSM Case ("SSMs Are Winning") ### 1. Speed and Memory Are Not Marginal Wins The performance difference at long contexts is not incremental — it's structural: | Context Length | Transformer Slowdown vs Short | Mamba Slowdown | |---------------|-------------------------------|----------------| | 2K tokens | 1× (baseline) | 1× | | 16K tokens | ~3× slower | 1× (constant) | | 128K tokens | ~64× slower | 1× (constant) | At 128K tokens, Mamba generates at ~15× the throughput of a Transformer.[^1] This isn't a benchmark trick — it's a mathematical consequence of O(1) vs O(n) per-token cost during generation. > [!NOTE] The Practical Impact > For a cloud serving system handling thousands of concurrent users with multi-turn conversations, SSMs can support 3–5× more users per GPU before hitting memory limits. At scale, this translates to a cost reduction of millions of dollars per year. ### 2. Falcon Mamba-7B: The Proof of Concept In July 2024, TII released Falcon Mamba-7B — a **pure SSM** that scored 64.1 average on the HuggingFace Open LLM Leaderboard v1, compared to: - LLaMA-3-8B: 62.6 (Transformer) - Mistral-7B: 61.0 (Transformer) - Falcon2-11B: 64.3 (Transformer, but larger) This was a watershed moment: for the first time, a pure SSM at production scale was *competitive* with the best open-source Transformers. Not 10 points behind. Neck and neck.[^2] ### 3. Hybrids Beat Both NVIDIA's controlled experiment (Waleffe et al., 2024)[^3] trained 8B-parameter Mamba, Mamba-2, Transformer, and a Hybrid (43% Mamba-2 + 7% attention + 50% MLP) on identical datasets up to 3.5T tokens. Results: | Architecture | 12-Task Average | Notes | |-------------|-----------------|-------| | Pure Transformer | baseline | Strong ICL, copying | | Pure Mamba | ≈baseline | Lags on 5-shot MMLU, phonebook | | **Hybrid (Mamba-2-Hybrid)** | **+2.65 pts** | Best on all 12 standard tasks AND 23 long-context tasks | The hybrid is *also* predicted to be up to **8× faster at inference** than the pure Transformer at long contexts.[^3] > [!NOTE] The Hybrid Thesis > The conclusion from NVIDIA's study: you don't need to choose. A well-designed hybrid has the quality of a Transformer *and* the inference efficiency of an SSM. This is the strongest argument for SSMs in the LLM stack — not as replacements, but as default components. ### 4. Scientific Domains: SSMs Are Already Deployed In genomics, time series, and audio, SSMs are not a research curiosity — they're the state of the art: - **HyenaDNA** (NeurIPS 2023 Spotlight): handles 1M-nucleotide genomic sequences, 160× faster training than Transformer, SOTA on 12/18 genomic benchmarks[^4] - **Evo** (Arc Institute, 2024): 7B-parameter SSM trained on 2.7M prokaryotic genomes; generates novel functional DNA sequences; no Transformer equivalent exists at this scale for DNA - **Caduceus** (2024): Mamba-based bidirectional DNA model; handles reverse-complement symmetry naturally - Audio and time-series SSMs routinely handle 8K–16K step sequences where Transformers either OOM or degrade These wins are not about catching up to Transformers — they're about accessing contexts Transformers *cannot physically reach*. --- ## The Transformer Countercase ("SSMs Are Not Enough") ### 1. The Associative Recall Problem The most well-supported critique of SSMs is the **associative recall weakness**, formalized in the Zoology paper (Arora et al., Dec 2023)[^5] and independently confirmed by Jelassi et al. (2024)[^6]. **Core finding**: 82% of the SSM-Transformer perplexity gap is explained by a single capability — *looking up a specific fact mentioned earlier in context*. The numbers are stark: | Model | Size | AR Perplexity | |-------|------|---------------| | Attention | 70M | **1.98** | | Long Convolution | 1,400M | 13.27 | A 70M-parameter attention model **outperforms a 1.4B SSM** on associative recall. This is not a data problem. It's structural.[^5] > [!WARNING] Why This Matters for LLMs > Most useful language tasks involve some form of associative recall: > - "What was the user's name?" (recall from earlier in conversation) > - "In the Python function I defined 50 lines ago..." (code recall) > - "The paper said on page 3 that..." (document QA) > - 5-shot prompting (store + retrieve demonstration examples) > > SSMs don't *fail* at these — they just do them less precisely. Whether this matters depends on the task. **The theoretical reason**: A two-layer Transformer can copy strings of exponential length by using exact QKV lookup. A Generalized State Space Model (GSSM) is **theoretically bounded** by its fixed-size state — it cannot copy arbitrarily long strings.[^6] ### 2. In-Context Learning Remains a Gap 5-shot in-context learning (ICL) — the ability to learn a new task from a few examples in the prompt — is weaker in pure SSMs. In NVIDIA's study, Mamba-8B lagged most on tasks requiring ICL (5-shot MMLU, phonebook lookup).[^3] The theoretical reason connects to AR: ICL essentially requires the model to "remember" the demonstration examples and match them to new inputs — a form of associative recall. Hybrids recover this capability, but it remains a structural weakness of pure SSMs. ### 3. Linear Attention as a Transformer Counter-Move The SSM community's claim that "recurrence is necessary for efficiency" has been challenged by **linear attention** research: - **RetNet** (Microsoft, 2023): Shows Transformers can be reformulated to run in recurrent mode at inference, achieving O(1) per token - **GLA / Based / HGRN2** (2024): Show that certain forms of linear attention match SSM efficiency while adding input-dependent gating - **Zucchet et al. (2023)[^7]**: Proved that gated linear RNNs can exactly implement linear self-attention — the boundary is blurrier than it appears The SSM/Transformer distinction is dissolving at the research frontier. By the time hardware catches up, the architectures may converge into "efficient sequence models" that borrow from both families. ### 4. No Frontier Deployment (Yet) As of 2025, all known frontier AI systems (GPT-4o, Claude 3.5, Gemini 2.0, LLaMA 3.x) are Transformer-based. None of the top-10 benchmark performers are pure SSMs. This doesn't prove SSMs are worse — it may reflect path dependence, existing infrastructure, and risk aversion. But it does mean SSMs lack frontier-scale validation. --- ## Key Open Questions (2025) > [!NOTE] These are the questions researchers are actively debating, not yet settled. ### Q1: Do SSMs scale to 70B+? All the SSM vs Transformer controlled studies are at ≤8B parameters. There are no public 70B+ SSMs. Do the efficiency advantages persist? Does the quality gap widen or narrow? **Status**: Unknown. Falcon Mamba 7B is the largest fully-pure SSM released. Jamba has 52B total but is a hybrid. ### Q2: What is the optimal hybrid ratio? Jamba uses 1 attention per 7 Mamba. NVIDIA found 7% attention works. Griffin uses local attention windows. Which is "best" is task-dependent and not yet systematized. **Status**: Active research area. No consensus. ### Q3: Can SSMs learn attention from data? The Zucchet et al. paper showed gradient descent teaches RNNs to implement attention under the hood. If true at scale, SSMs might be "discovering" attention patterns anyway — suggesting the distinction is architectural, not functional. **Status**: Theoretical + small-scale empirical. Not confirmed at LLM scale. ### Q4: How do SSMs handle multi-step reasoning? Chain-of-thought reasoning requires precise tracking of multi-step logical states. This is related to associative recall but distinct. Does the SSM state vector compress enough signal to follow a long reasoning chain? **Status**: No strong empirical evidence either way. Active question for 2025. ### Q5: Will custom hardware change the calculus? Current GPUs are optimized for attention (matrix multiplication). SSM recurrence is a different computational pattern — sequential state updates that are harder to parallelize in the GEMM-optimized GPU world. Custom SSM chips (or FPGA implementations) could make SSMs 10× faster on dedicated hardware. Several startups are pursuing this. **Status**: R&D stage. No widely deployed SSM-specific hardware yet. --- ## Trajectory: Where Is This Heading? ### Near-Term (2025) 1. **Hybrids become default** for new open-source LLMs at 7B–13B scale. The evidence for hybrid superiority is strong enough that pure Transformer and pure SSM 7B models will increasingly look like research baselines. 2. **RecurrentGemma and Griffin** signal Google's production investment in SSM-family architectures. Expect at least one major lab to ship a hybrid product in 2025. 3. **Edge deployment drives SSM adoption**. As AI moves to mobile, laptops, and IoT, the fixed-memory property of SSMs becomes commercially critical. Expect major phone OS vendors to ship SSM-based on-device models. 4. **Genomics and biology SSMs mature**. Evo-style models will become the standard for DNA/RNA/protein foundation models. Transformer-based genomics will be competitive only where exact recall from sequence is needed. ### Medium-Term (2026–2027) 1. **The architecture taxonomy dissolves**. Linear attention, SSMs, and efficient Transformers are converging on a theoretical common ground (semiseparable matrices). Future models will be described as "structured sequence models" not "SSMs or Transformers." 2. **SSM-specific hardware or compiler optimization**. CUDA support for SSM recurrence is improving. Custom accelerators may close the hardware gap between SSMs and attention. 3. **Larger SSM models**. If the hybrid wins at 8B, someone will train a 70B hybrid. This will likely happen in 2026. 4. **SSM-based multimodal models**. Video + language hybrids using SSMs for temporal sequence modeling. The linear complexity of SSMs over video frames is extremely attractive. ### Long-Term (Speculative) > [!WARNING] Speculation Below > The following is informed speculation, not established research. - **No single winner**: Just as CNNs and Transformers both persist (CNNs dominate low-level vision; Transformers dominate language and reasoning), SSMs and Transformers may both persist for their respective niches. - **Custom chips matter**: If SSM-optimized hardware ships, the efficiency advantage compounds. If GPUs continue to dominate, Transformer advantage in hardware utilization may persist. - **The biology model revolution**: The most transformative near-term SSM impact may be in genomics and drug discovery — not LLMs. Evo-scale models that understand the entire human genome will be SSM-based. --- ## Practitioner's Perspective > [!NOTE] This section synthesizes what engineers and ML practitioners have reported when actually deploying SSMs. ### What People Like About SSMs **Production engineers report**: - Constant inference memory is *real* and operationally valuable. KV cache management in Transformer serving is genuinely painful. SSMs eliminate the problem. - At long contexts (>16K tokens), SSM throughput advantages are measurable and translate to cost savings - RWKV is uniquely valuable for CPU-bound deployments (edge, no GPU) **Researchers report**: - Mamba training is faster per token than equivalent Transformer training at long sequences - The selectivity mechanism (input-dependent SSM parameters) is intuitively appealing — the model learns *what to remember* - Genomics and time series domains show clearly better SSM results ### What People Struggle With **Practitioners report** (from community discussions, 2024): - **The exact-lookup gap is real in production**. Chatbots built on pure SSMs sometimes miss specific facts the user stated earlier in the conversation. - **Few-shot prompting works differently**. Patterns that work in GPT-3/4 prompting don't always transfer to Mamba. The ICL mechanism is not identical. - **Training instability at larger scales**. Falcon Mamba required additional RMS normalization layers to stabilize training at 7B — a hint that pure SSMs at scale need architectural care. - **Tooling is immature**. The ecosystem for serving, quantizing, and fine-tuning SSMs is 2–3 years behind the Transformer ecosystem (no vLLM equivalents, limited LoRA support, etc.) ### The Emerging Consensus Most practitioners who have evaluated both architectures (not just read papers) converge on the same position: > "Pure SSMs are not ready to replace Transformers for general chat/assistant applications. But SSM layers are clearly valuable and should be part of hybrid architectures going forward. And for scientific applications (genomics, audio, long time series), SSMs are already superior and should be the default choice." This pragmatic view aligns with the research evidence: hybrids win, domain-specific SSMs win, pure LLMs remain competitive. --- ## The Debate Framed as Five Claims | Claim | Evidence Status | Verdict | |-------|----------------|---------| | SSMs match Transformers on standard language benchmarks | ✅ Strong (Falcon Mamba-7B) | **TRUE at 7B scale** | | Hybrids outperform both | ✅ Strong (NVIDIA 8B study) | **TRUE, robust finding** | | SSMs have structural recall weakness | ✅ Strong (Zoology, GSSM theory) | **TRUE, mitigable** | | SSMs are fundamentally better for long sequences | ✅ Strong (throughput, memory) | **TRUE structurally** | | SSMs will replace Transformers | ❌ No evidence | **FALSE — hybrids will** | ### xLSTM: The Return of the LSTM (2024) In a surprising development, Sepp Hochreiter (LSTM's original inventor) and his team published **xLSTM** — an extended LSTM that competes with modern Transformers and SSMs.[^11] Key innovations: - **sLSTM**: Scalar memory + exponential gating + memory mixing - **mLSTM**: Fully parallelizable matrix memory with covariance update rule (similar to SSM) Result: "Perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling." **The full-circle story**: LSTMs → replaced by Transformers (2017) → SSMs rediscovered what made LSTMs work (selectivity, state compression) → modern LSTMs rediscovered by the original creator. The field is converging on the same core idea from multiple starting points. ### Titans: Neural Memory for 2M+ Context (January 2025) Google DeepMind's **Titans** (Behrouz et al., arXiv:2501.00663) introduces the short-term/long-term memory framing explicitly:[^12] - **Attention = short-term memory**: limited context, precise dependency modeling - **Neural memory module = long-term memory**: compresses historical context, parallelizable training, fast inference - Achieves context windows **>2 million tokens** with higher accuracy than Transformers and linear recurrent models on needle-in-haystack tasks This is 2025's architecturally significant paper: it reframes the Transformer–SSM debate not as "attention vs recurrence" but as "short-term vs long-term memory module design." The right architecture has both, with different roles. --- ## Sources [^1]: Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." *arXiv:2312.00752*. https://arxiv.org/abs/2312.00752. See Table 3 for throughput benchmarks. [^2]: Zuo, J. et al. (2024). "Falcon Mamba 7B." TII / HuggingFace. https://huggingface.co/blog/falconmamba [^3]: Waleffe, R. et al. (2024). "An Empirical Study of Mamba-based Language Models." *arXiv:2406.07887*. NVIDIA / Megatron-LM. https://arxiv.org/abs/2406.07887 [^4]: Nguyen, E. et al. (2023). "HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution." *NeurIPS 2023 Spotlight*. *arXiv:2306.15794*. https://arxiv.org/abs/2306.15794 [^5]: Arora, S. et al. (2023). "Zoology: Measuring and Improving Recall in Efficient Language Models." *arXiv:2312.04927*. https://arxiv.org/abs/2312.04927 [^6]: Jelassi, S. et al. (2024). "Repeat After Me: Transformers are Better than State Space Models at Copying." *arXiv:2402.01032*. https://arxiv.org/abs/2402.01032 [^7]: Zucchet, N. et al. (2023). "Gated Recurrent Neural Networks discover Attention." *arXiv:2309.01775*. https://arxiv.org/abs/2309.01775 [^8]: Dao, T. & Gu, A. (2024). "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality." *ICML 2024*. *arXiv:2405.21060*. https://arxiv.org/abs/2405.21060 [^9]: Feng, L. et al. (2024). "Were RNNs All We Needed?" *arXiv:2410.01201*. https://arxiv.org/abs/2410.01201 [^10]: De, S. et al. (2024). "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." *arXiv:2402.19427*. https://arxiv.org/abs/2402.19427 [^11]: Beck, M. et al. (2024). "xLSTM: Extended Long Short-Term Memory." arXiv:2405.04517. Original LSTM team's modern successor, competitive with Transformers and SSMs. [^12]: Behrouz, A. et al. (2025). 'Titans: Learning to Memorize at Test Time.' arXiv:2501.00663. Google DeepMind. Neural memory for 2M+ context windows. --- ## Mamba-3 (March 2026) — ICLR 2026 **Paper**: "Mamba-3: Improved Sequence Modeling using State Space Principles" **Authors**: Lahoti, Li, Chen, Wang, Bick, Kolter, Tri Dao, Albert Gu **arXiv**: 2603.15569 | ICLR 2026 ### Three Core Improvements over Mamba-2 1. **More expressive recurrence** — derived from improved SSM discretization, addressing Mamba-2's limitations on retrieval tasks 2. **Complex-valued state updates** — richer state tracking; improves performance on retrieval and state-tracking benchmarks 3. **MIMO formulation** — Multi-Input Multi-Output; better model quality without increasing decode latency ### Key Results (1.5B scale) - **+1.8 pp** downstream accuracy gain over next-best pure-SSM (Gated DeltaNet) with MIMO variant - +0.6 pp average gain over next-best competitor - Achieves comparable perplexity to Mamba-2 with **half the state size** - Advances the performance-efficiency Pareto frontier for pure-SSM architectures ### Significance Mamba-3 extends the SSM sprint: S4→Mamba→Mamba-2→Mamba-3 is a continuous trajectory of closing the quality gap with Transformers while maintaining efficiency. The complex-valued state update specifically targets the retrieval/associative recall weakness identified in the Zoology paper — directly addressing what had been the primary structural limitation of pure-SSM models. **Footnote**: Lahoti, P., Li, Z., Chen, Z., Wang, Y., Bick, A., Kolter, Z., Dao, T., & Gu, A. (2026). arXiv:2603.15569.