Marconi: Prefix Caching for the Era of Hybrid LLMs
Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Tri Dao, Ravi Netravali
Conference on Machine Learning and Systems 2025 · Day 4 · Session 10: LLM and Diffusion Model Serving
In the rapidly evolving landscape of large language models (LLMs), a critical challenge persists: achieving efficiency under increasingly long context lengths. Traditional transformer-based LLMs, while powerful, grapple with the **attention mechanism's** quadratic computational complexity and the **KV cache's** linear memory scaling, both of which become bottlenecks as sequence lengths grow. This talk introduces **Marconi**, a pioneering prefix caching system specifically engineered to address these inefficiencies within the context of **hybrid LLMs**, which integrate both transformer attention layers and more efficient **State Space Models (SSMs)** like **Mamba**.
AI review
Marconi is genuinely novel systems work on a real and underappreciated problem: prefix caching breaks when you mix SSM layers into your transformer stack, and nobody had built a proper fix until now. The flop-efficiency metric is a clean insight, the radix tree admission strategy is well-motivated, and the performance numbers are meaningful — 34x token hit rate improvement and 20% P50/P95 TDFT reduction against reasonable baselines. The write-up is one abstraction layer above the code, which is frustrating, but the engineering reasoning is sound enough that you could reproduce the core ideas.