TurboAttention: Efficient Attention Approximation for High-Throughput LLM Serving

Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Rühle, Saravan Rajmohan

Conference on Machine Learning and Systems 2025 · Day 3 · Session 8: LLM and Diffusion Model Serving

In the rapidly evolving landscape of large language models (LLMs), efficient inference at scale remains a paramount challenge. Hao Kang, a second-year PhD student from Georgia Tech, presented "TurboAttention" at MLSys 2025, a novel approach developed during his research at Microsoft, addressing the critical bottlenecks of memory and computation in LLM inference. The talk outlined how TurboAttention significantly improves throughput and reduces latency by employing a combination of progressive quantization and a sparse-activated SoftMax approximation.

AI review

TurboAttention is a legitimate systems paper from MLSys 2025 that combines progressive quantization with a clever SoftMax approximation (SAS) to squeeze more throughput out of attention on A100s. The engineering is real, the implementation detail is reasonable, and the 1.8x speedup claim over FlashAttention v2 is specific enough to take seriously. What's missing is the code — the kernel is apparently still stuck in Microsoft's internal review process — which means this is a talk about results you can't yet reproduce. Solid ML systems work, but not something you can act on this week.