LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Rya Sanovar, Srikant Bharadwaj, Renée St. Amant, Victor Rühle, Saravan Rajmohan
Conference on Machine Learning and Systems 2025 · Day 2 · Session 1: LLM and Diffusion Model Serving
The proliferation of large language models (LLMs) has brought the self-attention mechanism to the forefront of AI innovation. However, efficiently executing attention, particularly during the decode phase of inference, presents significant hardware utilization challenges. This talk, presented by Rya Sanovar from Microsoft, introduces **LeanAttention**, a novel hardware-aware attention mechanism designed to maximize GPU occupancy and drastically improve the latency of Transformer inference for critical workloads like long-context processing and ragged batching.
AI review
LeanAttention is a real piece of engineering — Stream-K applied to attention decomposition, implemented in Cutlass/CuTe, shipping in ONNX Runtime — and the core insight (linearize the work across heads and context, partition equally across SMs) is genuinely useful. But this write-up reads like a thorough paper summary, not a talk review, and the gaps that matter to me as a builder are never filled: no ablation plots, no roofline analysis, no discussion of how this interacts with paged KV caches or speculative decoding, and the 'multifold speedup for ragged batching' claim floats without a…