LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Shang Yang, Junxian Guo, Haotian Tang, Guangxuan Xiao, Song Han

Conference on Machine Learning and Systems 2025 · Day 3 · Session 7: Quantization and Sparsity

In the rapidly evolving landscape of artificial intelligence, **long context Large Language Models (LLMs)** have emerged as a pivotal technology, unlocking new frontiers in applications ranging from comprehensive document and video understanding to complex multi-turn reasoning tasks. However, the efficient deployment and serving of these models present significant challenges, primarily due to the quadratic computational complexity of the **attention mechanism** with respect to sequence length. The talk "LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention" introduces a novel system designed to address these bottlenecks, offering a unified framework for optimizing both the **prefilling stage** (processing the initial long prompt) and the **decoding stage** (generating subsequent tokens).

AI review

LServe is credible systems research from a group that clearly built and benchmarked the thing — the hierarchical paging system and the reusable page selector are genuine engineering contributions worth knowing about. But the write-up reads more like a polished abstract than a talk report, and the implementation details stay just shallow enough that you couldn't reproduce this without going back to the paper or codebase. Good work, limited actionability from this writeup alone.

Watch on YouTube