QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Song Han
Conference on Machine Learning and Systems 2025 · Day 2 · Session 3: Quantization and Sparsity
This article delves into QServe, a groundbreaking system and algorithm co-design for the efficient serving of Large Language Models (LLMs) on cloud infrastructure. Presented by Shang Yang, a second-year PhD student at MIT ECS, under the advisement of Professor Song Han, QServe introduces a novel mixed-precision quantization scheme dubbed **W4A8KB4** (Quarter-Octo-Quarter). This approach quantizes model weights to 4 bits, activations to 8 bits, and the KV (Key-Value) cache, a critical component for attention mechanisms, also to 4 bits. The core motivation behind QServe is to tackle the pervasive challenge of high computational and memory demands during LLM inference, particularly in the decoding stage, which often becomes the system bottleneck in cloud serving scenarios.
AI review
QServe is a serious piece of systems work from the Song Han group at MIT — a tightly co-designed quantization scheme and serving infrastructure that squeezes real performance out of the W4A8KB4 precision mix. The engineering is specific enough to be credible: progressive group quantization with overflow-safe dequantization, compute-aware weight reordering, smooth attention for K cache outliers, fused activation quantization. The results (2.4x–3.5x over TensorRT-LLM) are benchmarked on A100s against named baselines. Where it falls short for my taste: the article is reconstructed from a talk…