Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Gao Wei, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen

Conference on Machine Learning and Systems 2025 · Day 2 · Session 1: LLM and Diffusion Model Serving

This talk, presented by Hanyu from Alibaba on behalf of the authors from Nanyang Technological University, S Lab, and Shanghai AI Lab, delves into a critical challenge in serving large language models (LLMs): the immense memory footprint of the **Key-Value (KV) cache**. As LLMs continue to scale in size and complexity, efficient and cost-effective serving becomes paramount. The KV cache, which stores intermediate key and value tensors for the attention mechanism, consumes a disproportionately large amount of GPU memory, often exceeding the model weights themselves. For instance, serving a Llama 3 70B model in FP16 with a batch size of 512 and a prompt length of 2048 tokens requires 130 GB for model weights but a staggering 512 GB for the KV cache alone.

AI review

A competent empirical teardown of KV cache compression benchmarking practices, with a clear and useful central message: most published results are measured in the wrong environment and the real costs are hidden in latency and negative samples. The proposed tooling (throughput predictor, length predictor, negative sample evaluator) is practical and framed well. But this is ultimately a 'the benchmarks are wrong' paper with thin implementation details on the tools themselves, and the proposed solutions are described conceptually without enough depth to reproduce or extend. Solid systems…