NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu
Conference on Machine Learning and Systems 2025 · Day 4 · Session 10: LLM and Diffusion Model Serving
In the rapidly evolving landscape of large language models (LLMs), online inference has become a cornerstone for numerous cutting-edge applications. However, the relentless growth in LLM size has precipitated a significant challenge: the **GPU memory crisis**. This talk, "NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference," presented by Yang Zhou and co-authored with Xuanlin Jiang, Shiyi Cao, Ion Stoica, and Minlan Yu from a collaboration spanning Peking University, UC Berkeley, Davis, and Harvard, introduces a novel system designed to mitigate this crisis. NEO proposes an intelligent CPU offloading strategy that leverages the often-underutilized CPU resources to enhance GPU efficiency and overall inference throughput for LLMs.
AI review
NEO presents a real systems contribution — CPU offloading for KV cache and decoding attention in online LLM inference — with a clear theoretical foundation and legitimately impressive numbers on memory-constrained hardware. The asymmetric pipelining insight is sound and the strawman analysis is the kind of honest engineering reasoning I like to see. But the write-up is a polished summary of a paper, not a window into implementation. I can follow the architecture at a whiteboard level, but I couldn't reproduce this tomorrow, and the scheduler details are punted entirely to the paper. Worth…