FlexInfer: Flexible LLM Inference with CPU Computations
Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Tushar Krishna, Hyesoon Kim
Conference on Machine Learning and Systems 2025 · Day 3 · Session 8: LLM and Diffusion Model Serving
The proliferation of Large Language Models (LLMs) has led to an explosion in demand for efficient inference, particularly in high-throughput applications like chatbots. However, a critical bottleneck in deploying these models is their immense memory footprint, encompassing both model weights and the dynamically growing **Key-Value (KV) cache**. Modern LLMs frequently exceed the memory capacity of even high-end accelerators like the NVIDIA H100 GPU. To address this, current solutions often resort to offloading model components to CPU memory, leveraging the CPU's larger capacity. While this enables the execution of larger models on memory-constrained GPUs, it introduces a significant performance penalty due to constant data transfers over the **PCI Express (PCI-e)** interconnect.
AI review
FlexInfer is a legitimate systems paper with a crisp central insight — CPUs with modern matrix accelerators are genuinely good at the decode phase because they eliminate PCI-e round-trips, not despite being slower than GPUs but because of data locality. The 75-76% latency reduction numbers are striking. But the write-up reads more like a well-organized abstract than a builder's guide: I know what the system decides, but not enough about how the performance estimator actually works or how you'd adapt this to your stack. Solid MLSys research, limited immediate actionability for most engineers.