ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Ana Klimovic, Eiko Yoneki
Conference on Machine Learning and Systems 2025 · Day 4 · Session 10: LLM and Diffusion Model Serving
The proliferation of large language models (LLMs) has revolutionized many industries, yet their deployment in production environments presents significant challenges, primarily due to the immense computational resources they demand. **ThunderServe**, a novel system presented at MLSys 2025 by Taiyi Wang from the University of Cambridge, in collaboration with Peking University and ETH, addresses the critical need for high-performance and cost-efficient LLM serving, particularly within dynamic cloud environments. The core motivation behind ThunderServe is to tackle the prohibitive expenses associated with LLM inference, which often require substantial GPU resources to meet stringent latency and throughput requirements, while simultaneously combating the pervasive issue of underutilized hardware, especially older-generation GPUs in cloud data centers.
AI review
ThunderServe is a legitimate systems paper tackling a real problem — heterogeneous GPU scheduling for LLM serving in cloud environments — with solid algorithmic depth and credible experimental results. The prefill/decode phase splitting insight is well-motivated, the scheduler stack is technically specific (Tabu search, hierarchical clustering, DP + LP), and the 25x latency headline is striking enough to warrant attention. But the article reads like an academic paper summary, not an engineer's guide to building this. The code isn't mentioned, reproducibility is unclear, and the 25x claim…