Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su, Wei Zhao, Xin Li, Chenhao Jiang, Gennady Pekhimenko
Conference on Machine Learning and Systems 2025 · Day 3 · Session 8: LLM and Diffusion Model Serving
The talk "Seesaw: High-throughput LLM Inference via Model Re-sharding" introduces a novel framework designed to significantly accelerate throughput-oriented offline large language model (LLM) text generation in distributed GPU environments. Presented by Qidong Su from the University of Toronto and Santa Mel, this work, a collaboration with researchers from the University of Toronto, the Vector Institute, Santa Mel, and Stanford University, addresses a fundamental inefficiency in current LLM inference systems. The core insight driving Seesaw is that the two primary phases of LLM text generation—pre-fill (processing input prompts) and decode (generating tokens one by one)—exhibit fundamentally different computational characteristics and thus benefit from distinct parallelization strategies.
AI review
Seesaw is a legitimate piece of systems engineering with a clean core insight — pre-fill and decode have different computational bottlenecks, so fix your parallelization strategy to match each phase rather than picking one and living with the compromise. The 36% throughput gain over VLM is credible and the tiered KV cache buffering is a genuinely clever mechanism for keeping re-sharding costs from eating the benefit. But the write-up reads like a cleaned-up paper summary, not a talk that shows you how to build or adopt the thing. The scope is narrow (single-node, offline batch only), the…