SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling

Ke Hong, Xiuhong Li, Lufang Chen, Guohao Dai, Xuefei Ning, Yu Wang

Conference on Machine Learning and Systems 2025 · Day 3 · Session 8: LLM and Diffusion Model Serving

The rapid proliferation of large language models (LLMs) has introduced unprecedented challenges and opportunities in model serving. This talk, presented by Ke Hong from Tsinghua University and co-authored with researchers from Shanghai Jiaotong University, Peking University, and Infinity AI, introduces **SOLA**, a novel **state-aware scheduling** framework designed to significantly improve **Service Level Objective (SLO) attainment** for LLM inference. SOLA addresses the unique characteristics of LLM serving, specifically the distinct **prefill** and **decode** phases, each with its own critical latency metrics: **Time To First Token (TTFT)** and **Time Per Output Token (TPOT)**.

AI review

SOLA is a competent MLSys paper on LLM serving scheduler optimization — the kind of work that matters if you're running a serving stack at scale. The core idea is real and the problem is well-motivated: prefill/decode phase interference creates biased latency distributions that existing schedulers handle badly. The iteration-level feedback loop and constrained optimization formulation are legitimate engineering. But as a conference talk write-up, this article is essentially a padded abstract — it tells me what SOLA does without giving me enough to understand, evaluate, or reproduce the key…