DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling

Sohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K. Sitaraman, Hui Guan

Conference on Machine Learning and Systems 2025 · Day 2 · Session 1: LLM and Diffusion Model Serving

The rapid advancement of text-to-image diffusion models has revolutionized content creation, but their computational intensity presents significant challenges for efficient serving in production environments. This talk introduces **DiffServe**, a novel system designed to optimize the serving of these models by dynamically adapting the model used for each query based on its perceived difficulty. Presented by Shirong Yang from UMass Amherst, this collaborative work with Adobe Research addresses the critical trade-off between image quality, generation latency, and computational cost that plagues current diffusion model serving infrastructures.

AI review

DiffServe is a competent ML systems paper dressed up as a conference talk — the cascading discriminator idea is real and the MILP allocator is an honest engineering choice, but the presentation as summarized here is thin on the details that would let anyone actually reproduce or extend this. The core insight (not all diffusion queries are equally hard, so don't route them equally) is genuinely useful, but the implementation specifics that would make this actionable — discriminator training data, MILP solve times, how the system behaves when the discriminator misfires — are either absent or…