FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference

Zaifeng Pan, Yitong Ding, Yue Guan, Zheng Wang, Yida Wang, Yufei Ding

Conference on Machine Learning and Systems 2025 · Day 2 · Session 1: LLM and Diffusion Model Serving

This article delves into **DeepServe**, an innovative system designed to significantly enhance the efficiency and quality of serving **text-to-image diffusion models**. Presented by Shirong Yang, a PhD student from UMass Amherst, and developed in collaboration with UMass and Adobe Research, DeepServe tackles the inherent computational intensity of these generative AI models. The core challenge lies in the multi-step iterative denoising process central to diffusion models, which contributes substantially to high serving costs and latency.

AI review

DeepServe is a competent systems paper on diffusion model serving that solves a real problem — not all prompts are equally hard, so why treat them the same? The cascade architecture plus a learned discriminator is a sensible idea, the MILP-based resource allocator is an interesting wrinkle, and the FID/SLO results are at least directionally credible. But the write-up reads like a conference proceedings summary padded with PR copy, the discriminator training procedure is underspecified enough that you couldn't reproduce it, and there's no code or artifact to point at. Worth knowing about if…