ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation

Jiacheng Yang, Jun Wu, Zhen Zhang, Yida Wang, Gennady Pekhimenko

Conference on Machine Learning and Systems 2025 · Day 3 · Session 8: LLM and Diffusion Model Serving

The rapid advancement of generative AI has brought about an increasing demand for high-resolution, long video generation. This talk, "ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation," presented by Jiacheng Yang, addresses a critical bottleneck in the distributed inference of a specific class of video diffusion models: **spatial-temporal diffusion transformers (ST-DiT)**. These models, while powerful for generating high-quality video content from text prompts, suffer from significant inference latency when scaled across multiple GPU machines due to inefficiencies in how existing distributed frameworks handle communication.

AI review

ScaleFusion presents a genuine engineering contribution — decomposing monolithic all-to-all operators into independently schedulable slices to enable communication-computation overlap in distributed ST-DiT inference. The core insight is real and the speedup numbers are striking. But the write-up reads more like an abstract expansion than a talk recap with implementation texture, and the reproducibility gap is significant: no code, no open-source artifact, no architecture diagram you could hand to an engineer and say 'build this.' Solid systems work, limited immediate utility for most…