Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, Sitan Chen

International Conference on Machine Learning 2025 · Oral

This talk, presented by Kulin Shah at ICML 2025, delves into the fundamental mechanisms and challenges of **Masked Diffusion Models (MDMs)**, particularly concerning their approach to token ordering in language modeling. The work, a collaborative effort with Jaeyeon Kim, Vasilis Kontonis, Sham Kakade, and Sitan Chen, aims to dissect the strengths and limitations of MDMs as a burgeoning alternative to traditional auto-regressive (AR) language models. With diffusion models demonstrating compelling performance and efficiency trade-offs—such as Gemini Diffusion achieving similar performance to Flash 2 Lite with six times faster inference—understanding their core operational paradigms is paramount for advancing the field.

AI review

A technically honest and well-structured contribution that gives the masked diffusion model community something it has been missing: a rigorous account of *why* vanilla MDM inference underperforms and a principled explanation of why adaptive strategies recover it. The theoretical hardness result is real and non-trivial, the experimental validation is carefully controlled, and the connection to any-order autoregressive loss—while known in the literature—is used here as a genuine analytical lever rather than decorative framing. The gap between 7% and 90% accuracy on logic puzzts via…