Best Talks at Conference on Machine Learning and Systems 2025

Editor's picks · 12 talks

Hand-picked from in-depth reviewer verdicts. View all talks at Conference on Machine Learning and Systems 2025 →

1. Designing Models from the Hardware Up — Simran Arora
In this insightful MLSys 2025 talk, Simran Arora presents a compelling argument for designing AI models with hardware considerations from the ground up, rather than optimizing them post-hoc. The core premise addresses a critical…
2. QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving — Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Song Han
This article delves into QServe, a groundbreaking system and algorithm co-design for the efficient serving of Large Language Models (LLMs) on cloud infrastructure. Presented by Shang Yang, a second-year PhD student at MIT ECS, under the…
3. Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer — Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Dhabaleswar Panda
The rapid advancement of large language models (LLMs) has highlighted a critical bottleneck: the ability to process and train on ultra-long input sequences. While models like Llama 3.1 are pushing context lengths to 128K tokens, achieving…
4. TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives — Size Zheng, Jin Fang, Ningxin Zheng, Haibin Lin, Li-Wen Chang, Xin Liu
The rapid advancement and widespread adoption of large language models (LLMs) have driven an unprecedented demand for computational resources, particularly massive clusters of GPUs. Training and inference for state-of-the-art LLMs often…
5. Marconi: Prefix Caching for the Era of Hybrid LLMs — Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Tri Dao, Ravi Netravali
In the rapidly evolving landscape of large language models (LLMs), a critical challenge persists: achieving efficiency under increasingly long context lengths. Traditional transformer-based LLMs, while powerful, grapple with the…
6. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models — Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ziyi Xu, Tianqi Chen
The XGrammar project introduces a novel and highly efficient engine for **structured generation** by Large Language Models (LLMs), addressing a critical need in modern AI applications. Presented by Yixin Dong from Carnegie Mellon…
7. Extreme PyTorch: Inside the Most Demanding ML Workloads—and the Open Challenges in Building AI Agents to Democratize Them — Soumith Chintala
Soumith Chintala, a distinguished Scientist-Engineer at Meta and NYU, presented a comprehensive talk at MLSys 2025, delving into the intricate world of **PyTorch** and its application to the most demanding machine learning workloads. The…
8. An AI Stack: From Scaling AI Workloads to Evaluating LLMs — Ion Stoica
In this comprehensive talk at MLSys 2025, Professor Ion Stoica from UC Berkeley presented an insightful journey through the evolution of the AI/ML stack, focusing on three pivotal open-source projects he has been deeply involved with…
9. Hardware-Aware Training and Inference for Large-Scale AI — Animashree Anandkumar
In an era where large-scale AI models are continually pushing the boundaries of computational resources, Professor Animashree Anandkumar's talk at MLSys 2025 presents a compelling vision for the future of machine learning systems…
10. LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers — Rya Sanovar, Srikant Bharadwaj, Renée St. Amant, Victor Rühle, Saravan Rajmohan
The proliferation of large language models (LLMs) has brought the self-attention mechanism to the forefront of AI innovation. However, efficiently executing attention, particularly during the decode phase of inference, presents…
11. FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving — Zihao Ye, Lequn Chen, Ruihang Lai, Tianqi Chen, Arvind Krishnamurthy, Luis Ceze
The proliferation of Large Language Models (LLMs) has introduced significant challenges in deploying and serving these models efficiently, particularly concerning the core **attention mechanism**. This talk introduces **FlashInfer**, an…
12. Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving — Gao Wei, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen
This talk, presented by Hanyu from Alibaba on behalf of the authors from Nanyang Technological University, S Lab, and Shanghai AI Lab, delves into a critical challenge in serving large language models (LLMs): the immense memory footprint…

View all talks at Conference on Machine Learning and Systems 2025