Conference on Machine Learning and Systems 2025
The eighth annual Conference on Machine Learning and Systems, bridging ML algorithms, systems software, and hardware design to advance efficient AI at scale.
→ See editor’s top picks at Conference on Machine Learning and Systems 2025
- Extreme PyTorch: Inside the Most Demanding ML Workloads—and the Open Challenges in Building AI Agents to Democratize Them — Soumith Chintala
Soumith Chintala, a distinguished Scientist-Engineer at Meta and NYU, presented a comprehensive talk at MLSys 2025, delving into the intricate world of **PyTorch** and its application to the most…
- Lessons Learned from Successful PhD Students — Tim Dettmers
This article delves into the illuminating insights shared by Tim Dettmers, the visionary behind **QLORA** and **bitsandbytes**, during his MLSys 2025 talk, "Lessons Learned from Successful PhD…
- LMArena: An Open Platform for Crowdsourced AI Benchmarks — Wei-Lin Chiang
In an era defined by the rapid evolution of generative AI, particularly large language models (LLMs), the traditional paradigms for evaluating AI performance are proving increasingly insufficient…
- Designing Models from the Hardware Up — Simran Arora
In this insightful MLSys 2025 talk, Simran Arora presents a compelling argument for designing AI models with hardware considerations from the ground up, rather than optimizing them post-hoc. The…
- Systems for Scalable Sparse Training and Inference — Beidi Chen
- An AI Stack: From Scaling AI Workloads to Evaluating LLMs — Ion Stoica
In this comprehensive talk at MLSys 2025, Professor Ion Stoica from UC Berkeley presented an insightful journey through the evolution of the AI/ML stack, focusing on three pivotal open-source…
- Hardware-Aware Training and Inference for Large-Scale AI — Animashree Anandkumar
In an era where large-scale AI models are continually pushing the boundaries of computational resources, Professor Animashree Anandkumar's talk at MLSys 2025 presents a compelling vision for the…
- Responsible Finetuning of Large Language Models — Ling Liu
This article delves into the critical and evolving challenges surrounding the responsible finetuning of Large Language Models (LLMs), with a particular emphasis on ensuring their safety and…
- FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference — Zaifeng Pan, Yitong Ding, Yue Guan, Zheng Wang, Yida Wang, Yufei Ding
This article delves into **DeepServe**, an innovative system designed to significantly enhance the efficiency and quality of serving **text-to-image diffusion models**. Presented by Shirong Yang, a…
- DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling — Sohaib Ahmad, Qizheng Yang, Haoliang Wang, Ramesh K. Sitaraman, Hui Guan
The rapid advancement of text-to-image diffusion models has revolutionized content creation, but their computational intensity presents significant challenges for efficient serving in production…
- LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers — Rya Sanovar, Srikant Bharadwaj, Renée St. Amant, Victor Rühle, Saravan Rajmohan
The proliferation of large language models (LLMs) has brought the self-attention mechanism to the forefront of AI innovation. However, efficiently executing attention, particularly during the decode…
- FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving — Zihao Ye, Lequn Chen, Ruihang Lai, Tianqi Chen, Arvind Krishnamurthy, Luis Ceze
The proliferation of Large Language Models (LLMs) has introduced significant challenges in deploying and serving these models efficiently, particularly concerning the core **attention mechanism**…
- Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving — Gao Wei, Xinyu Zhou, Peng Sun, Tianwei Zhang, Yonggang Wen
This talk, presented by Hanyu from Alibaba on behalf of the authors from Nanyang Technological University, S Lab, and Shanghai AI Lab, delves into a critical challenge in serving large language…
- Context Parallelism for Scalable Million-Token Inference — Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Jongsoo Park, Jianyu Huang
This article delves into a pivotal presentation from MLSys 2025 titled "Context Parallelism for Scalable Million-Token Inference," delivered by a team of researchers from Meta. The talk introduces…
- GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism — Sandeep Polisetty, Juelin Liu, Yi Fung, Hui Guan, Marco Serafini
Graph Neural Networks (GNNs) have emerged as a pivotal technology for extracting insights from graph-structured data across diverse fields, from social network analysis and personalized…
- Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling — Xinyi Zhang, Hanyu Zhao, Wencong Xiao, Xianyan Jia, Fei Xu, Fangming Liu
In the rapidly evolving landscape of deep learning, the training of large models, particularly **Large Language Models (LLMs)** built on **transformer architectures**, demands immense computational…
- PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training — Daiyaan Arfeen, Zhen Zhang, Xinwei Fu, Gregory Ganger, Yida Wang
The proliferation of large language models (LLMs) has necessitated increasingly sophisticated and scalable training techniques. Among these, **pipeline parallelism** has emerged as a crucial…
- AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine — Carlo Siebenschuh, Kyle Hippe, Alexander Brace, Arvind Ramanathan, Ian Foster, Robert Underwood
In the rapidly evolving landscape of large language models (LLMs), the quality and scale of pre-training data are paramount. This talk introduces **AdaParse**, an innovative adaptive parallel PDF…
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving — Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Song Han
This article delves into QServe, a groundbreaking system and algorithm co-design for the efficient serving of Large Language Models (LLMs) on cloud infrastructure. Presented by Shang Yang, a…
- MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators — Beichen Huang, Yueming Yuan, Zelei Shao, Minjia Zhang
This article delves into MiLo, a novel approach for efficient quantized Mixture of Experts (MoE) inference, presented by Beichen Huang and Yueming Yuan at MLSys 2025. MiLo tackles the pressing…
- Enabling Unstructured Sparse Acceleration on Structured Sparse Accelerators — Geonhwa Jeong, Po-An Tsai, Abhimanyu Rajeshkumar Bambhaniya, Stephen W. Keckler, Tushar Krishna
This article delves into the innovative work presented at MLSys 2025 by Geonhwa Jeong and collaborators, focusing on a novel method called **Testy (Tensor Approximation via Structured…
- Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training — Mingkai Zheng, Zhao Zhang
The pre-training of large language models (LLMs) has become a cornerstone of modern AI, yet it remains an immensely resource-intensive and time-consuming endeavor. As models scale to billions or…
- Self-Data Distillation for Recovering Quality in Pruned Large Language Models — Vithursan Thangarasa, Ganesh Venkatesh, Mike Lasby, Sean Lie
- Know Where You're Uncertain When Planning with Multimodal Foundation Models: A Formal Framework — Neel P. Bhatt, Yunhao Yang, Ufuk Topcu, Zhangyang Wang
Multimodal Foundation Models (MFMs) are rapidly becoming indispensable tools for developing advanced autonomous systems, particularly in robotics, where they offer a natural and intuitive interface…
- AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds — Yinfang Chen, Manish Shetty, Gagan Somashekar, Chetan Bansal, Saravan Rajmohan
In an era increasingly reliant on complex cloud infrastructure, the stability and performance of services are paramount. Yet, production incidents remain an inevitable and costly reality, leading to…
- AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution — Zhiqiang Xie, Hao Kang, Ying Sheng, Tushar Krishna, Kayvon Fatahalian, Christos Kozyrakis
- Interference-aware Edge Runtime Prediction with Conformal Matrix Completion — Tianshu Huang, Arjun Ramesh, Emily Ruppel, Anthony Rowe, Carlee Joe-Wong
This talk, "Interference-aware Edge Runtime Prediction with Conformal Matrix Completion," presented by Tianshu Huang, delves into the critical and complex problem of accurately predicting software…
- The Hidden Bloat in Machine Learning Systems — Huaifeng Zhang, Ahmed Ali-Eldin
The proliferation of machine learning (ML) frameworks like PyTorch and TensorFlow has driven rapid innovation, but this growth comes with an often-overlooked cost: **software bloat**. This talk…
- Youmu: Efficient Columnar Data Pipeline for LLM Training — Tianle Zhong, Jiechen Zhao, Qiang Su, Geoffrey Fox
In the rapidly evolving landscape of large language model (LLM) training, data pipeline efficiency remains a critical bottleneck. This talk introduces **Youmu**, an innovative system designed to…
- Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer — Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Dhabaleswar Panda
The rapid advancement of large language models (LLMs) has highlighted a critical bottleneck: the ability to process and train on ultra-long input sequences. While models like Llama 3.1 are pushing…
- HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression — Yujin Wang, Shunan Dong, Yichen You, Huazhong Yang, Yongpan Liu, Hongyang Jia
The talk "HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression" by Yujin Wang and colleagues from Tsinghua University addresses a critical bottleneck in the on-device…
- APOLLO: SGD-like Memory, AdamW-level Performance — Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Zhangyang Wang, Jinwon Lee
In the rapidly evolving landscape of large language models (LLMs), the computational and memory demands for training these sophisticated architectures have become a significant bottleneck, limiting…
- Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training — Mingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Louis Feng, Christina Delimitrou
The rapid advancement and increasing scale of Large Language Models (LLMs) necessitate highly efficient training processes. Optimizing the performance of these colossal models is paramount, but it…
- ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation — Zhiyu Mei, Wei Fu, Kaiwei Li, Guangju Wang, Huanchen Zhang, Yi Wu
The talk introduces **ReaL (Reinforcement Learning with Parameter Reallocation)**, a novel system designed to significantly enhance the efficiency of **Reinforcement Learning from Human Feedback…
- SwiftVI: Time-Efficient Planning and Learning with MDPs — Kasper Overgaard Mortensen, Konstantinos Skitsas, Andreas Pavlogiannis, Davide Mottin, Panagiotis Karras
In the realm of artificial intelligence and machine learning, particularly within reinforcement learning and autonomous systems, the ability to plan and make optimal decisions in complex, uncertain…
- ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription in the Cloud — Lu Wang, Mayukh Das, Fangkai Yang, Íñigo Goiri, Saravan Rajmohan, Dongmei Zhang
In the highly competitive and resource-intensive landscape of cloud computing, optimizing resource utilization is paramount for both operational efficiency and profitability. The talk, "ProtoRAIL: A…
- A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers — Chenxi Yang, Yan Li, Martin Maas, Mustafa Uysal, Arif Merchant, Richard McDougall
In the sprawling landscape of modern data centers, where "warehouse-scale computers" are the norm, storage systems represent a significant portion of the total operational cost. Within these complex…
- Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only Algorithm — Baichuan Huang, Amir Aminifar
This talk introduces **BioFO (Biologically Plausible Forward-Only Algorithm)**, a novel approach to training neural networks designed to address the significant energy consumption and biological…
- Optimizing LLM Queries in Relational Data Analytics Workloads — Shu Liu, Asim Biswal, Amog Kamsetty, Ion Stoica, Joseph E. Gonzalez, Matei Zaharia
This talk, presented by Shu Liu from UC Berkeley, along with collaborators from UC Berkeley and Stanford University, addresses the critical challenge of optimizing Large Language Model (LLM) queries…
- LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention — Shang Yang, Junxian Guo, Haotian Tang, Guangxuan Xiao, Song Han
In the rapidly evolving landscape of artificial intelligence, **long context Large Language Models (LLMs)** have emerged as a pivotal technology, unlocking new frontiers in applications ranging from…
- Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers — Francesco Daghero, Daniele Jahier Pagliari, Francesco Conti, Luca Benini, Alessio Burrello
This talk, presented by Francesco Daghero and his colleagues, delves into critical advancements for deploying deep learning models on highly constrained **microcontrollers (MCUs)**. The core…
- SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention — Qianchao Zhu, Jiangfei Duan, Chang Chen, Dahua Lin, Chao Yang
This article delves into SampleAttention, a novel approach designed to drastically accelerate inference for **Large Language Models (LLMs)** operating with exceptionally long context windows…
- Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking — Marco Federici, Davide Belli, Mart Van Baalen, Markus Nagel, Paul Whatmough
This talk, presented by Davide Belli and his colleagues at Qualcomm AI Research, addresses a critical challenge in the deployment of large language models (LLMs) on edge devices: the rapidly…
- SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations — Md Saidul Hoque Anik, Ariful Azad
This article delves into **SparseTransX**, a novel approach presented at MLSys 2025 by Md Saidul Hoque Anik and Ariful Azad, addressing the pervasive inefficiency in training **Knowledge Graph…
- Seesaw: High-throughput LLM Inference via Model Re-sharding — Qidong Su, Wei Zhao, Xin Li, Chenhao Jiang, Gennady Pekhimenko
The talk "Seesaw: High-throughput LLM Inference via Model Re-sharding" introduces a novel framework designed to significantly accelerate throughput-oriented offline large language model (LLM) text…
- ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation — Jiacheng Yang, Jun Wu, Zhen Zhang, Yida Wang, Gennady Pekhimenko
The rapid advancement of generative AI has brought about an increasing demand for high-resolution, long video generation. This talk, "ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion…
- TurboAttention: Efficient Attention Approximation for High-Throughput LLM Serving — Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Rühle, Saravan Rajmohan
In the rapidly evolving landscape of large language models (LLMs), efficient inference at scale remains a paramount challenge. Hao Kang, a second-year PhD student from Georgia Tech, presented…
- FlexInfer: Flexible LLM Inference with CPU Computations — Seonjin Na, Geonhwa Jeong, Byung Hoon Ahn, Tushar Krishna, Hyesoon Kim
The proliferation of Large Language Models (LLMs) has led to an explosion in demand for efficient inference, particularly in high-throughput applications like chatbots. However, a critical…
- SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling — Ke Hong, Xiuhong Li, Lufang Chen, Guohao Dai, Xuefei Ning, Yu Wang
The rapid proliferation of large language models (LLMs) has introduced unprecedented challenges and opportunities in model serving. This talk, presented by Ke Hong from Tsinghua University and…
- Scaling Deep Learning Training with MPMD Pipeline Parallelism — Anxhelo Xhebraj, Sean Lee, Hanfeng Chen, Vinod Grover
In the rapidly evolving landscape of deep learning, the relentless growth in model size necessitates increasingly sophisticated strategies for distributed training. As models surpass the capacity of…
- TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives — Size Zheng, Jin Fang, Ningxin Zheng, Haibin Lin, Li-Wen Chang, Xin Liu
The rapid advancement and widespread adoption of large language models (LLMs) have driven an unprecedented demand for computational resources, particularly massive clusters of GPUs. Training and…
- COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts — Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Xin Liu
This talk introduces **Comet**, a novel framework designed to achieve fine-grained computation-communication overlapping in **Mixture-of-Experts (MoE)** models. Presented by Ningxin Zheng from…
- Balancing Pipeline Parallelism with Vocabulary Parallelism — Yeung Man Tsung, Penghui Qi, Min Lin, Xinyi Wan
This talk, presented by Yeung Man Tsung from the National University of Singapore, details a novel approach called **Vocabulary Parallelism (VP)**. The research, conducted during an internship at…
- On Distributed Larger-Than-Memory Subset Selection with Pairwise Submodular Functions — Maximilian Böther, Abraham Sebastian, Pranjal Awasthi, Ana Klimovic, Srikumar Ramalingam
In the realm of large-scale machine learning, the cost associated with training models on massive, often petabyte-scale datasets presents a significant challenge. This talk, presented by Maximilian…
- Marconi: Prefix Caching for the Era of Hybrid LLMs — Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Tri Dao, Ravi Netravali
In the rapidly evolving landscape of large language models (LLMs), a critical challenge persists: achieving efficiency under increasingly long context lengths. Traditional transformer-based LLMs…
- FlexAttention: A Programming Model for Generating Fused Attention Variants — Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, Horace He
The landscape of deep learning, particularly within large language models (LLMs), is dominated by the **Transformer architecture**, where the **attention mechanism** is a foundational component…
- ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments — Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Ana Klimovic, Eiko Yoneki
The proliferation of large language models (LLMs) has revolutionized many industries, yet their deployment in production environments presents significant challenges, primarily due to the immense…
- XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models — Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ziyi Xu, Tianqi Chen
The XGrammar project introduces a novel and highly efficient engine for **structured generation** by Large Language Models (LLMs), addressing a critical need in modern AI applications. Presented by…
- NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference — Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu
In the rapidly evolving landscape of large language models (LLMs), online inference has become a cornerstone for numerous cutting-edge applications. However, the relentless growth in LLM size has…
- FedProphet: Memory-Efficient Federated Adversarial Training via Robust and Consistent Cascade Learning — Minxue Tang, Yitu Wang, Jingyang Zhang, Yiran Chen, Hai Helen Li
This article delves into FedProphet, an innovative framework designed to enable memory-efficient federated adversarial training while maintaining high model robustness and utility. Presented at…
- FLStore: Efficient Federated Learning Storage for Non-training Workloads — Ahmad Faraz Khan, Samuel Fountain, Ahmed M. Abdelmoniem, Ali R. Butt, Ali Anwar
This article delves into **FLStore**, a novel architecture designed to enhance the efficiency of federated learning (FL) pipelines by optimizing the storage and processing of **non-training…
- MAS-ATTENTION: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices — Mohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu
The proliferation of large language models and other foundation models has cemented the **transformer architecture** and its core component, **attention mechanisms**, as indispensable elements…
- Venn: Resource Management for Collaborative Learning Jobs — Jiachen Liu, Fan Lai, Eric Ding, Yiwen Zhang, Mosharaf Chowdhury
This talk introduces **Venn**, a novel resource manager designed to optimize the execution of multiple concurrent collaborative learning jobs across large-scale, heterogeneous edge devices…
- Photon: Federated LLM Pre-Training — Lorenzo Sani, Alex Iacob, Zeyu Cao, Royson Lee, Nicholas D. Lane
This article delves into **Photon**, a pioneering system designed for **Federated LLM pre-training**, as presented by Lorenzo Sani and his collaborators from Flower Labs, the Machine Learning…
- Supply-Chain Attacks in Machine Learning Frameworks — Yue Gao, Ilia Shumailov, Kassem Fawaz
This talk, presented by Yue Gao, Ilia Shumailov, and Kassem Fawaz at MLSys 2025, delves into the critical and rapidly escalating issue of supply chain attacks within the machine learning ecosystem…
- VoLUT: Efficient Volumetric Streaming Enhanced by LUT-based Super-Resolution — Chendong Wang, Anlan Zhang, Yifan Yang, Lili Qiu, Feng Qian, Suman Banerjee
The talk introduces **VoLUT**, a groundbreaking system designed to address the significant challenges of streaming high-fidelity volumetric video. Volumetric video, which offers a full 3D…
- Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs — Zichao Yue, Chenhui Deng, Zhiru Zhang
Graph Neural Networks (GNNs) have emerged as a cornerstone in modern machine learning, demonstrating remarkable success in diverse applications ranging from fraud detection to circuit functional…
- MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs — Abhishek Moitra, Arkapravo Ghosh, Shrey Agrawal, Karthik Swaminathan, Priyadarshini Panda
The rapid scaling of Large Language Models (LLMs) has unlocked a vast array of applications, from sophisticated chatbots to real-time translation systems. However, deploying these increasingly…
- LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions — Jianheng Ling, Pratik Worah, Yawen Wang, Martin Maas, Kathryn S. McKinley
This talk introduces **LAVA** (Lifetime-Aware VM Allocation), a novel approach to virtual machine (VM) scheduling within large-scale cloud environments, specifically addressing the challenges faced…