Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

Mingyu Liang, Hiwot Tadese Kassa, Wenyin Fu, Louis Feng, Christina Delimitrou

Conference on Machine Learning and Systems 2025 · Day 3 · Session 5: LLM Training and Fine-Tuning

The rapid advancement and increasing scale of Large Language Models (LLMs) necessitate highly efficient training processes. Optimizing the performance of these colossal models is paramount, but it first requires a profound understanding of their intricate execution behavior. This talk introduces **Lumos**, a novel trace-driven performance modeling and estimation framework designed specifically for large-scale LLM training. Developed through a collaboration between Cornell, Google, and Meta, Lumos addresses critical limitations in existing performance modeling approaches, particularly their inability to accurately capture the complex interplay of compute, communication, and overlap in modern LLMs.

AI review

Lumos is real, careful engineering on a genuinely hard problem — modeling LLM training performance with enough fidelity to be useful for what-if analysis. The 3% replay error is credible and the dependency taxonomy (intra-thread, inter-stream, CPU-GPU sync) is the kind of specific, unglamorous systems work that actually matters at scale. But the write-up reads like an expanded abstract rather than a talk report: the implementation details stop just short of being actionable, the auxiliary models are hand-waved, and there's no code or artifact I can point to. Strong systems research, thin on…