Training Neural Networks at Any Scale: Training Neural Networks at Any Scale

Leena Chennuru Vankadara, Volkan Cevher

International Conference on Machine Learning 2025 · Tutorial

This detailed technical article delves into the intricacies of training neural networks, particularly focusing on the challenges and opportunities presented by scaling models to unprecedented sizes. Presented by Volkan Cevher and Leena Chennuru Vankadara at ICML 2025, the talk dissects the prevailing "scaling law" paradigm, where larger models, more compute, and extensive data are believed to inherently lead to superior performance. While acknowledging the success of this approach, the speakers critically examine its limitations, demonstrating how naive scaling can paradoxically degrade model performance.

AI review

A technically serious tutorial that earns its ambition. The two-part structure — Cevher's Frank-Wolfe unification of adaptive optimizers, Vankadara's μP / μP² scaling theory — represents a coherent attempt to give the community both algorithmic tools and theoretical foundations for scaling. The LMO framework is a genuinely clarifying lens that makes weight decay's role rigorous rather than heuristic, and the μP² result for SAM (preventing perturbation collapse) is a clean, non-obvious theoretical contribution that addresses a real gap. The hyperparameter transferability claims are backed by…