APOLLO: SGD-like Memory, AdamW-level Performance

Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Zhangyang Wang, Jinwon Lee

Conference on Machine Learning and Systems 2025 · Day 3 · Session 5: LLM Training and Fine-Tuning

In the rapidly evolving landscape of large language models (LLMs), the computational and memory demands for training these sophisticated architectures have become a significant bottleneck, limiting accessibility and innovation. Hanqing Zhu from UT Austin, alongside collaborators from UT Austin and AI, presented **APOLLO**, a groundbreaking solution designed to address this challenge at MLSys 2025. APOLLO introduces a novel memory-efficient training paradigm that promises the memory footprint of traditional **Stochastic Gradient Descent (SGD)** while consistently achieving or even surpassing the performance of **AdamW**, the de facto optimizer for modern LLMs.

AI review

APOLLO is genuinely interesting optimizer research — the core claim that Adam's element-wise adaptivity is redundant and can be approximated in low-rank space via random projection, without SVD overhead, is worth knowing about. The headline number (Llama 7B in 20GB) is the kind of thing that changes what hardware you buy. But this writeup reads like a press release formatted as a talk summary, and it never gets deep enough on the failure modes, hyperparameter sensitivity, or implementation specifics that would let an engineer actually evaluate whether to drop AdamW for APOLLO on their next…