Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training

Mingkai Zheng, Zhao Zhang

Conference on Machine Learning and Systems 2025 · Day 2 · Session 3: Quantization and Sparsity

The pre-training of large language models (LLMs) has become a cornerstone of modern AI, yet it remains an immensely resource-intensive and time-consuming endeavor. As models scale to billions or even trillions of parameters, the computational and communication overheads associated with distributed training frameworks become dominant bottlenecks, particularly in data-parallel (DP) setups. This talk introduces **Radius**, a novel range-based gradient sparsity algorithm and system designed to significantly accelerate the pre-training of large foundation models by intelligently reducing the communication volume without sacrificing model quality.

AI review

Radius is legitimate systems research on gradient sparsity for distributed LLM pre-training — the core insight (select top-k indices post-all-reduce, amortize selection over T iterations) is clean and well-motivated. The headline numbers are impressive, and the experimental setup is specific enough to be credible. But this is a conference talk writeup about academic ML systems work, not a shipping tool, and there's nothing here an engineer outside of a large-scale pre-training shop can act on today.