MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators
Beichen Huang, Yueming Yuan, Zelei Shao, Minjia Zhang
Conference on Machine Learning and Systems 2025 · Day 2 · Session 3: Quantization and Sparsity
This article delves into MiLo, a novel approach for efficient quantized Mixture of Experts (MoE) inference, presented by Beichen Huang and Yueming Yuan at MLSys 2025. MiLo tackles the pressing challenge of deploying increasingly large MoE models—which are becoming prevalent due to their ability to scale model capacity while maintaining computational efficiency—on resource-constrained hardware, specifically single GPUs. The core innovation lies in its ability to perform extreme quantization, reducing model weights from FP16 to int3, without incurring severe accuracy loss or requiring extensive calibration data.
AI review
MiLo is legitimate systems research — a real kernel, a real algorithm, shipped and measured on real hardware. The core ideas (low-rank residual compensation on top of HQQ, adaptive rank via kurtosis/activation-frequency, custom int3 bit-packing) are solid and the 20% end-to-end latency reduction on Mixtral 8x7B is a meaningful result. But the write-up reads like a conference abstract expanded to fill space, not a talk that shows you how to build the thing. Specific numbers are frustratingly absent, the experimental baselines are underspecified, and there's no code or reproducibility path…