ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $\alpha$-$\beta$-Divergence

Guanghui Wang, Zhiyong Yang, Zitai Wang, Shi Wang, Qianqian Xu, Qingming Huang

International Conference on Machine Learning 2025 · Oral

This talk introduces **ABKD (Alpha-Beta Knowledge Distillation)**, a novel framework designed to enhance model compression through knowledge distillation by addressing the limitations of traditional divergence measures. As foundation models continue their exponential growth, driven by scaling laws that correlate performance with model size, the computational and financial demands for deploying and fine-tuning these models have become prohibitive for most users. Knowledge Distillation (KD) stands out as a highly effective model compression technique, widely adopted in the development of popular foundation models like DeepSeek and Qwen to achieve significant performance gains in smaller models.

AI review

ABKD is a competent and technically honest contribution that frames knowledge distillation as a divergence selection problem and proposes alpha-beta divergence as a two-parameter generalization of the KL family. The theoretical decomposition of forward and reverse KL dynamics via a log-ratio term is a useful organizing principle, and the independence argument — that alpha-beta divergence decouples confidence weighting from hardness concentration, whereas alpha divergence cannot — is the cleanest claim in the paper. The empirical results are consistent and the training cost argument is…