DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, Se-Young Yun

International Conference on Machine Learning 2025 · Oral

In an era increasingly defined by the capabilities of large language models (LLMs), the computational demands associated with their deployment remain a significant bottleneck. This talk introduces **DistiLLM-2**, a novel knowledge distillation (KD) framework designed to address this challenge by enabling smaller, more efficient language models (SLMs) to achieve performance levels comparable to their larger counterparts. Presented by Jongwoo Ko from KAIST AI, representing a collaborative effort with Microsoft, DistiLLM-2 stands out as the first work to jointly optimize both the loss formulation and data creation aspects of knowledge distillation for LLMs.

AI review

DistiLLM-2 is a competent engineering contribution to LLM knowledge distillation that combines a contrastive loss formulation (KALD, using skew-KL on teacher-generated outputs and skew-reverse-KL on student-generated outputs) with two curriculum-based adaptive coefficient schemes. The work is well-motivated and the empirical results span a reasonable range of model families and benchmarks. However, the theoretical framing is largely post-hoc rationalization dressed in information-theoretic language — the divergence decomposition is not new, the 'contrastive' framing is suggestive rather than…