Learning Dynamics in Continual Pre-Training for Large Language Models

Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Zeng

International Conference on Machine Learning 2025 · Oral

This talk, presented by Xingjin Wang from the University of Chinese Academy of Sciences, delves into the intricate **learning dynamics** of **continual pre-training (CPT)** for **Large Language Models (LLMs)**. Continual pre-training is a critical technique that enables LLMs to rapidly adapt to new, specialized domains such as mathematics or code, extend their context windows, and integrate new knowledge for subsequent downstream tasks like reinforcement learning (RL) fine-tuning. Despite its widespread use, the CPT process is characterized by numerous interacting variables, including learning rate schedules, data replay ratios, and initialization checkpoints, making its optimization a complex challenge.

AI review

A technically competent empirical-theoretic paper that extends a prior pre-training scaling law framework to the continual pre-training (CPT) setting. The core contribution — a combined loss equation that accounts for learning rate annealing dynamics plus a distribution shift term modeled as a power law — is a legitimate and useful extension. The paper introduces 'loss potential' as a practical proxy for CPT-readiness of a checkpoint and demonstrates applicability to black-box models via proxy datasets. The work is honest about what it is: a well-fitted predictive model, not a derivation…