Improving the Scaling Laws of Synthetic Data with Deliberate Practice

Reyhane Askari Hemmat, Mohammad Pezeshki, Elvis Dohmatob, Florian Bordes, Pietro Astolfi, Melissa Hall, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano

International Conference on Machine Learning 2025 · Oral

In the rapidly evolving landscape of machine learning, the quest for ever more performant models often hinges on the availability of vast, high-quality datasets. However, acquiring and labeling real-world data is an inherently expensive, time-consuming, and often prohibitive endeavor. This talk, presented by Reyhane Askari Hemmat and her colleagues from FAIR labs at Meta, introduces a novel framework dubbed **Deliberate Practice (DP)**, drawing inspiration from human learning psychology, to dramatically improve the data efficiency and scaling laws of synthetic data generation. The core innovation lies in dynamically generating challenging and informative examples using **entropy-guided sampling** from powerful **diffusion models**, rather than relying on static or naively generated datasets.

AI review

A competent and well-motivated engineering contribution from FAIR that demonstrates real empirical gains in synthetic data efficiency via entropy-guided diffusion sampling. The core idea — dynamically generating hard examples by steering diffusion model inference with classifier entropy gradients — is sensible and the results on ImageNet-100/1k are credible. The framing as 'Deliberate Practice' is more rhetorical than formal, the theoretical content is thin, and the novelty relative to existing curriculum learning and hard-example mining literature is underexplored. Solid applied work; not a…