How Do Large Language Monkeys Get Their Power (Laws)?

Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, Sanmi Koyejo

International Conference on Machine Learning 2025 · Oral

This talk, presented by Rylan Schaeffer and Joshua Kazdan at ICML 2025, delves into the fascinating and seemingly paradoxical scaling laws observed when using **Large Language Models (LLMs)**, affectionately termed "large language monkeys," with multiple independent attempts to solve problems. While it has been empirically established that the overall success rate across a dataset, known as `pass@k`, scales polynomially (a **power law**) with the number of attempts *k*, the success rate for individual problems `pass_i@k` scales exponentially. The core of their work is to resolve this discrepancy by identifying the underlying statistical properties of problem difficulty that give rise to these aggregate power laws.

AI review

Schaeffer, Kazdan, and collaborators resolve a genuine paradox in LLM inference scaling — why pass@k aggregated over a dataset follows a power law while per-problem pass_i@k scales exponentially — by identifying a power law left tail in the pass_i@1 difficulty distribution as both necessary and sufficient. The theorems are stated with apparent precision, the empirical validation is clean and includes a falsifying case (Llama 3's jailbreak distribution lacking the heavy tail), and the practical payoff is a more sample-efficient distributional predictor. This is the kind of work that explains…