Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation

D. Sculley, William Cukierski, Phil Culliton, Sohier Dane, Maggie Demkin, Ryan Holbrook, Addison Howard, Paul Mooney, Walter Reade, Meg Risdal, Nate Keating

International Conference on Machine Learning 2025 · Oral

In this thought-provoking position talk at ICML 2025, Walter Reade, a data scientist on the Kaggle competitions team, alongside a distinguished group of colleagues, presented a compelling argument: **AI competitions offer the gold standard for empirical rigor in generative AI (GenAI) evaluation**. The presentation critically examines the foundational assumptions underlying traditional machine learning benchmarks and highlights their inadequacy in the rapidly evolving landscape of GenAI. With large language models (LLMs) and other generative systems demonstrating unprecedented capabilities, the methods by which we assess their true generalization, robustness, and susceptibility to data leakage require a fundamental re-evaluation.

AI review

A position talk from the Kaggle team arguing that AI competitions constitute the gold standard for GenAI evaluation. The core diagnosis — that static IID benchmarks break down when test contamination is intractable and novelty-based generalization is the actual target — is correct and worth stating. But the talk does not rise above advocacy. The proposed solution (competitions) is presented with enthusiasm and case studies, not with formal analysis of what properties a good evaluation framework requires, whether competitions satisfy those properties, or when they don't. The result is a…