Emergence in non-neural models: grokking modular arithmetic via average gradient outer product

Neil Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Misha Belkin

International Conference on Machine Learning 2025 · Oral

In this compelling talk, Neil Mallinar and his co-authors present groundbreaking research challenging the conventional understanding of generalization in machine learning, particularly the phenomenon known as **grokking**. Traditionally observed in neural networks, grokking describes a delayed generalization where a model first perfectly fits the training data with poor test performance, only to suddenly achieve high test accuracy after thousands of additional training steps. Mallinar's presentation, "Emergence in non-neural models: grokking modular arithmetic via average gradient outer product," investigates whether this emergent behavior is unique to neural networks and proposes a unified framework to understand it across diverse model architectures.

AI review

This paper makes a genuine theoretical contribution by demonstrating that grokking — previously studied almost exclusively in neural networks — occurs in Kernel Recursive Feature Machines, and by identifying block circulant AGOP structure as the shared computational signature across both model classes. The mechanistic account is rigorous enough to be actionable: the authors prove (in the paper, if not fully in the talk) that a quadratic kernel with circulant features implements the Fourier multiplication algorithm, they confirm the structural hypothesis by showing that a priori circulant…