Tutorial on Mechanistic Interpretability for Language Models: Tutorial on Mechanistic Interpretability for Language Models

Ziyu Yao, Daking Rai

International Conference on Machine Learning 2025 · Tutorial

This article delves into the intricate world of **Mechanistic Interpretability (MI)** for language models, based on a comprehensive tutorial presented by Ziyu Yao and Daking Rai at ICML 2025. The talk addresses the fundamental challenge of understanding the internal workings of large language models (LLMs), which, despite their remarkable capabilities, largely remain "black boxes" to their users and developers. The speakers argue that while LLMs are essentially probabilistic models predicting the next token, their internal Transformer architecture processes information in ways that are not fully understood.

AI review

This is a competently assembled survey tutorial on mechanistic interpretability, covering the standard terrain — features, circuits, universality, SAEs, intervention methods — with reasonable technical breadth. The speakers clearly know the literature and communicate the scaffolding of MI research with care. But as a tutorial, its job is organization and pedagogy, not contribution, and evaluated against the standard I apply to research work, there is no new theorem, no new framework, no new empirical result, and no unifying insight that restructures how the community should think about the…