AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Nicholas Carlini, Edoardo Debenedetti, Javier Rando, Milad Nasr, Florian Tramer

International Conference on Machine Learning 2025 · Oral

In this insightful talk from ICML 2025, Nicholas Carlini and his co-authors present **AutoAdvExBench**, a novel, proxy-free benchmark designed to evaluate the capability of large language models (LLMs) to automatically exploit **adversarial example defenses**. The core premise of their work is to move beyond synthetic or "proxy" tasks that merely simulate human research activities, instead measuring whether LLMs can perform actual, complex security tasks that real security professionals undertake. The specific task chosen for this benchmark is the automatic generation of adversarial attacks that successfully bypass established defenses for image classifiers.

AI review

AutoAdvExBench is a well-motivated, honestly executed empirical benchmark that asks whether LLMs can autonomously exploit adversarial example defenses. The proxy-free design is the paper's genuine contribution — success is defined by whether the defense actually breaks, not by a surrogate metric — and the clean/real-world defense split produces a finding that is both credible and practically informative. The work is careful and the takeaways are not overclaimed. But this is benchmarking infrastructure, not theoretical advance: there are no new theorems, no new attack methods, and the core…