ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks

Saurabh Jha, Rohan Arora, Yuji Watanabe, Takumi Yanagawa, Yinfang Chen, Jackson Clark, Bhavya Bhavya, Mudit Verma, Harshit Kumar, Hirokuni Kitahara, Noah Zheutlin, Saki Takano, Divya Pathak, Felix George, Xinbo Wu, Bekir Turkkan, Gerard Vanloo, Michael Nidd, Ting Dai, Oishik Chatterjee, Pranjal Gupta, Suranjana Samanta, Pooja Aggarwal, Rong Lee, Jae-wook Ahn, Debanjana Kar, Amit Paradkar, Yu Deng, Pratibha Moogi, Prateeti Mohapatra, Naoki Abe, Chandrasekhar Narayanaswami, Tianyin Xu, Lav Varshney, Ruchi Mahindru, Anca Sailer, Laura Shwartz, Daby Sow, Nicholas Fuller, Ruchir Puri

International Conference on Machine Learning 2025 · Oral

In the rapidly evolving landscape of artificial intelligence, the promise of **AI agents** to automate complex, real-world tasks has garnered immense attention. However, the true capabilities of these agents, particularly in high-stakes operational environments, remain largely underexplored and inadequately benchmarked. This talk introduces **ITBench**, a novel and comprehensive benchmark designed to rigorously evaluate AI agents on diverse, real-world IT automation tasks. Developed through a collaborative effort between IBM and the University of Illinois at Urbana-Champaign, ITBench aims to bridge the gap between theoretical agentic capabilities and practical deployment in critical IT infrastructure.

AI review

ITBench is a benchmark paper for evaluating LLM agents on IT automation tasks. The work is honest about what it is — an engineering artifact — and the operational motivation is genuine. But this is not a theoretical contribution, and it does not have the scientific depth to justify a strong rating at a venue like ICML. The headline result (state-of-the-art LLMs score 11-14% on SRE tasks) is presented as a finding, but without a controlled analysis of *why* performance is low, it is closer to a demonstration than a discovery. The paper surfaces real challenges but does not explain them, does…