What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, liyunfei, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang

International Conference on Machine Learning 2025 · Oral

This talk introduces **OmniBench**, a novel subtask-based benchmark designed to provide a scalable and multi-dimensional evaluation framework for virtual agents. Presented by Yaoxin Li from the University of Waterloo on behalf of the original authors, the research addresses critical limitations of existing benchmarks, which often suffer from static complexity, reliance on manual annotation, and inadequate evaluation metrics. OmniBench represents tasks as interconnected subtask graphs, allowing for the definition and assessment of five distinct dimensions of task complexity and 10 core agent capabilities.

AI review

OmniBench proposes a graph-based benchmark framework for evaluating virtual agents across multiple complexity dimensions and capability axes. The engineering effort is genuine and the motivation is real — existing benchmarks do saturate quickly and fail to distinguish partial progress from complete failure. But this is systems/empirical work dressed in the language of principled contribution. There are no theorems, no formal definitions that do real work, and the 'framework' amounts to a set of design decisions with post-hoc empirical validation. The findings (agents struggle with…