EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
Rui Yang, Hanyang(Jeremy) Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
International Conference on Machine Learning 2025 · Oral
The talk introduces **EmbodiedBench**, a novel and comprehensive benchmarking suite designed to evaluate multi-modal large language models (VLMs) for their capabilities in vision-driven embodied agents. Presented by Rui Yang and a collaborative team from institutions including the University of Illinois, UW, U of Toronto, and TTIC, this work addresses critical limitations in existing embodied AI evaluation methodologies. Specifically, EmbodiedBench tackles the scarcity of diverse tasks and scenarios, as well as the lack of fine-grained diagnostic insights beyond simple success rates.
AI review
EmbodiedBench is a competently executed benchmarking effort that documents performance gaps between high-level and low-level embodied tasks for VLMs. The empirical observations are real and the benchmark infrastructure appears carefully constructed. However, this is squarely an engineering and measurement contribution, not a theoretical one, and the article consistently overstates the depth of its insights. The central findings — that VLMs trained on text-heavy corpora struggle with low-level motor control, that vision is more useful when spatial precision is required, that GPT-4o and Gemini…