EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang(Jeremy) Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang

International Conference on Machine Learning 2025 · Oral

The talk introduces **EmbodiedBench**, a novel and comprehensive benchmarking suite designed to evaluate multi-modal large language models (VLMs) for their capabilities in vision-driven embodied agents. Presented by Rui Yang and a collaborative team from institutions including the University of Illinois, UW, U of Toronto, and TTIC, this work addresses critical limitations in existing embodied AI evaluation methodologies. Specifically, EmbodiedBench tackles the scarcity of diverse tasks and scenarios, as well as the lack of fine-grained diagnostic insights beyond simple success rates.

AI review

EmbodiedBench is a competently executed benchmarking effort that documents performance gaps between high-level and low-level embodied tasks for VLMs. The empirical observations are real and the benchmark infrastructure appears carefully constructed. However, this is squarely an engineering and measurement contribution, not a theoretical one, and the article consistently overstates the depth of its insights. The central findings — that VLMs trained on text-heavy corpora struggle with low-level motor control, that vision is more useful when spatial precision is required, that GPT-4o and Gemini…