Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Yunzhuo Hao, Jiawei Gu, Huichen Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Yu Cheng

International Conference on Machine Learning 2025 · Oral

In an era where large language models (LLMs) are rapidly evolving into **multimodal large language models (MLLMs)**, their ability to process and generate information across various modalities—text, images, audio—is increasingly critical. However, a fundamental question persists: can these models truly *reason* in a deeply multimodal fashion, or do they primarily rely on their strong language capabilities, with visual inputs serving merely as shallow cues? This talk, presented by Yunzhuo Hao at ICML 2025, introduces **EMMA (Enhanced MultiModal ReAsoning Benchmark)**, a novel and rigorous benchmark designed to probe the genuine cross-modal reasoning abilities of MLLMs.

AI review

EMMA is a benchmark paper that makes a legitimate and practically important observation — current MLLMs are language-dominant systems that struggle with iterative cross-modal reasoning — but delivers it as a measurement contribution rather than a theoretical or mechanistic one. The core claim is supported by real numbers and a thoughtfully designed filtering pipeline, which is more than most benchmark papers offer. However, the work stops precisely where it becomes scientifically interesting: why do models fail, in a formal or mechanistic sense? The talk circles around empirical observations…