Position: Medical Large Language Model Benchmarks Should Prioritize Construct Validity

Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Raji, Travis Zack

International Conference on Machine Learning 2025 · Oral

In a critical presentation at ICML 2025, Tom Hartvigsen, faculty at the University of Virginia, delivered a compelling position paper on behalf of a large collaborative team from Berkeley and UCSF. The talk, titled "Benchmarks Medical Should Large Prioritize Language Construct Model Validity," or simply "we should care more about what our benchmarks measure," addresses a fundamental challenge in the burgeoning field of medical large language models (LLMs). As the race to deploy LLMs in various high-stakes medical applications accelerates, the community faces significant hurdles in accurately measuring progress and ensuring the reliability of these models.

AI review

This position paper raises a legitimate and practically important concern — that MedQA rankings don't predict real-world clinical performance — and backs it with a concrete empirical demonstration using matched UCSF EHR data. The finding that model rankings invert between synthetic and real-world evaluation is the kind of result the community should see. But the paper is a position paper dressed as a theoretical contribution: the central concept of 'construct validity' is borrowed wholesale from psychometrics without formalization, the empirical methodology is described too loosely to…