LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Parshin Shojaee, Ngoc Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa Doan, Chandan Reddy

International Conference on Machine Learning 2025 · Oral

This talk, presented by Chandan Reddy from Virginia Tech on behalf of his PhD students Parshin Shojaee and Ngoc Hieu Nguyen and co-authors, introduces **LLM-SRBench**, a novel benchmark designed to push the boundaries of Large Language Models (LLMs) in the domain of scientific equation discovery, also known as symbolic regression. The central premise is to move beyond mere data fitting and towards leveraging the vast scientific knowledge embedded within LLMs to uncover new, explainable mathematical hypotheses that accurately describe observed phenomena. The talk highlights a critical limitation of existing benchmarks: their susceptibility to LLM memorization, which hinders true scientific discovery.

AI review

LLM-SRBench is a competent benchmark contribution that addresses a real and underappreciated problem — LLM memorization inflating symbolic regression scores — and proposes two concrete dataset construction strategies and an LLM-as-judge evaluation mechanism. The work is honest about what it is: an evaluation infrastructure paper, not a theoretical or algorithmic breakthrough. The 95% human correlation figure for the LLM judge is the empirical core, and the finding that current LLMs cap at ~30% symbolic accuracy on non-memorizable problems is a useful data point for the field. However, this…