VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, HYUNG IL KOO, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee

International Conference on Machine Learning 2025 · Oral

In the rapidly evolving landscape of artificial intelligence, **Large Language Models (LLMs)** have demonstrated remarkable capabilities in understanding and generating human-like text. However, their inherent probabilistic nature often leads to non-deterministic outputs, meaning the same prompt can yield different, sometimes incorrect, answers. This variability poses significant challenges for deploying LLMs in applications requiring high accuracy and reliability, particularly in complex reasoning tasks. The talk by Thomas Zeng and his collaborators introduces **VersaPRM**, a novel **Process Reward Model (PRM)** designed to address these limitations by evaluating the correctness of LLM reasoning steps across multiple domains, moving beyond the current domain-specific confines of existing PRMs.

AI review

VersaPRM proposes a multi-domain process reward model trained on synthetically labeled, diverse data rather than math-only corpora. The core observation — that existing PRMs fail out-of-distribution because they were trained on narrow data — is real and worth stating. But the contribution reduces almost entirely to 'use more diverse training data and label it with an LLM,' which is a data engineering insight, not a theoretical or even rigorously empirical one. The evaluation is surface-level, the annotation pipeline is treated as a black box, and the paper makes no attempt to characterize…