AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine

Carlo Siebenschuh, Kyle Hippe, Alexander Brace, Arvind Ramanathan, Ian Foster, Robert Underwood

Conference on Machine Learning and Systems 2025 · Day 2 · Session 2: Parallel and Distributed Systems

In the rapidly evolving landscape of large language models (LLMs), the quality and scale of pre-training data are paramount. This talk introduces **AdaParse**, an innovative adaptive parallel PDF parsing and resource scaling engine designed to unlock vast repositories of scientific literature for LLM training. Presented by Carlo Siebenschuh from the University of Chicago and Argonne National Laboratory, AdaParse addresses the critical bottleneck of efficiently and accurately extracting textual content from PDF documents, particularly in the scientific domain. The overarching goal is to enable the creation of "science-savvy" foundation models capable of understanding and reasoning over complex scientific communication.

AI review

AdaParse solves a real and underappreciated problem — PDF parsing is genuinely a bottleneck for scientific LLM pretraining, and the adaptive meta-strategy framing is legitimate engineering. The 17x throughput gain over NUGAT with comparable or better accuracy is a meaningful result. But the write-up is long on framing and short on the implementation details that would let you actually reproduce or extend this work. The DPO application is interesting but the accuracy gains are modest enough that the justification feels strained. Worth watching for HPC-scale data pipeline teams; not a must-see…