Youmu: Efficient Columnar Data Pipeline for LLM Training
Tianle Zhong, Jiechen Zhao, Qiang Su, Geoffrey Fox
Conference on Machine Learning and Systems 2025 · Day 3 · Session 5: LLM Training and Fine-Tuning
In the rapidly evolving landscape of large language model (LLM) training, data pipeline efficiency remains a critical bottleneck. This talk introduces **Youmu**, an innovative system designed to streamline the data ingestion and shuffling process for LLM training by directly leveraging **Parquet**, a widely adopted columnar data format. Presented by Tianle Zhong and his collaborators, Youmu tackles the pervasive problem of "costly pit stops"—the inefficient conversion of data from optimized storage formats like Parquet to less efficient ones solely for the purpose of data shuffling during model training.
AI review
Youmu addresses a real and underappreciated pain point in LLM training infrastructure — the format mismatch between Parquet's chunk-oriented I/O and the fine-grained random access that good data shuffling requires. The core insight (use data pages as the I/O unit, not chunks) is genuinely clever and practically motivated. The Rust implementation with Python APIs is a good engineering choice. But the write-up is frustratingly thin on specifics: no hardware configurations, no dataset names, no actual throughput numbers, and 'just as good as row-based shuffling' without confidence intervals or…