Balancing Pipeline Parallelism with Vocabulary Parallelism
Yeung Man Tsung, Penghui Qi, Min Lin, Xinyi Wan
Conference on Machine Learning and Systems 2025 · Day 4 · Session 9: Parallel and Distributed Systems
This talk, presented by Yeung Man Tsung from the National University of Singapore, details a novel approach called **Vocabulary Parallelism (VP)**. The research, conducted during an internship at ByteDance's CAI lab, addresses a critical bottleneck in scaling large language models (LLMs) with **pipeline parallelism (PP)**: the disproportionate computational and memory burden imposed by the input and output embedding layers. While traditional pipeline parallelism effectively distributes transformer layers across multiple devices, the embedding layers, particularly the output embedding layer, often become a significant performance impediment, especially with the burgeoning vocabulary sizes seen in modern multi-language models.
AI review
Vocabulary Parallelism is a legitimate systems contribution solving a real problem — embedding layer imbalance in pipeline-parallel LLM training — with a clever communication reduction trick borrowed from online softmax. The work appears to be real engineering done at ByteDance by people who actually built and measured it. But the write-up I'm reviewing is a thorough summary that reads more like a well-structured abstract than a window into the implementation. The 5–51% MFU range is too wide to be immediately actionable, there's no code or reproducibility path, and the experimental setup is…