Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

Jinghan Yao, Sam Ade Jacobs, Masahiro Tanaka, Olatunji Ruwase, Dhabaleswar Panda

Conference on Machine Learning and Systems 2025 · Day 3 · Session 5: LLM Training and Fine-Tuning

The rapid advancement of large language models (LLMs) has highlighted a critical bottleneck: the ability to process and train on ultra-long input sequences. While models like Llama 3.1 are pushing context lengths to 128K tokens, achieving even longer contexts for training or inference remains a significant challenge dueailing to GPU memory limitations. This talk, presented by Jinghan Yao from Ohio State University and based on collaborative work with the Microsoft DeepSpeed team, introduces the **Fully Pipelined Distributed Transformer (FPDT)**, a novel approach designed to efficiently overcome these memory hurdles and enable training of LLMs with millions of tokens in context.

AI review

Solid systems engineering work on a real and gnarly problem — activation memory blowup during long-context attention backprop. The FPDT design (GPU trunking + host offloading + double buffering) is concrete, the 16x context length improvement is a headline number worth taking seriously, and the implementation lives in the DeepSpeed repo using standard PyTorch rather than custom kernels. The 64K token 'sweet spot' empirical finding is exactly the kind of hardware-grounded insight that separates real systems work from benchmark theater. Docks a star because the article is summary-of-summary…