TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

Size Zheng, Jin Fang, Ningxin Zheng, Haibin Lin, Li-Wen Chang, Xin Liu

Conference on Machine Learning and Systems 2025 · Day 4 · Session 9: Parallel and Distributed Systems

The rapid advancement and widespread adoption of large language models (LLMs) have driven an unprecedented demand for computational resources, particularly massive clusters of GPUs. Training and inference for state-of-the-art LLMs often necessitate tens of thousands of GPUs, making Model Flops Utilization (MFU) a critical metric for efficiency and cost-effectiveness. Even marginal improvements in MFU can translate into substantial cost reductions for organizations operating at this scale. A significant bottleneck in achieving high MFU is the overhead introduced by communication between GPUs, which, without optimization, can consume between 20% and 80% of the total execution time. This challenge is particularly acute in distributed training paradigms such as tensor parallel, pipeline parallel, data parallel, and sequence parallel.

AI review

TileLink is a legitimate systems contribution from ByteDance that addresses a real, quantified problem — communication overhead eating 20-80% of GPU time in distributed LLM training — with a compiler-level solution that's more reproducible and general than the hand-tuned libraries it competes with. The two-layer architecture (nine high-level tile-centric primitives over a modified Triton compiler called Triton Distributed) is a thoughtful design that separates concerns in exactly the right place. The benchmarks are impressive — 10x+ on MOE, 2x on Ring Attention, 14x on DeepSeek-EP operators…