Optimizing Training Performance for Large Language Model(LLM) in Kubernetes - Klaus Ma & Peng Gu

Klaus Ma, Peng Gu

KubeCon + CloudNativeCon Europe 2025 · Session

This talk, presented by Klaus Ma from Nvidia and Peng Gu from an unnamed technical startup, delves into the critical area of optimizing **Large Language Model (LLM)** training performance within **Kubernetes** environments. The core focus is on addressing the networking bottlenecks inherent in distributed LLM training, a problem that becomes increasingly pronounced with the scale and complexity of modern AI models. Ma and Gu introduce and demonstrate the **network topology aware scheduling** capabilities recently integrated into **Volcano**, a CNCF incubating project and the first batch scheduling system in CNCF.

AI review

This talk by Klaus Ma and Peng Gu delivers a critical technical solution for a pressing problem in large-scale AI: optimizing LLM training performance in Kubernetes by tackling network bottlenecks. They introduce Volcano's network topology aware scheduling, enabled by the new `HyperNode` CRD, which allows granular definition of data center network hierarchy. The live demo effectively showcases how this intelligent scheduling ensures distributed LLM jobs are co-located for optimal low-latency, high-bandwidth communication, directly impacting efficiency and cost for anyone serious about MLOps.

Watch on YouTube