Cloud Native AI: Harness the Power of Advanced Scheduling for High-P... William Wang & Xuzheng Chang

William Wang, Xuzheng Chang

KubeCon + CloudNativeCon Europe 2025 · Session

This talk, presented by Xuzheng Chang (Kevin Juan) and Shu Jun (Zir), delves into the latest advancements within the **Volcano project**, a **Kubernetes-native batch scheduler** designed to optimize high-performance AI and machine learning (ML) training workloads. As AI, particularly large language models (LMS), experiences rapid growth, the demand for sophisticated, efficient, and scalable infrastructure has intensified. This presentation addresses the evolving landscape of cloud-native AI, highlighting the critical need for advanced scheduling capabilities that can bridge the gap between complex underlying hardware topologies and the simplicity required by data scientists.

AI review

This talk provides a substantive deep dive into Volcano's latest advancements, specifically the HyperNode API and network topology-aware scheduling, which are crucial for optimizing distributed AI/ML workloads on Kubernetes. The project directly addresses real-world performance bottlenecks, resource utilization, and resiliency challenges with clever technical solutions like hierarchical queues and multi-layered retry policies. While the absence of a live demo is a notable omission, the speakers' deep expertise and the practical, detailed content make this a highly valuable session for anyone…

Watch on YouTube