More Nodes, More Problems: Solving Multi-Host GPU/TPU Scheduli... John Belamaric & Morten Torkildsen
John Belamaric, Morten Torkildsen
KubeCon + CloudNativeCon Europe 2025 · Session
In the rapidly evolving landscape of artificial intelligence and machine learning, workloads are continually growing in scale and complexity, often demanding vast arrays of specialized hardware such as GPUs and TPUs. However, managing these multi-host accelerator resources within Kubernetes presents significant challenges for developers and cluster operators alike. This talk by John Belamaric and Morten Torkildsen from Google delves into these intricate problems, highlighting the limitations of current Kubernetes scheduling mechanisms and presenting **Dynamic Resource Allocation (DRA)** as a robust, future-forward solution.
AI review
This talk presents Dynamic Resource Allocation (DRA) as a much-needed evolution in Kubernetes for managing complex, multi-host GPU/TPU workloads. It directly tackles critical issues like compact placement, atomic partitioning, and, most importantly, automated failure recovery, which has been a significant operational blind spot. The speakers, core contributors to DRA, lay out a clear, technically sound vision for how Kubernetes is finally catching up to the demands of large-scale AI/ML infrastructure, making it a strong recommendation for anyone dealing with these challenges.