The GPUs on the Bus Go ‘Round and ‘Round - Natalie Bandel & Ryan Hallisey, NVIDIA

Natalie Bandel, Ryan Hallisey, NVIDIA

KubeCon + CloudNativeCon Europe 2025 · Session

In the highly dynamic and resource-intensive world of cloud infrastructure, managing the reliability of specialized hardware like Graphics Processing Units (GPUs) presents a significant operational challenge. This KubeCon EU talk, "The GPUs on the Bus Go ‘Round and ‘Round," by NVIDIA's Natalie Bandel and Ryan Hallisey, delves into the complex problem of GPU failures within massive Kubernetes clusters. Drawing from their extensive experience operating the GeForce Now cloud gaming platform, the speakers illuminate the causes, detection, and automated remediation strategies for keeping tens of thousands of GPUs available and performing optimally.

AI review

The talk from NVIDIA engineers Natalie Bandel and Ryan Hallisey provides a brutally honest, data-backed look into the relentless reality of GPU failures at massive scale within Kubernetes clusters. Leveraging their GeForce Now operational experience, they detail sophisticated automated detection and multi-stage remediation strategies, from custom problem detectors to node rebuilds. The session offers critical insights into bottlenecks like node draining and proposes a Kubernetes working group for standardized solutions, making it highly relevant for anyone operating GPU-accelerated…

Watch on YouTube