Building & Operating a Large-scale HPC AI Cluster on Kubernetes - Kalyan Saladi & Chandan Avdhut
Kalyan Saladi, Chandan Avdhut
KubeCon + CloudNativeCon Europe 2025 · Session
This KubeCon EU talk, presented by Chandan Avdhut and Kalyan Saladi from Meta, delves into the intricate process of constructing and operating a high-performance computing (HPC) AI cluster at a massive scale within a public cloud environment, leveraging **Kubernetes**. The speakers share Meta's journey from custom on-premise infrastructure to a cloud-native, multi-provider solution, highlighting the unique challenges posed by large-scale machine learning (ML) training workloads and how Kubernetes was adapted to meet these demands. The core focus is on maintaining near bare-metal performance, ensuring reliability for long-running, fault-intolerant jobs, and crucially, preserving an uncompromised researcher experience.
AI review
This talk from Meta provides a brutally honest and technically deep dive into the operational complexities of building and running a massive HPC AI cluster on Kubernetes in a multi-cloud environment. It meticulously details the challenges of supporting long-running, fault-intolerant ML training jobs while maintaining near bare-metal performance and an uncompromised researcher experience. The speakers lay out clever, custom-engineered solutions for everything from Slurm abstraction and dynamic storage provisioning to node consistency and comprehensive observability, making it an invaluable…