Tutorial: Build, Operate, and Use a Multi-Tenant AI Cluster Base... C. Misale, O. Tardieu & D. Grove
C. Misale, O. Tardieu, D. Grove
KubeCon + CloudNativeCon Europe 2025 · Tutorial
In this comprehensive KubeCon EU session, Olivier Tardieu, Dave Grove, and Claudia Misale from IBM Research presented a detailed tutorial on building, operating, and effectively utilizing multi-tenant GPU clusters for AI and Generative AI workloads using a robust, open-source Kubernetes-native stack. The talk addresses critical challenges faced by organizations in their AI journey, from the initial, often daunting, procurement of expensive GPUs to the complex task of sharing these valuable resources efficiently and fairly across diverse teams and projects.
AI review
This tutorial from IBM Research delivers a no-nonsense, deeply technical blueprint for tackling the most expensive and frustrating problem in modern AI infrastructure: efficiently sharing and managing multi-tenant GPU clusters on Kubernetes. They present a robust, open-source stack that doesn't just promise high utilization and fault tolerance, but demonstrates it with concrete examples and custom-built components. This is real engineering solving a real problem, not just another marketing deck.