Balancing Cost and Efficiency: Day2 Optimization of Multi-Cluster AI Infrastructure - Kevin Wang

Kevin Wang

KubeCon + CloudNativeCon Europe 2025 · Session

In today's rapidly evolving technological landscape, the deployment and management of Artificial Intelligence (AI) and Machine Learning (ML) workloads are increasingly moving towards distributed, multi-cluster Kubernetes environments. This talk by Kevin Wang delves into the critical challenges and innovative solutions for "Day2 optimization" in such complex infrastructures, focusing on balancing cost, efficiency, and reliability. Wang, an early Kubernetes contributor and key figure in the **Volcano** and **Kamada** projects, outlines how these CNCF initiatives are evolving to meet the demands of enterprise-scale AI.

AI review

Wang delivers a highly technical and practical deep-dive into Day2 optimization for multi-cluster AI infrastructure using Kamada and Volcano. The talk meticulously outlines the challenges of federated scheduling, failover, and workload queuing for AI workloads, presenting concrete solutions like the Scheduler Estimator, `preferredNoExecute` taint, and Volcano Global. This is real engineering, addressing critical pain points for anyone operating large-scale AI/ML platforms on Kubernetes, providing actionable strategies to improve efficiency, resilience, and cost-effectiveness.

Watch on YouTube