AI Workload Preemption in a Multi-Cluster Scheduling System at Bloomberg - Leon Zhou & Wei-Cheng Lai
Leon Zhou, Wei-Cheng Lai
KubeCon + CloudNativeCon Europe 2025 · Session
This talk, presented by Wei-Cheng Lai and Leon Zhou from Bloomberg, delves into the sophisticated strategies employed by the financial technology giant to manage and prioritize thousands of AI training jobs across a multitude of Kubernetes clusters. The core focus is on **AI workload preemption** within a multi-cluster scheduling system, specifically leveraging the open-source orchestration tool **Carmada**. The speakers articulate Bloomberg's journey in building a robust, highly available, and efficient data science platform that not only scales to meet immense computational demands but also ensures critical AI workloads receive the necessary resources promptly, thereby enhancing system reliability and business agility.
AI review
This talk from Bloomberg dissects a critical problem: managing high-priority AI workloads across vast, multi-cluster Kubernetes environments. Leveraging and extending Carmada, they present a sophisticated priority-based scheduling and preemption system. The detailed architectural choices, particularly the emphasis on minimal disruption during preemption and the single scheduler enforcement for stability, offer a robust blueprint for any organization grappling with resource contention for critical ML infrastructure. It's a real-world solution to a real-world problem, executed with technical…