Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

Xinyi Zhang, Hanyu Zhao, Wencong Xiao, Xianyan Jia, Fei Xu, Fangming Liu

Conference on Machine Learning and Systems 2025 · Day 2 · Session 2: Parallel and Distributed Systems

In the rapidly evolving landscape of deep learning, the training of large models, particularly **Large Language Models (LLMs)** built on **transformer architectures**, demands immense computational resources. This talk introduces **Rubick**, a novel deep learning cluster scheduler designed to significantly boost training performance and resource efficiency by intelligently exploiting the inherent reconfigurability of deep learning jobs. Traditionally, deep learning jobs are treated as "black boxes" by schedulers, with their execution plans and resource requirements fixed at launch time. This static approach leads to a considerable mismatch in dynamic, shared GPU clusters where available resources fluctuate unpredictably. Rubick challenges this paradigm by adopting a "white-box" approach, allowing the scheduler transparent access to a job's execution plan and multi-dimensional resource needs, enabling dynamic adjustments to both.

AI review

Rubick is a genuinely interesting systems paper — the 'white-box' scheduler framing and resource sensitivity curve concept are clean ideas that address a real problem in multi-tenant GPU clusters. But the article as written is a summary of a research paper, not a talk review, and it consistently substitutes description for implementation detail. The performance model is the core engineering contribution and we get just enough to understand the shape of it without enough to reproduce or extend it. Worth reading if you manage GPU clusters at scale; not a must-watch for most engineers building…