From High Performance Computing To AI Workloads on Kubernetes: M... Andrey Velichkevich, & Yuki Iwai
Andrey Velichkevich,, Yuki Iwai
KubeCon + CloudNativeCon Europe 2025 · Session
This talk, "From High Performance Computing To AI Workloads on Kubernetes," presented by Andrey Velichkevich and Yuki Iwai, addresses a critical challenge in modern machine learning operations: abstracting the inherent complexity of Kubernetes and distributed training from data scientists. The core problem highlighted is the significant "infrastructure tax" that data scientists pay when trying to scale their machine learning code, forcing them to grapple with Docker, compute configurations, data access, and Kubernetes APIs instead of focusing on model development. The talk introduces **Kubeflow Trainer V2**, a project designed to simplify this process, particularly focusing on its new **MPI (Message Passing Interface) runtime**.
AI review
This talk delivers a highly technical and impactful deep-dive into Kubeflow Trainer V2, specifically showcasing its new MPI runtime. It directly addresses the critical "infrastructure tax" on data scientists by abstracting the complexities of distributed training and Kubernetes orchestration. The speakers, clearly deeply involved in the project, demonstrate a robust solution that enables native ML frameworks like MLX and DeepSpeed to scale across multi-node, multi-GPU clusters with a simple Python SDK, effectively bridging HPC paradigms with cloud-native AI workloads. This is a significant…