Efficient Transparent Checkpointing of AI/ML Workloads in Kub... R. Stoyanov, A. Reber, V. Spišáková

R. Stoyanov, A. Reber, V. Spišáková

KubeCon + CloudNativeCon Europe 2025 · Session

This talk delves into a critical challenge facing modern cloud-native infrastructures, particularly those supporting Artificial Intelligence and Machine Learning (AI/ML) workloads: the inefficient utilization of expensive GPU resources and the lack of robust fault tolerance for long-running jobs. Presented by R. Stoyanov, A. Reber, and V. Spišáková, the session introduces **transparent GPU checkpointing** as a versatile solution to these pervasive problems within Kubernetes environments. The speakers highlight how this innovative approach, developed in collaboration with experts from Nvidia and AMD, can significantly improve resource efficiency and provide essential resilience for GPU-accelerated applications, ranging from interactive Jupyter notebooks to multi-day batch processing tasks.

AI review

This KubeCon talk is a must-see for anyone serious about operating AI/ML workloads at scale. It tackles the critical problems of GPU underutilization and lack of fault tolerance with an incredibly elegant and transparent solution: unified CPU/GPU checkpointing. The research is deeply technical, leveraging native GPU driver capabilities and integrating seamlessly into container runtimes, making it a practical game-changer for resource efficiency and resilience. The live demos were compelling, showcasing real-world benefits like hot-swapping models and stateful LLM migration, with the added…

Watch on YouTube