Transparent, Infra-Level Checkpoint and Restore for Resil... Ganeshkumar Ashokavardhanan & Bernie Wu
Ganeshkumar Ashokavardhanan, Bernie Wu
KubeCon + CloudNativeCon Europe 2025 · Session
This talk by Ganeshkumar Ashokavardhanan from Microsoft's Azure Kubernetes Service (AKS) team and Bernie Wu from Meverge delves into a critical challenge facing large-scale AI/ML workloads running on Kubernetes: resilience against infrastructure failures and optimizing resource utilization. They introduce and thoroughly define **infra-level transparent checkpointing** as a paradigm-shifting approach to address these issues. The core idea is to capture the entire state of a running application—including memory, CPU state, and associated files—without requiring any modifications to the application's code or framework, enabling seamless migration and hot restarts.
AI review
This talk presents a genuinely groundbreaking approach to AI/ML workload resilience on Kubernetes using infra-level transparent checkpointing. Leveraging and extending CRIU to capture full application state, including GPU memory, without code changes, directly addresses critical pain points like high GPU error rates, abysmal utilization, and slow recovery times. The speakers demonstrate a sophisticated solution for automated, event-driven migration and hot restarts that promises substantial cost savings and operational stability for anyone running serious ML infrastructure.