Optimizing Model Serving on Kubernetes With Model Streaming - Ekin Karabulut & Ronen Dar, Run:ai
Ekin Karabulut, Ronen Dar, Run:ai
KubeCon + CloudNativeCon Europe 2025 · Session
In the realm of modern AI deployments, efficiently serving large language models (LLMs) and other deep learning models presents significant challenges, particularly within dynamic, cloud-native environments orchestrated by Kubernetes. This talk, delivered by Ronen Dar and Ekin Karabulut from Run:ai (now part of Nvidia), addresses a critical bottleneck: the "cold start problem" associated with loading massive model weights onto Graphics Processing Units (GPUs). As models grow exponentially in size, the time it takes to provision and prepare an inference replica can stretch into many minutes, leading to exorbitant operational costs, suboptimal GPU utilization, and poor user experience due to high latency.
AI review
This talk presents a highly relevant and well-engineered solution to a critical problem in modern AI deployments: the 'cold start' bottleneck when loading massive model weights onto GPUs. The Run:ai Model Streamer, an open-source project, leverages concurrent reading and direct streaming to GPU memory, demonstrating significant performance gains. While originating from a vendor, the technical depth, open-source contribution, and clear benchmarking make this a valuable session for anyone grappling with large language model serving at scale.