Yes You Can Run LLMs on Kubernetes - Abdel Sghiouar & Mofi Rahman, Google Cloud

Abdel Sghiouar, Mofi Rahman, Google Cloud

KubeCon + CloudNativeCon Europe 2025 · Session

In this KubeCon EU talk, Google Cloud's Abdel Sghiouar and Mofi Rahman tackle a pressing question for many organizations: how to effectively deploy and manage Large Language Models (LLMs) on Kubernetes. Far from being a mere theoretical exercise, the session provides a comprehensive guide, complete with live demonstrations, on leveraging Kubernetes' robust orchestration capabilities for the unique demands of LLM inference. The speakers argue that while cloud-hosted LLM services offer convenience, Kubernetes provides a powerful middle ground, offering both granular control over infrastructure and the scalability and flexibility of a managed platform.

AI review

This talk by Sghiouar and Rahman delivers a much-needed, no-nonsense guide to deploying Large Language Models on Kubernetes. Eschewing marketing fluff, they dive into the nitty-gritty of resource management, GPU allocation, and advanced techniques like model sharding with the nascent Leader Worker Set (LWS) API. The live demonstrations, especially the multi-node, multi-accelerator setup for the DeepSeek R1 model, prove that treating LLMs as 'the new web application' on Kubernetes is not just theoretical but demonstrably achievable, providing critical actionable insights for platform…

Watch on YouTube