Production-Ready LLMs on Kubernetes: Patterns, Pitfalls, and Performa... Priya Samuel & Luke Marsden
Priya Samuel, Luke Marsden
KubeCon + CloudNativeCon Europe 2025 · Session
In an era where the landscape of Artificial Intelligence evolves at an astonishing pace, organizations are increasingly seeking robust, secure, and cost-effective ways to deploy large language models (LLMs) on their own infrastructure. This talk by Priya Samuel and Luke Marsden at KubeCon EU addresses this critical need, guiding attendees through the intricate journey of building production-ready LLM platforms using Kubernetes and open-source models. The speakers share invaluable patterns, common pitfalls, and hard-won lessons from their extensive experience, aiming to demystify the perceived complexity of self-hosting advanced AI.
AI review
This talk delivers a brutally honest and deeply technical roadmap for deploying production-ready LLMs on Kubernetes. It dissects common pitfalls, from Olama's limitations to Docker's compression woes with massive model weights, offering concrete, hard-won solutions like VLLM optimization, custom Docker patching, and strategic GPU memory sharing. The speakers don't just stop at deployment; they detail sophisticated application layers with multi-step API integrations, advanced RAG, and even multimodal Vision RAG, all within a robust CI/CD and proposed AI spec framework. This isn't just theory…