Keynote: LLM-Aware Load Balancing in Kubernetes: A New Era of Effici... Clayton Coleman, Jiaxin Shan

Clayton Coleman, Jiaxin Shan

KubeCon + CloudNativeCon Europe 2025 · Keynote

This keynote address by Clayton Coleman from Google and Jiaxin Shan from ByteDance introduces a groundbreaking project within the Kubernetes ecosystem: the **Gateway API Inference Extension**. Developed under the Kubernetes serving working group, this extension is designed to transform any standard Kubernetes gateway into an intelligent inference gateway, specifically optimized for hosting large language models (LLMs) in production environments. The talk highlights the unique challenges of serving LLMs efficiently at scale and presents a collaborative solution informed by the extensive experiences of both Google and ByteDance.

AI review

This keynote by Coleman and Shan introduces the Gateway API Inference Extension, a critical and timely solution for the unique challenges of serving large language models in Kubernetes. Moving beyond the limitations of traditional load balancing, the project leverages real-time GPU metrics and techniques like LoRA to enable denser, faster, and more automated LLM deployments. The talk provides concrete data from ByteDance's production environment, demonstrating significant cost savings and performance gains, and lays out a clear, actionable path for platform teams grappling with LLM…

Watch on YouTube