Optimizing Metrics Collection & Serving When Autoscaling LLM Workloads - Vincent Hou & Jiří Kremser

Vincent Hou, Jiří Kremser

KubeCon + CloudNativeCon Europe 2025 · Session

This talk, delivered by Vincent Hou from Bloomberg and Jiří Kremser from Kify, addresses the critical challenge of efficiently autoscaling Large Language Model (LLM) workloads within Kubernetes environments. As LLMs introduce a paradigm shift in resource consumption, predominantly relying on GPUs rather than traditional CPUs, existing autoscaling mechanisms often fall short. The speakers delve into the limitations of conventional metrics like CPU utilization, memory, or requests per second (RPS) for effectively managing GPU-intensive LLM inference.

AI review

This talk presented a highly effective and innovative open-source architecture for autoscaling GPU-intensive LLM workloads on Kubernetes. By intelligently combining OpenTelemetry for custom metric collection, KEDA for event-driven scaling, and a novel OpenTelemetry Add-on for KEDA, the speakers demonstrated how to dynamically adjust both LLM inference pods and underlying GPU nodes based on crucial LLM-specific metrics like KV cache percentage and waiting queue length. This provides a robust, cost-efficient, and actionable solution for a critical modern infrastructure challenge, delivered…

Watch on YouTube