A Practical Guide To Benchmarking AI and GPU Workloads in Kubernetes - Yuan Chen & Chen Wang

Yuan Chen, Chen Wang

KubeCon + CloudNativeCon Europe 2025 · Session

This talk, presented by Yuan Chen from Nvidia and Chen Wang from IBM Research, offers a comprehensive guide to benchmarking AI and GPU-intensive workloads within Kubernetes environments. The session dives into practical methodologies and tools essential for understanding, optimizing, and ensuring the efficient operation of modern AI inference and generative AI (GenAI) applications. It addresses the critical need for robust benchmarking in the rapidly evolving landscape of AI, where performance, scalability, and resource utilization are paramount.

AI review

This session delivers a highly practical and technically sound guide to benchmarking AI and GPU workloads within Kubernetes. The speakers, drawing from Nvidia and IBM Research, effectively introduce the Nvidia Triton Inference Server for general AI inference and, more notably, the FMPERF Python library for automated, Kubernetes-native benchmarking of Large Language Models (LLMs). The focus on LLM-specific metrics and reproducible methodologies makes this a valuable resource for anyone tasked with optimizing GenAI deployments at scale, delivering actionable insights and tools rather than…

Watch on YouTube