Taking Care of Your Control Plane With API Priority and Fairness an... Matteo Ruina & Ayaz Badouraly
Matteo Ruina, Ayaz Badouraly
KubeCon + CloudNativeCon Europe 2025 · Session
In this comprehensive talk, Ayaz Badouraly and Matteo Ruina, software engineers at DataDog, delve into the critical challenges of maintaining the stability and reliability of the Kubernetes control plane in large-scale, multi-tenant environments. DataDog, a company running hundreds of thousands of pods across tens of thousands of nodes in dozens of Kubernetes clusters, manages its own control planes, providing them with unique insights into potential failure modes and effective mitigation strategies. The speakers illuminate how a single misbehaving user or application can disproportionately impact the entire cluster, leading to widespread outages.
AI review
This talk from DataDog engineers is a masterclass in Kubernetes control plane hardening for large-scale, multi-tenant environments. It goes deep into the practical application and advanced tuning of API Priority and Fairness (APF) and Resource Quotas, sharing battle-tested strategies and novel workarounds for critical, often overlooked, stability issues. This isn't theoretical fluff; it's a no-bullshit exposition of hard-won lessons from the trenches, complete with real-world incidents and clever engineering solutions that genuinely advance the state of the art in Kubernetes operations.