Source to Sink: How to Improve LLM First-Party Vuln Discovery

Scott Behrens, Justice Cassel

[un]prompted 2026 — AI Security Practitioner Conference · Day 2 · 2

Netflix's security engineering team spent months iterating through architectures, benchmark failures, and demoralizing late nights to build an LLM-based vulnerability discovery system that actually works in production. Their conclusion: specialized agents beat generalist agents, individual vuln-class agents beat grouped ones, and rigorous evaluation infrastructure is the only thing that separates signal from noise as models keep changing. ---

AI review

Eighteen months of genuine iteration from Netflix security, reported honestly including the demoralizing parts. Specialized agents per vulnerability class hitting 35/41 versus a monolithic super agent hitting 20/41 is a finding worth citing. The evaluation infrastructure they built — and are open-sourcing — is the most durable contribution.

Watch on YouTube