The Hard Part Isn't Building the Agent: On Measuring Agent Effectiveness to Improve It

Joshua Saxe

[un]prompted 2026 — AI Security Practitioner Conference · Day 1 · 1

Joshua Saxe makes a counterintuitive argument: the biggest blocker to deploying autonomous AI security agents isn't building them — it's evaluating them. Classical ML metrics like precision, recall, and F-score fail in cybersecurity because the ground truth is structurally noisy. The solution is to treat AI agents the way you'd evaluate a human security engineer: assess the quality of their reasoning process, not just their binary outputs. ---

AI review

Saxe identified the right problem and named it precisely: you cannot measure whether your AI security system is improving if your ground truth is structurally noisy, and 1-3% label noise is a ceiling, not an edge case. The rubric-based evaluation framework is practical and the 50%-of-engineering-time investment claim is the most honest thing I've heard from this space in months.

Watch on YouTube