Rethinking how we evaluate security agents for real-world use
Mudita Khurana
[un]prompted 2026 — AI Security Practitioner Conference · Day 2 · 1
An 80% benchmark score on a security agent tells you almost nothing useful. Mudita Khurana's lightning talk introduces CLASP, a capability-centric evaluation framework that shifts the question from "did the agent succeed?" to "how did the agent succeed?" — enabling meaningful debugging, targeted improvement, and reliable deployment decisions. ---
AI review
CLASP is a well-structured framework for a real problem — outcome-only evaluation is genuinely broken for security agents — and the SQL injection trace example makes the failure mode concrete. But this is a lightning talk that needed to be a full paper with empirical validation to be more than an interesting proposal.