Rethinking how we evaluate security agents for real-world use

Name: Rethinking how we evaluate security agents for real-world use
Duration: 20 min
Description: An 80% benchmark score on a security agent tells you almost nothing useful. Mudita Khurana's lightning talk introduces CLASP, a capability-centric evaluation framework that shifts the question from "did the agent succeed?" to "how did the agent succeed?" — enabling meaningful debugging, targeted improvement, and reliable deployment decisions. ---

Mudita Khurana

[un]prompted 2026 — AI Security Practitioner Conference · Day 2 · 1

An 80% benchmark score on a security agent tells you almost nothing useful. Mudita Khurana's lightning talk introduces CLASP, a capability-centric evaluation framework that shifts the question from "did the agent succeed?" to "how did the agent succeed?" — enabling meaningful debugging, targeted improvement, and reliable deployment decisions. ---

AI review

CLASP is a well-structured framework for a real problem — outcome-only evaluation is genuinely broken for security agents — and the SQL injection trace example makes the failure mode concrete. But this is a lightning talk that needed to be a full paper with empirical validation to be more than an interesting proposal.

Watch on YouTube