Glass-Box Security: Operationalizing Mechanistic Interpretability for Defending AI Agents
Carl Hurd
[un]prompted 2026 — AI Security Practitioner Conference · Day 1 · 2
Current AI security tools — prompt firewalls and host-based monitors — can only inspect what a model *says*, not what it *thinks*. Carl Hurd of Starseer argues that true AI agent defense requires peering inside the model's activation layers using mechanistic interpretability, measuring both the *direction* and *strength* of dangerous concepts before they become dangerous actions. ---
AI review
The most technically ambitious defense talk I've seen at this conference. Instrumenting the residual stream of a running model to detect dangerous intent via cosine similarity and scalar projection is not a product pitch — it's a research program pointing at where the entire field needs to go. Hurd earned his minutes at this podium.