Glass-Box Security: Operationalizing Mechanistic Interpretability for Defending AI Agents

Name: Glass-Box Security: Operationalizing Mechanistic Interpretability for Defending AI Agents
Duration: 20 min
Description: Current AI security tools — prompt firewalls and host-based monitors — can only inspect what a model *says*, not what it *thinks*. Carl Hurd of Starseer argues that true AI agent defense requires peering inside the model's activation layers using mechanistic interpretability, measuring both the *direction* and *strength* of dangerous concepts before they become dangerous actions. ---

Carl Hurd

[un]prompted 2026 — AI Security Practitioner Conference · Day 1 · 2

Current AI security tools — prompt firewalls and host-based monitors — can only inspect what a model *says*, not what it *thinks*. Carl Hurd of Starseer argues that true AI agent defense requires peering inside the model's activation layers using mechanistic interpretability, measuring both the *direction* and *strength* of dangerous concepts before they become dangerous actions. ---

AI review

The most technically ambitious defense talk I've seen at this conference. Instrumenting the residual stream of a running model to detect dangerous intent via cosine similarity and scalar projection is not a product pitch — it's a research program pointing at where the entire field needs to go. Hurd earned his minutes at this podium.

Watch on YouTube