Are Your LLM's Safety Mechanisms Intact? Detecting Backdoors with White-Box Analysis
Akash Mukherjee
[un]prompted 2026 — AI Security Practitioner Conference · Day 2 · 2
Akash Mukherjee demonstrated live that a backdoored LLM is completely indistinguishable from a clean model under standard black-box testing — but detectable in seconds by monitoring internal neural activations. His argument: open-weight models are an unsolved supply chain security problem, and the only path to trustworthy AI is white-box analysis of what's happening inside the model, not just what it outputs. ---
AI review
Mukherjee ran a live demo that made the entire black-box-only security apparatus look like theater — and he's right. A backdoored Llama 3.1 that's indistinguishable from a clean model on every output-level test, but whose refusal signal visibly collapses in real time under white-box monitoring, is not a hypothetical: it's a two-hour, 500-document operation. This is the supply chain problem the industry has been ignoring, demonstrated with specificity that makes ignoring it harder.