SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner

Xunguang Wang

34th USENIX Security Symposium (USENIX Security '25) · Day 2 · LLM Security 2: Jailbreaking and Prompt Stealing

This article delves into "SelfDefend," an innovative framework designed to protect Large Language Models (LLMs) from **jailbreak attacks**. Presented by Xunguang Wang from HKUS, the talk introduces a novel, dual-layer defense mechanism inspired by the traditional system security concept of **shadow stacks**. The core premise is to leverage LLMs themselves, not just for answering user queries, but also for actively detecting and mitigating malicious instructions.

AI review

Competent academic work on LLM jailbreak defense with a clean architectural idea — parallel shadow LLM detection inspired by shadow stacks. The shadow-stack analogy is cute and the fine-tuning pathway to open-source deployment is practically useful, but the core insight (use an LLM to judge another LLM's input) isn't fundamentally new territory, and the evaluation methodology raises the usual academic red flags around benchmark saturation and adaptive attack coverage.

Watch on YouTube