MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots
Gelei Deng
Network and Distributed System Security (NDSS) Symposium 2024 · Day 3 · LLM Security
Large Language Models (LLMs) have rapidly transformed content generation capabilities, yet they remain highly susceptible to **jailbreak attacks**. These sophisticated prompts are designed to bypass an LLM chatbot's built-in safeguards, coercing the model into generating inappropriate, harmful, or policy-violating content. Despite significant research into these vulnerabilities, existing jailbreaking strategies often prove ineffective against commercial LLM chatbots such as Bing Chat and Google Bard. This ineffectiveness stems primarily from the proprietary and undisclosed nature of their defensive mechanisms, creating a substantial hurdle for researchers attempting to understand and counter these systems.