SneakyPrompt: Jailbreaking Text-to-image Generative Models

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao

IEEE Symposium on Security and Privacy 2024 · Day 1 · Continental Ballroom 5

This article delves into "SneakyPrompt," a novel framework designed to jailbreak **text-to-image generative models** by bypassing their integrated safety features. Presented by Yuchen Yang and co-authors from Johns Hopkins University and Duke University at the IEEE S&P conference, this research addresses a critical security vulnerability in widely used generative AI systems like Stable Diffusion and DALL-E 2. The core innovation of SneakyPrompt lies in its automatic search mechanism for **adversarial prompts** that appear benign to human observers but coerce these models into generating content deemed "not safe for work" (NSFW) or otherwise restricted, all while maintaining the attacker's desired visual semantics.

AI review

This research introduces SneakyPrompt, an automated, reinforcement learning-driven framework that efficiently jailbreaks text-to-image models, including DALL-E 2's closed-box filters. It's a critical demonstration of how current safety mechanisms are semantically blind, allowing generation of restricted content with minimal queries while preserving attacker intent. This isn't just a paper; it's a wake-up call for everyone building generative AI.

Watch on YouTube