Self-interpreting Adversarial Images

Tingwei Zhang

34th USENIX Security Symposium (USENIX Security '25) · Day 1 · ML and AI Security 1: Images

In an era where large language models (LLMs) are rapidly becoming primary interpreters of information across various modalities, the integrity of their interpretations is paramount. This talk, presented by Tingwei Zhang from Cornell Tech, delves into a novel and highly stealthy attack vector: **self-interpreting adversarial images**. The core premise is that imperceptible perturbations embedded within an image can subtly but effectively steer a multimodal LLM's interpretation of that image, influencing its generated textual output in ways an attacker intends, without explicit text prompts or obvious misbehavior.

AI review

Solid, original research from a Cornell PhD student that identifies a genuinely underexplored attack surface: using imperceptible image perturbations as soft prompts to steer multimodal LLM interpretation rather than jailbreak it. The framing distinction — interpretation manipulation vs. capability unlocking vs. classic prompt injection — is the real contribution, and the transferability finding across commercial models elevates this above a lab curiosity.

Watch on YouTube