Character-Level Perturbations Disrupt LLM Watermarks

Zhaoxi Zhang

Network and Distributed System Security (NDSS) Symposium 2026 · Day 1 · AI Security

This talk presents a systematic study demonstrating that **character-level perturbations** are significantly more effective at removing LLM watermarks than the token-level and sentence-level attacks previously studied in the literature. The researchers from the University of Technology Sydney, Griffith University, and MIT show that simple modifications like **homoglyph substitution**, typo insertion, and zero-width character insertion can disrupt the tokenization process, cascading damage through the watermark signal far more efficiently than word-level synonyms or paraphrasing.

AI review

A clean, well-executed study demonstrating that character-level perturbations systematically defeat LLM watermarking schemes more effectively than token-level attacks, with the cascading tokenization disruption providing a satisfying technical explanation. The GA-based guided attack under the public API model is a nice addition. However, the contribution is primarily empirical validation of an intuitive insight rather than a deeply novel attack, and the defensive implications -- that watermarking is fragile -- are already widely suspected in the security community.

Watch on YouTube