Unshaken by Weak Embedding: Robust Probabilistic Watermarking for Dataset Copyright Protection

Shang Wang

Network and Distributed System Security (NDSS) Symposium 2026 · Day 1 · AI Security

This talk presents **DIP (Dataset Intelligence Probabilistic Watermarking)**, a method for protecting dataset copyright in the growing **data-as-a-service** marketplace. The core problem: when data contributors sell datasets to data curators, malicious curators can resell the data without authorization. DIP enables contributors to embed a probabilistic watermark into a small fraction of their data that survives adversarial removal attempts and can verify whether a model was trained on their dataset.

AI review

A dataset watermarking scheme that uses probabilistic label distribution as a verification signal. The two-factor verification approach is a reasonable improvement over zero-bit checking, and the probabilistic injection does defeat fixed-label cleaning assumptions. However, this is fundamentally still a backdoor-based watermarking scheme operating in a space where the previous talk (SSL Extraction) just demonstrated that SSL feature-space analysis can strip these kinds of watermarks. The contribution feels incremental and the threat model somewhat narrow.

Watch on YouTube