Dataset Reduction and Watermark Removal via Self-supervised Learning for Model Extraction Attack

Hao Luan

Network and Distributed System Security (NDSS) Symposium 2026 · Day 1 · AI Security

This talk presents **SSL Extraction**, a two-step attack pipeline that simultaneously achieves efficient model extraction and watermark removal against black-box ML models. The key innovation is shifting the attack from pixel space to feature space using **self-supervised learning (SSL)**, which naturally separates legitimate data features from artificial watermark trigger patterns. Combined with a **P-dispersion optimization** strategy for query selection, the attack achieves competitive extraction accuracy with as few as **500-1,000 queries** while crushing watermark success rates to near-zero levels -- dropping to just **5.39%** against margin-based watermarks.

AI review

A technically elegant attack that combines self-supervised learning with P-dispersion optimization to simultaneously solve two hard problems in model extraction: query efficiency and watermark evasion. The insight that SSL feature spaces naturally isolate artificial trigger patterns is powerful and well-demonstrated. The relative distance ratio metric is a genuine contribution that exposes a real flaw in watermark verification. This is the kind of work that makes defenders rethink their assumptions.

Watch on YouTube