TextGuard: Provable Defense against Backdoor Attacks on Text Classification
Hengzhi Pei
Network and Distributed System Security (NDSS) Symposium 2024 · Day 2 · ML Security
In an era increasingly reliant on machine learning models for critical applications, the integrity and trustworthiness of these systems are paramount. This article delves into TextGuard, a groundbreaking approach presented at the NDSS Symposium by Hengzhi Pei, which introduces the first provable defense against **backdoor attacks** on text classification models. These insidious attacks allow an adversary to embed hidden triggers into a model during training, causing misclassification to a target class when the trigger is present at inference time, while otherwise preserving normal functionality on clean inputs. This poses a severe threat, particularly within supply chains where models or training data may originate from untrusted sources.