Privacy-Preserving Data Deduplication for Enhancing Federated Learning of Language Models

Aydin Abadi

Network and Distributed System Security (NDSS) Symposium 2025 · Day 3 · Federated Learning 2

This talk, presented by Vishnu Dasu, a PhD student at the Pennsylvania State University, introduces a novel solution called **Efficient Privacy-Preserving Multi-Party Deduplication (EPMPD)**. The core problem addressed is the pervasive issue of duplicated text sequences within vast datasets used to train large language models (LLMs). Such duplicates are known to significantly degrade LLM performance, increasing **perplexity** (a measure of model uncertainty) and extending training times. While deduplication is a straightforward task in centralized data settings, it becomes inherently complex and privacy-sensitive within the **federated learning (FL)** paradigm, where multiple clients collaboratively train a global model without sharing their raw data.

AI review

Solid cryptographic systems paper with a clear problem statement, a genuine novel primitive (GPSI), and quantified results that actually move the needle on both performance and privacy. PhD student presenting original research at NDSS — this is exactly the kind of work the venue exists for.

Watch on YouTube