Detecting and Mitigating Sampling Bias in Cybersecurity with Unlabeled Data
Saravanan Thirumuruganathan, Fatih Deniz, Mohamed Nabeel, Mourad Ouzzani
33rd USENIX Security Symposium · Day 1 · USENIX Security '24
The deployment of machine learning (ML) models in cybersecurity faces a critical, yet often overlooked, challenge: **sampling bias**. This talk, presented by Fatih Deniz at USENIX Security '24, delves into the pervasive issue where the data used to train an ML model does not accurately represent the real-world data distribution the model encounters in production. The consequences of such bias are severe, leading to models that perform excellently in academic benchmarks but fail "miserably" when deployed against live, adversarial traffic. This paper, a collaborative effort with Qatar Computing Research Institute and Palo Alto Networks, offers novel solutions for both detecting and mitigating sampling bias, specifically tailored for the unique demands of the cybersecurity domain where labeling data is costly and the environment is inherently adversarial.
AI review
This research directly confronts the pervasive issue of sampling bias in cybersecurity ML, a critical factor undermining real-world deployments. It introduces novel, classifier-agnostic detection and mitigation strategies that intelligently leverage unlabeled data. The demonstrated ability to reclaim significant performance in adversarial settings marks this as a foundational and immediately actionable contribution for building robust security systems.