AutoLabel: Automated Fine-Grained Log Labeling for Cyber Attack Dataset Generation

Yihao Peng

34th USENIX Security Symposium (USENIX Security '25) · Day 1 · System Security 1: Threat Detection, Exploitation, and Adaptive Defenses

In the rapidly evolving landscape of cybersecurity, the ability to accurately detect and respond to sophisticated attacks hinges on the quality and availability of training data for security models. This talk introduces **AutoLabel**, a groundbreaking system designed to automate the generation of **fine-grained, multi-source labeled log datasets** for cyber attack research. Presented by Yihao Peng, a PhD student at Tsinghua University, AutoLabel addresses a critical bottleneck in the field: the acute scarcity of high-quality, labeled log data, which severely hampers the development and evaluation of new security technologies.

AI review

Solid systems research that attacks a real, unsexy problem: the chronic shortage of fine-grained, labeled log data for security ML. The provenance-graph reframing plus unit partitioning for dependency explosion is genuinely clever, and 580 released benchmark datasets is a concrete community contribution. 100% accuracy claims warrant scrutiny, but the technical architecture earns the benefit of the doubt at USENIX.

Watch on YouTube