Code is not Natural Language: Unlock the Power of Semantics-Oriented Graph Representation for Binary Code Similarity Detection
Haojie He, Xingwei Lin, Ziang Weng, Ruijie Zhao, Shuitao Gan, Libo Chen, Yuede Ji, Jiashui Wang, Zhi Xue
33rd USENIX Security Symposium · Day 1 · USENIX Security '24
Binary code similarity detection is a foundational task in cybersecurity, aiming to determine the semantic equivalence between binary functions. This is particularly challenging when functions originate from the same source code but have been subjected to diverse compilation techniques, optimization algorithms, or different hardware architectures. The ability to accurately identify such semantic similarities is critical for a wide array of security applications, including the discovery of vulnerabilities in closed-source firmware, recovery of binary function symbols, detection of software plagiarism, and identification of license violations.
AI review
This research brutally dismantles the lazy assumption that binary code can be treated as natural language, a critical and necessary correction. The novel Semantics-Oriented Graph (SOG) representation, coupled with a lightweight GNN, offers a genuinely new and efficient paradigm for binary similarity detection. It delivers state-of-the-art results across architectures and optimizations, fundamentally improving vulnerability discovery and malware analysis.