LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Kivilcim Coskun, Gianluca Stringhini

IEEE Symposium on Security and Privacy 2024 · Day 1 · Continental Ballroom 5

The proliferation of large language models (LLMs) as general-purpose assistants has led to their increasing deployment in various automated cybersecurity tasks, including vulnerability analysis, repair, and software test generation. However, despite their widespread adoption, a comprehensive understanding of their capabilities, particularly in **vulnerability detection** and **root cause analysis**, has remained largely unexplored. Previous evaluations often focused on smaller LLMs, limited to binary classification (vulnerable/not vulnerable), or concentrated on tasks like insecure code generation or vulnerability repair, leaving a critical gap in assessing their deeper reasoning abilities.

AI review

This research delivers a much-needed reality check on LLMs for vulnerability analysis, introducing a novel, automated framework (SEC-LLM-HOMES) to rigorously benchmark their reasoning. It unequivocally demonstrates that current LLMs are unreliable, prone to "unfaithful reasoning," and easily fooled by trivial code changes. This is critical signal for anyone deploying AI in security.

Watch on YouTube