Lex Sleuther - A Novel Approach to Script Language Detection

Aaron James

BSidesSF 2025 — Here Be Dragons · Day 1 · Main

Script language misidentification is a quiet but costly failure mode in large-scale malware analysis pipelines — at CrowdStrike's processing volume, even a 10% miss rate translates to hundreds of unanalyzed scripts per day. Aaron James built Lex Sleuther, a tool that lexes input files against multiple language grammars simultaneously and uses linear regression to assign language scores, achieving a false negative rate lower than Google's Magika while operating on just six targeted file types. ---

AI review

CrowdStrike's script language classifier built from six lexers and linear regression, outperforming Magika on false negative rate for the six types that matter. The 'I wrote six lexers and stapled them together' honesty is refreshing. Limited scope but the production deployment details are real.

Watch on YouTube