Lex Sleuther - A Novel Approach to Script Language Detection
Aaron James
BSidesSF 2025 — Here Be Dragons · Day 1 · Main
Script language misidentification is a quiet but costly failure mode in large-scale malware analysis pipelines — at CrowdStrike's processing volume, even a 10% miss rate translates to hundreds of unanalyzed scripts per day. Aaron James built Lex Sleuther, a tool that lexes input files against multiple language grammars simultaneously and uses linear regression to assign language scores, achieving a false negative rate lower than Google's Magika while operating on just six targeted file types. ---
AI review
CrowdStrike's script language classifier built from six lexers and linear regression, outperforming Magika on false negative rate for the six types that matter. The 'I wrote six lexers and stapled them together' honesty is refreshing. Limited scope but the production deployment details are real.