Beyond Raw Bytes: Towards Large Malware Language Models

Luke Kurlandski

Network and Distributed System Security (NDSS) Symposium 2026 · Day 2 · Malware & RE

Can the foundation model paradigm that has transformed natural language processing be adapted for malware analysis? This talk investigates the feasibility of training **Large Malware Language Models (LMLMs)** -- the malware analog to LLMs -- by adapting the core pillars of modern deep learning: large-scale data, efficient architectures, self-supervised pre-training, and task-specific fine-tuning. The research systematically explores three code representations (raw bytes, disassembly, and decompiled code via **Ghidra**), two efficient neural architectures (**HRRformer** and **Mamba**), and two pre-training objectives (masked and causal language modeling) across three downstream malware analysis tasks.

AI review

A systematic investigation into adapting foundation models for malware classification that produces genuinely useful findings -- decompiled code beats raw bytes, Mamba beats transformers for long sequences, and pre-training helps even in the malware domain. The depth-over-width scaling insight for long-sequence training is technically interesting. However, the work is fundamentally ML infrastructure research applied to malware as a domain, with no novel security insights, no adversarial robustness analysis, and no examination of evasion potential.

Watch on YouTube