Training Specialist Models: Automating Malware Development
Black Hat USA 2025 · Day 1 · Briefings
Outflank researcher Kyle trained a custom 7-billion-parameter LLM called Dante — built on Qwen 2.5 Coder and fine-tuned via supervised fine-tuning plus Reinforcement Learning with Verifiable Rewards (RLVR) — to generate functional, Microsoft Defender for Endpoint-evading shellcode loaders entirely through trial-and-error reinforcement learning, without being shown any working examples of evasive code. At a cost of roughly $1,350 in cloud GPU time, Dante outperforms DeepSeek R1 and rivals GPT-4o on this specific offensive task while refusing the request far less often. ---
AI review
Kyle trained a 7B model that outperforms DeepSeek R1 at EDR evasion for $1,350 in GPU time and released the code. RLVR for offensive security tooling is the methodological contribution that will age the worst, because it means the barrier to automated malware development just fell through the floor and it's not coming back up.