bin2ml: turning software binaries into machine learning ready training data
Josh Collyer
44CON 2024 · Day 1 · Main
Josh Collyer, Head of AI Security Group at The Alan Institute, presented `bin2ml`, an open-source tool designed to transform software binaries into machine learning-ready training data. This talk addresses a critical bottleneck in applying machine learning (ML) to binary analysis: the laborious and often inconsistent process of data extraction and preparation from diverse binary formats and architectures. Collyer, whose PhD research focuses on ML and AI for reverse engineering, developed `bin2ml` to standardize and streamline this essential first step, enabling researchers and practitioners to focus on model development rather than data wrangling.
AI review
Collyer has identified a real pain point — the fragmented, tool-dependent data pipelines choking ML-for-binary-analysis research — and built something practical to address it. bin2ml is a legitimate open-source infrastructure contribution, well-scoped and technically coherent, but it's tooling, not research, and the talk doesn't pretend otherwise. The ceiling on this is inherently limited.