Designing Models from the Hardware Up

Simran Arora

Conference on Machine Learning and Systems 2025 · Day 1 · Young Professional Symposium

In this insightful MLSys 2025 talk, Simran Arora presents a compelling argument for designing AI models with hardware considerations from the ground up, rather than optimizing them post-hoc. The core premise addresses a critical inefficiency in the current machine learning landscape: while novel model architectures are emerging to rival the ubiquitous Transformer, their theoretical efficiency often fails to translate into tangible wall-clock speed in practice. Arora highlights that this gap stems from two primary issues: architectural choices that are incompatible with underlying hardware capabilities and the lack of flexible, high-performance programming abstractions for rapid hardware-aware development.

AI review

Simran Arora's MLSys 2025 talk presents two concrete artifacts — Thunder Kittens, a hardware-aware kernel framework, and Based, a hybrid attention architecture — both grounded in an honest diagnosis of why 'theoretically efficient' models fail in practice. The engineering argument is specific and credible: the gap between asymptotic complexity and wall-clock performance is real, the tooling problem (CUDA too hard, Triton too slow) is accurately characterized, and the proposed solutions are benchmarked against sensible baselines. The 15x speedup over Triton and 5x hardware utilization…