AdaSplash: Adaptive Sparse Flash Attention

Nuno Gonçalves, Marcos V. Treviso, Andre Martins

International Conference on Machine Learning 2025 · Oral

The talk "AdaSplash: Adaptive Sparse Flash Attention" introduces a groundbreaking approach to enhance the efficiency and scalability of attention mechanisms in large-scale transformer models. Presented by Nuno Gonçalves, Marcos V. Treviso, and Andre Martins from IST Lisbon, this work directly addresses the fundamental quadratic memory complexity inherent in traditional attention, a bottleneck that has historically limited the context lengths and overall scalability of transformers. While Flash Attention revolutionized softmax-based attention by eliminating the need to materialize large intermediate matrices, it left a critical gap for alternative, sparse attention mechanisms.

AI review

AdaSplash is a competent systems contribution that delivers a faster GPU kernel for alpha-Entmax attention by combining Halley's method with bisection for root-finding. The engineering is real and the speedups are credible. The paper opens a path that was previously impractical, which is a genuine service to researchers who want to work with sparse attention. However, the theoretical framing is thin — there is no convergence analysis for the Halley-bisection hybrid in the Entmax-specific setting, the downstream empirical results are limited to a narrow set of architectures, and the central…