MAS-ATTENTION: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices

Mohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu

Conference on Machine Learning and Systems 2025 · Day 4 · Session 11: Federated Learning

The proliferation of large language models and other foundation models has cemented the **transformer architecture** and its core component, **attention mechanisms**, as indispensable elements across diverse AI applications, from natural language processing to computer vision, diffusion models, and image restoration. However, deploying these computationally intensive models on resource-constrained edge devices – such as cell phones, IoT devices, or custom neural processing units (NPUs) and tensor processing units (TPUs) – presents significant challenges. Existing acceleration techniques, often tailored for high-end GPUs, fail to adequately address the unique memory and computational limitations of edge hardware.

AI review

Technically credible systems work on attention acceleration for edge NPUs — real device experiments, specific hardware constraints, and honest trade-off discussion. But this is a research paper presentation, not an engineering talk, and the gap between 'we validated on a Huawei MatePad in simulation plus grid search' and 'you can use this' is substantial. Solid ML systems research, limited immediate applicability for most engineers.