Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

Marco Federici, Davide Belli, Mart Van Baalen, Markus Nagel, Paul Whatmough

Conference on Machine Learning and Systems 2025 · Day 3 · Session 7: Quantization and Sparsity

This talk, presented by Davide Belli and his colleagues at Qualcomm AI Research, addresses a critical challenge in the deployment of large language models (LLMs) on edge devices: the rapidly diverging growth rates of LLM sizes versus available on-device memory. Titled "Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking," the work proposes novel techniques to enable the execution of gigabyte-scale LLMs on smartphones, which typically offer limited and shared DRAM resources. The core problem lies in the inability of modern LLMs to fit entirely into a device's DRAM, necessitating slow flash storage access.

AI review

Qualcomm AI Research presents a legitimate systems paper on efficient LLM inference for edge devices, with real engineering substance behind the Dynamic Input Pruning (DIP) and DIPCA contributions. The core insight — that SwiGLU breaks the assumptions behind ReLU-based dynamic pruning, and here's a non-predictive alternative — is crisp and technically sound. The code is open-sourced, numbers are specific, and the problem framing is honest. What holds this back from a higher rating is that the article reads like a well-structured paper summary, not a talk that shows you how to build anything…