Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies
Nadav Timor, Jonathan Mamou, Daniel Korat, Moshe Berchansky, Gaurav Jain, Oren Pereg, Moshe Wasserblat, David Harel
International Conference on Machine Learning 2025 · Oral
Large Language Model (LLM) inference, particularly the **autoregressive decoding** process, remains a significant bottleneck in many applications. Each token generation typically requires a full forward pass through the LLM, leading to high latency and limiting throughput. Speculative decoding has emerged as a promising technique to mitigate this, offering substantial speedups while preserving the target model's output distribution, making it a **lossless** acceleration method. However, a critical practical limitation has historically hindered its widespread adoption: the requirement that the smaller, faster **drafter model** must share the exact same vocabulary as the larger **target model**.
AI review
A competent and practically impactful engineering contribution that removes a real friction point in speculative decoding deployment — the shared vocabulary requirement — and does so with losslessness guarantees. The work is honest about the tradeoffs between its three algorithms, and the Hugging Face adoption is meaningful evidence of practical utility. However, the theoretical depth is modest: the losslessness proofs are almost certainly straightforward applications of standard rejection sampling arguments, and the core ideas (vocabulary intersection pruning, string-level matching) are…