Long-Form Speech Generation with Spoken Language Models

Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan

International Conference on Machine Learning 2025 · Oral

This article delves into the groundbreaking work presented by Se Jin Park and Julian Salazar on **SpeechSSM**, a novel approach to generating long-form, coherent, and expressive speech using spoken language models. Developed through a collaboration between Google DeepMind and KAIST, SpeechSSM represents a significant leap forward in the field, addressing the long-standing challenge of generating extended audio sequences that maintain contextual relevance, speaker identity, and naturalness. The talk highlights the limitations of existing speech generation methods, particularly their reliance on text intermediates and their struggle with the high temporal resolution and information density inherent in raw audio, which often leads to prohibitive computational costs and a degradation of quality over time.

AI review

SpeechSSM is competent, well-executed systems work on long-form speech generation using hybrid state space models. The engineering contributions are real — window tokenizing, EOS bias removal, decoupled semantic/acoustic generation — and the benchmark and metrics are a genuine service to the subfield. But this is primarily an applied systems paper dressed in the language of foundational contribution. The theoretical justification for why SSMs generalize in the length dimension is asserted rather than derived, the architectural choices are borrowed wholesale from Griffin/Samba, and the…