VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Jian Tong, Haodong Duan, Qipeng Guo, Jiaqi Wang, Xipeng Qiu, Dahua Lin

International Conference on Machine Learning 2025 · Oral

The proliferation of video data and the increasing demand for sophisticated long-context video understanding tasks present significant challenges for current deep learning models, particularly in effectively encoding and leveraging positional information across vast temporal and spatial dimensions. This talk introduces **VideoRoPE**, a novel **Rotary Position Embedding (RoPE)** variant specifically designed to address the unique complexities of video inputs. Presented at ICML 2025 by a lab mate on behalf of the first author, Xilin Wei, and a team of researchers, VideoRoPE offers a principled solution to a critical bottleneck in video-centric AI systems: how to design position embeddings that are simultaneously scalable, robust, and effective for long-duration, high-resolution video streams.

AI review

VideoRoPE is a competent and honest engineering contribution that identifies four design axes for video RoPE variants — 3D structure, frequency allocation, spatial symmetry, and temporal scaling — and proposes concrete modifications to each. The empirical results are consistent and the ablation structure is reasonable. However, the work is fundamentally a principled design study, not a theoretical contribution: the four properties are presented as intuitive desiderata rather than derived from any formal framework, the frequency allocation argument is largely heuristic, and the performance…