VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin

International Conference on Machine Learning 2025 · Oral

In the rapidly evolving landscape of video generation, significant strides have been made in rendering visually stunning and high-fidelity content. However, as Hila Chefer from Meta AI and Tel Aviv University highlights in her ICML 2025 talk, "VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models," a critical challenge persists: achieving truly coherent and physically plausible motion. Despite the massive scaling of models and data, state-of-the-art video generation systems frequently falter when it comes to fundamental aspects of the temporal dimension, exhibiting jittery movements, illogical object interactions, and a general lack of understanding of real-world physics.

AI review

VideoJAM is a well-motivated engineering contribution that augments video diffusion models with optical flow supervision and a self-consistency guidance mechanism at inference time. The core empirical finding — that pixel-reconstruction losses are nearly invariant to frame shuffling — is a clean and honest diagnostic that earns the paper's premise. The architectural intervention is minimal and the efficiency claim is credible. That said, this is fundamentally a systems paper dressed with the vocabulary of theoretical insight. The 'joint latent representation' is a linear combination at the…