In the field of video generation, despite significant progress in recent years, existing generative models still struggle to accurately capture complex movements, dynamics, and physical phenomena. This limitation mainly arises from the traditional pixel reconstruction objective, which often favors enhancing the realism of appearance while neglecting the consistency of motion.
To address this issue, Meta's research team proposed a new framework called VideoJAM, which aims to inject effective motion priors into video generation models by encouraging the model to learn joint appearance-motion representations.
The VideoJAM framework consists of two complementary units. During the training phase, the framework expands the objective to predict both the generated pixels and the corresponding motion, both derived from a single learned representation.
During the inference phase, the research team introduced a mechanism called "intrinsic guidance," which utilizes the model's continuously evolving motion predictions as dynamic guiding signals to steer the generation process toward consistent motion directions. Notably, VideoJAM can be applied to any video generation model without the need to modify training data or expand the model.
Validated results show that VideoJAM achieves industry-leading levels of motion consistency, surpassing several competitive proprietary models while also enhancing the visual quality of the generated images. This research emphasizes the complementary relationship between appearance and motion, highlighting that their effective combination can significantly improve the visual effects and motion coherence of video generation.
Additionally, the research team demonstrated the excellent performance of VideoJAM-30B in generating complex types of motion, including scenes of skateboarders jumping and ballet dancers spinning on a lake. Comparisons with the base model DiT-30B revealed that VideoJAM significantly improves the quality of motion generation.
Key Highlights:
🌟 The VideoJAM framework enhances the motion expressiveness of video generation models through joint appearance-motion representations.
🎥 During training, VideoJAM can simultaneously predict pixels and motion, improving the consistency of the generated content.
🏆 Validated results show that VideoJAM outperforms several competitive models in terms of motion consistency and visual quality.