With the rapid development of artificial intelligence technology, the image-to-video (I2V) generation technology has become a hot research topic. Recently, a team consisting of researchers Xiaoyu Shi, Zhaoyang Huang, and others has introduced a new framework called Motion-I2V. This framework achieves more consistent and controllable image-to-video generation through explicit motion modeling. This technological breakthrough not only improves the quality and consistency of video generation but also brings an unprecedented control experience to users.

In the field of image-to-video generation, maintaining the continuity and controllability of generated videos has always been a technical challenge. Traditional I2V methods directly learn the complex mapping from images to videos, while the Motion-I2V framework innovatively breaks down this process into two stages, introducing explicit motion modeling in both stages.

In the first stage, Motion-I2V proposes a diffusion-based motion field predictor, focusing on deriving the trajectory of the reference image pixels. The key to this stage is to predict the motion field map between the reference frame and all future frames through reference images and text prompts. The second stage is responsible for propagating the content of the reference image into the synthesized frames. By introducing a novel motion enhancement temporal layer, it enhances 1-D temporal attention, expands the temporal receptive field, and reduces the complexity of directly learning complex spatiotemporal patterns.

In comparisons with existing methods, Motion-I2V demonstrates significant advantages. Whether it's in scenes like "a fast-moving tank," "a blue BMW car moving quickly," "three clear ice blocks," or "a crawling snail," Motion-I2V can generate more consistent videos, maintaining high-quality output even with large-scale motion and perspective changes.

Moreover, Motion-I2V supports users in precisely controlling the motion trajectory and region through sparse trajectory and area annotations, providing more control capabilities than those relying solely on text instructions. This not only improves the user's interactive experience but also makes video generation customization and personalization possible.

image.png

It is worth mentioning that the second stage of Motion-I2V naturally supports zero-shot video-to-video conversion, meaning that video conversion of different styles or contents can be achieved even without training samples.

image.png

The introduction of the Motion-I2V framework marks a new stage in the technology of image-to-video generation. It not only achieves significant improvements in quality and consistency but also demonstrates great potential in user controllability and personalized customization. As the technology continues to mature and improve, we have every reason to believe that Motion-I2V will play an important role in many fields such as film and television production, virtual reality, and game development, bringing more rich and vivid visual experiences to people.

Document address: https://xiaoyushi97.github.io/Motion-I2V/

github address: https://github.com/G-U-N/Motion-I2V