Recently, Meta has collaborated with King Abdullah University of Science and Technology (KAUST) in Saudi Arabia to introduce a new series of video diffusion models – MarDini. This model simplifies and enhances the creation of high-quality videos, enabling a variety of tasks such as filling in missing frames in videos, converting single images into dynamic scenes, and even extending short clips by adding natural continuous frames.

image.png

Building on last year's efforts, Meta has further advanced in the field of AI-generated video. Previously, it launched text-to-video and editing models like Emu Video and Emu Edit. Before releasing MarDini this year, Meta also introduced the advanced video editor Movie Gen. This shows that Meta is committed to providing more powerful tools for video creators.

The strength of MarDini lies in its ability to generate videos based on an arbitrary number of masked frames, supporting various generation tasks such as video interpolation, image-to-video conversion, and video expansion.

Image-to-Video Results

One of the main applications of MarDini is image-to-video generation. By using a reference frame placed in the middle as a conditional input and generating 16 additional frames to demonstrate this functionality, the official generated video example includes 17 frames rendered at 8FPS, creating a smooth 2-second video.

Video Expansion Results

MarDini can also expand videos by adjusting any length of existing videos. We have extended a 5-frame reference video to 2 seconds by adding 12 new frames to each sequence.

Video Interpolation Results

MarDini achieves video interpolation by using the first and last frames as conditioning signals to generate intermediate frames. When these boundary frames are the same, MarDini can create seamless loop videos.

MarDini's working principle is quite intriguing. It employs advanced and efficient video generation technology, mainly consisting of two parts: the planning model and the generation model. First, the planning model uses the masked autoregressive (MAR) method to interpret low-resolution input frames and generate guidance signals for the frames to be created. Then, the lightweight generation model generates high-resolution detailed frames through a diffusion process, ensuring that the final video is smooth and visually appealing.

Unlike many video models that require complex pre-trained image models, MarDini claims to be trainable from scratch using unlabeled video data. This is because it adopts a progressive training strategy, flexibly adjusting the masking of frames during training, allowing the model to better handle different frame configurations.

A notable feature of MarDini is its flexibility and performance. It is not only powerful but also efficient, suitable for larger-scale tasks. This model can handle various tasks such as video interpolation, image-to-video generation, and video expansion, whether it is smoothing existing video clips or creating complete sequences from scratch, it can handle them with ease.

In terms of performance, MarDini sets new benchmarks, generating high-quality videos in fewer steps, making it more cost-effective and time-efficient than more complex alternatives. The official research paper states, "Our studies show that our modeling strategy is competitive in various interpolation and animation benchmarks, while reducing computational requirements at comparable parameter scales."

Project entry: https://mardini-vidgen.github.io/

Key points:

✨ MarDini is a new generation video generation model launched by Meta in collaboration with KAUST, capable of easily completing various video creation tasks.

🎥 The model achieves efficient video interpolation and image-to-video generation through the combination of planning and generation models.

💡 MarDini generates high-quality videos in fewer steps, significantly enhancing the flexibility and efficiency of creation.