Website Home (ChinaZ.com) June 17 News: Peking University and the Kuaishou AI team have successfully tackled the challenge of generating complex videos. They introduced a new framework called VideoTetris, which can easily assemble various details like a jigsaw puzzle to create high-difficulty, complex instruction videos. This framework outperforms commercial models such as Pika and Gen-2 in complex video generation tasks.
The VideoTetris framework first defines the task of combined video generation, including two sub-tasks: 1) video generation following complex combined instructions; 2) long video generation following progressive combined multi-object instructions. The team found that almost all existing open-source and commercial models fail to generate correct videos. For example, when given the input "a cute brown dog on the left, and a cat napping in the sun on the right," the resulting video often merges the two objects, appearing strange.
In contrast, VideoTetris successfully preserves all positional information and detail features. In long video generation, it supports more complex instructions, such as "transitioning from a cute brown squirrel on a pile of hazelnuts to a cute brown squirrel and a cute white squirrel on a pile of hazelnuts." The generated video sequence matches the input instructions, and the two squirrels can naturally exchange food.
The VideoTetris framework employs a spatio-temporal combination diffusion method. It first deconstructs the text prompts over time, assigning different prompts to different video frames. Then, it deconstructs in the spatial dimension on each frame, mapping different objects to different video regions. Finally, it combines through spatio-temporal cross-attention to achieve efficient combined instruction generation.
To generate higher quality long videos, the team also proposed an enhanced training data preprocessing method, making long video generation more dynamically stable. Additionally, a reference frame attention mechanism was introduced, using native VAE to encode previous frame information, differing from other models that use CLIP encoding, thus achieving better content consistency.
The optimized results show that long videos no longer have large areas of color deviation and can better adapt to complex instructions, with generated videos being more dynamic and natural. The team also introduced new evaluation metrics VBLIP-VQA and VUnidet, the first to extend combined generation evaluation methods to the video dimension.
Experimental tests show that the VideoTetris model outperforms all open-source models, even commercial models like Gen-2 and Pika, in combined video generation capabilities.据悉,该代码将完全开源。
Project link: https://top.aibase.com/tool/videotetris