In the digital media era, video has become our primary means of self-expression and storytelling. However, creating high-quality videos typically requires professional skills and expensive equipment. Now, with Snap Video, you can automatically generate videos just by describing the scene you want in text.
The current image generation models have already demonstrated excellent quality and diversity. Inspired by this, researchers have begun to apply these models to video generation. However, the high redundancy of video content can reduce the authenticity, visual quality, and scalability of actions when directly applying image models to the video generation field.
Snap Video is a video-centric model that systematically addresses these challenges. First, it extends the EDM framework, considering redundant pixels in both space and time, naturally supporting video generation. Second, it proposes a new architecture based on transformers, which is 3.31 times faster to train than U-Net and 4.5 times faster during inference. This allows Snap Video to efficiently train text-to-video models with hundreds of millions of parameters, achieving state-of-the-art results and generating videos with higher quality, significant improvements in temporal consistency, and action complexity.
Technical Highlights:
Spatio-temporal Joint Modeling: Snap Video can synthesize videos with large-scale motion while retaining the semantic control capability of large-scale text-to-video generators.
High-resolution Video Generation: Using a two-stage cascaded model, it first generates low-resolution videos and then performs high-resolution upsampling, avoiding potential temporal inconsistency issues.
Architecture Based on FIT: Snap Video utilizes the FIT (Far-reaching Interleaved Transformers) architecture to achieve efficient spatio-temporal joint modeling by learning compressed video representations.
Snap Video has been evaluated on widely used datasets such as UCF101 and MSR-VTT, demonstrating particular advantages in generating high-quality actions. User studies also show that Snap Video is superior to the latest methods in video text alignment, action quantity, and quality.
The paper also discusses other research works in the video generation field, including methods based on adversarial training or autoregressive generation techniques, as well as recent progress in adopting diffusion models in the text-to-video generation task.
Snap Video systematically addresses common issues in the diffusion process and architecture in text-to-video generation by treating videos as first-class citizens. It proposes modifications to the EDM diffusion framework and an architecture based on FIT, significantly improving the quality and scalability of video generation.
Paper Address: https://arxiv.org/pdf/2402.14797