ZhipuAI has announced the open-sourcing of its video generation model, CogVideoX, aiming to accelerate the development and application of video generation technology. The CogVideoX model, based on advanced large-scale model technology, meets the demands of commercial-grade applications. The currently open-sourced CogVideoX-2B version requires only 18GB of GPU memory for inference at FP-16 precision, and 40GB for fine-tuning, enabling inference with a single 4090 GPU and fine-tuning with a single A6000 GPU.

CogVideoX employs a 3D Variational Autoencoder (3D VAE) technology, which compresses both spatial and temporal dimensions of videos simultaneously through 3D convolutions, achieving higher compression rates and better reconstruction quality. The model structure includes an encoder, decoder, and latent space regularizer, ensuring causal information through temporal causal convolutions. Additionally, expert Transformer technology is used to process encoded video data, combining text inputs to generate high-quality video content.

WeChat Screenshot_20240806095428.png

To train the CogVideoX model, ZhipuAI has developed a method for screening high-quality video data, excluding overly edited or inconsistently moving videos, ensuring the quality of the training data. Additionally, a pipeline from image captioning to video captioning has been implemented to address the lack of textual descriptions in video data.

In terms of performance evaluation, CogVideoX excels in multiple metrics, including human actions, scenes, dynamic levels, and evaluation tools focused on video dynamics. ZhipuAI will continue to explore innovations in video generation, including new model architectures, video information compression, and the fusion of text and video content.

Code Repository:

https://github.com/THUDM/CogVideo

Model Download:

https://huggingface.co/THUDM/CogVideoX-2b

Technical Report:

https://github.com/THUDM/CogVideo/blob/main/resources/CogVideoX.pdf