Recently, ZhipuAI released the open-source video generation model CogVideoX-5B, which not only surpasses its predecessor, CogVideoX-2B, in terms of video generation quality and visual effects, but also significantly enhances its inference performance. This improvement allows earlier GTX1080Ti graphics cards to run the previous model, while "sweet spot" desktop graphics cards like the RTX3060 can easily handle this new model.

Detailed Parameter Comparison between CogVideoX-5B and CogVideoX-2B:

image.png

This large-scale DiT (Diffusion Transformer) model is designed for text-to-video generation tasks. Its underlying technology includes a 3D causal Variational Autoencoder (VAE), which compresses video data into a latent space and decodes it along the time dimension for efficient video reconstruction.

Additionally, the use of Expert Transformer combines text embeddings with video embeddings, employing 3D-RoPE as the positional encoding. This process involves expert adaptive layer normalization to handle data from both modalities and utilizes a 3D full attention mechanism for joint spatiotemporal modeling.

Code: https://top.aibase.com/tool/cogvideox

Model Download: https://huggingface.co/THUDM/CogVideoX-5b

Paper Link: https://arxiv.org/pdf/2408.06072