The ModelScope community recently announced the official open-source release of a larger version of its domestic open-source Sora video generation model, CogVideoX-5B.

Compared to the previous CogVideoX-2B, the new model has significantly improved the quality and visual effects of video generation.

WeChat Screenshot_20240828081448.png

CogVideoX-5B is a large-scale DiT (diffusion transformer) model designed specifically for text-to-video generation tasks. The model employs a 3D causal variational autoencoder (3D causal VAE) and expert Transformer technology, combining text and video embeddings, using 3D-RoPE for positional encoding, and leveraging a 3D full attention mechanism for spatiotemporal joint modeling.

Additionally, the model incorporates progressive training techniques, enabling the generation of high-quality videos with significant motion characteristics, coherence, and extended duration.

Model Link:

https://modelscope.cn/models/ZhipuAI/CogVideoX-5b