Beijing Zhipu Huazhang Technology Co., Ltd. has announced the release of the latest version of its CogVideoX series models, CogVideoX v1.5, which is now open-sourced. Since its launch in early August, this series of models has become a leader in the field of video generation, thanks to its cutting-edge technology and features highly favored by developers. The new version, CogVideoX v1.5, has undergone significant enhancements over its predecessor, including improved video generation capabilities, now supporting 5/10-second, 768P, 16-frame videos, and the I2V model can support any size ratio, greatly enhancing the quality and complex semantic understanding of image-to-video conversion.
The open-source content includes two models: CogVideoX v1.5-5B and CogVideoX v1.5-5B-I2V. The new version will also be synchronized to the Qingying platform, combined with the newly launched CogSound audio model, offering enhanced quality, support for ultra-high-definition resolution, variable ratio adaptation to different playback scenarios, multi-channel output, and AI video with sound effects.
Technically, CogVideoX v1.5 has improved by automating the filtering framework to eliminate video data with poor dynamic connectivity and by using the end-to-end video understanding model CogVLM2-caption to generate accurate video content descriptions, enhancing text comprehension and instruction following abilities. Additionally, the new version employs an efficient three-dimensional variational autoencoder (3D VAE) to address content coherence issues and has independently developed a Transformer architecture that integrates text, time, and space dimensions, eliminating traditional cross-attention modules and optimizing the utilization of time step information in diffusion models through expert adaptive layer normalization.
In terms of training, CogVideoX v1.5 has constructed an efficient diffusion model training framework, achieving rapid training for long video sequences through various parallel computing and time optimization techniques. The company has verified the effectiveness of scaling laws in the field of video generation and plans to expand data volume and model size in the future, exploring innovative model architectures to more efficiently compress video information and better integrate text with video content.
Code: https://github.com/thudm/cogvideo
Model: https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT