Today, the Zhipu AI technology team released its latest video generation model, CogVideoX v1.5, and made it open-source. This version marks another significant advancement in the CogVideoX series since August.
According to reports, this update significantly enhances video generation capabilities, including support for 5-second and 10-second video lengths, 768P resolution, and 16 frames per second generation. Additionally, the I2V (Image-to-Video) model now supports any aspect ratio, further enhancing the understanding of complex semantics.
CogVideoX v1.5 includes two main models: CogVideoX v1.5-5B and CogVideoX v1.5-5B-I2V, designed to provide developers with more powerful video generation tools.
More notably, CogVideoX v1.5 will be synchronized on the Qingying platform, combined with the newly launched CogSound audio model, becoming the "New Qingying". The New Qingying will offer multiple unique services, including significant improvements in video quality, aesthetic expression, and motion rationality, supporting the generation of 10-second, 4K, 60-frame ultra-high-definition videos.
Official introduction as follows:
Quality enhancement: Significant improvements in the quality of image-to-video, aesthetic expression, motion rationality, and understanding of complex prompts.
Ultra-high-definition resolution: Supports generating 10-second, 4K, 60-frame ultra-high-definition videos.
Variable aspect ratios: Supports any aspect ratio, adapting to different playback scenarios.
Multi-channel output: The same instruction/image can generate 4 videos at once.
AI video with sound effects: The New Qingying can generate sound effects matching the visuals.
In terms of data processing, the CogVideoX team focuses on improving data quality, developing an automated screening framework to filter out poor video data, and launching an end-to-end video understanding model, CogVLM2-caption, to generate accurate content descriptions. This model effectively handles complex instructions, ensuring the generated videos meet user needs.
To enhance content coherence, CogVideoX employs an efficient 3D Variational Autoencoder (3D VAE) technology, significantly reducing training costs and difficulties. Additionally, the team has developed a Transformer architecture that integrates text, time, and space dimensions, enhancing the interaction between text and video by removing traditional cross-attention modules, thereby improving video generation quality.
In the future, the Zhipu AI technology team will continue to expand the data volume and model scale, explore more efficient model architectures, to achieve a better video generation experience. The open-source of CogVideoX v1.5 not only provides developers with powerful tools but also injects new vitality into the video creation field.
Code: https://github.com/thudm/cogvideo
Model: https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT
Key Points:
🌟 New CogVideoX v1.5 open-source, supports 5/10-second videos, 768P resolution, and 16-frame generation capabilities.
🎨 New Qingying platform launched, combined with CogSound audio model, offering ultra-high-definition 4K video generation.
📈 Data processing and algorithmic innovation to ensure the quality and coherence of generated videos.