ZhipuAI has launched the next-generation video generation model CogVideoX, marking another significant advancement in the company's development of multimodal technology.

WeChat Screenshot_20240726111755.png

The core technical features of CogVideoX include:

  1. 3D Variational Autoencoder (3D VAE): ZhipuAI's proprietary structure compresses raw video data to 2% of its original size, reducing training costs and difficulties. Combined with the 3D RoPE positional encoding module, it enhances the ability to capture inter-frame relationships in the temporal dimension, establishing long-term dependencies in videos.

  2. End-to-end video understanding model: Enhances the model's understanding of text and adherence to instructions, ensuring that the generated videos more closely meet user needs and can handle extremely long and complex prompt instructions.

  3. Text, time, and space integrated 3D transformer architecture: Innovatively designed the Expert Block to align text with video modal spaces and optimized inter-modal interactions through the Full Attention mechanism.

The CogVideoX model is now available on ZhipuAI's PC, mobile app, and mini-program platforms. Users can experience AI-driven text-to-video and image-to-video services for free via the "Qingying" (Ying) feature. Qingying's main features include rapid generation, efficient instruction following, content coherence, and flexible scene scheduling.

Additionally, the Zhipu Big Model Open Platform bigmodel.cn has also deployed "Qingying," allowing businesses and developers to use its features through API calls. ZhipuAI has validated the effectiveness of Scaling Law in the field of video generation and will continue to expand data and model sizes, research new model architectures, to more efficiently compress video information and more comprehensively integrate text and video content.

Experience URL: https://top.aibase.com/tool/qingying-ai-shipinshengchengfuwu