CogView4, the latest open-source text-to-image model from Zhihu AI, has officially launched. Boasting 600 million parameters, CogView4 fully supports Chinese input and the generation of images from Chinese text, earning it the title of "the first open-source model capable of generating Chinese characters in images."

QQ_1741067026688.png

A core highlight of CogView4 is its support for bilingual (Chinese and English) prompts. It excels at understanding and following complex Chinese instructions, making it a boon for Chinese content creators. As the first open-source text-to-image model capable of generating Chinese characters within images, it fills a significant gap in the open-source landscape. Furthermore, the model supports generating images of arbitrary width and height and can handle prompts of any length, demonstrating exceptional flexibility.

CogView4's bilingual capabilities stem from a comprehensive upgrade to its technical architecture. Its text encoder has been upgraded to GLM-4, supporting both Chinese and English input, overcoming the previous limitation of open-source models only supporting English. Reportedly, the model was trained using bilingual (Chinese and English) image-text pairs to ensure high-quality generation in Chinese contexts.

In text processing, CogView4 abandons the traditional fixed-length design, adopting a dynamic text length scheme. With an average descriptive text of 200-300 tokens, redundancy is reduced by approximately 50% compared to the traditional fixed 512-token scheme, improving training efficiency by 5%-30%. This innovation not only optimizes computing resources but also allows the model to process prompts of varying lengths more efficiently.

CogView4's ability to generate images of arbitrary resolution is underpinned by several technological breakthroughs. The model employs mixed-resolution training, combined with two-dimensional rotational positional encoding and interpolated positional representation, to adapt to different size requirements. Furthermore, its use of a Flow-matching diffusion model and parameterized linear dynamic noise scheduling further enhances the quality and diversity of generated images.

QQ_1741067051506.png

The CogView4 training process is divided into several stages: starting with basic resolution training, progressing to pan-resolution adaptation, then high-quality data fine-tuning, and finally optimizing output through human preference alignment. This process retains the Share-param DiT architecture while introducing independent adaptive layer normalization for different modalities, ensuring the model's stability and consistency across various tasks.

Project: https://github.com/THUDM/CogView4