On March 4, 2025, Beijing Zhipu AI Technology Co., Ltd. announced the launch of CogView4, the first open-source text-to-image model supporting Chinese character generation. This model achieved the highest comprehensive score in the DPG-Bench benchmark test, becoming the State-of-the-Art (SOTA) among open-source text-to-image models. Following the Apache 2.0 license, it's the first image generation model to do so.
CogView4 boasts strong complex semantic alignment and instruction-following capabilities. It supports arbitrary-length input in both Chinese and English and can generate images of any resolution. It not only generates high-quality images but also seamlessly integrates Chinese characters, catering to creative needs in advertising, short videos, and other fields. Technically, CogView4 utilizes a bilingual GLM-4 encoder, achieving bilingual prompt input capability through bilingual text-image training.
The model supports arbitrary-length prompts and generates images of any resolution, significantly enhancing creative freedom and training efficiency. CogView4 employs 2D Rotational Positional Embedding (2D RoPE) to model image positional information and utilizes interpolated positional encoding to support different image resolutions. Furthermore, the model uses a Flow-matching approach for diffusion generation modeling, combined with parameterized linear dynamic noise scheduling to adapt to the signal-to-noise ratio requirements of different resolution images.
Architecturally, CogView4 continues the Share-param DiT architecture of its predecessor, featuring independently designed adaptive LayerNorm layers for text and image modalities to achieve efficient inter-modal adaptation. The model utilizes a multi-stage training strategy, including base resolution training, pan-resolution training, high-quality data fine-tuning, and human preference alignment training, ensuring that the generated images are aesthetically pleasing and align with human preferences.
CogView4 also breaks the limitations of traditional fixed token length, allowing for a higher token limit and significantly reducing text token redundancy during training. When the average length of training captions is 200-300 tokens, CogView4 reduces token redundancy by approximately 50% compared to the traditional fixed 512-token approach, achieving a 5%-30% efficiency improvement during incremental training.
Additionally, CogView4 supports the Apache 2.0 license. Support for ControlNet, ComfyUI, and other ecosystem integrations will be added gradually, and a complete fine-tuning toolkit is forthcoming.
Open-source Repository:
https://github.com/THUDM/CogView4
Model Repository:
https://huggingface.co/THUDM/CogView4-6B
https://modelscope.cn/models/ZhipuAI/CogView4-6B