Tired of searching for open-source image models that understand Chinese? Say goodbye to the limitations of English prompts! Chinese AI giant Zhipu AI has proudly open-sourced its new text-to-image model, CogView4, pushing Chinese image generation technology to new heights! Now, designers, content creators, and even AI art novices can use their native language to master AI image generation!

QQ20250304-134226.png

CogView4's biggest highlight is its incredibly strong understanding of Chinese! No more struggling with translation software; use natural Chinese instructions, and CogView4 will instantly grasp your artistic vision and accurately generate the desired image! Even more impressive, it's the first open-source model that can directly "write" Chinese characters into the image! This is a game-changer for Chinese users, allowing for more authentic creative expression without worrying about text incompatibility.

Even better, CogView4 completely removes limitations on image size and prompt length! Want to generate a massive widescreen poster? No problem! Want a lengthy prompt describing a complex scene? Go ahead! CogView4 can easily handle it all, fulfilling your wildest creative needs and unleashing your imagination.

CogView4 isn't just style over substance; it won first place in the authoritative DPG-Bench benchmark test, demonstrating its superior capabilities. This means CogView4 is not only user-friendly but also powerful, offering top-tier image generation quality to meet even the most demanding requirements.

To help more developers and users utilize CogView4, Zhipu AI has also announced plans to open-source supporting ControlNet, ComfyUI, and model fine-tuning tools – essentially providing the complete toolkit! This means you can not only use CogView4's powerful features out-of-the-box but also customize it to create even more personalized and powerful image generation models.

So, how did CogView4 achieve this? Simply put, it boasts several key technological upgrades:

Bilingual Capability Leap: CogView4's "brain" has been upgraded to the more powerful GLM-4 encoder, enabling it to handle both Chinese and English seamlessly. It has also been trained on a massive amount of bilingual text and image data, overcoming the limitations of previous Chinese models that struggled with English, achieving true bilingual fluency.

Smarter Text Processing: CogView4 uses dynamic text length technology, acting like an intelligent tailor that adjusts to the length of the prompt, avoiding the waste and redundancy of traditional fixed-length methods, resulting in a 5-30% efficiency increase. This means CogView4 not only understands prompts more accurately but also generates images faster.

More Flexible Resolution Generation: CogView4 uses "hybrid resolution training" and "two-dimensional rotational positional encoding," allowing it to handle various image sizes, from high-resolution images to smaller, more refined ones. It also employs a Flow-matching diffusion model and parameterized linear dynamic noise scheduling for smoother and more controllable image generation.

More Refined Training Process: CogView4's training process is meticulously refined, undergoing multi-stage training and human preference alignment. From basic resolution to versatile resolution and then high-quality data fine-tuning, every step strives for excellence. It also retains the Share-param DiT architecture and uses independent adaptive layer normalization for different modalities, making the model more powerful and efficient.

Project Address: https://github.com/THUDM/CogView4