Recently, Zhipu AI has open-sourced its latest creation to the public—CogView3 and its upgraded version CogView-3Plus-3B, injecting new vitality into the text-to-image generation field.

The debut of CogView3 is undoubtedly a significant milestone. As the first model to achieve relay diffusion in the field of text-to-image generation, it employs a unique cascaded diffusion approach. This innovative method first generates low-resolution images, then completes the final output through relay-based super-resolution technology. This not only greatly enhances the quality of the generated images but also significantly reduces the costs of training and inference.

image.png

Most notably, CogView3's performance is outstanding. According to human evaluations, CogView3 surpasses the current leading open-source text-to-image model SDXL in generation quality, with a win rate as high as 77.0%. Even more astonishing is that it achieves this feat in about half the inference time of SDXL. If using the streamlined version of CogView3, it can still maintain comparable performance levels while only taking about one-tenth of SDXL's inference time. This breakthrough undoubtedly opens up new possibilities for efficient and high-quality image generation.

In the meantime, Zhipu AI has also introduced CogView-3Plus-3B, an image model based on the DiT (Diffusion Transformers) framework. Although specific test results have not yet been released, the industry is full of anticipation for its potential. CogView-3Plus-3B has been further optimized on the basis of CogView3, introducing advanced technologies such as Zero-SNR diffusion noise scheduling and joint text-image attention mechanisms. These improvements not only reduce training and inference costs but also maintain strong image generation capabilities.

It is worth mentioning that CogView-3Plus-3B supports a wide range of image resolutions, from 512x512 to 2048x2048, greatly increasing its application flexibility. Whether for daily use or professional creation, suitable resolution options can be found.

To help users better utilize these models, Zhipu AI also provides practical advice and tools. They recommend users to optimize prompts through large language models (LLM), which can significantly enhance the quality of generated images. At the same time, Zhipu AI provides example scripts, greatly lowering the user entry barrier.

Project link: https://github.com/THUDM/CogView3