Tencent's MetaGen Image Generation Large Model (Hunyuan DiT) has recently been upgraded to a 6GB VRAM version, making it easy for personal computer users to run. This version is compatible with plugins such as LoRA and ControlNet and has added support for the Kohya graphical user interface, lowering the threshold for developers to train personalized LoRA models. The Hunyuan DiT model has been upgraded to version 1.2, with improvements in image texture and composition.

At the same time, Tencent has open-sourced the MetaGen Image Generation Annotation Model "Hunyuan Captioner," which supports both Chinese and English and has been optimized for text-to-image scenarios. It can more accurately understand Chinese semantics and output structured, complete, and accurate image descriptions. It can also identify famous people and landmarks and allows developers to supplement personalized background knowledge.

WeChat Screenshot_20240705081554.png

In addition, the open-source of the Hunyuan Captioner model enables researchers and data annotators in the field of text-to-image generation worldwide to improve the quality of image descriptions, generate more comprehensive and accurate descriptions, and enhance model performance. The generated datasets can be used to train models based on Hunyuan DiT as well as other visual models.

The three major updates of the Hunyuan DiT model include the launch of a low-vRAM version, integration with the Kohya training interface, and the upgrade to version 1.2, which further lowers the threshold for use and improves image quality. The generated images by Hunyuan DiT have better texture, but the previous high VRAM requirements discouraged many developers. Now, Hunyuan DiT has launched a low-vRAM version that can run with as little as 6GB VRAM, and through collaboration with Hugging Face, the low-vRAM version and related plugins have been integrated into the Diffusers library, simplifying the cost of use.

Kohya is an open-source lightweight model fine-tuning training service that provides a graphical user interface and is widely used for training diffusion model-based text-to-image models. Users can complete full-precision fine-tuning and LoRA training through Kohya without writing any code.

The Hunyuan Captioner model constructs a structured image description system, improves the completeness of the descriptions from multiple sources, injects a large amount of background knowledge, making the output descriptions more accurate and complete. These optimizations have made Hunyuan DiT one of the most popular domestic DiT open-source models, with its Github Star count exceeding 2.6k.

Official Website

https://dit.hunyuan.tencent.com 

Code

https://github.com/Tencent/HunyuanDiT

Model

https://huggingface.co/Tencent-Hunyuan/HunyuanDiT

Paper

https://tencent.github.io/HunyuanDiT/asset/Hunyuan_DiT_Tech_Report_05140553.pdf