Developed jointly by researchers from the University of Hong Kong and Bytedance, LlamaGen is an image generation method based on the autoregressive model Llama, which has shown the potential to surpass traditional diffusion models in the field of image generation.

The open-source release of LlamaGen quickly garnered nearly 900 stars on GitHub, acknowledging this achievement. This result not only proves the competitiveness of autoregressive models in image generation but also brings new vitality and innovation to the open-source community.

On the ImageNet test benchmark, LlamaGen's performance exceeds that of diffusion models such as LDM and DiT, thanks to the research team's deep understanding and optimization of the autoregressive model architecture. They achieved superior results over previous Tokenizers on ImageNet and COCO, including VQGAN, ViT-VQGAN, and MaskGI, by retraining the Image Tokenizer.

image.png

The technical implementation of LlamaGen is based on several key design principles: image compression/quantizer, scalable image generation models, and high-quality training data. The research team adopted a CNN architecture similar to VQ-GAN to convert continuous images into discrete Tokens and significantly improve visual quality and resolution during the two-stage training process.

Project address: https://top.aibase.com/tool/llamagen

Online experience address: https://huggingface.co/spaces/FoundationVision/LlamaGen

In the first phase, the model was trained on the LAION-COCO 50M subset with an image resolution of 256×256. The research team selected a high-quality image dataset by screening valid image URLs, aesthetic scores, watermark scores, etc. In the second phase, fine-tuning was conducted on a 10 million-scale internal high aesthetic quality image dataset, with image resolution increased to 512×512, further enhancing the visual quality of generated images.

The advantage of LlamaGen lies in its excellent Image Tokenizer and the extensibility of the Llama architecture. During the actual generation process, LlamaGen has demonstrated strong competitiveness in metrics such as FID, IS, Precision, and Recall. Compared to previous autoregressive models, LlamaGen performs exceptionally well across various parameter scales.

Although LlamaGen has achieved significant results, researchers also point out that the current LlamaGen has only reached the Stable Diffusion v1 stage. Future improvements include higher resolutions, more Aspect Ratios, higher controllability, and video generation.

LlamaGen now supports online experiences, and interested friends can directly visit the LlamaGen space on Hugging Face to try this revolutionary image generation technology. Additionally, the open-source release of LlamaGen provides a platform for developers and researchers around the world to participate and contribute.