Recently, a research team from the University of Hong Kong and ByteDance unveiled an innovative technology called LlamaGen, which applies the next token prediction paradigm of large language models to the field of visual generation. By re-examining the design space of image tokenizers, the scalability properties of image generation models, and the quality of their training data, they successfully developed a new type of image generation model known as LlamaGen.

image.png

Product Entry: https://top.aibase.com/tool/llamagen

LlamaGen represents a disruptive innovation in traditional image generation models, demonstrating that even without visual signal inductive biases, conventional autoregressive models can achieve leading image generation performance, provided they are scaled appropriately. LlamaGen uses the LLaMA architecture for autoregressive token prediction without employing diffusion models. This discovery opens new possibilities and insights for the field of image generation, offering fresh perspectives and directions for future research.

Key features of LlamaGen include:

Image Tokenizer: Introduced an image tokenizer with a 16x downsampling ratio, 0.94 reconstruction quality, and 97% codebook utilization, outperforming on ImageNet benchmarks.

image.png

Class-Conditional Image Generation Models: Released a series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving a FID of 2.18 on the ImageNet256×256 benchmark, surpassing popular diffusion models.

image.png

image.png

Text-Conditional Image Generation Model: Introduced a text-conditional image generation model with 775M parameters, trained in two stages on LAION-COCO, capable of generating high-quality aesthetic images with excellent visual quality and text alignment performance.

image.png

Service Framework vllm: Validated the effectiveness of the LLM service framework in optimizing the inference speed of image generation models, achieving a speedup of 326% to 414%.

image.png

In this project, the research team released two types of image segmenters, seven class-conditional generation models, and two text-conditional generation models, along with an online demo and a high-throughput service framework. These models and tools provide developers and researchers with rich resources and instruments to better understand and apply the LlamaGen technology.