In the field of image generation, the task of creating high-resolution and realistic images has faced multiple challenges, especially during the text-to-image synthesis process. Traditional generation methods primarily rely on diffusion models and variational autoregressive (VAR) frameworks.
Although these models can produce high-quality images, they require a significant amount of computational resources, making them less flexible for real-time applications. Meanwhile, VAR models tend to accumulate errors when handling discrete tokens, leading to a loss of detail in the generated images, which affects their realism.
To overcome these shortcomings, the research team at ByteDance has introduced a new framework called "Infinity," which aims to enhance the efficiency and quality of text-to-image synthesis.
Infinity achieves a more fine-grained representation by introducing bit-level tokenization instead of traditional index-level tokenization, significantly reducing quantization errors and improving the realism of generated images. Additionally, the framework utilizes an Infinite Vocabulary Classifier (IVC) that expands the token vocabulary to 2^64, greatly reducing memory and computational demands.
The Infinity architecture consists of three main components: a bit-level multi-scale quantization tokenizer that converts image features into binary tokens to reduce computational overhead; a transformer-based autoregressive model that predicts residuals based on text prompts and previous outputs; and a self-correcting mechanism that introduces random bit flips during training to enhance the model's robustness to errors. The research team trained the model using large datasets such as LAION and OpenImages, achieving significant progress by gradually increasing image resolution from 256×256 to 1024×1024.
After evaluation, Infinity demonstrated excellent performance on key metrics, with a GenEval score of 0 and a Fréchet Inception Distance (FID) reduced to 3.48, proving its improvements in generation speed and quality. Infinity can generate high-resolution images of 1024×1024 within 0.8 seconds, showcasing its efficiency and reliability. The images produced by the system are not only visually realistic and rich in detail but also accurately respond to complex text instructions, receiving high human preference scores.
The launch of Infinity marks a new benchmark in the field of high-resolution text-to-image synthesis, addressing long-standing issues of scalability and detail quality through innovative design, thus advancing the development of generative AI.
Paper: https://arxiv.org/abs/2412.04431
Key Points:
🌟 **Innovative Framework Infinity:** The Infinity framework launched by ByteDance significantly enhances the efficiency of high-resolution image generation through bit-level tokenization and an infinite vocabulary classifier.
⚡ **Outstanding Performance:** Infinity surpasses existing models on key evaluation metrics, capable of generating high-quality images of 1024×1024 in just 0.8 seconds.
🖼️ **Realistic Details and Responsiveness:** The generated images are not only visually realistic but also accurately respond to complex text prompts, demonstrating high human preference scores.