In the field of image generation, technological advancements are continuously driving the development of applications such as virtual reality. Recently, Samsung Research introduced a novel method based on autoregressive modeling, aiming to enhance the fidelity and scalability of image generation. Unlike traditional methods that generate the entire scene at once, this approach adopts a gradual detail addition strategy, making the image generation process more aligned with human creative habits.

The core of this new method lies in dividing the image generation into two levels: "base" and "detail." It first generates a smooth base image and then iteratively adds details, ultimately forming a coherent, high-quality image. The research team emphasizes that this layered combination strategy is more efficient than traditional methods, especially when handling high-resolution images, offering better scalability without requiring retraining the entire model.

image.png

During the autoregressive model's learning process, the processing order of image tokens significantly impacts the generated results. Samsung's research team innovatively utilizes edge-aware smoothing techniques to decompose training images into different sub-levels, achieving incremental control over details. This method mirrors the human artistic creation process—artists often start with a sketch and gradually refine shapes and details.

The model's training involves three main steps: first, decomposing each training image into multiple levels of fundamental detail factors; second, encoding these factors using a Vector Quantized Variational Autoencoder (VQ-VAE) to reduce dimensionality while preserving key image features; and finally, employing a Transformer decoder architecture for iterative prediction of detail factors, enabling controllable and gradual addition of image details.

image.png

Experimental results demonstrate that this method not only achieves state-of-the-art image generation quality but also effectively reduces the computational complexity associated with high-resolution output. This innovative autoregressive image generation framework provides a strong alternative to diffusion models and other techniques, showcasing the vast potential of future image generation technologies.