Google DeepMind, in collaboration with the Massachusetts Institute of Technology (MIT), has recently announced significant research findings. The research team has developed a new autoregressive model called "Fluid," which has made groundbreaking advancements in the field of text-to-image generation. After scaling up to 10.5 billion parameters, the model has demonstrated exceptional performance.
This research has challenged the industry's conventional wisdom. Previously, while autoregressive models dominated the field of language processing, they were generally considered inferior to diffusion models like Stable Diffusion and Google Imagen3 in image generation. The researchers have significantly enhanced the performance and scalability of autoregressive models by innovatively introducing two key design factors: the use of continuous tokens instead of discrete tokens, and the introduction of random generation order in place of a fixed order.
In terms of image information processing, the advantages of continuous tokens are evident. Traditional discrete tokens encode image regions into codes from a limited vocabulary, inevitably leading to information loss, even with large models struggling to accurately generate detailed features like symmetrical eyes. Continuous tokens, however, can preserve more precise information, significantly improving the quality of image reconstruction.
The research team also innovated the image generation order. Traditional autoregressive models typically generate images in a fixed order from left to right, top to bottom. The researchers experimented with a random order method, allowing the model to predict multiple pixels at arbitrary positions in each step. This method performed exceptionally well in tasks requiring a good grasp of the overall image structure, achieving significant advantages in the GenEval benchmark test that measures the alignment between text and generated images.
The practical performance of the Fluid model has confirmed the value of the research. After scaling up to 10.5 billion parameters, Fluid has outperformed existing models in multiple important benchmark tests. Notably, a small Fluid model with only 369 million parameters has reached the FID score (7.23) of the Parti model with 20 billion parameters on the MS-COCO dataset.
This research indicates that autoregressive models like Fluid are likely to become strong alternatives to diffusion models. Compared to diffusion models that require multiple forward and backward passes, Fluid can generate images with a single pass, a significant efficiency advantage that will become even more pronounced as the model scales further.