Recently, the development of generative models has highlighted the crucial role of image tokenization in efficient synthesis of high-resolution images. Image tokenization converts images into latent representations, reducing computational demands and enhancing the effectiveness and efficiency of the generation process compared to direct pixel processing. However, previous methods (such as VQGAN) typically utilize a fixed 2D latent grid for tokenization, facing challenges in managing inherent redundancies in images where adjacent regions often exhibit similarity.

To address this issue, researchers have introduced a Transformer-based one-dimensional tokenization framework called TiTok, an innovative approach that tokenizes images into a one-dimensional latent sequence. TiTok is a compact one-dimensional tokenizer that can represent a 256×256 image with as few as 32 discrete tokens. Consequently, it significantly accelerates the sampling process (e.g., 410× faster than DiT-XL/2) while achieving competitive generation quality.

image.png

TiTok offers a more compact latent representation, resulting in a more efficient and effective representation compared to traditional techniques. For instance, a 256×256×3 image can be reduced to just 32 discrete tokens, far fewer than the 256 or 1024 tokens obtained by previous methods. Despite its compactness, TiTok achieves performance comparable to state-of-the-art methods.

image.png

Specifically, using the same generator framework, TiTok achieved a gFID of 1.97 on the ImageNet256×256 benchmark, significantly outperforming the MaskGIT benchmark at 4.21. The advantages of TiTok become more pronounced when dealing with higher-resolution images.

In the ImageNet512×512 benchmark, TiTok not only outperformed the state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04) but also reduced the number of image tokens by 64 times and increased the generation speed by 410 times. The best variant of TiTok significantly surpassed DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples, with a generation speed increase of 74 times.

The application scenarios of TiTok span various fields requiring efficient synthesis of high-resolution images, such as computer vision, image processing, and artistic creation.