NVIDIA has collaborated with researchers from the Massachusetts Institute of Technology and Tsinghua University to develop a new text-to-image generation framework called Sana, which can efficiently generate images with resolutions as high as 4096×4096.

Sana can synthesize high-resolution, high-quality images that are highly consistent with text at an extremely fast speed, and it can even run on a laptop's GPU.

image.png

Core Design Elements of Sana:

Deep Compression Autoencoder: Unlike traditional autoencoders that compress images by only 8 times, Sana's trained autoencoder can compress images by 32 times, effectively reducing the number of latent tokens.

Linear DiT: Sana replaces all conventional attention mechanisms in DiT with linear attention mechanisms, which are more efficient for high-resolution image generation without sacrificing quality.

Decoder-Only Text Encoder: Researchers have replaced T5 with a more advanced small decoder-only language model (LLM) called Gemma as the text encoder, and designed complex human instructions and contextual learning to enhance the consistency between images and text.

Efficient Training and Sampling: Sana proposes Flow-DPM-Solver to reduce sampling steps and accelerates model convergence through efficient caption tokenization and selection.

image.png

Thanks to these designs, Sana-0.6B performs comparably to large diffusion models (such as Flux-12B) in terms of performance, but the model size is 20 times smaller and the speed is over 100 times faster.

Additionally, Sana-0.6B can be deployed on a 16GB laptop GPU, generating 1024×1024 resolution images in less than a second, making low-cost content creation possible.

image.png

The main advantage of Sana lies in its efficiency. In 4K image generation, Sana-0.6B's throughput is over 100 times faster than the current state-of-the-art method (FLUX), and 40 times faster at 1K resolution.

Researchers have also quantized Sana-0.6B and deployed it on edge devices. On consumer-grade equipment with an RTX-4090 GPU, generating 1024×1024 resolution images takes only 0.37 seconds, providing a robust foundation model for real-time image generation.

In the future, researchers plan to build an efficient video generation pipeline based on Sana. However, the research also has some limitations, such as the inability to fully guarantee the safety and controllability of generated image content, and challenges in complex situations like text rendering, face, and hand generation.

Project Address: https://nvlabs.github.io/Sana/

Paper Address: https://arxiv.org/pdf/2410.10629