The AI image generation technology is rapidly advancing, but the models are becoming larger, making training and usage costs very high for the average user. Now, a new text-to-image framework called "Sana" has emerged, capable of efficiently generating ultra-high-definition images with resolutions up to 4096×4096, and it operates at an astonishing speed, even on a laptop GPU.
The core design of Sana includes:
Deep Compression Autoencoder: Unlike traditional autoencoders that compress images by a factor of 8, the autoencoder used in Sana compresses images by a factor of 32, effectively reducing the number of potential tokens. This is crucial for efficiently training and generating ultra-high-resolution images.
Linear DiT: Sana replaces all traditional attention mechanisms in DiT with linear attention, enhancing the processing efficiency of high-resolution images without sacrificing quality. Linear attention reduces computational complexity from O(N²) to O(N). Additionally, Sana employs Mix-FFN, integrating 3x3 depth convolutions into the MLP to aggregate local information from tokens, eliminating the need for positional encoding.
Decoder-style Text Encoder: Sana utilizes the latest decoder-style small LLM (like Gemma) as its text encoder, replacing the commonly used CLIP or T5. This approach enhances the model's understanding and reasoning capabilities regarding user prompts and improves the alignment of image-text through complex instructions and contextual learning.
Efficient Training and Sampling Strategies: Sana employs Flow-DPM-Solver to reduce sampling steps and utilizes efficient title labeling and selection methods to accelerate model convergence. The Sana-0.6B model is 20 times smaller than large diffusion models (like Flux-12B) and is over 100 times faster.
The innovation of Sana lies in significantly reducing inference latency through the following methods:
Algorithm and System Synergistic Optimization: Through various optimization techniques, Sana has reduced the generation time of 4096x4096 images from 469 seconds to 9.6 seconds, making it 106 times faster than the current state-of-the-art model, Flux.
Deep Compression Autoencoder: Sana uses the AE-F32C32P1 structure, compressing images by a factor of 32, significantly reducing the number of tokens and speeding up training and inference.
Linear Attention: Replacing traditional self-attention mechanisms with linear attention has improved the processing efficiency of high-resolution images.
Triton Acceleration: Triton is used to fuse the forward and backward kernel processes of the linear attention module, further speeding up training and inference.
Flow-DPM-Solver: This reduces the inference sampling steps from 28-50 to 14-20 while achieving better generation results.
Sana performs exceptionally well. At a resolution of 1024x1024, the Sana-0.6B model has only 590 million parameters, yet its overall performance reaches 0.64GenEval, comparable to many larger models. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, generating 1024×1024 resolution images in less than 1 second. For 4K image generation, Sana-0.6B's throughput is over 100 times faster than the state-of-the-art methods (FLUX). Sana not only achieves breakthroughs in speed but also demonstrates competitive image quality, even in complex scenes like text rendering and object details.
Additionally, Sana possesses strong zero-shot language transfer capabilities. Even when trained only on English data, Sana can understand prompts in Chinese and emojis and generate corresponding images.
The advent of Sana lowers the barrier for generating high-quality images, providing powerful content creation tools for both professionals and casual users. The code and model for Sana will be publicly released.
Experience link: https://nv-sana.mit.edu/
Paper link: https://arxiv.org/pdf/2410.10629
Github: https://github.com/NVlabs/Sana