Sana is a text-to-image framework developed by NVIDIA, capable of efficiently generating images with resolutions up to 4096×4096. It synthesizes high-resolution, high-quality images at an extremely fast speed, featuring strong text-image alignment capabilities and deployable on laptop GPUs. The model is based on linear diffusion transformers, utilizing a fixed pre-trained text encoder and a space-compressed latent feature encoder, supporting mixed prompts in English, Chinese, and emojis. The key advantages of Sana include high efficiency, high-resolution image generation capability, and multilingual support.