Sana is a text-to-image framework developed by NVIDIA, capable of efficiently generating images with resolutions up to 4096×4096. It synthesizes high-resolution, high-quality images at remarkable speeds while maintaining robust text-image alignment, making it deployable on laptop GPUs. The Sana model is based on linear diffusion transformers, utilizing pre-trained text encoders and spatially compressed latent feature encoders, supporting Emoji, Chinese, and English inputs, as well as mixed prompts.