Recently, NVIDIA open-sourced an image generation model called Sana, which has only 60 million parameters, greatly lowering the barrier to entry.

It is reported that Sana can generate images with a resolution of 4096×4096 and can run on a 16GB graphics card, generating high-quality images at 1024×1024 resolution in less than one second, which is outstanding compared to similar models.

Sana operates using DC-AE (Dual-Channel Autoencoder) technology, employing a latent space that is 32 times larger for image generation. The tool is equipped with 8 GPUs, including the powerful GTX 3090, allowing it to process complex images faster and more effectively. It is claimed that Sana's 0.6B performance is competitive with Flux-12B, having only 1/20 of the parameters but being 100 times faster.

Interestingly, Sana supports prompts in English, Chinese, and emoji. Users can generate images in various styles through simple text prompts, from cyberpunk-style cats to athletic Shiba Inus in white T-shirts, and even pirate ships in cosmic whirlpools, with Sana performing exceptionally well. Users can even input Chinese poetry to generate related artistic images. Additionally, Sana has a certain level of safety; when inappropriate words are entered, the system automatically replaces them with a red heart symbol ❤️ to prevent the generation of unsuitable content.

For example, using the prompt "A cat playing on the grass, stars 🌟," the generation speed is very fast, and the results are quite impressive.

image.png

Another example is the prompt "A cute 🐼 eating 🎋 in ink wash painting style," where the model can accurately recognize emojis.

image.png

It is worth mentioning that Sana has received official support for ComfyUI and is equipped with Lora training tools. This makes it more convenient and significantly enhances usability, and interested friends can try it out for themselves.

Project link: https://nv-sana.mit.edu/