Recently, a research team has jointly launched an open-source AI image generation model named Meissonic. Remarkably, this model, with only a billion parameters, can generate high-quality images. This compact design makes Meissonic potentially capable of localizing text-to-image applications on mobile devices.

image.png

Behind this technology, the development team includes Alibaba, Skywork AI, and researchers from multiple universities. They employed a unique transformer architecture and innovative training methods, enabling Meissonic to run on ordinary gaming PCs and potentially on mobile phones in the future.

image.png

Meissonic's training method utilizes a technique called "masked image modeling," which, in simple terms, involves hiding part of the image during the training process. The model learns to reconstruct the missing parts based on the visible areas and text descriptions. This approach helps the model understand the relationship between image elements and text.

Meissonic's architecture allows it to generate high-resolution images of 1024x1024 pixels, whether they are realistic scenes, stylized text, emojis, or cartoon stickers.

Unlike traditional autoregressive models that generate images step-by-step, Meissonic predicts all image information simultaneously through parallel iterative optimization. This innovation significantly reduces the decoding steps, approximately by 99%, greatly enhancing the speed of image generation.

In the construction process of the model, researchers went through four steps:

First, they taught the model basic concepts using 200 million 256x256 pixel images; then, they enhanced its text understanding capabilities with 10 million rigorously selected image-text pairs; next, by adding special compression layers, the model could output 1024x1024 pixel images; finally, they fine-tuned the model, incorporating data based on human preferences to improve its performance.

image.png

Interestingly, despite its smaller number of parameters, Meissonic outperforms some larger models like SDXL and DeepFloyd-XL in multiple benchmarks, achieving a high score of 28.83 in "human preference scores." Additionally, Meissonic can perform image inpainting and expansion without additional training, allowing users to easily add missing parts of the image or creatively enhance existing images.

The research team believes that this approach could facilitate the rapid and low-cost development of custom AI image generators and potentially drive the development of text-to-image applications on mobile devices. Interested individuals can find the demo version on Hugging Face and view the model's code on GitHub, which can be easily run on consumer GPUs with just 8GB of VRAM.

Demo: https://huggingface.co/spaces/MeissonFlow/meissonic

Project: https://github.com/viiika/Meissonic

Key Points:

🌟 Meissonic is an open-source AI model that generates high-quality images with only a billion parameters, suitable for use on ordinary gaming PCs and future mobile devices.

⚡ Adopting parallel iterative optimization training methods, Meissonic is 99% faster in image generation speed compared to traditional models.

🏆 Despite its smaller parameter count, Meissonic outperforms larger models in multiple tests and can perform image inpainting and expansion without additional training.