Recently, the research team at Beijing AI Institute has introduced a new image generation model named OmniGen.

image.png

Versatile Image Generation and Editing Model

Unlike previous image generation tools such as Stable Diffusion, OmniGen's most notable feature is its versatility, handling multiple tasks within a single framework:

It can process various image generation tasks, including text-to-image generation and image editing, making it a truly all-rounder.

This means users can control image generation and fine-tune edits with simple prompts, eliminating the need for additional plugins like ControlNet or IP-Adapter for detailed adjustments!

OmniGen's architecture is highly streamlined. Unlike traditional image generation models, it does not require extra text encoders or complex workflows. Simply input conditions, and OmniGen efficiently generates images, significantly enhancing user experience. It combines variational autoencoders with pretrained Transformer models, handling both image and text inputs within one model, reducing unnecessary complexity.

To enhance image generation quality, OmniGen employs a calibration flow training method, which directly regresses target speeds, making image generation control more precise. Additionally, its progressive training strategy gradually masters generation techniques from low to high resolution, yielding impressive results.

OmniGen Rivals Advanced Models in Image Generation

OmniGen's training dataset is also extensive and diverse, covering various image generation tasks. To ensure robust multitasking capabilities, researchers constructed a large-scale dataset called X2I, including data for text-to-image and image editing tasks. This allows OmniGen to effectively learn and transfer knowledge from different tasks, demonstrating new generation capabilities.

image.png

In multiple tests, OmniGen's performance has been remarkable. In text-to-image generation, it matches the performance of the most advanced models on the market. In the GenEval benchmark test, OmniGen was trained on just 0.1 million images, compared to SD3's over 1 billion images.

Its image editing capabilities are equally impressive, accurately controlling source images and editing instructions. For instance, on the EMU-Edit test set, it outperformed models like InstructPix2Pix and even matched the current state-of-the-art EMU-Edit model.

In subject-driven generation tasks, OmniGen showcases exceptional personalized capabilities, suitable for various fields such as art creation and advertising design.

Try it out at: https://huggingface.co/spaces/Shitao/OmniGen

Paper: https://arxiv.org/html/2409.11340v1