UNIMO-G

Unified Image Generation

CommonProductImageImage GenerationMultimodal
UNIMO-G is a simple multimodal conditional diffusion framework for processing interwoven text and visual inputs. It comprises two core components: a multimodal large language model (MLLM) for encoding multimodal prompts and a conditional denoising diffusion network for generating images based on the encoded multimodal inputs. We utilize a two-stage training strategy to effectively train this framework: Firstly, pre-training on a large-scale text-image pair dataset to develop conditional image generation capabilities, followed by guided fine-tuning using multimodal prompts to achieve unified image generation capabilities. We have adopted a carefully designed data processing pipeline, including language grounding and image segmentation, to construct multimodal prompts. UNIMO-G demonstrates outstanding performance in text-to-image generation and zero-shot theme-driven synthesis, proving highly effective in generating high-fidelity images with complex multimodal prompts involving multiple image entities.
Visit

UNIMO-G Visit Over Time

Monthly Visits

19075321

Bounce Rate

45.07%

Page per Visit

5.5

Visit Duration

00:05:32

UNIMO-G Visit Trend

UNIMO-G Visit Geography

UNIMO-G Traffic Sources

UNIMO-G Alternatives