UNIMO-G
Unified Image Generation
CommonProductImageImage GenerationMultimodal
UNIMO-G is a simple multimodal conditional diffusion framework for processing interwoven text and visual inputs. It comprises two core components: a multimodal large language model (MLLM) for encoding multimodal prompts and a conditional denoising diffusion network for generating images based on the encoded multimodal inputs. We utilize a two-stage training strategy to effectively train this framework: Firstly, pre-training on a large-scale text-image pair dataset to develop conditional image generation capabilities, followed by guided fine-tuning using multimodal prompts to achieve unified image generation capabilities. We have adopted a carefully designed data processing pipeline, including language grounding and image segmentation, to construct multimodal prompts. UNIMO-G demonstrates outstanding performance in text-to-image generation and zero-shot theme-driven synthesis, proving highly effective in generating high-fidelity images with complex multimodal prompts involving multiple image entities.
UNIMO-G Visit Over Time
Monthly Visits
17788201
Bounce Rate
44.87%
Page per Visit
5.4
Visit Duration
00:05:32