Tencent EMMA
Multimodal Text-to-Image Generation Model
PremiumNewProductImageImage GenerationMultimodal
EMMA is a novel image generation model built upon the state-of-the-art text-to-image diffusion model ELLA. It can accept multimodal prompts and, through its innovative multimodal feature connector design, effectively integrates text and supplementary modal information. This model, by freezing all parameters of the original T2I diffusion model and only adjusting some additional layers, reveals the interesting property that pre-trained T2I diffusion models can secretly accept multimodal prompts. EMMA is easy to adapt to different existing frameworks, making it a flexible and effective tool for generating personalized and context-aware images even videos.