ELLA
An LLM-enhanced semantic alignment adapter for diffusion models
CommonProductImageText-to-ImageSemantic Alignment
ELLA (Efficient Large Language Model Adapter) is a lightweight method that equips existing CLIP-based diffusion models with powerful LLMs. ELLA enhances the model's prompt following capability, enabling text-to-image models to understand long texts. We designed a Time-Sensitive Semantic Connector (TSC) to extract various denoising stage time-step related conditioning from pre-trained LLMs. Our TSC dynamically adapts semantic features for different sampling time steps, helping to freeze U-Net at different semantic levels. ELLA outperforms benchmarks like DPG-Bench, particularly in dense prompting scenarios involving multiple object combinations, diverse attributes, and relationships.