Can Diffusion Models 'Learn by Analogy'? Alibaba's IC-LoRA Enhances Image Generation Models with Contextual Memory

AIbase基地

Published inAI News · 9 min read · Nov 4, 2024

555

Recent research from Alibaba's Tongyi Laboratory indicates that existing text-to-image Diffusion Transformer models already possess the ability to generate multiple images with specific relationships. With a bit of guidance, they can "comprehend and integrate" to produce high-quality image sets.

Traditional Diffusion models are more like a student who "crammed for exams," requiring extensive data training to generate high-quality images.

With the addition of IC-LoRA, it transforms into a "quick learner" who can acquire new skills with just a few samples.

The principle behind this is not overly complex. Researchers have found that existing text-to-image Diffusion models already have a certain "contextual learning" capability, they just need some techniques to activate it.

They conducted several experiments, directly using existing text-to-image models to generate multiple images. The results showed that the models could indeed understand the relationships between images and produce coherent image sets, although there were some minor flaws, the results were quite impressive.

Therefore, they designed a simple and effective process to awaken the "contextual learning" capability of the Diffusion models:

Instead of concatenating tokens as before, they stitched multiple images into one large image, which is equivalent to directly processing images within the Diffusion model, rather than abstract tokens.

They merged the textual descriptions of each image into a long prompt, allowing the model to process information from multiple images simultaneously and understand their relationships.

For example:

Prompt: "In this adventurous three-image sequence, [IMAGE1] Ethan, a rugged archaeologist, discovers an ancient map at a sunny desert excavation site. His excitement is evident as he brushes off the sand, [IMAGE2] transitions to a bustling foreign market where Ethan negotiates with local merchants and gathers essentials for his mission, [IMAGE3] finally, Ethan treks through a dense, fog-shrouded jungle, with towering trees and exotic wildlife emphasizing the challenges and mystery of his journey."

Prompt: "In a captivating story of resilience, [IMAGE1] we see Lena, a determined girl, sowing seeds in barren fields, her face filled with resolve, [IMAGE2] transitions to her nurturing plants, watering them daily, her efforts gradually bearing fruit, [IMAGE3] culminating in a lush, vibrant garden where Lena stands proudly among her creations, symbolizing growth and perseverance."

Fine-tuning the model with a small set of high-quality images, rather than massive training with tens of thousands of images, saves computational power while retaining the model's original knowledge and "contextual learning" capability.

The final IC-LoRA model is very simple, requiring no modifications to the existing text-to-image model, just adjusting a small amount of training data according to specific tasks.

For example, if you want Stable Diffusion to learn to generate comic-style images, just train the IC-LoRA model with a few comic images, and it can generate various comic images you desire, truly "a touch of enlightenment."

Prompt: "These images depict a transformation from a realistic portrait to a playful illustration, capturing details and artistic flair; [IMAGE1] in a photo, a woman stands in a bustling market, wearing a wide-brimmed hat and a flowing bohemian dress, holding a leather crossbody bag; [IMAGE2] the illustrated version exaggerates her accessories and features, the bohemian dress depicted with vibrant patterns and bold colors, while the background is simplified into abstract market stalls, bringing a lively feel to the scene."

To make IC-LoRA even more powerful, researchers added image-conditional generation capabilities, which simply means generating new images based on existing ones, such as generating images with different expressions and poses from a person's photo, or different weather and lighting conditions from a landscape photo.

For example:

Prompt: "This set of four images captures the serene moments of an elderly woman tending to her garden. [IMAGE1] She kneels beside a blooming flower bed, gently pruning a cluster of roses, soft morning light illuminating her silver hair; [IMAGE2] she stands at a watering can, her face calm and serene as she nurtures the plants; [IMAGE3] a close-up shows her smiling contentedly as she looks at a flower bud in her hand, pride and joy evident; [IMAGE4] she sits on a small bench, drinking tea in her garden, surrounded by the vibrant colors of her hard work."

Prompt: "This pair of images illustrates the transformative impact of a sandstorm on a sports scene; [IMAGE1] on a lush green field, an American football team's focus is on a player holding a football, shot in bright sunlight, [IMAGE2] switches to the same player engulfed by dramatic sand and lightning effects, dust swirling around him, creating a fierce sandstorm effect on a dark, dim field."

Test results show that IC-LoRA achieves high-quality results in various image generation tasks, whether it's generating portraits, font designs, home decorations, or creating movie storyboards, visual effects, it can handle them all, truly "proficient in all forms of art."

The emergence of IC-LoRA is undoubtedly a milestone advancement in the field of AI image generation. It significantly reduces the training costs of AI models, allowing more people to participate in AI creation.

In the future, with the further development of IC-LoRA, we have reason to believe that AI will become an accessible creative tool for everyone, enabling everyone to become an artist.

Project address: https://ali-vilab.github.io/In-Context-LoRA-Page/

AI Daily: Tencent Yuanbao Upgrades for One-Phrase Image and Video Search; WeChat Pay MCP Launches; Google Unveils Veo 3 Globally

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Each day, we present you with the latest content in the AI field, focusing on developers to help you understand technical trends and innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1. Tencent Yuanbao upgrades again: one phrase search, images and videos appear instantly, making information retrieval more intuitive! The upgraded features of Tencent Yuanbao make information retrieval more intuitive and efficient. Users just need to ask a question in one phrase to get text and image results.

Exploring the Compatibility of LLMs with Reinforcement Learning: Shanghai Jiao Tong University Reveals Differences Between Llama and Qwen, Introducing OctoThinker

Large Language Models (LLMs) have achieved significant progress in complex reasoning tasks by combining task prompts with large-scale reinforcement learning (RL), as demonstrated by models like Deepseek-R1-Zero, which directly apply reinforcement learning to base models, showcasing strong reasoning capabilities. However, this success is difficult to replicate across different base model families, especially within the Llama series. This raises a core question: what factors lead to inconsistent performance of different base models during reinforcement learning? How does reinforcement learning perform in

Baidu Launches the HuiXiang Platform and MuseSteamer: AI-Generated Video with a Single Image to Create Professional-Level Movies!

At today's Baidu AI DAY technology open day, Baidu's commercial R&D team officially launched its self-developed video generation model MuseSteamer and the accompanying video product platform **HuiXiang**. This innovation aims to create a comprehensive video generation solution by combining generative AI and multimodal technology, to meet the strong demand for native content production in scenarios such as search, advertising, and recommendations. The MuseSteamer video generation model series is rich, currently including Turbo, Lite, Pro, and

Baidu Search Undergoes Its Largest Overhaul in a Decade: AI Smart Box, Baidu Watch, and AI Assistant Fully Evolved

At the recent Baidu AI Day open day, Baidu Search announced its largest-scale overhaul in a decade. This reform covered the search box, search results page, and the entire search ecosystem. This move is an active transformation by Baidu to actively adapt to industry development trends and expand the boundaries of search capabilities. The upgraded Baidu Search box is now called the 'Smart Box,' significantly enhancing its input capabilities, supporting text input of over a thousand characters. At the same time, various input methods such as photography, voice, and video have been comprehensively improved and can directly access AI writing and AI image creation.

AI Daily: Alibaba Tongyi Launches Qwen-TTS Model; Cursor Now Supports Web and Mobile; ByteDance Unveils Image Synthesis Technology XVerse

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and innovative AI product applications. Discover new AI products: https://top.aibase.com/1. Qwen-TTS Launches with a Major Breakthrough in Dialect Speech Synthesis, Achieving Realism Close to Human Voices. The Qwen-TTS model, developed by Alibaba's Tongyi team, has made significant breakthroughs in the field of speech synthesis.

ByteDance Releases Innovative Image Synthesis Technology XVerse: Independent and Precise Control over Multiple Individuals

On June 26, 2025, ByteDance officially launched its latest image synthesis technology - XVerse, aimed at providing a high-precision multi-subject image generation solution. This innovative technology enables users to independently and precisely control multiple individuals, greatly enhancing the ability to generate personalized and complex scenes. The core of XVerse lies in its unique DiT modulation method, which allows control over the identity and semantic attributes of each subject without affecting the overall latent features of the image. By converting reference images into specific characteristics...

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Can Diffusion Models 'Learn by Analogy'? Alibaba's IC-LoRA Enhances Image Generation Models with Contextual Memory

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Google Veo3 Adds Image-to-Video Feature, Users Create Over 40 Million Videos Within Seven Weeks

Kling AI Releases KTu 2.1 Model: Significant Improvement in Image Generation Capabilities, Supports 180 Styles