Recent research from Alibaba's Tongyi Laboratory indicates that existing text-to-image Diffusion Transformer models already possess the ability to generate multiple images with specific relationships. With a bit of guidance, they can "comprehend and integrate" to produce high-quality image sets.

Traditional Diffusion models are more like a student who "crammed for exams," requiring extensive data training to generate high-quality images.

With the addition of IC-LoRA, it transforms into a "quick learner" who can acquire new skills with just a few samples.

image.png

The principle behind this is not overly complex. Researchers have found that existing text-to-image Diffusion models already have a certain "contextual learning" capability, they just need some techniques to activate it.

They conducted several experiments, directly using existing text-to-image models to generate multiple images. The results showed that the models could indeed understand the relationships between images and produce coherent image sets, although there were some minor flaws, the results were quite impressive.

Therefore, they designed a simple and effective process to awaken the "contextual learning" capability of the Diffusion models:

Instead of concatenating tokens as before, they stitched multiple images into one large image, which is equivalent to directly processing images within the Diffusion model, rather than abstract tokens.

They merged the textual descriptions of each image into a long prompt, allowing the model to process information from multiple images simultaneously and understand their relationships.

For example:

image.png

Prompt: "In this adventurous three-image sequence, [IMAGE1] Ethan, a rugged archaeologist, discovers an ancient map at a sunny desert excavation site. His excitement is evident as he brushes off the sand, [IMAGE2] transitions to a bustling foreign market where Ethan negotiates with local merchants and gathers essentials for his mission, [IMAGE3] finally, Ethan treks through a dense, fog-shrouded jungle, with towering trees and exotic wildlife emphasizing the challenges and mystery of his journey."

image.png

Prompt: "In a captivating story of resilience, [IMAGE1] we see Lena, a determined girl, sowing seeds in barren fields, her face filled with resolve, [IMAGE2] transitions to her nurturing plants, watering them daily, her efforts gradually bearing fruit, [IMAGE3] culminating in a lush, vibrant garden where Lena stands proudly among her creations, symbolizing growth and perseverance."

Fine-tuning the model with a small set of high-quality images, rather than massive training with tens of thousands of images, saves computational power while retaining the model's original knowledge and "contextual learning" capability.

The final IC-LoRA model is very simple, requiring no modifications to the existing text-to-image model, just adjusting a small amount of training data according to specific tasks.

For example, if you want Stable Diffusion to learn to generate comic-style images, just train the IC-LoRA model with a few comic images, and it can generate various comic images you desire, truly "a touch of enlightenment."

image.png

Prompt: "These images depict a transformation from a realistic portrait to a playful illustration, capturing details and artistic flair; [IMAGE1] in a photo, a woman stands in a bustling market, wearing a wide-brimmed hat and a flowing bohemian dress, holding a leather crossbody bag; [IMAGE2] the illustrated version exaggerates her accessories and features, the bohemian dress depicted with vibrant patterns and bold colors, while the background is simplified into abstract market stalls, bringing a lively feel to the scene."

To make IC-LoRA even more powerful, researchers added image-conditional generation capabilities, which simply means generating new images based on existing ones, such as generating images with different expressions and poses from a person's photo, or different weather and lighting conditions from a landscape photo.

For example:

image.png

Prompt: "This set of four images captures the serene moments of an elderly woman tending to her garden. [IMAGE1] She kneels beside a blooming flower bed, gently pruning a cluster of roses, soft morning light illuminating her silver hair; [IMAGE2] she stands at a watering can, her face calm and serene as she nurtures the plants; [IMAGE3] a close-up shows her smiling contentedly as she looks at a flower bud in her hand, pride and joy evident; [IMAGE4] she sits on a small bench, drinking tea in her garden, surrounded by the vibrant colors of her hard work."

image.png

Prompt: "This pair of images illustrates the transformative impact of a sandstorm on a sports scene; [IMAGE1] on a lush green field, an American football team's focus is on a player holding a football, shot in bright sunlight, [IMAGE2] switches to the same player engulfed by dramatic sand and lightning effects, dust swirling around him, creating a fierce sandstorm effect on a dark, dim field."

Test results show that IC-LoRA achieves high-quality results in various image generation tasks, whether it's generating portraits, font designs, home decorations, or creating movie storyboards, visual effects, it can handle them all, truly "proficient in all forms of art."

The emergence of IC-LoRA is undoubtedly a milestone advancement in the field of AI image generation. It significantly reduces the training costs of AI models, allowing more people to participate in AI creation.

In the future, with the further development of IC-LoRA, we have reason to believe that AI will become an accessible creative tool for everyone, enabling everyone to become an artist.

Project address: https://ali-vilab.github.io/In-Context-LoRA-Page/