In the field of artificial intelligence, the capabilities of AI artists have been continuously advancing. However, even the most advanced AI image generation models may encounter challenges in seemingly simple tasks. Recently, doctoral candidate Zhao Juntao and his team at Shanghai Jiao Tong University discovered that AI struggled unexpectedly with generating the scene of "ice-cold cola in a teacup."

This phenomenon has drawn academic attention and is referred to as the text-image misalignment issue. In October 2023, when AI image generation models were just emerging, Zhao Juntao and his team attempted to use these models, finding that AI often depicted a transparent glass filled with ice-cold cola instead of a teacup. Even by July 2024, using the most advanced models, the results were still unsatisfactory.

image.png

To delve deeper into this issue, Professor Wang Dequan's team at Shanghai Jiao Tong University, in their forthcoming paper "Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models," categorized this problem as a misalignment issue involving hidden variables (Latent Concept Misalignment, or LC-Mis). They designed a system based on large language models (LLMs), leveraging the human-like thinking embedded in LLMs to help quickly gather concept pairs with similar issues.

The research team proposed a method called Mixture of Concept Experts (MoCE), which integrates sequential painting rules into the multi-step sampling process of diffusion models, successfully retrieving the elusive teacup. The method divides the sampling process into two stages: the first stage provides easily overlooked concepts, and the second stage uses the full text prompt. This approach allows MoCE to more precisely control the alignment between text and image during image generation.

The MoCE method significantly reduced the prevalence of level 5 LC-Mis concept pairs and even outperformed Dall·E3 (October 2023 version), which requires substantial data annotation costs.

Additionally, the research team found that existing automated evaluation metrics have significant flaws when faced with these new issues. For example, some metrics gave lower scores to ice-cold cola in a teacup while giving higher scores to ice-cold cola in a transparent glass. This indicates that even the tools used to evaluate AI performance may have biases and limitations.

Researchers plan to explore more complex LC-Mis scenarios in future work and develop learnable search algorithms to reduce iteration counts. They also intend to expand the types of models, model versions, and samplers used in the dataset and continuously iterate the dataset collection algorithm to enhance and expand the dataset.

This research not only provides new insights into understanding the limitations of AI in image generation but also offers new ideas and methods for improving AI's image generation capabilities. As technology continues to advance, we look forward to AI making greater breakthroughs in understanding and reproducing human creativity.

Project Address: https://lcmis.github.io/

Paper: https://arxiv.org/pdf/2408.00230