Recent research from Technische Universität Darmstadt in Germany has revealed a thought-provoking phenomenon: even the most advanced AI image models today can make significant errors when faced with simple visual reasoning tasks. This study calls for new considerations in evaluating AI's visual capabilities.

The research team utilized Bongard problems, designed by Russian scientist Michail Bongard, as their testing tool. These visual puzzles consist of 12 simple images divided into two groups, requiring the identification of the rule that distinguishes the two groups. For most people, this abstract reasoning task is not difficult, but the performance of AI models was unexpectedly poor.

image.png

Even the currently considered most advanced multimodal model, GPT-4o, only solved 21 out of 100 visual puzzles. Other well-known AI models like Claude, Gemini, and LLaVA performed even worse. These models struggled significantly with basic visual concepts such as identifying vertical and horizontal lines, or determining the direction of spirals.

Researchers found that even with multiple-choice options provided, the performance of AI models only slightly improved. Only under strict limitations on the number of possible answers did the success rates of GPT-4 and Claude increase to 68 and 69 puzzles, respectively. Through in-depth analysis of four specific cases, the research team discovered that AI systems sometimes encounter issues at the basic visual perception level before reaching the stages of "thinking" and "reasoning," but the exact reasons remain elusive.

This research also raises questions about the evaluation standards for AI systems. The team pointed out, "Why do visual language models perform well in established benchmark tests but struggle with seemingly simple Bongard problems? What is the significance of these benchmarks in assessing real reasoning abilities?" These questions suggest that the current AI evaluation system may need to be redesigned to more accurately measure AI's visual reasoning capabilities.

This study not only showcases the limitations of current AI technology but also points out the direction for future development of AI's visual abilities. It reminds us that while celebrating the rapid progress of AI, we must also be aware of the room for improvement in AI's fundamental cognitive abilities.