Visual Language Models (VLMs), you've probably heard of them. These AI whizzes aren't just good at reading text; they can also "see" and understand images. However, the truth is not quite that simple. Today, let's take a peek under their "skirts" and see if they truly understand images like humans do.
Firstly, let's clarify what VLMs are. In simple terms, they are large language models, such as GPT-4o and Gemini-1.5Pro, which excel in image and text processing and even score high in many visual understanding tests. But don't be deceived by these high scores; today we're going to see if they are as impressive as they seem.
Researchers have designed a test called BlindTest, which includes 7 tasks that are incredibly easy for humans. For example, determining if two circles overlap, if two lines intersect, or counting how many circles are in the Olympic logo. These tasks sound like something a kindergarten child could handle with ease, right? However, the performance of these VLMs is not as magical as you might think.
The results were astonishing. The average accuracy of these so-called advanced models in the BlindTest was only 56.20%, with the best Sonnet-3.5 reaching a mere 73.77% accuracy. This is like a supposed top student who can't even solve elementary math problems.
Why is this the case? Researchers analyzed that it might be because VLMs process images like a myopic person, struggling to see details. While they can generally see the overall trend of an image, they become confused when it comes to precise spatial information, such as whether two shapes intersect or overlap.
For instance, when researchers asked the VLMs to determine if two circles overlap, even if the circles are as large as watermelons, these models still couldn't answer accurately 100% of the time. Moreover, when they were asked to count the number of circles in the Olympic logo, their performance was less than satisfactory.
What's more interesting is that researchers also found that these VLMs seem to have a special preference for the number 5 when counting. For example, when the number of circles in the Olympic logo exceeds 5, they tend to answer "5," possibly because they are particularly familiar with this number due to the 5 circles in the logo.
Well, after all these explanations, have you, my friends, gained a new understanding of these seemingly high-tech VLMs? In fact, they have many limitations in visual understanding and are far from reaching the level of humans. So, next time someone claims that AI can completely replace humans, you can simply smile and laugh.
Paper link: https://arxiv.org/pdf/2407.06581
Project page: https://vlmsareblind.github.io/