Generative AI is advancing rapidly, but how to comprehensively evaluate its performance has always been a challenge. Various models are emerging one after another, and their effectiveness is becoming increasingly impressive. However, the question arises: how do we assess the performance of these text-to-image models?
Traditional evaluation methods either rely on human visual inspection, which is too subjective, or use simple metrics like CLIPScore. These metrics often fail to capture nuances in complex text prompts, such as relationships between objects, logical reasoning, and so on. This leads to inaccurate evaluations of many text-to-image models, and sometimes even comical situations where the generated images are completely off the mark but still score high.
To address this issue, researchers from Carnegie Mellon University and Meta have recently collaborated to introduce a new evaluation scheme for text-to-image models—VQAScore. The core idea of this scheme is to use Visual Question Answering (VQA) models to score text-to-image models.
Specifically, VQAScore first converts the text prompt into a simple question, such as "Is there a cat chasing a mouse in this image?", and then feeds the generated image along with this question into the VQA model. The VQA model determines whether the answer to the question is "yes" or "no", and VQAScore assigns a score to the text-to-image model based on the probability of the VQA model's "yes" answer.
This method may seem simple, but it yields surprisingly good results. Researchers tested VQAScore on eight different text-to-image evaluation benchmarks and found that its accuracy and reliability far exceeded traditional methods, even rivaling those using large models like GPT-4V.
Moreover, VQAScore is not only applicable to text-to-image evaluations but also to text-to-video and text-to-3D model evaluations. This is because the core of VQAScore is the VQA model, which can handle various types of visual content.
To further advance the field of text-to-image generation, researchers have also created a new evaluation benchmark—GenAI-Bench. This benchmark includes 1600 complex text prompts covering various visual-language reasoning abilities, such as comparison, counting, logical reasoning, and more. Researchers have also collected over 15,000 human annotations to assess the performance of different text-to-image models.
In summary, the emergence of VQAScore and GenAI-Bench brings new vitality to the field of text-to-image generation. VQAScore provides a more accurate and reliable evaluation method, helping researchers better assess the strengths and weaknesses of different models. GenAI-Bench offers a more comprehensive and challenging evaluation benchmark, pushing text-to-image models towards greater intelligence and human-like performance.
Of course, VQAScore also has some limitations. Currently, it primarily relies on open-source VQA models, whose performance is not yet on par with closed-source models like GPT-4V. In the future, as VQA models continue to improve, the performance of VQAScore will also be enhanced.
Project address: https://linzhiqiu.github.io/papers/vqascore/