The Qwen team recently announced the open-source release of their latest multimodal reasoning model, QVQ, marking a significant step forward in artificial intelligence's capabilities in visual understanding and complex problem-solving. This model is built on Qwen2-VL-72B and aims to enhance AI's reasoning abilities by integrating language and visual information. In the MMMU evaluation, QVQ achieved a high score of 70.3, showing significant performance improvements over Qwen2-VL-72B-Instruct in various math-related benchmark tests.

QVQ has demonstrated particular advantages in visual reasoning tasks, especially in areas requiring complex analytical thinking. While QVQ-72B-Preview has performed excellently, the team also pointed out some limitations of the model, including issues with language mixing and code-switching, the potential to fall into circular logic patterns, safety and ethical considerations, as well as performance and benchmark limitations. The team emphasized that although the model has improved in visual reasoning, it cannot fully replace the capabilities of Qwen2-VL-72B; during multi-step visual reasoning processes, the model may gradually lose focus on the image content, leading to hallucinations.

WeChat Screenshot_20241225075810.png

The Qwen team evaluated QVQ-72B-Preview on four datasets, including MMMU, MathVista, MathVision, and OlympiadBench, which are designed to assess the model's comprehensive understanding and reasoning abilities related to visual information. QVQ-72B-Preview performed exceptionally well in these benchmark tests, effectively narrowing the gap with leading models.

To further demonstrate the application of the QVQ model in visual reasoning tasks, the Qwen team provided several examples and shared a link to their technical blog. Additionally, the team offered code examples for model inference and guidance on how to directly call the QVQ-72B-Preview model using the Magic API-Inference. The Magic platform's API-Inference supports the QVQ-72B-Preview model, allowing users to utilize it directly through API calls.

Model Link:

https://modelscope.cn/models/Qwen/QVQ-72B-Preview

Experience Link:

https://modelscope.cn/studios/Qwen/QVQ-72B-preview

Chinese Blog:

https://qwenlm.github.io/zh/blog/qvq-72b-preview