Alibaba recently launched a new multimodal reasoning model named QVQ-72B. This model is built on Qwen2-VL-72B and integrates powerful language and visual capabilities, enabling it to handle more complex reasoning and analytical tasks. This marks a new breakthrough for Alibaba in the field of multimodal AI.

QVQ-72B has shown significant improvements in visual reasoning, mathematics, and scientific problem-solving, especially in multi-step reasoning tasks. This means that the model can not only understand textual information but also interpret visual information, solving complex problems through multi-step reasoning, which is challenging for traditional AI models.

image.png

A major highlight of this model is its ability to derive causal relationships by combining textual and visual information in physics problems. For example, it can infer the causal relationships of events based on images of physical scenes and relevant textual descriptions, demonstrating a deeper level of understanding.

In mathematical reasoning tasks (such as algebra and calculus), QVQ-72B significantly reduces the error rate through step-by-step reasoning. This indicates that the model can not only perform simple calculations but also engage in complex mathematical reasoning, providing clear problem-solving steps and offering new tools for tackling complicated mathematical issues.

image.png

Moreover, QVQ-72B has high accuracy and efficiency in extracting key information from technical reports and complex chart analyses. It can quickly and accurately pull critical information from intricate documents and charts, providing powerful support tools for researchers, analysts, and other professionals.

In terms of image recognition, QVQ-72B can accurately identify details within images, such as object positions, colors, spatial relationships, and complex scenarios. This means the model can be applied to a broader range of contexts, such as intelligent surveillance and autonomous driving.

In summary, Alibaba's QVQ-72B multimodal reasoning model, with its strong visual, linguistic, and reasoning capabilities, offers new ideas and tools for solving complex problems. Its emergence will undoubtedly drive the application of artificial intelligence across various fields, injecting new momentum into the intelligent upgrade of industries.

Try it online: https://huggingface.co/spaces/Qwen/QVQ-72B-preview

For more details: https://qwenlm.github.io/blog/qvq-72b-preview/