On September 2nd, Tongyi Qianwen announced the open-source release of its second-generation vision-language model, Qwen2-VL, and introduced two sizes of the model, 2B and 7B, along with their quantized versions, on the Alibaba Cloud BaiLian platform, providing APIs for direct user access.

The Qwen2-VL model has achieved comprehensive performance improvements in multiple aspects. It can understand images of varying resolutions and aspect ratios, setting new global benchmarks in tests such as DocVQA, RealWorldQA, and MTVQA. Additionally, the model can comprehend long videos exceeding 20 minutes, supporting applications such as video-based question answering, dialogue, and content creation. Qwen2-VL also possesses robust visual intelligence capabilities, enabling autonomous operation of smartphones and robots for complex reasoning and decision-making.

The model can understand multilingual text in images and videos, including Chinese, English, most European languages, Japanese, Korean, Arabic, Vietnamese, and more. The Tongyi Qianwen team evaluated the model's capabilities in six areas: comprehensive university questions, mathematical abilities, understanding of multilingual text in documents and tables, general scene question answering, video comprehension, and agent capabilities.

WeChat Screenshot_20240902141930.png

As the flagship model, Qwen2-VL-72B ranks among the best in most metrics. Qwen2-VL-7B delivers competitive performance with its economical parameter size, while Qwen2-VL-2B supports a variety of mobile applications, possessing full capabilities in understanding multilingual content in images and videos.

In terms of model architecture, Qwen2-VL continues the series structure of ViT plus Qwen2, with all three sizes of models employing a 600M-scale ViT, supporting unified input of images and videos. To enhance the model's perception of visual information and video understanding, the team has upgraded the architecture, including full support for native dynamic resolution and the use of multi-modal rotational position embedding (M-ROPE) methods.

The Alibaba Cloud BaiLian platform provides the Qwen2-VL-72B API, which users can directly call. Meanwhile, the open-source code for Qwen2-VL-2B and Qwen2-VL-7B has been integrated into Hugging Face Transformers, vLLM, and other third-party frameworks, allowing developers to download and use the models through these platforms.

Alibaba Cloud BaiLian Platform:

https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api

GitHub:

https://github.com/QwenLM/Qwen2-VL

HuggingFace:

https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d

ModelScope:

https://modelscope.cn/organization/qwen?tab=model

Model Experience:

https://huggingface.co/spaces/Qwen/Qwen2-VL