The Tongyi Qianwen team at Alibaba DAMO Academy announced a significant update to their Qwen2-VL model on August 30, 2024. The Qwen2-VL model has seen notable improvements in image understanding, video processing, and multilingual support, setting new benchmarks for key performance indicators.

New features of the Qwen2-VL model include enhanced image understanding capabilities, allowing for more accurate interpretation of visual information; advanced video understanding, enabling real-time analysis of dynamic video content; integrated visual agent functionality, transforming the model into a powerful agent capable of complex reasoning and decision-making; and expanded multilingual support, making it more accessible and effective in different language environments.

WeChat Screenshot_20240830075330.png

In terms of technical architecture, Qwen2-VL has achieved dynamic resolution support, capable of processing images of any resolution without needing to divide them into blocks, ensuring consistency between model input and inherent image information. Additionally, the innovation of Multimodal Rotary Position Embedding (M-ROPE) allows the model to simultaneously capture and integrate 1D text, 2D visual, and 3D video positional information.

The Qwen2-VL-7B model successfully retains support for image, multi-image, and video inputs at the 7B scale, and performs exceptionally well in document understanding tasks and multi-language text understanding of images.

Concurrently, the team has also released a 2B model optimized for mobile deployment, which, despite having only 2B parameters, excels in image, video, and multilingual understanding.

Model Links:

Qwen2-VL-2B-Instruct: https://www.modelscope.cn/models/qwen/Qwen2-VL-2B-Instruct

Qwen2-VL-7B-Instruct: https://www.modelscope.cn/models/qwen/Qwen2-VL-7B-Instruct