Alibaba's cloud computing division has just released a brand-new AI model — Qwen2-VL. The strength of this model lies in its ability to comprehend visual content, including images and videos, and it can even analyze videos up to 20 minutes long in real time, making it quite powerful.

image.png

Product Entry: https://qwenlm.github.io/blog/qwen2-vl/

Compared to other leading advanced models (such as Meta's Llama3.1, OpenAI's GPT-4o, Anthropic's Claude3Haiku, and Google's Gemini-1.5Flash), it performs exceptionally well in third-party benchmark tests, with its 72B model surpassing OpenAI's GPT-4o in certain metrics. This is illustrated in the graph below:

image.png

Superb Analysis of Images and Videos

Qwen2-VL is designed to enhance our understanding and processing of visual data. It can not only analyze static images but also summarize video content, answer related questions, and even provide real-time online chat support.

As the Qwen research team wrote in their blog post on GitHub about the new Qwen2-VL series models: "In addition to static images, Qwen2-VL extends its capabilities to video content analysis. It can summarize video content, answer related questions, and maintain a continuous conversational flow in real time, providing chat support. This feature allows it to act as a personal assistant, helping users by providing insights and information extracted directly from video content."

More importantly, the official claims that it can analyze videos over 20 minutes long and answer questions about the content. This means that whether it's for online learning, technical support, or any situation requiring understanding of video content, Qwen2-VL can be a valuable assistant. The official also showcased an example of the new model correctly analyzing and describing the following video:

Additionally, Qwen2-VL boasts strong language capabilities, supporting English, Chinese, and various European languages, as well as Japanese, Korean, Arabic, and Vietnamese, making it accessible to global users. To better understand its capabilities, Alibaba has shared related application examples on their GitHub.

Three Versions

This new model comes in three different parameter versions: Qwen2-VL-72B (72 billion parameters), Qwen2-VL-7B, and Qwen2-VL-2B. The 7B and 2B versions are available under the open-source Apache2.0 license, allowing businesses to freely use them for commercial purposes.

However, the largest 72B version is not yet publicly available and can only be obtained through a specialized license and API.

Furthermore, Qwen2-VL introduces some innovative technical features, such as Naive Dynamic Resolution support, which can handle images of different resolutions, ensuring consistent and accurate visual interpretation. There is also the Multimodal Rotary Position Embedding (M-ROPE) system, which can synchronize and integrate positional information across text, images, and videos.

The release of Qwen2-VL marks another breakthrough in visual language model technology. The Qwen team at Alibaba stated that they will continue to enhance these models' capabilities and explore more application scenarios. Now, the Qwen2-VL model is available for use, and developers and researchers are welcome to try out these cutting-edge technologies and the new possibilities they bring!

Key Points:

🌟 **Strong Video Analysis Capability**: Capable of real-time analysis of video content over 20 minutes long, answering related questions!

✅ 🌍 **Multilingual Support**: Supports multiple languages, making it accessible to global users!

✅ 📦 **Open Source Versions Available**: The 7B and 2B versions are open source, allowing businesses to freely use them, suitable for innovative teams!