Tongyi Qwen Open Source Visual Language Model Qwen2-VL API Available in 2B and 7B Sizes

AIbase基地

Published inAI News · 4 min read · Sep 2, 2024

754

On September 2nd, Tongyi Qianwen announced the open-source release of its second-generation vision-language model, Qwen2-VL, and introduced two sizes of the model, 2B and 7B, along with their quantized versions, on the Alibaba Cloud BaiLian platform, providing APIs for direct user access.

The Qwen2-VL model has achieved comprehensive performance improvements in multiple aspects. It can understand images of varying resolutions and aspect ratios, setting new global benchmarks in tests such as DocVQA, RealWorldQA, and MTVQA. Additionally, the model can comprehend long videos exceeding 20 minutes, supporting applications such as video-based question answering, dialogue, and content creation. Qwen2-VL also possesses robust visual intelligence capabilities, enabling autonomous operation of smartphones and robots for complex reasoning and decision-making.

The model can understand multilingual text in images and videos, including Chinese, English, most European languages, Japanese, Korean, Arabic, Vietnamese, and more. The Tongyi Qianwen team evaluated the model's capabilities in six areas: comprehensive university questions, mathematical abilities, understanding of multilingual text in documents and tables, general scene question answering, video comprehension, and agent capabilities.

WeChat Screenshot_20240902141930.png

As the flagship model, Qwen2-VL-72B ranks among the best in most metrics. Qwen2-VL-7B delivers competitive performance with its economical parameter size, while Qwen2-VL-2B supports a variety of mobile applications, possessing full capabilities in understanding multilingual content in images and videos.

In terms of model architecture, Qwen2-VL continues the series structure of ViT plus Qwen2, with all three sizes of models employing a 600M-scale ViT, supporting unified input of images and videos. To enhance the model's perception of visual information and video understanding, the team has upgraded the architecture, including full support for native dynamic resolution and the use of multi-modal rotational position embedding (M-ROPE) methods.

The Alibaba Cloud BaiLian platform provides the Qwen2-VL-72B API, which users can directly call. Meanwhile, the open-source code for Qwen2-VL-2B and Qwen2-VL-7B has been integrated into Hugging Face Transformers, vLLM, and other third-party frameworks, allowing developers to download and use the models through these platforms.

Alibaba Cloud BaiLian Platform:

https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api

GitHub:

https://github.com/QwenLM/Qwen2-VL

HuggingFace:

https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d

ModelScope:

https://modelscope.cn/organization/qwen?tab=model

Model Experience:

https://huggingface.co/spaces/Qwen/Qwen2-VL

Qwen2-VL Tongyi Qwen Visual Language Model Aliyun Bailian

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

AI Daily: Tencent Launches New IMA 2.0; Microsoft Unveils a Series of Major Updates for Copilot; Alibaba's Quark AI Glasses Go on Pre-sale

[AI Daily] The Kimi k2 model from the company Dark Side of the Moon has received praise for its performance surpassing GPT-5, and the company is about to complete another round of tens of millions of dollars in funding, just months after the last funding round. The domestic AI large model field remains highly active, and developers can learn about the latest product updates through the platform.

Oct 24, 2025

170

China University of Science and Technology and ByteDance Launch MoGA Long Video Generation Model: One-Click Generation of Minute-Level Multi-Shot Short Films

The University of Science and Technology of China and ByteDance jointly launched an end-to-end long video generation model that can directly generate high-quality videos with a duration of minutes, 480p resolution, and 24fps, supporting multi-shot switching. The core innovation is the underlying algorithm MoGA, a novel attention mechanism designed to tackle the challenges of long video generation, marking a key breakthrough in domestic video generation technology.

Oct 24, 2025

180

Baidu PaddleOCR-VL Model Tops Global OCR Rankings, Continues to Lead Huggingface Trending List for Five Consecutive Days

On October 16, Baidu PaddlePaddle released the vision language model PaddleOCR-VL, achieving a score of 92.56 in the authoritative evaluation OmniDocBench V1.5 with 0.9B parameters, surpassing mainstream models such as DeepSeek-OCR and topping the global OCR rankings. As of October 21, the top three positions on the Huggingface trending list were all occupied by OCR models, with Baidu PaddlePaddle ranking first.

Oct 24, 2025

180

Directly on Mac Desktop! OpenAI Acquires Sky Team, ChatGPT to Be Deeply Integrated into macOS Workflow

OpenAI acquires the team behind the AI language application Sky on the Mac platform, aiming to accelerate the deep integration of ChatGPT with macOS workflows. This move will leverage Sky's contextual understanding, user adaptability, and cross-application collaboration capabilities, promoting the natural integration of AI into daily use and enhancing the Mac user experience.

Oct 24, 2025

140

Kimi k2 Performance Praised to Surpass GPT-5, Moonshot AI Secures Another Billion-Dollar Funding Round

Domestic AI company Moonshot AI is about to complete another round of billion-dollar funding, just a few months after its previous $300 million funding round. The capital market continues to show strong confidence in the company, which was once hailed as one of China's most anticipated large model companies.

Oct 24, 2025

260

Alibaba Qwen Launches Deep Research: Generate Reports, Web Pages, and Podcasts with One Click

Alibaba upgraded Qwen Deep Research, enabling one-click generation of cited reports, interactive webpages, and multi-speaker podcasts in Qwen Chat, completing the data-to-content workflow with minimal clicks.....

Oct 23, 2025

150

ByteDance Seed Team Announces the Launch of 3D Generation Large Model Seed 3D 1.0

The ByteDance Seed team recently announced the launch of the 3D generation large model Seed3D1.0, which is capable of generating high-quality, realistic 3D models from a single image in an end-to-end manner, including detailed geometry, realistic textures, and physically based rendering (PBR) materials. This innovative achievement is expected to provide powerful world simulation support for the development of embodied intelligence, addressing bottlenecks in physical interaction capabilities and content diversity in current technologies. During the development process, the Seed team collected and processed a large amount of high-quality 3D data, building a complete three

Oct 23, 2025

520

Chesky: Airbnb Temporarily Pauses Integration with ChatGPT; AI Customer Service Already Uses Qwen

Airbnb CEO Brian Chesky stated the company has not integrated ChatGPT due to immature connection tools and platform stability needs. Emphasizing reliance on identity verification, Airbnb will monitor ChatGPT's progress and may collaborate in the future.....

Oct 23, 2025

160

Hailuo 2.3 is Coming Soon: The Next-Generation AI Video Model That Exceeds Veo, with Enhanced Realism

MiniMax's Hailuo2.3 video generation model achieves breakthroughs in realism, precision, and style diversity, enhancing motion capture to solidify its industry leadership after surpassing Google Veo3.....

Oct 23, 2025

1.2k

Vidu Q2 Reference AI Video Platform Fully Opens API Access

Recently, Shengshu Technology officially announced the full opening of the Vidu Q2 Reference AI Video Model API, marking a critical breakthrough in AI video generation technology from 'functional' to 'precision engineering'. Vidu Q2 demonstrates unique value in maintaining high consistency, especially in areas such as advertising and product display. It not only accurately restores product details but also infuses emotional expression into AI videos, thereby enhancing brand favorability and user conversion. The release of Vidu Q2 brings new opportunities to the interactive entertainment, animation, and advertising e-commerce industries.

Oct 23, 2025

280

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

Tongyi Qwen Open Source Visual Language Model Qwen2-VL API Available in 2B and 7B Sizes

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily: Tencent Launches New IMA 2.0; Microsoft Unveils a Series of Major Updates for Copilot; Alibaba's Quark AI Glasses Go on Pre-sale

China University of Science and Technology and ByteDance Launch MoGA Long Video Generation Model: One-Click Generation of Minute-Level Multi-Shot Short Films

Baidu PaddleOCR-VL Model Tops Global OCR Rankings, Continues to Lead Huggingface Trending List for Five Consecutive Days

Directly on Mac Desktop! OpenAI Acquires Sky Team, ChatGPT to Be Deeply Integrated into macOS Workflow

Kimi k2 Performance Praised to Surpass GPT-5, Moonshot AI Secures Another Billion-Dollar Funding Round

Alibaba Qwen Launches Deep Research: Generate Reports, Web Pages, and Podcasts with One Click

ByteDance Seed Team Announces the Launch of 3D Generation Large Model Seed 3D 1.0

Chesky: Airbnb Temporarily Pauses Integration with ChatGPT; AI Customer Service Already Uses Qwen

Hailuo 2.3 is Coming Soon: The Next-Generation AI Video Model That Exceeds Veo, with Enhanced Realism

Vidu Q2 Reference AI Video Platform Fully Opens API Access

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

Tongyi Qwen Open Source Visual Language Model Qwen2-VL API Available in 2B and 7B Sizes

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Daily: Tencent Launches New IMA 2.0; Microsoft Unveils a Series of Major Updates for Copilot; Alibaba's Quark AI Glasses Go on Pre-sale

China University of Science and Technology and ByteDance Launch MoGA Long Video Generation Model: One-Click Generation of Minute-Level Multi-Shot Short Films

Baidu PaddleOCR-VL Model Tops Global OCR Rankings, Continues to Lead Huggingface Trending List for Five Consecutive Days

Directly on Mac Desktop! OpenAI Acquires Sky Team, ChatGPT to Be Deeply Integrated into macOS Workflow

Kimi k2 Performance Praised to Surpass GPT-5, Moonshot AI Secures Another Billion-Dollar Funding Round

Alibaba Qwen Launches Deep Research: Generate Reports, Web Pages, and Podcasts with One Click

ByteDance Seed Team Announces the Launch of 3D Generation Large Model Seed 3D 1.0

Chesky: Airbnb Temporarily Pauses Integration with ChatGPT; AI Customer Service Already Uses Qwen

Hailuo 2.3 is Coming Soon: The Next-Generation AI Video Model That Exceeds Veo, with Enhanced Realism

Vidu Q2 Reference AI Video Platform Fully Opens API Access

GEO Services