Alibaba Launches Powerful Open Source AI Model Qwen2-VL: Capable of Understanding Videos Longer than 20 Minutes

AIbase基地

Published inAI News · 6 min read · Aug 30, 2024

604

Alibaba's cloud computing division has just released a brand-new AI model — Qwen2-VL. The strength of this model lies in its ability to comprehend visual content, including images and videos, and it can even analyze videos up to 20 minutes long in real time, making it quite powerful.

Product Entry: https://qwenlm.github.io/blog/qwen2-vl/

Compared to other leading advanced models (such as Meta's Llama3.1, OpenAI's GPT-4o, Anthropic's Claude3Haiku, and Google's Gemini-1.5Flash), it performs exceptionally well in third-party benchmark tests, with its 72B model surpassing OpenAI's GPT-4o in certain metrics. This is illustrated in the graph below:

Superb Analysis of Images and Videos

Qwen2-VL is designed to enhance our understanding and processing of visual data. It can not only analyze static images but also summarize video content, answer related questions, and even provide real-time online chat support.

As the Qwen research team wrote in their blog post on GitHub about the new Qwen2-VL series models: "In addition to static images, Qwen2-VL extends its capabilities to video content analysis. It can summarize video content, answer related questions, and maintain a continuous conversational flow in real time, providing chat support. This feature allows it to act as a personal assistant, helping users by providing insights and information extracted directly from video content."

More importantly, the official claims that it can analyze videos over 20 minutes long and answer questions about the content. This means that whether it's for online learning, technical support, or any situation requiring understanding of video content, Qwen2-VL can be a valuable assistant. The official also showcased an example of the new model correctly analyzing and describing the following video:

Additionally, Qwen2-VL boasts strong language capabilities, supporting English, Chinese, and various European languages, as well as Japanese, Korean, Arabic, and Vietnamese, making it accessible to global users. To better understand its capabilities, Alibaba has shared related application examples on their GitHub.

Three Versions

This new model comes in three different parameter versions: Qwen2-VL-72B (72 billion parameters), Qwen2-VL-7B, and Qwen2-VL-2B. The 7B and 2B versions are available under the open-source Apache2.0 license, allowing businesses to freely use them for commercial purposes.

However, the largest 72B version is not yet publicly available and can only be obtained through a specialized license and API.

Furthermore, Qwen2-VL introduces some innovative technical features, such as Naive Dynamic Resolution support, which can handle images of different resolutions, ensuring consistent and accurate visual interpretation. There is also the Multimodal Rotary Position Embedding (M-ROPE) system, which can synchronize and integrate positional information across text, images, and videos.

The release of Qwen2-VL marks another breakthrough in visual language model technology. The Qwen team at Alibaba stated that they will continue to enhance these models' capabilities and explore more application scenarios. Now, the Qwen2-VL model is available for use, and developers and researchers are welcome to try out these cutting-edge technologies and the new possibilities they bring!

Key Points:
🌟 **Strong Video Analysis Capability**: Capable of real-time analysis of video content over 20 minutes long, answering related questions!
✅ 🌍 **Multilingual Support**: Supports multiple languages, making it accessible to global users!
✅ 📦 **Open Source Versions Available**: The 7B and 2B versions are open source, allowing businesses to freely use them, suitable for innovative teams!

Huang Renxun Meets with MiniMax Founder Yan Junjie for an In-depth Meeting, New AI Opportunities Are Coming！

NVIDIA CEO Jensen Huang met with MiniMax founder Yan Junjie in Beijing, praising China's AI innovation. MiniMax, founded just two years ago, has made breakthroughs including the open-source M1 model, Hailuo02 video tool, and $300M funding at $4B valuation. The meeting highlights potential for US-China tech collaboration.....

Zuckerberg Reorganizes Meta AI Team, a New 3400-Person Structure Emerges

Meta reorganized its AI structure to establish a Superintelligent Lab, integrating 3400 employees, led by Alexandr Wang as Chief AI Officer. The new structure is divided into four teams: AGI Basic Research, AI Product Development (including Meta AI Assistant), the Basic AI Lab led by Yann LeCun, and a group focused on Llama5 development. Meta is offering high salaries to attract talent from companies like OpenAI and Apple, but this has raised doubts within the original team about the influx of high-paid outsiders. Recently, two AI leads from Apple have joined.

Li Auto Receives the First Batch of Automotive Generative AI Security Evaluation Certifications

Li Auto received the first domestic batch of dual safety certifications for automotive generative AI at the 2025 China Automotive Forum, becoming the first automaker to pass the national standards GB/T 45654 and GB 45438-2025. The certification was jointly issued by the CCIA Automotive Cybersecurity Working Committee and the AI-Generated Content Identification Service Platform, covering the fields of content security and identification. This achievement marks Li Auto's leading position in the industry regarding the safety of in-vehicle AIGC technology, setting a benchmark for the safe development of intelligent vehicles, while enhancing consumer confidence.

LTX-Video 13B Released! Generate High-Definition Videos 30 Times Faster, Open Source AI Makes Creation Boundless!

Lightricks releases open-source LTX-Video13B, a 13B-parameter video generation model with multi-scale rendering, achieving 30x faster speeds. It runs on consumer GPUs, supports 1216×704 real-time generation, and offers text/image/video-to-video modes. The model enhances coherence and detail, enabling keyframe control and style transfer. Free for SMEs, it includes training tools and optimized versions to democratize AI video creation.....

Tesla Dojo 2 Chip to Enter Mass Production, Performance Approaches NVIDIA, Musk Jokes It Will Change the Game

Tesla has released its new Dojo 2 chip, with performance 10 times that of the first generation, and computing power close to NVIDIA B200. The chip is manufactured by TSMC and uses advanced packaging technology, solving power consumption issues. Dojo 2 will assist Tesla's FSD autonomous driving system training, processing 16 billion video frames per day, achieving technological self-sufficiency. Musk revealed that next year a more powerful Dojo 3 will be launched, and he joked that Dojo 2 can run 'Crysis' at one billion frames per second. This breakthrough will reduce Tesla's reliance on NVIDIA and may be offered externally.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Alibaba Launches Powerful Open Source AI Model Qwen2-VL: Capable of Understanding Videos Longer than 20 Minutes

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Huang Renxun Meets with MiniMax Founder Yan Junjie for an In-depth Meeting, New AI Opportunities Are Coming！

Zuckerberg Reorganizes Meta AI Team, a New 3400-Person Structure Emerges

Li Auto Receives the First Batch of Automotive Generative AI Security Evaluation Certifications

ChatGPT Voice Mode Launches! Convert Meetings and Generate Plans with One Click - AI Boosts Efficiency Dramatically!

LTX-Video 13B Released! Generate High-Definition Videos 30 Times Faster, Open Source AI Makes Creation Boundless!

Perplexity Enters India: New Strategy to Challenge OpenAI in the AI Race

Apple Bows to NVIDIA! MLX Framework Supports CUDA, AI Field Competition Intensifies

Tesla Dojo 2 Chip to Enter Mass Production, Performance Approaches NVIDIA, Musk Jokes It Will Change the Game

Lightricks Releases LTXV Model Update: Breakthrough in Image-to-Video Generation Beyond 60 Seconds

Mistral AI Launches New Feature Le Chat to Catch Up with ChatGPT