Alibaba Cloud has open-sourced a new visual model, Qwen2.5-VL, and launched three versions: 3B, 7B, and 72B.
The flagship version, Qwen2.5-VL-72B, won the visual understanding championship in 13 authoritative evaluations, surpassing GPT-4o and Claude3.5. According to Alibaba Cloud's official introduction, the new Qwen2.5-VL can more accurately interpret image content and breakthrough support for understanding videos longer than one hour. This model can search for specific events within videos and summarize key points from different segments of the video, thereby quickly and efficiently helping users extract essential information.
Additionally, Qwen2.5-VL can transform into an AI visual agent that can operate smartphones and computers without the need for fine-tuning, enabling complex multi-step operations such as sending greetings to specified friends, photo editing on computers, and booking tickets on mobile phones. Qwen2.5-VL excels in recognizing common objects like flowers, birds, fish, and insects, and can also analyze text, charts, icons, graphics, and layouts within images. Alibaba Cloud has also enhanced Qwen2.5-VL's OCR capabilities, improving text recognition and localization across multiple scenes, languages, and orientations.
Moreover, significant improvements have been made to its information extraction capabilities to meet the growing demands for digital and intelligent solutions in areas such as qualification verification and financial services.
Key Highlights:
🌟 Alibaba Cloud has open-sourced Qwen2.5-VL, offering three versions: 3B, 7B, and 72B.
📈 Qwen2.5-VL-72B surpasses GPT-4o and Claude3.5 in visual understanding evaluations.
👀 Qwen2.5-VL supports video understanding for over 1 hour and enhances OCR recognition capabilities.