Alibaba Cloud has open-sourced the visual-language model Qwen-VL, following the release of the general-purpose model Qwen-7B and the conversational model Qwen-7B-Chat in August. Qwen-VL supports both Chinese and English and can be used for various applications such as knowledge-based question answering, image caption generation, and visual question answering. Unlike other models, Qwen-VL can perform Chinese open-domain localization, accurately annotating detection boxes in images. Developed based on Qwen-7B, Qwen-VL introduces a visual encoder and supports image input. It has achieved the best results among equivalent models in multiple visual-language task tests. Qwen-VL has been open-sourced on platforms like ModelScope. The development of multi-modal large models is a significant direction, though it still faces certain technical challenges.