On February 21, 2025, the Alibaba Internationalization Team announced the official open-source release of its new multimodal large language model series, Ovis2.

Ovis2 is the latest version of the Ovis series models proposed by the Alibaba Internationalization Team. Compared to the previous version 1.6, Ovis2 has significant improvements in data construction and training methods. It not only enhances the capability density of smaller models but also greatly improves chain-of-thought (CoT) reasoning abilities through instruction fine-tuning and preference learning. Additionally, Ovis2 introduces video and multi-image processing capabilities, and enhances multilingual abilities and OCR capabilities in complex scenarios, significantly increasing the model's practicality.

The open-sourced Ovis2 series includes six versions: 1B, 2B, 4B, 8B, 16B, and 34B, all of which have reached state-of-the-art (SOTA) levels for their respective sizes. Among them, Ovis2-34B has demonstrated outstanding performance on the authoritative evaluation list OpenCompass. In the multimodal general capability rankings, Ovis2-34B ranks second among all open-source models, surpassing many 70B flagship open-source models with less than half the parameter size. In the multimodal mathematical reasoning rankings, Ovis2-34B ranks first among all open-source models, while other versions also exhibit excellent reasoning capabilities. These achievements not only prove the effectiveness of the Ovis architecture but also showcase the tremendous potential of the open-source community in advancing multimodal large models.

WeChat Screenshot_20250221172215.png

The architecture of Ovis2 cleverly addresses the limitations of embedding strategies across modalities. It consists of three key components: a visual tokenizer, a visual embedding table, and a large language model (LLM). The visual tokenizer segments input images into multiple image patches, extracts features using a visual Transformer, and matches these features to "visual words" through a visual head layer, resulting in probabilistic visual tokens. The visual embedding table stores the embedding vectors corresponding to each visual word, while the LLM processes the concatenated visual and text embedding vectors to generate text output, completing multimodal tasks.

In terms of training strategy, Ovis2 adopts a four-stage training method to fully activate its multimodal understanding capabilities. The first stage freezes most of the LLM and ViT parameters, training the visual module to learn the conversion from visual features to embeddings. The second stage further enhances the feature extraction capability of the visual module, improving high-resolution image understanding, multilingual abilities, and OCR capabilities. The third stage aligns visual embeddings with the LLM's dialogue format using visually captioned data in a conversational manner. The fourth stage involves multimodal instruction training and preference learning, further enhancing the model's ability to follow user instructions and improve output quality across various modalities.

To enhance video understanding capabilities, Ovis2 has developed an innovative keyframe selection algorithm. This algorithm selects the most useful video frames based on the relevance between frames and text, the diversity of frame combinations, and the sequential nature of frames. By utilizing high-dimensional conditional similarity calculations, Determinantal Point Process (DPP), and Markov Decision Process (MDP), the algorithm efficiently selects keyframes within limited visual contexts, thereby improving video understanding performance.

The Ovis2 series models have particularly outstanding performance on the OpenCompass multimodal evaluation list. Models of different sizes have achieved SOTA results on multiple benchmarks. For example, Ovis2-34B ranks second and first in the multimodal general capability and mathematical reasoning rankings, respectively, demonstrating its powerful performance. Furthermore, Ovis2 has also achieved leading performance in video understanding rankings, further confirming its advantages in multimodal tasks.

The Alibaba Internationalization Team states that open-sourcing is a key force in advancing AI technology. By publicly sharing the research results of Ovis2, the team looks forward to exploring the frontiers of multimodal large models together with developers worldwide and inspiring more innovative applications. Currently, the Ovis2 code has been open-sourced on GitHub, and the model is available on the Hugging Face and Modelscope platforms, along with an online demo for user experience. Related research papers have also been published on arXiv for reference by developers and researchers.

Code: https://github.com/AIDC-AI/Ovis

Model (Hugging Face): https://huggingface.co/AIDC-AI/Ovis2-34B

Model (Modelscope): https://modelscope.cn/collections/Ovis2-1e2840cb4f7d45

Demo: https://huggingface.co/spaces/AIDC-AI/Ovis2-16B

arXiv: https://arxiv.org/abs/2405.20797