Recently, the official account of the ModelScope community, an open-source model community of Alibaba DAMO Academy (Hangzhou) Technology Co., Ltd., announced a significant breakthrough: the release of the InternVL2.5 model. This open-source multimodal large language model has achieved outstanding performance, becoming the first open-source model to exceed 70% accuracy on the Multimodal Understanding Benchmark (MMMU), comparable to commercial models like GPT-4o and Claude-3.5-Sonnet.

The InternVL2.5 model achieved a 3.7 percentage point improvement through Chain of Thought (CoT) reasoning techniques, demonstrating strong potential for scalability during testing. This model is further developed from InternVL2.0, enhancing performance through improved training and testing strategies and better data quality. In-depth research was conducted on visual encoders, language models, dataset sizes, and testing time configurations to explore the relationship between model scale and performance.

WeChat Screenshot_20241210081428.png

InternVL2.5 has demonstrated competitive performance across multiple benchmark tests, particularly in areas such as multidisciplinary reasoning, document understanding, multi-image/video comprehension, real-world understanding, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing. This achievement not only sets a new standard for the open-source community in developing and applying multimodal AI systems but also opens up new possibilities for research and applications in the field of artificial intelligence.

InternVL2.5 retains the same model architecture as its predecessors, InternVL1.5 and InternVL2.0, following the "ViT-MLP-LLM" paradigm, and integrates the newly incrementally pre-trained InternViT-6B or InternViT-300M with various sizes and types of pre-trained LLMs using randomly initialized two-layer MLP projectors. To enhance scalability for high-resolution processing, the research team applied a pixel disordering operation, reducing the number of visual tokens to half of the original amount.

The open-source nature of the model means that researchers and developers can freely access and use InternVL2.5, significantly promoting the development and innovation of multimodal AI technology.

Model Link:

https://www.modelscope.cn/collections/InternVL-25-fbde6e47302942