On April 11th, OpenGVLab open-sourced the InternVL3 series of models, marking a new milestone in the field of Multimodal Large Language Models (MLLMs). The InternVL3 series comprises seven models ranging from 1B to 78B parameters, capable of processing text, images, and videos simultaneously, demonstrating superior overall performance.
Compared to its predecessor, InternVL2.5, InternVL3 shows significant improvements in multimodal perception and reasoning capabilities. Its multimodal capabilities extend to tool use, GUI agents, industrial image analysis, 3D visual perception, and more. Furthermore, thanks to native multimodal pre-training, the overall text performance of the InternVL3 series even surpasses the Qwen2.5 series, which serves as the initial language component within InternVL3.
The InternVL3 series models adhere to the "ViT-MLP-LLM" paradigm, integrating a newly incrementally pre-trained InternViT with various pre-trained LLMs (including InternLM3 and Qwen2.5) using randomly initialized MLP projectors.
For model inference, InternVL3 employs pixel unshuffling, reducing the number of visual tokens to one-fourth, and utilizes a dynamic resolution strategy, dividing images into 448x448 pixel tiles. A key difference from InternVL2.0 is the added support for multiple images and video data. InternVL3 also integrates Variable Visual Positional Encoding (V2PE), providing smaller and more flexible positional increments for visual tokens, thus exhibiting enhanced long-context understanding.
In terms of model deployment, InternVL3 can be deployed as an OpenAI-compatible API via LMDeploy's api_server. Users only need to install lmdeploy>=0.7.3 and then use the relevant commands to complete the deployment. When calling the model, users can specify parameters such as the model name and message content through the OpenAI API interface to obtain the model's response.
Experience it here:https://modelscope.cn/collections/InternVL3-5d0bdc54b7d84e