InternVL2_5-1B-MPO

A multimodal large language model that enhances integrated understanding of visual and language data.

CommonProductProductivityMultimodalLarge Language Model

InternVL2_5-1B-MPO is a multimodal large language model (MLLM) built on InternVL2.5 and Mixed Preference Optimization (MPO), showcasing superior overall performance. This model integrates incrementally pre-trained InternViT with various pre-trained large language models (LLMs), including InternLM 2.5 and Qwen 2.5, utilizing a randomly initialized MLP projector. InternVL2.5-MPO retains the ‘ViT-MLP-LLM’ paradigm from InternVL 2.5 and its predecessors while introducing support for multiple images and video data. The model excels in multimodal tasks, capable of handling a variety of visual-language tasks including image captioning and visual question answering.

Visit

InternVL2_5-1B-MPO Visit Over Time

Monthly Visits

25633376

Bounce Rate

44.05%

Page per Visit

5.8

Visit Duration

00:04:53

InternVL2_5-1B-MPO Visit Trend

InternVL2_5-1B-MPO Visit Geography

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Deployment Calculator

AI Dataset Collection

Intelligent Document Recognition

InternVL2_5-1B-MPO

InternVL2_5-1B-MPO Visit Over Time

InternVL2_5-1B-MPO Visit Trend

InternVL2_5-1B-MPO Visit Geography

InternVL2_5-1B-MPO Traffic Sources

InternVL2_5-1B-MPO Alternatives

InternVL2_5-1B-MPO — A multimodal large language model that enhances integrated understanding of visual and language data.

InternVL2_5-26B-MPO — A multimodal large language model that enhances the interaction between visual and linguistic data.

VideoLLaMA2-7B — A large video-language model that provides video question answering and video captioning.

Search4All — A question answering system based on a large language model, capable of answering a wide range of questions.

VideoLLaMA2-7B-Base — A large video language model that provides visual question answering and video captioning capabilities.

InternVL2_5-26B — A large multimodal language model that integrates visual and linguistic understanding.

InternVL2_5-8B-MPO-AWQ — A multimodal large language model enhancing visual and linguistic interaction capabilities.

LLaVA — Large Language and Vision Assistant, enabling multimodal chat and scientific question answering

Doubao Large Model — A large model developed by ByteDance, providing multimodal capabilities.

Visual Sketchpad — A visual reasoning tool for multimodal large language models (LLMs)

MouSi — Multimodal Visual Language Model

MNN Large Model Android App — A fully functional Android app supporting multimodal capabilities with a large language model.

NVLM 1.0 — Cutting-edge multimodal large language model

MiniGemini — A multimodal large language model capable of understanding and generating images

DocGraphLM — A document graph language model for information extraction and question answering

Pixtral-Large-Instruct-2411 — A 124B-parameter multimodal large language model.

NVLM-D-72B — State-of-the-art multimodal large language model

idefics-80b — A general-purpose multimodal model that can be used for question answering, image description and other tasks.

Phi-4-multimodal-instruct — Phi-4-multimodal-instruct is a lightweight, multimodal foundational model developed by Microsoft, supporting text, image, and audio inputs.

VideoLLaMA2-7B-16F-Base — A large video language model used for visual question answering and video subtitling generation.

InternVL2_5-2B-MPO — Advanced multimodal large language model

ultravox-v0_4_1-llama-3_1-8b — Multimodal speech large language model

InternVL2_5-78B — Advanced multimodal large language model series

InternVL2_5-4B — A multimodal large language model that integrates visual and language understanding.

mPLUG-DocOwl — A modular multimodal large language model for document understanding

CogVLM — A powerful open-source visual language model

SlowFast-LLaVA — A large language model for video understanding and reasoning that does not require training.

DeepSeek-VL2-Small — An advanced large-scale mixture of experts visual language model.

ultravox-v0_4_1-llama-3_1-70b — Multimodal speech large language model

Snack AI — Multilingual Model Question-Answering Assistant

GEO Services