MiniGemini

A multimodal large language model capable of understanding and generating images

CommonProductProgrammingMultimodalVisual Language Model

Mini-Gemini is a multimodal visual language model supporting a series of dense and MoE large language models ranging from 2B to 34B. It possesses capabilities for image understanding, reasoning, and generation. Based on LLaVA, it utilizes dual vision encoders to provide low-resolution visual embeddings and high-resolution candidate regions. It employs patch-level information mining to perform patch-level mining between high-resolution regions and low-resolution visual queries, fusing text and images for understanding and generation tasks. It supports multiple visual understanding benchmark tests, including COCO, GQA, OCR-VQA, and VisualGenome.

Low-resolution/High-resolution Dual Vision Encoders
Patch-level Information Mining
Large Language Model-based Text-Image Fusion
Support Vision Understanding and Generation Tasks

Mini-Gemini can be applied to a variety of scenarios that require simultaneous processing of text and images
such as visual question answering
image description generation
and image editing.

Answer relevant questions based on the content of a given image
Generate textual descriptions of images
Edit images and generate new images according to instructions

Visit

MiniGemini Visit Over Time

Monthly Visits

1181

Bounce Rate

40.97%

Page per Visit

1.0

Visit Duration

00:00:00

MiniGemini Visit Trend

MiniGemini Visit Geography

MiniGemini Traffic Sources

MiniGemini Alternatives

Qwen2-VL-2B — A state-of-the-art visual language model that supports multimodal understanding and text generation.

Image•Visual Language Model•Multimodal

Pixtral-Large-Instruct-2411 — A 124B-parameter multimodal large language model.

Productivity•Multimodal•Large Language Model

126

mPLUG-Owl3 — A multimodal large language model that understands long image sequences.

Image•Multimodal•Image Understanding

156

Janus Pro — Janus Pro is an advanced AI image generation and understanding platform that provides high-quality visual intelligence services.

Image•Image Generation•Image Understanding

744

MNN Large Model Android App — A fully functional Android app supporting multimodal capabilities with a large language model.

Productivity•Large Language Model•Multimodal

2268

Janus-Pro-7B — Janus-Pro-7B is an innovative autoregressive framework that unifies multimodal understanding and generation.

Image•Multimodal•Image Generation

1080

VideoLLaMA3 — VideoLLaMA3 is a cutting-edge multimodal foundational model focused on image and video understanding.

Video•Multimodal•Video Understanding

108

InternVL2_5-78B-MPO — This is an advanced series of multimodal large language models that demonstrate outstanding overall performance.

Productivity•Multimodal•Large Language Model

132

Moondream AI — An open-source visual language model that operates on multiple devices.

Others•Artificial Intelligence•Open-source

108

InternVL2_5-38B-MPO — The InternVL2.5-MPO series models are based on InternVL2.5 and Hybrid Preference Optimization, showcasing exceptional performance.

chatting•Multimodal•Large Language Model

198

InternVL2_5-26B-MPO-AWQ — An advanced multimodal large language model with exceptional reasoning capabilities.

Programming•Multimodal•Large Language Model

CreatiLayout — CreatiLayout technology for creative layout-to-image generation is based on Siamese Multimodal Diffusion Transformers.

Image•Image Generation•Multimodal

318

VITA-1.5 — VITA-1.5: A GPT-4o level multimodal large language model for real-time visual and speech interaction.

Programming•Multimodal•Large Language Model

270

InternVL2_5-26B-MPO — A multimodal large language model that enhances the interaction between visual and linguistic data.

Image•Multimodal•Large Language Model

126

InternVL2_5-8B-MPO-AWQ — A multimodal large language model enhancing visual and linguistic interaction capabilities.

Image•Multimodal•Large Language Model

InternVL2_5-8B-MPO — A large multimodal language model showcasing exceptional overall performance.

Image•Multimodal•Large Language Model

228

DiffSensei — Customized comic generation model, connecting multimodal LLMs and diffusion models.

Image•Comic Generation•Multimodal

882

InternVL2_5-4B-MPO-AWQ — A multimodal large language model designed to enhance image and text interaction capabilities.

Image•Multimodal•Large Language Model

InternVL2_5-4B-MPO — A multimodal large language model demonstrating exceptional overall performance.

Image•Multimodal•Large Language Model

Valley 2.0 — A multimodal large language model that enhances the ability to process text, image, and video data.

Others•Multimodal•Large Language Model

246

InternVL2_5-2B-MPO — Advanced multimodal large language model

Image•Multimodal•Large Language Model

InternVL2_5-1B-MPO — A multimodal large language model that enhances integrated understanding of visual and language data.

Productivity•Multimodal•Large Language Model

168

InternVL2-8B-MPO — Multimodal large language model, enhancing multimodal inference capabilities.

Productivity•multimodal•large language model

POINTS-Yi-1.5-9B-Chat — Latest advancements in visual language models, integrating new technologies from WeChat AI.

Productivity•Visual Language Model•WeChat AI

InternVL 2.5 — Open-source multimodal large language model series

Productivity•multimodal•large language model

144

InternVL2_5-4B — A multimodal large language model that integrates visual and language understanding.

Image•Multimodal•Large Language Model

InternVL2_5-2B — A multimodal large language model that supports deep interaction between images and text.

Image•Multimodal•Large Language Model

InternVL2_5-1B — A large multimodal language model that supports image and text understanding.

Image•Multimodal•Large Language Model

144

InternVL2_5-8B — A multimodal large language model supporting interaction understanding between images and text.

Image•Multimodal•Large Language Model

180

Best AI Websites & Tools

MiniGemini

MiniGemini Visit Over Time

MiniGemini Visit Trend

MiniGemini Visit Geography

MiniGemini Traffic Sources

MiniGemini Alternatives

Qwen2-VL-2B — A state-of-the-art visual language model that supports multimodal understanding and text generation.

Pixtral-Large-Instruct-2411 — A 124B-parameter multimodal large language model.

mPLUG-Owl3 — A multimodal large language model that understands long image sequences.

Janus Pro — Janus Pro is an advanced AI image generation and understanding platform that provides high-quality visual intelligence services.

MNN Large Model Android App — A fully functional Android app supporting multimodal capabilities with a large language model.

Janus-Pro-7B — Janus-Pro-7B is an innovative autoregressive framework that unifies multimodal understanding and generation.

VideoLLaMA3 — VideoLLaMA3 is a cutting-edge multimodal foundational model focused on image and video understanding.

InternVL2_5-78B-MPO — This is an advanced series of multimodal large language models that demonstrate outstanding overall performance.

Moondream AI — An open-source visual language model that operates on multiple devices.

InternVL2_5-38B-MPO — The InternVL2.5-MPO series models are based on InternVL2.5 and Hybrid Preference Optimization, showcasing exceptional performance.

InternVL2_5-26B-MPO-AWQ — An advanced multimodal large language model with exceptional reasoning capabilities.

CreatiLayout — CreatiLayout technology for creative layout-to-image generation is based on Siamese Multimodal Diffusion Transformers.

VITA-1.5 — VITA-1.5: A GPT-4o level multimodal large language model for real-time visual and speech interaction.

InternVL2_5-26B-MPO — A multimodal large language model that enhances the interaction between visual and linguistic data.

InternVL2_5-8B-MPO-AWQ — A multimodal large language model enhancing visual and linguistic interaction capabilities.

InternVL2_5-8B-MPO — A large multimodal language model showcasing exceptional overall performance.

DiffSensei — Customized comic generation model, connecting multimodal LLMs and diffusion models.

InternVL2_5-4B-MPO-AWQ — A multimodal large language model designed to enhance image and text interaction capabilities.

InternVL2_5-4B-MPO — A multimodal large language model demonstrating exceptional overall performance.

Valley 2.0 — A multimodal large language model that enhances the ability to process text, image, and video data.

InternVL2_5-2B-MPO — Advanced multimodal large language model

InternVL2_5-1B-MPO — A multimodal large language model that enhances integrated understanding of visual and language data.

InternVL2-8B-MPO — Multimodal large language model, enhancing multimodal inference capabilities.

POINTS-Yi-1.5-9B-Chat — Latest advancements in visual language models, integrating new technologies from WeChat AI.

InternVL 2.5 — Open-source multimodal large language model series

InternVL2_5-4B — A multimodal large language model that integrates visual and language understanding.

InternVL2_5-2B — A multimodal large language model that supports deep interaction between images and text.

InternVL2_5-1B — A large multimodal language model that supports image and text understanding.

InternVL2_5-8B — A multimodal large language model supporting interaction understanding between images and text.