MiniGemini
A multimodal large language model capable of understanding and generating images
CommonProductProgrammingMultimodalVisual Language Model
Mini-Gemini is a multimodal visual language model supporting a series of dense and MoE large language models ranging from 2B to 34B. It possesses capabilities for image understanding, reasoning, and generation. Based on LLaVA, it utilizes dual vision encoders to provide low-resolution visual embeddings and high-resolution candidate regions. It employs patch-level information mining to perform patch-level mining between high-resolution regions and low-resolution visual queries, fusing text and images for understanding and generation tasks. It supports multiple visual understanding benchmark tests, including COCO, GQA, OCR-VQA, and VisualGenome.
MiniGemini Visit Over Time
Monthly Visits
519
Bounce Rate
41.41%
Page per Visit
1.0
Visit Duration
00:00:00