Multi-modal Large Language Models

Provides a comprehensive evaluation of MLLMs

CommonProductProductivityMLLMsEvaluation Tool

This tool aims to assess the generalization ability, trustworthiness, and causal reasoning abilities of the latest proprietary and open-source MLLMs through qualitative research from four modalities: text, code, images, and videos. This is done to increase the transparency of MLLMs. We believe these attributes are representative factors defining the reliability of MLLMs, supporting various downstream applications. Specifically, we evaluated closed-source GPT-4 and Gemini, as well as 6 open-source LLMs and MLLMs. Overall, we evaluated 230 manually designed cases, with qualitative results summarized into 12 scores (i.e., 4 modalities multiplied by 3 attributes). In total, we revealed 14 empirical findings that contribute to understanding the capabilities and limitations of proprietary and open-source MLLMs, enabling more reliable support for multi-modal downstream applications.

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

Multi-modal Large Language Models

Multi-modal Large Language Models Visit Over Time

Multi-modal Large Language Models Visit Trend

Multi-modal Large Language Models Visit Geography

Multi-modal Large Language Models Traffic Sources

Multi-modal Large Language Models Alternatives

Multi-modal Large Language Models — Provides a comprehensive evaluation of MLLMs

OpenCompass Multi-modal Leaderboard — Real-time updated leaderboard of multi-modal model performance

Video-MME — The first comprehensive benchmark for evaluating the performance of Multi-Modal Large Language Models (MLLMs) in video analysis.

Fuyu-8B — A small multi-modal model that supports image and text generation

DevMind AI — Multi-Modal AI Development Assistant

Unified-IO 2 — A unified multi-modal generation model

UniVG — Unified Multi-Modal Video Generation System

4M — Multi-modal and Multi-task Model Training Framework

Reka Core — Powerful multi-modal LLM, commercial solution.

Griffon — High-resolution multi-modal perception LVLM

Media2Face — Multi-modal Guided Co-speech Facial Animation Generation

Mobile-Agent — Autonomous Multi-Modal Mobile Device Agent

Any GPT — A multi-modal large-scale language model

SEED-Story — Multi-modal Long-form Story Generation Model

MagicAvatar — Multi-modal Avatar Generation and Animation

Silo — Multi-modal conversation, text-to-image

MNN-LLM Android App — A lightweight multi-modal language model Android application.

Kosmos-2 — A world-facing multi-modal large language model

Mini-Gemini — A multi-modal AI model with both image understanding and generation capabilities.

Google Gemini.co — Google's largest and most powerful multi-modal AI model

Magma-8B — Magma-8B is a multi-modal AI model developed by Microsoft that processes image and text inputs to generate text outputs.

Runway gen2 — A multi-modal artificial intelligence system that can generate new videos based on text, images, or video clips.

Janus-Pro-1B — Janus-Pro-1B is an autoregressive framework for unified multi-modal understanding and generation.

Kimi-VL — A highly efficient open-source expert-mixed visual language model with multi-modal reasoning capabilities.

VCoder — VCoder is a visual perception model that can improve the performance of multi-modal large language models on object-level visual tasks.

EgoLife — EgoLife is a long-term, multi-modal, multi-view daily life AI assistant project aimed at advancing research in long-term context understanding.

Migician — Migician is a multi-modal large language model focusing on multi-image localization, capable of achieving free-form, precise multi-image localization.

HPT — HPT is an innovative multi-modal LLM framework launched by HyperGAI, designed to understand and process various input modalities including text, images, and videos.

honeybee — Multi-modal Language Model Prediction Network

HunyuanDiT-v1.1 — A multi-resolution diffusion transformer that supports Chinese and English understanding