Peking University Team Releases Multimodal Model LLaVA-o1, Inference Capabilities Comparable to GPT-o1!

AIbase基地

Published inAI News · 4 min read · Nov 19, 2024

403

Recently, research teams including Peking University announced the release of a multimodal open-source model named LLaVA-o1, which is claimed to be the first visual language model capable of spontaneous and systematic reasoning, comparable to GPT-o1.

The model performed excellently on six challenging multimodal benchmark tests, with its 11B parameter version surpassing competitors such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

LLaVA-o1 is based on the Llama-3.2-Vision model and employs a "slow thinking" reasoning mechanism, enabling it to conduct more complex reasoning processes autonomously, surpassing traditional chain-of-thought prompting methods.

In multimodal reasoning benchmark tests, LLaVA-o1's performance exceeded that of its base model by 8.9%. The uniqueness of this model lies in its reasoning process, which is divided into four stages: summarization, visual interpretation, logical reasoning, and conclusion generation. In traditional models, the reasoning process is often relatively simple, leading to incorrect answers. In contrast, LLaVA-o1 ensures more accurate outputs through structured, multi-step reasoning.

For example, when solving the question "How many objects are left after removing all the small bright balls and purple objects?", LLaVA-o1 first summarizes the question, then extracts information from the image, and finally conducts step-by-step reasoning to arrive at the answer. This phased approach enhances the model's systematic reasoning capabilities, making it more efficient in handling complex problems.

It is worth mentioning that LLaVA-o1 introduces a stage-wise beam search method during the reasoning process. This method allows the model to generate multiple candidate answers at each reasoning stage and select the best answer to continue to the next stage, significantly improving overall reasoning quality. Through supervised fine-tuning and appropriate training data, LLaVA-o1 performs excellently when compared to larger or closed-source models.

The research achievements of the Peking University team not only advance the development of multimodal AI but also provide new ideas and methods for future visual language understanding models. The team stated that the code, pre-trained weights, and datasets for LLaVA-o1 will be fully open-sourced, hoping that more researchers and developers can explore and apply this innovative model together.

Paper: https://arxiv.org/abs/2411.10440

GitHub: https://github.com/PKU-YuanGroup/LLaVA-o1

Key Points:
🌟 LLaVA-o1 is a new multimodal reasoning model released by teams from Peking University, featuring "slow thinking" reasoning capabilities.
📈 This model outperforms its base model by 8.9% in multimodal reasoning benchmark tests.
🔍 LLaVA-o1 ensures accuracy through structured multi-step reasoning and will be open-sourced soon.

Mistral Seeks $1 Billion in Funding to Target the Throne of AI in Europe!

French AI company Mistral is seeking $1 billion in equity financing, with a valuation of $6.51 billion. The company is known for its open-source large language model and chatbot Le Chat, and has raised a total of $1.19 billion in funding so far. This round of financing will be used for research and development and market expansion. Additionally, it will collaborate with MGX Fund and NVIDIA to build the largest AI data center park in Europe, supporting France's AI sovereignty initiative. Mistral's development will enhance Europe's position in the global AI competition.

Vidu Q1 Reference Video Released Globally, Supporting Up to 7 Entities as Input

A major breakthrough has emerged in the AI video field — Vidu Q1 video model, launched by Shengshu Technology, officially released the Reference Generation feature, offering a revolutionary experience where video material can be generated from imagination in one step, redefining the technical boundaries and production efficiency of content creation. In traditional video production processes, creators have to go through complex steps such as script writing, character design, storyboard drawing, real scene shooting, and post-production editing, and the creation of a short film often takes several weeks or even months. The release of the Reference Generation feature in Vidu Q1 has completely broken this established model. Users

AI Daily: Tencent Huyaun Launches 3D Generation Large Model Hunyuan3D-PolyGen; DingTalk AI Spreadsheet Makes a Big Entry; Alibaba Launches Multimodal Large Language Model HumanOmniV2

1.Tencent's Hunyuan3D-PolyGen boosts 3D modeling efficiency by 70% with BPT tech. 2.Alibaba's HumanOmniV2 achieves 69.33% accuracy in multilingual input. 3.DingTalk AI processes 1k tasks/hour with 'spreadsheet-as-document'. 4.Baidu PaddleOCR3.1 improves 37-language recognition by 30%. 5.Microsoft Deep Research opens API. 6.HKPolyU & OPPO's DLoRAL speeds video enhancement 10x. 7.Google opens MCP Toolbox for SQL. 8.Microsoft Win11 to add AI dynamic....

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Peking University Team Releases Multimodal Model LLaVA-o1, Inference Capabilities Comparable to GPT-o1!

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Llama Is Abandoned! Meta Shifts to Claude, Insider Secrets Revealed

vivo New Multimodal Model Launches! AI's Ability to Understand GUI Interfaces is Upgraded Again!

Kunlun Wildfire Launches Skywork-R1V 3.0: Cross-modal Reasoning Capabilities Approaching Those of Human Experts!

Mistral Seeks $1 Billion in Funding to Target the Throne of AI in Europe!

Vidu Q1 Shock Upgrade: Reference to Video Supports Up to Seven Images, AI Video Generation Sets New Records

Mistral Seeks $1 Billion in Funding to Consolidate Europe's AI Leadership

Vidu Q1 Reference Video Released Globally, Supporting Up to 7 Entities as Input

AI Daily: Tencent Huyaun Launches 3D Generation Large Model Hunyuan3D-PolyGen; DingTalk AI Spreadsheet Makes a Big Entry; Alibaba Launches Multimodal Large Language Model HumanOmniV2

Ali HumanOmniV2 Launches with a Shock: The New King of Multimodal AI, Accuracy Surges to 69.33%

DingTalk AI Table Launches: Process 1,000 Tasks in 1 Hour, Easy Data Analysis for Everyone