Recently, research teams including Peking University announced the release of a multimodal open-source model named LLaVA-o1, which is claimed to be the first visual language model capable of spontaneous and systematic reasoning, comparable to GPT-o1.

The model performed excellently on six challenging multimodal benchmark tests, with its 11B parameter version surpassing competitors such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

image.png

LLaVA-o1 is based on the Llama-3.2-Vision model and employs a "slow thinking" reasoning mechanism, enabling it to conduct more complex reasoning processes autonomously, surpassing traditional chain-of-thought prompting methods.

In multimodal reasoning benchmark tests, LLaVA-o1's performance exceeded that of its base model by 8.9%. The uniqueness of this model lies in its reasoning process, which is divided into four stages: summarization, visual interpretation, logical reasoning, and conclusion generation. In traditional models, the reasoning process is often relatively simple, leading to incorrect answers. In contrast, LLaVA-o1 ensures more accurate outputs through structured, multi-step reasoning.

For example, when solving the question "How many objects are left after removing all the small bright balls and purple objects?", LLaVA-o1 first summarizes the question, then extracts information from the image, and finally conducts step-by-step reasoning to arrive at the answer. This phased approach enhances the model's systematic reasoning capabilities, making it more efficient in handling complex problems.

image.png

It is worth mentioning that LLaVA-o1 introduces a stage-wise beam search method during the reasoning process. This method allows the model to generate multiple candidate answers at each reasoning stage and select the best answer to continue to the next stage, significantly improving overall reasoning quality. Through supervised fine-tuning and appropriate training data, LLaVA-o1 performs excellently when compared to larger or closed-source models.

The research achievements of the Peking University team not only advance the development of multimodal AI but also provide new ideas and methods for future visual language understanding models. The team stated that the code, pre-trained weights, and datasets for LLaVA-o1 will be fully open-sourced, hoping that more researchers and developers can explore and apply this innovative model together.

Paper: https://arxiv.org/abs/2411.10440

GitHub: https://github.com/PKU-YuanGroup/LLaVA-o1

Key Points:

🌟 LLaVA-o1 is a new multimodal reasoning model released by teams from Peking University, featuring "slow thinking" reasoning capabilities.  

📈 This model outperforms its base model by 8.9% in multimodal reasoning benchmark tests.  

🔍 LLaVA-o1 ensures accuracy through structured multi-step reasoning and will be open-sourced soon.