In recent years, with the widespread application of large language models (LLMs), these models have played an important role in complex reasoning and problem-solving tasks. Among them, the o1-like models, inspired by OpenAI's o1 architecture, stand out due to their unique human-like thinking and step-by-step reasoning capabilities. However, these models also have a significant inefficiency issue known as "overthinking."

Overthinking refers to the phenomenon where models consume unnecessary computational resources when handling simple problems, often repeating irrelevant steps during the reasoning process. For example, when solving a simple arithmetic problem like "2+3," an o1-like model may generate overly detailed reasoning, using a number of tokens far exceeding those used by traditional LLMs. This not only increases computational costs but also limits their practical application in resource-constrained scenarios.

2b6b42c26c6e4a6dcffead9283f7524b.png

To address this issue, Tencent AI Lab and Shanghai Jiao Tong University jointly released a new study that delves into the phenomenon of overthinking in o1-like models and focuses on optimizing computational resource usage during testing. The research conducted experiments on datasets such as GSM8K, MATH500, and AIME, revealing that these models tend to generate redundant answers when faced with simple problems. To this end, the researchers introduced two evaluation metrics—result efficiency and process efficiency—to comprehensively assess the model's resource utilization during reasoning. These metrics evaluate the correctness of the answers and the relevance of the intermediate reasoning steps, respectively.

To tackle the overthinking problem, the researchers proposed a self-training method that directly integrates efficiency metrics into the model training process. This approach emphasizes the importance of accurate early responses, thereby reducing redundant reasoning while retaining the model's reflective capabilities. In the study, the First Correct Solution (FCS) and FCS + Reflection Strategy became core methods. For instance, in the QwQ-32B-Preview model, the token usage on the MATH500 dataset was reduced by 48.6%. In addition to computational savings, these methods also enhanced the interpretability of the reasoning and enabled deployment in resource-limited scenarios.

Experimental results showed that these efficiency-focused strategies significantly reduced token usage while maintaining or improving accuracy on simple tasks. For example, in the MATH500 dataset, the FCS + Reflection Strategy improved result efficiency from 52.3% to 75.8%. Higher process efficiency also indicated a reduction in redundancy in the reasoning steps. Even in more challenging datasets like GPQA and AIME, the optimized models maintained strong performance while reducing computational demands. The findings indicate that targeted training strategies can effectively address inefficiency issues while preserving the model's capabilities across various tasks.

This study by Tencent AI Lab and Shanghai Jiao Tong University highlights the overthinking problem in o1-like models and proposes practical solutions for efficient resource utilization. The introduction of these new metrics and training methods is significant for enhancing the scalability and applicability of advanced reasoning models. As artificial intelligence systems continue to evolve, ensuring efficient use of computational resources will become a key focus, enabling broader and more sustainable applications of these technologies.

Project link: https://arxiv.org/abs/2412.21187

Key Points:  

🔍 The study reveals that o1-like models exhibit "overthinking" when dealing with simple problems, leading to unnecessary waste of computational resources.  

⚙️ By introducing result efficiency and process efficiency metrics, the researchers optimize the model's computational resource utilization, enhancing the effectiveness of reasoning.  

📉 Experimental results show that the optimization strategies significantly reduce token usage while maintaining or improving the model's accuracy on simple tasks.