Recently, Apple Inc. conducted a study on the reasoning capabilities of large language models (LLMs), drawing attention to their performance in the field of mathematics.

As is well known, the GSM8K benchmark test is widely used to evaluate models' reasoning abilities on elementary math problems. Although the performance of LLMs on GSM8K has improved in recent years, researchers have questioned the reliability of these results. Consequently, they conducted extensive research to explore the performance of the current most advanced open-source and closed-source models.

To better assess the reasoning capabilities of models, the research team introduced an improved benchmark test—GSM-Symbolic. This new benchmark uses symbolic templates to generate diverse problems, better controlling the evaluation process and providing more reliable metrics.

image.png

The study found that when the numerical values in the problems were altered, the performance of LLMs showed significant fluctuations. More interestingly, as the number of clauses in the problems increased, the models' performance dropped noticeably. Researchers speculate that this decline in performance indicates that existing LLMs do not possess genuine logical reasoning abilities but rather mimic the reasoning steps from training data.

In experiments, the performance of all state-of-the-art models dropped by as much as 65% when just one seemingly relevant clause was added. These clauses, though unrelated to the reasoning chain leading to the final answer, still had a substantial impact on the models' performance. Overall, this research provides a deeper understanding of the capabilities and limitations of LLMs in mathematical reasoning.

Key Points:

🔍 The mathematical reasoning abilities of LLMs show significant variations across different problem instances.

📉 As problem complexity increases, LLM performance notably declines, especially after additional clauses are introduced.

🤖 Existing LLMs do not possess genuine logical reasoning abilities; they primarily rely on repetition and imitation of training data.