Recently, researchers from Apple Inc. have conducted an in-depth study on the mathematical reasoning capabilities of large language models (LLMs), introducing a new benchmark test called GSM-Symbolic.

This new benchmark builds upon GSM8K, which primarily evaluates basic mathematical abilities. Despite many LLMs showing improvement on GSM8K, the scientific community still has doubts about these models' reasoning abilities, suggesting that existing evaluation metrics may not fully reflect their true capabilities. The research found that LLMs typically rely on probabilistic pattern matching rather than genuine logical reasoning, making them highly sensitive to minor changes in input.

image.png

In this new study, researchers used symbolic templates to generate diverse mathematical problems, providing a more reliable evaluation. Experimental results showed that LLMs' performance significantly declined when the numerical values or complexity of the problems increased. Additionally, even adding information that appears relevant but is actually unrelated could lead to a performance drop of up to 65%. These findings reinforce the notion that LLMs rely more on pattern matching than formal logical reasoning during inference.

The GSM8K dataset contains over 8,000 grade-level mathematical problems and has become popular, leading to some risks such as data contamination and performance fluctuations due to minor problem variations. To address these challenges, the introduction of GSM-Symbolic effectively controls the diversity of problems. This benchmark evaluated over 20 open and closed models using 5,000 samples from 100 templates, revealing deep insights and limitations of LLMs in mathematical reasoning.

Preliminary experiments indicate significant performance differences among models on GSM-Symbolic, with overall accuracy lower than reported on GSM8K. The study further explored the impact of changing variable names and numerical values on LLMs, showing that numerical changes had a greater effect on performance. Additionally, the complexity of the problem directly influenced accuracy, with complex problems leading to a significant performance decline. These results suggest that models may rely more on pattern matching than genuine reasoning abilities when handling mathematical problems.

This research highlights the limitations of the current GSM8K evaluation and introduces the new benchmark GSM-Symbolic, aimed at assessing the mathematical reasoning capabilities of LLMs. Overall, the study results indicate that LLMs still need to enhance their logical reasoning abilities when dealing with complex problems.

Paper: https://arxiv.org/abs/2410.05229

Key Points:

🧮 Researchers introduce the new benchmark GSM-Symbolic to evaluate LLMs' mathematical reasoning abilities.

📉 LLMs perform poorly on complex mathematical problems, relying on pattern matching rather than logical reasoning.

📊 The study reveals significant performance differences among models under the new benchmark, calling for improved evaluation methods.