In the realm of artificial intelligence, the inference capabilities of machine learning models, particularly large language models (LLMs), have long been a focal point of scientific interest.
Recently, Apple's AI research team published a paper titled "Understanding the Limitations of Large Language Models in Mathematical Reasoning," shedding light on the limitations these models face when dealing with logical problems.
In the paper, researchers demonstrated this through a simple math problem. They introduced a question about Oliver picking kiwis:
As shown below:
Oliver picked 44 kiwis on Friday. On Saturday, he picked 58 more. On Sunday, he picked twice the amount he did on Friday. How many kiwis does Oliver have in total?
The obvious answer is 44 + 58 + (44 * 2) = 190. Although large language models are not perfect in arithmetic, they can reliably solve such problems.
However, if you add irrelevant information to observe the model's response, such as:
Oliver picked 44 kiwis on Friday. On Saturday, he picked 58 more. On Sunday, he picked twice the amount he did on Friday, but 5 of them were slightly smaller than average. How many kiwis does Oliver have?
Despite not changing the mathematical essence of the problem, even the most advanced LLMs give incorrect answers under this minor interference. For example, GPT-o1-mini mistakenly subtracted the 5 smaller kiwis from the total picked on Sunday.
This experiment indicates that while LLMs can provide correct answers in some scenarios, they do not truly comprehend the essence of the problem.
The researchers believe that these failure modes suggest that the models are not conducting genuine logical reasoning but are merely replicating the reasoning steps they observed in their training data. It's akin to an LLM counting that "I love you" is typically followed by "I love you too," without truly understanding the meaning of love.
One of the co-authors of the paper, Mehrdad Farajtabar, further explained this finding on social media. He noted that while better prompt engineering might improve the model's performance in simple cases, for complex interferences, the model might need more contextual data to handle correctly, which would not be an issue for a child.
This research reminds us that despite LLMs' excellent performance in language processing, their capabilities in logical reasoning are still limited. This is not just an academic issue; as AI technology becomes increasingly integrated into our daily lives, the answers to these questions become more crucial.
We cannot simply assume that AI can understand and perform complex tasks but should delve deeper into their working principles and limitations. This study provides a deeper understanding of AI technology and valuable insights into how we use and develop these technologies.
Reference: https://techcrunch.com/2024/10/11/researchers-question-ais-reasoning-ability-as-models-stumble-on-math-problems-with-trivial-changes/