A recent study published in the journal Nature has unveiled a thought-provoking phenomenon in the development of artificial intelligence: as large language models (LLMs) continue to evolve, they increasingly tend to provide answers with an overconfident demeanor, even when those answers may be incorrect. This phenomenon has sparked extensive discussions on the reliability and risks associated with AI usage.

The research team, led by José Hernández-Orallo from the Valencia Artificial Intelligence Institute in Spain and his colleagues, conducted an in-depth analysis of the changing trends in erroneous responses by AI models during their evolution, as well as the relationship between these errors and human perceptions of question difficulty, and the ability of people to identify incorrect responses.

The findings indicate that as models become more refined, particularly through fine-tuning methods such as learning from human feedback, the overall performance of AI has indeed improved. However, an unexpected discovery was that the proportion of incorrect answers also rises in tandem with the increase in correct answers. Hernández-Orallo vividly described this as: "They almost always provide an answer to every question, which means a higher rate of correct answers is accompanied by more incorrect ones."

Robot Artificial Intelligence AI

Image Source: This image was generated by AI, authorized by Midjourney

The research team focused primarily on mainstream AI models such as OpenAI's GPT, Meta's LMA, and the open-source model BLOOM. By comparing these models' early versions with their later refined versions, they analyzed their performance on various types of questions. The results showed that while the models' performance on simple questions improved, they did not exhibit a significant tendency to avoid difficult questions. For instance, GPT-4 almost always answers every question, and in many cases, the proportion of incorrect answers continues to increase, sometimes exceeding 60%.

More concerning is that these models sometimes even get simple questions wrong, meaning users struggle to find a "safe zone" where they can fully trust AI responses. When the research team asked volunteers to judge the correctness of these answers, the results were even more unsettling: participants misclassified incorrect answers at a rate of between 10% and 40%, whether the questions were simple or complex. Hernández-Orallo concluded: "Humans are unable to effectively supervise these models."

To address this challenge, Hernández-Orallo suggests that AI developers should focus on improving the models' performance on simple questions and encourage chatbots to express uncertainty or refuse to answer when faced with difficult questions. He emphasized: "We need to make users understand: I can use it in this area, but not in that one."

Although it may seem impressive for AI to answer various complex questions, Hernández-Orallo points out that this approach is not always beneficial. He is even puzzled by some models' errors in simple arithmetic problems, considering these issues solvable and ought to be addressed.

Vipula Rawte, a computer scientist at the University of South Carolina, noted that some models do indeed say "I don't know" or "I don't have enough information." AI systems designed for specific purposes, such as healthcare, often undergo stricter adjustments to prevent them from going beyond their knowledge scope. However, for companies dedicated to developing general-purpose chatbots, admitting ignorance is not always an ideal feature.

This study reveals an important paradox in AI development: as models become more complex and powerful, they may become less reliable in certain aspects. This finding presents new challenges for AI developers, users, and regulators.

In the future, AI development needs to find a balance between enhancing performance and maintaining caution. Developers may need to reconsider how to evaluate AI models' performance, not only focusing on the number of correct answers but also considering the proportion and impact of incorrect answers. Meanwhile, raising users' awareness of AI limitations is becoming increasingly important.

For ordinary users, this study serves as a reminder to remain vigilant when using AI tools. Although AI can provide convenience and efficiency, we still need to apply critical thinking, especially when dealing with important or sensitive information.