Recently, teams including the University of Cambridge published a significant paper that unveils the true nature of large models (LLMs), delving into the actual performance of current large language models (LLMs), with results that are quite shocking — these AI models, which were highly anticipated, perform far worse than expected on many basic tasks.
This study conducted comprehensive evaluations on several cutting-edge models, including o1-preview. The results indicate a significant difference in understanding capabilities between AI models and humans. Surprisingly, models excel at tasks humans consider complex but frequently falter on simple problems. This contrast raises questions about whether these AIs truly grasp the essence of tasks or are merely "pretending to be smart."
Even more astonishing is that Prompt Engineering, a technique believed to enhance AI performance, seems unable to effectively address the fundamental issues of models. The study found that even in simple spelling games, models make laughable mistakes. For instance, they can correctly spell "electroluminescence" but give incorrect answers like "mummy" for the simple word "my."
The research team evaluated 32 different large models, showing that their performance varies greatly when dealing with tasks of different difficulties. On complex tasks, their accuracy is far below human expectations. Worse still, these models seem to challenge higher-difficulty tasks before fully mastering simpler ones, leading to frequent errors.
Another concern is the high sensitivity of models to prompts. The study found that many models cannot correctly complete simple tasks without carefully designed prompts. Changing the prompt for the same task can lead to drastically different model performance, posing significant challenges for practical applications.
More worryingly, even models that have undergone reinforcement learning with human feedback (RLHF) still face reliability issues. In complex scenarios, these models often appear overly confident while their error rates increase significantly. This situation could lead users to unknowingly accept incorrect results, causing serious judgment errors.
This study undoubtedly casts a cold light on the AI field, especially in contrast to the optimistic predictions made two years ago by AI luminary Ilya Sutskever. He confidently stated that over time, AI performance would gradually meet human expectations. However, reality has provided a completely different answer.
This research serves as a mirror, reflecting numerous shortcomings of current large models. Although we are full of expectations for the future of AI, these findings remind us to remain cautious about these "smart" models. The reliability issues of AI need urgent resolution, and the road ahead remains long.
This study not only reveals the current state of AI technology development but also provides important references for future research directions. It reminds us that while pursuing improvements in AI capabilities, we must pay more attention to stability and reliability. Future AI research may need to focus more on enhancing model consistency and finding a balance between simple and complex tasks.
Reference:
https://docs.google.com/document/u/0/d/1SwdgJBLo-WMQs-Z55HHndTf4ZsqGop3FccnUk6f8E-w/mobilebasic?_immersive_translate_auto_translate=1