Cambridge University Study Reveals the True Level of AI: All Large Models are 'Amateur Performances'!

AIbase基地

Published inAI News · 5 min read · Sep 29, 2024

186

Recently, teams including the University of Cambridge published a significant paper that unveils the true nature of large models (LLMs), delving into the actual performance of current large language models (LLMs), with results that are quite shocking — these AI models, which were highly anticipated, perform far worse than expected on many basic tasks.

This study conducted comprehensive evaluations on several cutting-edge models, including o1-preview. The results indicate a significant difference in understanding capabilities between AI models and humans. Surprisingly, models excel at tasks humans consider complex but frequently falter on simple problems. This contrast raises questions about whether these AIs truly grasp the essence of tasks or are merely "pretending to be smart."

Even more astonishing is that Prompt Engineering, a technique believed to enhance AI performance, seems unable to effectively address the fundamental issues of models. The study found that even in simple spelling games, models make laughable mistakes. For instance, they can correctly spell "electroluminescence" but give incorrect answers like "mummy" for the simple word "my."

The research team evaluated 32 different large models, showing that their performance varies greatly when dealing with tasks of different difficulties. On complex tasks, their accuracy is far below human expectations. Worse still, these models seem to challenge higher-difficulty tasks before fully mastering simpler ones, leading to frequent errors.

Another concern is the high sensitivity of models to prompts. The study found that many models cannot correctly complete simple tasks without carefully designed prompts. Changing the prompt for the same task can lead to drastically different model performance, posing significant challenges for practical applications.

More worryingly, even models that have undergone reinforcement learning with human feedback (RLHF) still face reliability issues. In complex scenarios, these models often appear overly confident while their error rates increase significantly. This situation could lead users to unknowingly accept incorrect results, causing serious judgment errors.

This study undoubtedly casts a cold light on the AI field, especially in contrast to the optimistic predictions made two years ago by AI luminary Ilya Sutskever. He confidently stated that over time, AI performance would gradually meet human expectations. However, reality has provided a completely different answer.

This research serves as a mirror, reflecting numerous shortcomings of current large models. Although we are full of expectations for the future of AI, these findings remind us to remain cautious about these "smart" models. The reliability issues of AI need urgent resolution, and the road ahead remains long.

This study not only reveals the current state of AI technology development but also provides important references for future research directions. It reminds us that while pursuing improvements in AI capabilities, we must pay more attention to stability and reliability. Future AI research may need to focus more on enhancing model consistency and finding a balance between simple and complex tasks.

Reference:

https://docs.google.com/document/u/0/d/1SwdgJBLo-WMQs-Z55HHndTf4ZsqGop3FccnUk6f8E-w/mobilebasic?_immersive_translate_auto_translate=1

AI Tastes and Understands New Breakthrough! It's So Easy to Distinguish Coke from Coffee!

Italian scientists developed GO-ISMD, an artificial taste system with 90% accuracy in identifying basic tastes. Using graphene oxide, it detects flavors via conductivity changes, achieving 92.3% accuracy in distinguishing cola/coffee. Published in PNAS, it could help restore taste for impaired patients.....

Unsloth AI Releases 1.8-bit Quantized Kimi K2 Model, Significantly Reducing Deployment Costs

Unsloth AI quantized Moonshot AI's 1T-parameter Kimi K2 model to 1.8bit, reducing size by 80% to 245GB while maintaining performance. The MoE-based model excels in coding and reasoning, now deployable on 512GB M3Ultra devices, lowering costs. This advancement positions Kimi K2 as a GPT-4.1 competitor, benefiting SMEs and boosting open-source AI adoption in education/healthcare.....

Meta Announces World's First 1GW+ Power Supercomputer Cluster to Go Live, AI Computing Competition Rises to New Level

Meta accelerates AI infrastructure, targeting a 1GW 'Prometheus' supercomputer with 1.3M NVIDIA H100 GPUs (2 exaflops) by 2026, plus 5GW 'Hyperion' cluster. Plans $60-65B investment by 2025 for AI/data centers, competing with OpenAI/xAI. Commits to open-source and privacy despite environmental concerns.....

What is UTCP? A New Tool Calling Protocol: Let AI Agents Directly Access Tools, Reducing Latency

Global developers have introduced a universal tool calling protocol (UTCP), allowing AI agents to directly call various tools without relying on proxy servers. Compared to traditional MCP protocols, UTCP supports native interfaces such as HTTP and gRPC, significantly reducing calling latency and complexity. The protocol retains existing enterprise security measures while providing SDKs in TypeScript and Python. Developers can participate in improving the protocol through open-source projects. UTCP has the potential to open up new pathways for AI tool integration.

Cognition Acquires Windsurf AI Coding Tool, Intensifying the Competition in AI Coding!

A dramatic acquisition has recently taken place in the AI coding field: Cognition acquired Windsurf company. Previously, this company had experienced a $2.4 billion reverse talent acquisition by Google and an unsuccessful $3 billion acquisition offer from OpenAI. Windsurf generates $82 million in annual revenue, has 350 enterprise clients, and tens of thousands of daily active users. After the acquisition, Cognition will integrate Windsurf's AI development environment with its own Devin coding assistant and regain access to the Claude AI model. This deal marks another significant move in the competition.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

Cambridge University Study Reveals the True Level of AI: All Large Models are 'Amateur Performances'!

AIbase基地

This article is from AIbase Daily

AI News Recommendations

AI Tastes and Understands New Breakthrough! It's So Easy to Distinguish Coke from Coffee!

AI Daily: Meitu Launches Imaging AI Agent RoboNeo; 1.8bit Quantized Kimi K2 Model Released; Amazon Introduces AI Code Editor Kiro

Grok4 Is Coming! Elon Musk's New AI Star Successfully Challenges Programming Tests

Kimi K2 Sweeps Globally! Open Source AI Tops OpenRouter, Surpassing XAI in Market Share

Claude Major Upgrade! One-Click Link to MCP Tool Directory, AI Workflow Efficiency Soars

Unsloth AI Releases 1.8-bit Quantized Kimi K2 Model, Significantly Reducing Deployment Costs

Meta Announces World's First 1GW+ Power Supercomputer Cluster to Go Live, AI Computing Competition Rises to New Level

UTCP Makes a Strong Entry! Revolutionizing MCP AI Tool Calls into a New Era of Zero Packaging

What is UTCP? A New Tool Calling Protocol: Let AI Agents Directly Access Tools, Reducing Latency

Cognition Acquires Windsurf AI Coding Tool, Intensifying the Competition in AI Coding!