The capabilities of Artificial Intelligence (AI) are rapidly advancing, making the accurate measurement of its "intelligence" a key industry focus. However, just as measuring human intelligence is challenging, evaluating AI intelligence is equally difficult. Existing tests and benchmarks often provide only approximate assessments. In recent years, as AI models have become increasingly complex, the limitations of traditional benchmarks have become more apparent, prompting the industry to actively explore new evaluation systems that are more comprehensive and better reflect real-world application capabilities.

QQ_1744593297690.png

Limitations of Traditional Benchmarks: High Scores ≠ High Ability

For a long time, the generative AI community has relied on benchmarks such as MMLU (Massive Multitask Language Understanding) to evaluate model capabilities. These benchmarks typically use multiple-choice questions covering various academic fields, facilitating direct comparisons. However, this format is considered inadequate for truly capturing AI intelligence. For instance, some models achieve similar scores on MMLU but exhibit significant differences in real-world applications, indicating that high scores on paper don't necessarily translate to real-world capabilities.

Furthermore, even benchmarks like college entrance exams don't guarantee that high-scoring candidates possess the same level of intelligence or have reached the peak of their intellectual abilities. This further illustrates that benchmarks are approximate measures of ability, not precise metrics. Even more concerning is that some advanced models make "low-level" errors on seemingly simple tasks, such as failing to correctly count specific letters in a word or making mistakes when comparing the size of decimals. These cases highlight the disconnect between benchmark-driven progress and AI's reliability in the real world.

New Benchmarks Emerge: Focusing on General Reasoning and Real-World Applications

In response to the shortcomings of traditional benchmarks, the AI industry is actively exploring new evaluation frameworks. The recently released ARC-AGI benchmark, aimed at driving model development towards general reasoning and creative problem-solving, has been well-received. Another noteworthy new benchmark is the "Human-Level Last Exam," which comprises 3,000 peer-reviewed multi-step problems spanning multiple disciplines, attempting to challenge AI systems on expert-level reasoning. Early results show that OpenAI's model achieved a 26.6% score within a month of the test's release, demonstrating rapid AI progress.

However, similar to traditional benchmarks, the "Human-Level Last Exam" primarily assesses knowledge and reasoning abilities in isolation, neglecting the increasingly important ability to use tools in real-world applications. GPT-4, when equipped with tools, achieved only about 15% on the more complex GAIA benchmark, further confirming the gap between traditional benchmarks and actual capabilities.

The GAIA Benchmark: A New Standard for Measuring AI's Real-World Application Capabilities

To address the shortcomings of traditional benchmarks, the industry has introduced the GAIA benchmark, which is closer to real-world applications. Created collaboratively by Meta-FAIR, Meta-GenAI, HuggingFace, and the AutoGPT team, GAIA contains 466 carefully designed problems divided into three difficulty levels. These problems comprehensively test key AI capabilities such as web browsing, multi-modal understanding, code execution, file handling, and complex reasoning—all crucial for real-world commercial applications.

GAIA benchmark problems simulate the complexity of real-world business problems. Level 1 problems require approximately 5 steps and one tool to solve, Level 2 requires 5 to 10 steps and multiple tools, while Level 3 problems may require up to 50 discrete steps and any number of tools. This structure more realistically reflects the reality that solving problems in the real world often requires multiple steps and the collaboration of multiple tools.

Preliminary GAIA Results: Highlighting Flexibility and Specialization

Early results from the GAIA benchmark show that a flexible AI model achieved 75% accuracy, surpassing Microsoft's Magnetic-1 (38%) and Google's Langfun Agent (49%). This model's success is attributed to its use of specialized models combining audio-visual understanding and reasoning, with Anthropic's Sonnet3.5 as the primary model.

The emergence of GAIA reflects a broader shift in AI evaluation: we are moving from evaluating standalone Software-as-a-Service (SaaS) applications to evaluating AI agents capable of coordinating multiple tools and workflows. As businesses increasingly rely on AI systems to handle complex, multi-step tasks, benchmarks like GAIA provide a more practical measure of capability than traditional multiple-choice questions.

Benchmark Access: https://huggingface.co/gaia-benchmark