The rapid advancement of Artificial Intelligence (AI) models has left users questioning the accuracy of their performance, even as developers continuously improve them. To address this, the Vector Institute, founded by Geoffrey Hinton, has launched a study, "Assessing the State of the Art," evaluating AI research. This study uses an interactive leaderboard to comprehensively assess 11 top open-source and closed-source models across 16 benchmarks, including mathematics, general knowledge, coding, and security.

John Willes, AI Infrastructure and Research Engineering Manager at the Vector Institute, stated: "Researchers, developers, regulators, and end-users can independently verify results, compare model performance, and build their own benchmarks and evaluations, thereby driving improvement and accountability."

Large Model Metaverse (2)

Image Source Note: Image generated by AI, image licensing provider Midjourney

Top-performing models in this evaluation included DeepSeek and OpenAI's o1, while Command R+ performed relatively poorly. Its lower performance stemmed from being the smallest and oldest model in the test.

The study found that closed-source models generally outperformed open-source models in complex knowledge and reasoning tasks, but DeepSeek's strong showing demonstrates the continued competitiveness of open-source models. Willes noted, "These models are quite capable on simple tasks, but as task complexity increases, we see a significant drop in reasoning and comprehension abilities."

Furthermore, all 11 models faced challenges on "proxy benchmarks" evaluating real-world problem-solving capabilities, particularly in software engineering and other tasks requiring open-ended reasoning and planning. To address this, the Vector Institute developed the Multimodal Massive Multitask Understanding (MMMU) benchmark, assessing models' ability to handle images and text.

In the multimodal understanding evaluation, o1 demonstrated "excellent" capabilities, especially across different formats and difficulty levels. However, Willes emphasized that more work is needed to achieve truly multimodal systems capable of uniformly handling text, image, and audio inputs.

Addressing challenges in the evaluation, Willes highlighted evaluation leakage—models performing well on familiar evaluation datasets but poorly on new data. He believes that developing more innovative benchmarks and dynamic evaluations will be key to resolving this issue.