Following OpenAI's GPT-4 consistently achieving remarkable results in traditional mathematical assessments, a research team from Peking University and Alibaba has jointly introduced a new evaluation benchmark—Omni-MATH, designed to assess the reasoning capabilities of large language models at the level of the International Mathematical Olympiad. This initiative not only sets a new standard for evaluating AI's mathematical abilities but also opens up new avenues for exploring the potential of AI in advanced mathematical fields.

image.png

Unique Design of Omni-MATH

The Omni-MATH evaluation library contains 4,428 competition-level mathematical problems, covering over 33 sub-fields of mathematics, with difficulties divided into 10 different levels. Its features include:

High Reliability: All questions are sourced from various mathematical competitions and forums, with answers verified by humans.

Broad Coverage: Ranging from the Olympic preparation level (T4) to the top International Mathematical Olympiad (T0), such as IMO, IMC, and Putnam.

Diversity Consideration: Optimized answer diversity through evaluation methods based on GPT-4 and other assessment models.

Notable Performers on the Latest Leaderboard, Besides Full-Strength GPT-4, Include:

GPT-4-mini: Average score about 8% higher than GPT-4-preview.

Qwen2-MATH-72b: Exceeded the performance of GPT-4-turbo.

These results demonstrate that even smaller models can excel in specific capabilities.

Depth and Breadth of the Evaluation System

The design of Omni-MATH fully considers the selection process and difficulty levels of international mathematical competitions:

Referencing the Olympic math selection systems of countries like the UK and the US.

Covering multiple mathematical fields from number theory and algebra to geometry.

Data sources include various competition questions, analyses, and forum content from renowned mathematical websites.

Innovative Evaluation Methods

The research team developed the Omni-Judge open-source answer verifier, utilizing the fine-tuned Llama3-Instruct model, which can quickly determine the consistency of model outputs with standard answers. This method ensures a 95% consistency rate while also providing a convenient solution for evaluating complex mathematical problems.

The launch of Omni-MATH represents not only a new challenge for AI's mathematical abilities but also an important evaluation tool for the future application and development of AI in advanced mathematical fields. As AI technology continues to advance, we may witness AI's astonishing performance in the International Mathematical Olympiad in the near future.

Project Address: https://github.com/KbsdJames/Omni-MATH/