Recently, the debate over artificial intelligence benchmark testing has intensified in the public eye. An employee from OpenAI accused the AI company xAI, founded by Elon Musk, of releasing misleading results for the Grok3 benchmark test, while xAI's co-founder Igor Babushkin insisted that the company did nothing wrong.

The incident began when xAI published a chart on its blog showing Grok3's performance in the AIME2025 test. AIME2025 is a collection of challenging math problems from a recent math invitational. Although some experts have expressed doubts about the validity of AIME as an AI benchmark, it is still widely used to evaluate models' mathematical abilities.

xAI's chart indicated that two variants of Grok3—Grok3Reasoning Beta and Grok3mini Reasoning—outperformed OpenAI's current best model, o3-mini-high, in the AIME2025 test. However, OpenAI employees quickly pointed out that xAI's chart did not include the score of o3-mini-high calculated with "cons@64" in AIME2025.

QQ_1740367365318.png

So, what is cons@64? It stands for "consensus@64," which essentially gives the model 64 attempts to answer each question and takes the most common answer among those generated as the final answer. It can be imagined that the cons@64 scoring mechanism would significantly boost a model's benchmark score, so omitting this data from the chart could lead to the misleading impression that one model outperformed another when, in fact, that is not the case.

The "@1" scores for Grok3Reasoning Beta and Grok3mini Reasoning in AIME2025, which represent the scores from the models' first attempts, are actually lower than that of o3-mini-high. Moreover, Grok3Reasoning Beta's performance is slightly inferior to OpenAI's o1 model. Nevertheless, xAI continues to promote Grok3 as "the smartest AI in the world."

Babushkin responded on social media by stating that OpenAI has previously released similar misleading benchmark charts, primarily comparing the performance of its own models. A neutral expert then compiled the performances of various models into a more "accurate" chart, sparking broader discussions.

QQ_1740367567952.png

Additionally, AI researcher Nathan Lambert pointed out that a more important metric remains unclear: the computational (and financial) costs required for each model to achieve its best scores. This also indicates that the information conveyed by most current AI benchmark tests about the limitations and advantages of models remains limited.

Key Points:

🔍 The debate between xAI and OpenAI over the Grok3 benchmark results has garnered widespread attention.  

📊 xAI's chart omitted the key scoring metric "cons@64" for OpenAI's model, which could be misleading.  

💰 The computational and financial costs behind AI model performance remain an unsolved mystery.