Recently, the Doubao large model team at ByteDance, in collaboration with the M-A-P open-source community, released SuperGPQA, a knowledge reasoning benchmark covering 285 postgraduate disciplines and containing 26,529 professional questions.
This dataset not only encompasses mainstream disciplines like mathematics and physics but also, for the first time, incorporates long-tail disciplines such as light industry, agriculture, and service science into the evaluation system, filling a gap in existing benchmarks. SuperGPQA has been used to reveal the performance gap between open-source and closed-source models, becoming a crucial tool for AI development.
Traditional benchmarks like MMLU and GPQA cover fewer than 50 disciplines, with long-tail disciplines accounting for less than 5%. Furthermore, due to their single data source (e.g., Wikipedia) and unreliable crowdsourced annotations, they struggle to measure model reasoning capabilities in complex scenarios. SuperGPQA, built over six months using an expert-LLM collaborative mechanism, selects questions from authoritative sources. Its questions offer an average of 9.67 options, with 42.33% requiring mathematical calculations or formal reasoning, demonstrating both breadth and depth. Experiments show that the optimal model, DeepSeek-R1, achieves an accuracy of only 61.82%, indicating that current large language models still have room for improvement in diverse knowledge domains.
SuperGPQA employs a three-stage process to enhance quality: expert screening of initial questions, standardized transcription, and multi-layered quality checks (rule filtering, LLM detection, and expert review). Evaluation results show that instruction fine-tuning significantly improves performance, with DeepSeek-V3 outperforming its basic version. However, open-source models still lag behind closed-source solutions on difficult questions.
Paper Link:https://arxiv.org/pdf/2502.14739
Data Link:https://huggingface.co/datasets/m-a-p/SuperGPQA
Code Link:https://github.com/SuperGPQA/SuperGPQA