At the press conference on December 19, 2024, the Zhiyuan Institute and Tencent announced the launch of LongBench v2, a benchmark specifically designed to assess the deep understanding and reasoning capabilities of large language models (LLMs) in real-world long text multi-task scenarios. This platform aims to advance the understanding and reasoning capabilities of long text models, addressing the current challenges faced by large language models in practical applications.
Key features of LongBench v2 include support for longer text lengths, ranging from 8k to 2M words, and it contains 503 challenging multiple-choice questions, with a high difficulty level; even human experts achieved an average accuracy of only 53.7% within 15 minutes. Additionally, the benchmark covers six major task categories, including single-document question answering, multi-document question answering, and long-text contextual learning, ensuring a wide range of application scenarios.
To ensure the reliability of the evaluation, all questions in LongBench v2 are in multiple-choice format and have undergone a rigorous manual labeling and review process. During the data collection phase, annotators from top universities were recruited to ensure the quality and difficulty of the questions. By introducing control variables, LongBench v2 has improved upon the original Bradley-Terry statistical algorithm, reducing the impact of confounding factors and making model ranking more scientific and accurate.
In terms of evaluation results, the research team tested 10 open-source LLMs and 6 closed-source LLMs, finding that the introduction of control variables significantly enhanced model performance. Notably, the GPT-4o model performed exceptionally well in tasks such as multi-document question answering and long-text contextual learning after incorporating more reasoning steps, highlighting the importance of reasoning abilities.
The launch of LongBench v2 not only provides a new tool for evaluating large language models but also points the way for future research, emphasizing the importance of enhancing the models' own understanding and reasoning capabilities. The collaboration between the Zhiyuan Institute and Tencent marks a further advancement in the field of AI technology, and we look forward to this benchmark test driving progress in long text understanding and reasoning technologies.
Homepage:https://longbench2.github.io
Paper:https://arxiv.org/abs/2412.15204
Data and Code:https://github.com/THUDM/LongBench