OpenAI Releases PaperBench, a Benchmark for Evaluating AI Agents

AIbase基地

Published inAI News · 4 min read · Apr 3, 2025

The OpenAI team has introduced PaperBench, a benchmark designed to evaluate the ability of AI agents to replicate cutting-edge AI research. This test requires AI agents to replicate, from scratch, 20 key and oral papers from the 2024 International Conference on Machine Learning (ICML). The process involves understanding the paper's contributions, developing the codebase, and successfully executing the experiments.

OpenAI, ChatGPT, Artificial Intelligence, AI

To ensure objective evaluation, researchers developed detailed scoring criteria. These criteria break down each replication task into multiple sub-tasks with clear scoring standards. PaperBench contains a total of 8316 individually scorable tasks, all scoring metrics developed in collaboration with the authors of each paper to guarantee accuracy and validity.

For large-scale evaluation, the research team also developed an automated scoring system based on a large language model (LLM). This system scores the AI agent's replication attempts according to the predefined scoring criteria. The team also established an independent benchmark for this scoring system to assess its performance.

After evaluating several leading AI models, the study found that the best-performing agent was Claude3.5Sonnet (a new version), achieving an average replication score of 21.0%. To further validate these results, researchers invited several top machine learning PhD students to attempt parts of PaperBench. The results showed that current AI models have not yet surpassed human replication capabilities.

To foster further research, the OpenAI team has decided to open-source their developed code, allowing more researchers to utilize this platform and explore the engineering capabilities of AI agents and their potential in replicating AI research.

Project code: https://github.com/openai/preparedness/tree/main/project/paperbench

Key Highlights:
🌟 PaperBench is a new benchmark for evaluating the ability of AI agents to replicate AI research, encompassing 20 ICML 2024 papers.
🔍 The test features 8316 individually scorable tasks, with scoring criteria developed in collaboration with paper authors.
🤖 Claude3.5Sonnet was the best-performing model in the test, but still hasn't surpassed top human researchers.

Coze Space Officially Opens Beta Testing, Supporting MCP Extension Integration

ByteDance's technology team announced that its new AI collaborative workspace, "Coze Space", is officially opening beta testing. Coze Space aims to be the optimal place for users to collaborate with AI Agents, providing comprehensive services ranging from answering questions to solving problems, helping users work more efficiently.

Chatbot Arena, AI Benchmarking Platform, Launches New Company

Amidst the rapid growth of the AI industry, Chatbot Arena, a crowdsourced AI benchmarking project, is expanding its reach by officially launching a new company, Arena Intelligence Inc. According to Bloomberg, Chatbot Arena aims to leverage this new entity to secure more resources, significantly enhancing the platform's functionality and services. Founded in 2023, Chatbot Arena is primarily spearheaded by the University of California, Berkeley...

Tongfu Shield's AI Agent: Proactively Identifying Security Risks, Your Loyal Security Companion

In today's rapidly developing digital business environment, risk control systems have become a core defense line for enterprises against black production, fraud, and ensuring transaction security. However, traditional risk control faces challenges such as high reliance on manpower and lagging strategies. Data analysts need to manually extract risk characteristics and design protection rules from massive amounts of data daily, which is time-consuming and laborious; new strategies often take days or even weeks to be discovered and launched, while black production has already iterated its attack methods. How can we make risk control systems think and evolve in real-time like humans? The answer lies in the deep integration of AI Agents and risk control systems. Breakthrough.