The OpenAI team has introduced PaperBench, a benchmark designed to evaluate the ability of AI agents to replicate cutting-edge AI research. This test requires AI agents to replicate, from scratch, 20 key and oral papers from the 2024 International Conference on Machine Learning (ICML). The process involves understanding the paper's contributions, developing the codebase, and successfully executing the experiments.
To ensure objective evaluation, researchers developed detailed scoring criteria. These criteria break down each replication task into multiple sub-tasks with clear scoring standards. PaperBench contains a total of 8316 individually scorable tasks, all scoring metrics developed in collaboration with the authors of each paper to guarantee accuracy and validity.
For large-scale evaluation, the research team also developed an automated scoring system based on a large language model (LLM). This system scores the AI agent's replication attempts according to the predefined scoring criteria. The team also established an independent benchmark for this scoring system to assess its performance.
After evaluating several leading AI models, the study found that the best-performing agent was Claude3.5Sonnet (a new version), achieving an average replication score of 21.0%. To further validate these results, researchers invited several top machine learning PhD students to attempt parts of PaperBench. The results showed that current AI models have not yet surpassed human replication capabilities.
To foster further research, the OpenAI team has decided to open-source their developed code, allowing more researchers to utilize this platform and explore the engineering capabilities of AI agents and their potential in replicating AI research.
Project code: https://github.com/openai/preparedness/tree/main/project/paperbench
Key Highlights:
🌟 PaperBench is a new benchmark for evaluating the ability of AI agents to replicate AI research, encompassing 20 ICML 2024 papers.
🔍 The test features 8316 individually scorable tasks, with scoring criteria developed in collaboration with paper authors.
🤖 Claude3.5Sonnet was the best-performing model in the test, but still hasn't surpassed top human researchers.