In a recent study, the OpenAI research team introduced a new benchmark test called MLE-bench, aimed at evaluating the performance of AI agents in machine learning engineering.
This research specifically focuses on 75 machine learning engineering competitions from Kaggle, designed to test the agents' skills in various real-world scenarios, including model training, dataset preparation, and experiment execution.
To better conduct the evaluation, the research team used the basic data from Kaggle's public leaderboards to establish human benchmarks for each competition. In their experiments, they utilized an open-source agent architecture to test several cutting-edge language models. The results showed that the best-performing configuration—combining OpenAI's o1-preview with the AIDE architecture—achieved Kaggle's bronze level in 16.9% of the competitions.
Moreover, the research team delved into the resource scaling forms of AI agents and studied the contamination effects of pretraining on the results. They emphasized that these findings lay the groundwork for further understanding the capabilities of AI agents in machine learning engineering. To facilitate future research, the team also open-sourced the benchmark test code for use by other researchers.
The launch of this study marks a significant advancement in the field of machine learning, particularly in evaluating and enhancing the engineering capabilities of AI agents. Scientists hope that through MLE-bench, a more scientific evaluation standard and practical basis for the development of AI technology can be provided.
Project entry: https://openai.com/index/mle-bench/
Key Points:
🌟 MLE-bench is a new benchmark test designed to evaluate AI agents' machine learning engineering capabilities.
🤖 The study covers 75 Kaggle competitions, testing the agents' model training and data processing abilities.
📊 The combination of OpenAI's o1-preview with the AIDE architecture reached Kaggle's bronze level in 16.9% of the competitions.