MLE-bench

Benchmark for assessing the capabilities of AI agents in machine learning engineering.

CommonProductProductivityMachine LearningAI Agents
MLE-bench is a benchmark test launched by OpenAI to measure the performance of AI agents in the domain of machine learning engineering. It compiles 75 diverse challenges from Kaggle-related machine learning engineering competitions, testing real-world skills such as model training, dataset preparation, and experiment execution. Using publicly available leaderboard data from Kaggle, human benchmarks for each competition are established. Various cutting-edge language models are evaluated against this benchmark using open-source agent frameworks, revealing that the best-performing setup—OpenAI's o1-preview paired with the AIDE framework—achieved at least Kaggle bronze medal levels in 16.9% of the competitions. Moreover, the study examines various resource extension forms of AI agents and the effects of pre-training contamination. The benchmark code for MLE-bench has been open-sourced to facilitate future understanding of AI agents' capabilities in machine learning engineering.
Visit

MLE-bench Visit Over Time

Monthly Visits

525964165

Bounce Rate

57.10%

Page per Visit

2.2

Visit Duration

00:01:38

MLE-bench Visit Trend

MLE-bench Visit Geography

MLE-bench Traffic Sources

MLE-bench Alternatives