MLE-bench
Benchmark for assessing the capabilities of AI agents in machine learning engineering.
CommonProductProductivityMachine LearningAI Agents
MLE-bench is a benchmark test launched by OpenAI to measure the performance of AI agents in the domain of machine learning engineering. It compiles 75 diverse challenges from Kaggle-related machine learning engineering competitions, testing real-world skills such as model training, dataset preparation, and experiment execution. Using publicly available leaderboard data from Kaggle, human benchmarks for each competition are established. Various cutting-edge language models are evaluated against this benchmark using open-source agent frameworks, revealing that the best-performing setup—OpenAI's o1-preview paired with the AIDE framework—achieved at least Kaggle bronze medal levels in 16.9% of the competitions. Moreover, the study examines various resource extension forms of AI agents and the effects of pre-training contamination. The benchmark code for MLE-bench has been open-sourced to facilitate future understanding of AI agents' capabilities in machine learning engineering.
MLE-bench Visit Over Time
Monthly Visits
525964165
Bounce Rate
57.10%
Page per Visit
2.2
Visit Duration
00:01:38