ByteDance's Doubao large model team recently announced the open-sourcing of Multi-SWE-bench, the industry's first multilingual code repair benchmark dataset. This breakthrough facilitates the evaluation and improvement of large models' "automatic bug fixing" capabilities.

With the rapid development of large model technology, code generation tasks have become a key area for testing model intelligence. While code repair benchmarks like SWE-bench can measure a model's programming intelligence, they have significant limitations. They focus solely on Python, failing to assess cross-lingual generalization capabilities. Furthermore, their limited task difficulty restricts the evaluation of large models in complex development scenarios, hindering further advancements in code intelligence.

QQ20250410-143403.png

Code Ability Scores for Different Models

Multi-SWE-bench addresses these limitations. Building upon SWE-bench, it significantly expands coverage to include seven mainstream programming languages: Java, TypeScript, C, C++, Go, Rust, and JavaScript. It comprises 1632 repair tasks sourced from real-world open-source repositories. These tasks have undergone rigorous screening and manual verification to ensure reliability. Multi-SWE-bench also introduces a difficulty level system (easy, medium, hard), enabling a more comprehensive evaluation of model performance across different skill levels.

Experiments using this dataset show that current large language models perform reasonably well on Python repair tasks, but their average repair rate for other languages is less than 10%, highlighting the challenge of multilingual code repair for large models.

QQ20250410-143412.png

Some mainstream models show superior performance in Python but underperform significantly with other languages. Furthermore, the model's repair rate decreases as task difficulty increases.

To support the application of reinforcement learning in automatic programming, the team also open-sourced Multi-SWE-RL. This provides 4723 instances and a corresponding reproducible Docker environment, supporting one-click startup and automatic evaluation, creating a standardized data foundation for RL training. Additionally, the team has launched an open-source community initiative, inviting developers and researchers to participate in dataset expansion, new method evaluation, and other efforts to jointly advance the RL for Code ecosystem.

The ByteDance Doubao large model team hopes that Multi-SWE-bench will propel automatic programming technology to new heights. They plan to continue expanding its coverage to help large models make greater strides in the field of "automated software engineering."