AWS AI Labs recently launched SWE-PolyBench, a multilingual open-source benchmark designed to provide a more comprehensive framework for evaluating AI programming assistants. With advancements in large language models (LLMs), AI programming assistants capable of generating, modifying, and understanding software code have made significant progress. However, current evaluation methods have limitations, with many benchmarks focusing on a single language like Python, failing to fully reflect the structural and semantic diversity of real-world codebases.

QQ_1745456662909.png

SWE-PolyBench addresses this by encompassing 21 GitHub repositories, supporting four popular programming languages: Java, JavaScript, TypeScript, and Python. It offers 2110 tasks, including bug fixes, feature implementations, and code refactoring. Unlike previous benchmarks, SWE-PolyBench utilizes real-world pull requests (PRs) that solve actual problems and come with associated test cases, enabling verifiable evaluation. A smaller stratified subset, SWE-PolyBench500, is also released to facilitate faster experimentation while retaining task and language diversity.

QQ_1745456674846.png

In terms of technical structure and evaluation metrics, SWE-PolyBench employs an execution-based evaluation process. Each task includes a codebase snapshot and a task description derived from a GitHub issue. The system applies the relevant, real-world patch within a containerized testing environment configured for the specific language ecosystem (e.g., Maven for Java, npm for JavaScript/TypeScript). Evaluation results are measured using two types of unit tests: Fail-to-Pass (F2P) and Pass-to-Pass (P2P).

QQ_1745456685896.png

For a more granular evaluation of programming assistants, SWE-PolyBench introduces Concrete Syntax Tree (CST)-based metrics, including file-level and node-level retrieval scores, assessing the ability of programming assistants to locate and modify relevant parts of the codebase. The evaluation adapted three open-source programming assistants – Aider, SWE-Agent, and Agentless – all using Anthropic's Claude 3.5 model, adjusted to meet the benchmark's multilingual and codebase requirements.

Evaluation results show significant performance differences across programming languages and task types. For instance, Python tasks achieved pass rates as high as 24.1%, while TypeScript reached only 4.7%. Regarding task complexity, single-function or class modification tasks had success rates as high as 40%, but this dropped significantly for tasks involving multi-file changes.

github: https://github.com/amazon-science/SWE-PolyBench

Key Highlights:

🌟 AWS introduces SWE-PolyBench, a comprehensive evaluation framework for AI programming assistants.

🔧 The benchmark covers 21 GitHub repositories and supports four languages: Java, JavaScript, TypeScript, and Python.

📈 Evaluation reveals performance variations across languages and tasks, with Python tasks showing the highest success rate.