AWS Releases SWE-PolyBench: A New Open-Source Benchmark for Evaluating AI Programming Assistants

AIbase基地

Published inAI News · 4 min read · Apr 24, 2025

AWS AI Labs recently launched SWE-PolyBench, a multilingual open-source benchmark designed to provide a more comprehensive framework for evaluating AI programming assistants. With advancements in large language models (LLMs), AI programming assistants capable of generating, modifying, and understanding software code have made significant progress. However, current evaluation methods have limitations, with many benchmarks focusing on a single language like Python, failing to fully reflect the structural and semantic diversity of real-world codebases.

SWE-PolyBench addresses this by encompassing 21 GitHub repositories, supporting four popular programming languages: Java, JavaScript, TypeScript, and Python. It offers 2110 tasks, including bug fixes, feature implementations, and code refactoring. Unlike previous benchmarks, SWE-PolyBench utilizes real-world pull requests (PRs) that solve actual problems and come with associated test cases, enabling verifiable evaluation. A smaller stratified subset, SWE-PolyBench500, is also released to facilitate faster experimentation while retaining task and language diversity.

In terms of technical structure and evaluation metrics, SWE-PolyBench employs an execution-based evaluation process. Each task includes a codebase snapshot and a task description derived from a GitHub issue. The system applies the relevant, real-world patch within a containerized testing environment configured for the specific language ecosystem (e.g., Maven for Java, npm for JavaScript/TypeScript). Evaluation results are measured using two types of unit tests: Fail-to-Pass (F2P) and Pass-to-Pass (P2P).

For a more granular evaluation of programming assistants, SWE-PolyBench introduces Concrete Syntax Tree (CST)-based metrics, including file-level and node-level retrieval scores, assessing the ability of programming assistants to locate and modify relevant parts of the codebase. The evaluation adapted three open-source programming assistants – Aider, SWE-Agent, and Agentless – all using Anthropic's Claude 3.5 model, adjusted to meet the benchmark's multilingual and codebase requirements.

Evaluation results show significant performance differences across programming languages and task types. For instance, Python tasks achieved pass rates as high as 24.1%, while TypeScript reached only 4.7%. Regarding task complexity, single-function or class modification tasks had success rates as high as 40%, but this dropped significantly for tasks involving multi-file changes.

github: https://github.com/amazon-science/SWE-PolyBench

Key Highlights:
🌟 AWS introduces SWE-PolyBench, a comprehensive evaluation framework for AI programming assistants.
🔧 The benchmark covers 21 GitHub repositories and supports four languages: Java, JavaScript, TypeScript, and Python.
📈 Evaluation reveals performance variations across languages and tasks, with Python tasks showing the highest success rate.

SWE-PolyBench AI Programming Assistant Large Language Model (LLM)AWSAILabs

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

GLM-4-32B and GLM-Z1-32B Launched on OpenRouter, Free and Open to All

The Tsinghua University KEG Lab (THUDM) has launched its cutting-edge large language models (LLMs), GLM-4-32B and GLM-Z1-32B, on the OpenRouter platform, completely free and open to global users. This milestone event represents a significant step towards the widespread adoption of high-performance AI models, providing developers, researchers, and AI enthusiasts with powerful tools to drive further innovation in AI applications. Model launch: Powerful performance, free access.

Apr 22, 2025

320

Persona Engine Open Source Release: AI Virtual Assistant Meets Live2D for Enhanced Interactive Experiences

Recently, the Persona Engine project was officially open-sourced. Its powerful capabilities, integrating cutting-edge technologies such as Large Language Models (LLMs), Live2D, Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Real-time Voice Cloning (RVC), have garnered significant attention in the AI and virtual content creation fields. According to AIbase, the project enables real-time interaction with virtual characters by granting them natural conversation and dynamic expression capabilities, making it particularly suitable for VTubing and similar applications.

Apr 21, 2025

180

ByteDance Research Open-Sources ChatTS-14B: Native Understanding and Reasoning Over Time

ByteDance Research has announced the open-sourcing of ChatTS-14B, a 14-billion parameter large language model (LLM) specifically designed for understanding and reasoning with time series data. Released under the Apache2.0 license, ChatTS-14B's open-source release has garnered significant attention within the AI community, marking a substantial advancement in the intersection of time series analysis and generative AI. ChatTS-14B: An Intelligent Conversational Engine for Time Series. ChatTS-14B is based on Qwen2.5-1...

Apr 21, 2025

820

Intel Open Sources AI Playground for Intel Arc GPUs and Various AI Models

Intel has announced the open-sourcing of its generative AI software, AI Playground, generating significant interest within the AI community. Optimized for Intel Arc GPUs and integrated graphics, AI Playground is described as an 'AI hub' that supports local running of chat-based Large Language Models (LLMs), as well as image and video generation capabilities. This open-sourcing signifies Intel's commitment to advancing the accessibility of generative AI technology.

Apr 21, 2025

140

DroidRun Officially Open-Sourced: A New Breakthrough in LLM-Driven Android Automation

Recently, an open-source project called DroidRun has garnered significant attention. This project leverages Large Language Models (LLMs) to enable control of Android phones via natural language instructions, offering users an unprecedented level of automation. From social media management to automating daily tasks, DroidRun showcases the immense potential of AI in mobile device interaction. According to AIbase, DroidRun has officially been open-sourced, with the source code now available on GitHub, providing developers and tech enthusiasts with the freedom to explore.

Apr 17, 2025

670

DeepSeek's Innovative SPCT Technology Enables LLMs to Better Understand Human Intent

DeepSeek AI, a prominent Chinese artificial intelligence research lab, following its powerful open-source language model DeepSeek-R1, has achieved another significant breakthrough in the field of Large Language Models (LLMs). Recently, DeepSeek AI officially launched an innovative technology called Self-Principled Critique Tuning (SPCT), aimed at building more general-purpose and scalable AI reward models.

Apr 9, 2025

510

Wang Xing: Meituan's Developed Internal Large Language Model LongCat, Investing Billions in GPU Resources

Meituan CEO Wang Xing detailed the company's strategic plan for artificial intelligence (AI). Wang revealed that over the past year, Meituan prioritized securing GPU resources, investing heavily in AI infrastructure. He further stated that Meituan plans to further increase investment in key AI infrastructure in 2025 to strengthen its position in this field.

Mar 24, 2025

850

AMD Launches Open-Source GAIA Project for Efficient Local LLM Execution

AMD recently announced GAIA, an open-source application designed to provide a highly efficient and localized method for running Large Language Models (LLMs). Currently supporting Windows and optimized for Ryzen AI 300 series processors, GAIA leverages the strengths of these processors for AI tasks. GAIA is a generative AI application enabling private LLM execution on personal computers, ensuring data privacy. Furthermore, GAIA utilizes...

Mar 24, 2025

290

Alibaba Cloud Launches Project T to Advance Next-Generation AI Research

According to the Science and Technology Innovation Board Daily, Alibaba Cloud has launched a new initiative called "Project T" to accelerate the development of next-generation AI technologies. The project will focus on several cutting-edge areas, including AI engines, large language models (LLMs), and multi-modal technologies. The goal is to meet the growing market demand through breakthroughs in these technologies. The launch of "Project T" signifies Alibaba Cloud's further deepening of its AI strategy. Insiders reveal that this project will not only speed up AI R&D but also attract more top talent.

Mar 17, 2025

520

CMU Team Introduces Meta Reinforcement Fine-Tuning: A Novel Paradigm for Enhancing Large Language Model Reasoning

Large Language Models (LLMs) are constantly evolving in the field of artificial intelligence. Researchers from Carnegie Mellon University (CMU) and HuggingFace recently introduced a new method called Meta Reinforcement Fine-Tuning (MRT). This method aims to optimize the computational efficiency of LLMs during testing, particularly excelling in solving complex reasoning problems. Studies show that existing LLMs struggle with...

Mar 13, 2025

370

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

AWS Releases SWE-PolyBench: A New Open-Source Benchmark for Evaluating AI Programming Assistants

AIbase基地

This article is from AIbase Daily

AI News Recommendations

GLM-4-32B and GLM-Z1-32B Launched on OpenRouter, Free and Open to All

Persona Engine Open Source Release: AI Virtual Assistant Meets Live2D for Enhanced Interactive Experiences

ByteDance Research Open-Sources ChatTS-14B: Native Understanding and Reasoning Over Time

Intel Open Sources AI Playground for Intel Arc GPUs and Various AI Models

DroidRun Officially Open-Sourced: A New Breakthrough in LLM-Driven Android Automation

DeepSeek's Innovative SPCT Technology Enables LLMs to Better Understand Human Intent

Wang Xing: Meituan's Developed Internal Large Language Model LongCat, Investing Billions in GPU Resources

AMD Launches Open-Source GAIA Project for Efficient Local LLM Execution

Alibaba Cloud Launches Project T to Advance Next-Generation AI Research

CMU Team Introduces Meta Reinforcement Fine-Tuning: A Novel Paradigm for Enhancing Large Language Model Reasoning