OpenAI's Latest Research Reveals: State-of-the-Art AI Still Struggles with Coding Problems

AIbase基地

Published inAI News · 5 min read · Feb 24, 2025

174

Recently, researchers at OpenAI admitted in a newly published paper that despite the current AI technology being quite advanced, these models still cannot compete with human programmers. OpenAI's CEO, Sam Altman, had previously stated that AI is expected to surpass "junior" software engineers by the end of this year, but the research findings show that these AI models still face significant challenges.

Code Internet (1)

Image Source Note: Image generated by AI, provided by service provider Midjourney

In the study, the OpenAI team utilized a new benchmark test called SWE-Lancer to evaluate the performance of over 1,400 software engineering tasks extracted from the freelance platform Upwork. This test focused on assessing the coding abilities of three large language models (LLMs), including OpenAI's o1 reasoning model, flagship product GPT-4o, and Anthropic's Claude3.5Sonnet.

These models were tasked with completing two types of assignments: one focused on individual tasks, mainly centered on fixing bugs in code; the other involved management tasks requiring the models to make higher-level decisions. During the testing, these models did not have internet access, meaning they could not look up answers online directly.

Although the total value of the tasks these models undertook amounted to hundreds of thousands of dollars, they could only fix superficial issues and struggled to identify deeper bugs and root causes in complex projects. This situation mirrors the experience of using AI: while AI can quickly generate seemingly correct information, it often reveals shortcomings upon deeper examination.

The paper notes that although these three LLMs far exceed humans in the speed of task handling, they often fail to fully understand the breadth and context of errors, leading to solutions that are frequently inaccurate or incomplete. Researchers indicated that Claude3.5Sonnet outperformed OpenAI's two models and yielded higher returns, but its accuracy still did not reach a reliable level.

The research indicates that while these advanced AI models can operate quickly on certain specific tasks, they still fall short in overall software engineering capabilities and are far from being able to replace human programmers. However, this has not deterred some companies from replacing human programmers with immature AI models.

Key Points:
🧑‍💻 OpenAI's research indicates that advanced AI models still lag behind human programmers in coding abilities.
🚫 The three AI models performed poorly in fixing coding errors and struggled with complex problems.
🔍 Despite the speed of AI, they lack comprehensive understanding, leading to insufficient accuracy in solutions.

AI Daily: OpenAI to Potentially Release GPT-4.1 Series Next Week; Pika's New AI Video Feature 'Twists'; SenseTime's 'SenseNova' V6 Makes a Stunning Debut

Welcome to the AI Daily column! Your daily guide to exploring the world of artificial intelligence. We present you with the hottest content in the AI field, focusing on developers and helping you understand technology trends and innovative AI product applications. Discover new AI products here: https://top.aibase.com/ 1. Reports suggest OpenAI will release the GPT-4.1 series next week, including Mini and Nano versions. OpenAI's upcoming release of the GPT-4.1 and o3 series marks a significant advancement in...

Report: OpenAI to Release GPT-4.1 Series Next Week, Including Mini and Nano Versions

AI leader OpenAI is poised to unleash a new wave of technological advancements next week! According to tech media outlet The Verge, OpenAI plans to launch a major update including the GPT-4.1 series, o3 series, and several other AI models. This flurry of releases not only demonstrates OpenAI's ambition for accelerated innovation but also provides the industry with more powerful AI tools. GPT-4.1 Series: A Comprehensive Upgrade in Multimodal Capabilities As the successor to GPT-4.0, the GPT-4.1 series...

OpenAI Open-Sources BrowseComp: A New Benchmark for Evaluating AI Agent Web Browsing Capabilities

A new benchmark for evaluating AI agents has arrived! OpenAI has announced the open-sourcing of BrowseComp, an innovative benchmark designed specifically to assess the web browsing capabilities of AI agents. This initiative provides the AI research community with a new tool and lays the foundation for more intelligent and reliable browsing agents. AIbase offers an in-depth analysis of BrowseComp's core value and industry impact. BrowseComp: The ultimate test for AI browsing capabilities.

Soaring Costs of Benchmarking Inference AI Models: Assessing One Can Cost Nearly $3000

According to Artificial Analysis, a third-party AI testing agency, evaluating OpenAI's o1 inference model across seven popular benchmarks costs $2,767.05, while its non-inference model GPT-4o costs only $108.85. This significant disparity sparks discussion regarding the sustainability and transparency of AI evaluation. Inference models, AI systems capable of step-by-step reasoning to solve problems, while excelling in specific domains, incur significantly higher benchmarking costs than traditional models. Arti...

ChatGPT Launches Long-Term Memory Feature: A New Era for AI Interaction

OpenAI has announced a major update: ChatGPT now officially features long-term memory! This is considered one of the most significant upgrades since ChatGPT's launch, promising a greatly enhanced user experience and ushering in a new era of personalized interaction. AIbase provides an exclusive breakdown of this feature's key highlights and potential impact. While ChatGPT has long been a productivity tool for many users thanks to its powerful language processing capabilities, its memory has been limited to single conversations or short-term contexts. Now, with this new feature...

AI News

AI Daily

AI Timeline

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

OpenAI's Latest Research Reveals: State-of-the-Art AI Still Struggles with Coding Problems

AIbase基地

This article is from AIbase Daily

AI News Recommendations

ChatGPT Surpasses 46 Million Downloads in March, Becoming the World's Most Popular Non-Gaming App

OpenAI Announces Retirement of GPT-4: A New Chapter in the AI Wave

AI Daily: OpenAI to Potentially Release GPT-4.1 Series Next Week; Pika's New AI Video Feature 'Twists'; SenseTime's 'SenseNova' V6 Makes a Stunning Debut

Report: OpenAI to Release GPT-4.1 Series Next Week, Including Mini and Nano Versions

Google Releases 69-Page White Paper: Optimizing AI Models Through Prompt Engineering

OpenAI Launches New Memory Feature for ChatGPT, Enhancing Conversational Experience

OpenAI Open-Sources BrowseComp: A New Benchmark for Evaluating AI Agent Web Browsing Capabilities

17-Year-Old Prodigy Enables ChatGPT on Vintage iPhones

Soaring Costs of Benchmarking Inference AI Models: Assessing One Can Cost Nearly $3000

ChatGPT Launches Long-Term Memory Feature: A New Era for AI Interaction