Recently, researchers at OpenAI admitted in a newly published paper that despite the current AI technology being quite advanced, these models still cannot compete with human programmers. OpenAI's CEO, Sam Altman, had previously stated that AI is expected to surpass "junior" software engineers by the end of this year, but the research findings show that these AI models still face significant challenges.
Image Source Note: Image generated by AI, provided by service provider Midjourney
In the study, the OpenAI team utilized a new benchmark test called SWE-Lancer to evaluate the performance of over 1,400 software engineering tasks extracted from the freelance platform Upwork. This test focused on assessing the coding abilities of three large language models (LLMs), including OpenAI's o1 reasoning model, flagship product GPT-4o, and Anthropic's Claude3.5Sonnet.
These models were tasked with completing two types of assignments: one focused on individual tasks, mainly centered on fixing bugs in code; the other involved management tasks requiring the models to make higher-level decisions. During the testing, these models did not have internet access, meaning they could not look up answers online directly.
Although the total value of the tasks these models undertook amounted to hundreds of thousands of dollars, they could only fix superficial issues and struggled to identify deeper bugs and root causes in complex projects. This situation mirrors the experience of using AI: while AI can quickly generate seemingly correct information, it often reveals shortcomings upon deeper examination.
The paper notes that although these three LLMs far exceed humans in the speed of task handling, they often fail to fully understand the breadth and context of errors, leading to solutions that are frequently inaccurate or incomplete. Researchers indicated that Claude3.5Sonnet outperformed OpenAI's two models and yielded higher returns, but its accuracy still did not reach a reliable level.
The research indicates that while these advanced AI models can operate quickly on certain specific tasks, they still fall short in overall software engineering capabilities and are far from being able to replace human programmers. However, this has not deterred some companies from replacing human programmers with immature AI models.
Key Points:
🧑💻 OpenAI's research indicates that advanced AI models still lag behind human programmers in coding abilities.
🚫 The three AI models performed poorly in fixing coding errors and struggled with complex problems.
🔍 Despite the speed of AI, they lack comprehensive understanding, leading to insufficient accuracy in solutions.