OpenAI recently released an important report assessing AI programming capabilities, revealing the current state of AI in software development through a $1 million real-world development project. This benchmark test, named SWE-Lancer, covered 1,400 real projects from Upwork and comprehensively evaluated AI performance in two major areas: direct development and project management.

The test results showed that the best-performing AI model, Claude 3.5 Sonnet, achieved a success rate of 26.2% in coding tasks and reached 44.9% in project management decisions. Although there is still a gap compared to human developers, it has demonstrated considerable potential in terms of economic benefits.

Data indicates that, within the publicly available Diamond dataset, this model can complete project development work worth $208,050. If expanded to the full dataset, AI is expected to handle tasks valued at over $400,000.

QQ20250220-103559.png

However, the research also revealed significant limitations of AI in complex development tasks. While AI can competently handle simple bug fixes (such as correcting redundant API calls), it performs poorly on complex projects that require deep understanding and comprehensive solutions (such as developing cross-platform video playback features). Notably, AI often identifies problematic code but struggles to understand the underlying causes and provide complete solutions.

To promote research and development in this field, OpenAI has open-sourced the SWE-Lancer Diamond dataset and related tools on GitHub, enabling researchers to evaluate various programming models' performance based on standardized criteria. This initiative will provide important references for further enhancing AI programming capabilities.