OpenAI Introduces SWE-Lancer Benchmark for AI in Software Engineering

OpenAI Introduces SWE-Lancer Benchmark for AI in Software Engineering

Image: OpenAI
OpenAI has launched SWE-Lancer, a benchmark evaluating AI models on over 1,400 real-world freelance software engineering tasks from Upwork, valued at $1 million in total payouts.

OpenAI has launched SWE-Lancer, a new benchmark designed to evaluate the coding performance of AI models using real-world freelance software engineering tasks from Upwork, valued at a total of $1 million USD in payouts. In a press release, OpenAI detailed that SWE-Lancer includes over 1,400 tasks, ranging from $50 bug fixes to $32,000 feature implementations.

The benchmark not only tests AI models on independent engineering tasks but also on managerial tasks, where models must choose between technical implementation proposals. Independent tasks are graded with end-to-end tests verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers.

Despite the comprehensive nature of SWE-Lancer, current frontier models are still unable to solve the majority of tasks. The best-performing model, Anthropic's Claude 3.5 Sonnet, earned just over $400,000 out of the possible $1 million across all tasks. OpenAI has open-sourced part of the dataset, called SWE-Lancer Diamond, to encourage further research into AI's role in software development.

We hope you enjoyed this article.

Consider subscribing to one of several newsletters we publish. For example, in the Daily AI Brief you can read the most up to date AI news round-up 6 days per week.

Also, consider following our LinkedIn page AI Brief.

Subscribe to Daily AI Brief

Daily report covering major AI developments and industry news, with both top stories and complete market updates