OpenAI Introduces SWE-Lancer Benchmark for AI in Software Engineering

February 19, 2025

OpenAI has launched SWE-Lancer, a benchmark evaluating AI models on over 1,400 real-world freelance software engineering tasks from Upwork, valued at $1 million in total payouts.

OpenAI Introduces SWE-Lancer Benchmark for AI in Software Engineering

Image: OpenAI

OpenAI has launched SWE-Lancer, a new benchmark designed to evaluate the coding performance of AI models using real-world freelance software engineering tasks from Upwork, valued at a total of $1 million USD in payouts. In a press release, OpenAI detailed that SWE-Lancer includes over 1,400 tasks, ranging from $50 bug fixes to $32,000 feature implementations.

The benchmark not only tests AI models on independent engineering tasks but also on managerial tasks, where models must choose between technical implementation proposals. Independent tasks are graded with end-to-end tests verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers.

Despite the comprehensive nature of SWE-Lancer, current frontier models are still unable to solve the majority of tasks. The best-performing model, Anthropic's Claude 3.5 Sonnet, earned just over $400,000 out of the possible $1 million across all tasks. OpenAI has open-sourced part of the dataset, called SWE-Lancer Diamond, to encourage further research into AI's role in software development.

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. https://t.co/c3pFcL41uK
— OpenAI (@OpenAI) February 18, 2025

We hope you enjoyed this article.

Consider subscribing to one of our newsletters like Daily AI Brief.

Also, consider following us on social media:

AI Brief AI Brief (X)

Subscribe to Daily AI Brief

Daily report covering major AI developments and industry news, with both top stories and complete market updates

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

ModelOp

The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.

Categories

Companies

Resources

OpenAI Introduces SWE-Lancer Benchmark for AI in Software Engineering

We hope you enjoyed this article.

Subscribe to Daily AI Brief

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

You May Also Like

Anthropic Overtakes OpenAI in Enterprise LLM Market

Anthropic Develops AI Agents for Alignment Auditing

Alibaba Launches Qwen3-Coder, an Advanced AI Coding Model

OpenAI Plans Trillion-Dollar Infrastructure Investment

Anthropic Offers Claude AI to U.S. Government for $1

NVIDIA and NSF Partner to Develop Open AI Models for Scientific Research

Anthropic Releases Claude Opus 4.1 with Enhanced Coding Capabilities

OpenAI Staff to Sell $6 Billion in Shares to SoftBank and Others

OpenAI Considers $500 Billion Valuation in Employee Share Sale

OpenAI's Red-Teaming Challenge for GPT-OSS-20B

OpenAI and Slalom Partner to Enhance AI Solutions for Enterprises

OpenAI, Google, and Anthropic Approved as Federal AI Vendors