OpenAI Introduces PaperBench for AI Research Replication

April 03, 2025

OpenAI has launched PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research. The benchmark involves replicating top ICML 2024 papers, including understanding, coding, and executing experiments.

OpenAI has launched PaperBench, a new benchmark aimed at evaluating the ability of AI agents to replicate state-of-the-art AI research. This initiative is part of OpenAI's Preparedness Framework and involves replicating 20 ICML 2024 Spotlight and Oral papers from scratch. The process includes understanding the paper contributions, developing a codebase, and successfully executing experiments.

To ensure objective evaluation, PaperBench uses rubrics that break down each replication task into smaller, gradable sub-tasks. These rubrics are co-developed with the authors of each ICML paper to ensure accuracy and realism. An LLM-based judge has been developed to automatically grade replication attempts against these rubrics, and its performance is assessed through a separate benchmark for judges.

The benchmark has been tested with several frontier models, with the best-performing agent, Claude 3.5 Sonnet, achieving an average replication score of 21.0%. Despite these advancements, human experts still outperform AI models in this task. OpenAI has open-sourced the code to facilitate further research into AI engineering capabilities.

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework.

Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments. pic.twitter.com/CvYcDdk0nI
— OpenAI (@OpenAI) April 2, 2025

We hope you enjoyed this article.

Consider subscribing to one of our newsletters like Daily AI Brief.

Also, consider following us on social media:

AI Brief AI Brief (X)

Subscribe to Daily AI Brief

Daily report covering major AI developments and industry news, with both top stories and complete market updates

Whitepaper

Governing the Future: A Strategic Framework for AI Adoption in Financial Institutions

This whitepaper explores the transformative impact of artificial intelligence on the financial industry, focusing on the governance challenges and regulatory demands faced by banks. It provides a strategic framework for AI adoption, emphasizing the importance of a unified AI approach to streamline compliance and reduce operational costs. The document offers actionable insights and expert recommendations for banks with fewer than 2,000 employees to become leaders in compliant, customer-centric AI.

Categories

Companies

Resources

OpenAI Introduces PaperBench for AI Research Replication

We hope you enjoyed this article.

Subscribe to Daily AI Brief

Whitepaper

Governing the Future: A Strategic Framework for AI Adoption in Financial Institutions

You May Also Like

Anthropic Develops AI Agents for Alignment Auditing

OpenAI's Red-Teaming Challenge for GPT-OSS-20B

Writer Launches Action Agent for Enterprise Automation

Anthropic Introduces Sub-Agents for Claude Code

Alibaba Launches Qwen3-Coder, an Advanced AI Coding Model

OpenAI Plans Trillion-Dollar Infrastructure Investment

OpenAI Introduces GPT-5 with Enhanced Capabilities

SandboxAQ Report Highlights AI Security Gaps Amid Rapid Adoption

Anthropic Releases Claude Opus 4.1 with Enhanced Coding Capabilities

OpenAI Teases Imminent Launch of GPT-5

Clarifai Enhances AI Agent Development with MCP Server Hosting

CalypsoAI Unveils Autonomous AI Agents as Primary Cyber Threats