OpenAI Introduces PaperBench for AI Research Replication

OpenAI has launched PaperBench, a benchmark designed to evaluate AI agents' ability to replicate state-of-the-art AI research. The benchmark involves replicating top ICML 2024 papers, including understanding, coding, and executing experiments.

OpenAI has launched PaperBench, a new benchmark aimed at evaluating the ability of AI agents to replicate state-of-the-art AI research. This initiative is part of OpenAI's Preparedness Framework and involves replicating 20 ICML 2024 Spotlight and Oral papers from scratch. The process includes understanding the paper contributions, developing a codebase, and successfully executing experiments.

To ensure objective evaluation, PaperBench uses rubrics that break down each replication task into smaller, gradable sub-tasks. These rubrics are co-developed with the authors of each ICML paper to ensure accuracy and realism. An LLM-based judge has been developed to automatically grade replication attempts against these rubrics, and its performance is assessed through a separate benchmark for judges.

The benchmark has been tested with several frontier models, with the best-performing agent, Claude 3.5 Sonnet, achieving an average replication score of 21.0%. Despite these advancements, human experts still outperform AI models in this task. OpenAI has open-sourced the code to facilitate further research into AI engineering capabilities.

We hope you enjoyed this article.

Consider subscribing to one of several newsletters we publish. For example, in the Daily AI Brief you can read the most up to date AI news round-up 6 days per week.

Also, consider following us on social media:

Subscribe to Daily AI Brief

Daily report covering major AI developments and industry news, with both top stories and complete market updates

Market report

AI’s Time-to-Market Quagmire: Why Enterprises Struggle to Scale AI Innovation

ModelOp

The 2025 AI Governance Benchmark Report by ModelOp provides insights from 100 senior AI and data leaders across various industries, highlighting the challenges enterprises face in scaling AI initiatives. The report emphasizes the importance of AI governance and automation in overcoming fragmented systems and inconsistent practices, showcasing how early adoption correlates with faster deployment and stronger ROI.

Read more