Study Reveals OpenAI Models Memorize Copyrighted Content

April 05, 2025

A recent study suggests that OpenAI's models, including GPT-4, have memorized copyrighted content, raising concerns over data transparency and copyright infringement.

Study Reveals OpenAI Models Memorize Copyrighted Content

OpenAI is facing new scrutiny as a study suggests its models have memorized copyrighted content. The study, co-authored by researchers from the University of Washington, University of Copenhagen, and Stanford, introduces a method to identify training data memorized by models like GPT-4 and GPT-3.5. This method involves using 'high-surprisal' words to probe the models' ability to recall specific content, indicating potential memorization of copyrighted material.

The study found that GPT-4 showed signs of memorizing portions of popular fiction books and New York Times articles. This discovery adds to the ongoing legal challenges OpenAI faces from authors and rights-holders who accuse the company of using their works without permission. OpenAI has defended its practices under the fair use doctrine, but the study's findings highlight the need for greater transparency in AI training data.

The researchers emphasize the importance of being able to probe and audit large language models to ensure their trustworthiness. They argue for more transparency in the data used to train these models, a sentiment echoed by OpenAI, which has advocated for looser restrictions on using copyrighted data for AI development.

We hope you enjoyed this article.

Consider subscribing to one of our newsletters like AI Policy Brief or Daily AI Brief.

Also, consider following us on social media:

AI Safety & Regulation AI Brief AI Brief (X)

More from: Regulation

08/19

OpenAI CEO Warns of Underestimating China's AI Progress

08/18

Texas AG Investigates Meta and Character.AI Over Mental Health Claims

08/18

The MedLegal Professor Introduces AI Governance Protocol for Regulated Industries

08/15

PrivacyCheq Introduces aiCheq for AI Privacy Compliance

08/14

U.S. Government Considers Stake in Intel to Boost Domestic Manufacturing

Subscribe to AI Policy Brief

Weekly report on AI regulations, safety standards, government policies, and compliance requirements worldwide.

Market report

2025 Generative AI in Professional Services Report

Thomson Reuters

This report by Thomson Reuters explores the integration and impact of generative AI technologies, such as ChatGPT and Microsoft Copilot, within the professional services sector. It highlights the growing adoption of GenAI tools across industries like legal, tax, accounting, and government, and discusses the challenges and opportunities these technologies present. The report also examines professionals' perceptions of GenAI and the need for strategic integration to maximize its value.

Categories

Companies

Resources

Study Reveals OpenAI Models Memorize Copyrighted Content

We hope you enjoyed this article.

More from: Regulation

Subscribe to AI Policy Brief

Market report

2025 Generative AI in Professional Services Report

You May Also Like

OpenAI Releases GPT-OSS Models for Laptops

Anthropic Faces Massive Copyright Lawsuit Over AI Training

OpenAI Prepares for GPT-5 Launch Amid Potential Capacity Challenges

OpenAI Introduces GPT-5 with Enhanced Capabilities

OpenAI Teases Imminent Launch of GPT-5

OpenAI Introduces Study Mode in ChatGPT for Enhanced Learning

OpenAI's Red-Teaming Challenge for GPT-OSS-20B

OpenAI Plans Trillion-Dollar Infrastructure Investment

OpenAI Updates GPT-5 for a Friendlier User Experience

OpenAI Plans August Launch for GPT-5

Elon Musk Criticizes Microsoft as OpenAI Launches GPT-5

Anthropic's Request to Delay AI Copyright Trial Denied