
Study Reveals OpenAI Models Memorize Copyrighted Content
OpenAI is facing new scrutiny as a study suggests its models have memorized copyrighted content. The study, co-authored by researchers from the University of Washington, University of Copenhagen, and Stanford, introduces a method to identify training data memorized by models like GPT-4 and GPT-3.5. This method involves using 'high-surprisal' words to probe the models' ability to recall specific content, indicating potential memorization of copyrighted material.
The study found that GPT-4 showed signs of memorizing portions of popular fiction books and New York Times articles. This discovery adds to the ongoing legal challenges OpenAI faces from authors and rights-holders who accuse the company of using their works without permission. OpenAI has defended its practices under the fair use doctrine, but the study's findings highlight the need for greater transparency in AI training data.
The researchers emphasize the importance of being able to probe and audit large language models to ensure their trustworthiness. They argue for more transparency in the data used to train these models, a sentiment echoed by OpenAI, which has advocated for looser restrictions on using copyrighted data for AI development.
We hope you enjoyed this article.
Consider subscribing to one of several newsletters we publish. For example, in the Daily AI Brief you can read the most up to date AI news round-up 6 days per week.
Also, consider following our LinkedIn page AI Brief.
More from: Regulation
Subscribe to Daily AI Brief
Daily report covering major AI developments and industry news, with both top stories and complete market updates