OpenAI Allegedly Used Paywalled O'Reilly Books for AI Training

OpenAI Allegedly Used Paywalled O'Reilly Books for AI Training

A recent paper by the AI Disclosures Project suggests that OpenAI's GPT-4o model was trained on paywalled O'Reilly Media books without a licensing agreement.

OpenAI has been accused of training its AI models on copyrighted content without permission, with a new paper from the AI Disclosures Project suggesting that the company used paywalled books from O'Reilly Media for its GPT-4o model. The paper, authored by Tim O'Reilly, Ilan Strauss, and Sruly Rosenblat, indicates that GPT-4o shows strong recognition of non-public O'Reilly book content compared to earlier models like GPT-3.5 Turbo.

The research employed a method known as DE-COP, which detects copyrighted content in language models' training data. This method revealed that GPT-4o likely has prior knowledge of many non-public O'Reilly books published before its training cutoff date. The findings highlight the need for increased transparency in AI model training data sources.

The AI Disclosures Project, co-founded by Tim O'Reilly and Ilan Strauss, aims to address the societal impacts of AI's commercialization by advocating for better corporate transparency. The paper's findings suggest that OpenAI, despite having some licensing agreements, may have used unlicensed paywalled content to enhance its AI models.

We hope you enjoyed this article.

Consider subscribing to one of several newsletters we publish. For example, in the Daily AI Brief you can read the most up to date AI news round-up 6 days per week.

Also, consider following our LinkedIn page AI Brief.

Subscribe to Daily AI Brief

Daily report covering major AI developments and industry news, with both top stories and complete market updates