Study Claims OpenAI May Have Used O'Reilly Paid Books to Train AI Models Without Authorization

AIbase基地

Published inAI News · 4 min read · Apr 2, 2025

A recent study has sparked controversy, alleging that OpenAI used copyrighted O'Reilly Media books to train its latest AI model without permission. The study was published by the AI Disclosures Project, a non-profit organization co-founded in 2024 by media mogul Tim O'Reilly and economist Ilan Strauss.

AI models can be viewed as sophisticated prediction engines. They learn patterns from massive datasets—including books, movies, and television shows—to extrapolate from simple prompts. When an AI writes, say, an essay on Greek tragedy or generates stylized images, it's retrieving information from a vast knowledge base, not creating something entirely new.

The way AI models are trained is evolving, with more AI labs, including OpenAI, using AI-generated data to address the dwindling supply of real-world data (primarily public web resources). However, the risks associated with relying solely on synthetic data mean many organizations still opt for real-world data for training.

The study's paper suggests that OpenAI's GPT-4o model was likely trained on O'Reilly's paid books without a licensing agreement. The research indicates that GPT-4o exhibits significantly improved recognition of O'Reilly's paid book content compared to its predecessor, GPT-3.5 Turbo.

Researchers employed a method called DE-COP to detect copyrighted content in language model training data. The authors analyzed the knowledge of GPT-4o, GPT-3.5 Turbo, and other OpenAI models, using excerpts from 13,962 paragraphs across 34 O'Reilly books to estimate the probability of these excerpts being present in the training data.

The results showed that GPT-4o demonstrated higher recognition of more paid O'Reilly book content, suggesting that the model may have been exposed to this non-public material during training.

However, the researchers acknowledge this isn't definitive proof. OpenAI might have acquired the content through user copy-pasting. Furthermore, the study didn't evaluate OpenAI's latest models, leaving open the possibility that they weren't trained on O'Reilly's paid books.

While OpenAI pays for some of its training data and has agreements with news publishers and social networks, its data usage practices remain heavily scrutinized under current legal frameworks. This research undoubtedly presents OpenAI with even more significant challenges amidst numerous lawsuits concerning its training data usage.

Key takeaways:
📚 OpenAI is accused of using O'Reilly's paid books to train its AI models without authorization.
🔍 The study shows that GPT-4o exhibits significantly better recognition of O'Reilly books than earlier models.
⚖️ OpenAI faces multiple legal challenges regarding its training data usage.

Wikimedia Foundation Warns of Bandwidth Strain from AI Crawlers

The Wikimedia Foundation has warned of increasing bandwidth strain on its projects caused by AI-powered web crawlers. Representatives noted a 50% increase in bandwidth consumption for multimedia files since January 2024, largely attributed to automated programs harvesting content from Wikimedia's openly licensed image library for AI model training. Wikimedia Foundation staff members Birgit Mueller, Chris Danis, and...

OpenAI Urges UK to Develop Forward-Looking Copyright Policy to Boost AI Development

OpenAI submitted a consultation response to the UK Parliament's Science, Innovation and Technology Committee on AI and copyright, highlighting the importance of policies that foster innovation and aim to establish the UK as a European leader in AI. OpenAI expressed its eagerness to collaborate with the UK government, Parliament, and copyright holders to find solutions that balance the interests of all parties. OpenAI believes that while laws are national, technological advancements are borderless. To ensure the UK's competitiveness in AI, clear and innovation-friendly regulations are urgently needed.

ChatGPT Updates Image Generation Capabilities, Now Including Cursive Script

ChatGPT's recent image generation update has driven a significant surge in paying users, with a 20-million increase reported. The creative applications showcased demonstrate impressive advancements in ChatGPT4.0's capabilities, even addressing previously challenging aspects like Chinese character generation. Now, ChatGPT has further enhanced its 'Creat image' function, moving beyond standard fonts to generate accurate cursive script.

Tinder Launches AI-Powered Flirting Game 'Game Game' in Partnership with OpenAI, Sparking Controversy

Tinder recently announced a partnership with OpenAI to launch an AI-powered flirting game called 'Game Game'. Utilizing OpenAI's voice models and GPT-4 reasoning model, the game encourages users to role-play in various hypothetical encounter scenarios and earn points based on their flirting skills. The company emphasizes that voice data collected in the game will not be used to train any new AI models. This follows the recent appointment of a former Zillow executive as CEO of Tinder's parent company, Match Group.

OpenAI Establishes New Committee to Build the Most Powerful Non-profit

As an established non-profit, OpenAI is committed to building the world's best-equipped non-profit organization, aiming to enhance human creativity through historic financial resources and powerful technology. Imagine a model where a charity's investment capacity grows as the value of its affiliated companies increases. In OpenAI's vision, philanthropy is not merely the flow of money, but a fundamental form of support. Leveraging technology developed by leading AI companies, non-profit organizations will be able to...

OpenAI's o3 Model Cost Correction: Per-Task Price May Reach $30,000

The Arc Prize Foundation, responsible for maintaining and managing the competition, last week revised its cost estimate for OpenAI's upcoming o3 inference AI model with a staggering adjustment—from an initial estimate of $3,000 per ARC-AGI task to $30,000. This price correction reveals that the operational costs of today's most complex AI models may be ten times higher than previously anticipated. While OpenAI has yet to announce an official pricing strategy for o3, or even officially release the model, the Arc Prize...

Hugging Face Adds Handy Feature: One-Click Check for Compatible Models

Hugging Face, a leading open-source AI community platform, has launched a highly anticipated new feature: users can quickly see which machine learning models their computer hardware can run via platform settings. Users simply add their hardware information, such as GPU model, to their Hugging Face profile settings (located at the top right corner: Profile Icon > Settings > Local Apps and Hardware).