A recent study has sparked controversy, alleging that OpenAI used copyrighted O'Reilly Media books to train its latest AI model without permission. The study was published by the AI Disclosures Project, a non-profit organization co-founded in 2024 by media mogul Tim O'Reilly and economist Ilan Strauss.
AI models can be viewed as sophisticated prediction engines. They learn patterns from massive datasets—including books, movies, and television shows—to extrapolate from simple prompts. When an AI writes, say, an essay on Greek tragedy or generates stylized images, it's retrieving information from a vast knowledge base, not creating something entirely new.
The way AI models are trained is evolving, with more AI labs, including OpenAI, using AI-generated data to address the dwindling supply of real-world data (primarily public web resources). However, the risks associated with relying solely on synthetic data mean many organizations still opt for real-world data for training.
The study's paper suggests that OpenAI's GPT-4o model was likely trained on O'Reilly's paid books without a licensing agreement. The research indicates that GPT-4o exhibits significantly improved recognition of O'Reilly's paid book content compared to its predecessor, GPT-3.5 Turbo.
Researchers employed a method called DE-COP to detect copyrighted content in language model training data. The authors analyzed the knowledge of GPT-4o, GPT-3.5 Turbo, and other OpenAI models, using excerpts from 13,962 paragraphs across 34 O'Reilly books to estimate the probability of these excerpts being present in the training data.
The results showed that GPT-4o demonstrated higher recognition of more paid O'Reilly book content, suggesting that the model may have been exposed to this non-public material during training.
However, the researchers acknowledge this isn't definitive proof. OpenAI might have acquired the content through user copy-pasting. Furthermore, the study didn't evaluate OpenAI's latest models, leaving open the possibility that they weren't trained on O'Reilly's paid books.
While OpenAI pays for some of its training data and has agreements with news publishers and social networks, its data usage practices remain heavily scrutinized under current legal frameworks. This research undoubtedly presents OpenAI with even more significant challenges amidst numerous lawsuits concerning its training data usage.
Key takeaways:
📚 OpenAI is accused of using O'Reilly's paid books to train its AI models without authorization.
🔍 The study shows that GPT-4o exhibits significantly better recognition of O'Reilly books than earlier models.
⚖️ OpenAI faces multiple legal challenges regarding its training data usage.