A new study reveals that AI models are gradually losing access to the data they were trained on. Conducted by the Data Provenance Initiative, it shows that the proportion of completely blocked content in AI training data increased from approximately 1% to 5-7% between April 2023 and April 2024. This trend could lead to future AI models learning from less diverse, more biased, and outdated information.
Image Source: Image generated by AI, authorized service provider Midjourney
The study analyzed 14,000 web domain robots.txt files and terms of use, which are sources for popular AI training datasets such as C4, RefinedWeb, and Dolma.
The research found that news websites, forums, and social media platforms are the main sources restricting AI data access, with the blocking rate for news sites surging from 3% to 45%. This means that high-quality news content may decrease in AI training data, potentially being replaced by lower-quality content from corporate and e-commerce sites.
This presents a challenge for AI developers, as high-quality data is crucial for training superior models. However, providers of high-quality content may find new revenue streams by entering into licensing agreements with AI companies.
Meta CEO Mark Zuckerberg has stated that obtaining enough copyrighted data to train an excellent AI model is almost impossible or extremely expensive.
Without a fair use ruling, this situation may continue to escalate. OpenAI has recently struck deals worth millions of dollars with several publishers to access their content for real-time display and AI training. It is expected that other companies will follow suit unless there is a significant change in legal rulings.
Key Points:
🛑 Data access restrictions intensify: From 2023 to 2024, the proportion of blocked content in AI training data has significantly increased, with the blocking rate for news sites rising from 3% to 45%.
📉 Decrease in high-quality data: The proportion of high-quality news content in AI training data is decreasing, potentially being replaced by lower-quality corporate and e-commerce content.
💸 High costs and licensing issues: Obtaining sufficient data for AI training is costly, with OpenAI and Meta facing challenges, while high-quality content providers may find new revenue streams through licensing agreements.