The Rapid Increase in the Number of Domain Names Blocking AI Models from Accessing Training Data

AIbase基地

Published inAI News · 4 min read · Jul 25, 2024

168

A new study reveals that AI models are gradually losing access to the data they were trained on. Conducted by the Data Provenance Initiative, it shows that the proportion of completely blocked content in AI training data increased from approximately 1% to 5-7% between April 2023 and April 2024. This trend could lead to future AI models learning from less diverse, more biased, and outdated information.

Code Internet Computer

Image Source: Image generated by AI, authorized service provider Midjourney

The study analyzed 14,000 web domain robots.txt files and terms of use, which are sources for popular AI training datasets such as C4, RefinedWeb, and Dolma.

The research found that news websites, forums, and social media platforms are the main sources restricting AI data access, with the blocking rate for news sites surging from 3% to 45%. This means that high-quality news content may decrease in AI training data, potentially being replaced by lower-quality content from corporate and e-commerce sites.

This presents a challenge for AI developers, as high-quality data is crucial for training superior models. However, providers of high-quality content may find new revenue streams by entering into licensing agreements with AI companies.

Meta CEO Mark Zuckerberg has stated that obtaining enough copyrighted data to train an excellent AI model is almost impossible or extremely expensive.

Without a fair use ruling, this situation may continue to escalate. OpenAI has recently struck deals worth millions of dollars with several publishers to access their content for real-time display and AI training. It is expected that other companies will follow suit unless there is a significant change in legal rulings.

Key Points:

🛑 Data access restrictions intensify: From 2023 to 2024, the proportion of blocked content in AI training data has significantly increased, with the blocking rate for news sites rising from 3% to 45%.
📉 Decrease in high-quality data: The proportion of high-quality news content in AI training data is decreasing, potentially being replaced by lower-quality corporate and e-commerce content.
💸 High costs and licensing issues: Obtaining sufficient data for AI training is costly, with OpenAI and Meta facing challenges, while high-quality content providers may find new revenue streams through licensing agreements.

AI Models Data Source Initiative Midjourney C4

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Lenovo Tianshi AI Pro Launch: A Trustworthy AI Partner for Government and Enterprises

Lenovo launches Tianshi AI Pro, positioned as an AI partner for government and enterprise office work, promoting the Xinchuang industry into the AI era. The product shifts the operational logic from "tool-centric" to "task-centric," and is deeply integrated with the Kylin operating system, offering a "dual interface," allowing users to switch conveniently by swiping with four fingers.

Apr 17, 2026

230

Miniso Establishes AI Innovation Department: Focused on Intelligent Agent R&D and Global Site Selection Algorithm Optimization

Miniso establishes an AI Innovation Department, which belongs to the Digital Technology Center, aiming to promote the intelligent upgrade of global business decisions and internal collaboration through intelligent agent technology, focusing on the intelligentization of business decision-making and the construction of core capabilities of intelligent agents.

Apr 17, 2026

170

ChatGPT Users Exceed 1 Billion, Female Users Account for Over 50% for the First Time

According to OpenAI data, ChatGPT's global weekly active users will exceed 1 billion, with a significant change in user structure. The proportion of female users increased from 20% at the beginning to over 50%, for the first time surpassing males, with about 500 million women using it regularly. This reflects that AI technology is accelerating its popularization.

Apr 17, 2026

200

Cerebras and OpenAI Sign 20 Billion Dollar Chip Agreement Plan for IPO

AI chip company Cerebras has reached a major three-year deal worth over $1 billion with OpenAI, doubling the scale of the agreement from the beginning of the year, showing OpenAI's high trust in its technology. OpenAI has committed to provide approximately $1 billion in support for Cerebras to develop data center systems and has obtained a maximum of 10% of minority equity warrants, deepening the strategic cooperation.

Apr 17, 2026

180

iFLYTEK Launches the Upgraded Version of AstronClaw: Introduces 9 New Products and a Hardware-Software Integrated AI Agent Architecture

iFLYTEK launches the upgraded version of AstronClaw, introducing 9 new products and showcasing the hardware-software integrated "AI Agent" architecture. This architecture drives AI from a "dialogue assistant" to a "physical execution hub," aiming to break through screen limitations and bring large model capabilities into the physical world and complex business processes. In the office field, AstronClaw integrates with iFLYTEK Office Book to structure and process fragmented work information.

Apr 17, 2026

280

AI Daily: Claude Opus 4.7 Released; Alibaba Open Sources Qwen3.6-35B-A3B; Perplexity Launches AI Assistant for Mac

Welcome to the [AI Daily] column! This is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and innovative AI product applications. Discover new AI products: https://app.aibase.com/zh1, ClaudeOpus4.7 officially released: What matters more than being smart is being reliable. The release of ClaudeOpus4.7 marks Anthropic's progress in AI model reliability.

Apr 17, 2026

570

OpenAI Launches GPT-Rosalind Model, Deeply Crossing into the Field of Pharmaceutical and Life Sciences

OpenAI launches GPT-Rosalind, an AI model for life sciences named after DNA pioneer, designed to accelerate drug discovery by analyzing biochemical data to aid in evidence synthesis, hypothesis generation, experimental planning, and protein engineering, enhancing lab efficiency and medical application.....

Apr 17, 2026

240

Starbucks Introduces ChatGPT to Recommend Drinks Based on Mood

Starbucks is testing a smart ordering application based on ChatGPT, allowing users to get personalized drink recommendations by entering their mood or needs, aiming to enhance the consumer experience.

Apr 17, 2026

170

Google Gemini Integrates with Personal Photo Albums, AI-Generated Images Move Toward True Personalization

Google's Gemini AI now includes Personal Intelligence, linking to Google Photos to auto-generate personalized images from private albums without manual uploads. With Nano Banana, users can easily create custom content like animated family portraits, enhancing AI response personalization and convenience.....

Apr 17, 2026

210

NVIDIA Releases Lyra 2.0: Generate 90-Meter 3D Environments from a Single Photo, Outperforming Competitors in Multiple Metrics

NVIDIA released the Lyra 2.0 system, which can generate large-scale, highly coherent 3D virtual environments extending up to 90 meters from a single photo, solving issues of image distortion in long-distance camera paths. This technological breakthrough marks significant progress in AI's understanding of 3D spaces and real-time environment simulation, especially meeting the urgent demand for high-quality virtual scenes in embodied intelligence training.

Apr 17, 2026

140

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

GEO Brand Visibility

AI Visibility Audit

AI Search Visibility Checker

GEO Promotion Link Detection

GEO Ranking Optimization System

GEO Services​

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

LLM API Hub

AI Models Finder

Model Providers

LLM Leaderboard

Compare LLMs

LLM Cost Calculator

LLM Arena

AI Model Compatibility Checker

AI Deployment Calculator

The Rapid Increase in the Number of Domain Names Blocking AI Models from Accessing Training Data

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Lenovo Tianshi AI Pro Launch: A Trustworthy AI Partner for Government and Enterprises

Miniso Establishes AI Innovation Department: Focused on Intelligent Agent R&D and Global Site Selection Algorithm Optimization

ChatGPT Users Exceed 1 Billion, Female Users Account for Over 50% for the First Time

Cerebras and OpenAI Sign 20 Billion Dollar Chip Agreement Plan for IPO

iFLYTEK Launches the Upgraded Version of AstronClaw: Introduces 9 New Products and a Hardware-Software Integrated AI Agent Architecture

AI Daily: Claude Opus 4.7 Released; Alibaba Open Sources Qwen3.6-35B-A3B; Perplexity Launches AI Assistant for Mac

OpenAI Launches GPT-Rosalind Model, Deeply Crossing into the Field of Pharmaceutical and Life Sciences

Starbucks Introduces ChatGPT to Recommend Drinks Based on Mood

Google Gemini Integrates with Personal Photo Albums, AI-Generated Images Move Toward True Personalization

NVIDIA Releases Lyra 2.0: Generate 90-Meter 3D Environments from a Single Photo, Outperforming Competitors in Multiple Metrics

AI News Recommendations

Lenovo Tianshi AI Pro Launch: A Trustworthy AI Partner for Government and Enterprises

Miniso Establishes AI Innovation Department: Focused on Intelligent Agent R&D and Global Site Selection Algorithm Optimization

ChatGPT Users Exceed 1 Billion, Female Users Account for Over 50% for the First Time

Cerebras and OpenAI Sign 20 Billion Dollar Chip Agreement Plan for IPO

iFLYTEK Launches the Upgraded Version of AstronClaw: Introduces 9 New Products and a Hardware-Software Integrated AI Agent Architecture

AI Daily: Claude Opus 4.7 Released; Alibaba Open Sources Qwen3.6-35B-A3B; Perplexity Launches AI Assistant for Mac

OpenAI Launches GPT-Rosalind Model, Deeply Crossing into the Field of Pharmaceutical and Life Sciences

Starbucks Introduces ChatGPT to Recommend Drinks Based on Mood

Google Gemini Integrates with Personal Photo Albums, AI-Generated Images Move Toward True Personalization

NVIDIA Releases Lyra 2.0: Generate 90-Meter 3D Environments from a Single Photo, Outperforming Competitors in Multiple Metrics

GEO Services