In recent years, the rapid development of artificial intelligence (AI) technology has led to a significant increase in bandwidth pressure on Wikimedia projects due to web crawlers. Representatives from the Wikimedia Foundation have pointed out that since January 2024, bandwidth consumption for serving multimedia files has increased by 50%. This growth primarily stems from automated programs constantly scraping content from Wikimedia's openly licensed image library to train AI models.

Wikipedia

In an open letter, Wikimedia Foundation staff members Birgit Mueller, Chris Danis, and Giuseppe Lavagetto stated that this bandwidth increase isn't due to human users, but rather the intense demands of robotic programs. They emphasized: "Our infrastructure is designed to withstand bursts of traffic from human users during high-interest events, but the traffic generated by crawlers is unprecedented, posing increasing risks and costs to us."

According to Wikimedia's statistics, approximately 65% of high-cost content traffic is generated by these crawlers, despite crawlers only accounting for 35% of page views. This is because Wikimedia's caching scheme distributes popular content to data centers worldwide to improve performance, while crawlers don't consider content popularity when accessing pages. Therefore, they request less popular content, forcing retrieval from core data centers, consuming more computing resources.

Over the past year, the issue of excessive scraping by web crawlers has garnered attention from several open-source projects. For example, Git hosting service Sourcehut, Diaspora developer Dennis Schubert, repair website iFixit, and ReadTheDocs have all expressed concerns, reflecting the excessive demands of AI crawlers in content scraping.

The Wikimedia Foundation's 2025/2026 annual plan includes the goal of "reducing crawler-generated traffic," aiming to reduce the request rate by 20% and bandwidth usage by 30%. They hope to prioritize the user experience for human users and support Wikimedia projects and contributors.

While many websites recognize that providing bandwidth for crawlers is part of doing business, the proliferation of generative AI like ChatGPT has made crawler scraping increasingly aggressive, potentially threatening the existence of source websites. The Wikimedia Foundation acknowledges that while Wikipedia and Wikimedia Commons are crucial for training machine learning models, they must prioritize the needs of human users.

To address this challenge, tools have emerged to combat excessive crawler scraping, such as the data poisoning projects Glaze, Nightshade, and ArtShield, and web tools Kudurru and Nepenthes. However, existing robots.txt protocols are not entirely effective in limiting the behavior of these crawlers, especially as they may disguise themselves as other crawlers to circumvent blocks.

Key Points:

🌐 Crawlers have increased Wikimedia bandwidth consumption by 50%, primarily due to AI model content scraping.

🤖 Approximately 65% of high-cost content traffic is generated by crawlers, although crawlers only account for 35% of page views.

📉 The Wikimedia Foundation plans to reduce crawler-generated traffic in 2025/2026, prioritizing the needs of human users.