Wikimedia Foundation Warns of Bandwidth Strain from AI Crawlers

AIbase基地

Published inAI News · 5 min read · Apr 3, 2025

In recent years, the rapid development of artificial intelligence (AI) technology has led to a significant increase in bandwidth pressure on Wikimedia projects due to web crawlers. Representatives from the Wikimedia Foundation have pointed out that since January 2024, bandwidth consumption for serving multimedia files has increased by 50%. This growth primarily stems from automated programs constantly scraping content from Wikimedia's openly licensed image library to train AI models.

Wikipedia

In an open letter, Wikimedia Foundation staff members Birgit Mueller, Chris Danis, and Giuseppe Lavagetto stated that this bandwidth increase isn't due to human users, but rather the intense demands of robotic programs. They emphasized: "Our infrastructure is designed to withstand bursts of traffic from human users during high-interest events, but the traffic generated by crawlers is unprecedented, posing increasing risks and costs to us."

According to Wikimedia's statistics, approximately 65% of high-cost content traffic is generated by these crawlers, despite crawlers only accounting for 35% of page views. This is because Wikimedia's caching scheme distributes popular content to data centers worldwide to improve performance, while crawlers don't consider content popularity when accessing pages. Therefore, they request less popular content, forcing retrieval from core data centers, consuming more computing resources.

Over the past year, the issue of excessive scraping by web crawlers has garnered attention from several open-source projects. For example, Git hosting service Sourcehut, Diaspora developer Dennis Schubert, repair website iFixit, and ReadTheDocs have all expressed concerns, reflecting the excessive demands of AI crawlers in content scraping.

The Wikimedia Foundation's 2025/2026 annual plan includes the goal of "reducing crawler-generated traffic," aiming to reduce the request rate by 20% and bandwidth usage by 30%. They hope to prioritize the user experience for human users and support Wikimedia projects and contributors.

While many websites recognize that providing bandwidth for crawlers is part of doing business, the proliferation of generative AI like ChatGPT has made crawler scraping increasingly aggressive, potentially threatening the existence of source websites. The Wikimedia Foundation acknowledges that while Wikipedia and Wikimedia Commons are crucial for training machine learning models, they must prioritize the needs of human users.

To address this challenge, tools have emerged to combat excessive crawler scraping, such as the data poisoning projects Glaze, Nightshade, and ArtShield, and web tools Kudurru and Nepenthes. However, existing robots.txt protocols are not entirely effective in limiting the behavior of these crawlers, especially as they may disguise themselves as other crawlers to circumvent blocks.

Key Points:
🌐 Crawlers have increased Wikimedia bandwidth consumption by 50%, primarily due to AI model content scraping.
🤖 Approximately 65% of high-cost content traffic is generated by crawlers, although crawlers only account for 35% of page views.
📉 The Wikimedia Foundation plans to reduce crawler-generated traffic in 2025/2026, prioritizing the needs of human users.

Intel Open Sources AI Playground for Intel Arc GPUs and Various AI Models

Intel has announced the open-sourcing of its generative AI software, AI Playground, generating significant interest within the AI community. Optimized for Intel Arc GPUs and integrated graphics, AI Playground is described as an 'AI hub' that supports local running of chat-based Large Language Models (LLMs), as well as image and video generation capabilities. This open-sourcing signifies Intel's commitment to advancing the accessibility of generative AI technology.

Gartner Report: Task-Specific AI to Outpace General-Purpose AI by 2027

A new Gartner report predicts that by 2027, enterprises will utilize task-specific AI models three times more frequently than general-purpose large language models. While acknowledging the strong language processing capabilities of general-purpose models, the report highlights their decreased accuracy in tasks requiring deep understanding of specific business domains. Consequently, businesses are increasingly focusing on customized AI models to meet their unique needs. Image note: Image generated by AI, image licensing provided by Midjourney.

Korean Startup RLWRLD Secures $14.8 Million to Develop Robotic Foundation Models

As robotics technology advances, industries are increasingly adopting robots to automate various strenuous tasks. According to the International Federation of Robotics (IFR), over 540,000 new industrial robots were installed globally in 2023, bringing the total number of active industrial robots to over 4 million. While traditional industrial robots excel at repetitive tasks, they still face challenges in performing delicate tasks, handling fragile materials, and adapting to changing conditions. For example, robots in restaurant kitchens may cause more disruption than assistance.

Concerns Rise as AI Models Conceal Their Reasoning Processes: Study Finds Their 'Thinking' Often Unreliable

In education, we're taught to "show your work." Now, advanced AI models claim to do just that. However, new research reveals that these models sometimes obfuscate their true reasoning processes, fabricating elaborate explanations instead. A recent study from Anthropic's research team, investigating simulated reasoning (SR) models including their own Claude models and DeepSeek's R1, found these models often misrepresent their 'thinking' when

Google Plans to Combine Gemini and Veo AI Models to Advance Smart Assistants

In a recent podcast, Demis Hassabis, CEO of Google DeepMind, stated that Google plans to eventually integrate its Gemini AI model with the video generation model Veo to enhance Gemini's understanding of the physical world. He noted that Gemini was designed from the outset to be multimodal, aiming for a "universal digital assistant" that can genuinely help users in the real world. Hassabis mentioned...

Soaring Costs of Benchmarking Inference AI Models: Assessing One Can Cost Nearly $3000

According to Artificial Analysis, a third-party AI testing agency, evaluating OpenAI's o1 inference model across seven popular benchmarks costs $2,767.05, while its non-inference model GPT-4o costs only $108.85. This significant disparity sparks discussion regarding the sustainability and transparency of AI evaluation. Inference models, AI systems capable of step-by-step reasoning to solve problems, while excelling in specific domains, incur significantly higher benchmarking costs than traditional models. Arti...

EU Invests €20 Billion in AI Superfactories

The EU recently announced a €20 billion (approximately £17 billion) plan to establish multiple AI factories across Europe, equipped with high-performance computing resources to drive the development of next-generation AI models. This strategy aims to establish Europe as an "AI continent." According to EU Commission Vice-President Henna Virkkunen, AI is crucial for enhancing Europe's competitiveness, security, and technological sovereignty in the face of intense global competition. Currently, the US and...