In the data-driven era of AI, obtaining large volumes of data has become crucial for training powerful models. However, the methods of data acquisition have sparked controversy. Recently, the Claude team has stirred public outrage due to their improper data scraping behavior.
The incident began when Claude's crawlers made 1 million visits to a company's server within 24 hours, scraping website content without paying. This action not only blatantly ignored the website's "no scraping" notice but also forcibly occupied significant server resources.
Despite the efforts of the affected company to defend against it, they were ultimately unable to prevent Claude's data scraping. The company's head expressed anger on social media, condemning the actions of the Claude team. Many netizens also voiced their dissatisfaction, with some even suggesting the term "theft" to describe this behavior.
The company involved is iFixit, an American e-commerce and how-to website. iFixit offers millions of pages of free online repair guides for consumer electronics and gadgets. However, iFixit discovered that Claude's crawler program, ClaudeBot, initiated a large number of requests in a short period, accessing 10TB of files in a single day, totaling 73TB for the entire month of May.
Kyle Wiens, CEO of iFixit, stated that ClaudeBot "stole" all their data without permission and occupied server resources. Despite iFixit's clear statement on their website prohibiting unauthorized data scraping, the Claude team seemed to turn a blind eye.
The behavior of the Claude team is not an isolated case. In April this year, the Linux Mint forum also suffered frequent visits from ClaudeBot, causing the forum to run slowly or even crash. Additionally, there are voices pointing out that besides Claude and OpenAI's GPT, many other AI companies are also ignoring the robots.txt settings on websites and forcibly scraping data.
In response to this situation, some suggest that website owners add fake content with traceable or unique information to the page to detect if data is being illegally scraped. iFixit has actually implemented this measure and found that their data was not only scraped by Claude but also by OpenAI.
This incident has sparked a broad discussion about the data scraping behavior of AI companies. On one hand, the development of AI does indeed require a large amount of data for support; on the other hand, data scraping should respect the rights and regulations of website owners. Finding a balance between promoting technological progress and protecting copyright is a question the entire industry needs to consider.