A recent survey has revealed that hundreds of websites are attempting to block content scraping by AI company Anthropic, but are inadvertently blocking the wrong bots due to outdated directives. This phenomenon highlights the challenges faced by website owners in dealing with the constantly evolving AI web crawler ecosystem.
According to an anonymous operator of the web crawler tracking site Dark Visitors, many websites are currently blocking two bots that Anthropic no longer uses, "ANTHROPIC-AI" and "CLAUDE-WEB," while unknowingly allowing the company's actual new crawler, "CLAUDEBOT," to pass through. This situation arises primarily because website owners have copied and pasted outdated directives into their robots.txt files, while AI companies continuously introduce crawler bots with new names.
Image source note: The image was generated by AI, provided by the image licensing service provider Midjourney
This chaotic situation is not limited to Anthropic. The operator of Dark Visitors points out that tech giants like Apple and Meta have recently added new proxies, making it nearly impossible for website owners to manually keep up with these changes. More concerning is that some AI companies have been found scraping websites they shouldn't, or outright ignoring the directives in robots.txt files.
This scenario leads to a series of issues. Some websites opt to block all crawlers entirely or allow only a few specific ones to access, which may affect search engine indexing, internet archiving, and academic research. Meanwhile, some websites are facing technical and economic pressures from the large-scale access of AI crawlers. For example, repair guide website iFixit reported that Anthropic's crawler accessed its site nearly a million times in a single day. Another service provider, Read the Docs, said a crawler accessed 10TB worth of files in a day, resulting in high bandwidth costs.
A study by the Data Provenance Initiative further reveals the widespread confusion faced by content creators and website owners when trying to prevent AI tools from training. The study points out that the responsibility of blocking AI scraping tools falls entirely on website owners, and the ever-increasing and frequently changing number of crawlers makes this task extremely difficult.
In the face of this complex situation, experts advise website administrators to actively block suspicious AI crawlers, even if it might inadvertently block some non-existent proxies. At the same time, there are predictions that more creators will move their content behind paywalls to prevent unrestricted scraping.