Recently, Oleksandr Tomchuk, the CEO of Trilegangers, received an alert that his company's e-commerce website was down. Upon investigation, he discovered that the culprit was a bot from OpenAI that was relentlessly trying to scrape his vast website, which features over 65,000 products, each with its own page and at least three photos. OpenAI sent "tens of thousands" of server requests in an attempt to download all content, including hundreds of thousands of photos and their detailed descriptions.
Tomchuk stated that OpenAI's crawler was effectively destroying their website, constituting a DDoS attack. The company sells 3D object files and photos (ranging from hands to hair, skin, and full body) to 3D artists, video game developers, and anyone needing to digitally reproduce real human features.
The Trilegangers website is central to its business. The company spent over a decade building what is considered the largest database of "digital human avatars" on the web, comprised of 3D image files scanned from real human models.
Tomchuk's team is based in Ukraine but also has a license in Tampa, Florida, where their website includes a terms of service page that prohibits bots from capturing their images without permission. However, this alone has proven ineffective. The website must use a properly configured robot.txt file, which clearly instructs OpenAI's bot, GPTBot, not to interfere with the site.
Robot.txt, also known as the Robots Exclusion Protocol, is designed to inform search engines what content on a website should not be scraped when indexing web pages. OpenAI states on its information page that it will respect such files when configured with its own set of disallowed crawling tags, but it also warns that its bots may take up to 24 hours to recognize updates to the robot.txt file.
Tomchuk remarked that if a website does not correctly use robot.txt, OpenAI and other companies will assume they can scrape data at will. This is not an optional system.
Worse still, Trilegangers has been forced offline by OpenAI's bot during U.S. business hours, and Tomchuk anticipates a significant increase in their AWS bill due to the bot's CPU and download activities.
Robot.txt is not a foolproof solution either. AI companies voluntarily comply with it. Last summer, another AI startup, Perplexity, came under scrutiny from Wired due to evidence suggesting that Perplexity did not adhere to it, a notably publicized incident.
Tomchuk stated he has not found a way to contact OpenAI and inquire about the situation. OpenAI has not responded to TechCrunch's request for comment. OpenAI has yet to deliver on its long-promised opt-out tool.
For Triplegangers, this is a particularly tricky issue. "In our line of business, the rights issues are quite serious because we are scanning real people," he said. Under laws like the European GDPR, "they cannot just take anyone's photo off the internet and use it."
Ironically, the greed of OpenAI's bot has made Triplegangers realize how exposed they are. He said that if it scraped more gently, Tomchuk would never have known.
"It's frightening because these companies seem to exploit a loophole to scrape data, saying 'if you update your robot.txt with our tags, you can opt out,'" Tomchuk said, but this places the burden on business owners to understand how to block them.
He hopes that other small online businesses understand that the only way to discover if AI bots are stealing copyrighted assets from their websites is to actively look for them. He certainly isn't the only one being intimidated by AI bots. Other website owners have recently told Business Insider how OpenAI bots have disrupted their sites and increased their AWS costs.
By 2024, this issue is expected to worsen. Recent research from digital advertising company DoubleVerify found that AI crawlers and scraping tools will lead to an 86% increase in "general invalid traffic" in 2024, which is traffic not originating from real users.