The top two AI startups in the world are not complying with requests from media publishers to stop scraping their web content for free model training data, according to sources familiar with the matter.
OpenAI and Anthropic have been found to either ignore or bypass the established web rule known as robots.txt, which is designed to prevent automated scraping of websites. This information comes from individuals with knowledge of TollBit’s analytics and others familiar with the situation.
TollBit is a startup that aims to facilitate paid licensing agreements between publishers and AI companies. In a letter sent to large publishers on Friday, TollBit highlighted the actions of several AI companies in this regard. The names of the accused AI companies were not disclosed in the letter, which was first reported by Reuters.
Although OpenAI and Anthropic have publicly stated that they respect robots.txt rules and crawler blocks like GPTBot and ClaudeBot, TollBit’s findings suggest otherwise. These AI companies, including OpenAI and Anthropic, are reportedly bypassing robots.txt to scrape all content from websites.
A spokeswoman for OpenAI declined to comment further, referring to a blog post from May. This post asserts that the company considers web crawler permissions when training new models. Anthropic did not respond to requests for comment.
The use of robots.txt dates back to the late 1990s as a means for websites to indicate to bot crawlers that their data should not be scraped. It has been widely acknowledged as a key rule governing web behavior.
As generative AI becomes more prevalent, companies are racing to create powerful models that rely on high-quality data. This demand for training data has led to disregard for robots.txt and the informal agreements that support its use.
OpenAI is known for ChatGPT, a popular chatbot, and is backed by Microsoft. Anthropic, backed by Amazon, is responsible for another widely used chatbot called Claude. Both bots provide human-like responses to user queries, made possible by vast amounts of web-scraped data.
Several tech companies have argued to the US Copyright Office that web content should not be protected by copyright when used as AI training data. OpenAI has secured agreements with publishers like Axel Springer, the owner of BI. The US Copyright Office is expected to update its guidance on AI and copyright in the near future.
If you have insights to share, contact Kali Hays at khays@businessinsider.com or on secure messaging app Signal at +1-949-280-0267. Use a non-work device to reach out.