OpenAI and Anthropic disregard an existing rule barring bots from scraping online content.

The top two AI startups in the world are not complying with requests from media publishers to stop scraping their web content for free model training data, according to sources familiar with the matter.

OpenAI and Anthropic have been found to either ignore or bypass the established web rule known as robots.txt, which is designed to prevent automated scraping of websites. This information comes from individuals with knowledge of TollBit’s analytics and others familiar with the situation.

TollBit is a startup that aims to facilitate paid licensing agreements between publishers and AI companies. In a letter sent to large publishers on Friday, TollBit highlighted the actions of several AI companies in this regard. The names of the accused AI companies were not disclosed in the letter, which was first reported by Reuters.

Although OpenAI and Anthropic have publicly stated that they respect robots.txt rules and crawler blocks like GPTBot and ClaudeBot, TollBit’s findings suggest otherwise. These AI companies, including OpenAI and Anthropic, are reportedly bypassing robots.txt to scrape all content from websites.

A spokeswoman for OpenAI declined to comment further, referring to a blog post from May. This post asserts that the company considers web crawler permissions when training new models. Anthropic did not respond to requests for comment.

The use of robots.txt dates back to the late 1990s as a means for websites to indicate to bot crawlers that their data should not be scraped. It has been widely acknowledged as a key rule governing web behavior.

As generative AI becomes more prevalent, companies are racing to create powerful models that rely on high-quality data. This demand for training data has led to disregard for robots.txt and the informal agreements that support its use.

OpenAI is known for ChatGPT, a popular chatbot, and is backed by Microsoft. Anthropic, backed by Amazon, is responsible for another widely used chatbot called Claude. Both bots provide human-like responses to user queries, made possible by vast amounts of web-scraped data.

Several tech companies have argued to the US Copyright Office that web content should not be protected by copyright when used as AI training data. OpenAI has secured agreements with publishers like Axel Springer, the owner of BI. The US Copyright Office is expected to update its guidance on AI and copyright in the near future.

If you have insights to share, contact Kali Hays at [email protected] or on secure messaging app Signal at +1-949-280-0267. Use a non-work device to reach out.

Gemini Enterprise: Google’s Vision for AI Agents on Every Desk!

Japan’s Nikkei Rebounds: Mercari Soars After Tech-Tumble!

Sam Altman Greenlights Erotic Content on ChatGPT for Verified Adults: A…

AI Boosts Returns for Nearly All Aussie Finance Teams

OpenAI and Anthropic disregard an existing rule barring bots from scraping online content.

Post date:

Author:

Category:

INSTAGRAM

Popular Categories

Related Posts

Top 3 AI Video Tools to Boost Earnings in 2025

Gemini Enterprise: Google’s Vision for AI Agents on Every Desk!

Japan’s Nikkei Rebounds: Mercari Soars After Tech-Tumble!

EDITOR PICKS

POPULAR POSTS

Warning from OpenAI leaders helped trigger Sam Altman’s ouster

How to Sign In to ChatGPT: A Complete Guide

Google is increasing the features and availability of its AI-powered search.

POPULAR CATEGORY