How to Disallow GPTBot Crawling in robots.txt
To tell OpenAI’s web crawler to skip your site, add these lines to your robots.txt
(see docs):
User-agent: GPTBot
Disallow: /
The outlook is bleak: There’s no way to win this if you don’t want to be scraped at all, ever, except by not putting things online. If they don’t scrape your content, they scrape the copycat sites as Rik Schennink pointed out. Or ignore the robots.txt
rule. (How could you tell, anyway?)
My stuff is CC-BY-SA licensed. I don’t want to feel overly protective about my content, and I generally feel best when I stay out of a scarcity mindset. If any of my wisdom makes its way into the LLM content regurgitation machine, so be it. I don’t hold my breath for OpenAI to figure out the “BY” part of my license any time soon, but I do hope they are going to offer source attribution eventually for all of us.