runtimeterror/content/posts/blocking-ai-crawlers/index.md

5.3 KiB

title date lastmod description featured toc categories tags
Blocking AI Crawlers 2024-04-12 2024-06-13T20:51:54Z Using Hugo to politely ask AI bots to not steal my content - and then configuring Cloudflare's WAF to actively block them, just to be sure. false true Backstage
cloud
cloudflare
hugo
meta
selfhosting

I've seen some recent posts from folks like Cory Dransfeldt and Ethan Marcotte about how (and why) to prevent your personal website from being slurped up by the crawlers that AI companies use to actively enshittify the internet. I figured it was past time for me to hop on board with this, so here we are.

My initial approach was to use Hugo's robots.txt templating to generate a robots.txt file based on a list of bad bots I got from ai.robots.txt on GitHub.

I dumped that list into my config/params.toml file, above any of the nested elements (since toml is kind of picky about that...).

robots = [
  "AdsBot-Google",
  "Amazonbot",
  "anthropic-ai",
  "Applebot-Extended",
  "AwarioRssBot",
  "AwarioSmartBot",
  "Bytespider",
  "CCBot",
  "ChatGPT",
  "ChatGPT-User",
  "Claude-Web",
  "ClaudeBot",
  "cohere-ai",
  "DataForSeoBot",
  "Diffbot",
  "FacebookBot",
  "Google-Extended",
  "GPTBot",
  "ImagesiftBot",
  "magpie-crawler",
  "omgili",
  "Omgilibot",
  "peer39_crawler",
  "PerplexityBot",
  "YouBot"
]

I then created a new template in layouts/robots.txt:

Sitemap: {{ .Site.BaseURL }}/sitemap.xml

# hello robots [^_^]
# let's be friends <3

User-agent: *
Disallow:

# except for these bots which are not friends:
{{ range .Site.Params.bad_robots }}
User-agent: {{ . }}
{{- end }}
Disallow: /

And enabled the template processing for this in my config/hugo.toml file:

enableRobotsTXT = true

Now Hugo will generate the following robots.txt file for me:

Sitemap: https://runtimeterror.dev/sitemap.xml

# hello robots [^_^]
# let's be friends <3

User-agent: *
Disallow:

# except for these bots which are not friends:

User-agent: AdsBot-Google
User-agent: Amazonbot
User-agent: anthropic-ai
User-agent: Applebot-Extended
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: DataForSeoBot
User-agent: Diffbot
User-agent: FacebookBot
User-agent: Google-Extended
User-agent: GPTBot
User-agent: ImagesiftBot
User-agent: magpie-crawler
User-agent: omgili
User-agent: Omgilibot
User-agent: peer39_crawler
User-agent: PerplexityBot
User-agent: YouBot
Disallow: /

Cool!

I also dropped the following into static/ai.txt for good measure:

# Spawning AI
# Prevent datasets from using the following file types

User-Agent: *
Disallow: /
Disallow: *

That's all well and good, but these files carry all the weight and authority of a "No Soliciting" sign. Do I really trust these bots to honor it?

I'm hosting this site on Neocities, and Neocities unfortunately (though perhaps wisely) doesn't give me control of the web server there. But the site is fronted by Cloudflare, and that does give me a lot of options for blocking stuff I don't want.

So I added a WAF Custom Rule to block those unwanted bots. (I could have used their User Agent Blocking to accomplish the same, but you can only set 10 of those on the free tier. I can put all the user agents together in a single WAF Custom Rule.)

Here's the expression I'm using:

(http.user_agent contains "AdsBot-Google") or (http.user_agent contains "Amazonbot") or (http.user_agent contains "anthropic-ai") or (http.user_agent contains "Applebot-Extended") or (http.user_agent contains "AwarioRssBot") or (http.user_agent contains "AwarioSmartBot") or (http.user_agent contains "Bytespider") or (http.user_agent contains "CCBot") or (http.user_agent contains "ChatGPT-User") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "Claude-Web") or (http.user_agent contains "cohere-ai") or (http.user_agent contains "DataForSeoBot") or (http.user_agent contains "FacebookBot") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "GoogleOther") or (http.user_agent contains "GPTBot") or (http.user_agent contains "ImagesiftBot") or (http.user_agent contains "magpie-crawler") or (http.user_agent contains "Meltwater") or (http.user_agent contains "omgili") or (http.user_agent contains "omgilibot") or (http.user_agent contains "peer39_crawler") or (http.user_agent contains "peer39_crawler/1.0") or (http.user_agent contains "PerplexityBot") or (http.user_agent contains "Seekr") or (http.user_agent contains "YouBot")

Creating a custom WAF rule in Cloudflare's web UI

And checking on that rule ~24 hours later, I can see that it's doing some good:

It's blocked 102 bot hits already

See ya, AI bots!