runtimeterror/content/posts/blocking-ai-crawlers/index.md at 08c5efb655d6ff1f1f04eb5cf165f476c33a8fac

john/runtimeterror

Fork 0

mirror of https://github.com/jbowdre/runtimeterror.git synced 2024-09-19 15:45:53 +00:00

John Bowdre eb3a9d443c update post with apple ai bot name, formatting fixes

2024-06-13 15:52:11 -05:00

5.3 KiB

Raw Blame History

title

date

lastmod

description

featured

toc

meta

selfhosting

I've seen some recent posts from folks like Cory Dransfeldt and Ethan Marcotte about how (and why) to prevent your personal website from being slurped up by the crawlers that AI companies use to actively enshittify the internet. I figured it was past time for me to hop on board with this, so here we are.

My initial approach was to use Hugo's robots.txt templating to generate a robots.txt file based on a list of bad bots I got from ai.robots.txt on GitHub.

I dumped that list into my config/params.toml file, above any of the nested elements (since toml is kind of picky about that...).

robots = [
  "AdsBot-Google",
  "Amazonbot",
  "anthropic-ai",
  "Applebot-Extended",
  "AwarioRssBot",
  "AwarioSmartBot",
  "Bytespider",
  "CCBot",
  "ChatGPT",
  "ChatGPT-User",
  "Claude-Web",
  "ClaudeBot",
  "cohere-ai",
  "DataForSeoBot",
  "Diffbot",
  "FacebookBot",
  "Google-Extended",
  "GPTBot",
  "ImagesiftBot",
  "magpie-crawler",
  "omgili",
  "Omgilibot",
  "peer39_crawler",
  "PerplexityBot",
  "YouBot"
]

I then created a new template in layouts/robots.txt:

Sitemap: {{ .Site.BaseURL }}/sitemap.xml

# hello robots [^_^]
# let's be friends <3

User-agent: *
Disallow:

# except for these bots which are not friends:
{{ range .Site.Params.bad_robots }}
User-agent: {{ . }}
{{- end }}
Disallow: /

And enabled the template processing for this in my config/hugo.toml file:

enableRobotsTXT = true

Now Hugo will generate the following robots.txt file for me:

Sitemap: https://runtimeterror.dev/sitemap.xml

# hello robots [^_^]
# let's be friends <3

User-agent: *
Disallow:

# except for these bots which are not friends:

User-agent: AdsBot-Google
User-agent: Amazonbot
User-agent: anthropic-ai
User-agent: Applebot-Extended
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: DataForSeoBot
User-agent: Diffbot
User-agent: FacebookBot
User-agent: Google-Extended
User-agent: GPTBot
User-agent: ImagesiftBot
User-agent: magpie-crawler
User-agent: omgili
User-agent: Omgilibot
User-agent: peer39_crawler
User-agent: PerplexityBot
User-agent: YouBot
Disallow: /

Cool!

I also dropped the following into static/ai.txt for good measure:

# Spawning AI
# Prevent datasets from using the following file types

User-Agent: *
Disallow: /
Disallow: *

That's all well and good, but these files carry all the weight and authority of a "No Soliciting" sign. Do I really trust these bots to honor it?

I'm hosting this site on Neocities, and Neocities unfortunately (though perhaps wisely) doesn't give me control of the web server there. But the site is fronted by Cloudflare, and that does give me a lot of options for blocking stuff I don't want.

So I added a WAF Custom Rule to block those unwanted bots. (I could have used their User Agent Blocking to accomplish the same, but you can only set 10 of those on the free tier. I can put all the user agents together in a single WAF Custom Rule.)

Here's the expression I'm using:

(http.user_agent contains "AdsBot-Google") or (http.user_agent contains "Amazonbot") or (http.user_agent contains "anthropic-ai") or (http.user_agent contains "Applebot-Extended") or (http.user_agent contains "AwarioRssBot") or (http.user_agent contains "AwarioSmartBot") or (http.user_agent contains "Bytespider") or (http.user_agent contains "CCBot") or (http.user_agent contains "ChatGPT-User") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "Claude-Web") or (http.user_agent contains "cohere-ai") or (http.user_agent contains "DataForSeoBot") or (http.user_agent contains "FacebookBot") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "GoogleOther") or (http.user_agent contains "GPTBot") or (http.user_agent contains "ImagesiftBot") or (http.user_agent contains "magpie-crawler") or (http.user_agent contains "Meltwater") or (http.user_agent contains "omgili") or (http.user_agent contains "omgilibot") or (http.user_agent contains "peer39_crawler") or (http.user_agent contains "peer39_crawler/1.0") or (http.user_agent contains "PerplexityBot") or (http.user_agent contains "Seekr") or (http.user_agent contains "YouBot")

And checking on that rule ~24 hours later, I can see that it's doing some good:

See ya, AI bots!

5.3 KiB Raw Blame History

5.3 KiB

Raw Blame History