update post

update params, archetypes, tags
new post
2024-11-23 07:22:18 +00:00 · 2024-04-12 18:27:14 -05:00 · 2024-04-12 17:12:10 -05:00 · 2024-04-12 17:09:05 -05:00 · 2024-04-12 15:42:22 -05:00 · 2024-04-12 15:29:07 -05:00
8 changed files with 179 additions and 0 deletions
--- a/archetypes/default.md
+++ b/archetypes/default.md
@ -12,6 +12,7 @@ tags:
  - android
  - caddy
  - chromeos
  - cloudflare
  - crostini
  - docker
  - gcp
--- a/config/_default/hugo.toml
+++ b/config/_default/hugo.toml
@ -6,6 +6,7 @@ paginate = 10
 languageCode = "en"
 DefaultContentLanguage = "en"
 enableInlineShortcodes = true
 enableRobotsTXT = true
 # define gemini media type
 [mediaTypes]
--- a/config/_default/params.toml
+++ b/config/_default/params.toml
@ -8,6 +8,34 @@ numberOfRelatedPosts = 5
 indexTitle = ".-. ..- -. - .. -- . - . .-. .-. --- .-."
 robots = [
  "AdsBot-Google",
  "Amazonbot",
  "anthropic-ai",
  "Applebot",
  "AwarioRssBot",
  "AwarioSmartBot",
  "Bytespider",
  "CCBot",
  "ChatGPT",
  "ChatGPT-User",
  "Claude-Web",
  "ClaudeBot",
  "cohere-ai",
  "DataForSeoBot",
  "Diffbot",
  "FacebookBot",
  "Google-Extended",
  "GPTBot",
  "ImagesiftBot",
  "magpie-crawler",
  "omgili",
  "Omgilibot",
  "peer39_crawler",
  "PerplexityBot",
  "YouBot"
 ]
 # Comments
 comments = true
 giscusCategory = "Announcements"
--- a/content/posts/blocking-ai-crawlers/cloudflare-waf-rule.png
+++ b/content/posts/blocking-ai-crawlers/cloudflare-waf-rule.png
--- a/content/posts/blocking-ai-crawlers/index.md
+++ b/content/posts/blocking-ai-crawlers/index.md
@ -0,0 +1,134 @@
 ---
 title: "Blocking AI Crawlers"
 date: 2024-04-12
 # lastmod: 2024-04-12
 description: "Using Hugo to politely ask AI bots to not steal my content - and then configuring Cloudflare's WAF to actively block them, just to be sure."
 featured: false
 toc: true
 comments: true
 categories: Backstage
 tags:
  - cloud
  - cloudflare
  - hugo
  - meta
  - selfhosting
 ---
 I've seen some recent posts from folks like [Cory Dransfeldt](https://coryd.dev/posts/2024/go-ahead-and-block-ai-web-crawlers/) and [Ethan Marcotte](https://ethanmarcotte.com/wrote/blockin-bots/) about how (and *why*) to prevent your personal website from being slurped up by the crawlers that AI companies use to [actively enshittify the internet](https://boehs.org/node/llms-destroying-internet). I figured it was past time for me to hop on board with this, so here we are.
 My initial approach was to use [Hugo's robots.txt templating](https://gohugo.io/templates/robots/) to generate a `robots.txt` file based on a list of bad bots I got from [ai.robots.txt on GitHub](https://github.com/ai-robots-txt/ai.robots.txt).
 I dumped that list into my `config/params.toml` file, *above* any of the nested elements (since toml is kind of picky about that...).
 ```toml
 robots = [
  "AdsBot-Google",
  "Amazonbot",
  "anthropic-ai",
  "Applebot",
  "AwarioRssBot",
  "AwarioSmartBot",
  "Bytespider",
  "CCBot",
  "ChatGPT",
  "ChatGPT-User",
  "Claude-Web",
  "ClaudeBot",
  "cohere-ai",
  "DataForSeoBot",
  "Diffbot",
  "FacebookBot",
  "Google-Extended",
  "GPTBot",
  "ImagesiftBot",
  "magpie-crawler",
  "omgili",
  "Omgilibot",
  "peer39_crawler",
  "PerplexityBot",
  "YouBot"
 ]
 [author]
 name = "John Bowdre"
 ```
 I then created a new template in `layouts/robots.txt`:
 ```text
 Sitemap: {{ .Site.BaseURL }}/sitemap.xml
 User-agent: *
 Disallow:
 {{ range .Site.Params.robots }}
 User-agent: {{ . }}
 {{- end }}
 Disallow: /
 ```
 And enabled the template processing for this in my `config/hugo.toml` file:
 ```toml
 enableRobotsTXT = true
 ```
 Now Hugo will generate the following `robots.txt` file for me:
 ```text
 Sitemap: https://runtimeterror.dev//sitemap.xml
 User-agent: *
 Disallow:
 User-agent: AdsBot-Google
 User-agent: Amazonbot
 User-agent: anthropic-ai
 User-agent: Applebot
 User-agent: AwarioRssBot
 User-agent: AwarioSmartBot
 User-agent: Bytespider
 User-agent: CCBot
 User-agent: ChatGPT
 User-agent: ChatGPT-User
 User-agent: Claude-Web
 User-agent: ClaudeBot
 User-agent: cohere-ai
 User-agent: DataForSeoBot
 User-agent: Diffbot
 User-agent: FacebookBot
 User-agent: Google-Extended
 User-agent: GPTBot
 User-agent: ImagesiftBot
 User-agent: magpie-crawler
 User-agent: omgili
 User-agent: Omgilibot
 User-agent: peer39_crawler
 User-agent: PerplexityBot
 User-agent: YouBot
 Disallow: /
 ```
 Cool!
 I also dropped the following into `static/ai.txt` for [good measure](https://site.spawning.ai/spawning-ai-txt):
 ```text
 # Spawning AI
 # Prevent datasets from using the following file types
 User-Agent: *
 Disallow: /
 Disallow: *
 ```
 That's all well and good, but these files carry all the weight of a "No Soliciting" sign. Do I *really* trust these bots to honor it?
 I'm hosting this site [on Neocities](/deploy-hugo-neocities-github-actions/), but it's fronted by Cloudflare. So I added a [WAF Custom Rule](https://developers.cloudflare.com/waf/custom-rules/) to block those unwanted bots. Here's the expression I'm using:
 ```text
 (http.user_agent contains "AdsBot-Google") or (http.user_agent contains "Amazonbot") or (http.user_agent contains "anthropic-ai") or (http.user_agent contains "Applebot") or (http.user_agent contains "AwarioRssBot") or (http.user_agent contains "AwarioSmartBot") or (http.user_agent contains "Bytespider") or (http.user_agent contains "CCBot") or (http.user_agent contains "ChatGPT-User") or (http.user_agent contains "ClaudeBot") or (http.user_agent contains "Claude-Web") or (http.user_agent contains "cohere-ai") or (http.user_agent contains "DataForSeoBot") or (http.user_agent contains "FacebookBot") or (http.user_agent contains "Google-Extended") or (http.user_agent contains "GoogleOther") or (http.user_agent contains "GPTBot") or (http.user_agent contains "ImagesiftBot") or (http.user_agent contains "magpie-crawler") or (http.user_agent contains "Meltwater") or (http.user_agent contains "omgili") or (http.user_agent contains "omgilibot") or (http.user_agent contains "peer39_crawler") or (http.user_agent contains "peer39_crawler/1.0") or (http.user_agent contains "PerplexityBot") or (http.user_agent contains "Seekr") or (http.user_agent contains "YouBot")
 ```
 ![Creating a custom WAF rule in Cloudflare's web UI](cloudflare-waf-rule.png)
 I'll probably streamline this in the future to be managed with a GitHub Actions workflow but this will do for now.
--- a/content/posts/publish-services-cloudflare-tunnel/index.md
+++ b/content/posts/publish-services-cloudflare-tunnel/index.md
@ -8,6 +8,7 @@ toc: true
 comments: true
 categories: Self-Hosting
 tags:
  - cloudflare
  - cloud
  - containers
  - docker
--- a/layouts/robots.txt
+++ b/layouts/robots.txt
@ -0,0 +1,8 @@
 Sitemap: {{ .Site.BaseURL }}/sitemap.xml
 User-agent: *
 Disallow:
 {{ range .Site.Params.robots }}
 User-agent: {{ . }}
 {{- end }}
 Disallow: /
--- a/static/ai.txt
+++ b/static/ai.txt
@ -0,0 +1,6 @@
 # Spawning AI
 # Prevent datasets from using the following file types
 User-Agent: *
 Disallow: /
 Disallow: *
Author	SHA1	Message	Date
John Bowdre	7726d3e5dd	update post	2024-04-12 18:27:14 -05:00
John Bowdre	dff5146771	update params, archetypes, tags	2024-04-12 17:12:10 -05:00
John Bowdre	e6f6c2e8d0	new post	2024-04-12 17:09:05 -05:00
John Bowdre	0197cb3179	add ai.txt	2024-04-12 15:42:22 -05:00
John Bowdre	883e85c701	let ai bots know they aren't welcome	2024-04-12 15:29:07 -05:00