Taylor Beseda /blog/automating-my-robots-txt-to-block-ai-user-agents

3/8/2024

Automating My robots.txt to Block AI User Agents

darkvisitors.com maintains a helpful list of user agents you can use to "block" scrapers that are feeding large language models.

Insight into the hidden ecosystem of autonomous chatbots and data scrapers crawling across the web. Protect your website from unwanted AI agent access.

They even provide a sample robots.txt file. So, I decided to scrape it, save the agent strings, and dynamically generate a robots.txt file - that way I don't need to manually update mine each time BigCo creates a new chat bot.

This site is built with Architect, so out of the box I have access to 4 key features:

Scheduled functions (EventBridge + Lambda)
A stupid fast database (DynamoDB)
HTTP Cloud functions (API Gateway + Lambda)
A CDN/cache layer (CloudFront)

Here's how I've defined resources for this feature in my site's app.arc file:

@scheduled
get-dark-visitors rate(1 day)

@http
get /robots.txt

@tables
things
  key *String

Without getting into a full Architect tutorial, here's the source of the 2 related Lambdas:

The scheduled function, executed once per day:

import arc from '@architect/functions'

const matcher = /^User-agent:.*/
const { things } = await arc.tables()

export async function handler() {
  const request = await fetch('https://darkvisitors.com/robots-txt-builder')
  const agents = (await request.text())
    .split('\n')
    .filter((line) => matcher.test(line.trim()))

  await things.put({
    key: 'agents:dark-visitors',
    agents,
    updated: new Date().toISOString(),
  })

  return
}

And here's the HTTP handler for tbeseda.com/robots.txt:

import arc from '@architect/functions'

const { things } = await arc.tables()

let lines = `
User-agent: *
Disallow: /sekret
`

async function get() {
  const agentsThing = await things.get({
    key: 'agents:dark-visitors',
  })

  if (agentsThing) {
    const { agents } = agentsThing
    const agentLines = agents
      .map((agent) => `${agent}\nDisallow: /`)
      .join('\n\n')

    lines += `\n${agentLines}`
  }

  return {
    // cache for 1 day
    headers: { 'Cache-Control': 'public, max-age=86400' },
    text: lines.trim(),
  }
}

export const handler = arc.http(get)

The current robots.txt contents now looks like:

User-agent: *
Disallow: /sekret

User-agent: anthropic-ai
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: omgili
Disallow: /

That's that. Now each day a scheduled function grabs known agents, saves them to a database, and once a day the robots.txt file is regenerated in the CloudFront cache.

from Sanity.io (138.40ms) to HTML (1.36ms)