robots.txt + ai.txt

Per RFC 9309

Crawler controls including AI-crawler blocks (ClaudeBot, GPTBot, Google-Extended).

What is this, and when do I need it?

What is this?

The robots.txt is a text file at the root of your website (/robots.txt) that tells search crawlers which areas they may index and which they may not. Standardised in RFC 9309.

Complementing it: ai.txt and llms.txt are aimed specifically at AI crawlers (ClaudeBot, GPTBot, Google-Extended, PerplexityBot). With them you signal whether your content may be used as training material for language models - not legally binding yet, but respected by serious providers so far.

When do I need it?

robots.txt is a must for any production website. Without it, search engines crawl everything - including internal paths, admin pages, staging environments. A few disallow lines save crawl budget and protect against accidental indexing.

ai.txt / llms.txt are recommended as soon as you publish content with IP value (texts, code, data) that you do not want in AI training. Practically effective with the major providers; against bad actors, only legal remedies help.

Sensible next steps

security.txt for a contact path to security researchers

User-Agent Targets a specific crawler. * applies to all crawlers without a more specific rule of their own. Sitemap URL (absolute, RFC 9309 § 2.2.4) Full https:// URL of your XML sitemap. Search engines read it for efficient crawling. Add multiple sitemaps by repeating the entry.

Disallow (one path rule per line) Paths the crawler must not fetch. Prefix matching (e.g. /admin/ blocks everything below). Important: this is NOT a security mechanism, only a courtesy rule - secret URLs remain reachable via brute force. Allow (one path rule per line) Exceptions to the disallow. Example: Disallow: /admin/ plus Allow: /admin/public/ only allows the public subarea.

Crawl-delay (sec., 0 = off) Wait time between two requests. Not in RFC 9309. Google ignores it; Bing/Yandex support it. Host (Yandex-specific) Preferred spelling of the domain (with/without www); only Yandex reads it. For other crawlers redirect via the server. AI crawlers block Adds a block for 37 known AI training crawlers (GPTBot, ClaudeBot, Google-Extended, PerplexityBot, ...). The list needs ongoing maintenance.

/robots.txt per RFC 9309

Download

# robots.txt per RFC 9309 (Robots Exclusion Protocol)
# Created with Dernium Webtools

User-agent: *
Disallow: /admin/
Disallow: /api/

# AI crawler block. Tokens per vendor documentation as of early 2026.
# List requires ongoing maintenance because vendors change tokens.
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: OAI-SearchBot
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: anthropic-ai
User-agent: Google-Extended
User-agent: CCBot
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: Bytespider
User-agent: Amazonbot
User-agent: Applebot-Extended
User-agent: cohere-ai
User-agent: cohere-training-data-crawler
User-agent: YouBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: FacebookBot
User-agent: facebookexternalhit
User-agent: ImagesiftBot
User-agent: Diffbot
User-agent: Webzio-Extended
User-agent: omgili
User-agent: omgilibot
User-agent: Timpibot
User-agent: PetalBot
User-agent: AI2Bot
User-agent: Andibot
User-agent: Kangaroo Bot
User-agent: Velen Crawler
User-agent: MistralAI-User
User-agent: DuckAssistBot
User-agent: iaskspider
User-agent: Sidetrade indexer bot
User-agent: ICC-Crawler
User-agent: ISSCyberRiskCrawler
Disallow: /

Sitemap: https://example.com/sitemap.xml

Extras: ai.txt and llms.txt

ai.txt per Spawning is an opt-out or opt-in marker for AI training pipelines on the media-type level (text, image, audio, video, code). llms.txt per llmstxt.org is a short briefing in Markdown form that language models can read for structure.

Domain Apex domain without protocol prefix (e.g. example.com). Used as the domain comment in ai.txt and as the link URL base in llms.txt. Brand name / title for llms.txt Main heading (H1) at the top of the llms.txt file. Usually the domain or brand name; language models use it to classify. llms.txt short description 1-2 sentences on what the site does. Inserted as a blockquote directly under the heading. ai.txt directive Media types (comma-separated)

/ai.txt per Spawning ai.txt

Download

# ai.txt per Spawning (https://spawning.ai/)
# Opt-out / opt-in signal for AI training pipelines, separate from robots.txt.
# Created with Dernium Webtools

User-Agent: *
Disallow: image, text, audio, video, code

# Domain: example.com
# Host under https://<domain>/ai.txt

/llms.txt per llmstxt.org

Download

# Example Ltd
> Short description of the site for language models.

## Important content

- [Home](https://example.com/)
- [Imprint](https://example.com/imprint)
- [Contact](https://example.com/contact)

<!-- Created with Dernium Webtools -->

Inspect an existing robots.txt

Fetches /robots.txt of the given domain and shows the content.

Try with:

Server path: this inspection does NOT run browser-local. We fetch the DNS record or HTTPS response via our server. We do not log the queried domain or the result. 12 requests per minute per IPv4 address or IPv6 /64 subnet.