GEO Robots.txt Generator

Configure AI crawler access for optimal GEO visibility

📚 Educational Tool

This generates the AI crawler section of your robots.txt. Append to your existing file. Always test in staging first.

📖 Key Concepts Explained

Allow — Full access. The crawler can visit all pages (except blocked paths) at any speed. Best for maximum visibility.

Rate-Limit — Restricted speed. The crawler can visit pages but must wait X seconds between requests (Crawl-delay). Use for training crawlers that consume server resources.

Block — No access. The crawler cannot visit any pages. You become invisible to that AI system. Use for aggressive or unwanted crawlers.

Crawl-delay — Seconds the crawler must wait between page requests. Higher values = less server load but slower indexing. Typical: 2-10s for search crawlers, 10-30s for training crawlers.

User-agent: * — A wildcard rule that applies to ALL crawlers not specifically listed. Acts as a "catch-all" fallback.

Training crawlers — Bulk-collect content to train future AI models. Your content becomes part of model weights but is not directly attributed. High server load, indirect long-term value. robots.txt is your control surface for opting out. Consider rate-limiting.

Search Index crawlers — Proactively crawl your site to build a searchable index, like Googlebot does for Google Search. When users later ask questions, the AI searches this pre-built index. robots.txt is your control surface for search visibility. Prioritize allowing these.

User Fetcher (RAG) crawlers — Fire only when a human user asks a question, fetching your page in real-time to augment that specific response. This is Retrieval-Augmented Generation. robots.txt compliance varies by company — some honor it (Claude-User), some don't (ChatGPT-User). Allow, but don't rely on robots.txt for blocking.

Ad Validation crawlers — Visit landing pages submitted as ads to verify safety and policy compliance before the ad is allowed to serve. Currently OpenAI's OAI-AdsBot is the only major example (added April 2026). Blocking these does not protect content but does break your ability to advertise on the AI platform. Allow — especially if running AI-platform ad campaigns.

Content-Signal — Cloudflare's proposed extension to robots.txt (Sept 2025) that adds a second layer: if the crawler is allowed to fetch the page, what may it do with the content afterward? Three signals: search (index for results), ai-input (real-time AI inference — the GEO citation surface), ai-train (model training). Voluntary, unratified, but deployed across 3.8M+ Cloudflare-managed domains by default. Configure in Section 5 below.

1. Select Your Strategy

🚀

Maximize Visibility

Allow all crawlers for maximum AI citations

⚖️

Balanced

Allow search index + RAG, rate-limit training

🛡️

Conservative

Search index + RAG only, block training

2. Configure AI Crawlers

Training Bulk model training — indirect long-term value

Search Index Proactive indexing for AI search — controllable via robots.txt

User Fetcher (RAG) Live retrieval at query time — robots.txt may not apply

Ad Validation Validates ad landing pages — blocking may break ad serving

3. Paths to Block ?

Pages that should NOT be crawled by AI systems (applied to all non-blocked crawlers)

Shopping Cart/cart/ Checkout/checkout/ User Accounts/account/ Admin Area/admin/ Internal Search/search/ API Endpoints/api/ Private/Internal/private/ WordPress Admin/wp-admin/

Custom Paths to Block

Comma-separated paths. Include trailing slash.

4. Additional Options

Sitemap URL ? Helps crawlers find important pages

Include General Fallback (*) ? Rule for unlisted crawlers

5. Content Signals Policy ?

Unratified Proposal Voluntary Compliance Cloudflare-Deployed

The acquisition vs. usage distinction. Traditional robots.txt directives (Allow / Disallow) control whether a crawler may fetch your pages. Cloudflare's Content Signals Policy, introduced September 2025 and deployed across 3.8M+ domains, adds a separate layer that controls how the content may be used after it has been fetched. The two layers operate independently: a Disallow-ed page is never retrieved, so its Content-Signal is moot; an Allow-ed page is fetched but the signal binds the crawler's downstream usage.

search ?

Permission to index your content for search results and excerpts. Setting no removes you from search citation surfaces.

ai-input ?

Permission to use your content as real-time input for AI answers (RAG, AI Overviews, ChatGPT browsing). Most consequential signal for GEO.

ai-train ?

Permission to use your content to train or fine-tune AI models. Affects long-term foundational authority, not direct citations.

⚠️ ai-input=no removes you from generative AI citation

Setting ai-input=no signals to compliant crawlers that your content should not be used in real-time AI answers — including AI Overviews, ChatGPT, Perplexity, and similar surfaces. This is the opposite of typical GEO objectives, which aim to maximize citation in those surfaces. Use this only for deliberate licensing, content protection, or regulatory strategies — not as a default precaution.

Include Cloudflare's CC0 Preamble ? Human-readable explanation of the policy

Application Scope ? Avoids the per-user-agent specificity trap

6. Verification & Enforcement Layer ?

IETF Draft (RFC 9421) Cloudflare / AWS WAF Identity, not Policy

Policy vs. enforcement. Everything above this point — Allow/Disallow and Content Signals — is a stated preference. It tells well-behaved crawlers what you want; it does not prove who is knocking or stop anyone who ignores it. That gap matters more each quarter: as of June 2026 automated systems generate 57.5% of all web requests (Cloudflare), surpassing human traffic for the first time, and TollBit measured 13%+ of AI-bot requests ignoring robots.txt in Q4 2025 (a 400% rise over two quarters). The enforcement layer lives at your CDN/WAF and origin — not in robots.txt — and it answers a different question: is this request cryptographically who it claims to be?

🔑 Web Bot Auth — cryptographic crawler identity (RFC 9421)

Web Bot Auth applies HTTP Message Signatures (RFC 9421) to crawler traffic. An operator generates an Ed25519 keypair, publishes the public key at /.well-known/http-message-signatures-directory, and signs every outbound request with Signature-Agent, Signature-Input, and Signature headers. Your edge verifies the signature against the published key — confirming identity with cryptographic certainty rather than trusting a spoofable user-agent string. It is an IETF draft (draft-meunier-web-bot-auth-architecture, v-05, March 2026) backed by Cloudflare, Google, Amazon, Akamai, and OpenAI; working-group milestones target standards-track specs and a Best Current Practice document through August 2026, with RFC publication possible in 2027. If you sit behind Cloudflare, signatures are validated at the edge and exposed via cf.verified_bot_category; AWS WAF added native support as well.

Three tiers of verification, strongest first. When deciding whether a request claiming to be GPTBot or ClaudeBot is genuine, rank your evidence: (1) a valid Web Bot Auth signature — highest certainty; (2) a source IP that matches the operator's published IP-range file (OpenAI, Anthropic, Google, and others publish machine-readable JSON ranges); (3) reverse-DNS confirmation of the hostname. The user-agent header alone is the weakest signal and is trivially spoofed — which is precisely how Perplexity's stealth fetches were caught.

💳 Licensing & monetization layer (optional)

Two complementary efforts let publishers move from "block or allow" to "license or charge." Cloudflare Pay-Per-Crawl revives the HTTP 402 Payment Required status code so sites can charge a per-request fee; Cloudflare has blocked AI crawlers by default on newly onboarded domains since mid-2025, and customers were already sending more than a billion 402 responses a day by June 2026. RSL (Really Simple Licensing, launched September 2025, backed by Reddit, Yahoo, Quora, Ziff Davis, and Creative Commons) embeds machine-readable license and royalty terms — free, attribution, subscription, pay-per-crawl, or pay-per-inference — directly in robots.txt. Both are signaling/billing layers that depend on the verification layer above to be enforceable, and both remain voluntary on the AI operator's side.

Current crawler economics (context for your strategy). The crawl-to-refer ratios shift monthly, so treat any single figure as a snapshot and re-check quarterly. As of late May 2026 (Cloudflare Radar), Anthropic's ClaudeBot sits highest at roughly 11,122:1, OpenAI's blended ratio near 857:1, Perplexity around 190:1, and Google about 5:1. The standings also moved: in May 2026 GPTBot overtook ClaudeBot as the third-largest AI crawler (11.48% vs 9.73% of AI-bot traffic), with Claude-SearchBot entering the top 10. The durable takeaway is unchanged — block or rate-limit pure training crawlers with extractive ratios, allow search and user-fetch bots that return referrals, and verify identity before acting on either.

⚠️ Compliance Note

Robots.txt is a voluntary standard — legitimate crawlers respect it, but compliance is not guaranteed. Known gaps as of May 2026: ChatGPT-User does not follow robots.txt for OpenAI's user-initiated browsing pathway (removed December 2025); OpenAI's newer ChatGPT Agent shares the same user-agent token but does respect robots.txt. Perplexity-User ignores robots.txt for user-initiated requests. Google-Agent (Project Mariner), added to Google's official fetcher list in March 2026, ignores robots.txt entirely by design — Google classifies it as a user-triggered fetcher, and the only effective controls are server-side authentication and WAF rules, not robots.txt. Agentic AI browsers (ChatGPT Atlas, Perplexity Comet) similarly use standard Chrome user-agents with no distinguishing token — robots.txt cannot address them. For these cases, look beyond this tool to your WAF and access-control layer. For crisis or sensitive content: assume that once a page is publicly accessible, AI assistants may fetch and quote it immediately. Content Signals are preferences, not enforcement — Google has not committed to honoring them for AI Overviews, and the directive is not yet a ratified standard (the IETF AIPREF Working Group's vocabulary lock targets August 2026). The llms.txt proposal (Howard, September 2024) is intentionally excluded from this generator — no major AI crawler reads or honors it as of May 2026.

Generated robots.txt

0 Allowed

0 Rate-Limited

0 Blocked

✍️ Strategy & Content

⚙️ Technical & Implementation

📊 Measurement & Monitoring

GEO Robots.txt Generator

1. Select Your Strategy

2. Configure AI Crawlers

3. Paths to Block ?

4. Additional Options

5. Content Signals Policy ?

6. Verification & Enforcement Layer ?

Generated robots.txt

✍️ Strategy & Content

⚙️ Technical & Implementation

📊 Measurement & Monitoring

1. Select Your Strategy

2. Configure AI Crawlers

3. Paths to Block ?

4. Additional Options

5. Content Signals Policy ?

6. Verification & Enforcement Layer ?

Generated robots.txt 📋 Copy to Clipboard

Generated robots.txt