GEO Robots.txt Generator

Configure AI crawler access for optimal GEO visibility

📚 Educational Tool
This generates the AI crawler section of your robots.txt. Append to your existing file. Always test in staging first.
📖 Key Concepts Explained
Allow — Full access. The crawler can visit all pages (except blocked paths) at any speed. Best for maximum visibility.
Rate-Limit — Restricted speed. The crawler can visit pages but must wait X seconds between requests (Crawl-delay). Use for training crawlers that consume server resources.
Block — No access. The crawler cannot visit any pages. You become invisible to that AI system. Use for aggressive or unwanted crawlers.
Crawl-delay — Seconds the crawler must wait between page requests. Higher values = less server load but slower indexing. Typical: 2-10s for search crawlers, 10-30s for training crawlers.
User-agent: * — A wildcard rule that applies to ALL crawlers not specifically listed. Acts as a "catch-all" fallback.
Training crawlers — Bulk-collect content to train future AI models. Your content becomes part of model weights but is not directly attributed. High server load, indirect long-term value. robots.txt is your control surface for opting out. Consider rate-limiting.
Search Index crawlers — Proactively crawl your site to build a searchable index, like Googlebot does for Google Search. When users later ask questions, the AI searches this pre-built index. robots.txt is your control surface for search visibility. Prioritize allowing these.
User Fetcher (RAG) crawlers — Fire only when a human user asks a question, fetching your page in real-time to augment that specific response. This is Retrieval-Augmented Generation. robots.txt compliance varies by company — some honor it (Claude-User), some don't (ChatGPT-User). Allow, but don't rely on robots.txt for blocking.
Ad Validation crawlers — Visit landing pages submitted as ads to verify safety and policy compliance before the ad is allowed to serve. Currently OpenAI's OAI-AdsBot is the only major example (added April 2026). Blocking these does not protect content but does break your ability to advertise on the AI platform. Allow — especially if running AI-platform ad campaigns.
Content-Signal — Cloudflare's proposed extension to robots.txt (Sept 2025) that adds a second layer: if the crawler is allowed to fetch the page, what may it do with the content afterward? Three signals: search (index for results), ai-input (real-time AI inference — the GEO citation surface), ai-train (model training). Voluntary, unratified, but deployed across 3.8M+ Cloudflare-managed domains by default. Configure in Section 5 below.

1. Select Your Strategy

🚀
Maximize Visibility
Allow all crawlers for maximum AI citations
⚖️
Balanced
Allow search index + RAG, rate-limit training
🛡️
Conservative
Search index + RAG only, block training

2. Configure AI Crawlers

Training Bulk model training — indirect long-term value
Search Index Proactive indexing for AI search — controllable via robots.txt
User Fetcher (RAG) Live retrieval at query time — robots.txt may not apply
Ad Validation Validates ad landing pages — blocking may break ad serving

3. Paths to Block ?

Pages that should NOT be crawled by AI systems (applied to all non-blocked crawlers)

Comma-separated paths. Include trailing slash.

4. Additional Options

One URL per line. Each becomes its own Sitemap: line. A sitemap index file is also valid.
Rule for unlisted crawlers

5. Content Signals Policy ?

Unratified Proposal Voluntary Compliance Cloudflare-Deployed
The acquisition vs. usage distinction. Traditional robots.txt directives (Allow / Disallow) control whether a crawler may fetch your pages. Cloudflare's Content Signals Policy, introduced September 2025 and deployed across 3.8M+ domains, adds a separate layer that controls how the content may be used after it has been fetched. The two layers operate independently: a Disallow-ed page is never retrieved, so its Content-Signal is moot; an Allow-ed page is fetched but the signal binds the crawler's downstream usage. The signals below apply to every allowed and rate-limited crawler. To give a specific bot a different policy, use the Content-Signal control on that bot's row in Section 2 — an overridden bot is written into its own user-agent group with its own signal line, which (per RFC 9309) it reads in place of the global one.
search ?
Permission to index your content for search results and excerpts. Setting no removes you from search citation surfaces.
ai-input ?
Permission to use your content as real-time input for AI answers (RAG, AI Overviews, ChatGPT browsing). Most consequential signal for GEO.
ai-train ?
Permission to use your content to train or fine-tune AI models. Affects long-term foundational authority, not direct citations.
⚠️ ai-input=no removes you from generative AI citation
Setting ai-input=no signals to compliant crawlers that your content should not be used in real-time AI answers — including AI Overviews, ChatGPT, Perplexity, and similar surfaces. This is the opposite of typical GEO objectives, which aim to maximize citation in those surfaces. Use this only for deliberate licensing, content protection, or regulatory strategies — not as a default precaution.
Human-readable explanation of the policy
⚠️ Compliance Note
Robots.txt is a voluntary standard — legitimate crawlers respect it, but compliance is not guaranteed. Known gaps as of May 2026: ChatGPT-User does not follow robots.txt for OpenAI's user-initiated browsing pathway (removed December 2025); OpenAI's newer ChatGPT Agent shares the same user-agent token but does respect robots.txt. Perplexity-User ignores robots.txt for user-initiated requests. Google-Agent (Project Mariner), added to Google's official fetcher list in March 2026, ignores robots.txt entirely by design — Google classifies it as a user-triggered fetcher, and the only effective controls are server-side authentication and WAF rules, not robots.txt. Agentic AI browsers (ChatGPT Atlas, Perplexity Comet) similarly use standard Chrome user-agents with no distinguishing token — robots.txt cannot address them. For these cases, look beyond this tool to your WAF and access-control layer. For crisis or sensitive content: assume that once a page is publicly accessible, AI assistants may fetch and quote it immediately. Content Signals are preferences, not enforcement — Google has not committed to honoring them for AI Overviews, and the directive is not yet a ratified standard (the IETF AIPREF Working Group's vocabulary lock targets August 2026). The llms.txt proposal (Howard, September 2024) is intentionally excluded from this generator — no major AI crawler reads or honors it as of May 2026.

Generated robots.txt

0 Allowed
0 Rate-Limited
0 Blocked

For Educational Purposes Only

Based on the Three Streams GEO Methodology

Crawler data verified May 2026 (OAI-AdsBot added April 2026; Anthropic three-bot split confirmed February 2026); Content Signals Policy added April 2026. Verify current user-agents and signal directive name before deployment — IETF AIPREF vocabulary lock targets August 2026.