Citegrove · Reddit-first AI citation outreach

AI Crawler Audit

Before you worry about getting cited in ChatGPT or Perplexity, make sure those AIs can actually crawl your site. Paste any domain — we'll fetch your robots.txt and check 14 known AI crawler tokens (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, ChatGPT-User, claude-web, and more), showing exactly which are allowed, blocked, or partially restricted. No login, no spam — just the truth from your robots.txt.

How it works

Three steps, fourteen crawlers

  1. 1

    Fetch your robots.txt

    We GET https://yourdomain/robots.txt directly — no API, no third-party.

  2. 2

    Parse User-agent groups

    Standard RFC 9309 parser. Specific tokens win over wildcards (User-agent: *).

  3. 3

    Match against the AI list

    Each of the 14 known AI crawlers gets a verdict: allowed, blocked, partial (some paths blocked), or unknown.

FAQ

Frequently asked

Why does this matter?

If your robots.txt blocks GPTBot or ClaudeBot, ChatGPT and Claude can't crawl your site for training. Block ChatGPT-User or Perplexity-User and the live AI assistants can't fetch your page when answering a user — meaning you literally cannot be cited.

What's the difference between training and live-search crawlers?

Training crawlers (GPTBot, ClaudeBot, Google-Extended) ingest your content for the next model. Live-search crawlers (ChatGPT-User, Perplexity-User, claude-web) fetch your page in real time when a user asks a question. Blocking live-search hurts more than blocking training.

Should I block training but allow search?

If you don't want your content used for AI training but DO want to be cited in answers, block GPTBot/ClaudeBot/Google-Extended and allow ChatGPT-User/Perplexity-User/claude-web. The audit shows you both columns so you can see the split.

My site has no robots.txt. Is that bad?

No — missing robots.txt means everything is implicitly allowed. We report all 14 crawlers as "allowed". You can still write one to opt OUT of specific crawlers.

Why these 14 crawlers?

They're the publicly documented user-agents used by ChatGPT, Claude, Perplexity, Gemini/Google AI, Apple Intelligence, Meta AI, ByteDance, and Common Crawl (which most open LLMs train on). New ones appear occasionally — we update the list as they're announced.

Do you store my domain?

We log a hashed IP for rate limiting (15 audits per IP per 24 hours). Domains aren't persisted.