Why don't you do this in the browser?

**Because CORS**. Your browser **has no permission** to fetch `https://othersite.com/robots.txt`, because `othersite.com` doesn't send an `Access-Control-Allow-Origin` header. **Server-side** the problem disappears, CORS is a browser-side protection, **not server-to-server**. That's why every serious validator (Google Search Console, Bing Webmaster, Screaming Frog) runs from a server. **Ours is no exception**. Bonus: server-side we see the **real HTTP status** (if your `robots.txt` returns a 500, the validator tells you). A browser would only show "blocked by CORS".

What is "longest-prefix matching" for robots rules?

**The algorithm Googlebot uses** (and most modern crawlers). If your file has: ``` User-agent: Googlebot Disallow: /admin Allow: /admin/public ``` And you test the path `/admin/public/report.pdf`, the validator (and Google) picks the rule by **"longest matching prefix wins"**: - `Disallow: /admin` matches (6 chars) - `Allow: /admin/public` also matches (13 chars, **longer**) **Allow wins**, so the path is **allowed**. The old "first match wins" algorithm (used by older Bing) would give a different answer, but **modern crawlers use longest-match**. The validator implements **exactly that logic**, so the **"Allowed"** verdict in the tester matches what Googlebot actually does.

I have both `User-agent: *` and `User-agent: Googlebot`. Which wins?

**The more specific one**. Googlebot, when it sees a `User-agent: Googlebot` group, **completely ignores** the `User-agent: *` group. **It's all or nothing**, Googlebot does not mix rules across groups. The classic trap: you put an important `Disallow: /admin` and a `Sitemap: ...` in the `*` group, then add a tiny Googlebot-specific group with **just one rule** `Crawl-delay: 5`. **Googlebot now ignores `Disallow: /admin`**, because the entire `*` group is invisible to it. **Fix**: if you want a Googlebot-specific override, **duplicate every rule** that should still apply (`Disallow: /admin`, the sitemap is usually declared **outside** any group and is global anyway). The validator's **per-bot view** shows you **exactly what Googlebot really sees**.

My sitemap.xml has 50,001 URLs, why does the validator complain?

**Because that exceeds the official spec**. `sitemaps.org` says: a single sitemap can hold **up to 50,000 URLs** and weigh **at most 50 MB** (uncompressed). Google won't read the overflow, **it just truncates**. **Fix**: build a **sitemap index** (` `) that links to several plain sitemaps (` `): ```xml https://example.com/sitemap-pages-1.xml https://example.com/sitemap-pages-2.xml https://example.com/sitemap-products.xml ``` Each child sitemap can have **its own 50,000**, so an index realistically lets you have **up to 2.5 billion URLs** (the limit is 50,000 indices × 50,000 URLs each). Our validator **automatically fetches** up to 50 nested sitemaps and validates each.

A URL is in the sitemap, but it's not indexed?

**A sitemap is a hint, not a guarantee**. Google looks at ` ` entries but ultimately **its own algorithm decides** whether to index a page. A URL in the sitemap **can still be missing from Search**. The usual reasons: - **The page has ` `** - the sitemap says "crawl", the tag says "don't index", **the tag wins** - **The page returns 404 or 5xx** - Google drops it from the index quickly - **Duplicate content** - Google sees the page is a **copy** of another, indexes only one - **Low quality** - Google decides the page is **thin content** (little text, auto-generated) and skips it - **Blocked by robots.txt** - the validator surfaces this **Sitemap is helpful**, but **not magic**. It's a **map** for Google, not an **indexing mandate**. The validator helps with what is checkable: file validity, completeness, no duplicates.

Why does the validator warn "no Sitemap line in robots.txt"?

**Because that's the standard Google and Bing recommendation**. Crawlers look for the sitemap link in three places: 1) in `robots.txt` (`Sitemap: https://...`), 2) in **Google Search Console** (manual submission), 3) at the default `/sitemap.xml`. **Missing `Sitemap:` in `robots.txt`** = you skip the **free** sitemap-discovery mechanism. Every crawler on the planet fetches `robots.txt` on first visit, if it finds `Sitemap: ...` there, **it immediately follows the link**. Without it, it has to guess (it tries `/sitemap.xml`, but if your sitemap is at `/sitemap_index.xml` instead, **it might not find it**). **Easy fix**, add **one line** at the end of the file: ``` Sitemap: https://example.com/sitemap.xml ``` You can have **several** (`Sitemap: ...` repeated, e.g. one per language).

What are these "unknown directives" in my robots.txt?

**Any directive** that's not in the **official standard** (User-agent, Allow, Disallow, Crawl-delay, Sitemap, Host). Common non-standard ones: - **`Clean-param`**, Yandex-only, strips URL parameters from crawl - **`Request-rate`**, an old `Crawl-delay` cousin, most crawlers ignore it - **`Visit-time`**, a hint about when to crawl (e.g. `0500-0845`), ignored everywhere except Yandex - **Malformed comments**, sometimes someone writes `# comment` instead of `#comment`, some crawlers parse it, some flag it The validator surfaces them as **info (gray)**, not errors. **They don't break indexing**, but you should know they're there. If you see something exotic, **you probably inherited** it from an old SEO consultant, safe to remove.

Can I block ChatGPT and Claude from training on my site?

**Yes, each bot has its own User-agent**, you can block them individually. **Current** (as of 2026): ``` User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: / User-agent: OAI-SearchBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Claude-Web Disallow: / User-agent: Google-Extended Disallow: / User-agent: PerplexityBot Disallow: / User-agent: CCBot Disallow: / ``` **Note**: `Google-Extended` blocks **only Bard/Gemini training**, it **does not** block regular Googlebot (so you **don't fall out of search**, just out of Google's AI training). `GPTBot` blocks **only training**, `ChatGPT-User` is the real-time fetcher (when a user asks ChatGPT to look something up live). **The validator lets you check** whether your `Disallow: /` for GPTBot **actually applies**, click the "GPTBot" chip in the per-bot view and you see exactly the rules.

Why should every sitemap URL have a `lastmod`?

**Because Google uses it to prioritise recrawl**. If you submit a sitemap with 10,000 URLs but only 50 of them changed since the last crawl (fresh `lastmod`), Google **starts with those 50**. Without `lastmod` it has to **probe every URL** to see what changed, **slower and a waste of crawl budget**. **The validator shows `lastmod` coverage** as a percentage: if you see **30%**, that means 70% of URLs have no date, Google treats those as "unknown last change". **Goal: 100%** of URLs in the sitemap. The `lastmod` format must be **W3C/ISO-8601**: - `2026-05-11` (day) - `2026-05-11T14:30:00Z` (UTC) - `2026-05-11T14:30:00+02:00` (with offset) **Invalid**: `11/05/2026`, `2026-5-11`, `May 11, 2026`. The validator catches these and points at the offending line.

robots.txt + sitemap.xml validator - free

Why isn't your site showing in Google? Start with robots.txt and sitemap.xml

You paste a URL, pick a mode (`robots.txt` alone, `sitemap.xml` alone, or Both together) and hit Check. Our server fetches the publicly-accessible files, parses them and shows you exactly what Googlebot would see when it visits your domain.

The validator does three things you can't do from the browser:

Pulls `robots.txt` from the actual origin, not your CDN cache, the same bytes a crawler would get;
Simulates real bots: Googlebot, Bingbot, GPTBot, ChatGPT-User, ClaudeBot. Pick a bot from the chips and you see exactly the rules that apply to it (with longest-prefix matching, the algorithm Google really uses);
Parses sitemap.xml (including a sitemap index with nested sitemaps), checks the spec limits (50,000 URLs, 50 MB), validates W3C/ISO-8601 dates, `changefreq`, `priority` and surfaces duplicate `<loc>` entries.

Everything comes back as a tidy report with errors (red), warnings (yellow) and info (gray). Plus a URL tester, paste `/admin` or `/private/reports.pdf` and instantly see "allowed" or "disallowed" for the selected bot.

Why bother? The single most common reason a new site never gets indexed is a typo in robots.txt (`Disllow: /` instead of `Disallow: /admin`) or no sitemap link in robots.txt. The validator catches both in 5 seconds.

How to use it

Pick a mode in the segmented bar at the top. If unsure, choose "Both", we'll fetch `/robots.txt` first, find the sitemap link inside, and pull that too.
Paste your URL into the URL field. Bare domain (`example.com`), full URL (`https://example.com`) or a direct link to a sitemap (`https://example.com/sitemap.xml`) all work.
Hit "Check" (or press Enter). The server fetches with a 10-second timeout and a 50 MB cap, so even huge sitemaps won't stall the validation.
The robots.txt section shows: HTTP status, file size, group count, total Allow/Disallow rules. Issues are split into 3 severity levels (error / warning / info), each with the line number where it lives.
Per-bot view, click the bot chips (Googlebot, Bingbot, GPTBot, ChatGPT-User and others). You see exactly the rules that apply to that bot, plus we tell you which User-Agent token in your file matched.
URL tester, type any path (e.g. `/admin` or `/api/users`), see "Allowed" or "Disallowed" plus the exact rule that decided. Perfect for figuring out why a specific URL is missing from Google.
The sitemap section shows: type (urlset / sitemapindex), URL count, `lastmod` coverage (%), newest and oldest date, plus a sample of the first 100 URLs in a table. If it's a sitemap index, we automatically fetch the nested sitemaps (up to 50 for safety).

When this is useful

Five situations where the validator saves you a weekend in Search Console:

New site won't index in Google. You check `robots.txt`, the validator flags `Disallow: /` under `User-agent: *` (the classic dev-environment leftover). You change it to `Disallow: /admin` and indexing starts within 24 hours.
Domain migration or redesign. After moving to a new platform, you validate the old and the new sitemap. The validator shows 1,200 URLs missing in the new one (forgotten language prefix). You fix it in the CMS before Google notices the drop.
SEO audit before a big launch. A client asks "why isn't the shop showing in search". The validator finds `User-agent: Googlebot` + `Disallow: /products`, someone (knowingly or not) blocked the whole product catalogue. You'd never have spotted that without the per-bot view.
GPTBot, ClaudeBot, Google-Extended. You want to opt out of AI training on your content. The validator's per-bot view shows whether your `Disallow: /` for `GPTBot` actually applies, or whether it's overridden by an earlier `*` group with `Allow: /`.
CI/CD pre-deploy checks. Plug the validator into your pipeline (a plain `curl` with JSON does it) and builds fail when `robots.txt` has `Disallow: /` under `User-agent: *`. Selling that to a senior DevOps takes 10 minutes. Savings, thousands.

Need to author the files? Generate them in the robots.txt builder and the sitemap.xml builder. For social previews of the same URLs, use the OpenGraph preview.

Questions and answers

Only to our server, which then connects to your domain to fetch publicly-accessible files: `/robots.txt` and `/sitemap.xml`. The same files every crawler on the planet can grab in 5 seconds (that's the point of them being public). We do not store your URL, we do not log the content, we do not pass it to any third party. The validation is stateless, once the result is rendered we forget.

Why isn't your site showing in Google? Start with robots.txt and sitemap.xml

The validator does three things you can't do from the browser:

Pulls `robots.txt` from the actual origin, not your CDN cache, the same bytes a crawler would get;
Simulates real bots: Googlebot, Bingbot, GPTBot, ChatGPT-User, ClaudeBot. Pick a bot from the chips and you see exactly the rules that apply to it (with longest-prefix matching, the algorithm Google really uses);
Parses sitemap.xml (including a sitemap index with nested sitemaps), checks the spec limits (50,000 URLs, 50 MB), validates W3C/ISO-8601 dates, `changefreq`, `priority` and surfaces duplicate `<loc>` entries.

How to use it

Pick a mode in the segmented bar at the top. If unsure, choose "Both", we'll fetch `/robots.txt` first, find the sitemap link inside, and pull that too.

Paste your URL into the URL field. Bare domain (`example.com`), full URL (`https://example.com`) or a direct link to a sitemap (`https://example.com/sitemap.xml`) all work.

Hit "Check" (or press Enter). The server fetches with a 10-second timeout and a 50 MB cap, so even huge sitemaps won't stall the validation.

The robots.txt section shows: HTTP status, file size, group count, total Allow/Disallow rules. Issues are split into 3 severity levels (error / warning / info), each with the line number where it lives.

Per-bot view, click the bot chips (Googlebot, Bingbot, GPTBot, ChatGPT-User and others). You see exactly the rules that apply to that bot, plus we tell you which User-Agent token in your file matched.

URL tester, type any path (e.g. `/admin` or `/api/users`), see "Allowed" or "Disallowed" plus the exact rule that decided. Perfect for figuring out why a specific URL is missing from Google.

The sitemap section shows: type (urlset / sitemapindex), URL count, `lastmod` coverage (%), newest and oldest date, plus a sample of the first 100 URLs in a table. If it's a sitemap index, we automatically fetch the nested sitemaps (up to 50 for safety).

When this is useful

Five situations where the validator saves you a weekend in Search Console:

New site won't index in Google. You check `robots.txt`, the validator flags `Disallow: /` under `User-agent: *` (the classic dev-environment leftover). You change it to `Disallow: /admin` and indexing starts within 24 hours.
Domain migration or redesign. After moving to a new platform, you validate the old and the new sitemap. The validator shows 1,200 URLs missing in the new one (forgotten language prefix). You fix it in the CMS before Google notices the drop.
SEO audit before a big launch. A client asks "why isn't the shop showing in search". The validator finds `User-agent: Googlebot` + `Disallow: /products`, someone (knowingly or not) blocked the whole product catalogue. You'd never have spotted that without the per-bot view.
GPTBot, ClaudeBot, Google-Extended. You want to opt out of AI training on your content. The validator's per-bot view shows whether your `Disallow: /` for `GPTBot` actually applies, or whether it's overridden by an earlier `*` group with `Allow: /`.
CI/CD pre-deploy checks. Plug the validator into your pipeline (a plain `curl` with JSON does it) and builds fail when `robots.txt` has `Disallow: /` under `User-agent: *`. Selling that to a senior DevOps takes 10 minutes. Savings, thousands.

Need to author the files? Generate them in the robots.txt builder and the sitemap.xml builder. For social previews of the same URLs, use the OpenGraph preview.

Questions and answers