01 What is robots.txt?
robots.txt is a plain-text file at the root of your domain that tells web crawlers which URLs they're allowed to fetch. It's the oldest and simplest tool in the SEO toolkit — and also the one that breaks the most sites when misconfigured.
The file lives at https://yourdomain.com/robots.txt. There's no other valid location. If a crawler can't fetch it (404, 5xx, timeout), it generally assumes "no rules" and crawls freely — though some crawlers treat a 5xx as "block everything" until the file becomes available.
02 Syntax basics
The format is dead simple: groups of rules, each starting with User-agent: (which crawler), followed by Allow: and Disallow: rules.
User-agent: *
Disallow: /admin/
Disallow: /tmp/
Allow: /tmp/public/
User-agent: Googlebot
Allow: /
Sitemap: https://example.com/sitemap.xml
User-agent: *applies to all crawlers.Disallow: /pathblocks any URL starting with/path.Allow:can override a broaderDisallow.Sitemap:points crawlers to your XML sitemap. Always include it.
03 Common patterns
Block staging or development environments
User-agent: *
Disallow: /
This blocks the entire site. Use it on staging — never on production. (We'll come back to that.)
Block an admin area
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Block faceted-search and internal-search URLs
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /search?
Allow CSS and JavaScript
User-agent: *
Allow: /*.css$
Allow: /*.js$
This used to be needed because some old WordPress themes blocked /wp-includes/. Today it's mostly belt-and-braces — but Googlebot needs CSS and JS to render your pages, so never block them.
04 The disastrous mistake
Every senior SEO has seen this happen at least once:
User-agent: *
Disallow: /
This is the staging robots.txt. It blocks everything. When a developer copies the staging codebase to production without updating the file, Google starts dropping pages from the index within hours. By the next week, organic traffic has cratered.
The single most expensive line of code in SEO is
Disallow: /on production.
/robots.txt changes, you get a Slack ping. Catching this within an hour vs. catching it after the next crawl cycle is the difference between a non-event and a quarter of recovery.05 robots.txt vs. noindex
These two tools look similar but do completely different things. Confusing them is the second-most-expensive SEO mistake.
| robots.txt Disallow | noindex meta tag | |
|---|---|---|
| What it does | Blocks crawling | Allows crawling, blocks indexing |
| Removes from index? | No (paradoxically) | Yes |
| Saves crawl budget? | Yes | No |
| Use when… | You want crawlers off entire areas | You want to remove a page from search |
The paradox: if you Disallow a URL in robots.txt, Google can't crawl it — which means it can't see any noindex tag on it. The URL stays indexed (often with a "no description available" snippet) until you remove the disallow.
The fix: to remove a URL from the index, use noindex. Once Google has recrawled and processed the noindex, you can add a robots.txt disallow if you want to stop further crawling.
06 Crawl budget and robots.txt
For most sites, crawl budget isn't a real concern — Google has plenty of capacity for a 5,000-page site. But for large e-commerce or publishing sites with millions of URLs, robots.txt becomes a sharp tool for directing Googlebot toward what matters.
Common budget-savers:
- Block faceted-navigation URLs (
?colour=,?size=,?sort=). - Block calendar widgets that generate infinite future-dated URLs.
- Block parameterised search results.
- Block thin user-generated archive pages (e.g.
/users/profile listings).
The goal isn't "fewer URLs crawled overall" — it's "more crawl budget spent on URLs that matter."
07 Validating your robots.txt
Always validate after editing:
- Google Search Console robots.txt Tester — paste a URL, see if it's allowed or disallowed.
- Smart SEO Audit automatically checks your
robots.txton every audit and flags errors, including missing or unreachable files. - Manual check: visit
https://yourdomain.com/robots.txtin a browser. If it doesn't load, no crawler can read it either.
? Frequently asked questions
Does robots.txt stop a page from being indexed?
No — this is the most common robots.txt mistake. Disallow only blocks crawling, not indexing. A blocked URL can still appear in search results (often with no description) if other pages link to it. To keep a page out of the index, allow crawling and use a noindex meta tag, or protect it behind authentication.
Where must the robots.txt file be located?
It must sit at the root of your domain — example.com/robots.txt. Crawlers only look there; a robots.txt in a subdirectory is ignored. Each subdomain needs its own robots.txt, and the protocol matters too (https and http are treated separately).
Should I block CSS and JavaScript in robots.txt?
No. Google needs to fetch your CSS and JS to render the page the way users see it. Blocking these resources can cause Google to misjudge your layout, mobile-friendliness and content, hurting rankings. Allow rendering resources and only disallow genuinely private or low-value paths.
→ Related guides
Keep going — these companion guides go deeper on related topics.