robots.txt controls what search engines crawl — and a single over-broad rule can hide your whole site. Here's how it actually behaves, why blocking isn't the same as deindexing, and how not to block your own pages.
It's a crawl-management tool, not an indexing tool. That one distinction explains almost every robots.txt mistake.
robots.txt is a plain text file at yourdomain.com/robots.txt that crawlers read before crawling. With User-agent and Disallow / Allow rules, you tell bots which paths to skip — typically admin areas, internal search, login pages, and infinite parameter URLs that would waste crawl budget. You also use it to declare your sitemap. What robots.txt does not do is remove pages from the index — and conflating the two is where sites get into trouble.
The most important thing to understand about robots.txt.
If you Disallow a URL, crawlers won't fetch its content — but Google can still index the URL if it's linked from elsewhere, showing it in results without a description ("No information is available for this page"). So disallowing a page you wanted hidden can leave a bare, description-less listing in search.
Here's the trap that catches people: if you want a page out of search, you might add a noindex tagand Disallow it in robots.txt to be thorough. But because robots.txt blocks crawling, Google never fetches the page, so it never sees the noindex — and the page stays indexed. To deindex a page, you must allow crawling so Google can read the noindex. Disallow and noindex work against each other.
Real robots.txt checks the audit runs, and how each goes wrong.
Disallow: / from staging. Staging sites often block all crawlers. If that file ships to production, your entire site becomes uncrawlable. This is the robots.txt equivalent of the stray noindex. Blocking indexable pages. An over-broad path rule can cover pages you want ranked — the audit flags robots.txt blocking indexable pages. Blocking the sitemap. A rule that disallows your sitemap path stops crawlers reading it. No sitemap declared. robots.txt should include a Sitemap: line pointing to your XML sitemap. Blocking CSS/JS. Google renders pages; blocking the assets it needs to render can hurt how it sees the page. Missing robots.txt entirely isn't fatal, but a present, correct file is best practice — the audit flags a missing robots.txt too.
At the root of the domain: https://yourdomain.com/robots.txt. It only applies to the host it's served from, so subdomains need their own. It must return a 200 status and be plain text.
Allow it to be crawled and add a noindex meta tag or X-Robots-Tag header. For sensitive content, use authentication, not robots.txt — disallowed URLs are public and can still be discovered. robots.txt is for crawl management, not security or guaranteed removal.
Yes. Add a line like Sitemap: https://yourdomain.com/sitemap.xml. It's the simplest way to make sure every crawler can find your sitemap, and the audit flags a robots.txt that doesn't declare one.
Free to start. Check your robots.txt against every indexable page on your site.
Start my free audit