Robots.txt: Understanding the Basics of Crawler Management

Business owners today would be hard pressed to develop ongoing consistent business without depending on web leads and traffic. As part of your overall online marketing arsenal, it is crucial to have the right documents posted live on the web.

One of the most important files to post is robots.txt. Since search engines use “bots” (a.k.a. spiders, wanderers, or crawlers) to index websites and pages, Google has made available a feedback loop.

The feedback loop is essentially the ability to request that search engines stay away from certain pages or directories completely. This is done using the robots.txt file.

Take note that it is a suggestion, so it may or may not prevent crawling over time. There are many reasons why a page might still get crawled (e.g. spider following a backlink right to the page), so it is imperative that you hide any sensitive or proprietary data behind a login or other access layer as well.

Robots.txt: How to make it

This file is quite literally a plain text file that you can create in minutes using MS Notepad or a similar basic application. Simply open up Notepad, create a new file, and save as “robots.txt”. Voila! You have the file.

Robots.txt: Managing the content

So we have the file, but it contains no content just yet. Let’s look at what goes in there.

The robots.txt file is basically divided into several sections, one for each of of the robot crawlers’ User Agent names. You can direct a section at all crawlers or a specific one, so this can be as simple or as complex of an exercise as you feel comfortable taking on.

Each section begins with code designating what User Agent is targeted. Examples of this piece of code include:

User-agent: * (targets all spiders)
User-agent: Googlebot
User-agent: insert name of agent here

Beneath each User-agent designation, there will be one or more DISALLOW entries. How’s that for simple-to-learn logic? The syntax for this command looks like this:

Disallow: /    (tells the User-agent not to index any pages on the site)
Disallow: /name-of-microsite/
Disallow: /directory-not-to-index/

Disallow can be used as a negative by entering nothing after the colon. This is essentially an “Allow this crawler to index any and all pages on this website” command. There is only one practical reason that I’ve ever found to use this derivative – where you set a disallow for all user agents, but want to override it for only one agent. To keep it straightforward, come back to this one when you are much more comfortable with this topic later.

Robots.txt: Where to put it

This is a simple answer, but a very important one. Once you finish building your file, upload it right into the root directory of the website. If you place it anywhere else, search engine spiders will consider it to be merely a posted document and not a set of instructions to review prior to crawling.

Difference between robots.txt and the “robots” meta tag

Robots.txt and the robots meta tag are both effective ways to tell search engines not to crawl or index a specific page. I’ve heard many an SEO split hairs about whether there is any good reason to use one over the other.

While this is technically splitting hairs, keep in mind that the robots.txt file is massively more scalable than it’s meta tag cousin. Why? Because you can disallow access to an entire directory on your site with two lines of code. If that directory were to have, say, 18 pages in it, you would have to physically edit, save, and upload the new version of each page to the server individually for the same result.

More Information

I strongly recommend you take the time to learn how to manage this yourself. It is really not very difficult, and something you can keep in your back pocket for later when you really need it.

For those of you who don’t need to do this more than one time, I stumbled upon a robots.txt generator while researching a couple of items for this blog post. If you use it, let me know how it goes. I am sure there are multiple tools out there, and would rather only share the good ones.

In closing, the following is a nice chart from technyat.com that explains all this in an easy-to-understand comparison chart format. Enjoy!

Robots.txt: How it works
Image Source: Technyat.com