Search engines use web robots (spiders) to crawl sites and to index content. At the same time, spammers and a number of other sites use web robots to find details about a site, such as a list of email addresses or scanning for other details.
A robots.txt file gives instructions to the robots and details how to index the content. By nature, the search engines will try to index everything on your site, so robots.txt are crucial for blocking pages you want hidden from being indexed. This is also useful when dealing with duplication issues, especially post Panda.
Keep reading to find out all about the robot.txt file and what you need to consider when setting one up.
Before taking any further steps, you need to ensure you have access to the root of your domain. This means the URL can only be found at www.domain-name.com/robots.txt, not www.domain-name.com/another-folder/robots.txt.
As you can imagine from the name, it has to be a text file, while it also HAS to be named robots.txt.
To fully understand what to include, it is important to learn the Syntax.
User Agent: The text file will always start off with this, appearing as User-agent: *, which says to the crawlers that this applies to all robots.
Disallow: this is used to highlight the URL path you are looking to block
Allow: If you have blocked a directory yet you want to allow a URL path within a sub directory of this parent, you can use the ?allow? syntax to ensure the pages are crawled.
You can use as many lines as you wish in the file.
Below is an example of how the file might appear.
Test Your File
You should test anything before moving on and this is no different. Fortunately, Google has a free tool, through Search Console, where you can input the URL and then check to see if any errors or warnings come up.
If there are any issues, do not move on until they are completely fixed. Once you are certain any issues are ironed out, click ?submit? while in the robots.txt editor on search console. Once you have verified the live version and fully submitted, Google will crawl this file and will take note of your instructions.
Meta Robots vs Robots.txt
While robots.txt is incredibly useful for rules across multiple domains, sometimes you just want a specific page to be blocked. If this is the case, the preferred method might be to implement meta robot tags. This will often look something like this:
We have detailed some of the reasons for adding this file and how to create it, but it is also worth noting some of the limitations to having one, as well as some of the risks.?You should never complete one unless you are completely comfortable with what you are doing. Ensure the URL path you are blocking doesn?t contain any pages you want to appear on search engines, as a little mistake here can cost you dearly.
These instructions are more directives rather than enforced law. Many crawlers such as Google will obey them, however not everything will, so be aware of this. Different web crawlers also interpret the syntax differently, so what might be comprehensible to one might not work for another.
Many people ask questions about the crawl delay, so it is worth quickly pointing out that Google ignores this instruction. Some people include this so the server isn?t under so much pressure when the site is being crawled, however Google began ignoring crawl delay when they realised a huge amount of people were making costly mistakes here. If you really wanted to change the crawl delay for Google, you can adjust settings within Search Console (Webmaster Tools).
You aren?t required to submit your sitemap on the Robots.txt, but many people choose to. If you would like to, then you submit it in the format below:
Blocking The Entire Site
If for any reason you want to block the entire site from all web crawlers, you just need to include the below:
The / following disallow means that every page on the site will be blocked, while the * after User-Agent means that you are applying this rule to all crawlers.
If you wanted to block the site against just a specific crawler, you can apply this by setting the user-agent to the relevant site, e.g.
Both Google and Bing allow you to block pages with wildcards. So for example, you could add:
This will block all pages on the site that include a question mark. This is useful for dealing with issues such as facet management creating duplicate content, a common issue for e-commerce websites.
The focus should be to look at your URL breakdown for each page and work out which pages you want to block and ensuring they don?t have the same URL structure as other pages that you want to be crawled. Failure to consider related URL?s when implementing wildcards can be very costly.
If you ever require help on implementing a robots.txt, don’t hesitate to contact us and we would be more than happy to assist.