You are invited to Log in or Register a free Frihost Account!

Robots.txt and Meta Robots

What is Robots.txt?
The robots exclusion protocol (REP), or robots.txt is a text file webmasters create to instruct robots (typically search engine robots) on how to crawl & index pages on their website.

Cheat Sheet

Block all web crawlers from all content

User-agent: * Disallow: /

Block a specific web crawler from a specific folder

User-agent: Googlebot Disallow: /no-google/

Block a specific web crawler from a specific web page

User-agent: Googlebot Disallow: /no-google/blocked-page.html

Allow a specific web crawler to visit a specific web page

User-agent: * Disallow: /no-bots/block-all-bots-except-rogerbot-page.html User-agent: rogerbot Allow: /no-bots/block-all-bots-except-rogerbot-page.html

Sitemap Parameter

User-agent: * Disallow: Sitemap:

Optimal Format
Robots.txt needs to be placed in the top-level directory of a web server in order to be useful. Example: http:/

What is Robots.txt?
The Robots Exclusion Protocol (REP) is a group of web standards that regulate Web robot behavior and search engine indexing. The REP consists of the following:

The original REP from 1994, extended 1997, that defines crawler directives for robots.txt. Some search engines support extensions like URI patterns (wild cards).
Its extension from 1996 that defines indexer directives (REP tags) for use in the robots meta element, also known as "robots meta tag." Meanwhile, search engines support additional REP tags with an X-Robots-Tag. Webmasters can apply REP tags in the HTTP header of non-HTML resources like PDF documents or images.
The Microformat rel-nofollow from 2005 that defines how search engines should handle links where the A Element's REL attribute contains the value "nofollow."

Robots Exclusion Protocol Tags

Applied to an URI, REP tags (noindex, nofollow, unavailable_after) steer particular tasks of indexers, and in some cases (nosnippet, noarchive, noodp) even query engines at runtime of a search query. Other than with crawler directives, each search engine interprets REP tags differently. For example, Google wipes out even URL-only listings and ODP references on their SERPs when a resource is tagged with "noindex," but Bing sometimes list such external references to forbidden URLs on their SERPs. Since REP tags can be supplied in META elements of X/HTML contents as well as in HTTP headers of any web object, the consensus is that contents of X-Robots-Tags should overrule conflicting directives found in META elements.


Indexer directives put as microformats overrule page settings for particular HTML elements. For example, when a page's X-Robots-Tag states "follow" (there's no "nofollow" value), the rel-nofollow directive of a particular A element (link) wins.

Although robots.txt lacks indexer directives, it is possible to set indexer directives for groups of URIs with server sided scripts acting on site level that apply X-Robots-Tags to requested resources. This method requires programming skills and good understanding of web servers and the HTTP protocol.

Pattern Matching

Google and Bing both honor two regular expressions that can be used to identify pages or sub-folders that a SEO wants excluded. These two characters are the asterisk (*) and the dollar sign ($).

* - which is a wildcard that represents any sequence of characters
$ - which matches the end of the URL

Public Information

robots.txt File is Public -- Be aware that a robots.txt file is a publicly available file. Anyone can see what sections of a server the webmaster has blocked the engines from. This means that if a SEO has private user information that they don’t want publicly searchable, they should use a more secure approach (such as password protection) to keep visitors from viewing any confidential pages they don't want indexed.

Important Rules

In most cases, meta robots with parameters "noindex, follow" should be employed as a way to to restrict crawling or indexation.
It is important to note that malicious crawlers are likely to completely ignore robots.txt and as such, this protocol does not make a good security mechanism.
Only one "Disallow:" line is allowed for each URL.
Each subdomain on a root domain use separate robots.txt files.
Google and Bing accept two specific regular expression characters for pattern exclusion (* and $).
The filename of robots.txt is case sensitive. Use "robots.txt", not "Robots.TXT.
Spaces are not accepted ways to separate query parameters. For example, "/category/ /product page" would not be honored by robots.txt.

MOD - Quotes added; please provide which sites you used for your Sources.
Material you did not write should be quoted and sourced. Please refer to the forum rules here:
- Ankhanu
Info are very useful and saves me huge amount of time which I spend on something else instead of searching posts like this Thank you
The robots.txt file is a text file that tells search engine crawlers which portions of your website they should NOT index. If you don't want to restrict search engine crawlers, you should simply create an empty robots.txt file (e.g., touch robots.txt) or one that looks like this:

User-agent: *
Once you have created a robots.txt file, you store it in the root directory of your Web server.

Anyway thanks for this very useful and helpful information!!
Related topics
google search engine??
Meta Tags
Robots.txt.... how important is it?
get to top in SEO ranking
about robots.txt
How to prepare a "robots.txt" to get crawled by Se
Google released robots.txt generator
Please check this robots.txt file its getting complex for me
Evitar conteúdo duplicado com robots.txt
Question about robots.txt
Reply to topic    Frihost Forum Index -> Webmaster and Internet -> SEO and Search Engines

© 2005-2011 Frihost, forums powered by phpBB.