FRIHOSTFORUMSSEARCHFAQTOSBLOGSCOMPETITIONS
You are invited to Log in or Register a free Frihost Account!


How to prepare a "robots.txt" to get crawled by Se





paskall
Search Bots, crawl each URL and the first thing they search on an URL root is the robots.txt file. So if we make our robots.txt file, we can change the Search Bots' behaviours, and we can tell them where to search and publish and where to not. Imagine we have privacy folders in our website, for example a folder or a file which includes e-mail addresses so we don't want them get published, then we can avoid them get seen by the Search robots by a few simple commands on the robots.txt file. Here we go.

We use the /robots.txt file to give instructions about our site to web robots; this is called The Robots Exclusion Protocol.

Simply, the robots.txt is a very simple text file that is placed on our root directory. For example www.frihost.com/robots.txt. This file tells search engine and other robots which areas of our site they are allowed to visit and index.
The rules is, we can ONLY have one robots.txt on our site and ONLY in the root directory (where our home page is):

TRUE: www.frihost.com/robots.txt (Works)

FALSE: www.frihost.com/crap/robots.txt (Doesnot work)

All the big search engine spiders respect this file, but unfortunately most spambots (email collectors, harvesters) do not. If you want security on your site or if you got files or contents to hide, you have to actually put the files in a protected directory, you can't trust the robots.txt file only.

So what programs we need to create it. Just the good ol notebook or any text editor program is enough, all we need to do is to create a new text file, and name it! Attention, the name has to be "robots.txt", cannot be "robot.txt" or "Robot.txt" or "robots.TXT". Simple, no Caps and robots!

Then now we are starting to write in it, a simple robots.txt looks like this.

User-agent: *
Disallow:


The "User-agent: *" means this section applies to all robots, the wildcard "*" means all bots. The "Disallow: " tells the robots that they can go anywhere they want.

User-agent: *
Disallow: /


wildcard "*" used in this one too, so all bots must read this. But in this one, there is a little difference, a slash "/" in the Disallow line, which means dont allow anything to be crwaled, so the bots don't crwal you website, the good ones of course.

If we want all the bots read this text file, we should insert a "wildcard (*)" in the User-agent line. And when we leave the Disallow: line blank, it means come crawl my site you bots!, and when there is a slash it means keep out! Simple. This is the simplest way, now we can learn keep some bots crawling and some not.

User-agent line is the part we are gonna work on to define the bot's identity and behaviour. For example we want the google bot to crawl the site but yahoo bot not. Then how will our text file look ?
Simple, all we need to know is the names of the bots, that's all. I will give their bot names but first let's make a sample file.
User-agent: googlebot
Disallow:

User-agent: yahoo-slurp
Disallow: /

In this sample, we called the googlebot and left the disallow line blank so we said crawl my website. And in the second line we called the yahoo bot but in the disallow line we have a slash so we wanted it to go away.

Now we are going to learn how to avoid some folders of our site get searched by the search spiders and, how to get some folders be searched at same time. For this, we will change the values in the disallow line. For example we have two folders in our domain, /images, and /emails. We want /images to be searched but /emails not. Then the text file would look like:
User-agent: *
Disallow: /emails/

As we can see, we called all the robots to read this, and we dont want the /emails folder to be seen, we excluded it but the rest of the website can be searched by the robots.
Here are few samples to make it clearer.
To exclude all folders from all the bots
User-agent: *
Disallow: /

To exclude any folder from all the bots
User-agent: *
Disallow: /emails/

To exclude all folders from a bot
User-agent: googlebot
Disallow: /
User-agent: *
Disallow:

To allow just one bot to crawl the site
User-agent: googlebot
Disallow:
User-agent: *
Disallow: /

To allow all the bots to see the all folders:
User-agent: *
Disallow:

After learning these, I believe you guys got it. Now there are a few rules that we should know. We can't use a wildcrad"*" in the Disallow line, bots don't read it then ( Google and MSNbot can). so a line like "Disallow: /emails/*.htm" is not a valid line for all the bots. Another rule is, you have to make new user-agent and disallow lines for each spesific bots, and you have to make a new disallow line for each directory that you want to exclude. "user-agent: googlebot, yahoobot"and "disallow: /emails, /images" are not valid.

Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use. So don't try to use /robots.txt to hide information.

Is it possible to allow just one file or folder or directory to be crawled and the rest not? Simply there is no allow line in robots.txt, but mentally yea that can be done. How? You can insert all the files that you don't want to be seen in a folder and disallow it. For example, "Disallow: /filesthatIdontwanttoshare/ "

Major Known Spiders
Googlebot (Google), Googlebot-Image (Google Image Search), MSNBot (MSN), Slurp (Yahoo), Yahoo-Blogs, Mozilla/2.0 (compatible; Ask Jeeves/Teoma), Gigabot (Gigablast), Scrubby (Scrub The Web), Robozilla (DMOZ)

Google
Google allows the use of asterisks. Disallow patterns may include "*" to match any sequence of characters, and patterns may end in "$" to indicate the end of a name. To remove all files of a specific file type (for example, to include .jpg but not .gif images), you'd use the following robots.txt entry:

User-agent: Googlebot-Image
Disallow: /*.gif$

Yahoo
Yahoo also has a few specific commands, including the:

Crawl-delay: xx instruction, where "xx" is the minimum delay in seconds between successive crawler accesses. Yahoo's default crawl-delay value is 1 second. If the crawler rate is a problem for your server, you can set the delay up to up to 5 or 20 or a comfortable value for your server.

Setting a crawl-delay of 20 seconds for Yahoo-Blogs/v3.9 would look something like:

User-agent: Yahoo-Blogs/v3.9
Crawl-delay: 20

Ask / Teoma
Supports the crawl-delay command.

MSN Search
Supports the crawl-delay command. Also allows wildcard behavior

User-agent: msnbot
Disallow: /*.[file extension]$
(the "$" is required, in order to declare the end of the file)

Examples:

User-agent: msnbot
Disallow: /*.PDF$
Disallow: /*.jpeg$
Disallow: /*.exe$


Why do I want a Robots.txt?
There are several reasons you would want to control a robots visit to your site:

*It saves your bandwidth - the spider won't visit areas where there is no useful information (your cgi-bin, images, etc)

*It gives you a very basic level of protection - although it's not very good security, it will keep people from easily finding stuff you don't want easily accessible via search engines. They actually have to visit your site and go to the directory instead of finding it on Google, MSN, Yahoo or Teoma.

*It cleans up your logs - every time a search engine visits your site it requests the robots.txt, which can happen several times a day. If you don't have one it generates a "404 Not Found" error each time. It's hard to wade through all of these to find genuine errors at the end of the month.

*It can prevent spam and penalties associated with duplicate content. Lets say you have a high speed and low speed version of your site, or a landing page intended for use with advertising campaigns. If this content duplicates other content on your site you can find yourself in ill-favor with some search engines. You can use the robots.txt file to prevent the content from being indexed, and therefore avoid issues. Some webmasters also use it to exclude "test" or "development" areas of a website that are not ready for public viewing yet.

*It's good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be in? This is more of an ego/image thing than a "real" reason but in competitive areas or when applying for a job can make a difference. Some employers may consider not hiring a webmaster who didn't know how to use one, on the assumption that they may not to know other, more critical things, as well. Many feel it's sloppy and unprofessional not to use one.


So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site's main "index.html" welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

MAJOR SEARCH BOTS - SPIDERS NAMES
Google = googlebot
MSN Search = msnbot
Yahoo = yahoo-slurp
Ask/Teoma = teoma
GigaBlast = gigabot
Scrub The Web = scrubby
DMOZ Checker = robozilla
Nutch = nutch
Alexa/Wayback = ia_archiver
Baidu = baiduspider

Specific Special Bots:
Google Image = googlebot-image
Yahoo MM = yahoo-mmcrawler
MSN PicSearch = psbot
SingingFish = asterias
Yahoo Blogs = yahoo-blogs/v3.9

Feel free to ask any question, or correct my mistakes,
Cheers,
georgekalathil
Thanks a lot for this informative topic Paskall. It provides a good help for those relatively new to SEO and those willing to know more about Robots.txt file. Keep the good work.
paskall
Thanks, hope it be helpful, I have had hard times to make ma own robots.txt then now I want to help ppl to make it easily.

Cheers.
Crinoid
Thank you, very informative!

Can you clarify a little more:
1. Does it mean, that only www.frihost.com/robots.txt will be functioning, and none of the robots.txt in its many subdomains (websites of the members)?
2. How to allow search of html amd jpg files only? Only by disallowing all other file types, exixting on website?
paskall
In a domain, there can be just one robots.txt, you cant use a robots.txt for each URL, in your web root folder. For example this is your website, http://yourdomain.com and these are your pages.

http://yourdomain.com/index.html
http://yourdomain.com/products.htm
http://yourdomain.com/sellers.htm
http://yourdomain.com/aboutus.htm
http://yourdomain.com/photos.htm


Now you don't need to make a spesific robots.txt file for each page, remember the rules on how to write a robots.txt and don't use capsin the robots.txt file's name, and insert it in the true directory, which is usually the folder that your main page included.

a "robots.txt" file works for only the same level directory and sub level directories. If you have a folder tree like this,

....................... B Subfolder > E Childfolder ....

A Folder > ..C Subfolder > F Childfolder ....

....................... D Subfolder > G Childfolder ....

When you place the robots.txt file in the A Folder, all the folders A,B,C,D,E,F,G will get crawled or not according to the robots.txt
When you place it in B folder, spiders will crawl B folder and it's sub folders.

So if you make a robots.txt file and include it in the main directory of your website , for example www.frihost.com/robots.txt , then it will crawl all the sub folders and everything inside the root folder. If user interfaces were like "http://frihost.com/paskall", then search bots would crawl them by robots.txt. But what you mean is if "http://paskall.frihost.com", that has it's own root folder, so your robots.txt wouldnot see it. To get more information about first and second level subdomains, you should see my tutorial about them under the Domains Section.


About the second question, as far as I know you can't disallow html files. And not all the search bot engines allow you to specify a file type to disallow or allow to be searched. In the tutorial, there is a part about image crawlers, you can make your own robots.txt according to that. And if you don't want other file types to be seen, I think you don't want your .css and .js files don't be seen, just disallow them.

And never forget, only good bots listen to your robots.txt. Malware robots, such as email harvesters or spam bots, will never listen to your robots.txt. Always keep the secret files in a secure place, don't name the email or other special document files easily guesssable, like "emails" etc. You should have a complicated folder tree, like a labyrinth. Another rule, never but never includein the robots.txt file names! that you wanna hide .
Like
User Agent
Disallow: /myscript.js

or

User Agent
Disallow: /emails/

Then the malware bots would read the hidden directory, and would go there in contrary, because what they want and seek is there, security weaknesses.

When you wanna disallow a file type do this
User Agent
Disallow: /*.gif

for example, according to the tutorial of course


Hope this helps you. Cheers.
Crinoid
1. Yes, I don't have a domain name, and was asking about subdomain, like http://www.defineyourreef.frihost.net. So, robots.txt in subdomain will be invisible for the robots and will be not used by them. Do you know, by any chance, what robots.txt file of the frihost.net allow? Will the individual subdomains be unsearchable by robots and will never appear in the search engines without additional promotion, like submissions to the search engines, Open Directory, Stumble upon and Digg?

2. As I understand, the html files are always allowed, and each of the other file types should be disallowed for every search bot, that has this function. And for this those of us, who have domain names, should find each search engine bots' list of functions. Right?
Thank you.
paskall
1- You have a subdomain, and if you want it to be searched you have to do more than just waiting. Via frihost, your subdomain will get crwaled of course, but if you want to be listed in the first pages on Google or other Search Engines. Exchange your link with other oftenly visited websites, other than that, ppl can't find your subdomain unless they seek for the exact name, "defineyourreef". And you don't need a robots.txt file for that domain because, frihost.com the main domain already have a robots.txt file. Bots already crawl all the links starting from there.

I recommend you to buy a domain and direct it to your host,frihost.

2- I tried to summarize all the great Search Bots. They generally don't have too many commands.

As I said, try to protect your scripts and secret files.

Cheers.
nivinjoy
Thank you for this very much useful and informative post...!! Hope i will follow the directions accordingly and create one now...!!!
paskall
your welcome, Hope this be helpful. If you need more help, you can get in contact with me.
paskall
microsoft's live search engine will be changed to a new system they say, its goin be a paid system and you will have to pay to get searched. Is it true ?
Jean-Luc
Crinoid wrote:
1. Yes, I don't have a domain name, and was asking about subdomain, like http://www.defineyourreef.frihost.net. So, robots.txt in subdomain will be invisible for the robots and will be not used by them. Do you know, by any chance, what robots.txt file of the frihost.net allow?
Hi,

With your subdomain, your robots.txt should be at http://www.defineyourreef.frihost.net/robots.txt. The Frihost robots.txt at http://www.frihost.net/robots.txt has no effect on your subdomain.

Jean-Luc
Crinoid
Thank you! I can only hope, that this is so - otherwise the subdomains will be invisible for search engines. I already seen that my subdomain was searched by Yahoo bot, phpbb3 forum shows who is visiting it at current time.

Very interesting and useful topic.
paskall
yes, the key point here is that search bots visits the files with same or sub level with robots txt file. yer aim is to get read by bots so you insert it inside yer root folder which is a subdomain defineyourreef.

easy.
joharin
Wonderful!! this is definitely and absolutely something that i dun have knowledge at it until now. This is just amazing...Thank you so much for mentioning this here and I just can't help thinking how generous you are in sharing this information.

Thanks again!!
shkhanal
Such a long and nice information.

Though I come to know this thing a bit earlier but still I have not used it yet. I need to implement it.
imagefree
If i have a page http://www.example.com/my_private_page.php or http://www.example.com/my_private_page.html but that page is not linked by any website on the web (neither the example.com has a link to it), and i set the robots.txt as follows:


Code:
User-agent: *
Disallow:


then do u mean my that page (my_private_page.html or .php ) will be crawled?


I need i little bit explanation too.

Thanks @ Paskall
paskall
so you have a ghost page, that there is no link for it ?

First, if there's no link to it off your website or off another website, then probably it will never be visited. So there is no chanse bots can find it, unless you yourself visit it or somebody incidentally visits it. Then there will be a traffic and I guess bots can read it, at least google can but not 100%. But if you let bots read your directory via robots.txt file and if you have a link on robots.txt file to that page, then robots will go there and check.

Other than this, there is one way the bots can find it. Malware bots and generally big bots have some keywords, such as "index.htm", "products.html", "contact.html" etc. etc. these keywords depends on what they seek, generally malware bots try to find email ads so they seek such keywords... As I recommend, use different or not easily guessable link names, of course this will lower your manula traffic( this means sometimes people type the url manually if it's easily memorable, for example -blabla.com/this.is.easy.to.memory.html - ). Most website builder softwares and rookie website builder sites recommend this. But I dont. If you need to hide important files such as emails.

Hope this helps you.
esiportal
imagefree wrote:
If i have a page http://www.example.com/my_private_page.php or http://www.example.com/my_private_page.html but that page is not linked by any website on the web (neither the example.com has a link to it), and i set the robots.txt as follows:


Code:
User-agent: *
Disallow:


then do u mean my that page (my_private_page.html or .php ) will be crawled?


I need i little bit explanation too.

Thanks @ Paskall


Yes, crawlers still can crawl these even if you don't use robots.txt file because bad bots are still crawl any thing inside your host and ignore robots.txt file - robots.txt file is non-standard file, it is only an option. You can use it to prevent good bots such as Google bots, Yahoo bots,...crawl your specific files. See http://www.robotstxt.org for more information.
paskifire
very useful!
darkwater49
Thank you for the guide!!
wanshi
thank you for the contribution on robots.txt , hope more about htaccess if you have time Very Happy Very Happy Very Happy Very Happy Very Happy Very Happy Very Happy Very Happy Very Happy Very Happy Very Happy Very Happy
traxion
wtf, i've found a spammer Very Happy

on-topic
thnx for the complete overview about a robot.txt, i will try it out on mine website
alomari
Robot Exclusion Protocol (REP) is one important part of search engine optimization (SEO). It gives the power to website owner to restrict search engines from indexing or crawling some parts or the whole website. One cannot say that all search engines are following the REP but the big three search engines(Google, Yahoo and Bing) have adopted a strategy of working together to support REP.
Source http://widwebway.com/en/blog/?p=31
hakaner
Thank you for this helpful guide. I'm also using Google XML Sitemaps plugin for creating sitemap and robots.txt both in wordpress installation. This plugin is automatically creating robots.txt file (it's hidden in the root folder) like that:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Sitemap: http://mysite.com/sitemap.xml.gz
hakaner
Amit_Airon wrote:
Thanks for the sharing, but if we want to follow our website or webpage by cralwers please dont use any robot file.


If it is used correctly, helps Google's web crawling bots to show right direction.
ilmkidunya1
hmmm nice information thanks for sharing is it easy to get the file from internet or make it urs owns
Related topics
Robots.txt
robots.txt
Robots.txt.... how important is it?
Frihost forum not crwaled by google?
Robots.txt
Prevent Flash Site Navigation
get to top in SEO ranking
about robots.txt
Google released robots.txt generator
Please check this robots.txt file its getting complex for me
Evitar conteúdo duplicado com robots.txt
Question about robots.txt
Robots.txt
Robots.txt and Meta Robots
Reply to topic    Frihost Forum Index -> Webmaster and Internet -> SEO and Search Engines

FRIHOST HOME | FAQ | TOS | ABOUT US | CONTACT US | SITE MAP
© 2005-2011 Frihost, forums powered by phpBB.