You are invited to Log in or Register a free Frihost Account!

Hide Part of Your Web Site from Yahoo!

Hack 93. Hide Part of Your Web Site from Yahoo!

Though many sites do whatever they can to be found and ranked highly in Yahoo! Search results, there might be parts of your site that you want to keep private.
The Web is considered a public place, used to share pages of information with anyone in the world who wants to view them. But, to a lesser degree, the Web is also used to share information with small groups, or even a single individual. Because Yahoo! indexes as much of the Web as it can, these semipublic spaces can also be included in Yahoo! Search results. With just a bit of work, you can tell Yahoo! exactly which pages are meant for public consumption and which pages shouldn't be included in Yahoo! Search results. In addition, some sections of sites might not be a good introduction to the site, and you might want to control where people enter.
Yahoo! scans the Web with a program called Slurp. Slurp is a bot (short for robot) that visits and indexes web pages, makes a copy of the page for Yahoo!, and follows any links in the page looking for more pages to index. In addition to standard web pages, Slurp copies other files it finds along the way, such as PowerPoint presentations, PDF files, Word documents, Excel spreadsheets, and XML data files. Because of Slurp's link-following nature, many people think that if a page or document isn't linked from a page on their site that Yahoo! won't find it and it'll be out of view. But Slurp doesn't follow links exclusively; Slurp also looks for common filenames. And if a particular file is linked from another site, Slurp might find it that way.
6.3.1. Server Authentication
The best way to keep pages and files out of view of the general public is to place them behind server authentication. Server authentication is the web server's attempt to verify the identity of a particular user by requesting a username and password. The authentication is set at the server level.
Slurp can't enter a username and password if it encounters a server authenticated page, so you can be sure that anything behind this wall will not be indexed. You can set authentication permissions on a directory or file, and it's fairly easy to set up with both Apache and Microsoft's Internet Information Server (IIS).
Imagine you have a directory on your server called /private and you'd like to keep any pages or files out of Yahoo! Search results. Apache includes many ways to set authentication, but a straightforward method involves setting a .htaccess file. The .htaccess file tells Apache how to configure a particular directory, and you can add a .htaccess file to the /private directory with the following information:
AuthName "Please enter you login info."
AuthType Basic
AuthUserFile /your/path/to/.htpasswd
AuthGroupFile /dev/null
require user insert user name

Note that AuthUserFile points to a file that contains the username and password of the authenticated user, and you'll need to change /your/path/to/ to a real directory on your server that's not accessible via the Web. The next step is to create that password file with the htpasswd tool. Enter the following command from a command prompt:
htpasswd -c /your/path/to/.htpasswd insert user name

This creates the proper .htpasswd file for that user and puts in place all of the pieces for basic HTTP authentication.
To get the same results on a Windows server running IIS, open the IIS manager and find the directory you'd like to protect. Right-click the directory and choose Properties Directory Security. Click Edit under "Anonymous Access and Authentication Control," and you'll see the window shown in Figure 6-2.
Figure 6-2. Authentication Methods prompt in IIS

Uncheck the "Anonymous access" box to require authentication. Check "Integrated Windows authentication" for a bit more security or "Basic authentication" for the most basic HTTP authentication. Once you set one of these, only authenticated users will be able to view the files or subdirectories of /private, and Slurp won't be allowed in.
6.3.2. robots.txt Exclusions
If server authentication seems like overkill and you'd rather make your directory or files available to everyone except Slurp, you can do so with a robots.txt file, which indicates how you'd like robots to behave at your site. Well-behaved bots (such as Slurp) check for robots.txt before indexing anything, to make sure they're acting as the site owner wants them to.
With robots.txt, you can tell Slurp that you'd like it to exclude certain directories or files from its crawl. For example, if you'd like Slurp to skip a directory called /private, save the following line to a file called robots.txt:
User-agent: Slurp
Disallow: /private/

You can also tell Slurp to skip specific files:
User-agent: Slurp
Disallow: /Private.doc
Disallow: /Private.html

Once you've listed all of the files and directories you'd like to hide, add robots.txt to the root directory of your web site, so it has a URL like this:

If a human reads your robots.txt file, they'll see a list of the files and directories you've asked Yahoo! not to index. While robots.txt will keep some bots away, it won't keep people from viewing the files. Private files should always be placed behind server authentication where a password is required to access them.

If you'd like to deny entry to all robots across all areas of you site, you can use a wildcard, like this:
User-agent: *
Disallow: /

Keep in mind that only bots that adhere to the robots.txt standard will play by the rules. People are free to build bots any way they want, and some ignore robots.txt altogether. Luckily, Slurp will always play by the rules.
6.3.3. robots Meta Tags
Another way to guide the Slurp bot on a page-by-page basis as it crawls your web site is through special HTML meta tags. Meta tags add extra information to a web page and are located toward the top of the page, between the <head></head> tags. To keep Slurp from indexing a particular page, add the following tag:
<META NAME="robots" CONTENT="noindex">

This will insure that the page will not show up in Yahoo! Search results. Many web crawlers look for this tag, and adding this robots tag will affect more than Yahoo! Other search engines, such as Google, will also skip the page.
If you'd like search engines to index the page, but not keep a copy in their cache, you can use the following tag:
<META NAME="robots" CONTENT="noarchive">

Using this tag will mean your page will show up in search results, but the search engine will not store a copy of the page that their users can view. Again, this will affect more than just Yahoo!, because many search engines also obey this tag.
Now that you know how to speak Slurp's language, you can make sure that your private or semiprivate information doesn't turn up in Yahoo! Search results, and you can control what Yahoo! sees in the first place.
Copy + Paste.
No quote tags.
No credit.

If you think I'm wrong in anyway, please inform me.

Related topics
This topic is locked: you cannot edit posts or make replies.    Frihost Forum Index -> Scripting -> Php and MySQL

© 2005-2011 Frihost, forums powered by phpBB.