Search Engine Optimization Techniques (SEO): August 2009

About /robots.txt

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Web site owners use the /robots.txt file to give instructions about their site to web robots.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

a. can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
b. /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information.

1. To exclude all robots from the entire server

User-agent: *
Disallow: /

2. To allow all robots complete access

User-agent: *
Disallow:

3. To exclude all robots from part of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

4. To exclude a single robot

User-agent: BadBot
Disallow: /

5. To allow a single robot

User-agent: Google
Disallow:

User-agent: *
Disallow: /

6. To prevent all robots from indexing a page on your site, place the following meta tag into the head section of your page:

meta name="robots" content="noindex"

7. Submitted your URL to be indexed via:

http://www.bing.com/docs/submit.aspx

Search Engine Optimization Techniques (SEO)

About Me

Blog Archive

Followers