About /robots.txt

A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

Web site owners use the /robots.txt file to give instructions about their site to web robots.

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

User-agent: *
Disallow: /

The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.

There are two important considerations when using /robots.txt:

a. can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
b. /robots.txt file is a publicly available file. Anyone can see what sections of your server you don't want robots to use.

So don't try to use /robots.txt to hide information.

1. To exclude all robots from the entire server

User-agent: *
Disallow: /

2. To allow all robots complete access

User-agent: *
Disallow:

3. To exclude all robots from part of the server

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

4. To exclude a single robot

User-agent: BadBot
Disallow: /

5. To allow a single robot

User-agent: Google
Disallow:

User-agent: *
Disallow: /

6. To prevent all robots from indexing a page on your site, place the following meta tag into the head section of your page:

meta name="robots" content="noindex"






7. Submitted your URL to be indexed via:
http://www.bing.com/docs/submit.aspx


1 comment:

Marc Smith said...

Robot.txt files are helpful to hide your personal folders or personal web pages from search engine crawling. These are well explained information about the same.I have found it usefu
ecommerce web design in los angelesl.