PDA

View Full Version : Using ROBOTS.TXT on your web site.


sophtware
21-11-2001, 14:31/02:31PM
Hi All,

This is my first post to this forum, and since I didn't see any threads on ROBOTS.TXT, I thought I would add one.

This potentially small file can have a big impact on your web site and its listings in the search engines. Many people, and some SEOs, disregard this fact, but it's true.

Using a robots.txt file on your web site not only helps search engines index relevant content pages, but can also keep other spiders off your site completely. You can also use the robots.txt file to point specific search engine spiders to optimized pages (often referred to as cloaking).

This is just the tip of the iceberg, and I would be more than happy to answer any questions users might have about using this file on their web site.

ihelpyou
21-11-2001, 14:42/02:42PM
Welcome to the forums sophtware! :hi:

Yes, it is important. I know we have a few threads that have different posts on it in here somewhere, but here is also a link that explains a lot as well.

http://www.robotstxt.org/wc/exclusion-admin.html

Yes, any help members can get with it is more than welcome!

Kal
22-11-2001, 00:46/12:46AM
Welcome sophtware! :hi: - Thanks for posting and looking forward to your contributions to our forum.

sophtware
22-11-2001, 11:05/11:05AM
Hi Kal,

Thanks for the warm welcome. I look forward to contributing to this forum.

Alan Perkins
22-11-2001, 11:34/11:34AM
Hi Michael, welcome to the forums :hi:

You may be interested to read my thoughts on robots.txt and the robots meta tag (http://www.ebrandmanagement.com/whitepapers/robots2.htm).

One thing - sending spiders to different pages using robots.txt is not cloaking. It's just using a side effect of the standard. For more about cloaking, read Relevancy, Spam, Technology, Cloaking and Ranking : The Ethical Guide (http://www.ebrandmanagement.com/whitepapers/spam1.htm) and The Classification of Search Engine Spam (http://www.ebrandmanagement.com/whitepapers/spam-classification/).

Wow! Managed to mention all my White Papers in a single post. :D

Blue
22-11-2001, 11:47/11:47AM
Welcome to the forums sophtware! :hi:

It has been mentioned that some SE's ignore the robots.txt file. Could you give us a "DO & DON'T" listing of which SE's do which?

sophtware
22-11-2001, 12:21/12:21PM
Hi All,

Thanks for all the welcomes!

I will read all of your white papers later today, Alan. Thanks for that feedback. When I said 'cloaking' earlier, I really didn't mean it in the literal sence. When trying to explain some new concepts, I try to feed off the knowledge that I think my readers may have. Everyone knows that cloaking is a method of presenting optimized pages to specific search engines. In a way, thats what a robots.txt file can do too.

:)

To answer your question Blue, we are currently building a list of spiders that obey and (somtimes) not obey the robots.txt file. Just to mention, one spider I know of so far that sometimes doesn't obey the robots.txt file is OpenFind. I investigated this and found out the following information. This could also be applicable to other spiders as well.

This engine uses two different spiders for indexing; one for retreiving the robots.txt file and caching it, and another for actually indexing the site. From what I was told, it is possible to the indexing spider to get ahead of the robots.txt spider. This can happen when the indexing spider is coming from a link on another site and your site has not had the robots.txt spider visit yet.

If the indexing spider doesn't find a 'cached' version of the robots.txt file, it posts a request to the robots.txt spider to retreive it. Until then, the indexing spider gladly indexes all the pages and links on your site. But once the robots.txt spider comes by, the indexing spider will obey your robots.txt file.

Seems kind of backwards, but that is how it was explained to me. I'm sure there are other engine spiders that do the same thing. Eventually we will have a complete list of spiders and their habbits posted to our web site.

An interesting note on Scooter (AltaVista's Spider). I have noticed that if you don't pay the listing fee in AV, Scooter seems to only index your home page and not follow any links. Once you pay the fee, Scooter seems to break-free and index the rest of your site.

:)

Alan Perkins
22-11-2001, 12:36/12:36PM
Hi Michael

I reserve use of the word "cloaking" for deliberate deception of search engine spiders (e.g. detecting a spider by IP address and feeding <BODY> content to that spider that is not designed for any human to view).

The behaviour of that OpenFind robot, if you are correct, is bizarre!

Before you read the White Papers, here's a teaser for you: is it within the robots standards for an indexing spider to create and maintain an index of a page if that page is denied by both robots.txt and the robots meta tag?

Answer: yes! Read the white papers to find out why...

sophtware
22-11-2001, 13:40/01:40PM
Hi Alan,

I read your white papers. Very insightful! If I'm correct, your paper on robots.txt is in reference to the original spec., which did not include the allow rule. With the new spec. it should be possible to disallow the whole site while only allowing access to the index page. It would look something like this:

user-agent: *
disallow: /
allow: index.html

Also, I would recommend to anyone creating a new web site to install a robots.txt file on the site that disallows access to the whole site until it reaches a point where it can be indexed.

I think the reason some people think the search engine spiders ignore their robots.txt file is because one or more spiders indexed their site in the beginning before the robots.txt existed. And now with the spiders coming back, it look as though they did not retreive the robots.txt file.

We are still trying to build a list spider habbits and IP address for our web site, and hope to have a fairly comprehensive list by some time next year. We are building this with the help of our users of Robot-Analyzer Professional, which allows you to mine your log file for spider visits.

If you would like, I would be more than happy to extend you a copy of the software for free. Just download it from our web site and email me (michael@analystsoftware.com) and I'll send you a free unlock code.:)

Alan Perkins
22-11-2001, 14:12/02:12PM
Thanks Michael, I'll do that. I'll also return the compliment on the next release of Search Mechanics (at the moment it's quite difficult to give out free licenses!).

Also, I would recommend to anyone creating a new web site to install a robots.txt file on the site that disallows access to the whole site until it reaches a point where it can be indexed.That is good advice!

If I'm correct, your paper on robots.txt is in reference to the original specThat's right. My understanding is that the "original" spec is still the only spec. The "allow" rule is only a proposal, not a specification. If you have evidence of well-known spiders obeying the allow rule, that will be very interesting...

Are you on the robots mailing list, Michael? It's where some of us robot writers hang out. To subscribe, send a message to listar@mccmedia.com with "subscribe robots" in the message body.

sophtware
22-11-2001, 14:25/02:25PM
Hi Alan,

Didn't know about the 'robots' list serv. I just signed up. Thanks for the tip.

I don't know of any specific engines that are obeying the lastest cut of the standard. But it would be interesting to setup some dummy web sites to verify this.

We have something like this planned for next year on our web server. We have a co-located box with some spare IP addresses and domain names. When we get some more time, we'll set up some test web sites and submit them to the search engines and watch the logs. Should prove to be very insightful.

We also plan on offering a service on our web site next year for customers of Robot-Analyzer Professional that will allow them to submit their collected spider IP addresses and view those of others that have been submitted. Over time we should have the most compreshensive spider database on the web!

rmridgew
05-12-2001, 19:17/07:17PM
why wouldnt I want my entire site indexed? can subpages hurt me?

The way I see it so far is the spider goes to your parent directory in my case for www.charlestonfishing guide.com http://www.awod.com/gallery/business/fishguide/ and looks at every file (including subfolder files) and relates them with your site
if i have a /cgi-bin/ folder I would need to disallow it in my robots.txt file

I have thought in the past that spiders only indexed html, shtml, and other web documents linked from your index page and other pages throughout the net.

Alan Perkins
05-12-2001, 19:22/07:22PM
Quote from http://www.robotstxt.org/wc/norobots.html

...there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

sophtware
05-12-2001, 22:55/10:55PM
Thre are many reasons why you don't want spiders indexing your entire site (if you can help it). Here are a couple that I can think of off the top of my head:

1) Most sites, mine included, are in a constant state of development. I often post pages to the web server so I can see them in real-time, but don't want them indexed (yet). This is usually the case when your site has many non-html pages, like php, asp, and others. These pages you can't directly view in your editor, but can on your web site.

2) Lots of pages on a site really don't contain any useful information for the public. For instance, if I was searching a search engine for copyright and legal information, I sure wouldn't want to see 200 million listings for copyright, privacy, or legal pages from every site on the web! I would like to see relevant listings for my search. Every web site owner can do us all a favor by *limiting* the pages that get indexed to those that are *relevant* to the site's content.

Robots.txt can help us all. We just need to use it. And the more we use it, the more the search engines will pay attention to it. And in the end, we can look forward to better searching on all the engines for all of us :)

lots0cash
06-12-2001, 01:21/01:21AM
I have never used a robots.txt file, so this may be a dumb question;

What is the difference between using a robots.txt allow all and not using a robots.txt file at all?

Alan Perkins
06-12-2001, 05:15/05:15AM
Originally posted by lots0cash
I have never used a robots.txt file, so this may be a dumb question;

What is the difference between using a robots.txt allow all and not using a robots.txt file at all?

It's a very good question.

Both should have the same effect on your site being indexed. The differences are quite subtle:

1) Using a robots.txt file, you are actively giving permission to robots to crawl your site. You are indicating that you know this might happen and you accept it. Without the file, you are either ignorant of it (and maybe robots generally) or you are passively giving permission.
2) All good robots request robots.txt. So if you have one, the hits will show up in your access logs. If you don't have one, the failed requests will show up in your error logs.

Don't assume that ONLY robots access robots.txt, however. Plenty of nosey/interested people take a look too!

SEO_Speedster
11-12-2001, 15:09/03:09PM
Welcome Michael...

Today too, is my first day here.

If there is anyone that knows how to handle the robots.txt business, I would say your company has it down. I won't plug the product specifically or anything like that, but Ken Garner @ Analyst has done a great job getting things together in regards to this subject.

:)

ihelpyou
11-12-2001, 15:11/03:11PM
oh shoot, no one cares about posting the link... that is just fine as it does pertain to the thread.

SEO_Speedster
11-12-2001, 16:07/04:07PM
Seems like the root domain of http://www.analystsoftware.com/ has done a good job of calling attention to it anyways!

If it changes, and this thread gets old, the information relevant to that particular product should still be available at:
http://www.analystsoftware.com/downloads/

pageoneresults
31-12-2001, 10:51/10:51AM
Good morning everyone! As some of you already know, I devoted extensive research into this topic months ago. I found that having the robots.txt file present in the "root directory" plays an important part in a clean indexing.

I accidentally disallowed a sub directory on our corporate site that has a 216 web safe color palette (http://www.eagle411.com/java/colorcube/colorcube.htm) in it. That page held positions in the top ten results in most of the majors.

I was just recently reviewing statistics and could not for the life of me figure out why it disappeared. Well, I figured it out, I had a disallow: /java/ line in my robots text file that was the problem. Inside that /java/ directory is my color palette. I just removed the disallow: line a couple of days ago and hope to see that page back in the top positions within the next couple of months.

What is a Robots Text File (http://www.123seo.com/information-tips/robots-text-file.htm)?

sophtware
31-12-2001, 10:58/10:58AM
Try using Robot-Manager to manage your robots.txt files for your web site. You get a GUI interface for creating your robots.txt file, which may of helped catch your problem. You can download it here:

http://www.websitemanagementtools.com/downloads/rbtmgr30.exe

FYI: Analyst Software, which used to sell an oem version of Robot-Manager, has folded. You can still get the program from www.websitemanagementtools.com. Thanks...

Kal
03-01-2002, 03:20/03:20AM
Hey pageone! Haven't seen you since I left the other forums. Welcome to the friendly forums. :cheers:

pageoneresults
03-01-2002, 11:42/11:42AM
Hey, long time no hear. Yes, it has been a while since I've been active in any of the forums I previously participated in. All the politics sort of pused me away from the "other" place plus I was overburdened with responsibilities at the time.

Thanks for the welcome!

ihelpyou
03-01-2002, 11:58/11:58AM
PageOne, you really should make up a sig file for your posts. That way, members can easily visit you. :)

Sharon & Roy
03-01-2002, 12:06/12:06PM
Hello Michael & pageoneresults,

Gotta a couple questions for you boys (or anyone else) if you happen to know. This is specifically for Google.

We've "heard" that by disallowing a directory (or page) using the robots.txt file ...

1) That will then remove all the pages in the directory on the next update, is that correct?

2) If that is correct, then we also heard that it will take 90 days before the spider will return to check the robots.txt file?

We have a possible client that we need to "clean" out a directory or two of both mirrored pages and old outdated ones. We don't want to proceed on this until we are sure one way or the other.

Thank you.

pageoneresults
03-01-2002, 12:25/12:25PM
Google will call the robots.txt file each time it comes around for indexing so the 90 day delay may not be accurate.

In reference to cleaning up old pages that are mirrored. I will typically remove old pages and have my host set up a custom 404 script so any invalid requests automatically go to the site map or the home page, usually the site map. I want to give the user or spider a page of relative content where they can hopefully find what they were looking for. I just hate to see those "page not found errors" and the custom 404 is one way to eliminate them.

If it is just Google that you are worried about, I would put a Disallow: line for just Google, that way you'll keep the little bugger out of there to avoid any penalties.

User-agent: Googlebot
Disallow: /directory-name/

sophtware
03-01-2002, 13:38/01:38PM
While using Robot-Manager Professional, I have noticed that Google does in fact request the robots.txt file every time it hits your site. Sometimes, that's all it requests. While others, it will grab one or two pages. Then one day, out of the blue, Google indexes the whole site. (Probably your 90 day indexing interval that you mentioned.)

I agree with the custom 404 errors--to some extent. I personally would rather have Google throw out pages that don't exist on my site than have it index a directory or home page. In some ways hidding the 404 could be considered spam (and could easily be detected by Google and others). Just request a page on the site that obviously doesn't exist, and see what you get back. You could in theory load your web site up with tons of links to missing pages that all serve up your home page.

To me the risk of getting banned would be far greater and not worth it. I would let the search engine remove the missing pages.

Alex Pickering
18-04-2002, 23:32/11:32PM
As I am fairly new to the concept of robots( I am guessing you are talking about"<meta name="robots" content="index,follow"> if I am mistaken please let me know) I would like to know everything that you have to offer on them, I do not have any specifics questions as I am fairly new to them so I do not have the knoledge to ask questions when I do not even know the basics so an indepth overview with any aditional information would be usefull. I am also interested in using some of the information I collect from your post to write my own article on robots for my web site, and if you do not want me to do so please specify, and if you don't mind or would even like to write it for my web site, the better :)

Thanks you!

ihelpyou
18-04-2002, 23:49/11:49PM
hey alex, all you need to know is right here:

www.robotstxt.org

brian110872
10-03-2004, 13:12/01:12PM
Hi,

I'm new to the forums. I want to disallow a folder with usernames and passwords so, I created a robots.txt file and uploaded it to my main html directory. Now the search engines won't index it but, if someone comes along and types in www.mywebsite.com/robots.txt they can see what files Im disallowing the search engines to index. What can I do?

Brian

ihelpyou
10-03-2004, 13:25/01:25PM
Welcome to the forums Brian! :hi:

Oh yes, anyone can view your robots.txt file.

You should password protect that folder as well. I don't know any other way to do it. So when someone tries to view the folder with usernames and passwords in it, they will get a message saying they don't have permission to view the file.

brian110872
10-03-2004, 13:44/01:44PM
Hi ihelpyou and thank you. How do I password protect it? I never password protected anything.

Brian

brian110872
10-03-2004, 16:55/04:55PM
Can I put the robots.txt file in a different directory than my main root directory? I just contacted my web host and I can only password protect a directory.

Brian

pageoneresults
10-03-2004, 17:33/05:33PM
Can I put the robots.txt file in a different directory than my main root directory?No. It must reside at the root level of your domain...

http://www.example.com/robots.txt
I just contacted my web host and I can only password protect a directoryI think there may have been some confusion in the beginning. You cannot password protect the robots.txt file. What Doug was suggesting is that you password protect the sub-directory that those files are in if you don't want anyone to see or know where they are. The robots.txt file cannot be protected, it is there for all to see.