View Full Version : Question Regarding Robots.txt and Engine Spiders
blazeusa
17-07-2002, 22:27/10:27PM
Hello All,
I have a quick question regarding the robots.txt file and the engine bots that visit each site.
First off, I am curious to know if there are robots that will not index your site if they dont find a robots.txt file.
Also, If I want a robots.txt file that doesnt exclude anything, is the following the correct format:
User-agent: *
Disallow:
Now, on a second note. I have noticed many discussions in which people say that a search engine spider visits their site but doesnt index it. How in the world can you determine which engines are indexing your site or not? Is this vital information included in the site logs? If so, which software package will run up a report for your regarding the robots by scanning the logs (if any)?
Thanks all!
Nick
excell
18-07-2002, 03:13/03:13AM
User-agent: *
Disallow:
Is correct. I don't know of any robots that will not index your site because you do not have a robots.txt file. Make sure the server is configured correctly to return a 404 file not found though.
There are many programs that can be used for analysing log files. sawmill.com has a free trial that should give you an idea.
From that you can follow the path the robot takes.
But if you have the raw log files (and it is not too large) you can open it in wordpad and search for robots and see if and what pages they index as well.
(I prefer to look at the raw access files, it's more fun :))
potato
18-07-2002, 18:39/06:39PM
You might want to take a look at
www.sxw.org.uk/computing/robots/botwatch.html
There is a quite useful perlscript available, written by Simon Wilkinson, which can be installed and used for free. Only very little configuration is necessary.
After uploading the file to your cgi-bin directory, you call the file directly from your browser and get a monthly table showing which bots have crawled your site (derived from your servers logfile). The included list of bots is a bit outdated but can be made current if neccessary.
of course, looking through the raw log is more fun (or more grief) as it tells you more about your visitors, potential errors etc. But my logfiles reach 20MB plus a month, so I appreciate perls help.
Alan Perkins
18-07-2002, 18:50/06:50PM
Originally posted by blazeusa
First off, I am curious to know if there are robots that will not index your site if they dont find a robots.txt file.
Not that I know of. That would be an extremely polite robot.
Originally posted by blazeusa
Also, If I want a robots.txt file that doesnt exclude anything, is the following the correct format:
User-agent: *
Disallow:
Yes.
Originally posted by blazeusa
Now, on a second note. I have noticed many discussions in which people say that a search engine spider visits their site but doesnt index it. How in the world can you determine which engines are indexing your site or not? Is this vital information included in the site logs?No it's not in the logs (directly). The logs will show you when spiders read your site. You don't necessarily know it has been indexed. If you see referrers from a search engine, then you know it has been indexed. But there is a third case - indexed but no referrers. In this case, you either need to query the SE database to determine you have been indexed, or assume that you have (or haven't).
Reading and indexing are separate processes. A resource can be indexed without being read, and read without being indexed.
Note also that robots.txt is only a protocol and "unfriendly" spiders may totally ignore it.
Black_Knight
18-07-2002, 19:51/07:51PM
Originally posted by blazeusa
First off, I am curious to know if there are robots that will not index your site if they dont find a robots.txt file.
No. A robots.txt file exists only to exclude, never to include, and the absence of a robots.txt file is the normal, the default scenario, meaning that you are not requesting for anything to be excluded.
Having no robots.txt file tells the spider that you have not requested that anything be excluded.
Originally posted by blazeusa
If I want a robots.txt file that doesnt exclude anything, is the following the correct format:
User-agent: *
Disallow:
Yes, but the normal is to have no robots.txt file at all, or else upload a blank file.
Originally posted by blazeusa
How in the world can you determine which engines are indexing your site or not? Is this vital information included in the site logs? If so, which software package will run up a report for your regarding the robots by scanning the logs (if any)?
Your logs can tell you whether the spider is requesting pages (i.e. crawling), but the only way to determine whether they actually get indexed or not is to search on the engine. Most search engines provide a way to search for all pages indexed from a specific domain in the Advanced search options. This is usualy in the format of 'domain:mysite.com' or 'url:mysite.com'.
Your logs tell you who or what has looked at your pages, but can't tell you whether it actually indexed or rejected what it found.
Alan Perkins
19-07-2002, 08:20/08:20AM
If your logs show referrals from search engines this normally indicates that you have been indexed.Originally posted by Black_Knight
Most search engines provide a way to search for all pages indexed from a specific domain in the Advanced search options. This is usualy in the format of 'domain:mysite.com' or 'url:mysite.com'It's simpler to use this JavaScript tool to check pages indexed by various search engines (http://www.searchmechanics.com/look/look.htm). :)
Matt B
19-07-2002, 09:07/09:07AM
Thumbs up to Alan's Search Mech tool :up:
I highly recommend it. It has helped me many times.
Alan, any other SE's going to be added to the options in the future?:cool:
vBulletin® v3.8.3, Copyright ©2000-2010, Jelsoft Enterprises Ltd.