PDA

View Full Version : BotSeer?


WebSavvy
26-09-2007, 06:58/06:58AM
Can someone please explain the "logic" behind this??? I know bots that are compliant are supposed to ask for robots.txt file -- BUT, have you ever heard of one that wanted to INDEX just your robots.txt file?

Honestly, this is useless and IMO it shouldn't be indexed in any friggin SE!!!
http://botseer.ist.psu.edu/

Found that "bot" hitting several of my sites today. It can't get anything from my sites though because I use a whitelist via .htaccess and it was served a nice big fat juicy ACCESS DENIED message.

Just thought I'd give the rest of you a heads-up on this one if you hadn't already noticed it hitting your site.

IncrediBILL
26-09-2007, 11:27/11:27AM
I went to the site and took a look.

Although I can see how it could provide value to a handful of researchers it's the stupidest excuse for crawling the net I've ever seen.

WebSavvy
26-09-2007, 11:51/11:51AM
OK, now here's a quandary; How do you add a rule to a robots.txt file that disallows bots from being able to index it?

These bozo's are indexing the robots.txt file, and there will be other bozo's that will come along and do it too (count on it).

So, in order to prevent future bozo's that you won't have any idea about until they've already indexed your robots.txt file, from indexing it, how would you stop that?

You can't add this:
User-agent:*
Disallow: /robots.txt

Because then NO BOT would be able to access your robots.txt file and know which files they are and aren't allowed to index.

Now, to make it worse, the bozo's over at BotSeer are storing a local copy of your robots.txt on their server (like they have permission to do that or something) PLUS they're linking to your robots.txt file.

In some circles, blackhats would kill for a link from an .edu site -- would it matter if the link is to their robots.txt file? :lol:

IncrediBILL
26-09-2007, 12:05/12:05PM
FWIW, even Google has robots.txt files indexed (http://www.google.com/search?q=%22some+misbehaved+spiders+out+there+that+go+_way_+too+fast%22&num=100&hl=en&newwindow=1&c2coff=1&safe=off&filter=0) that show up in the SERPs so there is precedence for this stuff.

WebSavvy
08-07-2008, 10:09/10:09AM
I came across this WMW thread (http://www.webmasterworld.com/google/3649546.htm) in google while I was doing some research on something related to robots.txt

Seems there are others now that are concerned about having their robots.txt file indexed and cached in SEs db's.

IMO, it's a security risk to have your robots.txt file indexed/cached in SEs because of malicious bots that crawl through the cache files in SEs looking for specific strings. (case in point: read every bad bot rant by IncrediBill regarding cache pages in SERPs)

Is there a way to effectively disallow indexing of your robots.txt file without accidentally causing any negative impact on the crawling of your site?

UPDATE: Do a link: command query in Google for that BotSeer site. It seems BotSeer is allowing their own search results pages to be indexed -- which is why so many sites are now finding their own robots.txt files indexed in Google!

SIDE NOTE: Bill, g1 -- I'm not a member over at WMW but maybe one of you could post the BotSeer info to the WMW thread linked to above?

g1smd
08-07-2008, 10:44/10:44AM
*** You can't add this:

User-agent:*
Disallow: /robots.txt

to your site. ***


Yes you can.

You are mixing up spidering with indexing.

Bots will always request robots.txt whatever you have inside that file.

They even request it on sites that have this:

Disallow: /

The effect of Disallow: / or Disallow: /robots.txt is simply that the robots.txt file will not be accessed by the systems that scan file content for indexing, but it will still be scanned by the systems that scan robots.txt to gather the permissions and denials for the site as far as indexing content goes.

g1smd
08-07-2008, 11:37/11:37AM
{Pwned!!!

botseer.ist.psu.edu/namelog.jsp?s=0&l=1

There's several thousand nonsense user-agents in that list, and almost every one directly links back to some porn site or other.

The last 90% of the list is filled with that sort of junk. Anyone at that university have a clue?

WebSavvy
08-07-2008, 13:22/01:22PM
Hey g1, thanks. Yep, I'll add that to my file today. Thanks for updating the WMW thread with the BotSeer info.

At least it'll give them a starting point. What I can't understand is, why Google is even indexing this stuff?

Blue
08-07-2008, 14:06/02:06PM
Do we know what botseer's user-agent is called?

g1smd
08-07-2008, 14:19/02:19PM
It should be mentioned on their site somewhere.


Looks like the user-agent list at botseer.ist.psu.edu/namelog.jsp?s=0&l=1 has been partially fixed.

The duff entries no longer link back to the porn sites, but the list is still 90% polluted with junk entries.

WebSavvy
08-07-2008, 14:37/02:37PM
Blue, all I can remember was the UA had "BotSeer" in it, and that's how I found it.

It'd been requesting my robots.txt file over, and over, and over again and kept getting a 403 each time. There were over 300 requests from that bot in one days time. I was going over my logfile and saw all the requests so went to google to do some digging (that was last year in 2007 when I opened this thread)

So, I imagine if you just block the UA string for anything containing: BotSeer
you should be fine.

I noticed Wikipedia has an "article" on BotSeer. I've not looked at that (and don't plan to) but maybe there might be some UA information listed there?

g1smd
08-07-2008, 14:53/02:53PM
Hmm. The article in Wikipedia gives some basic background to the project.

They have analysed millions of robots.txt files to see which bots are being excluded.

However, they are missing a huge amount of the picture. I do most of my exclusions using .htaccess rules.

WebSavvy
08-07-2008, 15:53/03:53PM
I do most of my exclusions using .htaccess rules
Yep. So do I ... which is why they couldn't access my site to index my robots.txt file.

You'd think that for a group of university students that built a robot that specifically indexes robots.txt files, that they'd at least know how to properly set up their own.

They have bot-specific rules FIRST followed by general rules. Once a bot finds its own bot-specific rules it doesn't continue to read the rest of the file so any generic (user-agent: *) rules won't even be read or followed.

They should have placed the generic rules FIRST
User-agent: *

Then followed it afterwards with bot-specific rules
User-agent: Googlebot

I did some testing in Google webmaster apps last night with the robots tool.

If I create my robots.txt file in the following manner, I don't need to continue to edit it to stop other compliant bots (that I may not even know exist) that may come knocking at any point in the future, while allowing the bots I do grant permissions to.


User-agent: *
Disallow: /

User-agent: Googlebot
User-agent: MSNBot
User-agent: Slurp
Disallow: /cgi-bin/
Disallow: /javascript/
Disallow: /robots.txt

User-agent: Googlebot-Image
Disallow: /


If Google-Imagebot isn't given a direct DISALLOW rule, even though it wasn't listed in the middle section -- it's being read by Google as Google Imagebot is being given permissions to index.

I haven't implement this type of format yet, but I'm thinking about it. Seems like it'd be a lot easier. Then, save the .htaccess blocking for the nasty noncompliant bots.

g1smd
08-07-2008, 16:22/04:22PM
Google ignores the generic section of robots.txt if there is a specific section for Googlebot present.

We proved this a couple years ago: http://www.webmasterworld.com/google/3044757.htm - with input from both GoogleGuy and Vanessa Fox at the time.

Only if there is no Google-specific section does Google follow the generic section of the file.

When both sections are present, Google does not read both sections and attempt to combine the rules. It follows only the most specific part.

WebSavvy
08-07-2008, 16:52/04:52PM
If I've read more than ONE thread at WMW, I can't read a second thread (or even the same thread a second time) without being shown some screen asking me to donate money.

I can't even click through to the thread from a SERP listed in google w/o getting that very same message.

As far as I can tell, WMW won't let you read any content there w/o some type of donation yet they let the bots crawl and index that content that users never get to view without payment.

Have a look at the robots.txt file on that BotSeer site, and you'll see what I mean about the way they've set it up.

The part of their file that's blocking access to their search results is listed in that generic section, but there's no rules for that specifically for googlebot though there is a googlebot-specific section.

So, what it boils down to is, they didn't open-allow their results to be indexed, as there is a general-disallow rule in place. However, google didn't read/obey that because there was no duplicate of that same disallow rule specifically for googlebot in the googlebot-specific section of the file.

However, whether or not their SERPs ended up indexed in the SEs SERPs is a moot point because they direct link to robots.txt files on people's sites. They (BotSeer) are being irresponsible, IMO.

g1smd
08-07-2008, 17:02/05:02PM
WMW has "first click free". You do not have to pay anything, you can just join as a free member, like here at IHU. There are some pay sections at WMW, and they are more heavily promoted than those here.

Connie
08-07-2008, 17:17/05:17PM
Deb WMW did recently start requiring that you log in to view more than one thread in the free forum. You do not have to pay. You have to register. Registration is free. Once you log in you can read all the threads in the free forum.

Connie
08-07-2008, 17:18/05:18PM
Not sure how long the login cookie last, but I had to log in yesterday, and again today.

WebSavvy
08-07-2008, 17:24/05:24PM
Yeah, that's what g1 just told me in PM. I will probably signup tomorrow. Right now we have 70mph winds, thunder, lightning, and power outages all over the place.

Looks like we're getting a tornado.

g1smd
08-07-2008, 18:08/06:08PM
That robots.txt database is quite interesting in some ways.

It is amazing how many of the robots.txt files have non-valid and broken entries.

Blue
08-07-2008, 18:59/06:59PM
Blue, all I can remember was the UA had "BotSeer" in it, and that's how I found it.

It'd been requesting my robots.txt file over, and over, and over again and kept getting a 403 each time. There were over 300 requests from that bot in one days time. I was going over my logfile and saw all the requests so went to google to do some digging (that was last year in 2007 when I opened this thread)

So, I imagine if you just block the UA string for anything containing: BotSeer
you should be fine.

I noticed Wikipedia has an "article" on BotSeer. I've not looked at that (and don't plan to) but maybe there might be some UA information listed there? Thanks Deb! :)