PDA

View Full Version : MojeekBot


marcs
07-07-2006, 08:41/08:41AM
[Split from thread (http://www.ihelpyou.com/forums/showthread.php?s=&threadid=22608)]

Hi,

Hope this is ok but I noticed a few genuine but smaller search engines on that list, including my own unfortunately, and wondered if I could ask Deb (savvy1) if there was any particular reason? or if there was anything I could of done differently to avoid being included in the first place?

I understand it's up to webmasters who they ban but when you have spent so much time working on a project and made every effort to be considerate by crawling slow, obeying robots.txt and limiting the number of pages, to then be described as a bad bot, scraper etc. or even just having spammer=yes next to your name can be a little disheartening.

I'm not here as some kind of self promotion so please sticky me if anyone wants to know which engine/bot, all feedback is appreciated.

Thanks,
Marc.

gio3
07-07-2006, 09:03/09:03AM
Hi marcs and welcome:hi:
As you can see if you look at this thread and at the original one we only searched to have a list of bots that are dangerous or annoying. Maybe your bot is included in the list for an error.
I think you should post it in the forum so we can check and Deb can eventually remove it from the list.
Gio

WebSavvy
07-07-2006, 09:08/09:08AM
Please note specifically this from the opening paragraph, first post, top of the blocked bots list:
Below are a list of bots I've blocked from my site using .htaccess. Not all of the bots listed below are "bad" bots per se, many are just "annoying" bots.
The list I posted I said that not ALL of them were bad bots. I also said that the list I posted was the list from MY OWN SITE.

In order to simplify things, blocked bots are grouped together and handled with one directive (e.g., spammer=yes).

It's a little easier to do that Vs setting multiple groups with the same directives, as it causes the server to work harder than necessary and therefore use much more resources and bandwidth than actually necessary or required.

If your engine is in my list, it's because it's blocked from my site. If it's blocked from my site, it's in my list for a reason. I block all unnecessary bots from accessing my site.

My domain has well over 150,000 pages and when I have daily activity from all of the majors (that count!) plus every joe-shmoe with a robot, it's eating up more bandwidth than I believe is necessary.

There's no reason that bot activity should account for 3 GIG of BW on my site daily.

Since blocking all of the unnecessary bots, the pages on my site load faster for my VISITORS and that's the bottom line.

marcs
07-07-2006, 10:00/10:00AM
Hi Gio and thanks for the welcome, I didn't want to post my url as new engines usually then get accused of being a me too operation. Also I know some forums discourage it. I'll pm you the url.

Hi Deb, thanks for the answer and I understand where you are coming from. I was just wondering if there was a particular reason that was all, if it was because the bot had done something it shouldn't then I'd like to know so I could rectify the problem. Although I'd be surprised if it was because of crawling too many pages as I've limited the number per site in the early stages. But if you can't help never mind.

Regards,
Marc.

WebSavvy
07-07-2006, 10:04/10:04AM
Hi Marc,

I've changed the directive to simply just read as follows:
blocked=yes

This way, no one's bot is labeled as a spammer, when they're not. It may just be an annoying bot, and is lumped into the control directive for easier management.

FWIW, I don't block bots that send me traffic even if they're not a major/minor SE.

If they're not sending me traffic, and are just indexing my site daily for the hell of it ... I block them.

I have a business to run too. There's no reason why some wanna-be with a spider should be able to access my site just because he feels like it (not that you are in this group, but many others, are).

Connie
07-07-2006, 10:32/10:32AM
Hi marcs and welcome. :hi:

ihelpyou
07-07-2006, 11:22/11:22AM
Welcome to the forums marcs! :hi:

Which bot is your bot on that list marcs? You seem very sincere and want to do good, so I'd like you to post in here which one is your bot. :)

ihelpyou
07-07-2006, 11:32/11:32AM
Is that marc's bot?

This thread is confusing. :D Which bot are you all now talking about?

marcs
07-07-2006, 11:38/11:38AM
Hi Connie and Doug, thanks for the welcome! If it's ok then mine is MojeekBot.

Thanks Deb and I fully understand and respect your decision, it's just unfortunately some people will blindly copy and paste that list without a second thought or any research of their own. Then it's no longer just your personal blocked list, and when on a respected site like this can spread quicker and be even more damaging. But at least it got me out of lurking mode for a while! That's one positive.. I think.

Regards,
Marc.

Connie
07-07-2006, 11:46/11:46AM
Originally posted by ihelpyou
Is that marc's bot?

This thread is confusing. :D Which bot are you all now talking about?
Were not talking about marcs bot were talking about "psycheclone".

WebSavvy
07-07-2006, 11:47/11:47AM
I didn't just willy-nilly block your bot. Your bot accessed files on my site it wasn't supposed to, and then indexed them anyway. So, I blocked it.

You asked me why, so now I'm answering.

I had an eCard system on my site. In the files I had the following:

<meta name="robots" content="noindex, nofollow" />

In robots.txt file I had the following:

User-agent: *
Disallow: /eCards/

More than 300 bots IGNORED the robots.txt, and IGNORED the robots metas. They all climbed onto that eCards script all in one night.

It resulted in a massive 20 GIG error file that ate up all of the space I have set aside for my directory, which stopped all of my pages from loading and the server sending me a distress email saying the domain was out of space and files needed to be DELETED!

I went through my logs and found all 300 bots. Yours was among the ones that IGNORED the robots.txt file and robots metas. It indexed my entire eCard system along with the other non-compliant 299 bots.

I blocked them all in one massive swoop. I have no intention of removing your bot from my private block list.

My blocked URL is still listed in your SERPs as well.

http://www.mojeek.com/search?q=websavvy+ecards&r=www.websavvy.cc

ihelpyou
07-07-2006, 11:52/11:52AM
Yours was among the ones that IGNORED the robots.txt file and robots metas. It indexed my entire eCard system along with the other non-compliant 299 bots.
hmm. Now we know why.

It seems to me you need to fix your bot marcs. I would have done the same thing as Deb as I don't like bots who ignore robots.txt. :)

marcs
07-07-2006, 12:03/12:03PM
Hi Deb,

Maybe I'm wrong but I truly believe there has been a mistake, I have only crawled one page from your eCards directory, the main page way back on the 31st December 2005. Was that page definitely blocked by robots and meta back then?

I can asure you my bot obeys the robots standard as verified by other sites and by checking my results. If this has not been the case this time it was a mistake and I'm sorry.

The reason the page is still listed is because pages only get recrawled when links are found to them and early on I decided to leave obsolete pages in there for now. I will get that page removed and it will be gone by next week some time.

Marc.

WebSavvy
07-07-2006, 12:12/12:12PM
Yes, Marc. It was blocked then.

Our update went live in AUG. Then by SEPT all of these "compliant" bots were hitting the script and creating huge error files.

In OCT I added the robots metas and the robots.txt to disallow access. Google even indexed all of the eCard images even though Google imagebot has always been blocked from my site.

At the end of Dec, I had a massive 20 GIG error file from one night's worth of bot activity on my server. I was off over the holidays (New Years).

I then came back online on Jan. 2, to find out my entire domain had stopped responding 3 days prior because of bots ignoring the robots.txt and robots metas on the eCards URLs.

I took the eCard system off the site.

Just within the past month, I learned through trial and error how to block disobedient bots via .htaccess

Your bot, along with many others, are now blocked from accessing my site.

I can now safely put the eCard system back up again without worrying that I'm going to end up with data corruption because of so-called "compliant" bots, that do not extend that courtesy when on my domain.

So, yes. The URLs were blocked well in advance of your last index date of those files.

marcs
07-07-2006, 12:26/12:26PM
Thanks Deb, I understand and sincerely apologize if my bot added to your problems. I honestly don't understand why it would of indexed that page if blocked by robots or meta as all testing has worked perfectly as far as I can tell, but I will look into this straight away and until resolved. I must add though, at no time have I crawled multiple files from that directory, I have only ever crawled and indexed one page from your ecards directory.

Doug: Please check my results, I do not ignore robots.txt. If I have it's a genuine mistake that as far as I can tell is quite rare but I suppose inevitable. And why would I then come here trying to find out the problem?

Regards,
Marc.

WebSavvy
07-07-2006, 12:34/12:34PM
Yes, I know only one URL from the /eCards/ is indexed in your SE.

However, please understand that the eCards script, is scripted to cover all of it from just that one page you did index.

Example: /eCards/index.php?cat=get_well_soon

When your bot, and all the other ones climbed on top of it, it killed my server for 3 days! A 20 GIG error file is MASSIVE and there's no excuse for it. It's something that should have never happened in the first place because those URLs were disallowed to begin with.

If a bot is "compliant" it doesn't make mistakes and index files it isn't supposed to. That's what "compliant" means. It means, it complies with the directives given by the owner of the domain.

Once your bot, or any bot, goes against the directives given by the owner of the site, said bot is then, noncompliant.

Hope you're able to sort the issues with your bot, especially if you have major web indexing plans.

WebSavvy
07-07-2006, 12:44/12:44PM
For anyone following this, I just wanted to say that Marc has been speaking with me via PM also. He seems very sincere to try and get to the bottom of it, with regard to why his bot didn't obey the robots.txt file.

I've looked at your engine Marc, and it has a lot of potential. I really do wish you the best of luck with it!

ihelpyou
07-07-2006, 13:16/01:16PM
Good for you Marc! :up:

I hope you get things straightened out. :)

marcs
07-07-2006, 13:35/01:35PM
Thanks Deb, much appreciated.

And Doug, so do I, but I'm also quite sure this was a very rare occurrence or I'd have a few more webmasters on my back.

Marc.

gio3
07-07-2006, 14:03/02:03PM
Hi Deb,
I'm thinking that maybe you could edit your list in two different parts, definitively bad robots and other ones blocked for any other reason.
ihelpyou is a very trusted site and I really think your list may become a reference in the long time, so I think you may have a supplemental responsibility in editing your list.
I know very well it's your own blocked list and also that it was posted following the requests of the members here, and mine too of course, and I thank you again for that.
But people coming across your list after a google search will not probably read all the thread and will only copy and paste your code.
Splitting it in two parts with spammers and scrapers in the main one, and an accessorial one, with your own choice, may also give a larger importance to your work.
I know for you may be a waste of time, but I really think it may be a useful service to the community.
I'm a little unhappy with what happened to Marc and it seems to me very serious in his trying to fix the problem.
Gio

Connie
07-07-2006, 14:04/02:04PM
Reading in a few forums I think all robots occasionally don't follow the robots.txt.

I've seen complaints about Yahoo and Google.

I agree with Deb that you have a nice looking search interface, and wish you well.

WebSavvy
07-07-2006, 14:08/02:08PM
Gio, splitting it up, and what have you, will have to be done when I actually have the time. Right now, I just don't.

I'm in the middle of a huge update that thus far has taken 7 months from start to near finish.

I've changed the directive (a few hours ago) and it says now:

blocked=yes

Vs

spammer=yes

It is up to anyone using the list to do their own research. What you block, may not be something I decide to block. Every webmaster must be responsible for their own site and what access/permission rights they decide to grant.

Connie
07-07-2006, 15:13/03:13PM
Have to agree with Deb. It is not her responsibility to tell you what each bot may have done.

She has already spent a great deal of time coming up with this list. She had a reason for listing each bot that she listed.

Anyone can use this link to see if there is any information about a bot listed.

http://www.psychedelix.com/agents/index.shtml

You can also check Google to see if there have been complaints about a particular bot. Believe me if a bot is misbehaving it will shoe up on Google.

And in the case of marcs bot there have not been at least on the first page.

WebSavvy
07-07-2006, 21:18/09:18PM
Connie, that's a really good list of bots at that site. I'd ran across it some time back, but then had forgotten about it.

I saw a few bots listed there that I have seen showing up in my logs, also. One of them was that "NASA search" one, and I didn't know it was a spambot -- but I do now. Plan to block it today.

marcs
08-07-2006, 09:35/09:35AM
Thanks everyone for your feedback and comments. I know the engine still has a lot of work to do but I'm looking forward to it, and sorry if I confused the original thread that was not my intention.

Gio: That was exactly my point, from experience I know people will come across lists like this on reputable sites and blindly copy and paste. Thanks.

I have done some more testing and honestly can't find why it didn't obey Deb's robots.txt. The only possibilities I can think of is, there was an error while trying to retrieve the file or for some reason it was not parsed correctly. But I will carry on trying to find the cause and will keep a closer eye in the future.

If anyone else notices MojeekBot doing something on their site it shouldn't please don't hesitate in contacting me, details are available on the associated page or pm me here.

Regards,
Marc.

WebSavvy
08-07-2006, 09:51/09:51AM
Marc, just replied to your PM. :)
The only possibilities I can think of is, there was an error while trying to retrieve the file or for some reason it was not parsed correctly.
Yep, anything's possible.

Google states that they're only able to parse a robots.txt file that's under 150 lines. With all the bots I had blocked in robots.txt mine was more like 400+ lines.

I've removed all of the ones from robots.txt that I was blocking, and blocked them through .htaccess instead -- which is why yours was blocked in .htaccess Vs robots.txt file.

I'm keeping the robots.txt file as short as possible for the bots that I actually do want indexing my site. All those others, I don't care about as they're either spambots, downloaders, scrapers, or site rippers. Then, there's the other bots that are legit bots, but I just didn't want them indexing, so they're in my blocked list too.

With the robots.txt file being as long as it was, maybe that's what caused your bot not to be able to read it?

Marc, no apologies needed about posting in the other thread with regard to this. We just decided yesterday to split the threads because we didn't want someone to confuse your bot with an actual spambot.

I'll probably have some free time later this week, and will go through my posted list and remove bots that are legit bots and leave the remaining bots there.

marcs
08-07-2006, 13:19/01:19PM
The robots.txt file being too long shouldn't of been the problem (unfortunately). I used to count that as a complete block. That is no longer the case since an update to the bot, it will now record all relevant disallows, if this becomes a problem in the future I will revert back to being a complete block if too long.

Connie
08-07-2006, 15:04/03:04PM
Mark the other thread was spit because there had been several bots mentioned in it, and it was getting confusing.

If you want to point your bot to my site condells.com I'll see how it behaves. I have a file set up that is not mentioned anywhere on the site except in robots.txt, and of course all bots are excluded.

g1smd
08-07-2006, 16:34/04:34PM
A disallow directive for a folder, like:

Disallow: /folder/

does NOT stop a spider from requesting /folder


On many servers, a request for /folder will serve the default index page for that folder.

You need to Disallow: /folder instead.

Beware.

WebSavvy
08-07-2006, 16:49/04:49PM
Correct. But on mine, the way I did it was correct for my setup.

I use virtual folders -- meaning that it all lives in one file and uses mod_rewrite to fork the folder structure.

/folder

does not exist

whereas

/folder/

DOES.

I'm not worried to keep them out of something that doesn't exist. I don't want them in specific areas that do exist.

Going to /folder
on my site serves a blank page.

They can have that all they want.

Going to /folder/
serves content -- that I don't want certain bots to access.

Connie
08-07-2006, 17:05/05:05PM
I have a blank index page in the folder. In fact I have a blank index page in every folder on my site.

So any request for that folder should go to the blank index page. Right or wrong?

So are you saying G1
Disallow: /folder/ is not sufficient? Do I need to disallow /folder/index.htm?

g1smd
08-07-2006, 17:15/05:15PM
No, the disallow does not block a file or a folder, it is has to match the left hand part of the URL that it is to disallow.

As I said above, you need Disallow: /folder and that will disallow any URL that starts with /folder.


It will block:

/folder
/folder/
/folderthing
/folder/index.html
/folder/some.file.html

Connie
08-07-2006, 18:00/06:00PM
G1 really not arguing but why do all the tutorials say to block with
/folder/?

If I'm understanding you correctly these
User-agent: Nutch
Disallow: /articles.tips/
Disallow: /articles.htm
User-agent: *
Disallow: /cgi-bin/

should be
User-agent: Nutch
Disallow: /articles.tips
Disallow: /articles.htm
User-agent: *
Disallow: /cgi-bin

I have never seen a bot in my cgi-bin or any other folder that was disallowed using Disallow: /foldername/.

WebSavvy
08-07-2006, 18:17/06:17PM
Connie: http://www.google.com/robots.txt

Disallow: /pagead/
Disallow: /relpage/
Disallow: /sorry/
Disallow: /keyword/
Disallow: /u/
Disallow: /univ/
Disallow: /sms/demo?
Disallow: /blogsearch/
Disallow: /reader/
Disallow: /uds/
Disallow: /extern_js/
Disallow: /calendar/feeds/
Disallow: /calendar/ical/
Disallow: /cl2/feeds/
Disallow: /cl2/ical/

g1smd
08-07-2006, 18:19/06:19PM
Connie. I already said why, right above.


Disallow: /folder/ does NOT block access to /folder at all. Note that /folder on some servers will serve an index page. That index page will be indexed by search engines because it has not been blocked. That is why you need to block /folder instead. That blocks any URL that starts with /folder in it:

/folder
/folder/
/folder/index.html
/folder/some.file.html
/foldername
/foldername/
etc.

Disallow: /folder/ only blocks URLs that start with /folder/.

See that /folder does not start with /folder/ - it is shorter.



Savvy. That works for Google because they have a 301 redirect from /folder to /folder/ for every folder URL that you attempt to access without the trailing / on it. Not all servers are set up that way. Those that are not set up like that expose an index page at /folder that will be indexed if you do not block /folder exactly like that.

WebSavvy
08-07-2006, 18:29/06:29PM
Ah, OK g1. Thanks. Didn't know they did.

=> Savvy ... ???
What happened? :( Usually you call me Deb.

Yep, mine are setup to use /folder/ (because it's virtual folders Vs physical ones).

So, I have to use
Disallow: /folder/

:)

Comeran
08-07-2006, 18:50/06:50PM
Deb,

See 1 mistake and we are downgraded... after all of my questions I might be down to just C :p

G1, I hadn't known what you were saying here, I checked out robots and found that we too were using /folder/ as per the guidlines of which we learned to right the files.

This was VERY informative and I am sure will be extreamly useful to most.

Marc: It really seems that you are working hard to ensure a good site, bot, and SE I am sure most will not block you. I even noticed that you have your e-mail in your about us so that webmasters can contact you. If you are as responsive to them as you have been here I am sure that things will work out well for you.

Comeran-

Connie
08-07-2006, 19:30/07:30PM
G1 as far as I'm concerned you haven't explained anything. All you have done is confuse the issue.

I have read literally hundreds of tutorials in regard to robots.txt. I have looked at several in the last few days.

I'll be the first to admit I do not understand the tech language but I do know how to copy and past.

What I posted above is basically a copy and past from what I have read on the web.

If you want to disagree fine.

Post some authoritive examples.

Sorry but as much as I respect you I do not accept the fact that you are the web authority on any subject.

g1smd
08-07-2006, 20:03/08:03PM
>> So, I have to use
>> Disallow: /folder/

No you do not have to use that at all.

You could equally use Disallow: /folder and that will stop /folder and /folder/ and /folder/any.file.html from being accessed.

You could even use Disallow: /fold or even Disallow: /f - because all of the disallow statement only needs to match the left hand characters of the URL to be disallowed (taking care that it doesn't block something you wanted to be left open).


Connie. I have no way to explain this a third time. I really don't know why you can't follow the logic of this at all.


Debs. Not everyone reading the thread knows your real name, hence the lapse back to handle :-)

WebSavvy
08-07-2006, 20:17/08:17PM
heh heh ... yeah, I do have to use /folder/

because I have some folders with similar start names, and file names.

No matter anyway, because in the new setup we have real folders. I'm sick of virtual folders and files. We went to physical setup this time. It should be much easier to do things with it now, and to make sure things that I don't want indexed, aren't ... and things that I do want indexed, are.

Poor Marc, we've taken over his thread and littered it with robots.txt discussion (it's OK because it's related -- as per his example that mine may have issued an error upon request).

Connie, if you want to split off the robots.txt discussion into another thread, go ahead. I'm tired, otherwise I'd do it.

I've been coding all day ... man do I need a break. :(

gio3
09-07-2006, 06:15/06:15AM
Great discussion. I have learned here something really interesting about robots.txt
Thanks
Have a wonderful break Deb
Gio

WebSavvy
09-07-2006, 10:46/10:46AM
Hey Marc, we have a forum in here for UK SEs too, ya know? Why don't you begin a thread there for Mojeek?

This way when people are looking for UK SEs to submit to, they can find yours there.

Out of curiosity, how do you find sites? Is submission allowed?
How much of a site (or page) does your bot index?
What criteria are used in the ranking process?

marcs
09-07-2006, 15:08/03:08PM
The bot follows links and adds sites/pages as it comes across them. I did allow submissions for a while but unfortunately there's a few that ruin it for others and as I don't have limitless resources had to disable it, but send me a personal email and I'd be happy to add anyone's site.

I use links as one of the ranking factors and this is also used to decide what sites and how deep to crawl. Sites listed in personal searches are also crawled more often and deeper, hopefully making the service even more useful.

The usual criteria is used, a combination of links, link text and on-page. I'm also planning on allowing the tweaking of the parameters that go into these different criteria, you can get an idea of how much it's possible to alter the results this way by trying the different available ranking methods. An example of where this could be useful is for site searches, where external links are not always the best method to rank pages by.

Marc.

WebSavvy
09-07-2006, 15:17/03:17PM
Impressive, Marc ... truly.

I do wish you very well with it, as you already knew. :cheers: