Dez
16-09-2004, 18:51/06:51PM
I get a lot of email sent to WebSearch.COM.AU and Websearch.CO.NZ ( yea, a lot, thousands and thousands - impossible to answer them all, sorry folks, I'm a proud Dad of two with a three year old daugher who's magic to be around, and now a brand new 5 wk old boy in the house and I try to be a half decent husband most of the time, and apparently I also have a day job ( I really must look into that as it might need me to rock up and do something from time to time if that is the case! ), I really do need to get those 3.5 hours of sleep each day, otherwise those weird things keep drifting accross my eyes, and they really are quite disconcereting at times, so if you don't hear back from me for a while when you send email to Dez@WebSearch.COM.AU - well .. sorry, it is a pet project after all, and we are only partially human * grin * ).
One question that I have never had a good answer for is:
Q: Why is it that from time to time I search for the same thing and it takes a few seconds longer to get the results than it usually does?
A: well, this is a good question, infact it goes to the very heart of what is happening inside the search engine itself.
Usually, the search engine cluster is relatively quiet, that is, idle, and a millisecond in the life of a search engine usually goes something like this:
...wait for someone to visit http://WebSearch.COM.AU
...got a visitor, yay, send the some HTML, hope they like it
...am I IDLE or CRAWLING, quick, is it BLUE or GREEN, oh man, why is life so difficult today?
...ok, now I just wait for that someone to type in a query and hit ENTER
...got their query! yay! now go into painc mode to get the results back to them, asap
...broadcast a request accross the cluster for any results pertaining to this query
...wait for the cluster to poll, and ship back all the results
...start to get some data back
...re-poll anyone that didn't give me data
...aw heck, give up on old #05 as he's either dead or busy darn him
...wish Dez would get around to upgrading #05 up to an Opteron, those bloody Xeon's are soooo lame!! ( am I Xeonphobic I wonder!? )
...ok, now I've got data, what do I do!?
...oh yea, I do some magic on the data
...and, I do some more magic on that same data
...now I wrap it up in HTML, nice and gently like
...ship it back to that user's browser, pronto, compressed if they will allow gzip data
...now where did I leave that page header and websearch logo.gif file darn it, they'll need that too!
...and the bloody Cradle Technologies partner logo and page footer, where the heck is that thing?! Ah, found it!
...ok, no try to catch my breath while they read the results and cache the next results for them
...broadcast a poll to everyone on the cluster that I'm idle, coz I'm a glutton for punishment
...ok, got a responce again, so I ship them the cached "next page>>" if they want it, or
...deal with the clickthrough for them to whatever URL they liked
...better log the action, if it's a Next Page>> the I've got i cached, no sweat
...I better quickly cache the next page now and try to get some sleep, or
...kew, they found a link they like, that's nice, now I better zap them the clickthrough URL
...oh, and Mr Log That Data Sonny Jim "god of all clusters" will kill me if I don't log the ip, time, date, search query, results page #, and click through URL
...phew, done, all served, searched, logged, and now I can see if Ican get some time with the resident cluster head shrink before the next request
...call me a claptomaniac, but I better hang on to the results that I prepared for them for a while, incase they come back soon
...now I get to flush some stuff from brain and try to relax ( yea right - as if !! )
...go back to top [ waiting that is ] and wait nervously for the next visitor, hope they are nice...
...I wonder what would happen if I crash, do I get a rest, will I dream!?
So you get the idea.. Well, this is what happens when the search engine is just sitting there dealing with search requests all day and night.
But then someone comes through and slaps a Site Submission on us, and we've got to go through a process of getting their URL and EMAIL details, raciing off and sucking down their site, checking links, indexing, sorting, scoring, etc etc, and then zap them a quick "thanks for submitting your web site" email to let them know that we a) got it and b) care ;-)
So, on the whole, the cluster is busy but in bursts, not constantly working hard..
So why then as the emailed question askes, does it some times take more than a second ( ideally less than a second ) to get some results on some days, but not all the time?
Well that's due to the fact that there are times when the search engine ( usually each day ) realises that a whole bunch of URL's and or SITES that it has in the database have in theory "expired", that is, they have passed what I consider a Time To Live threashold.
This basically means that after a certain amount of time, ( email me and ask me about my swinging pendulum object expiration theory if you want a disitation via email as it's too long wind'ed and detailed for this forum's POST size limit - sorry ), each object, or "page" as it were, has a set time to live in the system, and after that, it gets re-visited, re-indexed or crawled, and updated, or refreshed.
more than 45% of the time the pages have not changed and they are just left the same, no point in hauling them down and re-indexing them if they have not changed.
then there's dynamic content, it always changes, but it's a whole different science, more on that another day - but it always gets updated to say the least.
but if a page does get updated, then it needs to be hauled down, re-indexed et al and jambed back into the system, so that next time someone searches for something pertaining to what that pages (and others) contanins, they get the very lates info and exerpts and links etc.
So, what this means then is that on a daily basis, it's possible that hundreds, thousands, even hundreds of thousands of pages have expired, and have to be re-fetched etc etc.
But this doesn't always get every page, so from time to time, at scientifically determined ( Dez sticks a wet finger in the air and figures it's time ) which is usually once a quarter at least, but of late it's been ideally once a month, the entire index is re-validated!
Yea, yea gods! every single one of the objects in the system is re-visited, checked, and if necessary hauled down and re-indexed as part of an entire engine re-crawl!
So that has a CPU and Network load inpact of course, and that is why, at times, and hopefully no more tha once a month, you will find that a search request to http://WebSearch.COM.AU and or http://WebSearch.CO.NZ ( they are not always crawling at the same time, but it does happen ) will return results to you that you previously had in milliseconds, well, sub-second anyway, in up to ten or 12 seconds god forbid, because the cluster is working like crazy trying to rip through the .AU or .NZ name space litterally downloading every Australian or New Zealand web page that it knows about, and any new links it finds in the process.
No mean feat, a very time and resource and MONEY costly exercise and to date I have yet to find a way to reduce the load that this causes.
I've dumped some more info about this and what I do about telling users that we are IDLE ( blue status bar! ) or CRAWLING ( green status bar ) at any one time.
Check out:
http://www.websearch.com.au/crawl-status.html
Now there are ways that folk like FAST, Inktomi, Google, The internet archive or Alexa for example etc fix this, they have whole clusters of servers dedicated to just crawling, and whole clusters of servers dedicated to web serving, and entire farms of servers dedicated to search results generation.
But they usually have hundred million dollar budgets with which to build such things.
Heck, if you vist Alexa.COM and find the link to their Crawler and see the picture of their racks and racks of indexing servers, you can't see the far end of the warehouse of servers, it's soo bloody bit!
Where as projects like WebSearch.COM.AU, Gigablast, and others, well we're all doing it "tough" and we fund our projects privately with a sprinkling ( and I mean sprinkling folks ) of external support in advertising and sponsorship, so we just can't afford to pour millions into our projects, we have mouths to feed and bills to pay.
Gigablast for example is from memory ( see Matt's blog ) around 9 x desktop PC's running hot doing web serving, crawling and indexing all in one go ( it's a nice engine - I envy it a lot ) and WebSearch.COM.AU too is a hand full of servers, although rack mount in my case, I invested a bit more heavily and I house them in a high end data centre with http://CradleTechnoliges.COM where I can get truck loads of bandwidth albeit at commercial data rates, where I think Matt runs his Gigablast still from home on a group of cheap ( read as cost effective sorry ) ADSL links, but that's smart and I wish we could do the same here in Australia - if we didn't have a stupid Government that thinks it's a phone company, or is that a Phone company that things it's a Government, I never know which it is!? ).
But hey, enough on the rant </rant>
Check out http://www.websearch.com.au/crawl-status.html
It tells the story.. Well part of it.. Stay tuned for more, tell me if you're interested, I'll keep adding to "a millisecond in the life of a search engine" * grin *
And let me know if you thing it was worth the effort if if frankly you don't give a toss if the engine is crawling or not ;-)
Cheers,
Dez
One question that I have never had a good answer for is:
Q: Why is it that from time to time I search for the same thing and it takes a few seconds longer to get the results than it usually does?
A: well, this is a good question, infact it goes to the very heart of what is happening inside the search engine itself.
Usually, the search engine cluster is relatively quiet, that is, idle, and a millisecond in the life of a search engine usually goes something like this:
...wait for someone to visit http://WebSearch.COM.AU
...got a visitor, yay, send the some HTML, hope they like it
...am I IDLE or CRAWLING, quick, is it BLUE or GREEN, oh man, why is life so difficult today?
...ok, now I just wait for that someone to type in a query and hit ENTER
...got their query! yay! now go into painc mode to get the results back to them, asap
...broadcast a request accross the cluster for any results pertaining to this query
...wait for the cluster to poll, and ship back all the results
...start to get some data back
...re-poll anyone that didn't give me data
...aw heck, give up on old #05 as he's either dead or busy darn him
...wish Dez would get around to upgrading #05 up to an Opteron, those bloody Xeon's are soooo lame!! ( am I Xeonphobic I wonder!? )
...ok, now I've got data, what do I do!?
...oh yea, I do some magic on the data
...and, I do some more magic on that same data
...now I wrap it up in HTML, nice and gently like
...ship it back to that user's browser, pronto, compressed if they will allow gzip data
...now where did I leave that page header and websearch logo.gif file darn it, they'll need that too!
...and the bloody Cradle Technologies partner logo and page footer, where the heck is that thing?! Ah, found it!
...ok, no try to catch my breath while they read the results and cache the next results for them
...broadcast a poll to everyone on the cluster that I'm idle, coz I'm a glutton for punishment
...ok, got a responce again, so I ship them the cached "next page>>" if they want it, or
...deal with the clickthrough for them to whatever URL they liked
...better log the action, if it's a Next Page>> the I've got i cached, no sweat
...I better quickly cache the next page now and try to get some sleep, or
...kew, they found a link they like, that's nice, now I better zap them the clickthrough URL
...oh, and Mr Log That Data Sonny Jim "god of all clusters" will kill me if I don't log the ip, time, date, search query, results page #, and click through URL
...phew, done, all served, searched, logged, and now I can see if Ican get some time with the resident cluster head shrink before the next request
...call me a claptomaniac, but I better hang on to the results that I prepared for them for a while, incase they come back soon
...now I get to flush some stuff from brain and try to relax ( yea right - as if !! )
...go back to top [ waiting that is ] and wait nervously for the next visitor, hope they are nice...
...I wonder what would happen if I crash, do I get a rest, will I dream!?
So you get the idea.. Well, this is what happens when the search engine is just sitting there dealing with search requests all day and night.
But then someone comes through and slaps a Site Submission on us, and we've got to go through a process of getting their URL and EMAIL details, raciing off and sucking down their site, checking links, indexing, sorting, scoring, etc etc, and then zap them a quick "thanks for submitting your web site" email to let them know that we a) got it and b) care ;-)
So, on the whole, the cluster is busy but in bursts, not constantly working hard..
So why then as the emailed question askes, does it some times take more than a second ( ideally less than a second ) to get some results on some days, but not all the time?
Well that's due to the fact that there are times when the search engine ( usually each day ) realises that a whole bunch of URL's and or SITES that it has in the database have in theory "expired", that is, they have passed what I consider a Time To Live threashold.
This basically means that after a certain amount of time, ( email me and ask me about my swinging pendulum object expiration theory if you want a disitation via email as it's too long wind'ed and detailed for this forum's POST size limit - sorry ), each object, or "page" as it were, has a set time to live in the system, and after that, it gets re-visited, re-indexed or crawled, and updated, or refreshed.
more than 45% of the time the pages have not changed and they are just left the same, no point in hauling them down and re-indexing them if they have not changed.
then there's dynamic content, it always changes, but it's a whole different science, more on that another day - but it always gets updated to say the least.
but if a page does get updated, then it needs to be hauled down, re-indexed et al and jambed back into the system, so that next time someone searches for something pertaining to what that pages (and others) contanins, they get the very lates info and exerpts and links etc.
So, what this means then is that on a daily basis, it's possible that hundreds, thousands, even hundreds of thousands of pages have expired, and have to be re-fetched etc etc.
But this doesn't always get every page, so from time to time, at scientifically determined ( Dez sticks a wet finger in the air and figures it's time ) which is usually once a quarter at least, but of late it's been ideally once a month, the entire index is re-validated!
Yea, yea gods! every single one of the objects in the system is re-visited, checked, and if necessary hauled down and re-indexed as part of an entire engine re-crawl!
So that has a CPU and Network load inpact of course, and that is why, at times, and hopefully no more tha once a month, you will find that a search request to http://WebSearch.COM.AU and or http://WebSearch.CO.NZ ( they are not always crawling at the same time, but it does happen ) will return results to you that you previously had in milliseconds, well, sub-second anyway, in up to ten or 12 seconds god forbid, because the cluster is working like crazy trying to rip through the .AU or .NZ name space litterally downloading every Australian or New Zealand web page that it knows about, and any new links it finds in the process.
No mean feat, a very time and resource and MONEY costly exercise and to date I have yet to find a way to reduce the load that this causes.
I've dumped some more info about this and what I do about telling users that we are IDLE ( blue status bar! ) or CRAWLING ( green status bar ) at any one time.
Check out:
http://www.websearch.com.au/crawl-status.html
Now there are ways that folk like FAST, Inktomi, Google, The internet archive or Alexa for example etc fix this, they have whole clusters of servers dedicated to just crawling, and whole clusters of servers dedicated to web serving, and entire farms of servers dedicated to search results generation.
But they usually have hundred million dollar budgets with which to build such things.
Heck, if you vist Alexa.COM and find the link to their Crawler and see the picture of their racks and racks of indexing servers, you can't see the far end of the warehouse of servers, it's soo bloody bit!
Where as projects like WebSearch.COM.AU, Gigablast, and others, well we're all doing it "tough" and we fund our projects privately with a sprinkling ( and I mean sprinkling folks ) of external support in advertising and sponsorship, so we just can't afford to pour millions into our projects, we have mouths to feed and bills to pay.
Gigablast for example is from memory ( see Matt's blog ) around 9 x desktop PC's running hot doing web serving, crawling and indexing all in one go ( it's a nice engine - I envy it a lot ) and WebSearch.COM.AU too is a hand full of servers, although rack mount in my case, I invested a bit more heavily and I house them in a high end data centre with http://CradleTechnoliges.COM where I can get truck loads of bandwidth albeit at commercial data rates, where I think Matt runs his Gigablast still from home on a group of cheap ( read as cost effective sorry ) ADSL links, but that's smart and I wish we could do the same here in Australia - if we didn't have a stupid Government that thinks it's a phone company, or is that a Phone company that things it's a Government, I never know which it is!? ).
But hey, enough on the rant </rant>
Check out http://www.websearch.com.au/crawl-status.html
It tells the story.. Well part of it.. Stay tuned for more, tell me if you're interested, I'll keep adding to "a millisecond in the life of a search engine" * grin *
And let me know if you thing it was worth the effort if if frankly you don't give a toss if the engine is crawling or not ;-)
Cheers,
Dez