PDA

View Full Version : Theme Sites Irrelevant!!


kneelsit
30-08-2002, 23:00/11:00PM
3.2 Topic distillation
Our first application of the Term Vector Database -- indeed, the motivation for building it -- was topic distillation. Topic distillation is a technique of using hyperlink connectivity to improve ranking of web search results (see [Kleinberg 98], [Chakrabarti et al 98], [Bharat et al 98] for specific instances of topic distillation algorithms). These algorithms work on the assumption that some of the best pages on a query topic are highly connected pages in a subgraph of the web that is relevant to the topic. A simple mechanism for constructing this query-specific subgraph is to seed the subgraph with top-ranked pages from a standard search engine and then expand this set with other pages in vicinity of the seed set ([Kleinberg 98]). However, this expanded set sometimes includes highly connected pages that are not relevant to the query topic, reducing the precision of the final result. [Bharat et al 98b] identifies this problem as topic drift.

[Bharat et al 98b] shows that topic drift can be avoided by using topic vectors to filter the expanded subgraph. A topic vector is a term vector computed from the text of pages in the seed set. A page is allowed to remain in the expanded graph only if its term vector is a good match to this topic vector. Specifically, the inner product of the two vectors is compared to a suitable relevance threshold, and pages below the threshold are expunged from the expanded graph.

In the past, we computed term vectors and topic vectors by downloading pages from the web after computing the initial subgraph. Downloading pages to compute vectors proved to be a significant bottleneck. We eliminated this bottleneck by re-implementing the algorithm on top of the Term Vector Database. The dramatic performance improvement provided by the Term Vector Database allows us to increase the size of our seed sets and expanded subgraphs while still reducing the overall time it takes to perform topic distillation.

O.K. Let's all go back to school now.

http://www9.org/w9cdrom/159/159.html AND the article that started it.:-

http://www.fantomaster.com/fafnissue0014.html

Advisor
30-08-2002, 23:28/11:28PM
Hi Greg,

Themed sites are NOT irrelevant. That's not what any of that says.

Sites just can't be summed up in only 2 words. That's all. Themed sites can and do work just like they always did!

Jill

Mel
31-08-2002, 00:14/12:14AM
AS I understand term vectors they are in fact summing up the theme of your pages in two word sets. If the pages in your site are "themed" theyare more likley to garner high term vector rankings as more pages pointing to each other will be seen to have the the same term vector.

Advisor
31-08-2002, 00:34/12:34AM
Originally posted by Mel
AS I understand term vectors they are in fact summing up the theme of your pages in two word sets. If the pages in your site are "themed" theyare more likley to garner high term vector rankings as more pages pointing to each other will be seen to have the the same term vector. That's exactly what Mike's article is debunking!

I don't understand the whole thing, but I've talked to Mike about this a lot (only cuz he likes to talk!), and from what I gather, what you are saying above, is how some in the SEO industry have interpretted what theming is. However, it's apparently been misinterpreted. I don't understand the whole thing, but if you read Mike's article in that FantomNews that Greg posted, he explains it pretty well.

Jill

Mel
31-08-2002, 05:30/05:30AM
Yes but I don't happen to agree with the assumption that themed pages will not rank well using term vectors.

Mel
31-08-2002, 06:00/06:00AM
Here is a quote from Mikes Fanotmaster article:

>>As they have been saying since the appearance of this
document on the web, in some SEM circles, the story
goes like this: search engines have discovered a brand
new technology called 'term vectoring'. Effectively,
search engines no longer look at words on a page, they
now turn them into 'vector numbers' (sic) and use this
information to provide the theme of an entire web site.
It's now necessary to ensure that you can sum up the
entire theme of your web site in as a little as two
words, because that's how term vectors work: in term weight pairs.

This is not saying that "themed" sites don't or can't rank well when using term vector analysis, just that they are using a more efficient method of calculating the appropriateness of a site to a particular theme or term vector.

Remember that search engine queries do not use term vectors, they use plain old fashioned words to interface with the searchers. The searchers are not asking for pages that are relevant to some term vector numbers. Either the term vector rankings on the search words when coverted into term vectors are relevant to the original words searched for or they are not.

If they are relevant to the original words, then all works fine, and pages and sites that use these same words will have reasonable rankings. If it is not relevant to the original words searched for then it is useless.

Advisor
31-08-2002, 12:02/12:02PM
Agreed!

J

kneelsit
05-09-2002, 23:29/11:29PM
Thanks Jill and Mel, for your eplanations.

The subject was a little too deep and technical for me to grasp it fully. :o:

I only just scraped a bare pass in statistics at Uni. When they got on to 3 dimensional graphs I just lost it. :confused: