kneelsit
30-08-2002, 23:00/11:00PM
3.2 Topic distillation
Our first application of the Term Vector Database -- indeed, the motivation for building it -- was topic distillation. Topic distillation is a technique of using hyperlink connectivity to improve ranking of web search results (see [Kleinberg 98], [Chakrabarti et al 98], [Bharat et al 98] for specific instances of topic distillation algorithms). These algorithms work on the assumption that some of the best pages on a query topic are highly connected pages in a subgraph of the web that is relevant to the topic. A simple mechanism for constructing this query-specific subgraph is to seed the subgraph with top-ranked pages from a standard search engine and then expand this set with other pages in vicinity of the seed set ([Kleinberg 98]). However, this expanded set sometimes includes highly connected pages that are not relevant to the query topic, reducing the precision of the final result. [Bharat et al 98b] identifies this problem as topic drift.
[Bharat et al 98b] shows that topic drift can be avoided by using topic vectors to filter the expanded subgraph. A topic vector is a term vector computed from the text of pages in the seed set. A page is allowed to remain in the expanded graph only if its term vector is a good match to this topic vector. Specifically, the inner product of the two vectors is compared to a suitable relevance threshold, and pages below the threshold are expunged from the expanded graph.
In the past, we computed term vectors and topic vectors by downloading pages from the web after computing the initial subgraph. Downloading pages to compute vectors proved to be a significant bottleneck. We eliminated this bottleneck by re-implementing the algorithm on top of the Term Vector Database. The dramatic performance improvement provided by the Term Vector Database allows us to increase the size of our seed sets and expanded subgraphs while still reducing the overall time it takes to perform topic distillation.
O.K. Let's all go back to school now.
http://www9.org/w9cdrom/159/159.html AND the article that started it.:-
http://www.fantomaster.com/fafnissue0014.html
Our first application of the Term Vector Database -- indeed, the motivation for building it -- was topic distillation. Topic distillation is a technique of using hyperlink connectivity to improve ranking of web search results (see [Kleinberg 98], [Chakrabarti et al 98], [Bharat et al 98] for specific instances of topic distillation algorithms). These algorithms work on the assumption that some of the best pages on a query topic are highly connected pages in a subgraph of the web that is relevant to the topic. A simple mechanism for constructing this query-specific subgraph is to seed the subgraph with top-ranked pages from a standard search engine and then expand this set with other pages in vicinity of the seed set ([Kleinberg 98]). However, this expanded set sometimes includes highly connected pages that are not relevant to the query topic, reducing the precision of the final result. [Bharat et al 98b] identifies this problem as topic drift.
[Bharat et al 98b] shows that topic drift can be avoided by using topic vectors to filter the expanded subgraph. A topic vector is a term vector computed from the text of pages in the seed set. A page is allowed to remain in the expanded graph only if its term vector is a good match to this topic vector. Specifically, the inner product of the two vectors is compared to a suitable relevance threshold, and pages below the threshold are expunged from the expanded graph.
In the past, we computed term vectors and topic vectors by downloading pages from the web after computing the initial subgraph. Downloading pages to compute vectors proved to be a significant bottleneck. We eliminated this bottleneck by re-implementing the algorithm on top of the Term Vector Database. The dramatic performance improvement provided by the Term Vector Database allows us to increase the size of our seed sets and expanded subgraphs while still reducing the overall time it takes to perform topic distillation.
O.K. Let's all go back to school now.
http://www9.org/w9cdrom/159/159.html AND the article that started it.:-
http://www.fantomaster.com/fafnissue0014.html