View Full Version : MSN begins formulating its own PR system
JohnC
16-08-2004, 13:47/01:47PM
Jason Dowdell of Airgin.com (http://www.airgin.com/) has located a Microsoft document that may shed some light on the new algos they are building.
Block-Level Link Analysis (http://research.microsoft.com/research/pubs/view.aspx?tr_id=754)
I am still reading and trying to digest the whole thing, but the basics from what I can tell is to remove the "web page" as the smallest unit that can be assigned a value. Then to break the page into smaller blocks that are compared and rated individually. This means different parts of the page could have different PR.
This is going to be interesting. :)
dvduval
17-08-2004, 00:29/12:29AM
Yes, I like that idea about blocks. Maybe this would be a new way to find the good content on a page. I do hope that Microsoft and Yahoo will both come up with some interesting methods of finding good content. Competition is good. I really hope we can keep all 3 competing, and maybe even add a 4th or 5th strong competitor. Amazon and their A9 search could be a possibilty.
bwelford
17-08-2004, 07:53/07:53AM
That Block-level Link Analysis is intriguing as a concept. However I'm still from Missouri on it. We always have the dilemma that computers deal in information or text content. The human user sees a combination of images and text. A given web page may therefore "look" quite different to a spider and to a human visitor. I would think this gets even more likely as you try to go down to sections of the web page.
I really think MSN has some much more fundamental problems to sort out than to spend much effort on "drilling down" to seek relevance in slices of a web page.
RandyDotcom
17-08-2004, 09:38/09:38AM
AS SEO Specialists it is incumbant upon us to realize that spiders will never see images and get the same meaning out of the reference. Just as any two people do not get the same information from a single picture.
I would rather the spiders consider the whole page in total. It's like when a reporter takes a quote out of context. The page is the thing. If you are going to request the page be downloaded to retrieve the information. then only the relevent information should be on the page. Trying to index blocks just leads to spam techniques. Spammers can create potent blocks displayed below the visual return of the page.
I think the most important job the spiders need to do. Is make sure that indexed content is visible to the requester. then rate the page based on the content leaving PR as a secondary index.
PR is ruining the abiltie to retrieve valuable infromation. More and More I am seeing the exact same text on the first 20 returns.
Personally I have started jumping to page 5 of the returns first when doing research.
ihelpyou
17-08-2004, 09:46/09:46AM
I agree with Randy.
If this is what MS is trying to do, I see it as a fail-fail situation. I see no good at all coming from 'blocks' of a page, and I do see "many" more spam possibilities than we already have anyway.
Webmaster T
17-08-2004, 10:51/10:51AM
Originally posted by ihelpyou
this is what MS is trying to do, I see it as a fail-fail situation. I see no good at all coming from 'blocks' of a page, and I do see "many" more spam possibilities than we already have anyway. I got the opposite feeling when I read that. It could be used to identify ads and remove noise that will enable better identification of the pages "topic". Anything any SE does will be used to spam. That's a given. If SE used that as a criteria for an algo then nothing would change. Spammers adapt and exploit every algo because every algo can be exploited.
What kind of links could be eliminated with this method:
1. designer/developer links (almost always near the copyright)
2. networks built to inflate linkpop (almost always in the nav/menu or footer)
3. ads/link brokering
4. possibly link pages
IMO, it would smooth out PageRank as a lot of internal PR is garnered through links in Navigation. It could be used to only pass PR where the documents are related. It smooths it because then all the PR is external and truly an "unbiased vote".
JohnC
17-08-2004, 11:39/11:39AM
Originally posted by Webmaster T
I got the opposite feeling when I read that. I have to agree with "T" on this one. I got the feeling that breaking the pages into blocks would make it easier to identify and discount spammy techniques or irrelevant content on a page.
Quadrille
17-08-2004, 18:55/06:55PM
There is certainly the potential to identify mismatches within a page, that could suggest spamming; also it could be used to ensure it wasn't the ad that matched the search.
But it'll be a long time before they reach the level of sophistication needed ... meanwhile, it could lead to some very odd results!
RandyDotcom
17-08-2004, 19:14/07:14PM
I don't think they are trying to trap spamming. They are trying to assign a value to a section of the page, and assigning PR Value to links within that section.
I think this whole thing is over - thought. And the results we are seeing from google are getting worse and worse.
I recently needed to find the Per Capita income for Argentina.
I got 179000 English results for Argentina Per Capita Income All of early results had the terms in them. but it wasn't until about page 10 that I got detailed infromation. all of the first returns were based on PR and had in most cases the exact same table cut and pasted into the page.
Keywords, and page Content need to make a comeback into the algorythm. Not Links
Blue
17-08-2004, 19:51/07:51PM
Originally posted by RandyDotcom
....Keywords, and page Content need to make a comeback into the algorythm. Not Links Here here!!!
Percept
18-08-2004, 02:48/02:48AM
Originally posted by Webmaster T
...
What kind of links could be eliminated with this method:
1. designer/developer links (almost always near the copyright)
2. networks built to inflate linkpop (almost always in the nav/menu or footer)
3. ads/link brokering
4. possibly link pages
IMO, it would smooth out PageRank as a lot of internal PR is garnered through links in Navigation. It could be used to only pass PR where the documents are related. It smooths it because then all the PR is external and truly an "unbiased vote".
This wouldn't work because with CSS you can put anything anywhere on the page nomatter where it occures in the HTML code.
JohnC
18-08-2004, 08:39/08:39AM
Originally posted by Percept
This wouldn't work because with CSS you can put anything anywhere on the page nomatter where it occures in the HTML code. It sounds to me like this is not going to be an issue as they have stated: In this paper, the web page is partitioned into blocks using the vision-based page segmentation algorithm. By extracting the page-to-block, block-to-page relationships from link structure and page layout analysis, we can...Highlighting added by me. I can only assume they understand CSS and have taken this into consideration.
Percept
18-08-2004, 08:48/08:48AM
They can do that but they can't fix the CSS bugs in IE ... tsssss :rolleyes:
Quadrille
18-08-2004, 08:50/08:50AM
As the possibilities of page design are infinite, and they can't 'consider' them all, I suspect they are going to have to make a few broad assumptions ... which will lead to some amazing page gymnastics as the spammers struggle to keep up!
I think the problem with all this, is that MSN have hinted at what they are doing, but have been a little reticent about the thinking behind it.
If they really have developed a system that can do justice to css, tables, <P>, <Hx> and every other design feature (flash?), then I guess we're going to have exciting times. Especially if they show preference to Front Page (this is M$ - why wouldn't they?).
But I suspect their actual achievement will turn out to be a little less ambitious than they would have us believe.
JohnC
18-08-2004, 09:51/09:51AM
Originally posted by Percept
They can do that but they can't fix the CSS bugs in IE ... tsssss :rolleyes: LOL ... Good Point.. :D
Originally posted by Quadrille
But I suspect their actual achievement will turn out to be a little less ambitious than they would have us believe. I have to agree, especially considering MS's product development history, they are not known for "unique" ideas.
However, I do believe that some sort of A.I. that can "see" a page is not that far off. I am not saying Bill and company has done this, but the technology is almost here and pieces of it are probably in use already. Rumored Example: Lex Wexner's companies (The Limited, Victoria’s Secret, Abercrombie & Fitch, Layne Bryant and more) use a form of A.I. that "looks" at fabric swatches and is able to identify certain characteristics and properties. This is used to help bring order to their database of 100,000's of fabrics. The point is, it's A.I. based and can "see" the fabrics. (I had a close friend in Columbus Ohio who worked in their IT division; sorry it’s my only source, hence the term “Rumored”. Oh yeah, and this was 4 years ago.)
With the amount of money Search earns and its growth potential, progress like this is bound to happen if it can be profitable. MSN and Google both have some big brains on the payroll. If something like this can make a difference, it will probably will.
Quadrille
18-08-2004, 13:35/01:35PM
... It is only a matter of time before AI becomes a reality in web searching. But the sheer anarchy of the web, and the inability of most users (especially M$) to agree or properly implement web standards, means that progress is likely to be incremental, rather than a 'big bang'
Which is fine, except that some (eg Google, of late) seem to get so bogged down in implementing increments, that they kinda lose sight of what they were doing in the first place!
M$ has demonstrated time after time, that their idea of progress is always "adding on". Until someone starts to think about rationalizing and simplifying, software - and searches - will get klunkier and klunkier; they'll always deliver - but less and less of what was asked for.
BionicOffice
22-08-2004, 12:54/12:54PM
Hi,
This is my first post here so I hope I don't break any rules of forum netiquette but I was directed here from another forum that was also looking at this paper. If I step on any virtual toes, I apologize in advance.
I wrote an interpretation of this paper for that forum and sent it off to one of the authors of it to see how well I conveyed the content. It does not cover every issue raised by the paper.
Below is what I wrote and it's followed by his response.
I offer this in the spirit of shared knowledge.
Hope it helps.
Good luck with your endeavors,
Nicola Andrews
Here is what I wrote:
This paper discusses new ways for improving web information
retrieval.
It wants to assign a value, like PR, to pages based upon
value of their content as “key entry pages”. A key entry page has one major topic and is not part of a
larger site based upon that same topic.
PR uses the web page as the smallest unit of comparison.
The authors say that a smaller unit, the semantic block,
should be used. This is sensible since every page is made
up of blocks that direct the reader to different topics of
information.
This proposed system uses blocks to analyze the web page for
topic relevance and importance.
Their system can assign value to different blocks based upon their position on the web page. The closer a block is to
the center of a page, the higher importance it is assigned.
And it also considers the size of a block. It assumes that
the larger a block, the higher its probable importance to
the web page contents.
This helps to remove what they call “noisy” information –-
like navigation, ads and decoration -- from the page
weighting.
It also assigns a different value to this noisy information.
Pages linked from a block are assumed to be because the
block’s author thinks they are relevant to the block they
appear in. This is called “human endorsement”. Nav bars
and ads would, in their model, lose this aspect of
endorsement.
In PageRank one value is given to each page.
In this system two values are given.
The page is given an authority value and the block is given
a hub value. The authority value is a measure of page
content match to a search query. And the hub value is a
measure of the block content match to the search query.
By looking at the page to block relationship (page layout)
and block to page relationship (link analysis) they have
developed an initial model to improve the results for
information searches.
A page with a lot of different topics would have a low
authority score.
A page with a lot of closely related topics (the blocks)
would have higher authority score.
The hubs are also looked at in terms of their outbound
links. The number of links and the degree of relatedness to
the block are evaluated to calculate the hub value.
A block on the page with a low hub score would be one that
has a poor match to the page’s main content.
I am not sure how this would affect portal sites, like
Yahoo, used in the paper.
It seems to me that the number of blocks would be very large
and the difference in topics would also be very large. I
guess a consideration would for the number of blocks that
link to Yahoo.
It may also mean a way for them to increase advertising,
without reducing relevance ranking, because the “noise” of
the advertising is filtered out.
Also of interest are the documents cited in the References.
Among them:
“Extracting content structure for web pages based upon
visual representation”
“Recognizing nepotistic links on the web”
“Does “authority” mean quality? Predicting expert quality
ratings of web documents”
“Automatic resource list compilation by analyzing hyperlink
structure and associated text”
Seems to me if you are building quality content sites, you
will probably do better with this new algorithm.
End of article
Dear Nicola,
Thanks for reading our paper. I have looked through your writting. I think
you have understood this paper very well. What I want to say are as
follows:
1. This paper is totally build upon our previous work on web page
segmentation, i.e. the Vision-Based Page Segmentation (VIPS) algorithm.
The performance of our algorithm presented in this paper is mainly based
on the performance of the VIPS algorithm. In fact, we are currently
working with MSN search and the most important factor of our algorithm is
its speed. In other words, I believe that our algorithm can improve
current search engines like google, MSN search, etc, but it needs to be
more faster.
2. From algorithmic persepective, I think the most important aspect of our
work is the way to build graph models, rather than the way to compute the
importance value of pages (blocks). The graph models reflect structure of
the Internet. Better graph models lead better understanding of the
structure of the Internet.
3. You might also be interested in the following paper:
http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&id=742
The main difference between our method and traditional link analysis is
that we use different graph models. However, once the page-to-page graph
is construct, we use the same approach to compute a importance value for
each page, which is called "block level PageRank" in our paper. What you
talked about is actually the HITS algorithm, based on which we can
assigned an authority value to a page and a hub value to a block.
We actually have another Tech Report which might contain more details
about the graph models.
http://research.microsoft.com/research/pubs/view.aspx?type=Technical%20Report&id=742
Best,
Xiaofei
ihelpyou
22-08-2004, 13:29/01:29PM
Welcome to the forums Nicola! :hi:
Thanks for that! :up:
Martin
10-09-2004, 12:40/12:40PM
I don't think they are trying to trap spamming.
Randy probably meant: The emphasis is on other things than spam, but are "they" keen on trapping spam at all these days?
I've seen a number of sites that are spamming like mad and have PR5. Are they so confident about their little algorithms that they say:"We'll index the good stuff and forget about the spam"?
Crichey
16-09-2004, 17:02/05:02PM
Hmmm....first its Microsoft. It will be buggy. IF it works, by the time it does it will be old hat. (anyone have ME?) I think that its great in theory, bad in reality. If I understand it correctly, what could happen, is that a high profile site could get top ranking for subjects that they really don't stand for, just by having a great "block" on their page. I could see sites being ranked for ads they are displaying----when the ads are really not related to the rest of the site. All you would need to do is experiment until you hit the magic forumla, and bingo----you rule MSN search. Of course, those ads would be MSN PPC ads.....
RandyDotcom
16-09-2004, 17:19/05:19PM
That's what I was thinking. What if a really popular site like ESPN had an article about a players cat. Then all searchs for CAT
Toys would point to the ESPN site.
It's silly to think that block content is relevenet. What is important is the page taken as whole
Catfish
15-10-2004, 18:26/06:26PM
Originally posted by RandyDotcom
It's silly to think that block content is relevenet. What is important is the page taken as whole
I think the jury is still out. It would certainly limit the effectiveness of link pages that have links of every shape and size for all subjects under the sun from Viagra to Vegas.
I think a combination of block analysis as it relates to the overall theme of a page is an interesting concept for link analysis (ie. Block anaylsis combined with Topic Sensitive Page Rank).
Now the practical implementation on the other hand, may or may not be doable.
If its MSN, it should be ready by at least 2012. Of coarse anyone using the new MSN search will probably have to download a new patch monthly for security reasons. :rolleyes:
vBulletin® v3.8.3, Copyright ©2000-2010, Jelsoft Enterprises Ltd.