中文XML论坛--Google的PageRank 算法

http://www.a100.cn/article/Article_Show.asp?ArticleID=999

[ 作者：戴云杰 | 转贴自：蓝色理想 | 点击数：71 | 更新时间：2004-11-2 | 文章录入：飞虹 ]

减小字体增大字体

前段时间看了 hedong对于PageRank算法学习的文章http://hedong.3322.org/archives/000199.html，参考了 PageRank的英文原始资料，感觉hedong写的内容稍微少了点，能有原版译文就更好了！Google了一下，没任何资料……还是自己开金山词霸看吧-.-

　　想想反正都看了，索性再花点时间写成文字记下来，方便今后的同道者。可是……555，偶e文实在太Poor了，因此将原文一段段附上，如有严重错误，请一定留言指正！

　　这是第一段，译自：Google PageRank Introduction - http://pr.efactory.de/

--------------------------------------------------------------------------------
　　Within the past few years, Google has become the far most utilized search engine worldwide. A decisive factor therefore was, besides high performance and ease of use, the superior quality of search results compared to other search engines. This quality of search results is substantially based on PageRank, a sophisticated method to rank web documents.

　　在过去几年之内，Google成为了全世界被使用的最多的搜索引擎。与其它搜索引擎比较，除高性能和易用以外，一个决定性的因素是它的优秀的搜索结果。搜索结果的这质量极大地来源于PageRank——一个精密的排序网页文件等级的方式。

　　The aim of these pages is to provide a broad survey of all aspects of PageRank. The contents of these pages primarily rest upon papers by Google founders Lawrence Page and Sergey Brin from their time as graduate students at Stanford University.

　　本文的主要目的就是对PageRank的各个方面做一次广泛的勘测。本文内容主要依据Google创始人Lawrence Page和Sergey Brin在他们作为斯坦福大学研究生时的文章。

　　It is often argued that, especially considering the dynamic of the internet, too much time has passed since the scientific work on PageRank, as that it still could be the basis for the ranking methods of the Google search engine. There is no doubt that within the past years most likely many changes, adjustments and modifications regarding the ranking methods of Google have taken place, but PageRank was absolutely crucial for Google's success, so that at least the fundamental concept behind PageRank should still be constitutive.

　　经常被讨论的是，尤其是考虑到互联网的动态性，自从PageRank科学工作开始，许多时间被浪费了，因为他仍然可以是Google搜索引擎的等级等级的基本依据。毋庸置疑，在过去几年内有许多关于Google等级方法的调整和修改，但PageRank是Google成功的绝对关键，因此至少PageRank的根本概念在之后应该仍然不会改变的。

　　Since the early stages of the world wide web, search engines have developed different methods to rank web pages. Until today, the occurence of a search phrase within a document is one major factor within ranking techniques of virtually any search engine. The occurence of a search phrase can thereby be weighted by the length of a document (ranking by keyword density) or by its accentuation within a document by HTML tags.

PageRank的概念

　　从万维网的早期，搜索引擎开发不同的方法排序网页。实际上，直到今天，任一个搜索引擎对网页的排序，是根据搜索的词组短语在页面中的出现次数，并用页面长度和html标签的重要性提示等进行权重修订。

　　For the purpose of better search results and especially to make search engines resistant against automatically generated web pages based upon the analysis of content specific ranking criteria (doorway pages), the concept of link popularity was developed. Following this concept, the number of inbound links for a document measures its general importance. Hence, a web page is generally more important, if many other web pages link to it. The concept of link popularity often avoids good rankings for pages which are only created to deceive search engines and which don't have any significance within the web, but numerous webmasters elude it by creating masses of inbound links for doorway pages from just as insignificant other web pages.

　　为了得到更好的搜索结果，尤其是使搜索引擎自动抵制那些基于对详细等级标准页面（入口页）内容的分析而自动生成的网页，连接人气值的概念开始被开发了。根据这个概念，一个网页文件的入链数量通常表示此文件的重要程度。因此，一般地，如果从其他网页链接到一个网页的数量越多，那么这个网页就越重要。链接人气值的概念通常可以避免那些只被创造出来欺骗搜索引擎并且没有任何实际意义的网页得到好的等级，然而，许多网站管理员为了避免发生这种情况，他们从其他没有意义的网页创建大量入站链接，而不是从入口页（doorway pages）。

　　Contrary to the concept of link popularity, PageRank is not simply based upon the total number of inbound links. The basic approach of PageRank is that a document is in fact considered the more important the more other documents link to it, but those inbound links do not count equally. First of all, a document ranks high in terms of PageRank, if other high ranking documents link to it.

　　与链接人气值向比较，PageRank的概念并不是简单地根据入站链接的总数。PageRank基本的方法是，越是重要的文件链接一个文件，则这个文件就越重要，但那些入站链接并不是被平等计算的。首先，如果其他高等级的文件连接到它，那么根据PageRank的规则，此文件的等级也高。

　　So, within the PageRank concept, the rank of a document is given by the rank of those documents which link to it. Their rank again is given by the rank of documents which link to them. Hence, the PageRank of a document is always determined recursively by the PageRank of other documents. Since - even if marginal and via many links - the rank of any document influences the rank of any other, PageRank is, in the end, based on the linking structure of the whole web. Although this approach seems to be very broad and complex, Page and Brin were able to put it into practice by a relatively trivial algorithm.

　　如此, 在PageRank概念中，文件的等级由与它连接那些文件的等级决定的。它们的等级再由与他们连接文件的等级决定。因此, 文件的PageRank由其他文件的PageRank总递归之和确定。因为，即使是在边缘的少量链接，任一个文件的等级都会影响些其他文件的等级，概言之，PageRank的等级是由整个网的连接结构决定的。虽然这种方法似乎是非常宽泛和复杂的, Page和Brin已经能够通过一个微不足道的运算法则将它投入实践了。

　　个人总结：PageRank绝对是个很科学的小创意。说他科学，你会在我以后的文章中看到Google是如何将数学（具体来说多数是统计学）理论淋漓尽致地发挥在搜索技术之中。说他“小”，因为这些理论对于搞数学的人来说实在太微不足道了，甚至稍微有些科学高数知识的人都能理解。

　　我一向认为，搜索引擎对于互联网的价值就好比桌面操作系统对于计算机的价值，微软已经无可争议地占领PC桌面之后，互联网的桌面之争从Internet诞生起就异常惨烈，后来Yahoo!因为进入互联网最早而取得阶段性胜利。不过那时候的搜索引擎对于我们来说好比是马桶……不得不用，一用就恶心。那时无论是Yahoo! 、AltaVista、AllTheWeb或者Lycos，搜索出来几乎都是大便。

　　对于我来说，生命中出现搜索引擎的一天，是我同学的一个英国的同学告诉我用用看www.google.com。

　　Lawrence Page和Sergey Brin在个别场合描述了PageRank最初的算法。这就是

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) ——算法1
式中：

PR(A) :网页A页的PageRank值；
PR(Ti) :链接到A页的网页Ti的PageRank值；
C(Ti) :网页Ti的出站链接数量；
d :阻尼系数，0<d<1。
　　可见，首先，PageRank并不是将整个网站排等级，而是以单个页面计算的。其次，页面A的PageRank值取决于那些连接到A的页面的PageRank的递归值。

　　PR(Ti)值并不是均等影响页面PR(A)的。在PageRank的计算公式里，T对于A的影响还受T的出站链接数C(T)的影响。这就是说，T的出站链接越多，A受T的这个连接的影响就越少。

　　PR(A)是所有PR(Ti)之和。所以，对于A来说，每多增加一个入站链接都会增加PR(A)。

　　最后，所有PR(Ti)之和乘以一个阻尼系数d，它的值在0到1之间。因此，阻尼系数的使用，减少了其它页面对当前页面A的排序贡献。

随机冲浪模型
　　Lawrence Page和Sergey Brin为以上这个PageRank算法给出了一个非常简单直观的解释。他们将PageRank视作一种模型，就是用户不关心网页内容而随机点击链接。

　　网页的PageRank值决定了随机访问到这个页面的概率。用户点击页面内的链接的概率，完全由页面上链接数量的多少决定的，这也是上面PR(Ti)/C(Ti)的原因。

　　因此，一个页面通过随机冲浪到达的概率就是链入它的别的页面上的链接的被点击概率的和。并且，阻尼系数d减低了这个概率。阻尼系数d的引入，是因为用户不可能无限的点击链接，常常因无聊而随机跳入另一个页面。

　　阻尼系数d定义为用户不断随机点击链接的概率，所以，它取决于点击的次数，被设定为0-1之间。d的值越高，继续点击链接的概率就越大。因此，用户停止点击并随机冲浪至另一页面的概率在式子中用常数(1-d)表示。无论入站链接如何，随机冲浪至一个页面的概率总是(1-d)。(1-d)本身也就是页面本身所具有的PageRank值。

　　Lawrence Page和Sergey Brin在不同的刊物中发表了2个不同版本的PageRank的算法公式。在第二个版本的算法里，页面A的PageRank值是这样得到的：

PR(A) = (1-d) / N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) ——算法2
　　这里的Ｎ是整个互联网网页的总数。这个算法2，并不是完全不同于算法1。随机冲浪模型中，算法2中页面的PageRank值就是在点击许多链接后到达这个页面页面的实际概率。因此，互联网上所有网页的PageRank值形成一个概率分布，所有RageRank值之和为1。

　　相反地，第一种算法中随机访问到一个页面的概率受到互联网网页总数的影响。因此，算法2解得的PageRank值就是用户开始访问过程后，该页面被随机访问到的概率的期望值。如果互联网有100个网页，其中一个页面PageRank值为2；那么，如果他将访问互联网的过程重新开始100次（xdanger注：这句话具体含义是，该用户随机点击网页上的链接进入另一个页面，每点击一次都有一定概率因疲劳或厌倦或其他任何原因停止继续点击，这就是阻尼系数d的含义；每当停止点击后，即算作此次访问结束，然后随机给出一个页面让他开始另一次访问过程；让他将这样的“手续”重复进行100次），平均就有2次访问到该页面。

　　就像前面所提到的，两种算法并非彼此是本质的不同。用算法2解得的PR(A)乘以互联网的总网页数N，即得到由算法1解得的PR(A)。Page和Brin在他们最著名的刊物《The Anatomy of a Large-Scale Hypertextual Web Search Engine》中调和了两种算法，文中声称算法1是将PageRank形成对于互联网网页的一个概率分布，其和为1。

　　接下来，我们将使用算法1。理由是算法1忽略了互联网的网页总数，使得更易于计算。


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	56.641ms