advertisement

Bryan Morgan CompSci 49S Shivnath Babu The Google Pagerank Algorthm and How it Works Introduction In this paper, the author, Ian Rogers, will: 1. Clearly explain how PageRank is calculated. 2. Go through every example in Chris’ paper, and add some more of my own, showing the correct PageRank for each diagram. By showing the code used to calculate each diagram I’ve opened myself up to peer review - mostly in an effort to make sure the examples are correct, but also because the code can help explain the PageRank calculations. 3. Describe some principles and observations on website design based on these correctly calculated examples. How is PageRank Used? PageRank is one of the methods Google uses to determine a page’s relevance or importance. The maximum PR of all pages on the web changes every month when Google does its re-indexing. PageRank says nothing about the content or size of a page, the language it’s written in, or the text used in the anchor of a link! So what is PageRank? In short PageRank is a “vote”, by all the other pages on the Web, about how important a page is. A link to a page counts as a vote of support. If there’s no link there’s no support (but it’s an abstention from voting rather than a vote against the page). The equation for PageRank is: PR(A) = (1-d) + d(PR(T1)/C(T1) + … + PR(Tn/C(TN)) Let’s break the equation down into sections 1. PR(Tn) – each page has a conception of its own self-importance. So “PR(T1)” is for the first page in the web and “PR(Tn)” is the last page 2. C(Tn) – This is the number of outgoing links a page has. 3. PR(Tn)/C(Tn) – so if our page (Page A) has a backlink from page :n” the share of the vote page A will get is “PR(Tn)/C(Tn)” 4. d – d is the dampering factor. It damps down the total vote by multiplying by 0.85 (the factor “d”) 5. (1-d) – if a page does not get pointed to, it will still get a small PR of .15 (i.e. 10.85). How is PageRank Calculated? This is where it gets tricky. The PR of each page depends on the PR of the pages pointing to it. But we won’t know what PR those pages have until the pages pointing to them have their PR calculated and so on… And when you consider that page links can form circles it seems impossible to do this calculation! But actually it’s not that bad. Remember this bit of the Google paper: PageRank or PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. What that means to us is that we can just go ahead and calculate a page’s PR without knowing the final value of the PR of the other pages. That seems strange but, basically, each time we run the calculation we’re getting a closer estimate of the final value. So all we need to do is remember the each value we calculate and repeat the calculations lots of times until the numbers stop changing much. Examples: We don’t know what their PR should be to begin with, so let’s take a guess at 1.0 and do some calculations: d = 0.85 PR(A) = (1 – d) + d(PR(B)/1) PR(B) = (1 – d) + d(PR(A)/1) i.e. PR(A) = 0.15 + 0.85 * 1 =1 PR(B) = 0.15 + 0.85 * 1 =1 Hmm, the numbers aren’t changing at all! So it looks like we started out with a lucky guess!!! If you do the same thing for this example here is what it would look like: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: a: 0.00000 0.15000 0.48612 0.81913 1.04169 1.19042 1.28982 1.35626 1.40065 1.43032 1.45015 1.46341 1.47226 1.47818 1.48214 1.48478 1.48655 1.48773 1.48852 1.48904 1.48940 1.48963 1.48979 1.48990 1.48997 1.49001 1.49004 1.49007 1.49008 1.49009 1.49009 1.49010 1.49010 1.49010 b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: b: 0.00000 0.21375 0.35660 0.49813 0.59272 0.65593 0.69818 0.72641 0.74528 0.75789 0.76632 0.77195 0.77571 0.77823 0.77991 0.78103 0.78178 0.78228 0.78262 0.78284 0.78299 0.78309 0.78316 0.78321 0.78324 0.78326 0.78327 0.78328 0.78328 0.78329 0.78329 0.78329 0.78329 0.78329 c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: c: 0.00000 0.39544 0.78721 1.04904 1.22403 1.34097 1.41912 1.47136 1.50626 1.52959 1.54518 1.55560 1.56257 1.56722 1.57033 1.57241 1.57380 1.57473 1.57535 1.57576 1.57604 1.57622 1.57635 1.57643 1.57649 1.57652 1.57655 1.57656 1.57657 1.57658 1.57659 1.57659 1.57659 1.57659 d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: d: 0.00000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000 a: 1.49010 b: 0.78329 c: 1.57659 a: 1.49011 b: 0.78329 c: 1.57660 a: 1.49011 b: 0.78330 c: 1.57660 a: 1.49011 b: 0.78330 c: 1.57660 a: 1.49011 b: 0.78330 c: 1.57660 a: 1.49011 b: 0.78330 c: 1.57660 Average pagerank = 1.0000 d: d: d: d: d: d: 0.15000 0.15000 0.15000 0.15000 0.15000 0.15000