Google Page Rank Project Linear Algebra Summer 2012 How does an internet search engine, like Google, work? In this project you will discover how the Page Rank algorithm works to give the most relevant information as the top hit on a Google search. This document is organized as follows: a brief description of what happens when you query a search engine, a brief description of the Page Rank algorithm (with a few tasks for you) the description of your major tasks for this project 1 A Google Query Search engines compile large indexes of the dynamic information on the internet so they are easily searched. This means that when you do a Google search, you are not actually searching the internet; instead, you are searching the indexes at Google. When you type a query into Google the following two steps take place: 1. Query Module: The query module at Google converts your natural language into a language that the search system can understand, and consults the various indexes at Google in order to answer the query. This is done to find the list of relevant pages. 2. Ranking Module: The ranking module takes the set of relevant pages and ranks them according to some criterion. The outcome is an ordered list of webpages such taht the pages near the top of the list are most likely to be what you desire from your search. This ranking is the same as assigning a popularity score to each web site and then listing the relevant sites by this score. This project focuses on the Linear Algebra behind the Ranking Module developed by Sergey Brin and Larry Page; the PageRank algorithm. 2 The PageRank Algorithm In simple terms: A webpage is important if it is pointed to by other important pages. The internet can be viewed as a directed graph (look up this term on Wikipedia) where the nodes are the webpages and the edges are the hyperlinks between the pages. The hyperinks into a page are called inlinks, and the ones pointing out of a page are called outlinks. In essence, a hyperlink from my page to yours is my endorsement of your page. Thus, a page with more recommendations must be more important than a page with a few links. However, the status of the recommendation is also important. Let us now translate this into mathematical equations. To help understand this we first consider the small web of six pages shown in Figure 2. The links between the pages are shown by arrows. An arrow pointing into a node is an inlink and an arrow pointing out of a node is an outlink. In Figure 2, node has three outlinks (to nodes 1, 2, and 5) and 1 inlink (from node 1). The PageRank of a page P i, denoted r(p i ), is the sum of the PageRanks of all pages pointing into P i r(p i ) = r(p j ) P j P j B Pi (1) 1
1 2 6 5 4 Figure 1: Sample graph of a web with six pages. where B Pi is the set of pages pointing into P i, and P j is the number of outlinks from page P j. This means that the inlinks are weighted against how many pages are linked from a page. This formula has the problem that to find a PageRank of one page you must first know the PageRank of all of the other pages. To overcome this problem we iterate (find successive approximations of r(p i )) using the formula r k+1 (P i ) = r k (P j ). (2) P j P j B Pi In this formula, the subscripts describe the iterate. This process is initiated with r 0 (P i ) = 1/n for all pages P i and repeated with the hope that the PageRank scores will eventually converge to some final stable values. Applying equation (2) to the web in Figure 2 gives the values in Table 1 for the PageRanks after a few iterations. Iteration 0 Iteration 1 Iteration 2 Iteration r 0 (P 1 ) = r 1 (P 1 ) = 1/18 r 2 (P 1 ) = 1/6 r 0 (P 2 ) = r 1 (P 2 ) = 5/6 r 2 (P 2 ) = 1/18 r 0 (P ) = r 1 (P ) = 1/12 r 2 (P ) = 1/6 r 0 (P 4 ) = r 1 (P 4 ) = 1/4 r 2 (P 4 ) = 17/72 r 0 (P 5 ) = r 1 (P 5 ) = 5/6 r 2 (P 5 ) = 11/72 r 0 (P 6 ) = r 1 (P 6 ) = r 2 (P 6 ) = 14/72 Table 1: First few iterates using equation (2) These PageRanks yield the following rankings at each iteration: Task # 1: Iteration 0 Iteration 1 Iteration 2 Iteration 1 5 5 1 4 1 4 5 1 1 1 1 1 2 2 Table 2: Rankings for first few iterates of equation (2) Fill in the tables above with the PageRank and rankings for iteration. The process described here can be simplified using matrices. Define the matrix H as an n n matrix and define x as a n 1 vector. The matrix H is called the hyperlink matrix with 1, if there is a link from node i to node j H ij = P i 0, else 2
Using this notation, H 11 = 0, H 12 = 1/2, H 1 = 1/2, H 14 = 0, H 15 = 0, and H 16 = 0. Task #2: Fill in the remainder of the entries for the H matrix for the web shown in Figure 1. 0 1/2 1/2 0 0 0 H = The vector x is a vector that contains all of the PageRanks. Therefore, Task # x 0 = 1/18 5/6 1/12 and x 1 =. 1/4 5/6 Verify that x 1 = H T x 0, x 2 = H T x 1, and x = H T x 2. (You do not need to show any calculations here. Just do the calculations on the side.) The general task to find the PageRank is to find successive approximations for r(p i ) using x k+1 = H T x k. () If this process reaches a steady state then that steady state is the final PageRank for the web. Technically this process can proceed forever, but in certain instances there is a stationary solution to the iterative process. This means that over time the values in x do not change. In other words, there is a limit. Task #4: Apply this process several times using MATLAB and approximate the stationary vector for this example. x = Given this stationary point solution to (), what are the rankings of the web pages? Task #5: Plot the behavior of the iterates over the first 20 iterations, and explain the plot that you see. (uncomment the code here) (be sure that your image file is saved in the same folder as your tex file, and the file type should be pdf)
Big Project Tasks 1. Prove that the sum of each row of the H-matrix for any web of n pages will always be either 0 or 1. Be sure to indicate the special circumstances when the sum of a row is zero. (you should not simply use the example from Figure 2.) (Soln) +Your proof gets typed here+ 2. Definition: A probability vector is a vector with nonnegative entries that add up to 1. Definition: A stochastic matrix is a square matrix whose columns are probability vectors.. What must be true about a collection of n pages such that H T is a stochastic matrix. (Soln) +Your solution gets typed here+ 4. Consider the web in the figure below. 1 2 7 6 5 8 4 Figure 2: Graph of a web with eight pages. (a) Write the H matrix, the initial state x 0, and the steady state PageRank vector. (Soln) +Your solution gets typed here+ (b) Rank the web pages according to the PageRank vector (Soln) +Your solution gets typed here+ (c) Create a graph of the iterates of the PageRank vector for 50 iterations (Soln) (uncomment the code here) (d) Use MATLAB (or something similar) to find the largest eigenvalue of H T and the eigenvector associated with this eigenvalue. Compare this eigenvector to the PageRank vector from part (a). (keep in mind that MATLAB will always normalize all vectors) (Soln) +Your Solution Gets Typed Here+ 5. Theorem: The largest eigenvalue of every stochastic matrix is 1. (you do not need to prove this) 6. Explain how the fact that H T is a stochastic matrix relates to the PageRank vector, and use this to explain how Google will rank the search results for a web query. Hint: Google does not do an iterative process. (Soln) +Your Solution Gets Typed Here+ 4
A MATLAB Basics The following are some basics for running MATLAB on this project. 1 2 To type in the matrix H = 4 5 6, type the following and press enter: 7 8 9 >> H = [1, 2, ; 4, 5, 6 ; 7, 8, 9] 1 Similarly to type the vector x = 2, type the following and press enter: >> x = [1 ; 2 ; ] To transpose a matrix, use the apostrophe >> H To iterate equation () you can run >> x = H * x over and over again. Each time will produce the result of one iterate. 0.4 To do several iterates quickly you can write the for loop as follows. For this example, assume that x 0 = 0. 0.1 0 0.9 and H = 0 0. 0.7. 0. 0.2 0.5 >> H = [0.1, 0, 0.9; 0, 0., 0.7; 0., 0.2, 0.5]; >> x(:,1) = [0.4; 0.; 0.]; >> for j=2:20, x(:,j) = H * x(:,j-1); end 0. This example will do 20 iterations and save the answers as the columns of x. You can change the 20 to anything that you like. To plot the behavior of x k over the iterations, >> plot(x ); >> legend( node 1, node 2, node ) To find the eigenvalues of a matrix A, type 5
>> eigs(a) To find the eigenvalues and eigenvectors of A, type >> [evec, eval] = eigs(a) The columns of matrix evec are the eigenvectors, and the diagonal entries of eval are the eigenvalues. 6