A Fast Two-Stage Algorithm for Computing PageRank

Size: px

Start display at page:

Download "A Fast Two-Stage Algorithm for Computing PageRank"

Roland Mitchell
5 years ago
Views:

1 A Fast Two-Stage Algorithm for Computing PageRank Chris Pan-Chi Lee Stanford University Gene H. Golub Stanford University Stefanos A. Zenios Stanford University ABSTRACT In this paper we present a fast two-stage algorithm for computing the PageRank [16] vector. Our algorithm exploits the observation that the homogeneous discrete-time Markov chain associated with PageRank is lumpable [13]; the lumpable subset of nodes are precisely the dangling nodes. As a result the algorithm can converge in a fraction of the time compared to the standard PageRank algorithm [16]. On data of 451,237 webpages, our two-stage algorithm converged in only 20% of the time compared the standard PageRank algorithm. The algorithm described here also replaces a common practice which is in general not correct. Namely, the practice of including the dangling nodes only during the last stages of computation [16] doesn t necessarily accelerate convergence in a general context; on the other hand, our algorithm is provable, generally applicable, and achieves the desired speed gains. Keywords PageRank, link analysis, dangling nodes, Power Method, eigenvector computation, limiting distribution, statespace reduction, state aggregation, lumpable Markov chains 1. INTRODUCTION Aside from its commercial success, the PageRank approach to ranking webpages has generated a significant amout of interest in the research community. The Markov chain interpretation gives an explicit model for web traffic and surfer behavior, yet the computation poses a numerically daunting challenge [15]. With billions of webpages already in existence, computing the PageRank vector is a very time-consuming procedure. It is reported in [11] that the computation of a PageRank vector over 290 million webpages requires as much as 3 hours 1 ; the computing time for a realistically large subset of the entire web would take days. Furthremore, frequent computation of the PageRank vector is often necessary. With webpages constantly updated, added, or removed, the PageRank vector needs to be re-computed continusouly to maintain timeliness and relevance of the search results. In the context of personalized web search [9], a number of PageRank vectors need to be computed to Scientific Computing & Computational Mathematics Program Department of Computer Science Division of Operations, Information, & Technology, Graduate School of Business 1 On a 1.5GHz AMD Athlon with 3.5GB of RAM. reflect the preferences of different classes of websurfers. Clearly, there is a demand for faster algorithms. The PageRank vector can be regarded as the limiting distribution of a homogeneous discrete-time Markov chain that jumps from webpage to webpage. In this paper we present a fast algorithm for computing the PageRank vector. The algorithm exploits the observation that the Markov chain is in fact lumpable [13]. This algorithm proceeds in two stages. In the first stage, we compute the limiting distribution of the chain in which the dangling nodes [16] are combined into one super node; in the second stage, we compute the limiting distribution of the chain in which the non-dangling nodes are combined into one 2. When the limiting distribution of the two chains are concatenated, we recover the limiting distribution of the original chain i.e. the PageRank vector. As we shall see, this approach can dramatically reduce the overall amount of computing time. A number of papers discuss accelerating PageRank computation, and many of these focus on numerical linear algebra techniques. A Gauss-Seidel algorithm is discussed in [1] where the most recent component values of the PageRank vector are used in the computation. In [12], one periodically subtracts away approximation of the sub-dominant eigenvectors to accelerate convergence. It is noted in [11] that when sorted by url, the Google matrix has a block structure; hence, a PageRank vector can be computed separately for each block, and the results are pasted together to yield a good starting iterate for the entire matrix. It is noted in [10] that components of the PageRank vector converge at different rates, and hence by not re-computing components that have converged performance gains are realized. This paper contributes to this growing literature in a number of ways. First, by adopting a characteristically Markov chain view and observing that the chain associated with PageRank is lumpable, not only are we able to achive performance gains we bring to forefront a powerful technique for statespace reduction. This technique of lumping is disctinctively different from the better-known technique of state aggregation cf. [3], [14], [17], which we also make use of in this paper. Thus, we have a two-stage algorithm where during each stage a different statespace reduction method is used; the reduction is aggressive, the overall performance gains are very significant, and the concept is novel. Second, our approach is analyzable. Whereas previous methods sometimes rely on intuition and approximate arguments, our procedure can be analyzed with greater precision, leading to some very interesting results. In addition, we show that the common practice of including the dangling nodes only during the last stages of computation [16] does not accelerate convergence in general and can be replaced by our present 2 In this paper, the terms node, state, and webpage are used interchangeably.

2 algorithm. Lastly, our approach is easily combined with many other methods to exploit even greater performance gains. For example, all of the existing methods described above can be combined with our approach, especially during the first-stage of our algorithm. Notation Notation in this paper is as follows. If v is a vector, then vi denotes the i-th element of v. If M is a matrix, then Mi, j denotes the element in the i-th row and the j-th column; Mi : j, k : l denotes the elements in rows i through j and columns k through l; Mi, : denotes the entire i-th row; and so on. Supesrcripts and subscripts may have different meanings depending on the context, but the meaning is always made clear. An un-transposed vector is always a column vector; the transpose is superscripted with a T. 1 is the sum of the absolute values of a vector. For example, v 1 = i vi. The notation en means an n-dimensional vector of 1 s. 2. PAGERANK REVIEW The central idea behind PageRank is to regard web surfing as a Markov chain. Imagine a collection of webpages indexed as S = {1, 2,..., N}, and suppose we have a personalization vector u R N 1 which records a generic surfer s preference for each page in S. 3 Let this generic surfer be currently at some page i S. We assume at the next time step, the surfer will move to some j S according to the probabilty: { where Qi, j = Gi,j Nl=1 Gi,l uj if Gi, l = 1 for some l otherwise { 1 if there is an outlink from i to j Gi, j = 0 otherwise. The above definition has a nice interpretation. If the i-th page has outlinks, the surfer will move to one of the outlinks with an equal probability; this corresponds to the first case in the definition of Q above. If no outlink from i exists, the surfer will move to any page in S at a probability according to preference; this corresponds to the second case above. A page that has no outlink is called a dangling page. At each time step k = 0, 1, 2,..., we assume the surfer to jump from page to page according to the above probabilities. This gives rise to a homogeneous discrete-time Markov chain. Q = [Qi, j] is simply the transition probability matrix for this Markov chain. Let π 0 R N 1 denote the probability distribution of where the surfer is to be found at the initial time step, then the distribution for the k-th time step is given by πk T = π0 T Q k. The idea of PageRank is that the importance of webpages can be defined as the limiting probability distribution associated with the Markov chain as k. Such a limiting distribution can be interpreted has the proportion of time the surfer spends on each webpage, and certainly quantifies the notion of importance. Unfortunately, there is nothing in our definition so far that guarantees convergence to such a limiting distribution as k. The solution is to consider a closely related Markov chain by adding to Q a small shift. We take P = cq + 1 ce N u T 1 to be our new transition probability matrix. e N u T is simply a matrix where each row is u T. c is a constant in 0, 1, and P in some 3 Specifically, the personalization vector is assumed to be componentwise positive and normalized so that N l=1 ul = 1 sense approximates Q for c close to 1. 4 It is easily seen that P is a positive matrix, and each row sums to 1. Specifically, the Markov chain associated with P is irreducible and aperiodic. 5 The Perron- Frobenius Theorem and the Power Method cf. pp of [2], pp of [5] guarantee that for such a matrix a unique limiting distribution π T = lim k π0 T P k exists regardless of the initial distribution. The PageRank vector is defined to be this limiting distribution π. P is also known as the Google matrix. P has some interesting properties. Let S D denote the subset of S containing only the dangling nodes, and let S ND = S \ S D be the subset containing only the non-dangling nodes. Then { c Gi,: Nl=1 P i, : = + 1 Gi,l cut if i S ND ; 2 u T if i S D. All the rows of P that correspond to a dangling node are identically u T. The rows that correspond to a non-dangling node are separable into two components. The first component consists of contribution from G, or the actual outlinks; the second component is u T. Below we give the standard algorithm [16] for computing the PageRank vector π. The algorithm proceeds by first taking an arbitrary vector, then multiplying it by P repeatedly until convergence; it is an implementation of the Power Method. ALGORITHM 1 PAGERANK. form P, where P i, j = select any y R N 1 do x = y y T = cx T P d = x 1 y 1 y = y + du δ = y x 1 until δ < ɛ { Gi,j Nl=1 Gi,l if i S ND ; 0 if i S D. Notice that P is never actually enumerated. Instead, a matrix P is formed. P comprises of contribution from G only; the contribution from u is left out completely. Because most webpages have only a handful of outlinks, P is mostly zeros and an extremely sparse matrix. The multiplication step x T P can hence be implemented very efficiently. The contribution of u is subsequently added in during the y + du step. Based on this approach each iteration of the loop can be performed in ON operations. In comparison, if P was explicitly enumerated each iteration would require ON 2 operations, which is prohibitively expensive due to the large size of N. 6 We emphasize that the computational savings of the PageRank algorithm are achieved by recognizing P is separable into a sparse matrix P plus a dense vector u. Multiplication is done separately to those components and added together subsequently. 4 A typical value for c is between 0.85 and It is shown in [6] that c controls the convergence rate of the PageRank algorithm. 5 The positivity ensures a direct positive-probability path between any two pages, and hence the irreducible and aperiodic properties. cf. pp of [2], [13] 6 As of year 2000, the number of webpages is on the order of See [15].

3 3. AS A LUMPABLE MARKOV CHAIN Our goal in this paper is to present an algorithm that is a substantial improvement over Algorithm 1. Our approach is based on the observation that the Markov chain associated with P is lumpable cf. [13], [4]. In general, a Markov chain is lumpable if its transition probabilities satisfy certain properties that allow its states nodes to be combined into blocks super nodes. The block-level transitions yield another Markov chain the transition probabilities of which can be very easily calculated. Unlike conventional state aggregation cf. [3], [14], [17], lumping doesn t require prior knowledge or computation of aggregation weights. Lumping is thus very effective in reducing the size of the statespace. Definition 1. Suppose M R n n is the transition probability matrix of a homogeneous discrete-time Markov chain with n states. Let S 1, S 2,..., S p {1, 2,..., n} be such that p S l = {1, 2,..., n} l=1 1 l m p S l S m = Then the Markov chain is said to be lumpable with respect to the partition S 1, S 2,...S p if for all l, m {1, 2,..., p}, every i S l satiesfies j S m Mi, j = cl, m 3 where the right-hand side is a constant that depends only on l and m. The transition probability matrix for the p-by-p lumped chain is M = [cl, m] 4 The key here is that the right-hand side of 3 depends only on l and m. We think of S 1, S 2,..., S p as blocks of nodes, and 3 requires every node in the same block to depart for another block with an identical probability. There is thus some notion of symmetry within each block. Lumping the Markov chain is to exploit this symmetry by discarding the within-block details and focusing on the between-block transitions. Because of symmetry the computation of the block-level transition probability matrix 4 involves minimal effort. We now claim the Markov chain associated with P is lumpable with respect to the partition that all the dangling nodes are lumped into one block and each non-dangling node is a singleton block. PROPOSITION 1. For each k S ND, define S k = {k}. The homogeneous discrete-time Markov chain associated with P is lumpable with respect to the partition consisted of S k for each k S ND and S D. This is a partition with cards ND + 1 blocks. PROOF. 3 is by construction true for all l, m S ND. In addition, for every i S D, j S m P i, j = P i, m = um for all m S ND and j S D P i, j = which is a constant. See 2. j S D uj Thus by lumping the dangling nodes into one block we can obtain a Markov chain with just cards ND + 1 states compared to cards ND +cards D states in the original chain. The lumped Markov chain is irreducible and aperiodic, since its transition probability matrix is necessarily positive. The Perron-Frobenius Theorem and the Power Method guarantee the existence of a unique limiting distribution. This limiting distribution is a vector with cards ND + 1 components: cards ND components are identical to the components of π corresponding to S ND ; the remaining component equals the sum of the components of π corresponding to S D. The benefit of lumping the dangling nodes is that typically cards D is very large often several times larger than cards ND. 7 Lumping can dramatically reduce the size of the transition probability matrix and enables the limiting distribution to be computed with much less effort. 4. A TWO-STAGE ALGORITHM While the lumped chain can be used to compute the limiting probability of the non-dangling nodes i.e. πk, k S ND, we re still left with the task of computing the limiting probability of the dangling nodes i.e. πk, k S D. As it turns out, once we have computed the limiting probability for the non-dangling nodes, the limiting probability for the dangling nodes can be computed with very little additional work. This is done by considering yet another Markov chain this time by combining all the non-dangling nodes into one block and treating each dangling node as a singleton block. This is a Markov chain with cards D+1 states. We emphasize this Markov chain is not obtained by lumping, as lumping is not applicable with respect to this particular partition that we re considering. Rather, we use the traditional state aggregation technique to combine the non-dangling nodes. The procedure requires aggregation weights, but we can readily compute these weights using the limiting probability of the non-dangling nodes. To summarize, we propose a two-stage algorihm that can be outlined as follows: 1. Compute the transition probability matrix P 1 of the lumped chain where the dangling nodes are combined into one block. See Proposition Compute the limiting distribution of P 1. This gives us πk for each k S ND and k S D πk. The computation is an iterative procedure similar to Algorithm 1. This step constitutes the bulk of the total work. 3. Compute the weights for state aggregation, for each k S ND. πk m S πm ND 4. Compute the transition probability matrix P 2 of the Markov chain where the non-dangling nodes are combined into a block. This requires the weights computed in Step Compute the limiting distribution of P 2. This yields πk for each k S D and k S ND πk. The amount of work involved is negligible compared to Step 2, as we ll show. 6. Concatenate the results from Step 2 and Step 5 to get πk for all k S. This is the limiting distribution of P, or the PageRank vector. 7 According to [11], a 2001 crawl by Stanford s WebBase project [7] contains 290 million pages in total; only 70 million are nondangling.

4 We ll now formalize Steps 1, 2, 4, and 5. We ll give specific numerical algorithms for an efficient implementation of these steps, and discuss performance issues. To simplify the notation, throughout subsequent sections we ll assume, without loss of generality, that S ND = {1, 2,..., K} and S D = {K + 1, K + 2,..., N}. P can be partitioned accordingly as P11 P P = 12 P 21 P 22 = P 11 P 12 e N K u T K e N K u T N K P 11, P 12, P 21, and P 22 are K-by-K, K-by-N K, N K- by-k, and N K-by-N K blocks respectively; the first K rows and columns are associated with the non-dangling nodes, and the last N K rows and columns are associated with the dangling nodes. Likewise, we partition the personalization vector u T = [ u T K u T N K]. Note that 5 follows from Formalizing Steps 1 & 2 According to Proposition 1, the transition probability matrix for the lumped chain is given by P 1 P11 P 12e N K = u T K u T N Ke N K This is a K + 1-by-K + 1 matrix. The matrix is positive, and each row sums to 1. Recall that Algorithm 1 was able to achieve each iteration of the multiplication step x T P in ON as opposed to ON 2 by separating P into two parts: a sparse matrix P and a dense vector u. Multiplication was done to these parts separately, and subsequently added together. Here we can do the same by separating P 1 into sparse and dense-vector parts. A mathematically equivalent form of 6 is 8 P P 1 1 = c 0 where for 1 i K P 1 i, : = ũ = 1 cek + 1 Gi, 1 : K N l=1 Gi, l e K uk 1 α 5 6 ũ T 7 Gi, 1 : K N l=1 Gi, l e K 8 and α = u T Ke K. Notice that P 1 is K-by-K + 1 and extremely sparse. Computationally 7 is a much more efficienct form than 6. Let x R K+1 1 be an arbitrary componentwise non-negative vector with unit one-norm 9, then x T P 1 = cx1 : K T P c + cxk + 1 ũ T 10 Because P 1 is extremely sparse, this multiplication requires only OK operations. On the other hand if one had used 6 directly, OK 2 operations would be needed. Based on this representation we can formalize an algorithm which combines Steps 1 and 2 above: ALGORITHM 2 STAGE 1. 8 See 2. 9 The components sum to 1. 9 form P 1 and ũ via 8 and 9 select y R K+1, y 0, y 1 = 1 do x = y y T = cx1 : K T P 1 +1 c + cxk + 1 ũ T δ = y x 1 until δ < ɛ Algorithm 2 converges to the limiting distribution of P 1, and yields πk, k = 1, 2,...K and N m=k+1 πm. We now have K components of the PageRank vector. Also, we use these results to compute the aggregation weights for Steps 4 & 5. We designate these weights as a vector η R K 1. For k = 1, 2,..., K ηk = πk K m=1 πm Formalizing Steps 4 & 5 With the aggregation weights of 11, we can compute the transition probability matrix P 2 by aggregating the non-dangling nodes. According to [14], we have P 2 η T P = 11e K η T P 12 e N Ku T Ke K e N Ku T N K η T P = 11e K η T P 12 αe N K e N Ku T N K 12 This is a N K+1-by-N K+1 matrix. N K =cards D. Each row of the matrix sums to 1. In addition, as the Perron Frobenius Theorem guarantees the aggregation weights 11 to be positive, P 2 is also positive. The Markov chain associated with P 2 is thus irreducible and aperiodic, and it has a unique limiting distribution. A remarkable property of P 2 is that all the rows starting from the second are identical check; the only unique rows are the first and second. In other words, P 2 is a rank-two matrix. This property allows us to compute the limiting distribution of P 2 with very little work and storage. We ll return to this shortly. Meanwhile, we can also derive an alternative form of P 2 which is computationally efficient. We again split P 2 into sparse and dense parts. Since P 2 is rank-two we need only look at the first two rows: P 2 β w T 1 c 1 : 2, : = c + α u T N K where w T = K i=1 Gi, K + 1 : N ηi N l=1 Gi, l 13 β = 1 w T e N K 14 Notice that w is the weighted sum of K extremely sparse vectors and can be formed very cheaply. The work for computing β is even less. Multiplication with a vector can be efficienctly implemented as follows. For x R N K+1 1 an arbitrary componenwise nonnegative vector with unit one-norm, we have x T P 2 = cx1 β w T + 1 cx1 α u T N K 15 In other words, multiplication with a vector can be implemented as just the sum of two vectors, which is extremely efficient.

5 Based on this representation we can formalize an algorithm which combines Steps 4 and 5 above: ALGORITHM 3 STAGE 2. Suppose we have computed the aggregation weights η according to 11. form w and β via 13 and 14 select x 0 R N K+1, x 0 0, x 0 1 = 1 for i = 1 : 3 x i T = cx i 1 1 β w T end + 1 cx i 1 1 α u T N K if x 3 x 2 1 < ɛ else end z = x 3 % Perform Aitken Extrapolation for i = 1 : N K + 1 vi = x2 i x 1 i 2 x 3 i 2x 2 i+x 1 i end z = x 1 v Algorithm 3 is characteristically different from Algorithm 1 and Algorithm 2. The latter two algorithms amout to a numerical implementation of lim k π T 0 P k and will converge iteratively; in general convergence to a fixed tolerance will occur only after many iterations, and the number of iterations needed is never known ahead of time. On the other hand, Algorithm 3 requires only three iterations of the vector-matrix multiplication. After three iterations, either convergence has already occurred or, if not, the Aitken Extrapolation [12] is performed to extract the limiting distribution. In either case the limiting distribution is available after just three iterations, guaranteed. What makes this fast convergence possible is related to the property of P 2 being a positive, rank-two matrix with each row summing to one. In the next section we ll prove the correctness of Algorithm 3. Meanwhile, Algorithm 3 involves relatively little computational work. Each iteration of the main loop is just summing two vectors, and only three such iterations are needed; the additional work of the Aitken Extrapolation is also very mild. In fact, the amout of work for the whole of Algorithm 3 is far less than one iteration of Algorithm 2, which involves multiplication with a very large matrix. Considering that Algorithm 2 can take 100 or more iterations to converge, Algorithm 3 is relatively cost-free in comparison. Furthermore, the storage requirement is extremely mild. There is no explicit enumeration of a transition probability matrix because P 2 is rank-two, only two vectors are stored. As a consequence, computationally the work for Algorithm 3 is relatively negligible, and the overall efficiency of the two-stage algorithm rests entirely on Algorithm 2. We ll compare the performance of Algorithm 2 to Algorithm 1 in the next section. We ll show that Algorithm 2 requires much less work compared to Algorithm CONVERGENCE ANALYSIS We address two issues in this section. First, we compare the performance of Algorithm 2 to Algorithm 1. Second, we validate the correctness of Algorithm 3 for computing the limiting distribution of P Convergence of Algorithm 2 We show that Algorithm 2 can always be made to converge in as many or fewer iterations than Algorithm 1. We begin with a lemma. LEMMA 1. Let x 0 R N 1 be given. Define y 0 T = I 0 x 0 T. I is the K-by-K identity matrix. Consider the two sequences of iterates for l = 0, 1, 2,... Then for l = 0, 1, 2,... But x l+1 T = x l T P y l+1 T = y l T P 1 y l T = x l T I 0 PROOF. Suppose the claim is true for l 0. Then y l 0 T P 1 = x l0 T I 0 P11 P 12 e N K u T K = x l0 T I 0 P11 P 12 u T K u T N Ke N K u T N K I 0 P11 P 12 u T K u T = P N K Thus the claim must hold for l proof. I 0 Induction completes the PROPOSITION 2. Let x l and y l be defined as in the lemma. Then for l = 0, 1, 2,... y l+1 y l 1 x l+1 x l 1 PROOF. Following the lemma y l+1 y l 1 x l+1 x l 1 N = y l+1 K + 1 y l K + 1 = 0 N i=k+1 x l+1 i x l i N i=k+1 i=k+1 x l+1 i x l i x l+1 i x l i This shows that given Algorithm 1 is applied to P with some starting iterate, a related starting iterate can always be constructed for P 1 so that Algorithm 2 converges in as many or fewer iterations, with respect to the same tolerance. In addition, we note that Algorithm 2 requires much less computational work per iteration. The reasons are Algorithm 2 works with K + 1-vectors throughout; Algorithm 1 works with N-vectors. Algorithm 2 involves multiplying by a K + 1-by-K + 1 sparse matrix; Algorithm 1 involves multiplying by an N- by-n sparse matrix With sparse matrices, the number of non-zeros is a much better guagae of performance than the size of the matrix. It is easily seen that the number of non-zeros in P 1 is only a fraction of P s.

6 Algorithm 2 gets rid of a norm-taking step duing each iteration of the loop by requiring the initial vector to have a unit-norm. In comparison, Algorithm 1 requires taking one additional norm, adding ON operations to each iteration. This suggests each iteration of Algorithm 2 is OK; on the other hand, each iteration of Algorithm 1 is ON. While the actual reduction in work depends on additional factors such as the distribution of the non-zero elements, the difference is rather dramatic as K is typically only a fraction of N. As it turns out, the amount of time the entire two-stage algorithm takes is roughly OK of what it takes for Algorithm 1. Furthermore, the two-stage algorithm never explicitly enumerates the ON entire transition probability matrix only a part of it is used at any given time. Consequently, on systems with insufficient memory to store the entire transition probability matrix 11, the performacne advantage of the two-stage algorithm is even more pronounced as the frequency of disk access is reduced. 5.2 Convergence of Algorithm 3 We now show that Algorithm 3 indeed computes the unique limiting distribution of P 2. We remark it can be easily shown that the limiting distribution of P 2 is in fact the left eigenvector associated with the eingenvalue 1. Thus it suffices to show that Algorithm 3 computes this eigenvector. LEMMA 2. To simplify notation denote M = N K + 1. i.e. P 2 is M-by-M is the dominant eigenvalue of P 2, with an algebraic and geometric multiplicity of one. 0 is also an eigenvalue, with a geometric multiplicity of M If P 2 does not have another distinct eigenvalue, then x T P 2 P 2 = x T P 2 P 2 P 2 for any x R M 1. In other words, the sequence has converged exactly to either the left eigenvector associated with the eigenvalue 1 or the null vector 0. PROOF. We have stated earlier that P 2 is positive, rank-two, and has rows that sum to 1. The Perron-Frobenius Theorem cf. pp of [2], pp of [8] establishes the first claim. Next, suppose P 2 does not have a third distinct eigenvalue. The algebraic multiplicity of 0 is necessarily M 1. The Jordan canonical form of P 2 establishes the second claim. See pp. 317 of [5], pp of [8]. PROPOSITION 3. Let x R M 1 be given. If x T P 2 P 2 x T P 2 P 2 P 2 then there exists another eigenvalue λ of P 2 such that 0 < λ < 1. In addition, for l = 1, 2,..., where x T P 2 l = c1 v T 1 + c 2 λ l v T 2 v T 1 P 2 = v T 1 v T 2 P 2 = λv T 2 and c 1 and c 2 are constants. 11 This is almost always the case. It is reported in [12] that a modest dataset with 290 million pages requires as much as 6GB; in comparison, the amount of addressable memory on a 32-bit machine is 4GB. If Algorithm 1 is used, disk use cannot be avoided. Log One Norm Error Log Error at Each Iteration Standard Stage Iteration Figure 1: Log-error at each iteration, c = PROOF. The first part follows directly from the lemma. An examination of the geometric multiplicities of P 2 reveals the existence of a full set of eigenvectors that span R M 1. Writing x as a combination of these eigenvectors establishes the second part. Proposition 3 can be rephrased as follows. Consider an arbitrary vector repeatedly multiplied by P 2. Either exact convergence will have occurred after three iterations or it won t. If converged then we re done. If not, we re still assured that all subsequent iterates are contained in the span of the first and sector eigenvectors. This knowledge enables us to extract the first eigenvector i.e. limiting distribution by subtracting away the component along the second eigenvector. 12 This validates the correctness of Algorithm A NUMERICAL EXPERIMENT The analysis in the preceding section suggests the two-stage algorithm ought to take only a fraction of the time to converge compared to the standard algorithm. We now show that this is indeed the case with an actual numerical experiment. Our results are based on a subset of N = 451, 237 webpages sampled from a 2001 crawl by the Stanford WebBase project. The number of dangling nodes in this sample is K = 137, 212, and K is roughly 30%. The experiment was conducted on a 2.4GHz N dual-xeon workstation with 4GB of RAM and a 70GB 4 RAID- 0 hard disk system. The amount of memory is ample for the size of the data, and there is thus no complications of disk access in the results. The following table summarizes the dimeonsions and the number of non-zero elements of the matrices involved in the computation. P P 1 u ũ dims N N K K + 1 N K + 1 nnz 1,082, , , ,213 Two values of c were tried. c = 0.85 as a fast-converging example, and c = 0.95 as a slow converging example. The tolerance ɛ is set to The following table shows that with either value of c, the total time for the two-stage algorithm is just 20% of the standard PageRank algorithm s: 12 This is precisely what the Aitken Extrapolation does. For details on the Aitken Extrapolation see [12].

7 1 2 3 Log Error at First 15 Iterations Standard Stage 1 We have presented a way of effectively managing the dangling nodes. Regardless of the number of dangling nodes present which is usually a very large number, the total computation time is only proportional to the number of non-dangling nodes. Log One Norm Error Iteration Figure 2: A blow-up of the first 15 iterations of Figure 1. Time in sec. No. iterations c = 0.85 c = 0.95 Step Step Step Step Step Total Standard In either case, Stage 1 Steps 1 & 2 constitutes the bulk of the work and makes up 95% and 98% of the total time. Furthremore, the error of Stage 1 at each iteration is consistently below that of the standard algorithm s, but the gap eventually diminishes; in both cases the two algorithms terminate after the same number of iterations. See Figure 1 and Figure 2. This coincides with the prediction by Proposition 2. The amount of time needed for Stage 2 Steps 4 & 5 is miniscule in comparison. When the distributions from the two stages are concatenated, we obtain the entire PageRank vector. The one-norm difference between this vector and the one produced by the standard algorithm is when c = 0.85 and when c = TREATMENT OF DANGLING NODES In this section we address the issue of dangling nodes from a modeling perspective. There are two sources of dangling nodes. A webpage is dangling if it genuinely has no outlinks. On the other hand, we also consider a webpage to be dangling if we simply have no information regarging its outlinks. The latter can arise when the webpage has been referenced i.e. linked to by another webpage in a crawl, but is itself not included in the crawl this is a very typical scenario as the vastness and rapidly-changing nature of the Web render a complete crawl impossible [16]. How to best treat the dangling nodes is very much a philosophical question. Some people choose to leave them out of the computation completely; this amounts to computing the limiting distribution of just the leading K-by-K matrix, or P 11, of 5 and defining it as the PageRank vector. On the other hand, others have chosen to include the dangling nodes in the computation by inserting the personalization vector u into the rows of P corresponding to the dangling nodes. See 2. In this paper, we have adopted the second view for a number of reasons: Because in a typical situation there usually is a very large number of dangling nodes, throwing away all of them is to give up an enormous amount of information. First, we d not have a way of ranking any of the dangling pages, which ironically is most of the webpages. Second, the resulting PageRank vector would not incorporate any information from P 12. Keep in mind that P 12 consists of actual links. It is completely legitimate and, in terms of probability mass and contribution to the final limiting distribution, is no less important that P 11. There are some very important classes of webpages urls that are by nature dangling. These include PDFs, images, movies, and etc. It would be a significant loss if one could not search for research papers or movie trailers, for example. While inserting the personalization vector into the rows of the dangling nodes may seem arbitrary at first, the practice is not necessarily so inappropriate. What is asserted here is that if there are no outlinks, the websurfer can move to any page according to preference. This is certainly not an inaccurate way to model transitions 13, and is in fact quite sensible from a behavioral perspective 14. While on this issue of whether to throw away the dangling nodes, we shall also mention that a very common suggestion is to not throw away the dangling nodes in the overall computation but do leave them out until the very last stages [16]. In other words, one d first compute the limiting distribution of P 11, pad it with more elements, and use that as the initial vector for the entire matrix P. It is believed that this procedure accelerates convergence. While this may well accelerate convergence in particular cases, it isn t true in general. 15 In general, the limiting distribution of P 11 doesn t coincide with or even approximate the first k components P s limiting distribution. What is true, from the theory of stochastc complementation [14], is that the first k components of P s limiting distribution, when normalized, coincide with the limiting distribution of the stochastic complement of P 11: S 11 = P 11 + P 12 I P 22 1 P 21 If P were a nearly completely decomposable matrice, the offdiagonal blocks would contain negligible probability mass, and S 11 and P 11 would be roughly the same; in that case the limiting distribution of P 11 would approximate the first K components of P s limiting distribution [3]. However, P is not NCD under our present partition. The dangling nodes and the non-dangling nodes do not form two NCD subsets, as a significant probability mass can be found in each block of 5. The procedure described in this paper renders this common practice irrelevant. What we ve showned is as follows. Contrary to 13 In reality, a websurfer can always go to a page by directly entering the url. An explicit link is not the only way to move to a page. 14 The bottom line is the Markov chain model of PageRank is very much a hybrid model of structure i.e. links and behavior i.e. preferences/personalization. Its success may well lie in its ability to recognize the importance of both aspects. 15 See the Appendix for a simple counter-example constructed by personalization.

8 common practice, we cannot hope to use the limiting distribution of P 11 to approximate the first K components of P s limiting distribution. On the other hand, we can compute the latter exactly by computing the limiting distribution of the lumped chain. The transition probability matrix 6 of the lumped chain is of course K + 1-by-K + 1 and effectively the same size as P 11. And once that is done with very little additional work we obtain the entire PageRank vector. 8. CONCLUDING REMARKS In this paper we present a fast two-stage algorithm for computing the PageRank vector. We exploit the Markov chain associated with PageRank being lumpable. In the first stage, we compute the limiting distribution of a Markov chain where the dangling nodes are lumped into one; in the second stage, we compute the limiting distribution for a chain where the non-dangling nodes are combined. The two limiting distributions are concatenated to form the PageRank vector. Most of the work is on computing the lumped chain, and the total work is only proportional to the number of nondagling nodes. A numerical experiment shows that in practice the two-stage algorithm finishes in only a fraction of the time required by the standard PageRank algorithm in this case as little as 20%. Furthermore, only a part of the transition probability matrix is enumerated at any given time, and the memory requirement is accordingly mild. On machines where the memory is limited relative to the size of the problem which is almost always the case in reality, the performance gap between the two-stage algorithm and the standard algorithm is likely to be even wider. Lastly, our algorithm represents an alternative to the common practice of not including the dangling nodes until the last stages of the computation. That practice lacks theoretical support and cannot be expected to accelerate convergence in general. On the other hand, the algorithm described here is provable, generally applicable, and achieves the desired speed gains. 9. ACKNOWLEDGMENTS The authors d like to thank Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan, and Gary Wesley of the Stanford WebBase project in assisting access to their data, and Sepandar Kamvar, Wang Lam, Amy Langville, and Sebastiano Vigna for their helpful comments. 10. ADDITIONAL AUTHORS Additional author: Stephanie Leung Computer Science Department, Stanford University. wleung@stanford.edu. 11. REFERENCES [1] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin. PageRank computation and the structure of the Web: experiments and algorithms. In Proceedings of the Eleventh International World Wide Web Conference, Poster Track, [2] A. Berman and R. J. Plemmons. Nonnegative Matrices in the Mathematical Sciences. SIAM Press, Pennsylvania, [3] W. L. Cao and W. J. Stewart. Iterative aggregation/disaggregation techniques for nearly uncoupled Markov chains. Journal of the Association for Computing Machinery, 32, pages , [4] T. Dayar and W. J. Stewart. Quasi-lumpability, lower bounding coupling matrices, and nearly completely decomposable Markov chains. SIAM Journal on Matrix Analysis and Applications, Vol. 18, :2, pages , [5] G. H. Golub and C. F. V. Loan. Matrix Computation. John Hopkins University Press, Third Edition. [6] T. H. Haveliwala and S. D. Kamvar. The second eigenvalue of the Google matrix. Technical report, Stanford University, [7] J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. WebBase: a repository of web pages. In Proceedings of the Ninth International World Wide Web Conference, [8] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge, [9] G. Jeh and J. Widom. Scaling personalized web search. In Proceedings of the Twelfth International World Wide Web Conference, [10] S. D. Kamvar, T. H. Haveliwala, and G. H. Golub. Adaptive methods for the computation of PageRank. Technical report, Stanford University, [11] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Exploiting the block structure of the Web for computing PageRank. Technical report, Stanford University, [12] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrapolation methods for accelerating PageRank computations. In Proceedings of the Twelfth International World Wide Web Conference, [13] J. G. Kemeny and J. L. Snell. Finite Markov Chains. D. Van Norstrand, New York, [14] C. D. Meyer. Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems. SIAM Review, Vol. 31, :2, pages , [15] C. Moler. The world s largest matrix computation. MATLAB News & Notes, pages 12 13, October [16] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, [17] H. A. Simon and A. Ando. Aggregation of variables in dynamic systems. Econometrica, pages 29: , APPENDIX Here is a small counter-example. It demonstrates the common practice of including the dangling nodes only during the last stages of computation doesn t always accelerate convergence. Take the link matrix G = Here, K = 2 and N = 4. Take c = 0.85 and u T = 1 3a where a = 43. Thus P = It can be verified that the limiting distribution of the leading 2- by-2 submatrix of P is , while for the entire matrix it is The bottom line is the limiting distribution of the 2-by-2 yields a worse starting iterate than the uniform vector, and the desired acceleration is not observed. For more details on why see [14]. a a a,

Computing PageRank using Power Extrapolation

Computing PageRank using Power Extrapolation Taher Haveliwala, Sepandar Kamvar, Dan Klein, Chris Manning, and Gene Golub Stanford University Abstract. We present a novel technique for speeding up the computation