UpdatingtheStationary VectorofaMarkovChain Amy Langville Carl Meyer Department of Mathematics North Carolina State University Raleigh, NC NSMC 9/4/2003
Outline Updating and Pagerank Aggregation Partitioning Iterative Aggregation Algorithm Experiments
The Updating Problem Given: Original chain s P andπ T and new chain s P Find: new chain s π T
P PageRank Application Google uses hyperlink structure of Web + fudge factor matrix to form irreducible, aperiodic Markov chain π i π T long-run proportion of time a random surfer spends on webpage i gives ranking of relative importance of webpages How Google Usesπ T To rank importance of thousands of pages containing a query phrase and list only the most important of those relevant pages to users.
Need for Updating PageRank vectorπ T Fact: Currently π T for immense Web Markov chain is computed monthly. Fact: Fact: Web changes much more frequently. Computing π T takes days. (hourly on news sites) (power method used) Need: Update π T more frequently with less work.
Use iterative aggregation to find π T faster, with less work, than full recomputation. Computing π T A Big Problem Solve π T = π T P π T (I P) = 0 (stationary distribution vector) (too big for direct solves) Start with π T 0 = e/n and iterate πt j+1 = πt j P (power method) Google s solution to updating problem Full recomputation run power method from scratch Start with π T 0 = e/n and iterate πt j+1 = πt P j Don t use old PageRank vector to find new PageRank faster. Our goal
Idea behind Aggregation Best for NCD systems (Simon and Ando (1960s), Courtois (1970s)) C 1 C 2 C 3 C 1 C 2 P = Π Τ Π τ C 3 C 1 C 2 C 3 A = C 1 C 2 C 3 a i j Τ = Π P ij e ξ Τ = ξ 1 ξ 2 ξ 3 Π Τ ~ ξ 1 Π Τ ξ 2 Π Τ ξ 3 Π Τ Pro Con exploits structure to reduce work produces an approximation, quality is dependent on degree of coupling
Iterative Aggregation Problem: repeated aggregation leads to fixed point. Solution: Do a power step to move off fixed point. Do this iteratively. Approximations improve and approach exact solution. Success with NCD systems, not in general. Input: approximation to Π get censored distributions Π T Π T Π T get coupling constants Output: get approximate global stationary distribution Output: move off fixed point with power step T ξ i T = ξ ξ ξ Π Π T Π T Π T 1 2 3
How to Partition for Updating Problem? Intuition There are some bad states (G) and some good states (G). Give more attention to bad states. Each state in G forms a partitioning level. Lump good states into 1 superstate. Aggregation Matrix A = G 1 G 2. G G 1 G 2... G ( G +1) ( G +1)
Definitions for Good and Bad 1. Good = states most likely to have π i change Bad = states least likely to have π i change 2. Good = states with smallest π i after k transient steps Bad = states nearby, with largest π i after k transient steps 3. Good = smallest π i from old PageRank vector Bad = largest π i from old PageRank vector 4. Good = fast converging states Bad = slow converging states
Determining Fast and Slow Consider power method and its rate of convergence π T k+1 = π T k P = π T k eπ T + λ k 2π T k x 2 y T 2 + λ k 3π T k x 3 y T 3 +... + λ k nπ T k x n y T n Asymptotic rate of convergence is rate at which λ k 2 0 Consider convergence of elements Some states converge to stationary value faster than λ 2 rate, due to LH e vector y T 2. Partitioning Rule Put states with largest y T 2 i values in bad group G, where Practicality y T 2 they receive more individual attention in aggregation method. expensive, but for PageRank problem, Kamvar et al. show states with large π i are slow-converging. inexpensive, use old π T to determine G. (adaptively approximate y T 2 )
Power law for PageRank Scale-free Model of Web network creates power laws (Kamvar, Barabasi, Raghavan)
Experiments Test Networks From Crawl Of Web (Supplied by Ronny Lempel) Censorship 562 nodes 736 links (Sites concerning censorship on the net ) Movies 451 nodes 713 links MathWorks 517 nodes 13,531 links Abortion 1,693 nodes 4,325 links Genetics 2,952 nodes 6,485 links (Sites concerning movies ) (Supplied by Cleve Moler) (Sites concerning abortion ) (Sites concerning genetics )
Parameters Number Of Nodes (States) Added 3 Number Of Nodes (States) Removed 5 Number Of Links Added (Different values have little effect on results) 10 Number Of Links Removed 20 Stopping Criterion 1-norm of residual < 10 10
The Partition Intuition to find slow converging states, but expen- Prefer to use y T 2 sive. + Slow converging components tend to be high PageRank pages The G Set New states go into G States corresponding to large entries in φ T = (φ 1, φ 2,..., φ m ) G States corresponding to small entries G
Censorship Power Method Iterations Time 38 1.40 Iterative Aggregation G Iterations Time 5 38 1.68 10 38 1.66 15 38 1.56 20 20 1.06 25 20 1.05 50 10.69 100 8.55 nodes = 562 links = 736 300 6.65 400 5.70
Censorship Power Method Iterative Aggregation Iterations Time 38 1.40 nodes = 562 links = 736 G Iterations Time 5 38 1.68 10 38 1.66 15 38 1.56 20 20 1.06 25 20 1.05 50 10.69 100 8.55 200 6.53 300 6.65 400 5.70
Movies Power Method Iterations Time 17.40 Iterative Aggregation G Iterations Time 5 12.39 10 12.37 15 11.36 20 11.35 nodes = 451 links = 713 100 9.33 200 8.35 300 7.39 400 6.47
Movies Power Method Iterative Aggregation Iterations Time 17.40 nodes = 451 links = 713 G Iterations Time 5 12.39 10 12.37 15 11.36 20 11.35 25 11.31 50 9.31 100 9.33 200 8.35 300 7.39 400 6.47
MathWorks Power Method Iterations Time 54 1.25 Iterative Aggregation G Iterations Time 5 53 1.18 10 52 1.29 15 52 1.23 20 42 1.05 25 20 1.13 nodes = 517 links = 13, 531 300 11.83 400 10 1.01
MathWorks Power Method Iterative Aggregation Iterations Time 54 1.25 nodes = 517 links = 13, 531 G Iterations Time 5 53 1.18 10 52 1.29 15 52 1.23 20 42 1.05 25 20 1.13 50 18.70 100 16.70 200 13.70 300 11.83 400 10 1.01
Abortion Power Method Iterations Time 106 37.08 Iterative Aggregation G Iterations Time 5 109 38.56 10 105 36.02 15 107 38.05 20 107 38.45 25 97 34.81 50 53 18.80 nodes = 1, 693 links = 4, 325 250 12 5.62 500 6 5.21 750 5 10.22 1000 5 14.61
Abortion Power Method Iterative Aggregation Iterations Time 106 37.08 nodes = 1, 693 links = 4, 325 G Iterations Time 5 109 38.56 10 105 36.02 15 107 38.05 20 107 38.45 25 97 34.81 50 53 18.80 100 13 5.18 250 12 5.62 500 6 5.21 750 5 10.22 1000 5 14.61
Genetics Power Method Iterations Time 92 91.78 Iterative Aggregation G Iterations Time 5 91 88.22 10 92 92.12 20 71 72.53 50 25 25.42 100 19 20.72 250 13 14.97 nodes = 2, 952 links = 6, 485 1000 5 17.76 1500 5 31.84
Genetics Power Method Iterative Aggregation Iterations Time 92 91.78 nodes = 2, 952 links = 6, 485 G Iterations Time 5 91 88.22 10 92 92.12 20 71 72.53 50 25 25.42 100 19 20.72 250 13 14.97 500 7 11.14 1000 5 17.76 1500 5 31.84
Conclusions First updating algorithm to handle both element and state updates. Algorithm is very sensitive to partition. For PageRank problem, partition can be determined cheaply from old PageRanks. For general Markov updating, use y T 2 to determine partition. When too expensive, approximate adaptively with Aitken s δ 2 or difference of successive iterates. Improvements Practical Optimize G-set Accelerate Smoothing Theoretical Relationship between partitioning by y T 2 and λ 2(S 2 ) not well-understood. Predict algorithm and partitioning by old π T will work very well on other scale-free networks.