UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Size: px

Start display at page:

Download "UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer"

Blaise Haynes
6 years ago
Views:

1 UpdatingtheStationary VectorofaMarkovChain Amy Langville Carl Meyer Department of Mathematics North Carolina State University Raleigh, NC NSMC 9/4/2003

2 Outline Updating and Pagerank Aggregation Partitioning Iterative Aggregation Algorithm Experiments

3 The Updating Problem Given: Original chain s P andπ T and new chain s P Find: new chain s π T

4 P PageRank Application Google uses hyperlink structure of Web + fudge factor matrix to form irreducible, aperiodic Markov chain π i π T long-run proportion of time a random surfer spends on webpage i gives ranking of relative importance of webpages How Google Usesπ T To rank importance of thousands of pages containing a query phrase and list only the most important of those relevant pages to users.

5 Need for Updating PageRank vectorπ T Fact: Currently π T for immense Web Markov chain is computed monthly. Fact: Fact: Web changes much more frequently. Computing π T takes days. (hourly on news sites) (power method used) Need: Update π T more frequently with less work.

6 Use iterative aggregation to find π T faster, with less work, than full recomputation. Computing π T A Big Problem Solve π T = π T P π T (I P) = 0 (stationary distribution vector) (too big for direct solves) Start with π T 0 = e/n and iterate πt j+1 = πt j P (power method) Google s solution to updating problem Full recomputation run power method from scratch Start with π T 0 = e/n and iterate πt j+1 = πt P j Don t use old PageRank vector to find new PageRank faster. Our goal

7 Idea behind Aggregation Best for NCD systems (Simon and Ando (1960s), Courtois (1970s)) C 1 C 2 C 3 C 1 C 2 P = Π Τ Π τ C 3 C 1 C 2 C 3 A = C 1 C 2 C 3 a i j Τ = Π P ij e ξ Τ = ξ 1 ξ 2 ξ 3 Π Τ ~ ξ 1 Π Τ ξ 2 Π Τ ξ 3 Π Τ Pro Con exploits structure to reduce work produces an approximation, quality is dependent on degree of coupling

8 Iterative Aggregation Problem: repeated aggregation leads to fixed point. Solution: Do a power step to move off fixed point. Do this iteratively. Approximations improve and approach exact solution. Success with NCD systems, not in general. Input: approximation to Π get censored distributions Π T Π T Π T get coupling constants Output: get approximate global stationary distribution Output: move off fixed point with power step T ξ i T = ξ ξ ξ Π Π T Π T Π T 1 2 3

9 How to Partition for Updating Problem? Intuition There are some bad states (G) and some good states (G). Give more attention to bad states. Each state in G forms a partitioning level. Lump good states into 1 superstate. Aggregation Matrix A = G 1 G 2. G G 1 G 2... G ( G +1) ( G +1)

10 Definitions for Good and Bad 1. Good = states most likely to have π i change Bad = states least likely to have π i change 2. Good = states with smallest π i after k transient steps Bad = states nearby, with largest π i after k transient steps 3. Good = smallest π i from old PageRank vector Bad = largest π i from old PageRank vector 4. Good = fast converging states Bad = slow converging states

11 Determining Fast and Slow Consider power method and its rate of convergence π T k+1 = π T k P = π T k eπ T + λ k 2π T k x 2 y T 2 + λ k 3π T k x 3 y T λ k nπ T k x n y T n Asymptotic rate of convergence is rate at which λ k 2 0 Consider convergence of elements Some states converge to stationary value faster than λ 2 rate, due to LH e vector y T 2. Partitioning Rule Put states with largest y T 2 i values in bad group G, where Practicality y T 2 they receive more individual attention in aggregation method. expensive, but for PageRank problem, Kamvar et al. show states with large π i are slow-converging. inexpensive, use old π T to determine G. (adaptively approximate y T 2 )

12 Power law for PageRank Scale-free Model of Web network creates power laws (Kamvar, Barabasi, Raghavan)

13 Experiments Test Networks From Crawl Of Web (Supplied by Ronny Lempel) Censorship 562 nodes 736 links (Sites concerning censorship on the net ) Movies 451 nodes 713 links MathWorks 517 nodes 13,531 links Abortion 1,693 nodes 4,325 links Genetics 2,952 nodes 6,485 links (Sites concerning movies ) (Supplied by Cleve Moler) (Sites concerning abortion ) (Sites concerning genetics )

14 Parameters Number Of Nodes (States) Added 3 Number Of Nodes (States) Removed 5 Number Of Links Added (Different values have little effect on results) 10 Number Of Links Removed 20 Stopping Criterion 1-norm of residual < 10 10

15 The Partition Intuition to find slow converging states, but expen- Prefer to use y T 2 sive. + Slow converging components tend to be high PageRank pages The G Set New states go into G States corresponding to large entries in φ T = (φ 1, φ 2,..., φ m ) G States corresponding to small entries G

16 Censorship Power Method Iterations Time Iterative Aggregation G Iterations Time nodes = 562 links =

17 Censorship Power Method Iterative Aggregation Iterations Time nodes = 562 links = 736 G Iterations Time

18 Movies Power Method Iterations Time Iterative Aggregation G Iterations Time nodes = 451 links =

19 Movies Power Method Iterative Aggregation Iterations Time nodes = 451 links = 713 G Iterations Time

20 MathWorks Power Method Iterations Time Iterative Aggregation G Iterations Time nodes = 517 links = 13,

21 MathWorks Power Method Iterative Aggregation Iterations Time nodes = 517 links = 13, 531 G Iterations Time

22 Abortion Power Method Iterations Time Iterative Aggregation G Iterations Time nodes = 1, 693 links = 4,

23 Abortion Power Method Iterative Aggregation Iterations Time nodes = 1, 693 links = 4, 325 G Iterations Time

24 Genetics Power Method Iterations Time Iterative Aggregation G Iterations Time nodes = 2, 952 links = 6,

25 Genetics Power Method Iterative Aggregation Iterations Time nodes = 2, 952 links = 6, 485 G Iterations Time

26 Conclusions First updating algorithm to handle both element and state updates. Algorithm is very sensitive to partition. For PageRank problem, partition can be determined cheaply from old PageRanks. For general Markov updating, use y T 2 to determine partition. When too expensive, approximate adaptively with Aitken s δ 2 or difference of successive iterates. Improvements Practical Optimize G-set Accelerate Smoothing Theoretical Relationship between partitioning by y T 2 and λ 2(S 2 ) not well-understood. Predict algorithm and partitioning by old π T will work very well on other scale-free networks.

Updating PageRank. Amy Langville Carl Meyer

Updating PageRank. Amy Langville Carl Meyer Updating PageRank Amy Langville Carl Meyer Department of Mathematics North Carolina State University Raleigh, NC SCCM 11/17/2003 Indexing Google Must index key terms on each page Robots crawl the web software