Monte Carlo methods in PageRank computation: When one iteration is sufficient

Size: px

Start display at page:

Download "Monte Carlo methods in PageRank computation: When one iteration is sufficient"

Roderick Craig
5 years ago
Views:

1 Monte Carlo methods in PageRank computation: When one iteration is sufficient Nelly Litvak (University of Twente, The Netherlands) Konstantin Avrachenkov (INRIA Sophia Antipolis, France), Dmitri Nemirovsky and Natalia Osipova (St.Petersburg State University, Russia) Financial support: Netherlands Organization for Scientific Research (NWO) under Meervoud grant and the grant VGP French Organization EGIDE under Van Gogh grant no.05433ud MCM2005, Tallahassee, p.1/17

2 Outline Markov model for the PageRank Monte Carlo algorithms Convergence Analysis Experiments MCM2005, Tallahassee, p.2/17

3 Search engine context A user types a query to find relevant pages. Problem: Normally, there are hundreds of relevant pages. In which order should we list the pages for the user?? The solution has been found by... MCM2005, Tallahassee, p.3/17

4 Search engine context A user types a query to find relevant pages. Problem: Normally, there are hundreds of relevant pages. In which order should we list the pages for the user?? The solution has been found by... Google MCM2005, Tallahassee, p.3/17

5 Search engine context A user types a query to find relevant pages. Problem: Normally, there are hundreds of relevant pages. In which order should we list the pages for the user?? The solution has been found by... Google Google ranking: List most important and popular pages first! S. BRIN AND L. PAGE (1998) The anatomy of a Large-Scale Hypertextual Web Search Engine. In WWW7, Australia MCM2005, Tallahassee, p.3/17

6 PageRank: Markov model PageRank π i of page i is the long run fraction of time that a random surfer spends on page i. Easily bored surfer model. With probability c (=0.85), a surfer follows a randomly chosen outgoing link. Otherwise, he/she jumps to a random page. 1 c/d c/d 2 i... c/d 1-c d... MCM2005, Tallahassee, p.4/17

7 PageRank: Markov model PageRank π i of page i is the long run fraction of time that a random surfer spends on page i. Easily bored surfer model. With probability c (=0.85), a surfer follows a randomly chosen outgoing link. Otherwise, he/she jumps to a random page. 1 c/d c/d 2 i... c/d 1-c d... π i = j i c π j + 1 c d j n MCM2005, Tallahassee, p.4/17

8 Formal model description n is the total number of pages P = (p ij ) - hyperlink matrix p ij = 1/d i if j is one of the d i outgoing links of i p ij = 1/n if d i = 0 p ij = 0 otherwise MCM2005, Tallahassee, p.5/17

9 Formal model description n is the total number of pages P = (p ij ) - hyperlink matrix p ij = 1/d i if j is one of the d i outgoing links of i p ij = 1/n if d i = 0 p ij = 0 otherwise Modified transition matrix: P = cp + (1 c)(1/n)e E is an n n matrix consisting of one s, c = 0.85 MCM2005, Tallahassee, p.5/17

10 Formal model description n is the total number of pages P = (p ij ) - hyperlink matrix p ij = 1/d i if j is one of the d i outgoing links of i p ij = 1/n if d i = 0 p ij = 0 otherwise Modified transition matrix: P = cp + (1 c)(1/n)e E is an n n matrix consisting of one s, c = 0.85 PageRank vector: π P = π, π1 = 1 MCM2005, Tallahassee, p.5/17

11 PageRank update Google updates the PageRank monthly: P is determined by crawling the web PageRank is computed by Power Iterations: π (0) = (1/n,...,1/n); π (k) = π (k 1) P, k > 0 Stop when π (k) and π (k 1) are close enough iterations needed with c = 0.85 MCM2005, Tallahassee, p.6/17

12 PageRank update Google updates the PageRank monthly: P is determined by crawling the web PageRank is computed by Power Iterations: π (0) = (1/n,...,1/n); π (k) = π (k 1) P, k > 0 Stop when π (k) and π (k 1) are close enough iterations needed with c = 0.85 Matrix P is huge, each iteration is costly MCM2005, Tallahassee, p.6/17

13 PageRank update Google updates the PageRank monthly: P is determined by crawling the web PageRank is computed by Power Iterations: π (0) = (1/n,...,1/n); π (k) = π (k 1) P, k > 0 Stop when π (k) and π (k 1) are close enough iterations needed with c = 0.85 Matrix P is huge, each iteration is costly We believe that Monte Carlo is more efficient... MCM2005, Tallahassee, p.6/17

14 Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 MCM2005, Tallahassee, p.7/17

15 Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 MCM2005, Tallahassee, p.7/17

16 Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 MCM2005, Tallahassee, p.7/17

17 Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 (X t ) t 0 Markov chain with tr. matrix P T geometric (1 c) stopping time, E[T] = 1/(1 c) = 1/ MCM1, end-point, random start: Given that X 0 is picked at random, X T is a sample from π MCM2005, Tallahassee, p.7/17

18 Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 (X t ) t 0 Markov chain with tr. matrix P T geometric (1 c) stopping time, E[T] = 1/(1 c) = 1/ MCM1, end-point, random start: Given that X 0 is picked at random, X T is a sample from π Complexity O(n 2 ) (Breyer, 2002) is over pessimistic MCM2005, Tallahassee, p.7/17

19 Variance reduction Z = [I cp] 1 = k=0 c k P k, (1 c)z ij = P[X T = j X 0 = i] π j = 1 c n n i=1 MCM2, end-point, cyclic start: Run (X t ) t 0, m times from each page. Evaluate π j as ˆπ j = [fraction of runs when {X T = j}] V ar(ˆπ j ) < (mn) 1 π j (1 π j ) z ij MCM2005, Tallahassee, p.8/17

20 Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM2005, Tallahassee, p.9/17

21 Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM3, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating terminating at each step w.p. 1 c. Evaluate π j as π j =[fraction of time spent in j] MCM2005, Tallahassee, p.9/17

22 Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM3, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating terminating at each step w.p. 1 c. Evaluate π j as π j =[fraction of time spent in j] Stopping time: It is natural to stop not only w.p. (1 c) at each step but also at dangling nodes MCM2005, Tallahassee, p.9/17

23 Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM4, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating at each step w.p. 1 c, or reaching a dangling node. Evaluate π j as π j =[fraction of time spent in j] MCM2005, Tallahassee, p.9/17

24 Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM4, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating at each step w.p. 1 c, or reaching a dangling node. Evaluate π j as π j =[fraction of time spent in j] Q matrix with zero-rows for dangling nodes π = γ1 T c k Q k, γ = c π i + 1 c n n < 1 n k=0 dangl. MCM2005, Tallahassee, p.9/17

25 Convergence Analysis W ij average # visits to j after m runs from i W j = W ij, W = W j i=1 j=1 Then π j = W j W 1. Here π j is determined by W j, and the relative errors are similar. Th.1. If W j w j εw j then π j π j ε n,β π j w.p. 1 β, where ε ε n,β < C(β)(1 + ε)/ nm Thus, the error in estimating π j originates mainly from W j, the estimator of w j = [ 1 T [I cq] 1] j MCM2005, Tallahassee, p.10/17

26 Idea of the proof of Theorem 1 π j π j = W j W 1 π j επ j + (γ W) 1 1 (1 + ε)πj 1. The length of each run is smaller than T, we can bound its variance 2. The runs are independent. 1&2 3. V ar( W) = O(n) V ar(γ W) = O(1/n) 4. W is approximately normally distributed MCM2005, Tallahassee, p.11/17

27 Confidence intervals P( W j w j < εw j ) 1 α We can show: V ar( W j ) q jj w j, where m 1 q jj q jj c 2 probability to return to j starting from j Then ε x 1 α/2 1 c + c dangl. π i πj mn, 1 + q jj 1 q jj x 1 α/2 is a (1 α/2)-quantile of N(0,1) Ex. π j = 10 4 (1 c)/n, m = 1 (!) ε This is much better than one power iteration! MCM2005, Tallahassee, p.12/17

28 Complete path vs.end-point ε - complete path 1 + q 1 c + c jj dangl. π i x 1 α/2 1 q jj πj mn ε - end-point x 1 α/2 1 πj πj mn The complete path might work worse if: There are many cycles (high variability) There are many dangling nodes (stopping time is short) In practice, the complete path method works better. In our experiments, ε comp.path 0.59ε end point MCM2005, Tallahassee, p.13/17

29 Power iterations vs. MCM INRIA Sophia Antipolis pages, hyperlinks MCM2005, Tallahassee, p.14/17

30 Power iterations vs. MCM INRIA Sophia Antipolis pages, hyperlinks x 10 3 MC comp path dangl nodes MC Confidence interval MC Confidence interval PI method PI method (10th iteration) 1.3 x MC comp path dangl nodes MC confidence interval MC confidence interval PI method PI method (10th iteration) PR PR no. iter. π 1 = no. iter. π 10 = MCM2005, Tallahassee, p.14/17

31 Power iterations vs. MCM INRIA Sophia Antipolis pages, hyperlinks 7 x 10 4 MC comp path dangl nodes MC confidence interval MC confidence interval PI method PI method (10th iteration) 1.6 x MC comp path dangl nodes MC confidence interval MC confidence interval PI method PI method (10th iteration) PR 5 PR no. iter. π 100 = no. iter. π 1000 = MCM2005, Tallahassee, p.14/17

32 Different versions of MCM INRIA Sophia Antipolis pages, hyperlinks MCM2005, Tallahassee, p.15/17

33 Different versions of MCM INRIA Sophia Antipolis pages, hyperlinks 0.15 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point with cyclic start MC end point with cyclic start (conf. interv.) MC comp path rand start 0.3 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point cycl start MC end point cycl start (conf. interv.) MC comp path rand start relative error relative error no. iter. π 1 = no. iter. π 10 = MCM2005, Tallahassee, p.15/17

34 Different versions of MCM INRIA Sophia Antipolis pages, hyperlinks 0.4 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point cycl start MC end point cycl start (conf. interv.) MC comp path rand start MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point cycl start MC end point cycl start (conf. interv.) MC comp path rand start relative error 0.2 relative error no. iter. π 100 = no. iter. π 1000 = MCM2005, Tallahassee, p.15/17

35 Conclusions MCM with cyclic start outperforms the MCM with random start Complete path algorithm in practice outperforms the end-point algorithm The PageRank of important pages is estimated well after the first iteration Other advantages of the MCM: natural parallel implementation and possibilities for on-line update MCM2005, Tallahassee, p.16/17

36 That s all for today... Questions? MCM2005, Tallahassee, p.17/17

37 That s all for today... Questions? Suggestions? MCM2005, Tallahassee, p.17/17

Department of Applied Mathematics. University of Twente. Faculty of EEMCS. Memorandum No. 1712

Department of Applied Mathematics Faculty of EEMCS t University of Twente The Netherlands P.O. Box 217 7500 AE Enschede The Netherlands Phone: +31-53-4893400 Fax: +31-53-4893114 Email: memo@math.utwente.nl