Monte Carlo methods in PageRank computation: When one iteration is sufficient

Monte Carlo methods in PageRank computation: When one iteration is sufficient Nelly Litvak (University of Twente, The Netherlands) e-mail: n.litvak@ewi.utwente.nl Konstantin Avrachenkov (INRIA Sophia Antipolis, France), Dmitri Nemirovsky and Natalia Osipova (St.Petersburg State University, Russia) Financial support: Netherlands Organization for Scientific Research (NWO) under Meervoud grant 632.002.401 and the grant VGP 61-520 French Organization EGIDE under Van Gogh grant no.05433ud MCM2005, Tallahassee, 19.05.2005 p.1/17

Outline Markov model for the PageRank Monte Carlo algorithms Convergence Analysis Experiments MCM2005, Tallahassee, 19.05.2005 p.2/17

Search engine context A user types a query to find relevant pages. Problem: Normally, there are hundreds of relevant pages. In which order should we list the pages for the user?? The solution has been found by... Google Google ranking: List most important and popular pages first! S. BRIN AND L. PAGE (1998) The anatomy of a Large-Scale Hypertextual Web Search Engine. In WWW7, Australia MCM2005, Tallahassee, 19.05.2005 p.3/17

PageRank: Markov model PageRank π i of page i is the long run fraction of time that a random surfer spends on page i. Easily bored surfer model. With probability c (=0.85), a surfer follows a randomly chosen outgoing link. Otherwise, he/she jumps to a random page. 1 c/d c/d 2 i... c/d 1-c d... MCM2005, Tallahassee, 19.05.2005 p.4/17

Formal model description n is the total number of pages P = (p ij ) - hyperlink matrix p ij = 1/d i if j is one of the d i outgoing links of i p ij = 1/n if d i = 0 p ij = 0 otherwise MCM2005, Tallahassee, 19.05.2005 p.5/17

Formal model description n is the total number of pages P = (p ij ) - hyperlink matrix p ij = 1/d i if j is one of the d i outgoing links of i p ij = 1/n if d i = 0 p ij = 0 otherwise Modified transition matrix: P = cp + (1 c)(1/n)e E is an n n matrix consisting of one s, c = 0.85 MCM2005, Tallahassee, 19.05.2005 p.5/17

PageRank update Google updates the PageRank monthly: P is determined by crawling the web PageRank is computed by Power Iterations: π (0) = (1/n,...,1/n); π (k) = π (k 1) P, k > 0 Stop when π (k) and π (k 1) are close enough. 50 100 iterations needed with c = 0.85 MCM2005, Tallahassee, 19.05.2005 p.6/17

Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 MCM2005, Tallahassee, 19.05.2005 p.7/17

Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 (X t ) t 0 Markov chain with tr. matrix P T geometric (1 c) stopping time, E[T] = 1/(1 c) = 1/0.15 6.67 MCM1, end-point, random start: Given that X 0 is picked at random, X T is a sample from π MCM2005, Tallahassee, 19.05.2005 p.7/17

Variance reduction Z = [I cp] 1 = k=0 c k P k, (1 c)z ij = P[X T = j X 0 = i] π j = 1 c n n i=1 MCM2, end-point, cyclic start: Run (X t ) t 0, m times from each page. Evaluate π j as ˆπ j = [fraction of runs when {X T = j}] V ar(ˆπ j ) < (mn) 1 π j (1 π j ) z ij MCM2005, Tallahassee, 19.05.2005 p.8/17

Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM2005, Tallahassee, 19.05.2005 p.9/17

Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM3, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating terminating at each step w.p. 1 c. Evaluate π j as π j =[fraction of time spent in j] MCM2005, Tallahassee, 19.05.2005 p.9/17

Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM4, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating at each step w.p. 1 c, or reaching a dangling node. Evaluate π j as π j =[fraction of time spent in j] MCM2005, Tallahassee, 19.05.2005 p.9/17

Convergence Analysis W ij average # visits to j after m runs from i W j = W ij, W = W j i=1 j=1 Then π j = W j W 1. Here π j is determined by W j, and the relative errors are similar. Th.1. If W j w j εw j then π j π j ε n,β π j w.p. 1 β, where ε ε n,β < C(β)(1 + ε)/ nm Thus, the error in estimating π j originates mainly from W j, the estimator of w j = [ 1 T [I cq] 1] j MCM2005, Tallahassee, 19.05.2005 p.10/17

Idea of the proof of Theorem 1 π j π j = W j W 1 π j επ j + (γ W) 1 1 (1 + ε)πj 1. The length of each run is smaller than T, we can bound its variance 2. The runs are independent. 1&2 3. V ar( W) = O(n) V ar(γ W) = O(1/n) 4. W is approximately normally distributed MCM2005, Tallahassee, 19.05.2005 p.11/17

Confidence intervals P( W j w j < εw j ) 1 α We can show: V ar( W j ) 1 1 + q jj w j, where m 1 q jj q jj c 2 probability to return to j starting from j Then ε x 1 α/2 1 c + c dangl. π i πj mn, 1 + q jj 1 q jj x 1 α/2 is a (1 α/2)-quantile of N(0,1) Ex. π j = 10 4 (1 c)/n, m = 1 (!) ε 0.01. This is much better than one power iteration! MCM2005, Tallahassee, 19.05.2005 p.12/17

Complete path vs.end-point ε - complete path 1 + q 1 c + c jj dangl. π i x 1 α/2 1 q jj πj mn ε - end-point x 1 α/2 1 πj πj mn The complete path might work worse if: There are many cycles (high variability) There are many dangling nodes (stopping time is short) In practice, the complete path method works better. In our experiments, ε comp.path 0.59ε end point MCM2005, Tallahassee, 19.05.2005 p.13/17

Power iterations vs. MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks MCM2005, Tallahassee, 19.05.2005 p.14/17

Power iterations vs. MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks 4.8 5 x 10 3 MC comp path dangl nodes MC Confidence interval MC Confidence interval PI method PI method (10th iteration) 1.3 x 10 3 1.2 MC comp path dangl nodes MC confidence interval MC confidence interval PI method PI method (10th iteration) 4.6 4.4 1.1 PR PR 4.2 1 4 3.8 0.9 3.6 1 2 3 4 5 6 7 8 9 10 no. iter. π 1 = 0.00409. 0.8 1 2 3 4 5 6 7 8 9 10 no. iter. π 10 = 0.00103. MCM2005, Tallahassee, 19.05.2005 p.14/17

Power iterations vs. MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks 7 x 10 4 MC comp path dangl nodes MC confidence interval MC confidence interval PI method PI method (10th iteration) 1.6 x 10 4 1.4 MC comp path dangl nodes MC confidence interval MC confidence interval PI method PI method (10th iteration) 6 1.2 PR 5 PR 1 0.8 4 0.6 3 1 2 3 4 5 6 7 8 9 10 no. iter. π 100 = 0.00054. 0.4 1 2 3 4 5 6 7 8 9 10 no. iter. π 1000 = 0.00009. MCM2005, Tallahassee, 19.05.2005 p.14/17

Different versions of MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks MCM2005, Tallahassee, 19.05.2005 p.15/17

Different versions of MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks 0.15 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point with cyclic start MC end point with cyclic start (conf. interv.) MC comp path rand start 0.3 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point cycl start MC end point cycl start (conf. interv.) MC comp path rand start 0.1 0.2 relative error relative error 0.05 0.1 0 1 2 3 4 5 6 7 8 9 10 no. iter. π 1 = 0.00409. 0 1 2 3 4 5 6 7 8 9 10 no. iter. π 10 = 0.00103. MCM2005, Tallahassee, 19.05.2005 p.15/17

Different versions of MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks 0.4 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point cycl start MC end point cycl start (conf. interv.) MC comp path rand start 1 0.9 0.8 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point cycl start MC end point cycl start (conf. interv.) MC comp path rand start 0.7 0.3 0.6 relative error 0.2 relative error 0.5 0.4 0.3 0.1 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 no. iter. π 100 = 0.00054. 0 1 2 3 4 5 6 7 8 9 10 no. iter. π 1000 = 0.00009. MCM2005, Tallahassee, 19.05.2005 p.15/17

Conclusions MCM with cyclic start outperforms the MCM with random start Complete path algorithm in practice outperforms the end-point algorithm The PageRank of important pages is estimated well after the first iteration Other advantages of the MCM: natural parallel implementation and possibilities for on-line update MCM2005, Tallahassee, 19.05.2005 p.16/17

That s all for today... Questions? MCM2005, Tallahassee, 19.05.2005 p.17/17

That s all for today... Questions? Suggestions? MCM2005, Tallahassee, 19.05.2005 p.17/17