Monte Carlo methods in PageRank computation: When one iteration is sufficient

Similar documents
Department of Applied Mathematics. University of Twente. Faculty of EEMCS. Memorandum No. 1712

1998: enter Link Analysis

Uncertainty and Randomization

ECEN 689 Special Topics in Data Science for Communications Networks

Online Social Networks and Media. Link Analysis and Web Search

Distribution of PageRank Mass Among Principle Components of the Web

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Finding central nodes in large networks

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Page rank computation HPC course project a.y

A Singular Perturbation Approach for Choosing the PageRank Damping Factor

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Online Social Networks and Media. Link Analysis and Web Search

Link Analysis. Leonid E. Zhukov

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Link Analysis Ranking

Quick Detection of Top-k Personalized PageRank Lists

Local properties of PageRank and graph limits. Nelly Litvak University of Twente Eindhoven University of Technology, The Netherlands MIPT 2018

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

A Note on Google s PageRank

Lecture 12: Link Analysis for Web Retrieval

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Graph Models The PageRank Algorithm

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Updating PageRank. Amy Langville Carl Meyer

Math 304 Handout: Linear algebra, graphs, and networks.

How does Google rank webpages?

The Google Markov Chain: convergence speed and eigenvalues

Data Mining Recitation Notes Week 3

Link Mining PageRank. From Stanford C246

0.1 Naive formulation of PageRank

Random Walk Based Algorithms for Complex Network Analysis

Personalized PageRank with node-dependent restart

IR: Information Retrieval

CPSC 540: Machine Learning

Web Structure Mining Nodes, Links and Influence

eigenvalues, markov matrices, and the power method

Information Retrieval and Search. Web Linkage Mining. Miłosz Kadziński

Randomization and Gossiping in Techno-Social Networks

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

CPSC 540: Machine Learning

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

The Theory behind PageRank

Applications. Nonnegative Matrices: Ranking

Computing PageRank using Power Extrapolation

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Approximate Inference

arxiv:cond-mat/ v1 3 Sep 2004

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Data Mining and Matrices

Analysis of Google s PageRank

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

CONVERGENCE ANALYSIS OF A PAGERANK UPDATING ALGORITHM BY LANGVILLE AND MEYER

= P{X 0. = i} (1) If the MC has stationary transition probabilities then, = i} = P{X n+1

Personalized PageRank with Node-dependent Restart

Google Page Rank Project Linear Algebra Summer 2012

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Results: MCMC Dancers, q=10, n=500

Introduction to Data Mining

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Available online at ScienceDirect. IFAC PapersOnLine 51-7 (2018) 64 69

Statistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Data and Algorithms of the Web

Markov Chains Handout for Stat 110

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

The PageRank Problem, Multi-Agent. Consensus and Web Aggregation

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Robust PageRank: Stationary Distribution on a Growing Network Structure

Reminder of some Markov Chain properties:

Mathematical Properties & Analysis of Google s PageRank

The Dynamic Absorbing Model for the Web

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

ISE/OR 760 Applied Stochastic Modeling

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

c 2005 Society for Industrial and Applied Mathematics

CS6220: DATA MINING TECHNIQUES

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Introduction to Information Retrieval

PV211: Introduction to Information Retrieval

Mathematical Methods for Computer Science

THEORY OF SEARCH ENGINES. CONTENTS 1. INTRODUCTION 1 2. RANKING OF PAGES 2 3. TWO EXAMPLES 4 4. CONCLUSION 5 References 5

STA141C: Big Data & High Performance Statistical Computing

CS249: ADVANCED DATA MINING

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Inf 2B: Ranking Queries on the WWW

1 Searching the World Wide Web

Link Analysis. Stony Brook University CSE545, Fall 2016

Transcription:

Monte Carlo methods in PageRank computation: When one iteration is sufficient Nelly Litvak (University of Twente, The Netherlands) e-mail: n.litvak@ewi.utwente.nl Konstantin Avrachenkov (INRIA Sophia Antipolis, France), Dmitri Nemirovsky and Natalia Osipova (St.Petersburg State University, Russia) Financial support: Netherlands Organization for Scientific Research (NWO) under Meervoud grant 632.002.401 and the grant VGP 61-520 French Organization EGIDE under Van Gogh grant no.05433ud MCM2005, Tallahassee, 19.05.2005 p.1/17

Outline Markov model for the PageRank Monte Carlo algorithms Convergence Analysis Experiments MCM2005, Tallahassee, 19.05.2005 p.2/17

Search engine context A user types a query to find relevant pages. Problem: Normally, there are hundreds of relevant pages. In which order should we list the pages for the user?? The solution has been found by... MCM2005, Tallahassee, 19.05.2005 p.3/17

Search engine context A user types a query to find relevant pages. Problem: Normally, there are hundreds of relevant pages. In which order should we list the pages for the user?? The solution has been found by... Google MCM2005, Tallahassee, 19.05.2005 p.3/17

Search engine context A user types a query to find relevant pages. Problem: Normally, there are hundreds of relevant pages. In which order should we list the pages for the user?? The solution has been found by... Google Google ranking: List most important and popular pages first! S. BRIN AND L. PAGE (1998) The anatomy of a Large-Scale Hypertextual Web Search Engine. In WWW7, Australia MCM2005, Tallahassee, 19.05.2005 p.3/17

PageRank: Markov model PageRank π i of page i is the long run fraction of time that a random surfer spends on page i. Easily bored surfer model. With probability c (=0.85), a surfer follows a randomly chosen outgoing link. Otherwise, he/she jumps to a random page. 1 c/d c/d 2 i... c/d 1-c d... MCM2005, Tallahassee, 19.05.2005 p.4/17

PageRank: Markov model PageRank π i of page i is the long run fraction of time that a random surfer spends on page i. Easily bored surfer model. With probability c (=0.85), a surfer follows a randomly chosen outgoing link. Otherwise, he/she jumps to a random page. 1 c/d c/d 2 i... c/d 1-c d... π i = j i c π j + 1 c d j n MCM2005, Tallahassee, 19.05.2005 p.4/17

Formal model description n is the total number of pages P = (p ij ) - hyperlink matrix p ij = 1/d i if j is one of the d i outgoing links of i p ij = 1/n if d i = 0 p ij = 0 otherwise MCM2005, Tallahassee, 19.05.2005 p.5/17

Formal model description n is the total number of pages P = (p ij ) - hyperlink matrix p ij = 1/d i if j is one of the d i outgoing links of i p ij = 1/n if d i = 0 p ij = 0 otherwise Modified transition matrix: P = cp + (1 c)(1/n)e E is an n n matrix consisting of one s, c = 0.85 MCM2005, Tallahassee, 19.05.2005 p.5/17

Formal model description n is the total number of pages P = (p ij ) - hyperlink matrix p ij = 1/d i if j is one of the d i outgoing links of i p ij = 1/n if d i = 0 p ij = 0 otherwise Modified transition matrix: P = cp + (1 c)(1/n)e E is an n n matrix consisting of one s, c = 0.85 PageRank vector: π P = π, π1 = 1 MCM2005, Tallahassee, 19.05.2005 p.5/17

PageRank update Google updates the PageRank monthly: P is determined by crawling the web PageRank is computed by Power Iterations: π (0) = (1/n,...,1/n); π (k) = π (k 1) P, k > 0 Stop when π (k) and π (k 1) are close enough. 50 100 iterations needed with c = 0.85 MCM2005, Tallahassee, 19.05.2005 p.6/17

PageRank update Google updates the PageRank monthly: P is determined by crawling the web PageRank is computed by Power Iterations: π (0) = (1/n,...,1/n); π (k) = π (k 1) P, k > 0 Stop when π (k) and π (k 1) are close enough. 50 100 iterations needed with c = 0.85 Matrix P is huge, each iteration is costly MCM2005, Tallahassee, 19.05.2005 p.6/17

PageRank update Google updates the PageRank monthly: P is determined by crawling the web PageRank is computed by Power Iterations: π (0) = (1/n,...,1/n); π (k) = π (k 1) P, k > 0 Stop when π (k) and π (k 1) are close enough. 50 100 iterations needed with c = 0.85 Matrix P is huge, each iteration is costly We believe that Monte Carlo is more efficient... MCM2005, Tallahassee, 19.05.2005 p.6/17

Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 MCM2005, Tallahassee, 19.05.2005 p.7/17

Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 MCM2005, Tallahassee, 19.05.2005 p.7/17

Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 MCM2005, Tallahassee, 19.05.2005 p.7/17

Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 (X t ) t 0 Markov chain with tr. matrix P T geometric (1 c) stopping time, E[T] = 1/(1 c) = 1/0.15 6.67 MCM1, end-point, random start: Given that X 0 is picked at random, X T is a sample from π MCM2005, Tallahassee, 19.05.2005 p.7/17

Monte Carlo Methods Convenient formula for the PageRank π: π = 1 c n 1T [I cp] 1 = 1 n 1T (1 c)c k P k k=0 (X t ) t 0 Markov chain with tr. matrix P T geometric (1 c) stopping time, E[T] = 1/(1 c) = 1/0.15 6.67 MCM1, end-point, random start: Given that X 0 is picked at random, X T is a sample from π Complexity O(n 2 ) (Breyer, 2002) is over pessimistic MCM2005, Tallahassee, 19.05.2005 p.7/17

Variance reduction Z = [I cp] 1 = k=0 c k P k, (1 c)z ij = P[X T = j X 0 = i] π j = 1 c n n i=1 MCM2, end-point, cyclic start: Run (X t ) t 0, m times from each page. Evaluate π j as ˆπ j = [fraction of runs when {X T = j}] V ar(ˆπ j ) < (mn) 1 π j (1 π j ) z ij MCM2005, Tallahassee, 19.05.2005 p.8/17

Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM2005, Tallahassee, 19.05.2005 p.9/17

Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM3, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating terminating at each step w.p. 1 c. Evaluate π j as π j =[fraction of time spent in j] MCM2005, Tallahassee, 19.05.2005 p.9/17

Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM3, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating terminating at each step w.p. 1 c. Evaluate π j as π j =[fraction of time spent in j] Stopping time: It is natural to stop not only w.p. (1 c) at each step but also at dangling nodes MCM2005, Tallahassee, 19.05.2005 p.9/17

Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM4, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating at each step w.p. 1 c, or reaching a dangling node. Evaluate π j as π j =[fraction of time spent in j] MCM2005, Tallahassee, 19.05.2005 p.9/17

Full path version Note: z ij, element (i,j) of the matrix k=0 ck P k, is the average number of visits to j before time T given {X 0 = i}. Also, π j = 1 c n n i=1 z ij MCM4, complete path, cyclic start: Run (X t ) t 0, m times from each page, terminating at each step w.p. 1 c, or reaching a dangling node. Evaluate π j as π j =[fraction of time spent in j] Q matrix with zero-rows for dangling nodes π = γ1 T c k Q k, γ = c π i + 1 c n n < 1 n k=0 dangl. MCM2005, Tallahassee, 19.05.2005 p.9/17

Convergence Analysis W ij average # visits to j after m runs from i W j = W ij, W = W j i=1 j=1 Then π j = W j W 1. Here π j is determined by W j, and the relative errors are similar. Th.1. If W j w j εw j then π j π j ε n,β π j w.p. 1 β, where ε ε n,β < C(β)(1 + ε)/ nm Thus, the error in estimating π j originates mainly from W j, the estimator of w j = [ 1 T [I cq] 1] j MCM2005, Tallahassee, 19.05.2005 p.10/17

Idea of the proof of Theorem 1 π j π j = W j W 1 π j επ j + (γ W) 1 1 (1 + ε)πj 1. The length of each run is smaller than T, we can bound its variance 2. The runs are independent. 1&2 3. V ar( W) = O(n) V ar(γ W) = O(1/n) 4. W is approximately normally distributed MCM2005, Tallahassee, 19.05.2005 p.11/17

Confidence intervals P( W j w j < εw j ) 1 α We can show: V ar( W j ) 1 1 + q jj w j, where m 1 q jj q jj c 2 probability to return to j starting from j Then ε x 1 α/2 1 c + c dangl. π i πj mn, 1 + q jj 1 q jj x 1 α/2 is a (1 α/2)-quantile of N(0,1) Ex. π j = 10 4 (1 c)/n, m = 1 (!) ε 0.01. This is much better than one power iteration! MCM2005, Tallahassee, 19.05.2005 p.12/17

Complete path vs.end-point ε - complete path 1 + q 1 c + c jj dangl. π i x 1 α/2 1 q jj πj mn ε - end-point x 1 α/2 1 πj πj mn The complete path might work worse if: There are many cycles (high variability) There are many dangling nodes (stopping time is short) In practice, the complete path method works better. In our experiments, ε comp.path 0.59ε end point MCM2005, Tallahassee, 19.05.2005 p.13/17

Power iterations vs. MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks MCM2005, Tallahassee, 19.05.2005 p.14/17

Power iterations vs. MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks 4.8 5 x 10 3 MC comp path dangl nodes MC Confidence interval MC Confidence interval PI method PI method (10th iteration) 1.3 x 10 3 1.2 MC comp path dangl nodes MC confidence interval MC confidence interval PI method PI method (10th iteration) 4.6 4.4 1.1 PR PR 4.2 1 4 3.8 0.9 3.6 1 2 3 4 5 6 7 8 9 10 no. iter. π 1 = 0.00409. 0.8 1 2 3 4 5 6 7 8 9 10 no. iter. π 10 = 0.00103. MCM2005, Tallahassee, 19.05.2005 p.14/17

Power iterations vs. MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks 7 x 10 4 MC comp path dangl nodes MC confidence interval MC confidence interval PI method PI method (10th iteration) 1.6 x 10 4 1.4 MC comp path dangl nodes MC confidence interval MC confidence interval PI method PI method (10th iteration) 6 1.2 PR 5 PR 1 0.8 4 0.6 3 1 2 3 4 5 6 7 8 9 10 no. iter. π 100 = 0.00054. 0.4 1 2 3 4 5 6 7 8 9 10 no. iter. π 1000 = 0.00009. MCM2005, Tallahassee, 19.05.2005 p.14/17

Different versions of MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks MCM2005, Tallahassee, 19.05.2005 p.15/17

Different versions of MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks 0.15 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point with cyclic start MC end point with cyclic start (conf. interv.) MC comp path rand start 0.3 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point cycl start MC end point cycl start (conf. interv.) MC comp path rand start 0.1 0.2 relative error relative error 0.05 0.1 0 1 2 3 4 5 6 7 8 9 10 no. iter. π 1 = 0.00409. 0 1 2 3 4 5 6 7 8 9 10 no. iter. π 10 = 0.00103. MCM2005, Tallahassee, 19.05.2005 p.15/17

Different versions of MCM INRIA Sophia Antipolis http://www-sop.inria.fr, 50000 pages, 200000 hyperlinks 0.4 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point cycl start MC end point cycl start (conf. interv.) MC comp path rand start 1 0.9 0.8 MC comp path dangl nodes MC comp path dangl nodes (conf. interv.) MC end point cycl start MC end point cycl start (conf. interv.) MC comp path rand start 0.7 0.3 0.6 relative error 0.2 relative error 0.5 0.4 0.3 0.1 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 no. iter. π 100 = 0.00054. 0 1 2 3 4 5 6 7 8 9 10 no. iter. π 1000 = 0.00009. MCM2005, Tallahassee, 19.05.2005 p.15/17

Conclusions MCM with cyclic start outperforms the MCM with random start Complete path algorithm in practice outperforms the end-point algorithm The PageRank of important pages is estimated well after the first iteration Other advantages of the MCM: natural parallel implementation and possibilities for on-line update MCM2005, Tallahassee, 19.05.2005 p.16/17

That s all for today... Questions? MCM2005, Tallahassee, 19.05.2005 p.17/17

That s all for today... Questions? Suggestions? MCM2005, Tallahassee, 19.05.2005 p.17/17