Updating PageRank. Amy Langville Carl Meyer

Similar documents
UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Updating Markov Chains Carl Meyer Amy Langville

1998: enter Link Analysis

Link Analysis. Leonid E. Zhukov

Link Analysis. Stony Brook University CSE545, Fall 2016

Calculating Web Page Authority Using the PageRank Algorithm

for an m-state homogeneous irreducible Markov chain with transition probability matrix

Applications. Nonnegative Matrices: Ranking

CONVERGENCE ANALYSIS OF A PAGERANK UPDATING ALGORITHM BY LANGVILLE AND MEYER

Page rank computation HPC course project a.y

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Mathematical Properties & Analysis of Google s PageRank

Google Page Rank Project Linear Algebra Summer 2012

Analysis of Google s PageRank

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Link Analysis Ranking

Online Social Networks and Media. Link Analysis and Web Search

Computing PageRank using Power Extrapolation

Graph Models The PageRank Algorithm

ECEN 689 Special Topics in Data Science for Communications Networks

Data Mining and Matrices

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

How does Google rank webpages?

Markov Chains and Web Ranking: a Multilevel Adaptive Aggregation Method

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

The Second Eigenvalue of the Google Matrix

Uncertainty and Randomization

Krylov Subspace Methods to Calculate PageRank

Online Social Networks and Media. Link Analysis and Web Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

CS6220: DATA MINING TECHNIQUES

Link Mining PageRank. From Stanford C246

Analysis and Computation of Google s PageRank

IR: Information Retrieval

CS6220: DATA MINING TECHNIQUES

A Fast Two-Stage Algorithm for Computing PageRank

CPSC 540: Machine Learning

A hybrid reordered Arnoldi method to accelerate PageRank computations

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

CPSC 540: Machine Learning

Data Mining Recitation Notes Week 3

Lecture 12: Link Analysis for Web Retrieval

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

CS249: ADVANCED DATA MINING

eigenvalues, markov matrices, and the power method

Computational Economics and Finance

Data and Algorithms of the Web

A Two-Stage Algorithm for Computing PageRank and Multistage Generalizations

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Department of Applied Mathematics. University of Twente. Faculty of EEMCS. Memorandum No. 1712

Markov Model. Model representing the different resident states of a system, and the transitions between the different states

MULTILEVEL ADAPTIVE AGGREGATION FOR MARKOV CHAINS, WITH APPLICATION TO WEB RANKING

c 2005 Society for Industrial and Applied Mathematics

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Rank and Rating Aggregation. Amy Langville Mathematics Department College of Charleston

Markov Chains and Spectral Clustering

Eigenvalue Problems Computation and Applications

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

Math 304 Handout: Linear algebra, graphs, and networks.

A Note on Google s PageRank

Lecture 7 Mathematics behind Internet Search

Random Surfing on Multipartite Graphs

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Monte Carlo methods in PageRank computation: When one iteration is sufficient

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

MATH36001 Perron Frobenius Theory 2015

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

NCDREC: A Decomposability Inspired Framework for Top-N Recommendation

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

The Theory behind PageRank

Ranking using Matrices.

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Utilizing Network Structure to Accelerate Markov Chain Monte Carlo Algorithms

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Algebraic Representation of Networks

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

ISE/OR 760 Applied Stochastic Modeling

Markov Processes Hamid R. Rabiee

Web Structure Mining Nodes, Links and Influence

Coding the Matrix! Divya Padmanabhan. Intelligent Systems Lab, Dept. of CSA, Indian Institute of Science, Bangalore. June 28, 2013

Robust PageRank: Stationary Distribution on a Growing Network Structure

0.1 Naive formulation of PageRank

Application. Stochastic Matrices and PageRank

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Lecture: Local Spectral Methods (1 of 4)

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

STOCHASTIC COMPLEMENTATION, UNCOUPLING MARKOV CHAINS, AND THE THEORY OF NEARLY REDUCIBLE SYSTEMS

COMPSCI 514: Algorithms for Data Science

Communities, Spectral Clustering, and Random Walks

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Markov Chains and Stochastic Sampling

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Node Centrality and Ranking on Networks

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Transcription:

Updating PageRank Amy Langville Carl Meyer Department of Mathematics North Carolina State University Raleigh, NC SCCM 11/17/2003

Indexing Google Must index key terms on each page Robots crawl the web software does indexing Inverted file structure (like book index: terms to pages) Term 1 P i, P j,... Term 2 P k, P l,.... Ranking Determine a PageRank for each page P i, P j, P k, P l,... Query independent Based only on link structure Query matching Q = Term 1, Term 2,... produces P i, P j, P k, P l,... Return P i, P j, P k, P l,... to user in order of PageRank

Google s PageRank Idea (Sergey Brin & Lawrence Page 1998) Rankings are not query dependent Depend only on link structure Off-line calculations Your page P has some rank r(p) Adjust r(p) higher or lower depending on ranks of pages that point to P Importance is not number of in-links or out-links One link to P from Yahoo! is important Many links to P from me is not Yahoo! points many places value of link to P is diluted

The Definition r(p) = r(p) P P B P Successive Refinement Start with r 0 (P i ) = 1/n PageRank B P = {all pages pointing to P } P = number of out links from P for all pages P 1, P 2,..., P n Iteratively refine rankings for each page r 1 (P i ) = r 0 (P) P P B Pi r 2 (P i ) = r 1 (P) P P B Pi... r j+1 (P i ) = r j (P) P P B Pi

After Step j In Matrix Notation π T j = [ r j (P 1 ), r j (P 2 ),..., r j (P n ) ] π T j+1 = πt j P where p ij = PageRank = lim j π T j = π T { 1/ Pi if i j 0 otherwise (provided limit exists) It s A Markov Chain P = [ ] p ij is a stochastic matrix (row sums = 1) Each π T j is a probability distribution vector ) ( i r j(p i )=1 π T j+1 = πt j P is random walk on the graph defined by links π T = lim j π T j = stationary probability distribution

Random Surfer Web Surfer Randomly Clicks On Links (Back button not a link) Long-run proportion of time on page P i is π i Problems Dead end page (nothing to click on) π T not well defined Could get trapped into a cycle (P i P j P i ) No convergence Convergence Markov chain must be irreducible and aperiodic Bored Surfer Enters Random URL Replace P by P = αp + (1 α)e e ij = 1/n α.85 Different E = ev T and α allow customization & speedup

Computing π T A Big Problem Solve π T = π T P π T (I P) = 0 (stationary distribution vector) (too big for direct solves)

Computing π T A Big Problem Solve π T = π T P π T (I P) = 0 (stationary distribution vector) (too big for direct solves) Start with π T 0 = e/n and iterate πt j+1 = πt j P (power method)

Computing π T A Big Problem Solve π T = π T P π T (I P) = 0 (stationary distribution vector) (too big for direct solves) Start with π T 0 = e/n and iterate πt j+1 = πt j P (power method) A Bigger Problem Updating Link structure of web is extremely dynamic Links on CNN change frequently Links are added and deleted almost continuously Google says just start from scratch every 3 to 4 weeks Old results don t help to restart

The Updating Problem Given: Original chain s P andπ T and new chain s P Find: new chain s π T

Idea behind Aggregation Best for NCD systems (Simon and Ando (1960s), Courtois (1970s)) C 1 C 2 C 3 C 1 C 2 P = Π Τ Π τ C 3 C 1 C 2 C 3 A = C 1 C 2 C 3 a i j Τ = Π P ij e ξ Τ = ξ 1 ξ 2 ξ 3 Π Τ ~ ξ 1 Π Τ ξ 2 Π Τ ξ 3 Π Τ Pro Con exploits structure to reduce work produces an approximation, quality is dependent on degree of coupling

Iterative Aggregation Problem: repeated aggregation leads to fixed point. Solution: Do a power step to move off fixed point. Do this iteratively. Approximations improve and approach exact solution. Success with NCD systems, not in general. Input: approximation to Π get censored distributions Π T Π T Π T get coupling constants Output: get approximate global stationary distribution Output: move off fixed point with power step T ξ i T = ξ ξ ξ Π Π T Π T Π T 1 2 3

Exact Aggregation (Meyer 1989) P = C 1 C 2 C 3 s i T = censored (stat.) dist. of stochastic complement S i = P i i+ P i (I - P ) P i * * * For 2-level partition, -1-1 S 1= P 11 + P 12 (I - P 22) P2 1 S i C 1 C 2 C 3 A = C 1 C 2 C 3 Τ s i a i j = P ij e ξ Τ = ξ 1 ξ 2 ξ 3 Π Τ Τ ξ 1 ξ Τ 2 ξ Τ = s 1 s 3 s 3 2 Pro Con only one step needed to produce exact global vector SC matrices S i are very expensive to compute

Back to Updating... C 1 C 2 C 3 P = C g-1 C g C g+1 C 1 C 1 C 2 C 3 C g-1 C g C g+1 A = C 2 C 3 C g-1 C g C g+1

Partitioned Matrix P n n = ( G G ) G P 11 P 12 G P 21 P 22 Aggregation = p 11... p 1g r T 1.......... p g1... p gg r T g π T = (π 1,...π g π g+1,..., π n ) Advantages of this Partition c 1... c g P 22 p 11... pgg are 1 1 = Stochastic complements = 1 = censored distributions = 1 Only one significant complement S 2 = P 22 + P 21 (I P 11 ) 1 P 12 Only one significant censored dist s T 2 S 2 = s T 2 A/D Theorem = s T 2 = (π g+1,..., π n )/ n i=g+1 π i

Aggregation Matrix A = p 11... p 1g r T 1 e.......... p g1... p gg r T g e s T 2 c 1... s T 2 c g s T 2 P 22e = (g+1) (g+1) [ P11 P 12 e s T 2 P 21 1 s T 2 P 21e ] The Aggregation/Disaggregation Theorem If α T = (α 1,..., α g, α g+1 ) = stationary dist for A Then π T = ( α 1,..., α g α g+1 s2) T = stationary dist for P Trouble! Always A Big Problem G small G big S 2 = P 22 + P 21 (I P 11 ) 1 P 12 large G big A large

Assumption Approximate Aggregation Updating involves relatively few states [ ] P11 P 12 e G small A = s T 2 P 21 1 s T 2 P 21e Approximation (π g+1,..., π n ) (φ g+1,..., φ n ), small (g+1) (g+1) where φ T is old PageRank vector and π T is new, updated PageRank s T 2 = (π g+1,..., π n ) n π i=g+1 i (φ g+1,..., φ n ) n φ i=g+1 i = s T 2 [ ] A Ã = P11 P 12 e s T 2 P 21 1 s T 2 P 21e α T α T = ( ) α 1,..., α g, α g+1 π T π T = ( ) α 1,..., α g α g+1 s T 2 (avoids computing s T 2 for large S 2) (not bad)

Iterative Aggregation Improve By Successive Aggregation / Disaggregation? Solution NO Can t do A/D twice a fixed point emerges Perturb A/D output to move off of fixed point Move it in direction of solution π T = π T P The Iterative A/D Updating Algorithm Determine the G-set partition S = G G Approximate A/D step generates approximation π T Smooth the result π T = π T P (a smoothing step) Use π T as input to another approximate aggregation step.

How to Partition for Updating Problem? Intuition There are some bad states (G) and some good states (G). Give more attention to bad states. Each state in G forms a partitioning level. Much progress toward correct PageRank is made during aggregation step. Lump good states in G into 1 superstate. Progress toward correct PageRank is made during smoothing step (power iteration).

Definitions for Good and Bad 1. Good = states least likely to have π i change Bad = states most likely to have π i change 2. Good = states with smallest π i after k transient steps Bad = states nearby, with largest π i after k transient steps 3. Good = smallest π i from old PageRank vector Bad = largest π i from old PageRank vector 4. Good = fast converging states Bad = slow converging states

Determining Fast and Slow Consider power method and its rate of convergence π T k+1 = π T k P = π T k eπ T + λ k 2π T k x 2 y T 2 + λ k 3π T k x 3 y T 3 +... + λ k nπ T k x n y T n Asymptotic rate of convergence is rate at which λ k 2 0 Consider convergence of elements Some states converge to stationary value faster than λ 2 rate, due to LH e vector y T 2. Partitioning Rule Put states with largest y T 2 i values in bad group G, where Practicality y T 2 they receive more individual attention in aggregation method. expensive, but for PageRank problem, Kamvar et al. show states with large π i are slow-converging. inexpensive soln = use old π T to determine G. (adaptively approximate y T 2 )

Power law for PageRank Scale-free Model of Web network creates power laws (Kamvar, Barabasi, Raghavan)

Theorem Convergence Always converges to stationary dist π T for P Converges for all partitions S = G G Rate of convergence is rate at which S n 2 converges S 2 = P 22 +P 21 (I P 11 ) 1 P 12 Dictated by Jordan structure of λ 2 (S 2 ) λ 2 (S 2 ) simple = π T k πt at the rate at which λ n 2 0 The Game Goal now is to find a relatively small G that minimizes λ 2 (S 2 )

Experiments Test Networks From Crawl Of Web (Supplied by Ronny Lempel) Censorship 562 nodes 736 links (Sites concerning censorship on the net ) Movies 451 nodes 713 links MathWorks 517 nodes 13,531 links Abortion 1,693 nodes 4,325 links Genetics 2,952 nodes 6,485 links (Sites concerning movies ) (Supplied by Cleve Moler) (Sites concerning abortion ) (Sites concerning genetics )

Parameters Number Of Nodes (States) Added 3 Number Of Nodes (States) Removed 5 Number Of Links Added (Different values have little effect on results) 10 Number Of Links Removed 20 Stopping Criterion 1-norm of residual < 10 10

Censorship Power Method Iterations Time 38 1.40 Iterative Aggregation G Iterations Time 5 38 1.68 10 38 1.66 15 38 1.56 20 20 1.06 25 20 1.05 50 10.69 100 8.55 nodes = 562 links = 736 300 6.65 400 5.70

Censorship Power Method Iterative Aggregation Iterations Time 38 1.40 nodes = 562 links = 736 G Iterations Time 5 38 1.68 10 38 1.66 15 38 1.56 20 20 1.06 25 20 1.05 50 10.69 100 8.55 200 6.53 300 6.65 400 5.70

Movies Power Method Iterations Time 17.40 Iterative Aggregation G Iterations Time 5 12.39 10 12.37 15 11.36 20 11.35 nodes = 451 links = 713 100 9.33 200 8.35 300 7.39 400 6.47

Movies Power Method Iterative Aggregation Iterations Time 17.40 nodes = 451 links = 713 G Iterations Time 5 12.39 10 12.37 15 11.36 20 11.35 25 11.31 50 9.31 100 9.33 200 8.35 300 7.39 400 6.47

MathWorks Power Method Iterations Time 54 1.25 Iterative Aggregation G Iterations Time 5 53 1.18 10 52 1.29 15 52 1.23 20 42 1.05 25 20 1.13 nodes = 517 links = 13, 531 300 11.83 400 10 1.01

MathWorks Power Method Iterative Aggregation Iterations Time 54 1.25 nodes = 517 links = 13, 531 G Iterations Time 5 53 1.18 10 52 1.29 15 52 1.23 20 42 1.05 25 20 1.13 50 18.70 100 16.70 200 13.70 300 11.83 400 10 1.01

Abortion Power Method Iterations Time 106 37.08 Iterative Aggregation G Iterations Time 5 109 38.56 10 105 36.02 15 107 38.05 20 107 38.45 25 97 34.81 50 53 18.80 nodes = 1, 693 links = 4, 325 250 12 5.62 500 6 5.21 750 5 10.22 1000 5 14.61

Abortion Power Method Iterative Aggregation Iterations Time 106 37.08 nodes = 1, 693 links = 4, 325 G Iterations Time 5 109 38.56 10 105 36.02 15 107 38.05 20 107 38.45 25 97 34.81 50 53 18.80 100 13 5.18 250 12 5.62 500 6 5.21 750 5 10.22 1000 5 14.61

Genetics Power Method Iterations Time 92 91.78 Iterative Aggregation G Iterations Time 5 91 88.22 10 92 92.12 20 71 72.53 50 25 25.42 100 19 20.72 250 13 14.97 nodes = 2, 952 links = 6, 485 1000 5 17.76 1500 5 31.84

Genetics Power Method Iterative Aggregation Iterations Time 92 91.78 nodes = 2, 952 links = 6, 485 G Iterations Time 5 91 88.22 10 92 92.12 20 71 72.53 50 25 25.42 100 19 20.72 250 13 14.97 500 7 11.14 1000 5 17.76 1500 5 31.84

Conclusions First updating algorithm to handle both element and state updates. Algorithm is very sensitive to partition. For PageRank problem, partition can be determined cheaply from old PageRanks. For general Markov updating, use y T 2 to determine partition. When too expensive, approximate adaptively with Aitken s δ 2 or difference of successive iterates. Improvements Practical Optimize G-set Accelerate Smoothing Theoretical Relationship between partitioning by y T 2 and λ 2(S 2 ) not well-understood. Predict algorithm and partitioning by old π T will work very well on other scale-free networks.