PageRank: Ranking of nodes in graphs

PageRank: Ranking of nodes in graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester gmateosb@ece.rochester.edu http://www.ece.rochester.edu/~gmateosb/ October 4, 207 Introduction to Random Processes Ranking of nodes in graphs

PageRank: Random walk Ranking of nodes in graphs: Random walk Ranking of nodes in graphs: Probability propagation Introduction to Random Processes Ranking of nodes in graphs 2

Graphs 5 4 6 2 3 Graph A set of V of vertices or nodes j =,..., J Connected by a set of edges E defined as ordered pairs (i, j) In figure Nodes are V = {, 2, 3, 4, 5, 6} Edges E = {(, 2), (, 5),(2, 3), (2, 5), (3, 4),... (3, 6), (4, 5), (4, 6), (5, 4)} Ex. : Websites and hyperlinks World Wide Web (WWW) Ex. 2: People and friendship Social network Introduction to Random Processes Ranking of nodes in graphs 3

How well connected nodes are? 5 4 6 2 3 Q: Which node is the most connected? A: Define most connected Can define most connected in different ways Two important connectivity indicators ) How many links point to a node (outgoing links irrelevant) 2) How important are the links that point to a node Node rankings to measure website relevance, social influence Introduction to Random Processes Ranking of nodes in graphs 4

Connectivity ranking Key insight: There is information in the structure of the network Knowledge is distributed through the network The network (not the nodes) knows the rankings Idea exploited by Google s PageRank c to rank webpages... by social scientists to study trust & reputation in social networks... by ISI to rank scientific papers, transactions & magazines... No one points to 5 4 6 Only points to 2 Only 2 points to 3, but 2 more important than 4 as high as 5 with less links 2 3 Links to 5 have lower rank Same for 6 Introduction to Random Processes Ranking of nodes in graphs 5

Preliminary definitions Graph G = (V, E) vertices V = {, 2,..., J} and edges E 5 4 6 2 3 Outgoing neighborhood of i is the set of nodes j to which i points n(i) := {j : (i, j) E} Incoming neighborhood, n (i) is the set of nodes that point to i: n (i) := {j : (j, i) E} Strongly connected G directed path joining any pair of nodes Introduction to Random Processes Ranking of nodes in graphs 6

Definition of rank Agent A chooses node i, e.g., web page, at random for initial visit Next visit randomly chosen between links in the neighborhood n(i) All neighbors chosen with equal probability If reach a dead end because node i has no neighbors Chose next visit at random equiprobably among all nodes Redefine graph G = (V, E) adding edges from dead ends to all nodes Restrict attention to connected (modified) graphs 5 4 6 2 3 Rank of node i is the average number of visits of agent A to i Introduction to Random Processes Ranking of nodes in graphs 7

Equiprobable random walk Formally, let A n be the node visited at time n Define transition probability P ij from node i into node j P ij := P ( A n+ = j An = i ) Next visit equiprobable among i s N i := n(i) neighbors P ij = n(i) = N i, for all j n(i) /2 /2 5 4 /2 /2 /2 /2 2 3 /2 /2 6 /5 /5 /5 /5 /5 to to 2 to 3 to 4 to 5 Still have a graph But also a MC Red (not blue) circles Introduction to Random Processes Ranking of nodes in graphs 8

Formal definition of rank Def: Rank r i of i-th node is the time average of number of visits r i := lim n n n I {A m = i} m= Define vector of ranks r := [r, r 2,..., r J ] T Rank r i can be approximated by average r ni at time n r ni := n n I {A m = i} m= Since lim r ni = r i, it holds r ni r i for n sufficiently large n Define vector of approximate ranks r n := [r n, r n2,..., r nj ] T If modified graph is connected, rank independent of initial visit Introduction to Random Processes Ranking of nodes in graphs 9

Ranking algorithm Output : Vector r(i) with ranking of node i Input : Scalar n indicating maximum number of iterations Input : Vector N(i) containing number of neighbors of i Input : Matrix N(i, j) containing indices j of neighbors of i m = ; r=zeros(j,); % Initialize time and ranks A 0 = random( unid,j); % Draw first visit uniformly at random while m < n do jump = random( unid,n Am ); % Neighbor uniformly at random A m = N(A m, jump); % Jump to selected neighbor r(a m ) = r(a m ) + ; % Update ranking for A m m = m + ; end r = r/n; % Normalize by number of iterations n Introduction to Random Processes Ranking of nodes in graphs 0

Social graph example Asked probability students about homework collaboration Created (crude) graph of the social network of students in the class Used ranking algorithm to understand connectedness Ex: I want to know how well students are coping with the class Best to ask people with higher connectivity ranking 2009 data from UPenn s ECE440 Introduction to Random Processes Ranking of nodes in graphs

Ranked class graph Harish Venkatesan Jacci Jeffries Pallavi Yerramilli Xiang-Li Lim Thomas Cassel Owen Tian Eric Lamb Daniela Savoia Ceren Dumaz Priya Takiar Lindsey Eatough Sugyan Lohiaa Ankit Aggarwal Lisa Zheng Madhur Agarwal Anthony Dutcher Robert Feigenberg Carolina Lee Aarti Kochhar Ciara Kennedy Amanda Smith Saksham Karwal Michael Harker Amanda Zwarenstein Ranga Ramachandran Varun Balan Katie Joo Shahid Bosan Pia Ramchandani Jesse Beyroutey Rebecca Gittler Ivan Levcovitz Jane Kim Paul Deren Alexandra Malikova Jihyoung Ahn Chris Setian Charles Jeon Ella Kim Aditya Kaji Introduction to Random Processes Ranking of nodes in graphs 2

Convergence metrics Recall r is vector of ranks and r n of rank iterates By definition lim r n = r. How fast r n converges to r (r given)? n Can measure by l 2 distance between r and r n ( J ) /2 ζ n := r r n 2 = (r ni r i ) 2 If interest is only on highest ranked nodes, e.g., a web search Denote r (i) as the index of the i-th highest ranked node Let r (i) n i= be the index of the i-th highest ranked node at time n First element wrongly ranked at time n ξ n := arg min{r (i) r n (i) } i Introduction to Random Processes Ranking of nodes in graphs 3

Evaluation of convergence metrics 0 Distance correctly ranked nodes 0 0 0 Distance close to 0 2 in 5 0 3 iterations 0 2 0 000 2000 3000 4000 5000 6000 7000 8000 9000 0000 time (n) 4 2 First element wrongly ranked Bad: Two highest ranks in 4 0 3 iterations Awful: Six best ranks in 8 0 3 iterations correctly ranked nodes 0 8 6 4 (Very) slow convergence 2 0 0 000 2000 3000 4000 5000 6000 7000 8000 9000 0000 time (n) Introduction to Random Processes Ranking of nodes in graphs 4

When does this algorithm converge? Cannot confidently claim convergence until 0 5 iterations Beyond particular case, slow convergence inherent to algorithm 40 35 30 correctly ranked nodes 25 20 5 0 5 0 0 0.2 0.4 0.6 0.8 time (n).2.4.6.8 2 x 0 5 Example has 40 nodes, want to use in network with 0 9 nodes! Leverage properties of MCs to obtain a faster algorithm Introduction to Random Processes Ranking of nodes in graphs 5

PageRank: Probability propagation Ranking of nodes in graphs: Random walk Ranking of nodes in graphs: Probability propagation Introduction to Random Processes Ranking of nodes in graphs 6

Limit probabilities Recall definition of rank r i := lim n n n I {A m = i} m= Rank is time average of number of state visits in a MC Can be as well obtained from limiting probabilities Recall transition probabilities P ij = N i, for all j n(i) Stationary distribution π = [π, π,..., π J ] T solution of π i = j n (i) P ji π j = π j N j j n (i) Plus normalization equation J i= π i = for all i As per ergodicity of MC (strongly connected G) r = π Introduction to Random Processes Ranking of nodes in graphs 7

Matrix notation, eigenvalue problem As always, can define matrix P with elements P ij π i = J P ji π j = P ji π j j n (i) j= for all i Right hand side is just definition of a matrix product leading to π = P T π, π T = Also added normalization equation Idea: solve system of linear equations or eigenvalue problem on P T Requires matrix P available at a central location Computationally costly (sparse matrix P with 0 8 entries) Introduction to Random Processes Ranking of nodes in graphs 8

What are limit probabilities? Let p i (n) denote probability of agent A visiting node i at time n p i (n) := P (A n = i) Probabilities at time n + and n can be related P (A n+ = i) = P ( A n+ = i A n = j ) P (A n = j) j n (i) Which is, of course, probability propagation in a MC p i (n + ) = P ji p j (n) j n (i) By definition limit probabilities are (let p(n) = [p (n),..., p J (n)] T ) lim p(n) = π = r n Compute ranks from limit of propagated probabilities Introduction to Random Processes Ranking of nodes in graphs 9

Probability propagation Can also write probability propagation in matrix form p i (n + ) = J P ji p j (n) = P ji p j (n) j n (i) j= for all i Right hand side is just definition of a matrix product leading to p(n + ) = P T p(n) Idea: can approximate rank by large n probability distribution r = lim p(n) p(n) for n sufficiently large n Introduction to Random Processes Ranking of nodes in graphs 20

Ranking algorithm Algorithm is just a recursive matrix product, a power iteration Output : Vector r(i) with ranking of node i Input : Scalar n indicating maximum number of iterations Input : Matrix P containing transition probabilities m = ; % Initialize time r=(/j)ones(j,); % Initial distribution uniform across all nodes while m < n do r = P T r; % Probability propagation m = m + ; end Introduction to Random Processes Ranking of nodes in graphs 2

Interpretation of probability propagation Q: Why does the random walk converge so slow? A: Need to register a large number of agent visits to every state Ex: 40 nodes, say 00 visits to each 4 0 3 iters. Smart idea: Unleash a large number of agents K r i = lim n n n m= K K I {A km = i} k= Visits are now spread over time and space Converges K times faster But haven t changed computational cost Introduction to Random Processes Ranking of nodes in graphs 22

Interpretation of prob. propagation (continued) Q: What happens if we unleash infinite number of agents K? n K r i = lim lim I {A km = i} n n K K m= Using law of large numbers and expected value of indicator function n n r i = lim E [I {A m = i}] = lim P (A m = i) n n n n m= Graph walk is an ergodic MC, then r i = lim n n n m= k= m= lim P (A m = i) exists, and m p i (m) = lim n p i(n) Probability propagation Unleashing infinitely many agents Introduction to Random Processes Ranking of nodes in graphs 23

Distance to rank Initialize with uniform probability distribution p(0) = (/J) Plot distance between p(n) and r 0 0 0 Distance 0 2 0 3 0 4 0 20 40 60 80 00 20 40 time (n) Distance is 0 2 in 30 iters., 0 4 in 40 iters. Convergence two orders of magnitude faster than random walk Introduction to Random Processes Ranking of nodes in graphs 24

Number of nodes correctly ranked Rank of highest ranked node that is wrongly ranked by time n 40 35 correctly ranked nodes 30 25 20 5 0 5 0 0 20 40 60 80 00 20 40 time (n) Not bad: All nodes correctly ranked in 20 iterations Good: Ten best ranks in 70 iterations Great: Four best ranks in 20 iterations Introduction to Random Processes Ranking of nodes in graphs 25

Distributed algorithm to compute ranks Nodes want to compute their rank r i Can communicate with neighbors only (incoming + outgoing) Access to neighborhood information only Recall probability update p i (n + ) = j n (i) Uses local information only P ji p j (n) = j n (i) N j p j (n) Distributed algorithm. Nodes keep local rank estimates r i (n) Receive rank (probability) estimates rj (n) from neighbors j n (i) Update local rank estimate ri (n + ) = j n (i) r j(n)/n j Communicate rank estimate ri (n + ) to outgoing neighbors j n(i) Only need to know the number of neighbors of my neighbors Introduction to Random Processes Ranking of nodes in graphs 26

Distributed implementation of random walk Can communicate with neighbors only (incoming + outgoing) But cannot access neighborhood information Pass agent ( hot potato ) around Local rank estimates r i (n) and counter with number of visits V i Algorithm run by node i at time n if Agent received from neighbor then V i = V i + Choose random neighbor Send agent to chosen neighbor end n = n + ; r i (n) = V i /n; Speed up convergence by generating many agents to pass around Introduction to Random Processes Ranking of nodes in graphs 27

Comparison of different algorithms Random walk (RW) implementation Most secure. No information shared with other nodes Implementation can be distributed Convergence exceedingly slow System of linear equations Least security. Graph in central server Distributed implementation not clear Convergence not an issue But computationally costly to obtain approximate solutions Probability propagation Somewhat secure. Information shared with neighbors only Implementation can be distributed Convergence rate acceptable (orders of magnitude faster than RW) Introduction to Random Processes Ranking of nodes in graphs 28

Glossary Graph, nodes and edges Connectivity indicators Node ranking Google s PageRank Node s neighborhood Strong connectivity Random walk on a graph Long-run fraction of state visits Ranking algorithm Convergence metrics Computational cost Probability propagation Power method Distributed algorithm Security Introduction to Random Processes Ranking of nodes in graphs 29