Ranking of nodes in graphs

Similar documents
PageRank: Ranking of nodes in graphs

Centrality Measures and Link Analysis

0.1 Naive formulation of PageRank

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Online Social Networks and Media. Link Analysis and Web Search

Markov chains (week 6) Solutions

COMPSCI 514: Algorithms for Data Science

Uncertainty and Randomization

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Online Social Networks and Media. Link Analysis and Web Search

Lecture 12: Link Analysis for Web Retrieval

A Note on Google s PageRank

ECEN 689 Special Topics in Data Science for Communications Networks

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Link Analysis Ranking

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Graph Models The PageRank Algorithm

Approximate Inference

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Link Mining PageRank. From Stanford C246

CPSC 540: Machine Learning

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

Data and Algorithms of the Web

Updating PageRank. Amy Langville Carl Meyer

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

The PageRank Problem, Multi-Agent. Consensus and Web Aggregation

Web Structure Mining Nodes, Links and Influence

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Complex Social System, Elections. Introduction to Network Analysis 1

Data Mining Recitation Notes Week 3

Data Mining and Matrices

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Page rank computation HPC course project a.y

CS 188: Artificial Intelligence Fall Recap: Inference Example

CPSC 540: Machine Learning

Distributed Systems Gossip Algorithms

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Math 304 Handout: Linear algebra, graphs, and networks.

CS6220: DATA MINING TECHNIQUES

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Link Analysis. Stony Brook University CSE545, Fall 2016

Computing PageRank using Power Extrapolation

CS6220: DATA MINING TECHNIQUES

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

Slides based on those in:

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS 188: Artificial Intelligence Spring 2009

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

IR: Information Retrieval

The Theory behind PageRank

Lecture: Local Spectral Methods (1 of 4)

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

DS504/CS586: Big Data Analytics Graph Mining II

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

The Static Absorbing Model for the Web a

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Link Analysis. Leonid E. Zhukov

At the boundary states, we take the same rules except we forbid leaving the state space, so,.

1998: enter Link Analysis

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

CS 188: Artificial Intelligence Spring Announcements

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Modelling data networks stochastic processes and Markov chains

Sequential Decision Problems

6 Markov Chain Monte Carlo (MCMC)

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Monte Carlo methods in PageRank computation: When one iteration is sufficient

Discrete-time Consensus Filters on Directed Switching Graphs

Lecture 11 October 11, Information Dissemination through Social Networks

Machine Learning CPSC 340. Tutorial 12

Propp-Wilson Algorithm (and sampling the Ising model)

A Tunable Mechanism for Identifying Trusted Nodes in Large Scale Distributed Networks

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Algebraic Representation of Networks

Lesson Plan. AM 121: Introduction to Optimization Models and Methods. Lecture 17: Markov Chains. Yiling Chen SEAS. Stochastic process Markov Chains

Markov Chains. October 5, Stoch. Systems Analysis Markov chains 1

The PageRank Computation in Google: Randomization and Ergodicity

Finding central nodes in large networks

The Second Eigenvalue of the Google Matrix

Designing Information Devices and Systems I Fall 2018 Homework 5

Combating Web Spam with TrustRank

25.1 Markov Chain Monte Carlo (MCMC)

1 Searching the World Wide Web

Transcription:

Ranking of nodes in graphs Alejandro Ribeiro & Cameron Finucane Dept. of Electrical and Systems Engineering University of Pennsylvania aribeiro@seas.upenn.edu http://www.seas.upenn.edu/users/~aribeiro/ October 3, 200 Stoch. Systems Analysis Ranking of nodes in graphs

Ranking of nodes in graphs: Random walk Ranking of nodes in graphs: Random walk Ranking of nodes in graphs: Markov chain Stoch. Systems Analysis Ranking of nodes in graphs 2

Graphs 5 4 6 2 3 Graph A set of J nodes j =,..., J Connected by a set of edges E defined as ordered pairs (i, j) In figure nodes are j =, 2, 3, 4, 5, edges E = {(, 2), (, 5),(2, 3), (2, 5), (3, 4),... (3, 6), (4, 5), (4, 6), (5, 4)} Websites and links The web. People and friendship Social network Stoch. Systems Analysis Ranking of nodes in graphs 3

How well connected nodes are? 5 4 6 2 3 Q: Which node is the most connected? A: Define most connected Can define most connected in different ways Node rankings to measure website quality, social influence There are two important connectivity indicators How many nodes point to a link (outgoing links irrelevant) How Important are the links that point to a node Stoch. Systems Analysis Ranking of nodes in graphs 4

Connectivity ranking Insight There is information in the structure of the network Knowledge is distributed through the network The network (not the nodes) know the rankings Idea exploited by Google s PageRank c to rank webpages... by social scientists to study trust & reputation in social networks... by ISI to rank scientific papers, transactions & magazines... No one points to 5 4 6 Only points to 2 Only 2 points to 3, but 2 more important than 4 as high as 5 with less links 2 3 Links to 5 have lower rank Same for 6 Stoch. Systems Analysis Ranking of nodes in graphs 5

Preliminary definitions Graph G = (V, E) sets of vertices V = {, 2,..., J} and edges E Edges (elements of E) are ordered pairs (i, j) We say there is a connection from i to j Outgoing neighborhood of i is the set of nodes j to which i points n(i) := {j : (i, j) E} Incoming neighborhood, n (i) is the set of nodes that point to i: n (i) := {j : (j, i) E} Connected graph There is a path from any node to any other node Stoch. Systems Analysis Ranking of nodes in graphs 6

Definition of rank Agent A chooses node i, e.g., web page, at random for initial visit Next visit randomly chosen between links in the neighborhood n(i) All neighbors chosen with equal probability If reach a dead end because node i has no neighbors Chose next visit at random equiprobably among all nodes Redefine graph G = (V, E) adding edges from dead ends to all nodes Restrict attention to connected (modified) graphs 5 4 6 2 3 Rank of node i is the average number of visits of A to i Stoch. Systems Analysis Ranking of nodes in graphs 7

Equiprobable random walk Formally, let A n be the node visited at time n Define transition probability P ij from node i into node j P ij := P [ A n+ = j A n = i ] Next visit equiprobable among neighbors P ij = #[n(i)] = N i, Defined number of neighbors N i = #[n(i)] for all j n(i) /2 /2 5 4 /2 /2 /2 /2 2 3 /2 /2 6 /5 /5 /5 /5 /5 to to 2 to 3 to 4 to 5 Still have a graph But also a MC Red (not blue) circles Stoch. Systems Analysis Ranking of nodes in graphs 8

Formal definition of rank Consider variable I {A m = i} to indicate visit to state i at time m. Rank r i of i-th node defined as time average of number of visits, i.e., r i := lim n n n I {A m = i} m= Define vector of ranks r := [r, r 2,..., r J ] T Rank r i can be approximated by average r ni at time n r ni := n n I {A m = i} m= Since lim n r ni = r i, it holds r ni r i for n sufficiently large Define vector of approximate ranks r n := [r n, r n2,..., r nj ] T If modified graph is connected, rank independent of initial visit Stoch. Systems Analysis Ranking of nodes in graphs 9

Algorithm Output : Vector r(i) with ranking of node i Input : Vector N(i) containing number of neighbors of i Input : Matrix N(i, k) containing indices j of neighbors of i m = ; r=zeros(j,); % Initialize time and ranks A 0 = random( unid,j); % Draw first visit uniformly at random while m < n do jump = random( unid,n Am ); % Neighbor uniformly at random A m = N(A m, jump); % Jump to selected neighbor r(a m ) = r(a m ) + ; % Update ranking for A m m = m + ; end r = r/n; % Normalize by number of iterations n Stoch. Systems Analysis Ranking of nodes in graphs 0

Example: Social graph Asked students taking ESE303 about homework collaboration Created (crude) graph of the social network of students in this class Used ranking algorithm to understand connectedness E.g., If I want to know how well students are coping with the class it is best to ask people with higher connectivity ranking 2009 data Students in 200 don t show up in class in enough numbers Stoch. Systems Analysis Ranking of nodes in graphs

Ranked class graph Harish Venkatesan Jacci Jeffries Pallavi Yerramilli Xiang-Li Lim Thomas Cassel Owen Tian Eric Lamb Daniela Savoia Ceren Dumaz Priya Takiar Lindsey Eatough Sugyan Lohiaa Ankit Aggarwal Lisa Zheng Madhur Agarwal Anthony Dutcher Robert Feigenberg Carolina Lee Aarti Kochhar Ciara Kennedy Amanda Smith Saksham Karwal Michael Harker Amanda Zwarenstein Ranga Ramachandran Varun Balan Katie Joo Shahid Bosan Pia Ramchandani Jesse Beyroutey Rebecca Gittler Ivan Levcovitz Jane Kim Paul Deren Alexandra Malikova Jihyoung Ahn Chris Setian Charles Jeon Ella Kim Aditya Kaji Stoch. Systems Analysis Ranking of nodes in graphs 2

Convergence metrics Recall r is vector of ranks and r n of rank iterates By definition lim r n = r. How fast r n converges to r (r given)? n Can measure by distance between r and r n ( J ) /2 ζ n := r r n 2 = (r ni r i ) 2 If interest is only on largest ranked nodes, e.g., a web search Denote r (i) as the index of the i-th highest ranked node Similarly, r (i) n i= is the index of the i-th highest ranked node at time n First element wrongly ranked at time n ξ n := min r (i) r n (i) i Stoch. Systems Analysis Ranking of nodes in graphs 3

Evaluation of convergence metrics 0 Distance correctly ranked nodes 0 0 0 Distance gets close to 0 2 in approx. 5 0 3 iterations 0 2 0 000 2000 3000 4000 5000 6000 7000 8000 9000 0000 time (n) 4 2 First element wrongly ranked Bad: Two largest ranks in 3 0 3 iterations Awful: Six best ranks in 8 0 3 iterations correctly ranked nodes 0 8 6 4 Convergence appears (very) slow 2 0 0 000 2000 3000 4000 5000 6000 7000 8000 9000 0000 time (n) Stoch. Systems Analysis Ranking of nodes in graphs 4

When does this algorithm converge? Can confidently claim convergence not until 0 5 iterations True for particular case. Slow convergence inherent to algorithm Example has 40 nodes, want to use in network with 0 9 nodes 40 35 30 correctly ranked nodes 25 20 5 0 5 0 0 0.2 0.4 0.6 0.8 time (n).2.4.6.8 2 x 0 5 Use fact that this process is a MC to obtain faster algorithm Stoch. Systems Analysis Ranking of nodes in graphs 5

Ranking of nodes in graphs: Markov chain Ranking of nodes in graphs: Random walk Ranking of nodes in graphs: Markov chain Stoch. Systems Analysis Ranking of nodes in graphs 6

Limit probabilities Recall definition of rank r i := lim t t t I{A(u) = i} u= Rank is time average of number of state visits in a MC Can be equally obtained from limiting probabilities Recall transition probabilities P ij = N i, for all j n(i) Stationary distribution π = [π, π,..., π J ] T solution of π i = j n (i) P ji π j = π j N j j n (i) Plus normalization equation J i= π i = As per ergodicity r = π for all i Stoch. Systems Analysis Ranking of nodes in graphs 7

Matrix notation, eigenvalue problem As always, can define matrix P with elements P ij π i = J P ji π j = P ji π j j n (i) j= for all i Right hand side is just definition of a matrix product leading to π = P T π, π T = Also added normalization equation Can solve as system of linear equations or eigenvalue problem on P T Non-iterative method Convergence not an issue But requires matrix P available at a central location Computationally costly (matrix P with 0 9 rows and columns) All methods are costly to compute exact solution This one is costly to find even approximate solution Stoch. Systems Analysis Ranking of nodes in graphs 8

What are limit probabilities? Let p i (n) denote probability of agent A visiting node i at time t p i (n) := P [A n = i] Probabilities at time n + and n can be related P [A n+ = i] = P [ A n = i An = i ] P [A n = j] j n (i) Which is, of course, probability propagation in a MC p i (n + ) = P ji p j (n) j n (i) By definition limit probabilities are (let p(n) = [p (n),..., p J (n)] T ) lim p(n) = π = r n Compute ranks from limit of probability propagation Stoch. Systems Analysis Ranking of nodes in graphs 9

Probability propagation Can also write probability propagation in matrix form p i (n + ) = J P ji p j (n) = P ji p j (n) j n (i) j= for all i Right hand side is just definition of a matrix product leading to p(n + ) = P T p(n) Can approximate rank by probability distribution r = lim n p(n) p(n) for n sufficiently large Stoch. Systems Analysis Ranking of nodes in graphs 20

Algorithm Algorithm is just a recursive matrix product Output : Vector r(i) with ranking of node i Input : Matrix P containing transition probabilities m = ; % Initialize time r=(/j)ones(j,); % Initial distribution uniform across all nodes while m < n do r = P T r; % Probability propagation m = m + ; end Stoch. Systems Analysis Ranking of nodes in graphs 2

Interpretation of probability propagation Why does the random walk converges so slow? What does it take to obtain a time average r ni close to r i? Need to register a large number of agent visits to every state Back of hand: 40 nodes, some 00 visits to each 4 0 3 iters. Idea: Unleash a large number of agents K r i = lim n n n m= K K I {A km = i} k= Visits are now spread over time and space Converges K times faster (depends agents initial distribution) But haven t changed computational cost Stoch. Systems Analysis Ranking of nodes in graphs 22

Interpretation of prob. propagation (continued) What happens if we unleash infinite number of agents K? n K r i = lim lim I {A km = i} n n K K m= Using law of large numbers and expected value of indicator function n n r i = lim E [I {A m = i}] = lim P [A m = i] n n n n m= Graph walk is a MC, then r i = lim n n k= m= lim P [A m = i] = lim p i(m) exists, and m m n m= p i (m) = lim n p i(n) Probability propagation Unleashing infinite number of agents Interpretation true for any MC Stoch. Systems Analysis Ranking of nodes in graphs 23

Distance to rank Initialize with uniform probability distribution p(0) = (/J) Distance between p(n) and r 0 0 0 Distance 0 2 0 3 0 4 0 20 40 60 80 00 20 40 time (n) Distance is 0 2 in approximately 30 iterations, 0 4 in 40 iterations Convergence is two orders of magnitude faster than random walk Stoch. Systems Analysis Ranking of nodes in graphs 24

Number of nodes correctly ranked Rank of highest ranked node that is wrongly ranked by time n 40 35 correctly ranked nodes 30 25 20 5 0 5 0 0 20 40 60 80 00 20 40 time (n) Not bad: All nodes correctly ranked in 20 iterations Good: Ten best ranks in 80 iterations Great: Four best ranks in 20 iterations Stoch. Systems Analysis Ranking of nodes in graphs 25

Distributed algorithm to compute ranks Nodes want to compute their rank r i Can communicate with neighbors only (incoming + outgoing) Access to neighborhood information only Recall probability update p i (n + ) = Uses local information only j n (i) P ji p j (n) = j n (i) Algorithm. Nodes keep local rank estimates p i (n) N j p j (n) Receive rank (probability) estimates p j (n) from neighbors j n (i) Update local rank estimate p i (n + ) = j n (i) p j(n)/n j Communicate rank estimate p i (n + ) to outgoing neighbors j n(i) Need only know number of neighbors of my neighbors Stoch. Systems Analysis Ranking of nodes in graphs 26

Distributed implementation of random walk Can communicate with neighbors only (incoming + outgoing) But cannot access neighborhood information Pass agent around Local rank estimates r i (n) and counter with number of visits V i Algorithm run by node i at time n if Agent received from neighbor then V i = V i + Choose random neighbor Send agent to chosen neighbor end n = n + ; r i (n) = V i /n; Speed up convergence by generating many agents to pass around Stoch. Systems Analysis Ranking of nodes in graphs 27

Comparison of different algoithms Random walk implementation Most secure & robust. No information shared with other nodes Implementation can be distributed Convergence exceedingly slow System of linear equations Least security and robustness. Graph in central server Distributed implementation not clear Non-iterative method, convergence not a problem But computationally costly to obtain approximate solutions Probability propagation, matrix powers Somewhat secure/robust. Info. shared with neighbors only Implementation can be distributed Convergence rate acceptable (orders of magnitude faster than RW) Stoch. Systems Analysis Ranking of nodes in graphs 28