Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Similar documents
A Note on Google s PageRank

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

Link Mining PageRank. From Stanford C246

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

0.1 Naive formulation of PageRank

Graph Models The PageRank Algorithm

Data Mining Recitation Notes Week 3

Uncertainty and Randomization

Online Social Networks and Media. Link Analysis and Web Search

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

ECEN 689 Special Topics in Data Science for Communications Networks

Introduction to Data Mining

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Calculating Web Page Authority Using the PageRank Algorithm

On the mathematical background of Google PageRank algorithm

Online Social Networks and Media. Link Analysis and Web Search

Lecture: Local Spectral Methods (1 of 4)

Link Analysis Ranking

CS6220: DATA MINING TECHNIQUES

1998: enter Link Analysis

Google Page Rank Project Linear Algebra Summer 2012

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

PageRank: The Math-y Version (Or, What To Do When You Can t Tear Up Little Pieces of Paper)

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

Link Analysis. Leonid E. Zhukov

Slides based on those in:

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

IR: Information Retrieval

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data and Algorithms of the Web

SUPPLEMENTARY MATERIALS TO THE PAPER: ON THE LIMITING BEHAVIOR OF PARAMETER-DEPENDENT NETWORK CENTRALITY MEASURES

Math 304 Handout: Linear algebra, graphs, and networks.

The Second Eigenvalue of the Google Matrix

Applications of The Perron-Frobenius Theorem

Today. Next lecture. (Ch 14) Markov chains and hidden Markov models

1 Searching the World Wide Web

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

The Google Markov Chain: convergence speed and eigenvalues

COMPSCI 514: Algorithms for Data Science

Inf 2B: Ranking Queries on the WWW

Majorizations for the Eigenvectors of Graph-Adjacency Matrices: A Tool for Complex Network Design

eigenvalues, markov matrices, and the power method

Pseudocode for calculating Eigenfactor TM Score and Article Influence TM Score using data from Thomson-Reuters Journal Citations Reports

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS6220: DATA MINING TECHNIQUES

Page rank computation HPC course project a.y

Applications to network analysis: Eigenvector centrality indices Lecture notes

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

Web Structure Mining Nodes, Links and Influence

Justification and Application of Eigenvector Centrality

6.842 Randomness and Computation March 3, Lecture 8

MATH36001 Perron Frobenius Theory 2015

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Computing PageRank using Power Extrapolation

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Markov Chains for Biosequences and Google searches

Lecture 7 Mathematics behind Internet Search

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Application. Stochastic Matrices and PageRank

Chapter 10. Finite-State Markov Chains. Introductory Example: Googling Markov Chains

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Data Mining and Matrices

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Algebraic Representation of Networks

Google and Biosequence searches with Markov Chains

STA141C: Big Data & High Performance Statistical Computing

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

Spectral radius, symmetric and positive matrices

Lecture 5: Random Walks and Markov Chain

Node and Link Analysis

Node Centrality and Ranking on Networks

Eigenvalues of Exponentiated Adjacency Matrices

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

CS249: ADVANCED DATA MINING

1.3 Convergence of Regular Markov Chains

Why matrices matter. Paul Van Dooren, UCL, CESAME

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Link Analysis and Web Search

How does Google rank webpages?

Markov Chains Handout for Stat 110

Affine iterations on nonnegative vectors

Conditioning of the Entries in the Stationary Vector of a Google-Type Matrix. Steve Kirkland University of Regina

Complex Social System, Elections. Introduction to Network Analysis 1

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

Introduction to Information Retrieval

Finite-Horizon Statistics for Markov chains

Computational Economics and Finance

UpdatingtheStationary VectorofaMarkovChain. Amy Langville Carl Meyer

Pr[positive test virus] Pr[virus] Pr[positive test] = Pr[positive test] = Pr[positive test]

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Link Analysis. Stony Brook University CSE545, Fall 2016

Transcription:

Lab 8: Measuring Graph Centrality - PageRank Monday, November 5 CompSci 531, Fall 2018

Outline Measuring Graph Centrality: Motivation Random Walks, Markov Chains, and Stationarity Distributions Google s PageRank Algorithm

Directed Graphs A B C D E F A 0 1 0 0 0 0 B 0 0 1 0 0 0 C 0 0 0 0 1 0 D 0 1 0 0 0 0 E 0 0 0 1 0 1 F 0 0 0 0 0 0

Graph Centrality Which vertex is the most important in this graph? What do we even mean by important? In this class, we will focus on importance as centrality as measured by a random walk.

Motivation Social Media Who is important in the Twitter network?

Motivation Academic Publishing How impactful is a scientific publication?

Motivation Web Search Which webpages are most important for displaying after a search query? (The original motivation).

Outline Measuring Graph Centrality: Motivation Random Walks, Markov Chains, and Stationarity Distributions Google s PageRank Algorithm

Formalizing Graph Centrality Attempt 1. Measure the in-degree (number of incoming directed edges) of every node. Node in-degree A 0 B 2 C 1 D 1 E 1 F 1

Formalizing Graph Centrality Problem. Why do edges from unimportant and important nodes contribute equally? What is the most important and central vertex in this graph?

Formalizing Graph Centrality Attempt 2. Say that a node is central in so far as we are likely to arrive at the node while traversing the graph. For example, in this graph, all traversals end at the same place.

Random Walk Question. What do we mean by likely in a traversal? Where is the probability coming from? Answer. We consider a random walk. Start at a random vertex For t from 1 to T steps: Choose an outgoing edge uniformly at random and follow it Let! " # be the probability that we are at node $ at time %. Then the centrality of node $ is lim # *! " #.

Transition Probabilities Note that! "#$ only depends on! ". In particular, let % & denote the outdegree of vertex '. Then! ( "#$ = * &: &,(. For convenience, let 0 be the transition matrix defined below. For now, assume that % & 1 for all '.! & " % &. 1, 0 &( = 3% & 4 &( = 1 0, 4 &( = 0

Markov Chain Each row represents a conditional probability distribution: we can interpret! "# as the probability that we move to $ given we are at %. We can rewrite the updates in terms of the transition matrix. & '() = & '! Note that & '() is independent the history, conditional on & ', i.e., & '() & ), & 0,, & ' = & '() & '. Thus, this random walk is a Markov Chain.

Stationary Distribution lim $ & ' $, our measure of graph centrality, is the stationary distribution of the Markov chain. Questions. 1. Does the limit even exist? 2. Does the limit depend on the starting state ' (? 3. Can we compute lim $ & ' $ efficiently?

Existence and Uniqueness Note that if lim $ & ' $ exists, then it must be some ' such that ' = ' * * + ' = '. That is, the stationary distribution ' should be an eigenvector of the transposed transition matrix * +, with eigenvalue 1. (More to come next class on eigenvalues in graphs). Is it the only one? We need a theorem from linear algebra. Suppose for a moment that * has all strictly positive values.

Existence and Uniqueness Perron-Frobenius Theorem (abbreviated). Let A be a square matrix with real, strictly positive entries. Then the following hold. 1. The largest eigenvalue (call it! " ) of A is unique. 2. There is a unique eigenvector (call it # ) corresponding to! ", all entries of which are positive, and this is the only eigenvector with all positive entries. 3. The power iteration method that repeatedly applies # %&" = (# % beginning from an initial vector # " not orthogonal to # converges to # as ). Every row of, is a probability distribution, so, 1 = 1. By conditions 2 and 1, it must be that the largest eigenvalue of, is 1. Since, is square,, and, / have the same eigenvalues, so 1 is the largest eigenvalue of, / too!

Existence and Uniqueness Since 1 is the largest eigenvalue of! ", the theorem implies that # exists and is the unique eigenvector of! " with all positive entries. So we have answered questions 1 and 2: the stationary distribution exists, and it is unique. What about computation? The theorem tells us that the power iteration method converges in the limit but how long does that take?

Computation In general, the convergence rate is determined by the spectral gap. If! " = 1 is the largest eigenvalue of % &, and! ' is the second largest eigenvalue of % &, then the spectral gap is! "! '. As we will see next lab, the spectral gap is in turn related to the conductance of the underlying graph. Let ) + be a cut in, = (+, /). The conductance of the cut is 1 ) = 3,4 6:3 8,4 8 :;<( >? @ >, >? @ > ).

Computation The conductance of a graph is the minimum conductance of any cut. S! " = 2 min(10, 4) = 1 2

Computation S! " = 1 min(7, 7) = 1 7

Computation So intuitively, lower conductance graphs have bottlenecks, and it may take a longer time for the random walk to traverse the cut. By contrast, power iteration converges rapidly on graphs with high conductance (e.g., complete graphs). To converge (to within some constant error term), one needs iterations. What does that look like in practice?! "#$ % & '

Outline Measuring Graph Centrality: Motivation Random Walks, Markov Chains, and Stationarity Distributions Google s PageRank Algorithm

PageRank Page rank is named after Larry Page. He was doing a PhD at Stanford when he started working on the project of building a search engine. He didn t finish his PhD, but he is currently the Alphabet CEO and worth around 53 billion USD.

PageRank PageRank treats the web as a huge graph, where webpages are vertices, and hyperlinks are directed edges. The PageRank algorithm simply applies the power iteration method to compute the stationary distribution of a random walk on the web. Recall that we needed all entries in! to be strictly positive to be guaranteed that this works. That means that from any vertex, there has to be nonzero probability of transitioning to any other vertex.

PageRank To satisfy this, PageRank assumes a slightly different random walk than we described. In particular: Start at a random vertex For t from 1 to T steps: If current page has no links Choose a page uniformly at random. Else With probability 0.15, choose a page uniformly at random. With the remaining probability, choose a link from the current page uniformly at random and follow it.

PageRank Thus, if there are n web pages in total, the transition matrix for this random walk is given by! "# = 0.85) "# + 0.15, / h12 3/-42 * " - 1, / h12-5 3/-42 - Then we just compute the stationary distribution by the power iteration method. What kind results does this generate?

PageRank

PageRank Note that our modification also ensures that the conductance of the graph is not too small. In practice, 50 to 100 power iterations suffice for a reasonable approximation to the stationary distribution. This might seem hard for large n, but note that the graph itself is extremely sparse, so matrix vector multiplication can be implemented efficienctly. All other things equal, google search prefers to show results with higher PageRank. The #1 thing that increases your PageRank? Having other important pages link to you.