Applications to network analysis: Eigenvector centrality indices Lecture notes

Similar documents
Nonnegative and spectral matrix theory Lecture notes

Link Analysis. Leonid E. Zhukov

Node Centrality and Ranking on Networks

Eigenvalues of Exponentiated Adjacency Matrices

Node and Link Analysis

1 Searching the World Wide Web

How does Google rank webpages?

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Lecture 7 Mathematics behind Internet Search

Bounds on the Ratio of Eigenvalues

Majorizations for the Eigenvectors of Graph-Adjacency Matrices: A Tool for Complex Network Design

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Google Page Rank Project Linear Algebra Summer 2012

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

IR: Information Retrieval

Data Mining Recitation Notes Week 3

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Graph Models The PageRank Algorithm

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

ECEN 689 Special Topics in Data Science for Communications Networks

A Note on Google s PageRank

Nonnegative Matrices I

Inf 2B: Ranking Queries on the WWW

Markov Chains and Spectral Clustering

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

Calculating Web Page Authority Using the PageRank Algorithm

MATH36001 Perron Frobenius Theory 2015

Complex Social System, Elections. Introduction to Network Analysis 1

Math 304 Handout: Linear algebra, graphs, and networks.

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

THE PERTURBATION BOUND FOR THE SPECTRAL RADIUS OF A NON-NEGATIVE TENSOR

Notes on Linear Algebra and Matrix Theory

Link Analysis Ranking

How works. or How linear algebra powers the search engine. M. Ram Murty, FRSC Queen s Research Chair Queen s University

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

This section is an introduction to the basic themes of the course.

Online Social Networks and Media. Link Analysis and Web Search

Eigenvectors Via Graph Theory

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Detailed Proof of The PerronFrobenius Theorem

Lecture 3: graph theory

A New Method to Find the Eigenvalues of Convex. Matrices with Application in Web Page Rating

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Data Mining and Matrices

1998: enter Link Analysis

eigenvalues, markov matrices, and the power method

Perron Frobenius Theory

THEORY OF SEARCH ENGINES. CONTENTS 1. INTRODUCTION 1 2. RANKING OF PAGES 2 3. TWO EXAMPLES 4 4. CONCLUSION 5 References 5

The Miracles of Tropical Spectral Theory

On the mathematical background of Google PageRank algorithm

Online Social Networks and Media. Link Analysis and Web Search

Krylov Subspace Methods to Calculate PageRank

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

No class on Thursday, October 1. No office hours on Tuesday, September 29 and Thursday, October 1.

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

Link Analysis and Web Search

Page rank computation HPC course project a.y

Multiple Relational Ranking in Tensor: Theory, Algorithms and Applications

Z-Pencils. November 20, Abstract

Affine iterations on nonnegative vectors

Key words. Strongly eventually nonnegative matrix, eventually nonnegative matrix, eventually r-cyclic matrix, Perron-Frobenius.

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

The Singular Values of the Exponentiated Adjacency Matrices of Broom-Tree Graphs

The Google Markov Chain: convergence speed and eigenvalues

Intrinsic products and factorizations of matrices

Lecture: Local Spectral Methods (1 of 4)

Bonacich Measures as Equilibria in Network Models

CS47300: Web Information Search and Management

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

Uncertainty and Randomization

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Applications of The Perron-Frobenius Theorem

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

The Second Eigenvalue of the Google Matrix

0.1 Naive formulation of PageRank

Markov Chains, Random Walks on Graphs, and the Laplacian

Justification and Application of Eigenvector Centrality

Spectral Properties of Matrix Polynomials in the Max Algebra

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Announcements: Warm-up Exercise:

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Central Groupoids, Central Digraphs, and Zero-One Matrices A Satisfying A 2 = J

REPRESENTATION THEORY WEEK 5. B : V V k

Authoritative Sources in a Hyperlinked Enviroment

Degree Distribution: The case of Citation Networks

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Spectral Graph Theory Tools. Analysis of Complex Networks

COMPSCI 514: Algorithms for Data Science

The Giving Game: Google Page Rank

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Transcription:

Applications to network analysis: Eigenvector centrality indices Lecture notes Dario Fasino, University of Udine (Italy) Lecture notes for the second part of the course Nonnegative and spectral matrix theory with applications to network analysis, held within the Rome-Moscow school on Matrix Methods and Applied Linear Algebra, August September 204. The author is partially supported by Istituto Nazionale di Alta Matematica. This document is intended for internal circulation only. A brief introduction to the analysis of complex networks Complex networks is a common name for various real networks which are usually presented by graphs with a large number of vertices. Here belong Internet graphs, phone graphs, e-mail graphs, social networks and many other. The term network analysis refers to a wealth of mathematical techniques aiming at describing the structure, function, and evolution, of complex networks. One of the main tasks in network analysis is the localization of nodes that, in some sense, are the most important in a given graph. The main tool to quantify the relevance of nodes in a graph is through the computation of suitably defined centrality indices. Many centrality indices have been invented during time. Each one of them refers to a particular definition of importance or relevance that is most useful in a given context. Graph partitioning is the problem of dividing the vertices of a graph into a given number of disjoint subsets of given sizes such that the total weight of edges between such sets is minimized. The best known example of a graph partitioning problem is the problem of dividing a graph into two subsets of comparable size, such that the number of edges between them is minimized. Community detection differs from graph partitioning in that the number and size of the subsets into which the network is divided are generally not apriori specified. Instead it is assumed that the graph is intrinsically structured into communities or groups of vertices which are more or less evidently delimited, the aim being to reveal the presence and the consistency of such groups.. Notations and definitions Two graphs G = (V, E) and G = (V, E ) are called isomorphic if there exists a permutation matrix P such that A G = P A G P T. Let A = A G. The in-degree and the out-degree of node i are respectively the numbers d in i = A ij, j= d out i = A ji. j= They represent the number (or overall weight) of edges that arrive to or depart from node i, respectively. If G is not oriented the two numbers are the same and their common value is the degree d i. Let V = {,..., n} and let G n be the set of all graphs whose node set is V. A graph invariant is any function f : G n R which is invariant under graph isomorphisms: If A G = P A GP T then

2 Eigenvector centralities 2 f(g) = f(g ). A centrality index is any function c : G n R n such that if A G = P A GP T then P c(g) = c(g ). The degree vectors d in = A and d out = A T are the most simple (somehow trivial) examples of a centrality index. Clearly, T d out = T d out, and the sum is equal to the total edge weight of G, which is a graph invariant called volume. Many interesting graph invariants and centrality indices are based on linear algebraic properties (in particular, eigenpairs) of A G and variations thereof. Remark.. Let G G n and let A = A G. Let c : G n R n be a centrality index. Suppose that there exists a nontrivial permutation matrix P such that A = P T AP (that is, the graph owns a nontrivial automorphism). Then for all i =..., n the centrality index of node i is the same as that of node j, where e j = P e i. 2 Eigenvector centralities The purpose of this section is to describe some of the most important centrality indices, whose definition is largely based on tools and concepts borrowed from linear algebra and matrix analysis. They are the Bonacich index, PageRank, and HITS scores. The common feature shared by these indices is that they are Perron eigenvectors of suitably defined nonnegative matrices. 2. The Bonacich index One of the first centrality indices (besides degrees) was introduced by the american sociologist Bonacich []. Let G be a directed graph. For simplicity, assume it is strongly connected, and let A = A G. The original idea is that a node is important it is connected to other important nodes. This sort of circular defintion can be formalized rigorously by assuming that the score of node i is proportional to the sum of scores of all nodes j such that i j: λb i = j:j i b j = A ji b j, where A = A G. Hence, the vector of Bonacich indices fulflills the eigenvalue equation A T b = λb. Among the possible solutions of the previous equation, the Bonacich index is the one which corresponds to the Perron eigenpair of A T. In fact, if G is strongly connected then the Bonacich score vector b = (b,..., b n ) T immediately obtains a number of useful properties from Perron Frobenius theory: It is uniquely defined, apart of a scaling factor; its entries are positive (every node gets a nonzero score). If we add a new edge i j to G then the node whose Bonacich index receives the largest increase is i (by Dietzenbacher theorem), consistently to the idea that its influence has increased. Remark 2.. Recall that the number (A k ) ij represents the number of distinct paths from node j to node i whose length is k. If A is primitive then we can compute the Bonacich index by applying the (normalized) power method to A T. Hence, apart of a scaling factor, b = lim k (A T ) k / (A T ) k. In particular, b i is proportional to limit for k of the number of different paths whose length is k from node i to every node in the graph. Thus a node with a large Bonacich index is a node that originates many long paths. Exercise 2.2. A star graph with n nodes is the undirected graph whose adjacency matrix is 0 0 0 A =... Rn n. 0 0 Prove that the spectrum of A is { ρ, 0, ρ} and a Perron eigenvector of A (which contains the Bonacich centrality index of the nodes) is x = (ρ,,..., ) T where ρ = n. j=

2 Eigenvector centralities 3 2.2 PageRank One of the best known centrality indices for arbitrary graphs is PageRank, whose fortune is due to its usage in the Google search engine [5]. The original formula by S. Brin and L. Page [2] defines the PageRank vector π = (π,..., π n ) T of a graph G as the solution of the following linear system: π i = ( α) + α π j d out j:j i j where α (0, ) is a fixed constant called the damping factor, originally set to α = 0.85. In matrix form, (I αm)π = ( α), () where M O is the so-called link matrix which is defined as { A ij /d out j if d out j > 0 M ij = 0 otherwise and A = A G. For simplicity of exposition, hereafter I assume that all nodes in G have at least one outgoing edge, that is, d out > 0. In this case, the sum of all entries in any column of M is. Unfortunately, M is seldom irreducible. Nevertheless, we can say something about ρ(m): Lemma 2.3. ρ(m) =. Proof. Since M O, there exists x 0 such that Mx = ρ(m)x. Moreover, we can rewrite i M ij = as M T =. Hence, T x = T Mx = ρ(m) T x, and the proof is complete, by observing that T x > 0. The surprise is that there exists a matrix Γ > O (which is called Google matrix) such that, apart of a scaling factor, π is a Perron Frobenius eigenvector of Γ. Theorem 2.4. Let Γ R n n be the positive matrix defined as Γ = αm + α n T. for 0 < α <. Then, ρ(γ) = and the vector π defined in () is a Perron eigenvector. Proof. Observe that Γ > O by construction; in particular, it is primitive, hence irreducible. Simple computations show that Γ T =, so that ρ(γ) =. Finally, let x be a Perron eigenvector of Γ normalized so that T x = n. Then, x = Γx = αmx + α n T x = αmx + ( α). Rearranging terms, (I αm)x = ( α), which is (). To complete the proof it remains to prove that I αm is nonsingular. Using Lemma 2.3 we can derive that all eigenvalues of I αm are contained in the circle {z C : z α}, which excludes 0 since α < by hypothesis., 2.3 Hubs and Authorities Exactly in the same year Brin and Page invented PageRank, J. Kleinberg introduced another algorithm to evaluate the relevance of documents in a large hypertext, such as the internet [3, 5]. This algorithm (HITS, Hypertext Induced Topic Search) quantifies the importance of nodes in a graph according to two centrality indices: the hub score and the authority score. Very informally, the hub score of a node is a measure of how good it is as access point or portal, while the authority score is a measure of how good a node is as final document. Kleinberg original idea is that a node is a good hub if it points to good authorities; and a node is a good authority if it is pointed by good hubs.

2 Eigenvector centralities 4 This concept has been formalized by the following equations: Let h i and a i be the hub score and authority score of node i, respectively. Then, λh i = λa i = h j, i =..., n, (2) j:i j a j where λ is a proportionality constant, to be defined. In matrix notations, λh = Aa and λa = A T h. The two equations can be uncoupled as follows: j:j i λ 2 h = A T Ah, λ 2 a = AA T a. Let M hub = A T A and M auth = AA T be the hub matrix and the authority matrix, respectively. These two matrices are symmetric, nonnegative, positive semidefinite, and have exactly the same eigenvalues (why?). In particular, ρ(m hub ) = ρ(m auth ). The preferred solution to (2) corresponds to Perron eigenvectors, with λ = ρ(m hub ). Unfortunately, M hub and M auth are usually not irreducible, even if the original graph is strongly connected. As an exercise, let s compute hub and authority scores of the following graph: 4 2 3 The adjacency and hub matrix are 2 0 0 A = 0 0 0 0 0, M hub = A T A = 2 0 0. 0 0 0 The eigenvalues of M hub are 3,, 0, 0. An eigenvector associated to ρ(m hub ) = 3 is h = (,, 0, 0) T. We can compute authority scores from the formula λa = Ah. We obtain a = (0,, 2, ) T / 3. We conclude that nodes and 2 are good hubs, nodes 3 and 4 are not (indeed, they have no outgoing links). The best authority node is 3, which is pointed by both best hubs. Node is not an authority, because it has no ingoing links. Exercise 2.5. Suppose that in a given graph G there are two nodes, say i and j such that for every k V it holds k i k j. Prove that a i a j. HITS algorithm is essentially the power method (with normalization) applied to M hub and M auth starting from the initial vector [3]. However, the largest eigenvalue of these matrices may be not simple, and this fact implies that hub scores and authority scores may be not uniquely defined (apart of a scaling factor), since the convergence of the power method may be affected by the choiche of the starting vector. For example, consider 2 3 4 2 0 0 0 M hub = 0 0 0 0. The eigenvalues of M hub are 2, 2, 0, 0. Any vector of the form h = (α β β 0) T is an eigenvector corresponding to ρ(m hub ) = 2. If we apply the power method to M hub starting from ( ) T we obtain h ( 0) T, while if the starting vector is ( 0 0 0) T then we obtain h ( 0 0 0) T. Various modifications have been devised in order to make hub-authority scores well defined under rather general hypotheses, see e.g., [4]. One of these tricks is described in the following exercise:

3 Exercises and problems 5 Exercise 2.6. Let  = A + εi where ε > 0 (note: this modification corresponds to adding a loop with weight ε to every node in the graph) and let M hub = ÂT Â. Prove that if G is not disconnected then M hub is irreducible. Hint: M hub = ε(a + A T ) + other nonnegative matrices. 3 Exercises and problems Exercises marked with a star ( ) are requested for the final evaluation.. ( ) For any given integers p, q, let G be the graph having + p + pq nodes, which is defined as follows: There is a root node, which is connected to p star nodes; every star node is connected to q peripheral nodes. Compute the Bonachich centrality index for the nodes of this graph. Use the preceding result to prove that the Bonacich index of a node is not an inccreasing function of its degree. 2. ( ) Let G = (V, E) be an undirected, connected graph. Consider the following centrality indices for both nodes and edges: To any i V and ij E associate variables q i and e ij respectively, by means of these equations: λq i = e ij, λe ij = q i + q j. j:i j where λ is a constant (depending on G) to be determined. Discuss existence, uniqueness, positivity... (open ended problem). References [] Phillip Bonacich. Factoring and Weighting Approaches to Status Scores and Clique Identification. Journal of Mathematical Sociology, 2 (972), 3 20. [2] Sergey Brin, Larry Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the 7th international conference on World Wide Web (WWW). Brisbane, Australia (998), 07 7. [3] J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46 (999), 604 632. This is the seminal paper on HITS. Preliminary versions circulated in 997, see Kleinberg homepage, http://www.cs.cornell.edu/home/kleinber/ [4] A. Farahat, T. Lofaro, J. Miller, G. C. Rae, L. A. Ward. Authority rankings from HITS, PageRank, and SALSA: existence, uniqueness, and effect of initialization. SIAM J. Sci. Comput. 27 (2006), 8 20. [5] A. N. Langville, C. D. Meyer. A survey of eigenvector methods for Web information retrieval. SIAM Rev., 47(2005), 35 6. Expository paper on PageRank and HITS. Last update: September 6, 204 A graph is disconnected if its vertex set can be partitioned into two subsets, V = V V 2 and V V 2 =, so that no edge belongs to (V V 2 ) (V 2 V ).