Modularity and Graph Algorithms

Similar documents
The Trouble with Community Detection

1 Matrix notation and preliminaries from spectral graph theory

1 Matrix notation and preliminaries from spectral graph theory

Graph Clustering Algorithms

Nonparametric Bayesian Matrix Factorization for Assortative Networks

Lecture 12: May 09, Decomposable Graphs (continues from last time)

Comparing linear modularization criteria using the relational notation

A Dimensionality Reduction Framework for Detection of Multiscale Structure in Heterogeneous Networks

Graph coloring, perfect graphs

arxiv:cond-mat/ v1 [cond-mat.dis-nn] 27 Mar 2006

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Statistical mechanics of community detection

Graph Detection and Estimation Theory

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Sociology 6Z03 Review II

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

3 Undirected Graphical Models

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

Chapter 11. Matrix Algorithms and Graph Partitioning. M. E. J. Newman. June 10, M. E. J. Newman Chapter 11 June 10, / 43

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

arxiv: v1 [physics.soc-ph] 20 Oct 2010

Groups of vertices and Core-periphery structure. By: Ralucca Gera, Applied math department, Naval Postgraduate School Monterey, CA, USA

ON THE NP-COMPLETENESS OF SOME GRAPH CLUSTER MEASURES

Undirected Graphical Models 4 Bayesian Networks and Markov Networks. Bayesian Networks to Markov Networks

Lecture 12 : Graph Laplacians and Cheeger s Inequality

On Resolution Limit of the Modularity in Community Detection

Lecture 3. 1 Polynomial-time algorithms for the maximum flow problem

Rapid Introduction to Machine Learning/ Deep Learning

8.1 Concentration inequality for Gaussian random matrix (cont d)

Graphical Models and Independence Models

An ADMM Algorithm for Clustering Partially Observed Networks

Small Forbidden Configurations III

Undirected Graphical Models

Data Mining Techniques

Graphs and Network Flows IE411. Lecture 12. Dr. Ted Ralphs

Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Spectral Theorem for Self-adjoint Linear Operators

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

SVD, discrepancy, and regular structure of contingency tables

This section is an introduction to the basic themes of the course.

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

44.2. Two-Way Analysis of Variance. Introduction. Prerequisites. Learning Outcomes

Communities, Spectral Clustering, and Random Walks

Laplacian Integral Graphs with Maximum Degree 3

The doubly negative matrix completion problem

Faloutsos, Tong ICDE, 2009

Matching Polynomials of Graphs

Data Mining and Analysis: Fundamental Concepts and Algorithms

10.3 Matroids and approximation

Prof. Dr. Lars Schmidt-Thieme, L. B. Marinho, K. Buza Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

ORIE 4741: Learning with Big Messy Data. Spectral Graph Theory

Containment restrictions

Class President: A Network Approach to Popularity. Due July 18, 2014

Effective Resistance and Schur Complements

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Markov Chains and Spectral Clustering

Weighted Least Squares

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Community Detection. Data Analytics - Community Detection Module

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Analysis of Variance: Part 1

LECTURE 5 HYPOTHESIS TESTING

ROOT SYSTEMS AND DYNKIN DIAGRAMS

Applications to network analysis: Eigenvector centrality indices Lecture notes

Evaluation. Andrea Passerini Machine Learning. Evaluation

Categorical Data Analysis Chapter 3

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Definition (The carefully thought-out calculus version based on limits).

Lecture 3: Simulating social networks

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Lecture 13: Spectral Graph Theory

Evaluation requires to define performance measures to be optimized

hypothesis a claim about the value of some parameter (like p)

Four new upper bounds for the stability number of a graph

K 4 -free graphs with no odd holes

LINEAR SPACES. Define a linear space to be a near linear space in which any two points are on a line.

Lecture 10 February 4, 2013

DRAGAN STEVANOVIĆ. Key words. Modularity matrix; Community structure; Largest eigenvalue; Complete multipartite

ACO Comprehensive Exam March 17 and 18, Computability, Complexity and Algorithms

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Modular representations of symmetric groups: An Overview

arxiv: v2 [physics.soc-ph] 13 Jun 2009

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Cocliques in the Kneser graph on line-plane flags in PG(4, q)

Parameterized Algorithms for the H-Packing with t-overlap Problem

1 Primals and Duals: Zero Sum Games

Data Analysis and Manifold Learning Lecture 9: Diffusion on Manifolds and on Graphs

First we look at some terms to be used in this section.

Chapter 12 Comparing Two or More Means

CMPSCI 611: Advanced Algorithms

Enumeration Schemes for Words Avoiding Permutations

Random Lifts of Graphs

Rao s degree sequence conjecture

Link Analysis Ranking

Lecture 13 Spectral Graph Algorithms

Recitation 9: Loopy BP

Transcription:

Modularity and Graph Algorithms David Bader Georgia Institute of Technology Joe McCloskey National Security Agency 12 July 2010 1

Outline Modularity Optimization and the Clauset, Newman, and Moore Algorithm 2

Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Let G represent a directed, unweighted graph with adjacency matrix A. Assume that L is the total number of edges in the network and that G has been partitioned into m disjoint sets of nodes M s, s = 1, 2,..., m. Consider the set M s and let l ss represent the number of observed transitions from M s to M s. Now define l s s to be the total number of transitions from within to outside the node set M s. The quantities l ss and l s s are defined analogously. Note that we have collapsed the adjacency matrix into a 2 2 table of transition counts where L = l ss + l s s + l ss + l s s. 3

Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Let the row and column marginal sums of the 2 2 transition table be defined as l +s = l ss + l ss, and l s+ = l ss + l s s. We observe l ss transitions within node set M s. Under the hypothesis of independence the estimated number of transitions is as follows: ˆlss = L ( l+s L ) ( ) ls+ = l s+l +s L L. The residual is defined as R ss = l ss ˆl ss. If R ss > 0 then we observe more transitions than expected. If R ss is significantly larger than zero then we might suspect that M s is a module (cluster). 4

Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Following CNM we let Q s = R ss /L. The set of nodes M s to said to be a module if and only if Q s > 0. Next we define the modularity function Q as follows: Q = m Q s = 1 L s=1 m R ss = 1 L s=1 m s=1 (l ss ˆl ss ). The goal of modularity optimization is to efficiently find a decomposition of G into a set of m modules M s, s = 1, 2,..., m that maximizes the modularity function Q. 5

Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs It follows that the residual R ss = l ss ˆl ss = l ss ls+l+s L only if l ss l s s l s s l ss > 0. Moreover, > 0 if and Q = 1 L 2 m s=1 (l ss l s s l s s l ss ), and M s is a module if and only if c 12 = lssl s s l ssl s s > 1, where c 12 is the cross-product ratio of a 2 2 contingency table. The expected value of c 12 equals one if and only if the rows and columns of the table are statistically independent. c 12 > 1 indicates positive dependence. 6

Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Now let G represent an undirected graph. The adjacency matrix A is now symmetric and it follows that l s s = l ss, and L = l ss + l ss + l s s. Since there are L edges in the graph, the sum of all of the entries of the adjacency matrix is equal to N = 2L. Let X represent the 2 2 contingency table obtained by collapsing the adjacency matrix with respect to module M s. Because of symmetry x ss = 2l ss, x s s = 2l s s, and x ss = x s s = l ss = l s s. 7

Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Let ˆm ij represent the estimated entries of table X under independence. The marginal sums of X with respect to module M s are x s = x +s = x s+ = x ss + x s s = 2l ss + l s s. The expected number of transitions within M s is seen to be ˆlss = ˆm ( ss 2 = (2L) x+s 2L As before Q = m Q s = 1 L s=1 ) ( xs+ 2L m R ss = 1 L s=1 ) ( ) 1 = (x s) 2 2 4L. m s=1 (l ss ˆl ss ). 8

Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs The residual R ss = l ss ˆl ss = l ss x2 s 4L, and M s is a module if and only if 4L ss > x 2 s. Equivalently, M s is a module if and only if 4l ss l s s l 2 s s > 0. The cross-product ratio becomes ( ) ( ) xss x s s c 12 = = x s s x s s ( 2lss l s s ) ( ) 2l s s l s s = 4l ssl s s ls s 2. 9

Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs The modularity function Q can be rewritten in the following useful form where the cross-product ratio is more apparent: Q = m Q s = 1 L s=1 m s=1 R ss = 1 4L 2 m s=1 ( 4lss l s s ls s 2 ). Clauset, Newman, and Moore (2004) have developed a scalable, agglomerative algorithm to maximize the modularity function. Their greedy algorithm makes use of efficient data structures and for certain networks can run in time O ( L log 2 L ). 10

The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution It is a-priori impossible to tell whether a module (large or small), detected through modularity optimization, is indeed a single module or a cluster of smaller modules. Fortunato and Barthelemy (2007) 11

The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution The Most Modular Connected Network Let G be composed of n cliques, n even, where each clique has m nodes. Each clique is connected by a single edge to an adjacent clique. The number of edges in this graph is equal to L = 1 2 nm(m 1) + n = n 2 (m(m 1) + 2). Consider two partitions of G into hypothesized modules. In the first partition every module is a clique, while in the second partition every module is a pair of adjacent cliques. Then Q 1 = 1 2 m(m 1) + 2 1 n, Q 1 2 = 1 m(m 1) + 2 2 n. 12

The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution The Most Modular Connected Network It follows that Q 2 > Q 1 if n > m(m 1) + 2. Since L = n 2 (m(m 1) + 2) this further implies that Q 2 > Q 1 if n 2 > 2L. If n = 25, and m = 5, then Q 2 = 0.8745 > Q 1 = 0.8691 and modularity optimization has not determined the best partition. Fortunato and Barthelemy have analyzed real world data sets that exhibit the property that the best partition does not have the maximum modularity score. 13

An Improved Resolution Limit The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution Berry, Hendrickson, Laviolette, and Phillips (2010) use the local topology of the graph to infer edge weights. Then they generalize the results of Fortunato and Barthelemy to show it is possible to resolve much smaller communities. They also suggest that the dendrogram constructed by the CNM algorithm can be mined to resolve smaller communities. This might be even more effective with their modified CNM algorithm. 14

A Different Resolution? The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution Claim: The failure of the CNM algorithm to resolve communities is due to the properties of the modularity function Q and not the information that it uses. Can CNM be salvaged? Is it possible to optimize a different objective function that uses the same information and resolves communities? 15

Two Equivalent Representations of Q Consider a graph with L edges and three modules. Let l ij represent the number of transitions from module M i to module M j. Then L = l 11 + l 12 + l 13 + l 22 + l 23 + l 33. What happens to the modularity function if modules M 1 and M 2 are merged together? We find two equivalent representations of the difference Q 1,2,3 Q 1 2,3 of the modularity functions due to the merge of M 1 and M 2. If Q 1 2,3 > Q 1,2,3 then CNM will merge M 1 and M 2. 16

Two Equivalent Representations of Q Use the three modules M 1, M 2 and M 3 to collapse the undirected graph into a 3 3 table X 1,2,3 of counts x ij. The x ij are determined from the transition counts l ij as follows: x jj = 2l jj, x ij = x ji = l ij for i j, x 1 = x 1+ = x +1 = 2l 11 + l 12 + l 13, x 2 = x 2+ = x +2 = l 12 + 2l 22 + l 23. Because of symmetry the total sample size of the x ij s is N = 2L. 17

Two Equivalent Representations of Q If we further merge M 1 and M 2 together then the collapsed graph has transition counts of the form l 11 = l 11 + l 12 + l 22, l 12 = l 21 = l 13 + l 23, l 22 = l 33. The resulting 2 2 symmetric table X 1 2,3 of counts x ij has entries x 11 = 2l 11 + 2l 12 + 2l 22, x 12 = x 21 = l 13 + l 23, x 22 = 2l 33. 18

Properties of the Modularity Function Two Equivalent Representations of Q Let Q 1,2,3 be the modularity function when modules M 1, M 2, M 3 are considered to be separate modules, and Q 1 2,3 the modularity function when modules M 1 and M 2 are merged together. Let Q = Q 1,2,3 Q 1 2,3 be the difference of the two modularity functions. If Q 1 2,3 > Q 1,2,3 then Q < 0, and CNM will merge M 1 and M 2. To understand the properties of the modularity function we find two equivalent representations of Q = Q 1,2,3 Q 1 2,3. 19

Properties of the Modularity Function Two Equivalent Representations of Q Module M 3 makes the same contribution Q 3 = R 33 /L to each of the modularity functions Q 1,2,3 and Q 1 2,3. Thus, L Q = L (Q 1,2,3 Q 1 2,3 ) = LQ 1,2,3 LQ 1 2,3 = (R 11 + R 22 + R 33 ) (R 1 2,1 2 + R 33 ) = R 11 + R 22 R 1 2,1 2. We have left to compute R 1 2,1 2 = l 11 ˆl 11. 20

Properties of the Modularity Function Two Equivalent Representations of Q Now ˆl 11 = 1 4L x 2 1, where x 1 = 2l 11 + 2l 12 + 2l 22 + l 13 + l 23. A straightforward calculation yields where we define 2L 2 Q = c 0 + c 1 + c 2 + c 3, c 0 = 4l 11 l 22 l 2 12, c 1 = 2l 11 l 23 l 12 l 13, c 2 = 2l 22 l 13 l 12 l 23, c 3 = l 13 l 23 2l 12 l 33. The c j s have cross-product interpretations as 2 2 subtables of X 1,2,3. 21

Properties of the Modularity Function Two Equivalent Representations of Q With this representation it is easy to manufacture examples where Q < 0 and modules M 1 and M 2 will be merged in error by the CNM algorithm. Note that l 33 is the number of transitions in the rest of the graph and recall that c 3 = l 13 l 23 l 12 l 33. If l 12 > 0 and l 33 is large then c 3 << 0 and Q < 0. The most modular connected network is a simple example where this happens. 22

Properties of the Modularity Function Two Equivalent Representations of Q A second representation of Q leads to a simple observation that addresses several flaws in the current CNM algorithm. Standard properties of symmetric contingency tables can be used to show that the residuals of the 3 3 contingency table X 1,2,3 satisfy 2R 11 + R 12 + R 13 = 0, R 12 + 2R 22 + R 23 = 0, R 13 + R 23 + 2R 33 = 0. Analogously, the residuals of the 2 2 collapsed contingency table X 1 2,3 satisfy R 1 2,1 2 = R 33. 23

Properties of the Modularity Function Two Equivalent Representations of Q These equations lead to another representation of L Q. L Q = L (Q 1,2,3 Q 1 2,3 ) = R 11 + R 22 R 1 2,1 2 = R 11 + R 22 R 33 = 1 2 ( R 12 R 13 ) + 1 2 ( R 12 R 23 ) 1 2 ( R 13 R 23 ) = R 12. Hence, R 12 > 0 if and only if Q < 0. 24

Two Equivalent Representations of Q This last inequality means that two modules M 1 and M 2 are candidates to be merged by the CNM algorithm if their residual R 12 > 0. The greedy CNM algorithm merges the two modules M i and M j with the largest residual R ij > 0. CNM Modification: Before two modules M i and M j are merged their residual R ij > 0 should be deemed statistically significant. The residuals R ij are not comparable since their variances are not necessarily equal. Let S ij = be standardized residuals. R ij std(r ij) 25

Two Equivalent Representations of Q CNM Modification: Before two modules M i and M j are merged their standardized residual S ij > 0 should be deemed statistically significant. The greedy modified CNM algorithm will merge the two modules M i and M j with the largest statistically significant standardized residual S ij > 0. This adjustment to the CNM algorithm will alleviate the resolution limit problem, and the tendency for CNM to favor the merging of large modules. 26

Modularity optimization can fail to resolve communities. This failure is due to the properties of the modularity function. The CNM algorithm can be modified to avoid the resolution limit problem. Algorithms that use spectral information from the modularity matrix can detect modules when CNM fails. 27