Modularity and Graph Algorithms David Bader Georgia Institute of Technology Joe McCloskey National Security Agency 12 July 2010 1
Outline Modularity Optimization and the Clauset, Newman, and Moore Algorithm 2
Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Let G represent a directed, unweighted graph with adjacency matrix A. Assume that L is the total number of edges in the network and that G has been partitioned into m disjoint sets of nodes M s, s = 1, 2,..., m. Consider the set M s and let l ss represent the number of observed transitions from M s to M s. Now define l s s to be the total number of transitions from within to outside the node set M s. The quantities l ss and l s s are defined analogously. Note that we have collapsed the adjacency matrix into a 2 2 table of transition counts where L = l ss + l s s + l ss + l s s. 3
Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Let the row and column marginal sums of the 2 2 transition table be defined as l +s = l ss + l ss, and l s+ = l ss + l s s. We observe l ss transitions within node set M s. Under the hypothesis of independence the estimated number of transitions is as follows: ˆlss = L ( l+s L ) ( ) ls+ = l s+l +s L L. The residual is defined as R ss = l ss ˆl ss. If R ss > 0 then we observe more transitions than expected. If R ss is significantly larger than zero then we might suspect that M s is a module (cluster). 4
Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Following CNM we let Q s = R ss /L. The set of nodes M s to said to be a module if and only if Q s > 0. Next we define the modularity function Q as follows: Q = m Q s = 1 L s=1 m R ss = 1 L s=1 m s=1 (l ss ˆl ss ). The goal of modularity optimization is to efficiently find a decomposition of G into a set of m modules M s, s = 1, 2,..., m that maximizes the modularity function Q. 5
Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs It follows that the residual R ss = l ss ˆl ss = l ss ls+l+s L only if l ss l s s l s s l ss > 0. Moreover, > 0 if and Q = 1 L 2 m s=1 (l ss l s s l s s l ss ), and M s is a module if and only if c 12 = lssl s s l ssl s s > 1, where c 12 is the cross-product ratio of a 2 2 contingency table. The expected value of c 12 equals one if and only if the rows and columns of the table are statistically independent. c 12 > 1 indicates positive dependence. 6
Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Now let G represent an undirected graph. The adjacency matrix A is now symmetric and it follows that l s s = l ss, and L = l ss + l ss + l s s. Since there are L edges in the graph, the sum of all of the entries of the adjacency matrix is equal to N = 2L. Let X represent the 2 2 contingency table obtained by collapsing the adjacency matrix with respect to module M s. Because of symmetry x ss = 2l ss, x s s = 2l s s, and x ss = x s s = l ss = l s s. 7
Modularity Optimization Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs Let ˆm ij represent the estimated entries of table X under independence. The marginal sums of X with respect to module M s are x s = x +s = x s+ = x ss + x s s = 2l ss + l s s. The expected number of transitions within M s is seen to be ˆlss = ˆm ( ss 2 = (2L) x+s 2L As before Q = m Q s = 1 L s=1 ) ( xs+ 2L m R ss = 1 L s=1 ) ( ) 1 = (x s) 2 2 4L. m s=1 (l ss ˆl ss ). 8
Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs The residual R ss = l ss ˆl ss = l ss x2 s 4L, and M s is a module if and only if 4L ss > x 2 s. Equivalently, M s is a module if and only if 4l ss l s s l 2 s s > 0. The cross-product ratio becomes ( ) ( ) xss x s s c 12 = = x s s x s s ( 2lss l s s ) ( ) 2l s s l s s = 4l ssl s s ls s 2. 9
Modularity Optimization: Directed Graphs Modularity Optimization: Undirected Graphs The modularity function Q can be rewritten in the following useful form where the cross-product ratio is more apparent: Q = m Q s = 1 L s=1 m s=1 R ss = 1 4L 2 m s=1 ( 4lss l s s ls s 2 ). Clauset, Newman, and Moore (2004) have developed a scalable, agglomerative algorithm to maximize the modularity function. Their greedy algorithm makes use of efficient data structures and for certain networks can run in time O ( L log 2 L ). 10
The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution It is a-priori impossible to tell whether a module (large or small), detected through modularity optimization, is indeed a single module or a cluster of smaller modules. Fortunato and Barthelemy (2007) 11
The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution The Most Modular Connected Network Let G be composed of n cliques, n even, where each clique has m nodes. Each clique is connected by a single edge to an adjacent clique. The number of edges in this graph is equal to L = 1 2 nm(m 1) + n = n 2 (m(m 1) + 2). Consider two partitions of G into hypothesized modules. In the first partition every module is a clique, while in the second partition every module is a pair of adjacent cliques. Then Q 1 = 1 2 m(m 1) + 2 1 n, Q 1 2 = 1 m(m 1) + 2 2 n. 12
The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution The Most Modular Connected Network It follows that Q 2 > Q 1 if n > m(m 1) + 2. Since L = n 2 (m(m 1) + 2) this further implies that Q 2 > Q 1 if n 2 > 2L. If n = 25, and m = 5, then Q 2 = 0.8745 > Q 1 = 0.8691 and modularity optimization has not determined the best partition. Fortunato and Barthelemy have analyzed real world data sets that exhibit the property that the best partition does not have the maximum modularity score. 13
An Improved Resolution Limit The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution Berry, Hendrickson, Laviolette, and Phillips (2010) use the local topology of the graph to infer edge weights. Then they generalize the results of Fortunato and Barthelemy to show it is possible to resolve much smaller communities. They also suggest that the dendrogram constructed by the CNM algorithm can be mined to resolve smaller communities. This might be even more effective with their modified CNM algorithm. 14
A Different Resolution? The Resolution Limit The Most Modular Connected Network An Improved Resolution Limit A Different Resolution Claim: The failure of the CNM algorithm to resolve communities is due to the properties of the modularity function Q and not the information that it uses. Can CNM be salvaged? Is it possible to optimize a different objective function that uses the same information and resolves communities? 15
Two Equivalent Representations of Q Consider a graph with L edges and three modules. Let l ij represent the number of transitions from module M i to module M j. Then L = l 11 + l 12 + l 13 + l 22 + l 23 + l 33. What happens to the modularity function if modules M 1 and M 2 are merged together? We find two equivalent representations of the difference Q 1,2,3 Q 1 2,3 of the modularity functions due to the merge of M 1 and M 2. If Q 1 2,3 > Q 1,2,3 then CNM will merge M 1 and M 2. 16
Two Equivalent Representations of Q Use the three modules M 1, M 2 and M 3 to collapse the undirected graph into a 3 3 table X 1,2,3 of counts x ij. The x ij are determined from the transition counts l ij as follows: x jj = 2l jj, x ij = x ji = l ij for i j, x 1 = x 1+ = x +1 = 2l 11 + l 12 + l 13, x 2 = x 2+ = x +2 = l 12 + 2l 22 + l 23. Because of symmetry the total sample size of the x ij s is N = 2L. 17
Two Equivalent Representations of Q If we further merge M 1 and M 2 together then the collapsed graph has transition counts of the form l 11 = l 11 + l 12 + l 22, l 12 = l 21 = l 13 + l 23, l 22 = l 33. The resulting 2 2 symmetric table X 1 2,3 of counts x ij has entries x 11 = 2l 11 + 2l 12 + 2l 22, x 12 = x 21 = l 13 + l 23, x 22 = 2l 33. 18
Properties of the Modularity Function Two Equivalent Representations of Q Let Q 1,2,3 be the modularity function when modules M 1, M 2, M 3 are considered to be separate modules, and Q 1 2,3 the modularity function when modules M 1 and M 2 are merged together. Let Q = Q 1,2,3 Q 1 2,3 be the difference of the two modularity functions. If Q 1 2,3 > Q 1,2,3 then Q < 0, and CNM will merge M 1 and M 2. To understand the properties of the modularity function we find two equivalent representations of Q = Q 1,2,3 Q 1 2,3. 19
Properties of the Modularity Function Two Equivalent Representations of Q Module M 3 makes the same contribution Q 3 = R 33 /L to each of the modularity functions Q 1,2,3 and Q 1 2,3. Thus, L Q = L (Q 1,2,3 Q 1 2,3 ) = LQ 1,2,3 LQ 1 2,3 = (R 11 + R 22 + R 33 ) (R 1 2,1 2 + R 33 ) = R 11 + R 22 R 1 2,1 2. We have left to compute R 1 2,1 2 = l 11 ˆl 11. 20
Properties of the Modularity Function Two Equivalent Representations of Q Now ˆl 11 = 1 4L x 2 1, where x 1 = 2l 11 + 2l 12 + 2l 22 + l 13 + l 23. A straightforward calculation yields where we define 2L 2 Q = c 0 + c 1 + c 2 + c 3, c 0 = 4l 11 l 22 l 2 12, c 1 = 2l 11 l 23 l 12 l 13, c 2 = 2l 22 l 13 l 12 l 23, c 3 = l 13 l 23 2l 12 l 33. The c j s have cross-product interpretations as 2 2 subtables of X 1,2,3. 21
Properties of the Modularity Function Two Equivalent Representations of Q With this representation it is easy to manufacture examples where Q < 0 and modules M 1 and M 2 will be merged in error by the CNM algorithm. Note that l 33 is the number of transitions in the rest of the graph and recall that c 3 = l 13 l 23 l 12 l 33. If l 12 > 0 and l 33 is large then c 3 << 0 and Q < 0. The most modular connected network is a simple example where this happens. 22
Properties of the Modularity Function Two Equivalent Representations of Q A second representation of Q leads to a simple observation that addresses several flaws in the current CNM algorithm. Standard properties of symmetric contingency tables can be used to show that the residuals of the 3 3 contingency table X 1,2,3 satisfy 2R 11 + R 12 + R 13 = 0, R 12 + 2R 22 + R 23 = 0, R 13 + R 23 + 2R 33 = 0. Analogously, the residuals of the 2 2 collapsed contingency table X 1 2,3 satisfy R 1 2,1 2 = R 33. 23
Properties of the Modularity Function Two Equivalent Representations of Q These equations lead to another representation of L Q. L Q = L (Q 1,2,3 Q 1 2,3 ) = R 11 + R 22 R 1 2,1 2 = R 11 + R 22 R 33 = 1 2 ( R 12 R 13 ) + 1 2 ( R 12 R 23 ) 1 2 ( R 13 R 23 ) = R 12. Hence, R 12 > 0 if and only if Q < 0. 24
Two Equivalent Representations of Q This last inequality means that two modules M 1 and M 2 are candidates to be merged by the CNM algorithm if their residual R 12 > 0. The greedy CNM algorithm merges the two modules M i and M j with the largest residual R ij > 0. CNM Modification: Before two modules M i and M j are merged their residual R ij > 0 should be deemed statistically significant. The residuals R ij are not comparable since their variances are not necessarily equal. Let S ij = be standardized residuals. R ij std(r ij) 25
Two Equivalent Representations of Q CNM Modification: Before two modules M i and M j are merged their standardized residual S ij > 0 should be deemed statistically significant. The greedy modified CNM algorithm will merge the two modules M i and M j with the largest statistically significant standardized residual S ij > 0. This adjustment to the CNM algorithm will alleviate the resolution limit problem, and the tendency for CNM to favor the merging of large modules. 26
Modularity optimization can fail to resolve communities. This failure is due to the properties of the modularity function. The CNM algorithm can be modified to avoid the resolution limit problem. Algorithms that use spectral information from the modularity matrix can detect modules when CNM fails. 27