The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks

Size: px

Start display at page:

Download "The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks"

Lionel Malone
5 years ago
Views:

1 The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks Andy M. Yip Steve Horvath Abstract Systems biologic studies of gene and protein interaction networks have found that these networks are comprised of modules (groups of tightly interconnected nodes). Module identification is an essential step towards understanding the whole network architecture. Here we will focus on module identification methods that are based on using a node dissimilarity measure in conjunction with a clustering method. More specifically, we introduce a general class of node dissimilarity measures based on the notion of topological overlap, which has been found to be biologically meaningful in several applications. The resulting generalized topological overlap measure (GTOM) generalizes the standard topological overlap measure (TOM) introduced by (2002). Specifically, the m-th order version of this family is constructed by (i) counting the number of m-step neighbors that a pair of nodes share and (ii) normalizing it to take a value between 0 and 1. The main use of the GTOM measures is the identification of network modules (sets of tightly connected nodes). But it can also be used to define novel measures of node connectivity. These GTOM based connectivity measures go beyond the usual nodal degree (number of connections) by taking into account higher-order connections. We discuss the properties of the GTOM measures and provide empirical evidence that they are useful in the context of gene co-expression network analysis. We show that the original TOM proposed by (2002) favors the discovery of smaller modules whereas the higher-order GTOM s favor larger modules. Keywords: Gene Co-expression Networks, Microarrays, Gene Module, Network Distance, Cluster Analysis, Scale-free Topology Availability: An R implementation of the GTOM measures and the datasets can be obtained from the following webpage: Department of Mathematics, University of California, Los Angeles, CA , USA Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA , USA Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA , USA To whom correspondence should be addressed. shorvath@mednet.ucla.edu

2 1 Introduction Genes and their protein products carry out cellular processes in the context of functional modules (Hartwell et al., 1999). Thus it is natural to ask whether such modular organization can be revealed by gene- or protein interaction networks. Multiple approaches for defining gene modules have been proposed in the literature, e.g. (Bar-Joseph et al., 2003; Bergmann et al., 2004; Isaacs et al., 2003; Prinz et al., 2004; Segal et al., 2003; Thimm et al., 2004; Tornow and Mewes, 2003; Toyoda and Konagaya, 2003; Xu et al., 2004). Recent efforts devoted to module detection include a repeated edge-removal algorithm by (2004) where at each step the edge with the highest centrality is removed, and a hierarchical clustering method using the topological overlap matrix (TOM) as input similarities by (2002). Here we aim to generalize the topological overlap dissimilarity measure since we and others have found that it results in biologically meaningful modules (Ravasz et al., 2002; Solé and Fernández, 2003; Ye and Godzik, 2004). Further we study the properties of the resulting class of node distance measures. The Topological Overlap Measure The approach taken by Ravasz et al. on uncovering network modules essentially boils down to the problem of measuring how close pairs of nodes are in a network. Ravasz et al. demonstrate the potential of using the set of immediate neighbors (first order interactions) of a pair of nodes as a basis for measuring pair-wise similarity. Moreover, in their study of a E. coli metabolic network, they show that two substrates having a higher overlap are more likely to belong to the same functional class than substrates having a lower topological overlap. Such an interesting finding prompts the question whether a measure of topological similarity based on higher-order interactions would be useful to define more stable and larger modules. In this paper, we generalize Ravasz et al. s approach by incorporating information from higher-order interactions into the similarity measure. The organization of the rest of the paper is as follows. We start off in 1.1 with two examples showing the superiority of GTOM measures in relating functional classes with modules and in predicting lethal gene deletions in yeast. The notion of GTOM is formally introduced in 2. Methods for module detection and nodal connectivity assessment, together with illustrations in yeast microarray data are presented in 3 and 4 respectively. Finally, we give a conclusion in Motivational Example An important use of DNA microarray data is to annotate genes by clustering them on the basis of their gene expression profiles across several microarrays. Because the transcriptional response of cells to changing conditions involves the coordinated co-expression of genes encoding interacting proteins, studying co-expression patterns can provide insights into the underlying cellular processes (Eisen et al., 1998; Tamayo et al., 1999). It is standard to use the (Pearson) correlation coefficient as a co-expression measure. 2

3 To group genes with coherent expression profiles into modules, one typically uses hierarchical clustering (Kaufman and Rousseeuw, 1990) coupled with an appropriate distance measure. Here our emphasis is the distance measure and not the clustering procedure. Comparing different clustering procedures is beyond the scope of this paper. Once a dendrogram is obtained from a hierarchical clustering method, we need to choose a height cutoff in order arrive at a clustering. It is a judgement call where to cut the tree branches. When apply our proposed GTOM measures for clustering, we use a GTOM plot (see 3.2) to visually aid the choice of the cutoff. Thus the modules are found by inspection: a height cutoff value is chosen in the dendrogram such that some of the resulting branches correspond to the discrete diagonal blocks (modules) in the GTOM plot Applications to a Yeast Dataset To demonstrate how the GTOM measures are useful and biologically meaningful, we apply the proposed measures to gene co-expression networks constructed from microarray data. A dataset recording gene expression levels during different stages of cell cycles in yeasts (Spellman et al., 1998) is employed throughout the paper. Based on the expression data, the absolute pair-wise (Pearson) correlation coefficient between the expression profiles of each pair of genes is calculated. Then, a network with each node representing one gene is constructed. An edge between two nodes is present if their absolute correlation coefficient exceeds a threshold τ = 0.7. Our methods work for any threshold. A discussion for how this threshold is chosen is beyond the scope of the article. Here, we obtain the threshold τ by using the scale-free criterion proposed by (2005). Such a topology implies the existence of hub genes (Barabási and Albert, 1999) and the robustness to random perturbations (Albert and Barabási, 2000) which are biologically relevant features. The scale-free topology of the yeast network is demonstrated in Appendix. Module Detection Figure 1 shows the modules (as branches of the dendrogram) detected by applying hierarchical clustering with 3 different similarity measures: The adjacency matrix (GTOM0), Ravasz et al. s TOM (GTOM1) and a generalized TOM (GTOM2) presented in 2. Genes belong to the functional class protein biosynthesis are grouped in the vicinity of each other by the GTOM2 measure. Prediction of Lethal Gene Deletions It has been reported independently by many research groups, which used different biological networks or constructed the same type of networks in different ways, genes with high nodal degree (i.e. adjacency matrix based connectivity) are more likely to be essential, i.e. they are lethal if knocked out. Figure 2 shows boxplots of the distribution of the 3 different connectivity measures for essential and non-essential genes. (2005) proposed to use the TOMbased connectivity measure that is better able to distinguish essential from non-essential 3

4 Adjacency Matrix Standard TOM Measure New TOM Measure Height Height Height Colored by GTOM1 modules Colored by protein biosynthesis status (a) (b) (c) Figure 1: Comparisons of 3 different similarities in capturing the functional class protein biosynthesis. (a) The adjacency matrix (GTOM0). (b) Standard Ravasz et al. s TOM (GTOM1). (c) Our new generalized TOM (GTOM2). In each column, the top row shows the dendrogram obtained by applying hierarchical clustering to the corresponding similarity matrix, the middle row shows the color bar ordered by the corresponding dendrogram but colored by the module assignment with respect to the TOM measure in (b), the bottom shows the color bar ordered by the corresponding dendrogram where genes belong to the class protein biosynthesis are colored in dark red. The modules defined by the TOM are the branches of the dendrogram in (b) at the cutoff Almost all protein biosynthesis genes are grouped together by the proposed new TOM measure whereas the other two measures tend to distribute the class over two modules. 4

5 genes (Figure 2b) than the standard measure (Figure 2a). Even our novel GTOM2-based connectivity measure leads to a more significant difference (p = ) than the standard connectivity measure (p = ). Standard, p= 1.3e 09 Standard TOM, p= 2.0e 16 New TOM, p= 8.4e 10 Degree (standard) Degree (new) Degree (new) Essentiality Essentiality Essentiality (a) (b) (c) Figure 2: Comparisons of 3 different nodal connectivity measures in predicting lethal gene deletions in yeast. The connectivity measures are obtained by the column sums of the corresponding similarity matrix. (a) The adjacency matrix (GTOM0). (b) Ravasz et al. s TOM (GTOM1). (c) A generalized TOM (GTOM2). The boxplots of connectivity in each of the two groups (Essential/Non-essential) are shown. The p-values of the Kruskal-Wallis rank sum test are p = , p = and respectively. The result shows that TOM gives rise to a larger difference in the group means. 2 Generalized Topological Overlap Matrices In this section, we formally introduce the notion of generalized topological overlap for measuring pair-wise similarity. We start with a network encoded by its corresponding adjacency matrix A = [a ij ] which is a symmetric with binary entries. By convention, the diagonal elements are assumed to be zero. The topological overlap of two nodes reflects their similarity in terms of the commonality of the nodes they connect to. (2002) define the topological overlap matrix T = [t ij ] as follows t ij = { l ij +a ij min{k i,k j }+1 a ij if i j 1 if i = j. where, l ij = u a iua uj, k i = u a iu and the index u runs across all nodes of the network. 1 Basically, t ij is an indicator for the agreement between the sets of neighboring nodes of i and 1 The definition given in (Ravasz et al., 2002) is slightly different: (l ij + a ij )/ min{k i, k j }. In a personal communication with E. Ravasz, the definition in Eq. (1) is preferred, which is also given in the online supporting material of (Ravasz et al., 2002). 5 (1)

6 j. The inclusion of the term a ij in the numerator makes t ij explicitly depends on whether there is a direct link between the two nodes in question. The purpose of the quantity 1 a ij in the denominator is to avoid double-counting i as a neighbor of j and vice versa. Our generalization of the TOM is motivated by the observation that formula (1) can be expressed as t ij = { N 1 (i) N 1 (j) +a ij min{ N 1 (i), N 1 (j) }+1 a ij if i j 1 if i = j. where N 1 (i) denotes the set of neighbors of i excluding i itself and denotes the number of elements (cardinality) in its argument. The quantity N 1 (i) N 1 (j) measures the number of common neighbors that nodes i and j share whereas N 1 (i) gives the number of neighbors of i. By denoting N m (i) (with m > 0) the set of nodes (excluding i itself) that are reachable from i within a path of length m (see Figure 3), i.e., N m (i) := {j i dist(i, j) m} (3) where dist(i, j) is the geodesic distance between i and j, we obtain a very natural generalization of the TOM, which reads as follows { t [m] N m(i) N m(j) +a ij ij = min{ N m(i), N m(j) }+1 a ij if i j (4) 1 if i = j. We call the matrix T [m] = [t [m] ij ] the m-th order generalized topological overlap matrix (GTOMm). This quantity simply measures the agreement between the nodes that are reachable from i and from j within m steps. When m = 1, we obtain back the original TOM in formula (1). Since a ii = 0 and t [m] ii = 1, we can rewrite the definition above in a more compact way: t [m] ij = N m(i) N m (j) + a ij + I i=j min{ N m (i), N m (j) } + 1 a ij. (5) Here, I i=j equals to 1 if and only if i = j. It is convenient and intuitive to define as well the GTOM0 as T [0] A + I which only considers the direct link between the pair of nodes in question. An example showing different GTOM values of a graph with 11 nodes is given in Figure A Simple Method for Computing GTOM In this subsection, we present computational formulas for T [m]. We first note that the matrix A m counts the number of paths of length m connecting the pair of nodes in place (Wasserman and Faust, 1994). Here, the paths are not necessarily geodesic and may contain cycles. Hence, the matrix S [m] [s [m] ij ] := A + A A m gives 6 (2)

7 Figure 3: (a) A graph with 11 nodes. (b) The 1-step and 2-step neighbors of nodes 1 3. (c) The GTOM0, GTOM1 and GTOM2 values of the pairs (1, 2), (1, 3) and (2, 3). exactly how many distinct paths with length smaller than or equal to m connecting each pair of nodes. Thus, we have N m (i) {j i s [m] ij > 0}. If we define a binary matrix B [m] to be { b [m] [m] 1 if s ij = ij > 0 and i j, 0 otherwise then N m (i) {j i b [m] ij = 1}. Thus, to obtain the value N m (i) N m (j), we simply take the inner product of the i-th and the j-th columns of B [m] which can be obtained from the matrix ( B [m]) 2 [ Nm (i) N m (j) ] because of the symmetry of B [m]. In particular, N m (i) is given by the i diagonal entry of ( B [m]) 2. These values can then be used to compute T [m] using formula (4). In the actually implementation, one may recursively update S [m] by the formula A(S [m 1] + I). 2.2 Properties of GTOM Each entry of GTOMm lies between 0 (no common neighbors, no direct link) and 1 (one set of neighbors is a subset of the other, has a direct link). Moreover, it increases with the number of common m-step neighbors. This can be seen by the following observations. Suppose two nodes i and j share c number of common m-step neighbors, not counting i and j, i.e., N m (i) N m (j) = c, c.f. Eq. (3). Without loss of generality, assume that N m (i) N m (j). If there is a direct link between i and j (implying i j), then the m-th order generalized topological overlap between i and j is given by t [m] ij = (c + 1)/ N m (i) where 0 c N m (i) 1. On the other hand, if there is no direct link between i and j, then we have t [m] ij = (c + I i=j )/( N m (i) + 1) where 0 c N m (i). Thus the results follow. 7

8 Next, we consider the situation where every pair of nodes are reachable to each other within m steps. This requires m to be large enough and the network only has one component. In this case, we have N m (i) = n 1 and N m (i) N m (j) = n 2 for all i and j where n is the size of network. Thus, t [m] ij = n 2 + a ij + 2I i=j n a ij = 1 2 n a ij + 2(a ij + I i=j ) n a ij. When n is large, we have T [m] (1 2n 1 )1+2n 1 (A+I), and therefore, GTOMm behaves like GTOM0 ( A + I). Since most biological networks have an average path length of 2 to 6 (Newman, 2003), it is expected that GTOMm is useful for small values of m, but may depend on the size and nature of the network. (2002) provide empirical evidence that GTOM1 yields biologically meaningful results. In our motivational example, we provide evidence that GTOM2 can lead to biologically more meaningful results than GTOM1 when dealing with large modules. 3 Module Detection 3.1 A GTOM-based Dissimilarity Measure A rationale for considering the generalized topological overlap matrix is that nodes that are part of highly interconnected modules have high topological overlap with their neighbors. The GTOMm T [m] is a similarity measure (Kaufman and Rousseeuw, 1990) since it is non-negative, symmetric and each entry increases with the relative overlapping between two corresponding sets of neighbors. To turn it into a dissimilarity measure, it is subtracted from one, i.e, the generalized topological overlap-based dissimilarity measure is defined by d T,[m] ij = 1 t [m] ij. (6) Next, we described two common dissimilarity measures which are used for comparison with our proposed measures. We have found that a power of the correlation coefficient can sometimes approximate the TOM measure quite well, which is why we consider it here. Specifically, we consider the following class of correlation based dissimilarities: d C,[p] ij = 1 ρ ij p. (7) Here, ρ ij is the (Pearson) correlation coefficient between the expression profiles of i and j. Setting p = 1 yields the absolute correlation coefficient which is in standard use for clustering genes. We also consider p = 6 since we find that the resulting distance is highly related to GTOM1 in the yeast dataset. Our next example shows the relationship between these dissimilarities. 8

9 3.1.1 Comparing the Dissimilarity Measures Figure 4 shows the correlation between six dissimilarity measures, GTOM-based d T,[m] for m = 0, 1, 2, 3 and correlation-based d C,[p] for p = 1, 6, when applied to the yeast dataset. For this dataset, we arrive at the following results. First, d C,[6] is highly correlated (> 0.8) with the lower-order GTOM dissimilarities, d T,[0] and d T,[1]. This is in concordance with the results reported in (Zhang and Horvath, 2005) which uses a cancer dataset. The other correlation-based measure d C,[1] is highly correlated (0.79) with d T,[2]. Second, the higher-order GTOM dissimilarities d T,[2] and d T,[3] show a high correlation of Third, two GTOM-based dissimilarities are moderately correlated (< 0.4) if their orders differ by 2 or more. Finally, the frequency distribution of d T,[3] is concentrated around 0 while that of the others are concentrated around 1. This shows that many pairs of nodes considered as dissimilar under lower-order GTOM measures become similar when measured with higherorder ones. We visualize the dissimilarity measures using classical multi-dimensional scaling plots (Cox and Cox, 2001), see 3.3. But to define modules, we use hierarchical clustering and topological overlap matrix plots (Ravasz et al., 2002) which are reviewed next. 3.2 Hierarchical Clustering and GTOM Plots In networks involving few nodes, modules can easily be identified by inspecting the network but for large networks involving hundreds of nodes, it is useful to provide a reduced view of the network. Towards this end, it has been suggested to use a topological overlap matrix plot (TOM or GTOM plot) which is a color-coded depiction of the values of [d T,[m] ij ], i.e. the TOM based dissimilarity measure described above (Ravasz et al., 2002) Comparing the GTOM Plots When detecting modules using hierarchical clustering, we use GTOM plots to aid the choice of the dendrogram s height cutoff. The four GTOM plots corresponding to the zeroth- to third-order GTOM are shown in Figure 5. The dataset used here is the same as the one in Figure 1. Red/yellow indicate low/high values of d T,[m] ij. Both rows and columns of [d T,[m] ij ] have been sorted using the hierarchical clustering tree. Since [d T,[m] ij ] is symmetric, the GTOM plot is also symmetric around the diagonal. Since modules are sets of nodes with high (generalized) topological overlap, modules correspond to red squares along the diagonal. As in all hierarchical clustering analysis, it is a judgement call where to cut the tree branches. Thus the modules are found by inspection: a height cutoff value is chosen in the dendrogram such that some of the resulting branches correspond to the discrete blocks of red (modules) in the GTOM plot. As we can see in Figure 5, modules are more pronounced in the GTOM2 and GTOM3 plots (larger contrast between the diagonal blocks and off-diagonal blocks). Smaller modules (as diagonal blocks of red) are more visible in GTOM0 and GTOM1 plots whereas larger modules are more pronounced in GTOM2 and GTOM3 plots. 9

10 Figure 4: Pair-wise scatter plots between various dissimilarity measures. The upper triangular panel shows the scatter plots, the lower triangular panel shows the corresponding Pearson correlation coefficients, the diagonal panel shows the frequency distributions of the dissimilarities. Correlation-based dissimilarities d C,[p] are denoted by disscorp. GTOM-based dissimilarities d T,[m] are denoted by dissgtomm. For comparison purposes, a color bar is shown on the top on each GTOM plot. The color bar is ordered by the respective dendrogram and colored by the GTOM1 module assignment (c.f. Figure 1b). To show the usefulness of modules for class discovery, a second color bar is included on the left of the heatmap. Here, dark red indicates the membership to the class protein biosynthesis. Genes belong to other classes (or unknown) are depicted by a gray color in the bar. We observe that genes belong to the protein biosynthesis class are grouped in the vicinity of each other in the GTOM2 and GTOM3 plots. 3.3 Multi-dimensional Scaling Plots To visualize the GTOM-based dissimilarities, we use classical multi-dimensional scaling plots, which are shown in Figure 6. All the plots are color-coded according to the modules with 10

(a) (b) (c) (d) Figure 5: GTOM plots of various GTOM-based dissimilarities. (a) GTOM0. (b) GTOM1.

The color bar on the top of each heatmap shows the module assignment obtained from GTOM1.

genes. Dark red indicates the membership to the class protein biosynthesis.

11 (a) (b) (c) (d) Figure 5: GTOM plots of various GTOM-based dissimilarities. (a) GTOM0. (b) GTOM1. (c) GTOM2. (d) GTOM3. The color bar on the top of each heatmap shows the module assignment obtained from GTOM1. The color bar on the left of each heatmap shows the functional category of the corresponding genes. Dark red indicates the membership to the class protein biosynthesis. Modules are more pronounced in the GTOM2 and GTOM3 plots (larger contrast between the diagonal blocks and off-diagonal blocks). Smaller modules (as diagonal blocks of red) are more visible in GTOM0 and GTOM1 plots whereas larger modules are more respected in GTOM2 and GTOM3 plots. Genes belong to the protein biosynthesis class are grouped in the vicinity of each other in the GTOM2 and GTOM3 plots. 11

12 respect to GTOM1 depicted in Figure 1. The relative position of the points are well-preserved as we can see that points having the same color are almost always clustered together. To make the plots more prolific, we encode some functional category information into the plots. Genes that belong to the class protein biosynthesis are depicted by the symbol. Other genes are denoted by a. Interestingly, almost all protein biosynthesis genes are in the vicinity of each other and they are colored in black, brown and gray. The plot using the GTOM1 dissimilarity in Figure 6(b) shows a more clear separation between the black and brown modules. 4 Measuring Nodal Connectivity One typically measures the connectivity of a node by its degree (Wasserman and Faust, 1994) defined by the number of links that the node possesses. Such a connectivity is found useful in indicating the centrality of a node in a network (Freeman, 1978) and in characterizing the scale-free structure of a network (Barabási and Albert, 1999). Moreover, it has been recently used for prediction of lethal gene deletions in yeast using various biological networks (Butte et al., 2000; Jeong et al., 2001; Jeong et al., 2003; Rho et al., 2003). It turns out that, beyond module detection, the GTOM similarity can be used to obtain a nodal connectivity measure in a way analog to obtaining degree from the adjacency matrix. As GTOM encodes higher-order interactions, the proposed connectivity introduced below also inherits such a property. 4.1 A GTOM-based Connectivity Measure The usual nodal degree can be obtained from the column sum of the adjacency matrix (with 0 s on the main diagonal). (2005) have found empirical evidence that a GTOM1-based connectivity measure may be biologically more meaningful than the standard connectivity. Here we generalize their connectivity measure to the m-th GTOM-based connectivity as follows: k T,[m] i := n j=1,j i t [m] ij. (8) Here, n is the number of nodes in the network. Since each t [m] is bounded by 0 and 1, the defined connectivity is bounded by 0 and n 1. Moreover, when m = 0, i.e., the adjacency matrix is used as the similarity matrix, we get back to the usual nodal degree. The connectivity measures based on similarity measured by correlation coefficients is similarly defined: k C,[p] i := n j=1,j i ρ p ij. (9) In 4.1.1, we compare the performance of various connectivity measures on predicting genes essentiality in yeast. The results show that these connectivity measures themselves 12

13 Dimension Dimension Dimension 1 (a) Dimension 1 (b) Dimension Dimension Dimension 1 (c) Dimension 1 (d) Figure 6: Multi-dimensional scaling plots of various GTOM-based dissimilarities. (a) GTOM0. (b) GTOM1. (c) GTOM2. (d) GTOM3. The coloring scheme is used to reflect the 7 modules shown in Figure 1(b) detected by using hierarchical clustering on the GTOM1-based dissimilarity. The symbol denotes genes that belong to the functional category protein biosynthesis. Genes belong to other classes are denoted by a. The overall cluster structures are generally preserved across the different GTOM measures. But the spatial distributions of points vary to a large extent. Genes in the protein biosynthesis class are clustered together. 13

14 are quite correlated to each other but k T,[1] gives a better prediction accuracy Comparing the Connectivity Measures In addition to the example in Figure 2, we now present some further comparisons of the connectivity measures k T,[m] and k C,[p] defined in formulae (8) and (9) respectively. In Figure 7, the pair-wise scatterplots of the different connectivity measures is presented for the yeast dataset. Overall, the connectivities are highly correlated. Specifically, we obtain the following results. First, k C,[6] is highly correlated (> 0.9) with k T,[0] and k T,[1] whereas k C,[1] is highly correlated (> 0.9) with k T,[2]. Second, k T,[2] is highly correlated (0.85) with k T,[3]. Third, k T,[m] s are only moderately correlated (< 0.5) if their orders differ by at least 2. Finally, an interesting observation is that the frequency distribution of k C,[1] and k T,[2] tend to be bimodal whereas the other GTOM connectivity measures and k C,[6] are unimodal. Table 1 shows the (Spearman) correlation coefficient between each connectivity and essentiality (a binary variable). The commonly used degree measure k T,[0] shows a correlation of Interestingly, the GTOM1 connectivity has a highly correlation which is Table 1: Prediction of lethal gene deletions in yeast by various connectivity measures. The first row shows the Spearman correlation coefficients s between each connectivity measure and essentiality (a binary variable). The second row shows the corresponding p-value. The GTOM1-based connectivity has the best performance. In particular, GTOM1 is better than the widely used one k T,[0] (i.e. degree). k C,[1] k C,[6] k T,[0] k T,[1] k T,[2] k T,[3] s p 3E 6 2E 11 6E 10 4E 17 4E 10 9E 3 5 DISCUSSION A class of natural generalizations of the widely used TOM similarity, called generalized topological overlap, is proposed. This class of new measures is constructed by counting the number of m-step neighbors that a pair of nodes share and then normalized suitably. The proposed GTOM measure is independent of the number of paths and the number of geodesic paths connecting a node and its m-step neighbor. In the literature, several authors have found GTOM1 useful in defining modules. However, we provide evidence that GTOM2 is also capable of identifying cohesive and biologically plausible modules, especially when one is interested in larger modules such as the class of protein biosynthesis genes in our examples. In general, the GTOM measures with lower 14

15 Figure 7: Pair-wise scatter plots between various connectivity measures. The upper triangular panel shows the scatter plots, the lower triangular panel shows the corresponding Pearson correlation coefficients, the diagonal panel shows the frequency distributions of the connectivity measures. Correlation-based connectivity k C,[p] are denoted by kcorp. GTOM-based connectivity measures k T,[m] are denoted by kgtomm. 15

16 orders allow discovery of modules in finer scales while those with higher orders favor discovery of coarser-scale modules. We also show analytically that GTOMm is similar to the adjacency matrix (GTOM0) if m is larger than the diameter of the network. Thus GTOMm will be useful in networks with moderate or large degree of separation (average path length between any pair of nodes). TOM plots are widely used to visualize modularity in networks. But we demonstrate that classical MDS plots are particularly handy for visualizing the relative distance between the nodes. We also observe from the MDS plots that there is a tendency of consolidation as the order of the GTOM measure increases. This phenomenon can be seen in Figure 6(d) where GTOM3 is used and a few sinks (points of attraction) are formed. Intuitively speaking, each cluster becomes tighter whereas different clusters are more separated. Another class of importance measures, nodal connectivity, is also constructed based on the proposed similarity measures. Its application in predicting essential genes for yeast s survival shows that the generalized connectivity could lead to a better prediction performance. Acknowledgement The authors would like to thank Jun Dong, Ai Li, Bin Zhang, Marc Carlson, Stan Nelson, and Paul Mischel for their helpful discussions. Appendix: Scale-free Topology In Figure 8, the frequency distribution p(k) against the nodal degree k is plotted with both axes in the logarithm scale. A straight line relationship in the graph would indicate a scalefree topology, i.e., the frequency distribution follows a power law: p(k) k γ (Barabási and Albert, 1999). However, this relationship is often only approximately satisfied in practice and is better explained by an exponentially truncated power law p(k) k γ e αk (Csányi and Szendrői, 2004). References Albert, R. and Barabási, A. (2000). Error and attack tolerance of complex networks. Nature, 406(6794), Bar-Joseph, Z, Gerber, G K, Lee, T I, Rinaldi, N J, Yoo, J Y, Robert, F, Gordon, D B, Fraenkel, E, Jaakkola, T S, Young, R A and Gifford, D K. (2003). Computational discovery of gene modules and regulatory networks. Nat biotechnol, 21(11), Evaluation Studies Journal Article. 16

17 tau= 0.7, scale R^2= 0.91, trunc. R^2= log10(p(k)) log10(k) Figure 8: Log-log plot between frequency of connectivity p(k) and connectivity k for determining whether the network exhibits scale-free topology. In this gene co-expression network, a Pearson correlation cut-off of 0.7 results in an approximate straight line relationship: the red linear regression line leads to a fitting index of R 2 = The black line represents the fit of an exponentially truncated power law, which leads to a fitting index of R 2 = Barabási, A. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286, Bergmann, S., Ihmels, J. and Barkai, N. (2004). Similarities and differences in genome-wide expression data of six organisms. Plos biology, 2(1), Butte, A., Tamayo, P., Slonim, D., Golub, T. and Kohane, I. (2000). Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relavance networks. Pnas, 97, Cox, T. and Cox, M. (2001). Multidimensional scaling. Boca Raton: Chapman and Hall/CRC. Csányi, G. and Szendrői, B. (2004). Structure of a large social network. Physical review e, 69(3), Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Pnas, 95(25), Freeman, L. (1978). Centrality in social networks: Conceptual clarification. Social networks, 1,

18 Hartwell, L., Hopefield, J., S., Leibler and Murray, A. (1999). From molecular to modular cell biology. Nature, 402(6761 Suppl), C Isaacs, F J, Hasty, J, Cantor, C R and Collins, J J. (2003). Prediction and measurement of an autoregulatory genetic module. Proc natl acad sci u s a, 100(13), Journal Article. Jeong, H., Mason, S., Barabási, A. and Oltvai, Z. (2001). Lethality and centrality in protein networks. Nature, 411(6833), Jeong, H., Oltvai, Z. and Barabási, A. (2003). Prediction of protein essetiality based on genome data. Complexus, 1, Kaufman, Leonard and Rousseeuw, Peter. (1990). Finding groups in data: An introduction to cluster analysis. New York: John Wiley & Sons, Inc. Newman, M. (2003). The structure and function of complex networks. Siam review, 45(2), Newman, M. and Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review e, 69(2), Prinz, S, Avila-Campillo, I, Aldridge, C, Srinivasan, A, Dimitrov, K, Siegel, A F and Galitski, T. (2004). Control of yeast filamentous-form growth by modules in an integrated molecular network. Genome res, 14(3), Journal Article. Ravasz, E., Somera, A., Mongru, D., Oltvai, Z. and Barabási, A. (2002). Hierarchical organization of modularity in metabolic networks. Science, 297(5586), Rho, K., Jeong, H. and Kahng, B. (2003). Identification of essential and functionally moduled genes through the microarray assay. Preprint. Segal, E, Shapira, M, Regev, A, Pe er, D, Botstein, D, Koller, D and Friedman, N. (2003). Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat genet, 34(2), Journal Article. Solé, R. and Fernández, P. (2003). Modularity for free in genome architecture? Preprint. Spellman, P., Sherlock, G., Zhang, M., Iyer, V. and Anders, K. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. biol. cell, 9(12), Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S. and Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic. Pnas, 96,

19 Thimm, O, Blasing, O, Gibon, Y, Nagel, A, Meyer, S, Kruger, P, Selbig, J, Muller, L A, Rhee, S Y and Stitt, M. (2004). Mapman: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant j, 37(6), Journal Article. Tornow, S and Mewes, H W. (2003). Functional modules by relating protein interaction networks and gene expression. Nucleic acids res, 31(21), Journal Article. Toyoda, T and Konagaya, A. (2003). Knowledgeeditor: a new tool for interactive modeling and analyzing biological pathways based on microarray data. Bioinformatics, 19(3), Journal Article. Wasserman, S. and Faust, K. (1994). Social network analysis: Methods and applications. Structural Analysis in the Social Science. New York: Cambridge University Press. Xu, X, Wang, L and Ding, D. (2004). Learning module networks from genome-wide location and expression data. Febs lett, 578(3), Journal Article. Ye, Y. and Godzik, A. (2004). Comparative analysis of protein domain organization. Genome Biology, 14(3), Zhang, B. and Horvath, S. (2005). Using the scale-free topology criterion for constructing weighted gene co-expression networks. 19

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong

A Geometric Interpretation of Gene Co-Expression Network Analysis Steve Horvath, Jun Dong Outline Network and network concepts Approximately factorizable networks Gene Co-expression Network Eigengene Factorizability,