The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks

Size: px
Start display at page:

Download "The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks"

Transcription

1 The Generalized Topological Overlap Matrix For Detecting Modules in Gene Networks Andy M. Yip Steve Horvath Abstract Systems biologic studies of gene and protein interaction networks have found that these networks are comprised of modules (groups of tightly interconnected nodes). Module identification is an essential step towards understanding the whole network architecture. Here we will focus on module identification methods that are based on using a node dissimilarity measure in conjunction with a clustering method. More specifically, we introduce a general class of node dissimilarity measures based on the notion of topological overlap, which has been found to be biologically meaningful in several applications. The resulting generalized topological overlap measure (GTOM) generalizes the standard topological overlap measure (TOM) introduced by (2002). Specifically, the m-th order version of this family is constructed by (i) counting the number of m-step neighbors that a pair of nodes share and (ii) normalizing it to take a value between 0 and 1. The main use of the GTOM measures is the identification of network modules (sets of tightly connected nodes). But it can also be used to define novel measures of node connectivity. These GTOM based connectivity measures go beyond the usual nodal degree (number of connections) by taking into account higher-order connections. We discuss the properties of the GTOM measures and provide empirical evidence that they are useful in the context of gene co-expression network analysis. We show that the original TOM proposed by (2002) favors the discovery of smaller modules whereas the higher-order GTOM s favor larger modules. Keywords: Gene Co-expression Networks, Microarrays, Gene Module, Network Distance, Cluster Analysis, Scale-free Topology Availability: An R implementation of the GTOM measures and the datasets can be obtained from the following webpage: Department of Mathematics, University of California, Los Angeles, CA , USA Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, CA , USA Department of Biostatistics, School of Public Health, University of California, Los Angeles, CA , USA To whom correspondence should be addressed. shorvath@mednet.ucla.edu

2 1 Introduction Genes and their protein products carry out cellular processes in the context of functional modules (Hartwell et al., 1999). Thus it is natural to ask whether such modular organization can be revealed by gene- or protein interaction networks. Multiple approaches for defining gene modules have been proposed in the literature, e.g. (Bar-Joseph et al., 2003; Bergmann et al., 2004; Isaacs et al., 2003; Prinz et al., 2004; Segal et al., 2003; Thimm et al., 2004; Tornow and Mewes, 2003; Toyoda and Konagaya, 2003; Xu et al., 2004). Recent efforts devoted to module detection include a repeated edge-removal algorithm by (2004) where at each step the edge with the highest centrality is removed, and a hierarchical clustering method using the topological overlap matrix (TOM) as input similarities by (2002). Here we aim to generalize the topological overlap dissimilarity measure since we and others have found that it results in biologically meaningful modules (Ravasz et al., 2002; Solé and Fernández, 2003; Ye and Godzik, 2004). Further we study the properties of the resulting class of node distance measures. The Topological Overlap Measure The approach taken by Ravasz et al. on uncovering network modules essentially boils down to the problem of measuring how close pairs of nodes are in a network. Ravasz et al. demonstrate the potential of using the set of immediate neighbors (first order interactions) of a pair of nodes as a basis for measuring pair-wise similarity. Moreover, in their study of a E. coli metabolic network, they show that two substrates having a higher overlap are more likely to belong to the same functional class than substrates having a lower topological overlap. Such an interesting finding prompts the question whether a measure of topological similarity based on higher-order interactions would be useful to define more stable and larger modules. In this paper, we generalize Ravasz et al. s approach by incorporating information from higher-order interactions into the similarity measure. The organization of the rest of the paper is as follows. We start off in 1.1 with two examples showing the superiority of GTOM measures in relating functional classes with modules and in predicting lethal gene deletions in yeast. The notion of GTOM is formally introduced in 2. Methods for module detection and nodal connectivity assessment, together with illustrations in yeast microarray data are presented in 3 and 4 respectively. Finally, we give a conclusion in Motivational Example An important use of DNA microarray data is to annotate genes by clustering them on the basis of their gene expression profiles across several microarrays. Because the transcriptional response of cells to changing conditions involves the coordinated co-expression of genes encoding interacting proteins, studying co-expression patterns can provide insights into the underlying cellular processes (Eisen et al., 1998; Tamayo et al., 1999). It is standard to use the (Pearson) correlation coefficient as a co-expression measure. 2

3 To group genes with coherent expression profiles into modules, one typically uses hierarchical clustering (Kaufman and Rousseeuw, 1990) coupled with an appropriate distance measure. Here our emphasis is the distance measure and not the clustering procedure. Comparing different clustering procedures is beyond the scope of this paper. Once a dendrogram is obtained from a hierarchical clustering method, we need to choose a height cutoff in order arrive at a clustering. It is a judgement call where to cut the tree branches. When apply our proposed GTOM measures for clustering, we use a GTOM plot (see 3.2) to visually aid the choice of the cutoff. Thus the modules are found by inspection: a height cutoff value is chosen in the dendrogram such that some of the resulting branches correspond to the discrete diagonal blocks (modules) in the GTOM plot Applications to a Yeast Dataset To demonstrate how the GTOM measures are useful and biologically meaningful, we apply the proposed measures to gene co-expression networks constructed from microarray data. A dataset recording gene expression levels during different stages of cell cycles in yeasts (Spellman et al., 1998) is employed throughout the paper. Based on the expression data, the absolute pair-wise (Pearson) correlation coefficient between the expression profiles of each pair of genes is calculated. Then, a network with each node representing one gene is constructed. An edge between two nodes is present if their absolute correlation coefficient exceeds a threshold τ = 0.7. Our methods work for any threshold. A discussion for how this threshold is chosen is beyond the scope of the article. Here, we obtain the threshold τ by using the scale-free criterion proposed by (2005). Such a topology implies the existence of hub genes (Barabási and Albert, 1999) and the robustness to random perturbations (Albert and Barabási, 2000) which are biologically relevant features. The scale-free topology of the yeast network is demonstrated in Appendix. Module Detection Figure 1 shows the modules (as branches of the dendrogram) detected by applying hierarchical clustering with 3 different similarity measures: The adjacency matrix (GTOM0), Ravasz et al. s TOM (GTOM1) and a generalized TOM (GTOM2) presented in 2. Genes belong to the functional class protein biosynthesis are grouped in the vicinity of each other by the GTOM2 measure. Prediction of Lethal Gene Deletions It has been reported independently by many research groups, which used different biological networks or constructed the same type of networks in different ways, genes with high nodal degree (i.e. adjacency matrix based connectivity) are more likely to be essential, i.e. they are lethal if knocked out. Figure 2 shows boxplots of the distribution of the 3 different connectivity measures for essential and non-essential genes. (2005) proposed to use the TOMbased connectivity measure that is better able to distinguish essential from non-essential 3

4 Adjacency Matrix Standard TOM Measure New TOM Measure Height Height Height Colored by GTOM1 modules Colored by protein biosynthesis status (a) (b) (c) Figure 1: Comparisons of 3 different similarities in capturing the functional class protein biosynthesis. (a) The adjacency matrix (GTOM0). (b) Standard Ravasz et al. s TOM (GTOM1). (c) Our new generalized TOM (GTOM2). In each column, the top row shows the dendrogram obtained by applying hierarchical clustering to the corresponding similarity matrix, the middle row shows the color bar ordered by the corresponding dendrogram but colored by the module assignment with respect to the TOM measure in (b), the bottom shows the color bar ordered by the corresponding dendrogram where genes belong to the class protein biosynthesis are colored in dark red. The modules defined by the TOM are the branches of the dendrogram in (b) at the cutoff Almost all protein biosynthesis genes are grouped together by the proposed new TOM measure whereas the other two measures tend to distribute the class over two modules. 4

5 genes (Figure 2b) than the standard measure (Figure 2a). Even our novel GTOM2-based connectivity measure leads to a more significant difference (p = ) than the standard connectivity measure (p = ). Standard, p= 1.3e 09 Standard TOM, p= 2.0e 16 New TOM, p= 8.4e 10 Degree (standard) Degree (new) Degree (new) Essentiality Essentiality Essentiality (a) (b) (c) Figure 2: Comparisons of 3 different nodal connectivity measures in predicting lethal gene deletions in yeast. The connectivity measures are obtained by the column sums of the corresponding similarity matrix. (a) The adjacency matrix (GTOM0). (b) Ravasz et al. s TOM (GTOM1). (c) A generalized TOM (GTOM2). The boxplots of connectivity in each of the two groups (Essential/Non-essential) are shown. The p-values of the Kruskal-Wallis rank sum test are p = , p = and respectively. The result shows that TOM gives rise to a larger difference in the group means. 2 Generalized Topological Overlap Matrices In this section, we formally introduce the notion of generalized topological overlap for measuring pair-wise similarity. We start with a network encoded by its corresponding adjacency matrix A = [a ij ] which is a symmetric with binary entries. By convention, the diagonal elements are assumed to be zero. The topological overlap of two nodes reflects their similarity in terms of the commonality of the nodes they connect to. (2002) define the topological overlap matrix T = [t ij ] as follows t ij = { l ij +a ij min{k i,k j }+1 a ij if i j 1 if i = j. where, l ij = u a iua uj, k i = u a iu and the index u runs across all nodes of the network. 1 Basically, t ij is an indicator for the agreement between the sets of neighboring nodes of i and 1 The definition given in (Ravasz et al., 2002) is slightly different: (l ij + a ij )/ min{k i, k j }. In a personal communication with E. Ravasz, the definition in Eq. (1) is preferred, which is also given in the online supporting material of (Ravasz et al., 2002). 5 (1)

6 j. The inclusion of the term a ij in the numerator makes t ij explicitly depends on whether there is a direct link between the two nodes in question. The purpose of the quantity 1 a ij in the denominator is to avoid double-counting i as a neighbor of j and vice versa. Our generalization of the TOM is motivated by the observation that formula (1) can be expressed as t ij = { N 1 (i) N 1 (j) +a ij min{ N 1 (i), N 1 (j) }+1 a ij if i j 1 if i = j. where N 1 (i) denotes the set of neighbors of i excluding i itself and denotes the number of elements (cardinality) in its argument. The quantity N 1 (i) N 1 (j) measures the number of common neighbors that nodes i and j share whereas N 1 (i) gives the number of neighbors of i. By denoting N m (i) (with m > 0) the set of nodes (excluding i itself) that are reachable from i within a path of length m (see Figure 3), i.e., N m (i) := {j i dist(i, j) m} (3) where dist(i, j) is the geodesic distance between i and j, we obtain a very natural generalization of the TOM, which reads as follows { t [m] N m(i) N m(j) +a ij ij = min{ N m(i), N m(j) }+1 a ij if i j (4) 1 if i = j. We call the matrix T [m] = [t [m] ij ] the m-th order generalized topological overlap matrix (GTOMm). This quantity simply measures the agreement between the nodes that are reachable from i and from j within m steps. When m = 1, we obtain back the original TOM in formula (1). Since a ii = 0 and t [m] ii = 1, we can rewrite the definition above in a more compact way: t [m] ij = N m(i) N m (j) + a ij + I i=j min{ N m (i), N m (j) } + 1 a ij. (5) Here, I i=j equals to 1 if and only if i = j. It is convenient and intuitive to define as well the GTOM0 as T [0] A + I which only considers the direct link between the pair of nodes in question. An example showing different GTOM values of a graph with 11 nodes is given in Figure A Simple Method for Computing GTOM In this subsection, we present computational formulas for T [m]. We first note that the matrix A m counts the number of paths of length m connecting the pair of nodes in place (Wasserman and Faust, 1994). Here, the paths are not necessarily geodesic and may contain cycles. Hence, the matrix S [m] [s [m] ij ] := A + A A m gives 6 (2)

7 Figure 3: (a) A graph with 11 nodes. (b) The 1-step and 2-step neighbors of nodes 1 3. (c) The GTOM0, GTOM1 and GTOM2 values of the pairs (1, 2), (1, 3) and (2, 3). exactly how many distinct paths with length smaller than or equal to m connecting each pair of nodes. Thus, we have N m (i) {j i s [m] ij > 0}. If we define a binary matrix B [m] to be { b [m] [m] 1 if s ij = ij > 0 and i j, 0 otherwise then N m (i) {j i b [m] ij = 1}. Thus, to obtain the value N m (i) N m (j), we simply take the inner product of the i-th and the j-th columns of B [m] which can be obtained from the matrix ( B [m]) 2 [ Nm (i) N m (j) ] because of the symmetry of B [m]. In particular, N m (i) is given by the i diagonal entry of ( B [m]) 2. These values can then be used to compute T [m] using formula (4). In the actually implementation, one may recursively update S [m] by the formula A(S [m 1] + I). 2.2 Properties of GTOM Each entry of GTOMm lies between 0 (no common neighbors, no direct link) and 1 (one set of neighbors is a subset of the other, has a direct link). Moreover, it increases with the number of common m-step neighbors. This can be seen by the following observations. Suppose two nodes i and j share c number of common m-step neighbors, not counting i and j, i.e., N m (i) N m (j) = c, c.f. Eq. (3). Without loss of generality, assume that N m (i) N m (j). If there is a direct link between i and j (implying i j), then the m-th order generalized topological overlap between i and j is given by t [m] ij = (c + 1)/ N m (i) where 0 c N m (i) 1. On the other hand, if there is no direct link between i and j, then we have t [m] ij = (c + I i=j )/( N m (i) + 1) where 0 c N m (i). Thus the results follow. 7

8 Next, we consider the situation where every pair of nodes are reachable to each other within m steps. This requires m to be large enough and the network only has one component. In this case, we have N m (i) = n 1 and N m (i) N m (j) = n 2 for all i and j where n is the size of network. Thus, t [m] ij = n 2 + a ij + 2I i=j n a ij = 1 2 n a ij + 2(a ij + I i=j ) n a ij. When n is large, we have T [m] (1 2n 1 )1+2n 1 (A+I), and therefore, GTOMm behaves like GTOM0 ( A + I). Since most biological networks have an average path length of 2 to 6 (Newman, 2003), it is expected that GTOMm is useful for small values of m, but may depend on the size and nature of the network. (2002) provide empirical evidence that GTOM1 yields biologically meaningful results. In our motivational example, we provide evidence that GTOM2 can lead to biologically more meaningful results than GTOM1 when dealing with large modules. 3 Module Detection 3.1 A GTOM-based Dissimilarity Measure A rationale for considering the generalized topological overlap matrix is that nodes that are part of highly interconnected modules have high topological overlap with their neighbors. The GTOMm T [m] is a similarity measure (Kaufman and Rousseeuw, 1990) since it is non-negative, symmetric and each entry increases with the relative overlapping between two corresponding sets of neighbors. To turn it into a dissimilarity measure, it is subtracted from one, i.e, the generalized topological overlap-based dissimilarity measure is defined by d T,[m] ij = 1 t [m] ij. (6) Next, we described two common dissimilarity measures which are used for comparison with our proposed measures. We have found that a power of the correlation coefficient can sometimes approximate the TOM measure quite well, which is why we consider it here. Specifically, we consider the following class of correlation based dissimilarities: d C,[p] ij = 1 ρ ij p. (7) Here, ρ ij is the (Pearson) correlation coefficient between the expression profiles of i and j. Setting p = 1 yields the absolute correlation coefficient which is in standard use for clustering genes. We also consider p = 6 since we find that the resulting distance is highly related to GTOM1 in the yeast dataset. Our next example shows the relationship between these dissimilarities. 8

9 3.1.1 Comparing the Dissimilarity Measures Figure 4 shows the correlation between six dissimilarity measures, GTOM-based d T,[m] for m = 0, 1, 2, 3 and correlation-based d C,[p] for p = 1, 6, when applied to the yeast dataset. For this dataset, we arrive at the following results. First, d C,[6] is highly correlated (> 0.8) with the lower-order GTOM dissimilarities, d T,[0] and d T,[1]. This is in concordance with the results reported in (Zhang and Horvath, 2005) which uses a cancer dataset. The other correlation-based measure d C,[1] is highly correlated (0.79) with d T,[2]. Second, the higher-order GTOM dissimilarities d T,[2] and d T,[3] show a high correlation of Third, two GTOM-based dissimilarities are moderately correlated (< 0.4) if their orders differ by 2 or more. Finally, the frequency distribution of d T,[3] is concentrated around 0 while that of the others are concentrated around 1. This shows that many pairs of nodes considered as dissimilar under lower-order GTOM measures become similar when measured with higherorder ones. We visualize the dissimilarity measures using classical multi-dimensional scaling plots (Cox and Cox, 2001), see 3.3. But to define modules, we use hierarchical clustering and topological overlap matrix plots (Ravasz et al., 2002) which are reviewed next. 3.2 Hierarchical Clustering and GTOM Plots In networks involving few nodes, modules can easily be identified by inspecting the network but for large networks involving hundreds of nodes, it is useful to provide a reduced view of the network. Towards this end, it has been suggested to use a topological overlap matrix plot (TOM or GTOM plot) which is a color-coded depiction of the values of [d T,[m] ij ], i.e. the TOM based dissimilarity measure described above (Ravasz et al., 2002) Comparing the GTOM Plots When detecting modules using hierarchical clustering, we use GTOM plots to aid the choice of the dendrogram s height cutoff. The four GTOM plots corresponding to the zeroth- to third-order GTOM are shown in Figure 5. The dataset used here is the same as the one in Figure 1. Red/yellow indicate low/high values of d T,[m] ij. Both rows and columns of [d T,[m] ij ] have been sorted using the hierarchical clustering tree. Since [d T,[m] ij ] is symmetric, the GTOM plot is also symmetric around the diagonal. Since modules are sets of nodes with high (generalized) topological overlap, modules correspond to red squares along the diagonal. As in all hierarchical clustering analysis, it is a judgement call where to cut the tree branches. Thus the modules are found by inspection: a height cutoff value is chosen in the dendrogram such that some of the resulting branches correspond to the discrete blocks of red (modules) in the GTOM plot. As we can see in Figure 5, modules are more pronounced in the GTOM2 and GTOM3 plots (larger contrast between the diagonal blocks and off-diagonal blocks). Smaller modules (as diagonal blocks of red) are more visible in GTOM0 and GTOM1 plots whereas larger modules are more pronounced in GTOM2 and GTOM3 plots. 9

10 Figure 4: Pair-wise scatter plots between various dissimilarity measures. The upper triangular panel shows the scatter plots, the lower triangular panel shows the corresponding Pearson correlation coefficients, the diagonal panel shows the frequency distributions of the dissimilarities. Correlation-based dissimilarities d C,[p] are denoted by disscorp. GTOM-based dissimilarities d T,[m] are denoted by dissgtomm. For comparison purposes, a color bar is shown on the top on each GTOM plot. The color bar is ordered by the respective dendrogram and colored by the GTOM1 module assignment (c.f. Figure 1b). To show the usefulness of modules for class discovery, a second color bar is included on the left of the heatmap. Here, dark red indicates the membership to the class protein biosynthesis. Genes belong to other classes (or unknown) are depicted by a gray color in the bar. We observe that genes belong to the protein biosynthesis class are grouped in the vicinity of each other in the GTOM2 and GTOM3 plots. 3.3 Multi-dimensional Scaling Plots To visualize the GTOM-based dissimilarities, we use classical multi-dimensional scaling plots, which are shown in Figure 6. All the plots are color-coded according to the modules with 10

11 (a) (b) (c) (d) Figure 5: GTOM plots of various GTOM-based dissimilarities. (a) GTOM0. (b) GTOM1. (c) GTOM2. (d) GTOM3. The color bar on the top of each heatmap shows the module assignment obtained from GTOM1. The color bar on the left of each heatmap shows the functional category of the corresponding genes. Dark red indicates the membership to the class protein biosynthesis. Modules are more pronounced in the GTOM2 and GTOM3 plots (larger contrast between the diagonal blocks and off-diagonal blocks). Smaller modules (as diagonal blocks of red) are more visible in GTOM0 and GTOM1 plots whereas larger modules are more respected in GTOM2 and GTOM3 plots. Genes belong to the protein biosynthesis class are grouped in the vicinity of each other in the GTOM2 and GTOM3 plots. 11

12 respect to GTOM1 depicted in Figure 1. The relative position of the points are well-preserved as we can see that points having the same color are almost always clustered together. To make the plots more prolific, we encode some functional category information into the plots. Genes that belong to the class protein biosynthesis are depicted by the symbol. Other genes are denoted by a. Interestingly, almost all protein biosynthesis genes are in the vicinity of each other and they are colored in black, brown and gray. The plot using the GTOM1 dissimilarity in Figure 6(b) shows a more clear separation between the black and brown modules. 4 Measuring Nodal Connectivity One typically measures the connectivity of a node by its degree (Wasserman and Faust, 1994) defined by the number of links that the node possesses. Such a connectivity is found useful in indicating the centrality of a node in a network (Freeman, 1978) and in characterizing the scale-free structure of a network (Barabási and Albert, 1999). Moreover, it has been recently used for prediction of lethal gene deletions in yeast using various biological networks (Butte et al., 2000; Jeong et al., 2001; Jeong et al., 2003; Rho et al., 2003). It turns out that, beyond module detection, the GTOM similarity can be used to obtain a nodal connectivity measure in a way analog to obtaining degree from the adjacency matrix. As GTOM encodes higher-order interactions, the proposed connectivity introduced below also inherits such a property. 4.1 A GTOM-based Connectivity Measure The usual nodal degree can be obtained from the column sum of the adjacency matrix (with 0 s on the main diagonal). (2005) have found empirical evidence that a GTOM1-based connectivity measure may be biologically more meaningful than the standard connectivity. Here we generalize their connectivity measure to the m-th GTOM-based connectivity as follows: k T,[m] i := n j=1,j i t [m] ij. (8) Here, n is the number of nodes in the network. Since each t [m] is bounded by 0 and 1, the defined connectivity is bounded by 0 and n 1. Moreover, when m = 0, i.e., the adjacency matrix is used as the similarity matrix, we get back to the usual nodal degree. The connectivity measures based on similarity measured by correlation coefficients is similarly defined: k C,[p] i := n j=1,j i ρ p ij. (9) In 4.1.1, we compare the performance of various connectivity measures on predicting genes essentiality in yeast. The results show that these connectivity measures themselves 12

13 Dimension Dimension Dimension 1 (a) Dimension 1 (b) Dimension Dimension Dimension 1 (c) Dimension 1 (d) Figure 6: Multi-dimensional scaling plots of various GTOM-based dissimilarities. (a) GTOM0. (b) GTOM1. (c) GTOM2. (d) GTOM3. The coloring scheme is used to reflect the 7 modules shown in Figure 1(b) detected by using hierarchical clustering on the GTOM1-based dissimilarity. The symbol denotes genes that belong to the functional category protein biosynthesis. Genes belong to other classes are denoted by a. The overall cluster structures are generally preserved across the different GTOM measures. But the spatial distributions of points vary to a large extent. Genes in the protein biosynthesis class are clustered together. 13

14 are quite correlated to each other but k T,[1] gives a better prediction accuracy Comparing the Connectivity Measures In addition to the example in Figure 2, we now present some further comparisons of the connectivity measures k T,[m] and k C,[p] defined in formulae (8) and (9) respectively. In Figure 7, the pair-wise scatterplots of the different connectivity measures is presented for the yeast dataset. Overall, the connectivities are highly correlated. Specifically, we obtain the following results. First, k C,[6] is highly correlated (> 0.9) with k T,[0] and k T,[1] whereas k C,[1] is highly correlated (> 0.9) with k T,[2]. Second, k T,[2] is highly correlated (0.85) with k T,[3]. Third, k T,[m] s are only moderately correlated (< 0.5) if their orders differ by at least 2. Finally, an interesting observation is that the frequency distribution of k C,[1] and k T,[2] tend to be bimodal whereas the other GTOM connectivity measures and k C,[6] are unimodal. Table 1 shows the (Spearman) correlation coefficient between each connectivity and essentiality (a binary variable). The commonly used degree measure k T,[0] shows a correlation of Interestingly, the GTOM1 connectivity has a highly correlation which is Table 1: Prediction of lethal gene deletions in yeast by various connectivity measures. The first row shows the Spearman correlation coefficients s between each connectivity measure and essentiality (a binary variable). The second row shows the corresponding p-value. The GTOM1-based connectivity has the best performance. In particular, GTOM1 is better than the widely used one k T,[0] (i.e. degree). k C,[1] k C,[6] k T,[0] k T,[1] k T,[2] k T,[3] s p 3E 6 2E 11 6E 10 4E 17 4E 10 9E 3 5 DISCUSSION A class of natural generalizations of the widely used TOM similarity, called generalized topological overlap, is proposed. This class of new measures is constructed by counting the number of m-step neighbors that a pair of nodes share and then normalized suitably. The proposed GTOM measure is independent of the number of paths and the number of geodesic paths connecting a node and its m-step neighbor. In the literature, several authors have found GTOM1 useful in defining modules. However, we provide evidence that GTOM2 is also capable of identifying cohesive and biologically plausible modules, especially when one is interested in larger modules such as the class of protein biosynthesis genes in our examples. In general, the GTOM measures with lower 14

15 Figure 7: Pair-wise scatter plots between various connectivity measures. The upper triangular panel shows the scatter plots, the lower triangular panel shows the corresponding Pearson correlation coefficients, the diagonal panel shows the frequency distributions of the connectivity measures. Correlation-based connectivity k C,[p] are denoted by kcorp. GTOM-based connectivity measures k T,[m] are denoted by kgtomm. 15

16 orders allow discovery of modules in finer scales while those with higher orders favor discovery of coarser-scale modules. We also show analytically that GTOMm is similar to the adjacency matrix (GTOM0) if m is larger than the diameter of the network. Thus GTOMm will be useful in networks with moderate or large degree of separation (average path length between any pair of nodes). TOM plots are widely used to visualize modularity in networks. But we demonstrate that classical MDS plots are particularly handy for visualizing the relative distance between the nodes. We also observe from the MDS plots that there is a tendency of consolidation as the order of the GTOM measure increases. This phenomenon can be seen in Figure 6(d) where GTOM3 is used and a few sinks (points of attraction) are formed. Intuitively speaking, each cluster becomes tighter whereas different clusters are more separated. Another class of importance measures, nodal connectivity, is also constructed based on the proposed similarity measures. Its application in predicting essential genes for yeast s survival shows that the generalized connectivity could lead to a better prediction performance. Acknowledgement The authors would like to thank Jun Dong, Ai Li, Bin Zhang, Marc Carlson, Stan Nelson, and Paul Mischel for their helpful discussions. Appendix: Scale-free Topology In Figure 8, the frequency distribution p(k) against the nodal degree k is plotted with both axes in the logarithm scale. A straight line relationship in the graph would indicate a scalefree topology, i.e., the frequency distribution follows a power law: p(k) k γ (Barabási and Albert, 1999). However, this relationship is often only approximately satisfied in practice and is better explained by an exponentially truncated power law p(k) k γ e αk (Csányi and Szendrői, 2004). References Albert, R. and Barabási, A. (2000). Error and attack tolerance of complex networks. Nature, 406(6794), Bar-Joseph, Z, Gerber, G K, Lee, T I, Rinaldi, N J, Yoo, J Y, Robert, F, Gordon, D B, Fraenkel, E, Jaakkola, T S, Young, R A and Gifford, D K. (2003). Computational discovery of gene modules and regulatory networks. Nat biotechnol, 21(11), Evaluation Studies Journal Article. 16

17 tau= 0.7, scale R^2= 0.91, trunc. R^2= log10(p(k)) log10(k) Figure 8: Log-log plot between frequency of connectivity p(k) and connectivity k for determining whether the network exhibits scale-free topology. In this gene co-expression network, a Pearson correlation cut-off of 0.7 results in an approximate straight line relationship: the red linear regression line leads to a fitting index of R 2 = The black line represents the fit of an exponentially truncated power law, which leads to a fitting index of R 2 = Barabási, A. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286, Bergmann, S., Ihmels, J. and Barkai, N. (2004). Similarities and differences in genome-wide expression data of six organisms. Plos biology, 2(1), Butte, A., Tamayo, P., Slonim, D., Golub, T. and Kohane, I. (2000). Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relavance networks. Pnas, 97, Cox, T. and Cox, M. (2001). Multidimensional scaling. Boca Raton: Chapman and Hall/CRC. Csányi, G. and Szendrői, B. (2004). Structure of a large social network. Physical review e, 69(3), Eisen, M., Spellman, P., Brown, P. and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Pnas, 95(25), Freeman, L. (1978). Centrality in social networks: Conceptual clarification. Social networks, 1,

18 Hartwell, L., Hopefield, J., S., Leibler and Murray, A. (1999). From molecular to modular cell biology. Nature, 402(6761 Suppl), C Isaacs, F J, Hasty, J, Cantor, C R and Collins, J J. (2003). Prediction and measurement of an autoregulatory genetic module. Proc natl acad sci u s a, 100(13), Journal Article. Jeong, H., Mason, S., Barabási, A. and Oltvai, Z. (2001). Lethality and centrality in protein networks. Nature, 411(6833), Jeong, H., Oltvai, Z. and Barabási, A. (2003). Prediction of protein essetiality based on genome data. Complexus, 1, Kaufman, Leonard and Rousseeuw, Peter. (1990). Finding groups in data: An introduction to cluster analysis. New York: John Wiley & Sons, Inc. Newman, M. (2003). The structure and function of complex networks. Siam review, 45(2), Newman, M. and Girvan, M. (2004). Finding and evaluating community structure in networks. Physical review e, 69(2), Prinz, S, Avila-Campillo, I, Aldridge, C, Srinivasan, A, Dimitrov, K, Siegel, A F and Galitski, T. (2004). Control of yeast filamentous-form growth by modules in an integrated molecular network. Genome res, 14(3), Journal Article. Ravasz, E., Somera, A., Mongru, D., Oltvai, Z. and Barabási, A. (2002). Hierarchical organization of modularity in metabolic networks. Science, 297(5586), Rho, K., Jeong, H. and Kahng, B. (2003). Identification of essential and functionally moduled genes through the microarray assay. Preprint. Segal, E, Shapira, M, Regev, A, Pe er, D, Botstein, D, Koller, D and Friedman, N. (2003). Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat genet, 34(2), Journal Article. Solé, R. and Fernández, P. (2003). Modularity for free in genome architecture? Preprint. Spellman, P., Sherlock, G., Zhang, M., Iyer, V. and Anders, K. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. biol. cell, 9(12), Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S. and Golub, T. R. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic. Pnas, 96,

19 Thimm, O, Blasing, O, Gibon, Y, Nagel, A, Meyer, S, Kruger, P, Selbig, J, Muller, L A, Rhee, S Y and Stitt, M. (2004). Mapman: a user-driven tool to display genomics data sets onto diagrams of metabolic pathways and other biological processes. Plant j, 37(6), Journal Article. Tornow, S and Mewes, H W. (2003). Functional modules by relating protein interaction networks and gene expression. Nucleic acids res, 31(21), Journal Article. Toyoda, T and Konagaya, A. (2003). Knowledgeeditor: a new tool for interactive modeling and analyzing biological pathways based on microarray data. Bioinformatics, 19(3), Journal Article. Wasserman, S. and Faust, K. (1994). Social network analysis: Methods and applications. Structural Analysis in the Social Science. New York: Cambridge University Press. Xu, X, Wang, L and Ding, D. (2004). Learning module networks from genome-wide location and expression data. Febs lett, 578(3), Journal Article. Ye, Y. and Godzik, A. (2004). Comparative analysis of protein domain organization. Genome Biology, 14(3), Zhang, B. and Horvath, S. (2005). Using the scale-free topology criterion for constructing weighted gene co-expression networks. 19

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong A Geometric Interpretation of Gene Co-Expression Network Analysis Steve Horvath, Jun Dong Outline Network and network concepts Approximately factorizable networks Gene Co-expression Network Eigengene Factorizability,

More information

A General Framework for Weighted Gene Co-Expression Network Analysis. Steve Horvath Human Genetics and Biostatistics University of CA, LA

A General Framework for Weighted Gene Co-Expression Network Analysis. Steve Horvath Human Genetics and Biostatistics University of CA, LA A General Framework for Weighted Gene Co-Expression Network Analysis Steve Horvath Human Genetics and Biostatistics University of CA, LA Content Novel statistical approach for analyzing microarray data:

More information

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013

Weighted gene co-expression analysis. Yuehua Cui June 7, 2013 Weighted gene co-expression analysis Yuehua Cui June 7, 2013 Weighted gene co-expression network (WGCNA) A type of scale-free network: A scale-free network is a network whose degree distribution follows

More information

Cell biology traditionally identifies proteins based on their individual actions as catalysts, signaling

Cell biology traditionally identifies proteins based on their individual actions as catalysts, signaling Lethality and centrality in protein networks Cell biology traditionally identifies proteins based on their individual actions as catalysts, signaling molecules, or building blocks of cells and microorganisms.

More information

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai Network Biology: Understanding the cell s functional organization Albert-László Barabási Zoltán N. Oltvai Outline: Evolutionary origin of scale-free networks Motifs, modules and hierarchical networks Network

More information

The architecture of complexity: the structure and dynamics of complex networks.

The architecture of complexity: the structure and dynamics of complex networks. SMR.1656-36 School and Workshop on Structure and Function of Complex Networks 16-28 May 2005 ------------------------------------------------------------------------------------------------------------------------

More information

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article JOURNAL OF COMPUTATIONAL BIOLOGY Volume 7, Number 2, 2 # Mary Ann Liebert, Inc. Pp. 8 DOI:.89/cmb.29.52 Research Article Reducing the Computational Complexity of Information Theoretic Approaches for Reconstructing

More information

CS224W: Social and Information Network Analysis

CS224W: Social and Information Network Analysis CS224W: Social and Information Network Analysis Reaction Paper Adithya Rao, Gautam Kumar Parai, Sandeep Sripada Keywords: Self-similar networks, fractality, scale invariance, modularity, Kronecker graphs.

More information

Erzsébet Ravasz Advisor: Albert-László Barabási

Erzsébet Ravasz Advisor: Albert-László Barabási Hierarchical Networks Erzsébet Ravasz Advisor: Albert-László Barabási Introduction to networks How to model complex networks? Clustering and hierarchy Hierarchical organization of cellular metabolism The

More information

Analyzing Microarray Time course Genome wide Data

Analyzing Microarray Time course Genome wide Data OR 779 Functional Data Analysis Course Project Analyzing Microarray Time course Genome wide Data Presented by Xin Zhao April 29, 2002 Cornell University Overview 1. Introduction Biological Background Biological

More information

networks in molecular biology Wolfgang Huber

networks in molecular biology Wolfgang Huber networks in molecular biology Wolfgang Huber networks in molecular biology Regulatory networks: components = gene products interactions = regulation of transcription, translation, phosphorylation... Metabolic

More information

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

Evidence for dynamically organized modularity in the yeast protein-protein interaction network Evidence for dynamically organized modularity in the yeast protein-protein interaction network Sari Bombino Helsinki 27.3.2007 UNIVERSITY OF HELSINKI Department of Computer Science Seminar on Computational

More information

Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree

Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree YOICHI YAMADA, YUKI MIYATA, MASANORI HIGASHIHARA*, KENJI SATOU Graduate School of Natural

More information

Self Similar (Scale Free, Power Law) Networks (I)

Self Similar (Scale Free, Power Law) Networks (I) Self Similar (Scale Free, Power Law) Networks (I) E6083: lecture 4 Prof. Predrag R. Jelenković Dept. of Electrical Engineering Columbia University, NY 10027, USA {predrag}@ee.columbia.edu February 7, 2007

More information

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics GENOME Bioinformatics 2 Proteomics protein-gene PROTEOME protein-protein METABOLISM Slide from http://www.nd.edu/~networks/ Citrate Cycle Bio-chemical reactions What is it? Proteomics Reveal protein Protein

More information

BioControl - Week 6, Lecture 1

BioControl - Week 6, Lecture 1 BioControl - Week 6, Lecture 1 Goals of this lecture Large metabolic networks organization Design principles for small genetic modules - Rules based on gene demand - Rules based on error minimization Suggested

More information

How Scale-free Type-based Networks Emerge from Instance-based Dynamics

How Scale-free Type-based Networks Emerge from Instance-based Dynamics How Scale-free Type-based Networks Emerge from Instance-based Dynamics Tom Lenaerts, Hugues Bersini and Francisco C. Santos IRIDIA, CP 194/6, Université Libre de Bruxelles, Avenue Franklin Roosevelt 50,

More information

Application of random matrix theory to microarray data for discovering functional gene modules

Application of random matrix theory to microarray data for discovering functional gene modules Application of random matrix theory to microarray data for discovering functional gene modules Feng Luo, 1 Jianxin Zhong, 2,3, * Yunfeng Yang, 4 and Jizhong Zhou 4,5, 1 Department of Computer Science,

More information

Analysis of Biological Networks: Network Robustness and Evolution

Analysis of Biological Networks: Network Robustness and Evolution Analysis of Biological Networks: Network Robustness and Evolution Lecturer: Roded Sharan Scribers: Sasha Medvedovsky and Eitan Hirsh Lecture 14, February 2, 2006 1 Introduction The chapter is divided into

More information

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Written Exam 15 December Course name: Introduction to Systems Biology Course no Technical University of Denmark Written Exam 15 December 2008 Course name: Introduction to Systems Biology Course no. 27041 Aids allowed: Open book exam Provide your answers and calculations on separate

More information

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 Cluster Analysis of Gene Expression Microarray Data BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 1 Data representations Data are relative measurements log 2 ( red

More information

Supplemental Material

Supplemental Material Supplemental Material Article title: Construction and comparison of gene co expression networks shows complex plant immune responses Author names and affiliation: Luis Guillermo Leal 1*, Email: lgleala@unal.edu.co

More information

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it? Proteomics What is it? Reveal protein interactions Protein profiling in a sample Yeast two hybrid screening High throughput 2D PAGE Automatic analysis of 2D Page Yeast two hybrid Use two mating strains

More information

Supplementary Information

Supplementary Information Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers

More information

Differential Modeling for Cancer Microarray Data

Differential Modeling for Cancer Microarray Data Differential Modeling for Cancer Microarray Data Omar Odibat Department of Computer Science Feb, 01, 2011 1 Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing

More information

Using Knowledge Driven Matrix Factorization to Reconstruct Modular Gene Regulatory Network

Using Knowledge Driven Matrix Factorization to Reconstruct Modular Gene Regulatory Network Using Knowledge Driven Matrix Factorization to Reconstruct Modular Gene Regulatory Network Yang Zhou 1 and Zheng Li and Xuerui Yang and Linxia Zhang and Shireesh Srivastava and Rong Jin 1 and Christina

More information

Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data of Bacillus Subtilis Using Differential Equations

Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data of Bacillus Subtilis Using Differential Equations Inferring Gene Regulatory Networks from Time-Ordered Gene Expression Data of Bacillus Subtilis Using Differential Equations M.J.L. de Hoon, S. Imoto, K. Kobayashi, N. Ogasawara, S. Miyano Pacific Symposium

More information

Fine-scale dissection of functional protein network. organization by dynamic neighborhood analysis

Fine-scale dissection of functional protein network. organization by dynamic neighborhood analysis Fine-scale dissection of functional protein network organization by dynamic neighborhood analysis Kakajan Komurov 1, Mehmet H. Gunes 2, Michael A. White 1 1 Department of Cell Biology, University of Texas

More information

Fuzzy Clustering of Gene Expression Data

Fuzzy Clustering of Gene Expression Data Fuzzy Clustering of Gene Data Matthias E. Futschik and Nikola K. Kasabov Department of Information Science, University of Otago P.O. Box 56, Dunedin, New Zealand email: mfutschik@infoscience.otago.ac.nz,

More information

Divergence Pattern of Duplicate Genes in Protein-Protein Interactions Follows the Power Law

Divergence Pattern of Duplicate Genes in Protein-Protein Interactions Follows the Power Law Divergence Pattern of Duplicate Genes in Protein-Protein Interactions Follows the Power Law Ze Zhang,* Z. W. Luo,* Hirohisa Kishino,à and Mike J. Kearsey *School of Biosciences, University of Birmingham,

More information

Biological Networks. Gavin Conant 163B ASRC

Biological Networks. Gavin Conant 163B ASRC Biological Networks Gavin Conant 163B ASRC conantg@missouri.edu 882-2931 Types of Network Regulatory Protein-interaction Metabolic Signaling Co-expressing General principle Relationship between genes Gene/protein/enzyme

More information

An Introduction to the spls Package, Version 1.0

An Introduction to the spls Package, Version 1.0 An Introduction to the spls Package, Version 1.0 Dongjun Chung 1, Hyonho Chun 1 and Sündüz Keleş 1,2 1 Department of Statistics, University of Wisconsin Madison, WI 53706. 2 Department of Biostatistics

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

identifiers matched to homologous genes. Probeset annotation files for each array platform were used to

identifiers matched to homologous genes. Probeset annotation files for each array platform were used to SUPPLEMENTARY METHODS Data combination and normalization Prior to data analysis we first had to appropriately combine all 1617 arrays such that probeset identifiers matched to homologous genes. Probeset

More information

A Modified Method Using the Bethe Hessian Matrix to Estimate the Number of Communities

A Modified Method Using the Bethe Hessian Matrix to Estimate the Number of Communities Journal of Advanced Statistics, Vol. 3, No. 2, June 2018 https://dx.doi.org/10.22606/jas.2018.32001 15 A Modified Method Using the Bethe Hessian Matrix to Estimate the Number of Communities Laala Zeyneb

More information

A Multiobjective GO based Approach to Protein Complex Detection

A Multiobjective GO based Approach to Protein Complex Detection Available online at www.sciencedirect.com Procedia Technology 4 (2012 ) 555 560 C3IT-2012 A Multiobjective GO based Approach to Protein Complex Detection Sumanta Ray a, Moumita De b, Anirban Mukhopadhyay

More information

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information - Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes - Supplementary Information - Martin Bartl a, Martin Kötzing a,b, Stefan Schuster c, Pu Li a, Christoph Kaleta b a

More information

An Overview of Weighted Gene Co-Expression Network Analysis. Steve Horvath University of California, Los Angeles

An Overview of Weighted Gene Co-Expression Network Analysis. Steve Horvath University of California, Los Angeles An Overview of Weighted Gene Co-Expression Network Analysis Steve Horvath University of California, Los Angeles Contents How to construct a weighted gene co-expression network? Why use soft thresholding?

More information

arxiv:physics/ v1 9 Jun 2006

arxiv:physics/ v1 9 Jun 2006 Weighted Networ of Chinese Nature Science Basic Research Jian-Guo Liu, Zhao-Guo Xuan, Yan-Zhong Dang, Qiang Guo 2, and Zhong-Tuo Wang Institute of System Engineering, Dalian University of Technology, Dalian

More information

Protein function prediction via analysis of interactomes

Protein function prediction via analysis of interactomes Protein function prediction via analysis of interactomes Elena Nabieva Mona Singh Department of Computer Science & Lewis-Sigler Institute for Integrative Genomics January 22, 2008 1 Introduction Genome

More information

Lecture 5: November 19, Minimizing the maximum intracluster distance

Lecture 5: November 19, Minimizing the maximum intracluster distance Analysis of DNA Chips and Gene Networks Spring Semester, 2009 Lecture 5: November 19, 2009 Lecturer: Ron Shamir Scribe: Renana Meller 5.1 Minimizing the maximum intracluster distance 5.1.1 Introduction

More information

T H E J O U R N A L O F C E L L B I O L O G Y

T H E J O U R N A L O F C E L L B I O L O G Y T H E J O U R N A L O F C E L L B I O L O G Y Supplemental material Breker et al., http://www.jcb.org/cgi/content/full/jcb.201301120/dc1 Figure S1. Single-cell proteomics of stress responses. (a) Using

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Towards Detecting Protein Complexes from Protein Interaction Data

Towards Detecting Protein Complexes from Protein Interaction Data Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,

More information

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods International Academic Institute for Science and Technology International Academic Journal of Science and Engineering Vol. 3, No. 6, 2016, pp. 169-176. ISSN 2454-3896 International Academic Journal of

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species.

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species. Supplementary Figure 1 Icm/Dot secretion system region I in 41 Legionella species. Homologs of the effector-coding gene lega15 (orange) were found within Icm/Dot region I in 13 Legionella species. In four

More information

WGCNA User Manual. (for version 1.0.x)

WGCNA User Manual. (for version 1.0.x) (for version 1.0.x) WGCNA User Manual A systems biologic microarray analysis software for finding important genes and pathways. The WGCNA (weighted gene co-expression network analysis) software implements

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Clustering and Network

Clustering and Network Clustering and Network Jing-Dong Jackie Han jdhan@picb.ac.cn http://www.picb.ac.cn/~jdhan Copy Right: Jing-Dong Jackie Han What is clustering? A way of grouping together data samples that are similar in

More information

Bioinformatics. Transcriptome

Bioinformatics. Transcriptome Bioinformatics Transcriptome Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/ Bioinformatics

More information

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks in biology Protein-Protein Interaction Network of Yeast Transcriptional regulatory network of E.coli Experimental

More information

Modeling Gene Expression from Microarray Expression Data with State-Space Equations. F.X. Wu, W.J. Zhang, and A.J. Kusalik

Modeling Gene Expression from Microarray Expression Data with State-Space Equations. F.X. Wu, W.J. Zhang, and A.J. Kusalik Modeling Gene Expression from Microarray Expression Data with State-Space Equations FX Wu, WJ Zhang, and AJ Kusalik Pacific Symposium on Biocomputing 9:581-592(2004) MODELING GENE EXPRESSION FROM MICROARRAY

More information

Introduction to clustering methods for gene expression data analysis

Introduction to clustering methods for gene expression data analysis Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Network Biology-part II

Network Biology-part II Network Biology-part II Jun Zhu, Ph. D. Professor of Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New

More information

Computational Network Biology Biostatistics & Medical Informatics 826 Fall 2018

Computational Network Biology Biostatistics & Medical Informatics 826 Fall 2018 Computational Network Biology Biostatistics & Medical Informatics 826 Fall 2018 Sushmita Roy sroy@biostat.wisc.edu https://compnetbiocourse.discovery.wisc.edu Sep 6 th 2018 Goals for today Administrivia

More information

Systems biology and biological networks

Systems biology and biological networks Systems Biology Workshop Systems biology and biological networks Center for Biological Sequence Analysis Networks in electronics Radio kindly provided by Lazebnik, Cancer Cell, 2002 Systems Biology Workshop,

More information

Introduction to clustering methods for gene expression data analysis

Introduction to clustering methods for gene expression data analysis Introduction to clustering methods for gene expression data analysis Giorgio Valentini e-mail: valentini@dsi.unimi.it Outline Levels of analysis of DNA microarray data Clustering methods for functional

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS

GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS 141 GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS MATTHIAS E. FUTSCHIK 1 ANNA TSCHAUT 2 m.futschik@staff.hu-berlin.de tschaut@zedat.fu-berlin.de GAUTAM

More information

Design and characterization of chemical space networks

Design and characterization of chemical space networks Design and characterization of chemical space networks Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-University Bonn 16 August 2015 Network representations of chemical spaces

More information

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

In order to compare the proteins of the phylogenomic matrix, we needed a similarity Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for

More information

Discovering molecular pathways from protein interaction and ge

Discovering molecular pathways from protein interaction and ge Discovering molecular pathways from protein interaction and gene expression data 9-4-2008 Aim To have a mechanism for inferring pathways from gene expression and protein interaction data. Motivation Why

More information

ACTA PHYSICA DEBRECINA XLVI, 47 (2012) MODELLING GENE REGULATION WITH BOOLEAN NETWORKS. Abstract

ACTA PHYSICA DEBRECINA XLVI, 47 (2012) MODELLING GENE REGULATION WITH BOOLEAN NETWORKS. Abstract ACTA PHYSICA DEBRECINA XLVI, 47 (2012) MODELLING GENE REGULATION WITH BOOLEAN NETWORKS E. Fenyvesi 1, G. Palla 2 1 University of Debrecen, Department of Experimental Physics, 4032 Debrecen, Egyetem 1,

More information

Types of biological networks. I. Intra-cellurar networks

Types of biological networks. I. Intra-cellurar networks Types of biological networks I. Intra-cellurar networks 1 Some intra-cellular networks: 1. Metabolic networks 2. Transcriptional regulation networks 3. Cell signalling networks 4. Protein-protein interaction

More information

Measuring the shape of degree distributions

Measuring the shape of degree distributions Measuring the shape of degree distributions Dr Jennifer Badham Visiting Fellow SEIT, UNSW Canberra research@criticalconnections.com.au Overview Context What does shape mean for degree distribution Why

More information

Phylogenetic Analysis of Molecular Interaction Networks 1

Phylogenetic Analysis of Molecular Interaction Networks 1 Phylogenetic Analysis of Molecular Interaction Networks 1 Mehmet Koyutürk Case Western Reserve University Electrical Engineering & Computer Science 1 Joint work with Sinan Erten, Xin Li, Gurkan Bebek,

More information

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures Dongxiao Zhu June 13, 2018 1 Introduction OMICS data are increasingly available to biomedical researchers, and

More information

Graph Theory and Networks in Biology

Graph Theory and Networks in Biology Graph Theory and Networks in Biology Oliver Mason and Mark Verwoerd Hamilton Institute, National University of Ireland Maynooth, Co. Kildare, Ireland {oliver.mason, mark.verwoerd}@nuim.ie January 17, 2007

More information

Self-organized scale-free networks

Self-organized scale-free networks Self-organized scale-free networks Kwangho Park and Ying-Cheng Lai Departments of Electrical Engineering, Arizona State University, Tempe, Arizona 85287, USA Nong Ye Department of Industrial Engineering,

More information

Chapter 8: The Topology of Biological Networks. Overview

Chapter 8: The Topology of Biological Networks. Overview Chapter 8: The Topology of Biological Networks 8.1 Introduction & survey of network topology Prof. Yechiam Yemini (YY) Computer Science Department Columbia University A gallery of networks Small-world

More information

Modelling Gene Expression Data over Time: Curve Clustering with Informative Prior Distributions.

Modelling Gene Expression Data over Time: Curve Clustering with Informative Prior Distributions. BAYESIAN STATISTICS 7, pp. 000 000 J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2003 Modelling Data over Time: Curve

More information

Network landscape from a Brownian particle s perspective

Network landscape from a Brownian particle s perspective Network landscape from a Brownian particle s perspective Haijun Zhou Max-Planck-Institute of Colloids and Interfaces, 14424 Potsdam, Germany Received 22 August 2002; revised manuscript received 13 January

More information

Unsupervised Learning with Random Forest Predictors

Unsupervised Learning with Random Forest Predictors Unsupervised Learning with Random Forest Predictors Tao Shi, and Steve Horvath,, Department of Human Genetics, David Geffen School of Medicine, UCLA Department of Biostatistics, School of Public Health,

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

Comparative Network Analysis

Comparative Network Analysis Comparative Network Analysis BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by

More information

Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs

Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs Dynamic modular architecture of protein-protein interaction networks beyond the dichotomy of date and party hubs Xiao Chang 1,#, Tao Xu 2,#, Yun Li 3, Kai Wang 1,4,5,* 1 Zilkha Neurogenetic Institute,

More information

Physical network models and multi-source data integration

Physical network models and multi-source data integration Physical network models and multi-source data integration Chen-Hsiang Yeang MIT AI Lab Cambridge, MA 02139 chyeang@ai.mit.edu Tommi Jaakkola MIT AI Lab Cambridge, MA 02139 tommi@ai.mit.edu September 30,

More information

Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions

Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions Belfield Campus Map Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions

More information

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from Gene Expression Data II Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday

More information

Exploratory statistical analysis of multi-species time course gene expression

Exploratory statistical analysis of multi-species time course gene expression Exploratory statistical analysis of multi-species time course gene expression data Eng, Kevin H. University of Wisconsin, Department of Statistics 1300 University Avenue, Madison, WI 53706, USA. E-mail:

More information

Correlation Networks

Correlation Networks QuickTime decompressor and a are needed to see this picture. Correlation Networks Analysis of Biological Networks April 24, 2010 Correlation Networks - Analysis of Biological Networks 1 Review We have

More information

Bioinformatics I. CPBS 7711 October 29, 2015 Protein interaction networks. Debra Goldberg

Bioinformatics I. CPBS 7711 October 29, 2015 Protein interaction networks. Debra Goldberg Bioinformatics I CPBS 7711 October 29, 2015 Protein interaction networks Debra Goldberg debra@colorado.edu Overview Networks, protein interaction networks (PINs) Network models What can we learn from PINs

More information

Overview. Overview. Social networks. What is a network? 10/29/14. Bioinformatics I. Networks are everywhere! Introduction to Networks

Overview. Overview. Social networks. What is a network? 10/29/14. Bioinformatics I. Networks are everywhere! Introduction to Networks Bioinformatics I Overview CPBS 7711 October 29, 2014 Protein interaction networks Debra Goldberg debra@colorado.edu Networks, protein interaction networks (PINs) Network models What can we learn from PINs

More information

FCModeler: Dynamic Graph Display and Fuzzy Modeling of Regulatory and Metabolic Maps

FCModeler: Dynamic Graph Display and Fuzzy Modeling of Regulatory and Metabolic Maps FCModeler: Dynamic Graph Display and Fuzzy Modeling of Regulatory and Metabolic Maps Julie Dickerson 1, Zach Cox 1 and Andy Fulmer 2 1 Iowa State University and 2 Proctor & Gamble. FCModeler Goals Capture

More information

Module preservation statistics

Module preservation statistics Module preservation statistics Module preservation is often an essential step in a network analysis Steve Horvath University of California, Los Angeles Construct a network Rationale: make use of interaction

More information

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules Ying Liu 1 Department of Computer Science, Mathematics and Science, College of Professional

More information

BMD645. Integration of Omics

BMD645. Integration of Omics BMD645 Integration of Omics Shu-Jen Chen, Chang Gung University Dec. 11, 2009 1 Traditional Biology vs. Systems Biology Traditional biology : Single genes or proteins Systems biology: Simultaneously study

More information

DESIGN OF EXPERIMENTS AND BIOCHEMICAL NETWORK INFERENCE

DESIGN OF EXPERIMENTS AND BIOCHEMICAL NETWORK INFERENCE DESIGN OF EXPERIMENTS AND BIOCHEMICAL NETWORK INFERENCE REINHARD LAUBENBACHER AND BRANDILYN STIGLER Abstract. Design of experiments is a branch of statistics that aims to identify efficient procedures

More information

Integration of functional genomics data

Integration of functional genomics data Integration of functional genomics data Laboratoire Bordelais de Recherche en Informatique (UMR) Centre de Bioinformatique de Bordeaux (Plateforme) Rennes Oct. 2006 1 Observations and motivations Genomics

More information

Grouping of correlated feature vectors using treelets

Grouping of correlated feature vectors using treelets Grouping of correlated feature vectors using treelets Jing Xiang Department of Machine Learning Carnegie Mellon University Pittsburgh, PA 15213 jingx@cs.cmu.edu Abstract In many applications, features

More information

Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction Networks

Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction Networks 22 International Conference on Environment Science and Engieering IPCEE vol.3 2(22) (22)ICSIT Press, Singapoore Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction

More information

Research Article A Topological Description of Hubs in Amino Acid Interaction Networks

Research Article A Topological Description of Hubs in Amino Acid Interaction Networks Advances in Bioinformatics Volume 21, Article ID 257512, 9 pages doi:1.1155/21/257512 Research Article A Topological Description of Hubs in Amino Acid Interaction Networks Omar Gaci Le Havre University,

More information

ANÁLISE DOS DADOS. Daniela Barreiro Claro

ANÁLISE DOS DADOS. Daniela Barreiro Claro ANÁLISE DOS DADOS Daniela Barreiro Claro Outline Data types Graphical Analysis Proimity measures Prof. Daniela Barreiro Claro Types of Data Sets Record Ordered Relational records Video data: sequence of

More information

V 5 Robustness and Modularity

V 5 Robustness and Modularity Bioinformatics 3 V 5 Robustness and Modularity Mon, Oct 29, 2012 Network Robustness Network = set of connections Failure events: loss of edges loss of nodes (together with their edges) loss of connectivity

More information

Graph Theory and Networks in Biology arxiv:q-bio/ v1 [q-bio.mn] 6 Apr 2006

Graph Theory and Networks in Biology arxiv:q-bio/ v1 [q-bio.mn] 6 Apr 2006 Graph Theory and Networks in Biology arxiv:q-bio/0604006v1 [q-bio.mn] 6 Apr 2006 Oliver Mason and Mark Verwoerd February 4, 2008 Abstract In this paper, we present a survey of the use of graph theoretical

More information

A Dimensionality Reduction Framework for Detection of Multiscale Structure in Heterogeneous Networks

A Dimensionality Reduction Framework for Detection of Multiscale Structure in Heterogeneous Networks Shen HW, Cheng XQ, Wang YZ et al. A dimensionality reduction framework for detection of multiscale structure in heterogeneous networks. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 27(2): 341 357 Mar. 2012.

More information

Course plan Academic Year Qualification MSc on Bioinformatics for Health Sciences. Subject name: Computational Systems Biology Code: 30180

Course plan Academic Year Qualification MSc on Bioinformatics for Health Sciences. Subject name: Computational Systems Biology Code: 30180 Course plan 201-201 Academic Year Qualification MSc on Bioinformatics for Health Sciences 1. Description of the subject Subject name: Code: 30180 Total credits: 5 Workload: 125 hours Year: 1st Term: 3

More information

Comparing transcription factor regulatory networks of human cell types. The Protein Network Workshop June 8 12, 2015

Comparing transcription factor regulatory networks of human cell types. The Protein Network Workshop June 8 12, 2015 Comparing transcription factor regulatory networks of human cell types The Protein Network Workshop June 8 12, 2015 KWOK-PUI CHOI Dept of Statistics & Applied Probability, Dept of Mathematics, NUS OUTLINE

More information

Data visualization and clustering: an application to gene expression data

Data visualization and clustering: an application to gene expression data Data visualization and clustering: an application to gene expression data Francesco Napolitano Università degli Studi di Salerno Dipartimento di Matematica e Informatica DAA Erice, April 2007 Thanks to

More information