Annotating gene function by combining expression data with a modular gene network

Size: px

Start display at page:

Download "Annotating gene function by combining expression data with a modular gene network"

Derek Hancock
5 years ago
Views:

1 BIOINFORMATICS Vol. 23 ISMB/ECCB 2007, pages i468 i478 doi: /bioinformatics/btm173 Annotating gene function by combining expression data with a modular gene network Motoki Shiga, Ichigaku Takigawa and Hiroshi Mamitsuka* Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji , Japan ABSTRACT Motivation: A promising and reliable approach to annotate gene function is clustering genes not only by using gene expression data but also literature information, especially gene networks. Results: We present a systematic method for gene clustering by combining these totally different two types of data, particularly focusing on network modularity, a global feature of gene networks. Our method is based on learning a probabilistic model, which we call a hidden modular random field in which the relation between hidden variables directly represents a given gene network. Our learning algorithm which minimizes an energy function considering the network modularity is practically time-efficient, regardless of using the global network property. We evaluated our method by using a metabolic network and microarray expression data, changing with microarray datasets, parameters of our model and gold standard clusters. Experimental results showed that our method outperformed other four competing methods, including k-means and existing graph partitioning methods, being statistically significant in all cases. Further detailed analysis showed that our method could group a set of genes into a cluster which corresponds to the folate metabolic pathway while other methods could not. From these results, we can say that our method is highly effective for gene clustering and annotating gene function. Contact: shiga@kuicr.kyoto-u.ac.jp 1 INTRODUCTION Recent progress in genome sciences has led to the development of DNA microarray technology which allows to monitor the expression of thousands of genes simultaneously. Microarray technology by which we can see which genes are active in a given cell or tissue promises the ability to determine the function of each gene. A current popular approach to annotate gene function from gene expression data is clustering genes by expression values based on the assumption that genes with similar expression patterns can be clustered into a group with the same gene function. Existing approaches include typical clustering methods, such as hierarchical clustering (Eisen et al., 1998), k-means (Tavazoie et al., 1999) and self-organizing maps (Tamayo et al., 1999). However, microarray expression data is inevitably noisy, making the clustering result by the above methods unstable (Kerr and Churchill, 2001; Zhang and Zhao, 2000). A possible solution to overcome this problem is to generate many array replicates which are however *To whom correspondence should be addressed. demanding because of the high experimental cost of microarray experiments. A more promising direction in current bioinformatics research is to combine microarray expression data with the existing knowledge of gene annotation derived from literature. Typical examples include model-based clustering incorporating GO (Gene Ontology) annotation as priors of model parameters (Pan, 2006), k-means clustering using GO annotation dependent distances (Huang and Pan, 2006) and hierarchical clustering based on the distance defined by both gene expressions and the shortest path on a metabolic (chemical reaction) pathway (Hanisch et al., 2002), etc. This type of combination is of interest, since dynamic behavior of genes which would be observed from microarray data can be integrated with the literature-derived biological data which is obviously static information. Specifically, dynamic information of microarray expression would change the static gene clusters to fit the experimental condition of given microarray data. However, existing approaches are not methodologically sophisticated enough in combining the two data, i.e. realvalued expression data and literature-derived data, especially gene networks. In addition, the focus of current approaches is placed on the rather local information, such as neighboring genes, of gene networks, and incorporating global information of gene networks might find more appropriate gene clusters. In light of the above, we present a new method for clustering genes using both microarray expression data, i.e. the dynamic information of a cell, and a given gene network, focusing on network modularity, a global nature of biological networks (Ravasz et al., 2002). Our method is based on learning a probabilistic model which we call a hidden module random field that is well suited to combine the two different types of data. Our model has observable variables for microarray expression data and hidden variables for gene annotation (cluster labels). Concretely, each observable variable is dependent upon a hidden variable which can be dependent upon one or more hidden variables each other, as defined by a given biological network. Learning parameters of our model is based on the EM (Expectation-Maximization) algorithm that minimizes the objective function which considers the modularity of a given gene network. Briefly speaking, our method uses not only the simple gene relations which are directly obtained from literature but also more complex information which can be derived from the entire network structure. We further stress that regardless of focusing on the global nature of a given network, our proposed algorithm for estimating probability parameters of our model is time-efficient. ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

2 Clustering genes with expression data and network modularity Using a metabolic network and a wide variety of microarray datasets, we validated the performance of our method, comparing with other methods and analyzed the results obtained by our method. Clusters of genes are variable depending on the condition under which genes are placed, meaning that true cluster labels are unknown. However, to evaluate the performance of our method, we generated two different sets of gene clusters as standard data. Assuming that a good clustering result will be close to them if many microarray observations are used, we compared our method with four different clustering methods: k-means with microarray expression data and three graph partitioning methods using the structure of the metabolic network. We then found that our clustering method outperformed all other competing methods, being statistically significant in all cases. We further examined the clustering result, focusing on the genes in folate biosynthesis to confirm the significance of our method. These results proved that our method of combining microarray expression data with the modularity of a gene network is highly effective for clustering genes and annotating gene functions. 2 METHOD: PROBABILISTIC MODEL Clustering is to assign a gene function (or generally a cluster label) to each gene. Starting with a hidden random field, we will explain our model for clustering genes by combining microarray expression data with a global feature of gene networks. Each node in a gene network is labeled by a gene, and an undirected edge connecting two nodes represents some relation between the two genes labeling the two nodes. 2.1 Hidden random field A hidden random field is a probabilistic model with two types of variables: observable and hidden variables. Figure 1a shows a model in which an observable variable depends on only one hidden variable and a hidden variable depends on no other variables. This model corresponds to that of a typical clustering method like k-means in which genes are clustered with microarray expression data only, assuming that the expression pattern of a gene is independent of another. However, this assumption is not necessarily true. For example, the expression patterns of two genes located neighborhood on a metabolic network are in most cases strongly correlated with each other a Observable Hidden Fig. 1. Hidden random fields with (a) independent hidden variables and with (b) relation between hidden variables. b (Kharchenko et al., 2005). Figure 1b shows a hidden random field in which hidden variables can be dependent upon each other which would be more natural for clustering genes using both gene expressions and biological networks. In particular, we use a given gene network to represent the relation of hidden variables. Mathematically, these models can be formalized as follows: let Z ¼ðz 1 ; z 2,...; z N Þ be a set of hidden variables and VðZÞ be a potential function which defines the relation between genes. The probabilistic distribution over Z is given as the Gibbs distribution: PðZÞ ¼ 1 exp VðZÞ ; ð1þ CðÞ where CðÞ ¼ P Z exp VðZÞ and is a parameter (weight) for VðZÞ. We can further define the joint probability of a set of observable variables X ¼ðx 1 ; x 2,...; x N Þ and Z as follows: PðX; ZÞ ¼PðXjZÞPðZÞ ¼ YN n¼1 pðx n jz n ÞPðZÞ Practically N is the number of genes. Given a microarray dataset, the size of x n is the number of observations (experiments), and each value taken in x n is an observable expression value. K is the number of clusters. A hidden variable z n takes a cluster label out of 1 to K. The conditional probability pðx n jz n Þ is assumed to be a normal distribution or the von Mises Fisher distribution (Mardia and Jupp, 2000). The probability of X is further given as follows: PðXÞ ¼ X PðX; ZÞ Z The model of Figure 1b has dependency relation between hidden variables as a network with nodes corresponding to genes and edges connecting nodes. The structure information of this network can be described in PðZÞ of Equation (2). 2.2 Hidden Markov random field (HMaF) A typical hidden random field is a hidden Markov random field, which we call HMaF in this article. 1 We explain this model briefly to show the significance of the dependency structure of hidden variables. In HMaF, a hidden variable for a node depends on the hidden variables for its neighboring (closest) nodes alone, satisfying with the Markov property. The potential function in this model is defined as follows: V M ðzþ ¼ XN n¼1 1 X ðz n 6¼ z i Þ; js n j i2s n 1 A general abbreviation of a hidden Markov random field is HMRF. If we follow this abbreviation, our model which we call a hidden modular random field is also HMRF. So in this article, we use HMRF for a hidden Markov random fiela hidden Markov random field while for our moded while for our model. ð2þ ð3þ i469

M.Shiga et al. a Observable b a b c Fig. 3. Image restoration by (a) HMaF and by (b) k-means. Hidden Fig. 2. (a) HMaF for image restoration and (b) observable noisy data and (c) the true image.

3 M.Shiga et al. a Observable b a b c Fig. 3. Image restoration by (a) HMaF and by (b) k-means. Hidden Fig. 2. (a) HMaF for image restoration and (b) observable noisy data and (c) the true image. where S n is a set of neighboring nodes of n and ðþ is defined as follows: ðz i 6¼ z j Þ¼ 1 when z i 6¼ z j ; 0 others As seen by Equation (3), the value of V M ðzþ increases as the neighboring hidden variables take the same value. Using this potential function, the probabilistic distribution over Z in HMaF can be defined as follows: PðZÞ ¼ 1 C M ðþ exp V MðZÞ ; ð4þ where C M ðþ ¼ P Z exp V MðZÞ. A major application of HMaF is image restoration (or image modeling), which is a problem of estimating true pixel labels from an observed image which is usually very noisy (Fjortoft et al., 2003; Zhang et al., 2001). We show the effectiveness of using the network information of hidden variables through an example of image restoration. Figure 2 shows a schematic picture of applying HMaF to image restoration, exactly according to (Zhang et al., 2001) in which observable and hidden variables are used for (b) noisy images and (c) clear true pixels, respectively. Each of (b) and (c) has 2500 (¼50 50) pixels, and a pixel can take one of three labels as a true value. As shown in Figure 2a, a typical network structure for the relation of hidden variables in image restoration is a grid structure in which a node is connected to neighboring four (left, right, up and down) nodes only, keeping the Markov property. Figure 3a shows the result obtained by applying HMaF to the data shown in Figure 2b, while Figure 3b shows that by applying k-means (which corresponds to Figure 1a) to the same data. As shown in these figures, HMaF clearly removes the noise in Figure 2b while obviously k-means cannot, indicating that the network structure assumption in HMaF works effectively for image restoration. 2.3 Our model: hidden Modular random field () for gene clustering As shown in the previous section, the network structure of hidden variables is an important assumption to estimate the values of hidden variables exactly. The hidden structure in image restoration is a grid shape in which the number of edges from a node, i.e. the degree of a node, is basically constant. In contrast, the degree of a node in a gene network varies according to the scale-free nature (Jeong et al., 2000; Ravasz et al., 2002), suggesting that the grid shape is inappropriate for a gene network and the Markov property would be insufficient. It is already reported that the genes having the same function (cluster) tend to be gathered together on a gene network, especially metabolic networks (Ravasz et al., 2002). This feature is called modularity, a global feature of gene networks. Modularity of a network is defined as a quantity which becomes larger as with increasing the number of edges in a cluster and with decreasing the number of edges between two different clusters (Guimera and Nunes Amaral, 2005; Guimera et al., 2004; Newman and Girvan, 2004). Modularity can then be mathematically defined as follows: ( RðZÞ ¼ XK l k ðzþ L d ) 2 kðzþ ð5þ 2L k¼1 where L is the total number of edges in a given network, l k ðzþ is the number of edges in cluster k and d k ðzþ is the total sum of degrees in cluster k. The first term of the right-hand side of Equation (5) increases with the number of edges inside a cluster, playing the same role as that of Equation (3) which becomes larger as the same cluster is assigned to neighboring genes on a gene network. On the other hand, the second term checks the number of edges not only in a cluster but also crossing different clusters. Totally, the right-hand side becomes larger as with increasing the number of edges in a cluster without increasing the number of edges crossing different clusters. Thus Equation (5) can reflect the modularity, a global feature of gene networks. Thus, we propose a new random field using V D ðzþð¼ N RðZÞÞ as a potential function as follows: PðZÞ ¼ 1 C D ðþ exp V DðZÞ ; ð6þ where C D ðþ ¼ P Z exp V DðZÞ. We call this hidden random field which stands for hidden modular random field. 3 METHOD: LEARNING ALGORITHM FOR 3.1 EM algorithm The EM algorithm is a local optimization method which is widely used for estimating parameters of a lot of probabilistic i470

4 Clustering genes with expression data and network modularity models with hidden variables (Dempster et al., 1977). According to the EM algorithm, we have to minimize the following so-called Q function: Qðhjh ðtþ Þ¼ X PðZjX; h ðtþ Þ log PðX; ZjhÞ Z ¼ log PðX; ZjhÞ where h ðtþ is the set of current parameters, h is the set of parameters to be estimated and a -function is used for PðZjX; h ðtþ Þ. This Q function is true of a lot of other probabilistic models with hidden variables, such as k-means (Kearns et al., 2001). We assume that the probabilistic distribution of an observable variable follows the von Mises Fisher distribution (Mardia and Jupp, 2000), since this distribution is used for clustering high-dimensional data such as text documents (Zhong and Ghosh, 2005):! pðx n jz n ¼ kþ ¼ 1 exp xt n l k pffiffiffiffiffiffiffiffiffiffi Cv x T n x ; ð8þ n where, qffiffiffiffiffiffiffiffiffiffi l k is the center of cluster k and must satisfy with l T k l k ¼ 1, and C v is a normalization constant. From Equations (2, 6 8), we can have the objective function J which should be minimized: ( ) J :¼ ð1!þ XK X xt n l k pffiffiffiffiffiffiffiffiffiffi k¼1 n2u k x T n x!v D ðzþ; ð9þ n where! satisfies with ¼! 1! and 0! 1, and U k is a set of the indices of hidden variables taking cluster k. We note that! is a weight parameter which is not trained from data, and so C D () in Equation (6) becomes a constant when we derive the above Equation (9). The EM algorithm suggests a procedure to minimize J by optimizing cluster centers l k ðk ¼ 1; 2,...; KÞ and hidden variables Z separately. More concretely, this algorithm repeats the following two steps alternately: (1) we optimize cluster centers using fixed hidden variables and (2) optimize hidden variables using fixed cluster centers. For the first step, when hidden variables Z are fixed, the objective function J is minimized by the following l k : ð7þ 3.2 ICM method The basic procedure of the ICM method repeats the following two steps until the convergence of J: (1) we first randomly permute N hidden variables. (2) In the permuted order, we update the value of the chosen hidden variable so that J should be minimized. The ICM method is a local optimization method, and so we can repeat the above procedure many times using different initial values to find the most optimal solution. We emphasize that the ICM method for our model is timeefficient, although our model deals with the global feature of a given network. The reason is as follows: we do not have to compute function J at every time of doing the above two steps. Instead, we can compute the difference between the current J and the new J. More concretely, after we choose a hidden variable, say z n and then examine a new hidden (cluster) value, say j, the difference between the current J specified by z n and the new J by j can be given by the function of these two: J ðn;jþ ðzþ ¼ ð1!þ xt n l j x T n l i pffiffiffiffiffiffiffiffiffiffi x T n x!v ðn;jþ D ðzþ; n where i is the current value of hidden variable z n and V ðn;jþ D ðzþ : ¼ 1 l ðn;jþ ðzþ l ðn;iþ ðzþ L þ dðnþ ðzþ 2L 2 d i ðzþ d j ðzþ d ðnþ ðzþ ; where, l ðn;iþ ðzþ is the number of edges coming to the nodes in cluster i and d ðnþ ðzþ is the degree of node n. From the above equations, we note that J ðn;jþ ðzþ can be computed by using only the local (neighboring) nodes of node n, although reflects the global structure of a given biological network. Thus, the practical computation time of is totally comparable to that of HMaF which just uses the local information of a given network. Finally, Figure 4 shows our entire algorithm of estimating probabilistic parameters by minimizing J. Line 1 is the input of y k l k ¼ q ffiffiffiffiffiffiffiffiffi ; y T k y k where y k ¼ 1 X x n : ju k j n2u k The second step is to optimize J when we fix the above cluster centers. We note that potential function VðZÞ is not the sum over all hidden variables z n ðn ¼ 1,...; NÞ, since hidden variables can be dependent upon each other. We then have to find the optimal combination of hidden values, but the computational complexity of this problem is NP-hard. We then use the ICM (iterative conditional modes) method (Besag, 1986) to find an approximation solution of this problem. Fig. 4. Our entire algorithm for estimating parameters of. Let O n be a set of neighbors of node n. i471

5 M.Shiga et al. initial values which are, in our experiments, obtained by repeating randomly assigning cluster labels to the nodes of a given gene network and selecting the set of cluster labels with the highest modularity which can be computed by Equation (5). Lines 3 6 are the ICM method to which, as shown in line 5, we add one constraint that the new cluster label is selected out of the labels of neighboring nodes only. This device incorporates the local information of a given network in updating labels and speeds up the efficiency of this algorithm. Lines 2 9 are the entire EM algorithm. 4 EXPERIMENTAL RESULTS 4.1 Data We used metabolic pathways for a gene network and focused on the data of Saccharomyces cerevisiae in both the metabolic gene network and microarray expression data. We will show the detail of our data subsequently Metabolic network We generated a metabolic network from KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa et al., 2006), by assigning an enzyme gene to a node and connecting the two nodes of two neighboring genes on the KEGG metabolic pathway by an edge. We removed a node with no edges and an isolated network with no more than 15 nodes. The generated network had 636 nodes (genes) and 3104 edges. We note again that this metabolic network is an undirected gene network, meaning that each node corresponds to a gene Microarray expression From the GEO (Gene Expression Omnibus) database (Edgar et al., 2002), we chose five datasets of S. cerevisiae under the condition that each dataset has more than 50 experiments (observations), because a larger number of observations is more reliable when genes with similar expression patterns are clustered in the same group. We further added a dataset (Hughes et al., 2000) which has been often used in microarray analysis (Huang and Pan, 2006; Pan, 2006; Wu et al., 2002; Zhou et al., 2002) since it has 300 observations while a small number of missing values. Tables 1 shows the detail of the above six datasets. A missing value in these datasets was interpolated by using the 10-nearest least square method by (Troyanskaya et al., 2001) Standard gene cluster data Gene clusters are variable depending on the condition under which genes are placed. However, to evaluate the performance of our method for clustering genes, we generated two different sets of standard gene clusters (or true cluster labels) from KEGG and GO (Ashburner et al., 2000), which we call KEGG clusters and GO clusters, respectively. (1) KEGG clusters: The KEGG pathway maps are classified into six major categories, including metabolism. We used 10 sub-categories under the metabolism category as KEGG clusters. Tables 2 shows the list of KEGG clusters and the number of genes in the corresponding cluster. We note that a gene can be in more than one clusters. (2) GO clusters: GO is the ontology of gene functions which are classified into three categories: molecular function, Table 1. Microarray expression data summary Name observations Number of missing values Reference Brem (Brem et al., 2002) Gasch (Gasch et al., 2000) Hughes (Hughes et al., 2000) Spellman (Spellman et al., 1998) Storey (Storey et al., 2005) Yvert (Yvert et al., 2003) Table 2. Ten standard gene clusters from KEGG Cluster name Amino acid metabolism 208 Carbohydrate metabolism 226 Metabolism of cofactors and vitamins 83 Energy metabolism 56 Glycan biosynthesis and metabolism 66 Lipid metabolism 72 Nucleotide metabolism 110 Metabolism of other amino acids 42 Biosynthesis of secondary metabolites 13 Xenobiotics biodegradation and metabolism 22 Number of genes Table 3. Number of gene clusters from GO when changing the minimum size of a cluster Cut-off value Number of clusters biological process and cellular component. Out of these three major categories, we used the function labels in biological process only. In GO, the relation between gene functions forms a so-called DAG (directed acyclic graph) which is a directed graph with no cycles. Each node of the DAG corresponds to a gene function (cluster label) to which corresponding genes are assigned. Starting with the root of the DAG, we went from a parent to its child, checking the number of genes assigned to each node. Naturally, the number of genes reduces as with going to descendants. We fixed a cutoff value (minimum size of a cluster), and if the number of genes was less than the specified cutoff value, we went back to its parent and stopped there. If the number of genes at the parent is larger than 300, we did not use that node as a cluster because its size is too large to annotate gene function. Tables 3 shows the number of clusters obtained by the above procedure, changing the cutoff value. We selected the cutoff value of 75 to have 19 clusters as i472

6 Clustering genes with expression data and network modularity cell organization and biogenesis cellular biosynthesis cellular catabolism lipid metabolism carboxylic acid metabolism biologocal process 636 ( Root ) nitrogen compound biosynthesis amino acid biosynthesis 88 amine biosynthesis Fig standard gene clusters (nodes with thick lines) from GO cellular carbohydrate metabolism cellular lipid metabolism amino acid and derivative metabolism alcohol metabolism nucleobase, nucleoside, 182 nucleotide and nucleic acid metabolism 83 energy derivation by 82 oxidation of organic compounds protein biosynthesis cellular macromolecule metabolism 97 RNA metabolism protein modification biopolymer modification GO clusters, because of choosing a little different number from 10, which was the number of KEGG clusters. Figure 5 shows all nodes appeared during the above procedure to obtain 19 clusters which correspond to circles with thick lines in this figure. The number in each circle (node) in Figure 5 shows the number of corresponding genes. The name attached to each node shows the corresponding cluster name. As well as KEGG clusters, a gene can be in more than one clusters. 4.2 Performance evaluation: criterion In information theory, mutual information is defined as a quantity to measure the amount of information shared between two random variables. This criterion can be applied to check the overlap between our clustering result and standard clusters. This is because the mutual information between two sets of cluster labels becomes larger as one set of clusters is more consistent with the other set of clusters. In general, mutual information is normalized since the range taken by mutual information depends on the size of given sets of clusters (Strehl and Ghosh, 2003). We note that normalized mutual information (NMI) has been widely used in a lot of applications to measure the performance of clustering methods (Zhong and Ghosh, 2003, 2005). Let Gð¼ ðg 1,...; G K ÞÞ be a cluster set of our result, where G i be a set of genes in cluster i. Let G 0 ð¼ ðg 0 1,...; G 0 K 0ÞÞ be a standard cluster set, where G 0 i be a set of genes in standard (true) cluster i. NMI is defined as follows: NMI : ¼ MIðG; G 0 Þ pffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffiffiffiffiffiffiffiffiffi HðGÞ HðG 0 Þ where MIðX; YÞ : ¼ HðXÞþHðYÞ HðX; YÞ; HðXÞ : ¼ X PðXÞ log PðXÞ; HðX; YÞ : ¼ X X PðX; YÞ log PðX; YÞ: X Y 4.3 Performance evaluation: results Parameter setting and competing methods We checked three different sizes of clusters: K ¼ 10, 15 and 20, since two sets of standard clusters have 10 and 19 clusters. As shown in Equation (9), the objective function that must be minimized has parameter! which varies in the range between zero and one. If we set! at zero, J has the first term only, consisting of microarray expression alone. This case, our model and its learning algorithm are almost the same as clustering genes by k-means with microarray expression data, and so hereafter we call! ¼ 0 as k-means. On the other hand, if we set! at one, J has the second term alone, meaning that genes are clustered by a given biological network only. This is similar to graph partitioning in which nodes are clustered by using the connectivity of a given graph. Thus, we changed the value of parameter! from zero to one at every 0.1. We then mainly examined the performance of our method at three values: zero, one and! that gives the highest performance, which we denote! 0,! 1 and! max, respectively. Our method was further compared with two famous graph partitioning methods, called (Karypis and Kumar, 1998a) and (Karypis and Kumar, 1998b), both of which use only the structure of a given metabolic network, meaning that their NMIs are irrelevant to! and microarray datasets. This is true of! NMI We first show the result when we used KEGG clusters. Figure 6 shows the NMI of when we used Hughes as microarray expression data. This figure shows that the NMI of! max was higher than those of! 0 (k-means) and! 1 in all three cases of K. This indicates that the clustering result obtained by using both the modularity of the metabolic network and microarray expression data is more consistent with KEGG clusters than that by either of the two types of data. We can further see that the NMI values in 05!51 were higher than those by and which are shown as dashed lines in Figure 6, indicating that our clustering result i473

7 M.Shiga et al. a b c NMI Fig. 6. NMI changing with! (weight) when we used Hughes, KEGG clusters and K¼ (a)10, (b)15 and (c) 20. Table 4. NMI results summary when we used KEGG clusters Data! 0! 1! max (a) K¼10 Brem (0.2) Gasch (0.2) Hughes (0.3) Spellman (0.4) Storey (0.5) Yvert (0.2) (b) K¼15 Brem (0.2) Gasch (0.3) Hughes (0.2) Spellman (0.2) Storey (0.2) Yvert (0.3) (c) K¼20 Brem (0.4) Gasch (0.4) Hughes (0.4) Spellman (0.3) Storey (0.2) Yvert (0.2) was overlapped more with KEGG clusters than those by the two popular graph partitioning methods. These findings were true of other microarray datasets. Tables 4 summarizes the NMI values for all six microarray datasets used in our experiments. Each! max value is shown in the corresponding parenthesis. From this table, we can see that the NMI of! max was the best among five compared methods in all 18 cases. We then show the results obtained by using GO clusters as standard data. Figure 7 shows the NMI of when we used Hughes, changing K and!. The NMI values of and are also shown in this figure as dashed lines. From this figure, the advantage of over other four methods was confirmed again. The NMI values in this figure ranged roughly from 0.15 to 0.2, which were smaller than those by KEGG clusters, probably because the number of GO clusters is 19 which is larger than the number of KEGG clusters, i.e. 10. Tables 5 summarizes the NMI values obtained by all six microarray datasets. From this table, we can see that the NMI of! max was the highest among the five competing methods in all 18 cases, totally following the result by KEGG clusters. From Tables 4 and 5, we can see that! max is 0.2 in 21 cases in all 36 cases, meaning that the clustering performance of! at 0.2 is almost always close to that of! max (or the true maximum). Thus, hereafter we fix! at 0.2, which we denote! 0:2. The above results show the clear advantage of! max (and! 0:2 ) over k-means (! 0 ), and, but the difference of NMI between! max and! 1 was rather slight. We then examined the statistical significance in the difference of NMI between! 0:2 and! 1. Tables 6 shows the P-values of pairwise t-test (Kreyszig, 1970) between! 0:2 and! 1 over all six microarray datasets. This table indicates that the NMI of! 0:2 was higher than that of! 1 with a statistical significance of more than around 99.5% in all six cases. Overall we can say that our method of combining microarray expression data with the network modularity achieved a clearly higher performance in gene clustering than other four compared methods which use only either of the two data types. 4.4 Annotating gene function Once a cluster label is assigned to each gene, we can compute the P-value between each gene function in some ontology database like GO and a cluster containing the gene in question. We can then rank the gene functions according to their P-values, and the top function can be assigned to each gene of the cluster. This manner of function annotation would be helpful for a functionally putative gene. To compute the P-values and rank them, we can use a software called GO Term Finder (Boyle et al., 2004) which uses a set of genes in a cluster as input and outputs a list of gene functions ranked according to P-values. Tables 7 shows an example of function annotation, i.e. a list of gene functions with P-values for a cluster which is obtained under the condition that the microarray dataset was Hughes and K¼20. We show this example, since this cluster has the i474

8 Clustering genes with expression data and network modularity a 0.2 b c 0.2 NMI Fig. 7. NMI changing with! (weight) when we used Hughes, GO clusters and K¼ (a)10, (b)15 and (c) 20. Table 5. NMI results summary when we used GO clusters Table 6. P-values of pairwise t-test between! 0.2 and! 1 Data! 0! 1! max (a) K¼10 Brem (0.2) Gasch (0.2) Hughes (0.2) Spellman (0.2) Storey (0.1) Yvert (0.1) (b) K¼15 Brem (0.2) Gasch (0.2) Hughes (0.1) Spellman (0.2) Storey (0.2) Yvert (0.1) (c) K¼20 Brem (0.2) Gasch (0.1) Hughes (0.2) Spellman (0.3) Storey (0.2) Yvert (0.2) highest P-value, 1.62E-37, among all 20 clusters obtained under the above condition. The top of this list is protein modification. In fact, this cluster contained 45 genes including, say ALG11/YNL048W (mannosyltransferase) and ALG6/ YOR002W (glucosyltransferase) which add a sugar unit to a protein amino acid, meaning that their function is protein modification. 4.5 Analysis on clustering result Figure 8 shows the three metabolic gene networks obtained by! 0,! 0:2 and! 1 using Hughes under K ¼ 20. Each node corresponds to a gene, and the edge between two nodes indicates that the genes of these two nodes are neighboring enzyme genes on the KEGG metabolic pathway map. Number of clusters (K) KEGG clusters GO clusters E E E E E E 4 Table 7. An example of function annotation: top five functions for an obtained cluster having, say ALG11/YNL48W (mannosyltransferase) GO gene function Protein modification Biopolymer modification Protein metabolic process Cellular protein metabolic process Macromolecule biosynthetic process P-value 1.62E E E E E 29 These networks have the same structure but nodes were differently colored. Each color indicates a cluster, meaning that genes labeled with the same color fall into the same cluster. From this figure, we can easily see that nodes with the same color were widely distributed over the entire network in (a), while the nodes with the same color were gathered together in (c) because nodes were clustered by the network structure only. We then focused on the square at the top-left part in each of the three networks in Figure 8. Figure 9 shows the enlargement of each of these squares. In Figure 9, two colors, red and green, were merged in (b) while more than two colors were merged in (a) and the two colors were clearly separated in (c). As shown in the previous section, (b), i.e.! 0:2, was closest to the standard clusters among! 0,! 0:2 and! 1. We then checked orange colored genes in (b) by using the KEGG database and found that they correspond to those categorized in metabolism of cofactors and vitamins, more precisely those in the folate biosynthesis pathway. Figure 10 shows this pathway in which, i475

9 M.Shiga et al. a b c Fig. 8. Clustered genes on metabolic networks when we used Hughes, K ¼ 20 and (a)! 0,(b)! 0:2 and (c)! 1. a YDR481C b YDR481C YER171W YER171W c YDR481C YER171W YGL251C YGL163C YDR040C YGR271W YJL092W YGR267C YOR163W YOR116C YGL163C YDR040C YGL251C YGR271W YJL092W YGR267C YOR163W YOR116C YGL251C YGL163C YDR040C YGR271W YJL092W YGR267C YOR163W YOR116C YDR038C YKR080W YPL235W YJR103W YDR268W YER172C YDR045C YOR241W YJR063W YOR236W YJL101C YNL256W YMR113W YPL235W YER172C YDR038C YKR080W YJR103W YDR268W YOR241W YOR236W YJL101C YNL256W YJR063W YDR045C YMR113W YDR038C YKR080W YPL235W YJR103W YDR268W YER172C YDR045C YOR241W YJR063W YOR236W YJL101C YNL256W YMR113W Fig. 9. Enlargement of the square located at the top-left part in Figure 8 (a)! 0,(b)! 0:2 and (c)! 1. e.g. dihydroneopterin (which is located at almost the center) is generated from GTP, catalyzed by FOL2/YGR267C (GTP-cyclohydrolase I) and PHO8/YDR481C (repressible alkaline phosphatase). These two genes were colored orange in Figure 9b, while not in both (a) and (c), meaning that they were wrongly clustered under! 0 and! 1. This result is true of other orange-colored genes in (b) such as RAD54/YGL163C (DNA-dependent ATPase) and HIR2/YDR038C (Histone transcription regulator) which are also in the folate biosynthesis pathway but not colored orange in (a) and (c). These four genes were all colored dark blue in (a). A possible reason for this is that in Hughes, they must not be co-expressed with other orange colored genes such as ENA1/YDR040C (ATPase) and RAD3/YER171W (DNA helicase), though the four genes themselves might be co-expressed with each other. On the other hand, a reason why the above four genes were clustered into the green group in (c) is that since only the network structure is considered in (c), MTD1/YKR080W (methylenetetrahydrafolate dehydrogenase), a major hub colored green (located at slightly right from the center) strongly affected other genes linked to this gene. In fact, as shown in Figure 9, YGR267C, YDR481C and YGL163C are directly connected to YKR080W. Interestingly, YDR038C is not linked to YKR080W but to the above all three genes, implying that YDR038C was colored green by the three genes (YGR267C, YDR481C and YGL163C) which were already colored green by YKR080W. On contrary, a possible reason why these four genes were well clustered in (b) might be that YDR038C was first colored orange by, say MET7/YOR241W (folylpolyglutamate synthetase) and/or DFR1/YOR236W (dihydrofolate reductase), and then other three were colored orange by the link to YDR038C or by their tight co-expression with YDR038C. Thus this result and analysis also confirmed the advantage of our method for balancing the two different types of data. 5 CONCLUDING REMARKS We have presented a systematic method for gene clustering by combining expression data with the modularity of gene i476

10 Clustering genes with expression data and network modularity Formamidopyrimidine nucleoside triphosphate ,5-Diaminopyrimidine nucleoside triphosphate YGR267C ,5-Diamino-6- (5'-triphosphoryl-3',4'-trihydroxy-2'-oxopentyl) -amino-4-oxopyrimidine Amino-4-hydroxy-6- (erythro-1,2,3-trihydroxypropyl) dihydropteridine triphosphate YDR038C YDR040C YDR481C YGR271W YGL163C Dihydroneopterin YER171W YOR163W phosphate YER172C YJL092W Dihydroneopterin YGL251C YPL235W Folate YNL256W YOR236W THF-L-glutamate GTP DHF THF ,8-Dihydropteroate THF-polyglutamate Fig. 10. Folate biosynthesis pathway. networks, based on learning of a probabilistic model. Experimental results showed that our method successfully integrated the totally different two types of data for clustering genes to accurately annotate gene function. Interesting future work is to use another or other multiple types of gene networks which are not necessarily curated, since KEGG is a curated database, and the KEGG network might be closely related to the gold-standard gene clusters. It would also be interesting to consider a probabilistic model in which more than one cluster labels can be probabilistically assigned to each gene. ACKNOWLEDGEMENTS This work is supported in part by Bioinformatics Education Program Education and Research Organization for Genome Information Science and Kyoto University 21st Century COE Program Knowledge Information Infrastructure for Genome Science with support from MEXT, Japan. Conflict of Interest: none declared. 2-Amino-4-hydroxy-6-hydroxymethyl -7,8-dihydropteridine 2-Amino-4-hydroxy-6-hydroxymethyl -7,8-dihydropteridine-P2 YOR241W YMR113W REFERENCES Ashburner,M., et al. (2000) Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat. Genet., 25, Besag,J. (1986) On the statistical analysis of dirty pictures. J.R. Statist. Soc. B, 48, Boyle,E.I., et al. (2004) GO::TermFinder open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics, 20, Brem,R.B., et al. (2002) Genetic dissection of transcriptional regulation in budding yeast. Science, 296, Dempster,A.P., et al. (1977) Maximum likelihood from incomplete data via the EM algorithm. J.R. Statist. Soc. B, 39, Edgar,R., et al. (2002) Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res., 30, Eisen,M.B., et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95, Fjortoft,R., et al. (2003) Unsupervised classification of radar images using hidden Markov chains and hidden Markov random fields. IEEE Trans. Geosci. Remote Sens., 41, Gasch,A.P., et al. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11, Guimera,R. and Nunes Amaral,L.A. (2005) Functional cartography of complex metabolic networks. Nature, 433, Guimera,R., et al. (2004) Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E, 70, Hanisch,D., et al. (2002) Co-clustering of biological networks and gene expression data. Bioinformatics, 18 (Suppl.), S145 S154. Huang,D. and Pan,W. (2006) Incorporating biological knowledge into distancebased clustering analysis of microarray gene expression data. Bioinformatics, 22, Hughes,T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, Jeong,H., et al. (2000) The large-scale organization of metabolic networks. Nature, 407, Kanehisa,M., et al. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res., 34, D354 D357. Karypis,G. and Kumar,V. (1998a) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20, Karypis,G. and Kumar,V. (1998b) Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput., 48, Kearns,M., et al. (2001) An information-theoretic analysis of hard and soft assignment methods for clustering. In 13th Annual Conference on Uncertainty in Artificial Intelligence (UAI2001), pp Kerr,M.K. and Churchill,G.A. (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl. Acad. Sci. USA, 98, Kharchenko,P., et al. (2005) Expression dynamics of a cellular metabolic network. Mol. Syst. Biol., msb , E1 E6. Kreyszig,E. (1970) Introductory Mathematical Statistics. John Wiley & Sons. Mardia,K.V. and Jupp,P.E. (2000) Directional Statistics. 2nd edn. JohnWiley & Sons. Newman,M.E.J. and Girvan,M. (2004) Finding and evaluating community structure in networks. Phys. Rev. E, 69, Pan,W. (2006) Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics, 22, Ravasz,E., et al. (2002) Hierarchical organization of modularity in metabolic networks. Science, 297, Spellman,P.T., et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell., 9, Storey,J.D., et al. (2005) Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biol., 3, e267. Strehl,A. and Ghosh,J. (2003) Relationship-based clustering and visualization for high-dimensional data mining. INFORMS J. Comput., 15, Tamayo,P., et al. (1999) Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA, 96, Tavazoie,S., et al. (1999) Systematic determination of genetic network architecture. Nat. Genet., 22, i477

11 M.Shiga et al. Troyanskaya,O., et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, Wu,L.F., et al. (2002) Large-scale prediction of saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet., 31, Yvert,G., et al. (2003) Trans-acting regulatory variation in saccharomyces cerevisiae and the role of transcription factors. Nat. Genet., 35, Zhang,K. and Zhao,H. (2000) Assessing reliability of gene clusters from gene expression data. Funct. Integr. Genomics, 1, Zhang,Y., et al. (2001) Segmentation of brain MR images through a hidden Markov random field model and the Expectation-Maximization algorithm. IEEE Trans. Med. Imaging, 20, Zhong,S. and Ghosh,J. (2003) A unified framework for model-based clustering. J. Mach. Learn. Res., 4, Zhong,S. and Ghosh,J. (2005) Generative model-based document clustering: a comparative study. Knowl. Inf. Syst., 8, Zhou,X., et al. (2002) Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA, 99, i478

Discovering molecular pathways from protein interaction and ge

Discovering molecular pathways from protein interaction and gene expression data 9-4-2008 Aim To have a mechanism for inferring pathways from gene expression and protein interaction data. Motivation Why