Complex detection from PPI data using ensemble method

Size: px
Start display at page:

Download "Complex detection from PPI data using ensemble method"

Transcription

1 ORIGINAL ARTICLE Complex detection from PPI data using ensemble method Sajid Nagi 1 Dhruba K. Bhattacharyya 2 Jugal K. Kalita 3 Received: 18 August 2016 / Revised: 2 November 2016 / Accepted: 17 December 2016 Springer-Verlag Wien 2016 Abstract Many algorithms have been proposed recently to detect protein complexes in protein protein interaction (PPI) networks. Most proteins form complexes to accomplish biological functions such as transcription of DNA, translation of mrna and cell growth. Since proteins perform their tasks by interacting with each other, determining these protein protein interactions is an important task. Traditional clustering approaches for protein complex identification cannot deal with noisy and incomplete PPI data and dependent on information from a single source. Since the noise in the interaction datasets hampers the detection of accurate protein complexes, we propose an ensemble approach for protein complex detection that attempts to combine information from Gene Ontology at the time of complex detection. The PPI data network is taken as input by several baseline complex detection algorithms to generate protein complexes. The protein complexes are then subsequently combined by the proposed ensemble using a consensus building module for the purpose of identifying meaningful complexes. The protein complexes thus predicted by the ensemble are evaluated by comparing them to a set of gold standard protein complexes and their biological relevance established using a co-localization score. & Sajid Nagi sajidnagi@gmail.com Department of Computer Science, St. Edmund s College, Shillong , Meghalaya, India Department of Computer Science and Engineering, Tezpur University, Napaam , Assam, India Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA Keywords Protein complex prediction Protein interaction network Empirical study of proteins Ensemble clustering 1 Introduction Proteins are complex organic compounds made up of chains of amino acids used by the body for growth, maintenance and repair of body tissues. Protein complexes perform many crucial tasks within living beings such as transcription of DNA, translation of mrna, cell growth and transporting molecules from one place to another. Moreover, in-depth study of protein complexes helps understand how they are built and how they work, allowing better comprehension of biological systems. To understand the dynamics of biological processes within an organism, it is necessary to determine the complete map of physical interactions among the proteins (interactome). The result is the PPI network where nodes represent proteins and edges represent interactions between pairs of proteins. It has been well established that the main cause of many diseases is proteins (Hung et al. 1997; Payne et al. 1998; Tanaka et al. 2003; Hung and Link 2011). An erroneous production of a protein complex can cause diseases by affecting the biological processes in which it is involved. Therefore, correct classification of a newly discovered protein complex is important as it may guide discovery of appropriate drugs. New protein complexes can be detected based on the observation that densely connected regions in the PPI networks often correspond to actual protein complexes. Graph clustering techniques have been useful since PPI networks are large-scale graphical data structures consisting of tens of thousands of pair-wise protein protein interactions.

2 1.1 Background A wide range of detection methods have been proposed in recent years for the purpose of identification and categorization of protein complexes (Yong et al. 2012; Wu et al. 2013; Zaki 2014; Srihari et al. 2015; Yong et al. 2014; Ou- Yang et al. 2015; Sharma et al. 2015; Li et al. 2016). Every new published method compares its performance with a few selected earlier methods. It has been noticed that due to the differences in the PPI networks produced, benchmark datasets used for evaluation, evaluation criteria, threshold settings and parameters used, the results of such comparisons and surveys on complex detection vary and are difficult to reconcile (Brohee and van Helden 2006; Vlasblom and Wodak 2009; Li et al. 2010; Srihari and Leong 2013). Evaluation of predicted complexes is generally performed by (a) comparing the predicted complexes against one or more gold standard sets of complexes using performance measures such as precision and recall (Liu et al. 2009), (b) if a gold standard dataset is not available, measures such as cluster cohesiveness and separability are used (Bader and Hogue 2003; King et al. 2004) and (c) evaluation of the predicted complexes can also be performed by computing a functional or co-localization score, after incorporating the appropriate annotation data (Liu et al. 2009; King et al. 2004). This evaluation is particularly useful for biological relevance of the predictions. 1.2 Observations and motivation Based on our survey of literature on protein complex detection, we make the following observations. (i) (ii) (iii) (iv) (v) Every clustering algorithm has its own nuances, being highly dependent on specific initial conditions, seed selection or parameter setting. The noise in the datasets hampers the detection of accurate protein complexes. Proteins can be assigned importance based on the number of interactions they participate in. These protein complexes can then be ranked to further study the roles they play in biological processes. Most methods report low recall because the search for complexes takes place in the dense regions of the network causing them to miss complexes of low densities from sparse parts in the network (Srihari and Leong 2012). A protein complex is usually validated after it is fully identified or detected and it may be desirable for a method to perform validation during complex identification or detection so that its goodness based on cohesiveness can be explored better. To tackle these limitations, an ensemble method, PCDEN, is proposed since ensembles have been able to improve robustness, stability and prediction accuracy of clusterings by combining the output of several algorithms (Yang et al. 2010; Nagi and Bhattacharyya 2013). Since the noise in the interaction datasets hampers the detection of accurate protein complexes, one way to compensate for the lack of interaction coverage is to combine multiple datasets for the purpose of identification of meaningful complexes. An ensemble of clustering algorithms balances individual characteristics by combining the output of several algorithms and may be able to identify larger, denser clusters with improved biological significance (Nagi and Bhattacharyya 2014). Additionally, ensemble-based methods may be preferable for protein complex detection considering that traditional clustering methods search for complexes in dense regions and miss complexes of low densities. An ensemble method proposed by (Dai et al. 2016) employs a cluster-wise voting mechanism to extract the consensus information embedded in the results of different methods, whereas (Wu et al. 2016) integrate the clusters obtained from the base clustering mechanisms with co-complex affinity scores. An alternate way may be to add topological information to biological information (Zaki 2014) so that complexes of low densities, that is, small complexes of two or three proteins are not discarded. Moreover, cluster ensembles may also be able to address the problem of local optima and obtain a more globally optimum solution, i.e., a stable clustering solution. To evaluate the quality and accuracy of predicted complexes, we use Frac (Nepusz et al. 2012), MMR (Nepusz et al. 2012), Acc (Brohee and van Helden 2006), Precision and Recall (Liu et al. 2009) and F-measure (Van Rijsbergen 1979). To evaluate the biological relevance, we have used co-localization (Huh et al. 2003) and GO semantic similarity (Schlicker et al. 2006). 1.3 Organization of the paper The rest of the paper is organized as follows: Sect. 2 briefly describes the algorithms and ensemble approaches used for comparison. Section 3 gives an account of the evaluation metrics used, followed by the description of the datasets in Sect. 4. Section 5 describes the experimental design methodology of the proposed PCDEN ensemble, followed by the results in Sect. 6. Finally, the discussion and concluding remarks are included in Sect. 7.

3 2 Related work We compare eight algorithms, namely MCL (Van Dongen 2000; Pereira-Leal et al. 2004), MCODE (Bader and Hogue 2003), RNSC (King et al. 2004), CFinder (Adamcsek et al. 2006), RRW (Macropol et al. 2009), AP (Frey and Dueck 2007), CMC (Liu et al. 2009) and ClusterONE (Nepusz et al. 2012) with our proposed ensemble. In addition, we compare our proposed ensemble method with three ensemble approaches proposed in recent years, namely Ensemble Non-negative Matrix Factorization (NMF) (Greene et al. 2008), Bipartite Graph Ensemble (BGENS) (Zhang et al. 2012) and Full Graph Ensemble (FGENS) (Zhang et al. 2012), Ensemble Clustering Bayesian Nonnegative Matrix Factorization (EC-BNMF) (Ou-Yang et al. 2013). A brief overview of the algorithms and ensembles is presented next. 2.1 Overview of the clustering approaches The Markov Clustering (MCL) (Van Dongen 2000; Pereira-Leal et al. 2004) detects protein complexes by simulating random walks in PPI networks by boosting the probabilities of intra-cluster walks and demoting intercluster walks (i.e., more links within a cluster and fewer links between clusters). Molecular Complex Detection (MCODE) (Bader and Hogue 2003) selects seed nodes with high weights as initial clusters and enhances these clusters by outward traversal from the seeds. The number of predicted complexes is generally small and the size of many predicted complexes is often too large. The Restricted Neighborhoods Search Clustering (RNSC) (King et al. 2004) algorithm begins with an initial random clustering and then searches for a better clustering with lower costs by moving vertices around. It predicts relatively few complexes and its results depend on the quality of the initial clustering. Clique Finder (Cfinder) (Adamcsek et al. 2006) takes a parameter k as input and detects all the k-cliques of the input network and connects them if they are adjacent, i.e., if they share (k - 1) common nodes. A k- clique percolation cluster is then constructed by linking all the adjacent k-cliques as a bigger subgraph. It can detect overlapping clusters. The Repeated Random Walk (RRW) (Macropol et al. 2009) algorithm iteratively expands at most k times a cluster of nodes so as to include proteins with high proximity using an affinity function, followed by a rank step to remove clusters with an overlap score above a given threshold. It is able to handle weighted and unweighted graphs. Affinity Propagation (AP) (Frey and Dueck 2007) is a k-centers clustering technique with centers referred to as exemplars, randomly selected and refined iteratively to decrease the sum of squared errors between each input object and its nearest center. AP cannot detect overlapping clusters. Clustering based on Maximum Cliques (CMC) (Liu et al. 2009) uses an iterative scoring algorithm followed by a maximal clique finding process to assess whether two proteins are in the same protein complex. The cliques are merged when the network between them is denser than the merge threshold. It is not able to handle weighted networks. The Cluster Overlapping Neighborhood Expansion (ClusterONE) (Nepusz et al. 2012) algorithm starts by selecting the protein with the highest degree as the first seed and grows a cohesive group from it using a greedy approach and then moves on to the next seed. The cohesive groups are then merged based on their overlap score. 2.2 Overview of the ensemble approaches The ensemble framework for detecting protein complexes Ensemble Non-negative Matrix Factorization (NMF) proposed by (Greene et al. 2008) first generates a collection of non-negative matrix factorizations with different number of dimensions. Then, a hierarchical meta-clustering algorithm is employed to aggregate these factorizations and produce a disjoint hierarchy of meta-clusters. Finally, the results are transformed into a soft hierarchical clustering of the original dataset. Bipartite Graph Ensemble (BGENS) (Zhang et al. 2012) is a bipartite graph that is constructed based on the affiliation of proteins and clusters. The original PPI data are initially clustered into a set of T partitions using the base clustering algorithms. Each clustering partition consists of a set of clusters from a certain clustering method. BGENS constructs a bipartite graph between protein nodes in the dataset and the cluster nodes obtained from the base clustering algorithms. The cluster ensemble, Full Graph Ensemble (FGENS) (Zhang et al. 2012), is produced from the bipartite graph BGENS, by omitting interactions among the proteins and also the relationships among the clusters. The authors consider the original protein interactions as the basis of cluster-belonging preference for those proteins that appear in different clusters. The proteins are partitioned into different final clusters by taking into consideration their original relationships with other proteins. The method proposed in BGENS is followed to partition the cluster nodes generated by the base clustering algorithms and a graph of proteins and clusters is constructed. The graph is now partitioned using spectral clustering, which partitions the graph into K parts with the objective of minimizing the cut. The ensemble approach EC-BNMF (Ensemble Clustering Bayesian Nonnegative Matrix Factorization) (Ou-Yang et al. 2013) consists of two phases. In the generation phase, useful information in the form of features is extracted from several base clustering results. Each base clustering result is regarded a feature of the original PPI network, where two proteins are

4 assumed to be connected if they occur in the same cluster at least once. In the complex detection phase, they use a Bayesian NMF-based ensemble clustering method which uses group information provided by the edges between interacting proteins to detect protein complexes. 3 Evaluation of protein complexes In this section, we outline the measures used to compare a predicted protein complex with a reference complex and also to evaluate the biological relevance of the predictions. 3.1 Performance measures The following performance measures try to assess the quality of a protein complex predicted by comparing it with the protein complexes present in a specific gold standard. However, it is not possible to establish the quality of a predicted complex if it does not match, even partially, with at least one gold standard complex Frac The fraction of pairs (Frac) (Nepusz et al. 2012) is a measure which says that a predicted complex p and a reference complex b (gold standard) match with an overlap score NAðp; bþ, given by NAðp; b Þ ¼ V p \ V b 2 x j Vb j : ð1þ V p The complexes are considered to match if NAðp; bþ. is greater than a threshold value Geometry accuracy Geometric accuracy (Acc) (Brohee and van Helden 2006) is the geometric mean of clustering sensitivity (Sn) (Brohee and van Helden 2006), and the positive predictive value (PPV) (Brohee and van Helden 2006) of a clustering. Both are based on the confusion matrix T ¼ t ij of the complexes. With n reference and m predicted complexes, the element t ij of the confusion matrix refers to the number of proteins that are found both in the reference complex i and the predicted complex j. If n i is defined as the number of proteins in reference complex i, then P n i¼1 Sn ¼ maxm j¼1 t ij P n i¼1 n ; and ð2þ i P m j¼1 PPV ¼ maxn i¼1 t ij P m P n j¼1 i¼1 t : ð3þ ij Clustering sensitivity (Sn) can give inflated values if every protein is placed in the same cluster, whereas the positive predictive value (PPV) can be maximized by putting every protein in its own cluster. For this reason, the two measures are balanced by computing their geometric accuracy (Acc), which is defined as: p Acc ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Sn PPV: ð4þ MMR The maximum matching ratio (MMR) (Nepusz et al. 2012) is used to express how well the predicted complexes compare with a gold standard, where the overlap score between two protein complexes is computed by Eq. (2). MMR is obtained by dividing the total weight of the maximum matching complexes by the number of reference complexes. Given n standard complexes and m predicted complexes, let j be a member of the predicted complexes. MMR is then defined as: P n i¼1 MMR ¼ maxm j n NA ð p; b Þ : ð5þ A predicted complex that may not match any reference complex may show a degraded geometric accuracy score. However, a predicted complex that does not match any reference complex is not necessarily an undesired result and may be an undiscovered complex. Therefore, trying to optimize the geometric accuracy may discourage detection of such new complexes in a PPI network. This is one motivation behind the use of MMR as a metric F-measure The F-measure (Van Rijsbergen 1979) is the harmonic mean of two measures called Precision and Recall, defined as follows. Precision This measure is defined as the fraction of a cluster that consists of objects of a specific class. The precision of a cluster V j. with respect to class U i. is represented by n i;j Precision U i ; V j ¼ : ð6þ n j Recall This measure is defined as the proportion of objects of a class in a cluster. The recall of a cluster V j with respect to class U i is represented by n i;j Recall U i ; V j ¼ ; ð7þ n i where n i;j equals number of data items belonging to class U i in cluster V j, n j is the number of data items

5 belonging to cluster V j and n i denotes number of data items belonging to class U i. F-measure is defined as the rate at which a cluster contains only objects of a particular class and all objects of that class (Van Rijsbergen 1979). Thus, the F-measure of a cluster U i with respect to class V j is represented by the following expression: F-measure ¼ 2 Precision U i; V j Recall Ui ; V j : Precision U i ; V j þ Recall Ui ; V j ð8þ F-measure is useful in the sense that it is an objective metric of how well a clustering algorithm is able to recover the original clusters. It is measured in the range [0, 1] and high values indicate a good quality of clustering. 3.2 Biological significance measures One of the main issues that arise during the comparison of predicted complexes and a gold standard complex is that the gold standard protein complex sets are usually incomplete (Jansen and Gerstein 2004). Hence, a predicted complex that does not match any of the reference complexes may belong to a valid but previously uncataloged complex. Therefore, relying on the comparison measures outlined in Sect. 3.1, which are based on a predefined gold standard, may not give a true picture of the predicted complexes. Rather, the scores should be supported by measures that assess the biological relevance of predicted complexes, based on the similarity of their functional annotation or the co-localization of the constituent proteins Co-localization similarity Since proteins in the same complex have a tendency to share common functions, they tend to be located at the same location in a cell. The maximum fraction of proteins in a complex found at the same location is known as the colocalization score for that complex. The average co-localization score is calculated as the weighted average over all complexes and is defined as: P j C ¼ max il i;j P : ð9þ j A j Here, l i;j is the number of proteins of complex A j assigned to the localization group i and A j is the number of proteins in the complex A j with localization assignments. A high co-localization score indicates a high functional similarity between proteins in the same complex GO semantic similarity Comparison of GO terms associated with proteins in a protein complex is another indicator of biological relevance. This can be performed in terms of functional similarity (Ashburner et al. 2000). The functional similarity of two proteins is measured in terms of the semantic similarity of GO terms with which these proteins have been annotated. The GO semantic similarity score for a protein complex is defined as the geometric mean of all complex scores determined separately for the biological process and molecular function taxonomies for a set of complexes. A higher GO semantic similarity score is associated with better quality of a set of protein complexes. Out of the several variations of semantic similarity that have been described (Resnick 1999; Lin 1998; Lord et al. 2003; Schlicker et al. 2006), we use the semantic similarity proposed by (Schlicker et al. 2006). It is defined as: sim Schlicker ðt 1 ; t 2 Þ ¼ 2 ln ½ p msðt 1 ; t 2 Þ lnðpðt 1 Þþlnðpðt 2 Þ ð 1 p msðt 1 ; t 2 ÞÞ ð10þ Here, t, t 1 and t 2 are GO terms, pt ðþ represents the probability of term t and p ms is the minimum subsumer. 4 Dataset description We compare protein complexes, our clustering algorithm predicts, with a set of known interactions called gold standards. 4.1 PPI datasets Each of the algorithms and the proposed ensemble has been tested on three publicly available high-throughput benchmark yeast PPI datasets, namely Gavin (Gavin et al. 2006), Krogan Core (Krogan et al. 2006) referred to in this paper as Krogan, and Krogan Extended (Krogan et al. 2006) referred in the paper as Krogan?. Some details of the datasets are shown in Table Gold standard datasets We consider the Saccharomyces Genome Database (SGD) (Hong et al. 2008) and the Munich Information Centre for Protein Sequences (MIPS) (Mewes et al. 2004) Saccharomyces cerevisiae protein protein interaction dataset as they have been applied in several analyses as gold standard reference because of their quality and comprehensiveness. The general properties of these datasets are given in Table 2.

6 Table 1 Benchmark datasets Details Gavin Krogan Krogan? References Gavin et al. (2006) Krogan et al. (2006) Krogan et al. (2006) Number of proteins Number of interactions ,317 Weighted Yes Yes Yes 5 Protein complex detection ensemble (PCDEN) The PCDEN approach consists of three phases. The First Phase generates n individual complexes sets from PPI Network data taken as input by n base clustering algorithms. The set of base clustering algorithms is identified after an exhaustive experimentation with a large number of benchmark datasets. The Second Phase employs a consensus building process to generate protein complexes by identifying a common set of protein nodes from the corresponding complex sets (obtained from the base clustering algorithms) and also using GO similarity. The Third Phase evaluates the predicted protein complexes and also establishes their biological relevance. A conceptual framework for the proposed protein complex detection ensemble (PCDEN) is shown in Fig. 1. Phase I: complex set generation The PPI network data are taken as input by n consistently well-performing base clustering algorithms BC 1, BC 2,, BC n which generate n individual complexes sets C 1,C 2,, C n. In this step, care must be taken to select base clustering algorithms that are diverse in nature. As a result, the errors made in clustering by one algorithm are likely to be averaged out by the correct clustering of another, so that the overall clustering accuracy is improved and a final unbiased decision can be made. Phase II: protein complex generation through consensus Next, we take a complex set C 1 generated by the base clustering algorithms BC 1 and identify the node within this complex with the highest degree. A node with a higher degree implies high connectivity or association with other nodes. Such nodes are chosen since they give more guarantee to form meaningful complex. The next complex set C 2 generated by BC 2 is taken and we identify the corresponding complex based on the highest degree node as Table 2 General properties of the gold standard datasets General properties SGD MIPS References Hong et al. (2008) Mewes et al. (2004) No. of proteins No. of complexes Overlapping proteins identified by complex set C 1. Similarly, we take the other set of complexes C 3, C 4,, C n generated by the base clustering algorithms BC 3, BC 4,, BC n and identify the corresponding complexes based on the highest degree node identified earlier. After we have identified all the complexes containing the highest degree node from the n complex sets, a common set of nodes (proteins), say P i, is identified from the corresponding complex sets with the condition that each member element of P i is present in at least two complexes. Two cases are considered now. Case 1 For the common set of nodes P i, we complete the edges among the nodes of P i subject to the condition that if n i and n j are two nodes present in the common set of nodes P i, then they will be connected by an edge if and only if (n i, n j ) are already connected by an edge in at least two complexes. This will continue for every node and will give rise to a common complex, P Com i, consisting of nodes that are present in all corresponding complexes and inter-connected by an edge. Hence, a complex is generated which is arrived at by consensus among the corresponding complexes. Case 2 For the common set of nodes P i, complete the edges to obtain a complex P GO i based on GO similarity. Since a protein is usually annotated with multiple GO terms, the semantic similarity measured between two GO terms can be directly converted to a measurement of the similarity between two proteins. Thus, Gene Ontology information is integrated with PPI network data based on semantic similarity measure (Schlicker et al. 2006). Next, we superimpose both the complexes obtained from the common set of nodes P i, i.e., P Com i and P GO i to obtain the final complex P Fin i. The superimposition is done using the logic given in Table 3. In other words, if the edge is present for a pair of nodes n i and n j in both P Com i and P GO i, then the edge is retained in the final predicted complex P Fin i. If an edge is present for a pair of nodes n i and n j in P Com i and absent in P GO i or vice versa, then also that edge is retained even though it may be a case of false alarm. The reason to retain this edge is that in the case of P Com i, the edge is a result of consensus (where un-supervised approach has been applied) and in the case of P GO i, the edge is obtained through GO similarity (a case of supervised approach). So in both the cases, the probability of the existence of the edge is justified. The final

7 COMPLEX SET GENERATION FROM BASE CLUSTERING ALGORITHMS PPI Network data BC 1 BC 2 BC 3 BC n (unsupervised methods) C 1 C 2 C 3 C n-1 C n (n individual complex sets) PROTEIN COMPLEX GENERATION THROUGH CONSENSUS AMONG THE CORRESPONDING COMPLEXES Choose a complex from n individual complex sets Identify a node with highest degree from the complex Find corresponding complexes based on node identified Find a common set of nodes, P i, from the corresponding complex sets, where each member element is present atleast in two complexes Complete the edges among the nodes of P i to obtain P i Com by fulfilling the condition that: a pair of nodes (n i, n j ) will have an edge iff n i, n j are connected by an edge atleast in two complexes Use GO Similarity to complete the edges among the nodes of P i to obtain P i GO Superimpose P i Com and P i GO to obtain the final complex P i Fin such that an edge will exist in P i Fin iff the edge is present in (i) P i Com or P i GO (ii) Both in P i Com and in P i GO VALIDATION PHASE Result validation of P i Fin using Frac, Acc, MMR, Precision, Recall and F-measure Result validation using Co-localization Score Gold Standard protein complexes Fig. 1 Conceptual framework for the proposed protein complex detection ensemble Table 3 Existence of an edge in the final complex Edge in P Com i Edge in P GO i Edge in P Fin i Present Present Present Present Not present Present Not present Present Present Not present Not present Not present predicted complex P Fin i is now stored in a file for the purpose of validation. Now, we start with the second highest degree node for the formation of the second protein complex and then iterate through to the next lower degree node and so on. This process of identifying the node of the highest degree from the next set of complexes and the process of identifying protein complexes through consensus and through the use of GO similarity are repeated for each corresponding complex. The end result of this phase is that there will be interacting protein complexes with overlapping proteins. Phase III: validation Finally, the validation phase evaluates protein complexes predicted by comparing them to a set of gold standard protein complexes and also establishes the biological relevance of the predicted protein complexes using the co-localization score. Though the method adopted for protein complex detection is computationally expensive, the primary objective is to find complexes with high precision. For this reason, speed of detection is compromised in favor of

8 accuracy and precision. Our main focus is to make use of multiple sources and means to detect and validate the complexes found since a single information source or criterion has its limitations as discussed in Sect Experimental evaluation Three different clustering methods for protein complex detection, MCODE (Bader and Hogue 2003), MCL (Van Dongen 2000; Pereira-Leal et al. 2004), and CFinder (Adamcsek et al. 2006), are used as the base clustering algorithms. These methods have been selected as they follow different approaches for clustering and MCODE and CFinder can detect overlapping complexes. Moreover, we have also performed an empirical evaluation of these algorithms along with others, namely RNSC (King et al. 2004), AP (Frey and Dueck 2007), CMC (Liu et al. 2009), RRW (Macropol et al. 2009) and ClusterONE (Nepusz et al. 2012) and have found that the performance of these algorithms fall in the category of good, average and not so good. Hence, these algorithms satisfy the diversity criteria of our ensemble. ClusterONE (Nepusz et al. 2012) has been found to be the best performing complex finding algorithm and hence we will compare our ensemble s performance with this algorithm. GO semantic similarity is calculated based on the information content of GO terms and the semantic similarity proposed by (Schlicker et al. 2006). The PPI data set for the experiments is summarized in Table 1. The GO-slim file was downloaded from and the Biological Process (BP) hierarchy was selected to calculate the GO-driven similarity of proteins. The evaluation of the results was done by the SGD and MIPS gold standard datasets, shown in Table 2. 6 Results For each clustering algorithm tested, we tune the input parameters after experimenting with different combinations of settings to achieve optimal performance, unless the authors of the original algorithm explicitly suggest particular settings. This procedure was conducted separately for each measure, for each dataset and with respect to each gold standard to obtain the most optimistic score. 6.1 Comparison of PCDEN with baseline clustering algorithms We compare the performance of the proposed ensemble PCDEN with the approaches described in Sect Quality score by MIPS gold standard In this section, we report the results obtained by comparing protein complexes predicted by each clustering algorithm and the proposed PCDEN ensemble, with the gold standard obtained from the MIPS catalog of protein complexes. The best scores are highlighted in bold for easy comparison. The number of complexes and the matched number of complexes for each dataset, the fraction of protein complexes matched by at least one predicted complex, geometric accuracy and maximum matching ratio as detected by the algorithms are reported in Table 4. When we compare the eight clustering approaches, we observe that ClusterONE shows the best scores in the number of detected complexes that match at least one real complex, and also for the three quality measures, namely Frac, Acc and MMR across all the three datasets. However, we notice that although the number of protein complexes identified by PCDEN is lower than several others, 88 of the 152 predicted complexes matched very well with the benchmark complexes. Further we also observe that PCDEN obtains the best scores for the quality measures Frac and Acc across all the three datasets used for comparison. The maximum matching ratio MMR score of for the Gavin dataset obtained by PCDEN is also the highest indicating better quality of overlapped protein complexes detected. The comparison results are presented in Fig. 2 and it is clear that the PCDEN ensemble outperforms the other approaches in all datasets by achieving the highest score in each of the three categories used for comparison. The results obtained by PCDEN are better than those of the single clustering approaches Quality score by SGD gold standard Similarly, Table 5 shows the scores for the number of complexes detected, number of matched complexes, Frac, Acc and MMR for the three datasets achieved by the algorithms and PCDEN using the SGD gold standard. It is observed that although ClusterONE shows the best scores among the eight baseline clustering algorithms across all the three datasets, PCDEN achieves the best results in protein complex detection with respect to the three quality measures, for all the three datasets and also identifies the maximum number of matched complexes. We also observe in Tables 4 and 5 that the number of complexes detected by PCDEN for the Krogan and Krogan? datasets is closer to the actual number of MIPS complexes, i.e., 203, while the results of ClusterONE seem to contain a large number of extra clusters, i.e., 526 and 531. Moreover, the number of predicted complexes by

9 Table 4 Comparison of the protein complex detection algorithms and proposed ensemble PCDEN on PPI datasets using the MIPS complex dataset Algorithm Gavin Krogan Krogan? #c #m Frac Acc MMR #c #m Frac Acc MMR #c #m Frac Acc MMR MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN The best scores are highlighted in bold #c no. of complexes, #m no. of matched complexes 0.8 Comparison of baseline approaches with PCDEN for each Dataset Quality Score Frac Acc MMR Frac Acc MMR Frac Acc MMR MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN Gavin Krogan Krogan+ Evaluation Metrics of the clustering algorithms and PCDEN Fig. 2 Comparing the evaluation metrics of the clustering algorithms and PCDEN for the MIPS dataset Table 5 Comparison of the protein complex detection algorithms and proposed ensemble on PPI datasets using the SGD complex dataset Algorithm Gavin Krogan Krogan? #c #m Frac Acc MMR #c #m Frac Acc MMR #c #m Frac Acc MMR MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN The best scores are highlighted in bold #c no. of complexes, #m no. of matched complexes PCDEN matches very well with the number of benchmark complexes for the Krogan and Krogan? datasets. Figure 3 shows the comparison of the clustering algorithms and PCDEN for SGD dataset and it is noticed that PCDEN outperforms all the other single clustering approaches in all datasets. The results obtained by PCDEN prove that the integration of multiple sources has greater power to predict protein complexes in PPI networks.

10 6.1.3 Biological significance of predicted complexes on Gavin and Krogan datasets The biological relevance of the predicted complexes based on the similarity of their co-localization score is discussed next Co-localization score We make use of the localization dataset published by (Huh et al. 2003) and ProCope (Krumsiek et al. 2008; Friedel et al. 2009), a popular tool used to predict and evaluate protein complexes, to calculate the co-localization score. Table 6 shows the co-localization scores of protein complexes detected using the eight algorithms and PCDEN on the Gavin and Krogan datasets. The co-localization scores of protein complexes obtained by the eight algorithms on the Gavin and Krogan datasets are shown in Table 6. As expected, the highest colocalization score of on the Gavin dataset is achieved by PCDEN, followed by ClusterONE with a score of The co-localization score on the Krogan dataset of PCDEN is which is higher than 0.725, the score of obtained by MCL and achieved by ClusterONE. Hence, the protein complexes detected by the PCDEN ensemble have relatively high quality from the biological standpoint due to the high co-localization score on both the Gavin and Krogan datasets. The results of comparisons of Co-localization score of PCDEN and the baseline algorithms on the Gavin and Krogan datasets are shown in Fig Comparison of PCDEN with Ensemble Approaches In addition to comparison with individual clustering methods, we compare PCDEN with the ensemble approaches proposed in recent years, NMF (Greene et al. 2008), EC-BNMF (Ou-Yang et al. 2013), BGENS (Zhang et al. 2012) and FGENS (Zhang et al. 2012). Table 6 Comparison of Co-localization Score Algorithm Co-localization score Gavin Krogan MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN The best scores are highlighted in bold Co-localization Score Comparison of Co-localization Score Gavin Datasets Fig. 4 Comparison of Co-localization Score Krogan Analysis of ensemble results We use the validity measures Precision, Recall and F-measure discussed in Sect The results of the PCDEN ensemble are compared with the results of the ensemble approaches discussed in Sect Due to the non-availability of source code of the ensemble approaches used for comparison with PCDEN, the metrics Frac, Acc and MMR could not be established for these ensemble approaches. Hence, we use the Precision, Recall and F-measure scores as provided by the authors for the purpose of comparison with PCDEN. MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN Quality Score Comparison of baseline approaches with PCDEN for each Dataset Frac Acc MMR Frac Acc MMR Frac Acc MMR Gavin Krogan Krogan+ Evaluation metrics of the clustering algorithms and PCDEN MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN Fig. 3 Comparative analysis of the evaluation metrics of the clustering algorithms and PCDEN for SGD

11 The proposed ensemble method outperforms other approaches by achieving higher F-measure and Recall rates for the MIPS gold standard, as shown in Table 7. The F-measure and Recall rates for SGD gold standard are and for PCDEN, which are almost at par with FGENS, which has scored for F-measure and for Recall rate. However, PCDEN finds more matched modules from the PPI network than others and this demonstrates its effectiveness. NMF and BC-NMF use only the PPI network topology whereas PCDEN integrates information from GO similarity sources and so achieves higher scores than these methods. It also shows better results than BGENS and FGENS for the MIPS gold standard. Its F-measure and Recall rates for SGD gold standard are almost at par with FGENS but are better than BGENS, the reason being that FGENS combines gene expression data and GO similarity information, which is also the case with PCDEN. This indicates that the integration of multiple information sources greatly benefits the protein complex detection. A graphical comparison of the clustering approaches using Precision, Recall and F-measure is given in Fig. 5. The performance of PCDEN is better than the other cluster ensemble methods, i.e., NMF and BC-NMF, but is at par for one result with FGENS. This indicates the effectiveness of a graph-based cluster ensemble method such as FGENS and PCDEN in detecting protein complexes, especially so, when additional information about the proteins is integrated into the network. 7 Discussion and conclusion Protein complexes are important for understanding principles of cellular organization and function. A better comprehension of their roles allows us to realize how a protein complex disorder can affect the biological processes in which it is involved. Proteins are related to diseases and diseases are usually caused by an erroneous production of some protein complex. Therefore, much work has been focussed on the prediction of protein complexes from PPI Quality Score Precision Comparative Performance of Ensemble Approaches Recall MIPS F-measure networks. However, it has been observed that traditional clustering approaches for protein complex identification are hampered by a number of factors, such as PPI networks being noisy and incomplete giving rise to high false-positive rates in identifying interactions. The partitioning schemes themselves are dependent on specific initial conditions and parameter settings; the end result is unstable and unsatisfactory cluster partitions which lack biological meaning. Keeping these challenges in mind, we have developed and established a framework for an ensemble method with satisfactory results for accurate identification of protein complexes from interaction datasets. An ensemble approach is generally able to improve the robustness and stability of the clusterings by combining the output of several algorithms, thus improving the overall prediction accuracy. Such a method can identify larger, denser clusters with improved biological significance. Ensemble-based methods are preferable for protein complex detection also considering that traditional clustering methods search for complexes in dense regions and miss complexes of low densities, that is, small complexes of two or three proteins. Moreover, cluster ensembles can also address the problem of local optima and obtain a globally optimum solution, i.e., a stable clustering solution. Precision Recall SGD F-measure Evaluation Measures for MIPS and SGD Gold Standard BGENS FGENS NMF EC- BNMF PCDEN Fig. 5 Comparison of the ensemble approaches with the proposed ensemble PCDEN Table 7 Comparison of the ensemble approaches with the proposed PCDEN ensemble Ensemble Method MIPS SGD Precision Recall F-measure Precision Recall F-measure BGENS FGENS NMF EC-BNMF PCDEN The best scores are highlighted in bold

12 We propose an ensemble framework called Protein Complex Detection Ensemble (PCDEN) to predict protein complexes from a PPI network. We compare PCDEN with eight algorithms and the results are validated independently using two gold standard datasets. The biological relevance of the protein complexes detected is measured using a colocalization score. PCDEN is also compared with four ensemble approaches. The PCDEN ensemble outperforms the other clustering approaches in all datasets by achieving the highest score in each of the categories used for comparison. We conclude that the complexes obtained by PCDEN achieve accuracies comparable to gold standard complexes. The results prove that the integration of multiple sources increases the precision in detecting protein complexes in PPI networks. References Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T (2006) CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 22(8): Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25 29 Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform 4(1):2. doi: / Brohee S, van Helden J (2006) Evaluation of clustering algorithms for protein protein interaction networks. BMC Bioinform 7:488. doi: / Dai Q, Duan X, Guo M, Guo Y (2016) EnPC: an EnsembleClustering framework for detecting protein complexes in protein protein interaction network. Curr Proteom 13(2): Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814): Friedel CC, Krumsiek J, Zimmer R (2009) Bootstrapping the interactome: unsupervised identification of protein complexes in yeast. J Comput Biol 16(8): Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature 440(7084): Greene D, Cagney G, Krogan N, Cunningham P (2008) Ensemble nonnegative matrix factorization. Bioinformatics 24: Hong EL, Balakrishnan R, Dong Q, Christie KR, Park J, Binkley G (2008) Gene ontology annotations at SGD: new data sources and annotation methods. Nucleic Acids Res 36(Database issue): Huh WK, Falvo JV, Gerke LC, Carroll SA, Howson RW, Weissman JS, O Shea EK (2003) Global analysis of protein localization in budding yeast. Nature 425: Hung MC, Link W (2011) Protein localization in disease and therapy. J Cell Sci 124: Hung IH, Suzuki M, Yamaguchi Y, Yuan DS, Klausner RD, Gitlin JD (1997) Biochemical characterization of the Wilson disease protein and functional expression in the yeast Saccharomyces cerevisiae. J Biol Chem 272: Jansen R, Gerstein M (2004) Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction. Curr Opin Microbiol 7: King A, Przulj N, Jurisica I (2004) Protein complex prediction via cost-based clustering. Bioinformatics 20(17): Krogan N, Cagney G, Yu H, Zhong G, Guo X et al (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084): Krumsiek J, Friedel CC, Zimmer R (2008) ProCope-Protein complex Prediction and evaluation. Bioinformatics 24(18): doi: /bioinformatics/btn376 Li XL, Wu M, Kwoh CC, Ng SK (2010) Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genom. 11(Suppl 1):S3. doi: / S1-S3 Li X, Wang J, Zhao B, Wu F-X, Pan Y (2016) Identification of protein complexes from multi-relationship protein interaction networks. Hum Genom 10(Suppl 2):17. doi: /s z Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA, pp Liu G, Wong L, Chua HN (2009) Complex discovery from weighted PPI networks. Bioinformatics 25(15): Lord PW, Stevens RD, Brass A, Goble CA (2003) Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19: Macropol K, Can T, Singh AK (2009) RRW: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinform 10:283 Mewes HW, Frishman D, Mayer KFX, Münsterkötter M, Noubibou O, Rattei T, Oesterheld M, Stümpflen V (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 32:41 44 Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Health Inform Bioinform 2(3): Nagi S, Bhattacharyya DK (2014) Cluster analysis of cancer data using semantic similarity, sequence similarity and biological measures. Netw Model Anal Health Inform Bioinform 3(67):1 38 Nepusz T, Yu H, Paccanaro A (2012) Detecting overlapping protein complexes in protein protein interaction networks. Nat Methods 9(5): Ou-Yang L, Dai D-Q, Zhang XF (2013) Protein complex detection via weighted ensemble. PLoS One 8(5):e62158 Ou-Yang L, Dai D-Q, Zhang XF (2015) Detecting protein complexes from signed protein protein interaction networks. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 12(6): Payne AS, Kelly EJ, Gitlin JD (1998) Functional expression of the Wilson disease protein reveals mislocalization and impaired copper-dependent trafficking of the common H1069Q mutation. Proc Natl Acad Sci USA 95: Pereira-Leal JB, Enright AJ, Ouzounis CA (2004) Detection of functional modules from protein interaction networks. Proteins Struct Funct Bioinform 54:49 57 Resnick P (1999) Semantic similarity in a taxonomy: an information based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11: Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006) A new measure for functional similarity of gene products based on gene ontology. BMC Bioinform 7:302 Sharma P, Ahmed HA, Roy S, Bhattacharyya DK (2015) Unsupervised methods for finding protein complexes from PPI networks. Netw Model Anal Health Inform Bioinform 4(1):1 15

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks Twan van Laarhoven and Elena Marchiori Institute for Computing and Information

More information

A Multiobjective GO based Approach to Protein Complex Detection

A Multiobjective GO based Approach to Protein Complex Detection Available online at www.sciencedirect.com Procedia Technology 4 (2012 ) 555 560 C3IT-2012 A Multiobjective GO based Approach to Protein Complex Detection Sumanta Ray a, Moumita De b, Anirban Mukhopadhyay

More information

Protein complex detection using interaction reliability assessment and weighted clustering coefficient

Protein complex detection using interaction reliability assessment and weighted clustering coefficient Zaki et al. BMC Bioinformatics 2013, 14:163 RESEARCH ARTICLE Open Access Protein complex detection using interaction reliability assessment and weighted clustering coefficient Nazar Zaki 1*, Dmitry Efimov

More information

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules Ying Liu 1 Department of Computer Science, Mathematics and Science, College of Professional

More information

MIPCE: An MI-based protein complex extraction technique

MIPCE: An MI-based protein complex extraction technique MIPCE: An MI-based protein complex extraction technique PRIYAKSHI MAHANTA 1, *, DHRUBA KR BHATTACHARYYA 1 and ASHISH GHOSH 2 1 Department of Computer Science and Engineering, Tezpur University, Napaam

More information

Towards Detecting Protein Complexes from Protein Interaction Data

Towards Detecting Protein Complexes from Protein Interaction Data Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,

More information

Identification of protein complexes from multi-relationship protein interaction networks

Identification of protein complexes from multi-relationship protein interaction networks Li et al. Human Genomics 2016, 10(Suppl 2):17 DOI 10.1186/s40246-016-0069-z RESEARCH Identification of protein complexes from multi-relationship protein interaction networks Xueyong Li 1,2, Jianxin Wang

More information

Protein Complex Identification by Supervised Graph Clustering

Protein Complex Identification by Supervised Graph Clustering Protein Complex Identification by Supervised Graph Clustering Yanjun Qi 1, Fernanda Balem 2, Christos Faloutsos 1, Judith Klein- Seetharaman 1,2, Ziv Bar-Joseph 1 1 School of Computer Science, Carnegie

More information

CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction. Wong Limsoon

CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction. Wong Limsoon CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction Wong Limsoon 2 Lecture Outline Overview of protein-complex prediction A case study: MCL-CAw Impact of PPIN cleansing

More information

EFFICIENT AND ROBUST PREDICTION ALGORITHMS FOR PROTEIN COMPLEXES USING GOMORY-HU TREES

EFFICIENT AND ROBUST PREDICTION ALGORITHMS FOR PROTEIN COMPLEXES USING GOMORY-HU TREES EFFICIENT AND ROBUST PREDICTION ALGORITHMS FOR PROTEIN COMPLEXES USING GOMORY-HU TREES A. MITROFANOVA*, M. FARACH-COLTON**, AND B. MISHRA* *New York University, Department of Computer Science, New York,

More information

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science 1 Analysis and visualization of protein-protein interactions Olga Vitek Assistant Professor Statistics and Computer Science 2 Outline 1. Protein-protein interactions 2. Using graph structures to study

More information

CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction. Wong Limsoon

CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction. Wong Limsoon CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction Wong Limsoon 2 Lecture outline Overview of protein-complex prediction A case study: MCL-CAw Impact of PPIN cleansing

More information

The Role of Network Science in Biology and Medicine. Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs

The Role of Network Science in Biology and Medicine. Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs The Role of Network Science in Biology and Medicine Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs Network Analysis Working Group 09.28.2017 Network-Enabled Wisdom (NEW) empirically

More information

Protein function prediction via analysis of interactomes

Protein function prediction via analysis of interactomes Protein function prediction via analysis of interactomes Elena Nabieva Mona Singh Department of Computer Science & Lewis-Sigler Institute for Integrative Genomics January 22, 2008 1 Introduction Genome

More information

Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions

Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions Belfield Campus Map Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions

More information

Chapter 16. Clustering Biological Data. Chandan K. Reddy Wayne State University Detroit, MI

Chapter 16. Clustering Biological Data. Chandan K. Reddy Wayne State University Detroit, MI Chapter 16 Clustering Biological Data Chandan K. Reddy Wayne State University Detroit, MI reddy@cs.wayne.edu Mohammad Al Hasan Indiana University - Purdue University Indianapolis, IN alhasan@cs.iupui.edu

More information

A Max-Flow Based Approach to the. Identification of Protein Complexes Using Protein Interaction and Microarray Data

A Max-Flow Based Approach to the. Identification of Protein Complexes Using Protein Interaction and Microarray Data A Max-Flow Based Approach to the 1 Identification of Protein Complexes Using Protein Interaction and Microarray Data Jianxing Feng, Rui Jiang, and Tao Jiang Abstract The emergence of high-throughput technologies

More information

Comparison of Protein-Protein Interaction Confidence Assignment Schemes

Comparison of Protein-Protein Interaction Confidence Assignment Schemes Comparison of Protein-Protein Interaction Confidence Assignment Schemes Silpa Suthram 1, Tomer Shlomi 2, Eytan Ruppin 2, Roded Sharan 2, and Trey Ideker 1 1 Department of Bioengineering, University of

More information

Construction of dynamic probabilistic protein interaction networks for protein complex identification

Construction of dynamic probabilistic protein interaction networks for protein complex identification Zhang et al. BMC Bioinformatics (2016) 17:186 DOI 10.1186/s12859-016-1054-1 RESEARCH ARTICLE Open Access Construction of dynamic probabilistic protein interaction networks for protein complex identification

More information

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

Evidence for dynamically organized modularity in the yeast protein-protein interaction network Evidence for dynamically organized modularity in the yeast protein-protein interaction network Sari Bombino Helsinki 27.3.2007 UNIVERSITY OF HELSINKI Department of Computer Science Seminar on Computational

More information

MTopGO: a tool for module identification in PPI Networks

MTopGO: a tool for module identification in PPI Networks MTopGO: a tool for module identification in PPI Networks Danila Vella 1,2, Simone Marini 3,4, Francesca Vitali 5,6,7, Riccardo Bellazzi 1,4 1 Clinical Scientific Institute Maugeri, Pavia, Italy, 2 Department

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Interaction Network Topologies

Interaction Network Topologies Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 Inferrng Protein-Protein Interactions Using Interaction Network Topologies Alberto Paccanarot*,

More information

IEEE Xplore - A binary matrix factorization algorithm for protein complex prediction http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=573783 Page of 2-3-9 Browse > Conferences> Bioinformatics and Biomedicine...

More information

Detecting temporal protein complexes from dynamic protein-protein interaction networks

Detecting temporal protein complexes from dynamic protein-protein interaction networks Detecting temporal protein complexes from dynamic protein-protein interaction networks Le Ou-Yang, Dao-Qing Dai, Xiao-Li Li, Min Wu, Xiao-Fei Zhang and Peng Yang 1 Supplementary Table Table S1: Comparative

More information

A MAX-FLOW BASED APPROACH TO THE IDENTIFICATION OF PROTEIN COMPLEXES USING PROTEIN INTERACTION AND MICROARRAY DATA (EXTENDED ABSTRACT)

A MAX-FLOW BASED APPROACH TO THE IDENTIFICATION OF PROTEIN COMPLEXES USING PROTEIN INTERACTION AND MICROARRAY DATA (EXTENDED ABSTRACT) 1 A MAX-FLOW BASED APPROACH TO THE IDENTIFICATION OF PROTEIN COMPLEXES USING PROTEIN INTERACTION AND MICROARRAY DATA (EXTENDED ABSTRACT) JIANXING FENG Department of Computer Science and Technology, Tsinghua

More information

An Improved Ant Colony Optimization Algorithm for Clustering Proteins in Protein Interaction Network

An Improved Ant Colony Optimization Algorithm for Clustering Proteins in Protein Interaction Network An Improved Ant Colony Optimization Algorithm for Clustering Proteins in Protein Interaction Network Jamaludin Sallim 1, Rosni Abdullah 2, Ahamad Tajudin Khader 3 1,2,3 School of Computer Sciences, Universiti

More information

A method for predicting protein complex in dynamic PPI networks

A method for predicting protein complex in dynamic PPI networks Zhang et al. BMC Bioinformatics 2016, 17(Suppl 7):229 DOI 10.1186/s12859-016-1101-y RESEARCH Open Access A method for predicting protein complex in dynamic PPI networks Yijia Zhang *, Hongfei Lin, Zhihao

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Discrete Applied Mathematics. Integration of topological measures for eliminating non-specific interactions in protein interaction networks

Discrete Applied Mathematics. Integration of topological measures for eliminating non-specific interactions in protein interaction networks Discrete Applied Mathematics 157 (2009) 2416 2424 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam Integration of topological measures

More information

Discovering molecular pathways from protein interaction and ge

Discovering molecular pathways from protein interaction and ge Discovering molecular pathways from protein interaction and gene expression data 9-4-2008 Aim To have a mechanism for inferring pathways from gene expression and protein interaction data. Motivation Why

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Iteration Method for Predicting Essential Proteins Based on Orthology and Protein-protein Interaction Networks

Iteration Method for Predicting Essential Proteins Based on Orthology and Protein-protein Interaction Networks Georgia State University ScholarWorks @ Georgia State University Computer Science Faculty Publications Department of Computer Science 2012 Iteration Method for Predicting Essential Proteins Based on Orthology

More information

Francisco M. Couto Mário J. Silva Pedro Coutinho

Francisco M. Couto Mário J. Silva Pedro Coutinho Francisco M. Couto Mário J. Silva Pedro Coutinho DI FCUL TR 03 29 Departamento de Informática Faculdade de Ciências da Universidade de Lisboa Campo Grande, 1749 016 Lisboa Portugal Technical reports are

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Phylogenetic Analysis of Molecular Interaction Networks 1

Phylogenetic Analysis of Molecular Interaction Networks 1 Phylogenetic Analysis of Molecular Interaction Networks 1 Mehmet Koyutürk Case Western Reserve University Electrical Engineering & Computer Science 1 Joint work with Sinan Erten, Xin Li, Gurkan Bebek,

More information

MTGO: PPI Network Analysis Via Topological and Functional Module Identification

MTGO: PPI Network Analysis Via Topological and Functional Module Identification www.nature.com/scientificreports Received: 1 November 2017 Accepted: 28 February 2018 Published: xx xx xxxx OPEN MTGO: PPI Network Analysis Via Topological and Functional Module Identification Danila Vella

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Differential Modeling for Cancer Microarray Data

Differential Modeling for Cancer Microarray Data Differential Modeling for Cancer Microarray Data Omar Odibat Department of Computer Science Feb, 01, 2011 1 Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing

More information

Constraint-based Subspace Clustering

Constraint-based Subspace Clustering Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30 Traditional Clustering Partitions

More information

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms Bioinformatics Advance Access published March 7, 2007 The Author (2007). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

More information

PROTEINS are the building blocks of all the organisms and

PROTEINS are the building blocks of all the organisms and IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. 3, MAY/JUNE 2012 717 A Coclustering Approach for Mining Large Protein-Protein Interaction Networks Clara Pizzuti and Simona

More information

Extension and Robustness of Transitivity Clustering for Protein Protein Interaction Network Analysis

Extension and Robustness of Transitivity Clustering for Protein Protein Interaction Network Analysis Internet Mathematics Vol. 7, No. 4: 255 273 Extension and Robustness of Transitivity Clustering for Protein Protein Interaction Network Analysis Tobias Wittkop, Sven Rahmann, Richard Röttger, Sebastian

More information

A general co-expression network-based approach to gene expression analysis: comparison and applications

A general co-expression network-based approach to gene expression analysis: comparison and applications BMC Systems Biology This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon. A general co-expression

More information

Interaction Network Analysis

Interaction Network Analysis CSI/BIF 5330 Interaction etwork Analsis Young-Rae Cho Associate Professor Department of Computer Science Balor Universit Biological etworks Definition Maps of biochemical reactions, interactions, regulations

More information

arxiv: v1 [q-bio.mn] 5 Feb 2008

arxiv: v1 [q-bio.mn] 5 Feb 2008 Uncovering Biological Network Function via Graphlet Degree Signatures Tijana Milenković and Nataša Pržulj Department of Computer Science, University of California, Irvine, CA 92697-3435, USA Technical

More information

Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree

Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree YOICHI YAMADA, YUKI MIYATA, MASANORI HIGASHIHARA*, KENJI SATOU Graduate School of Natural

More information

Research Article Prediction of Protein-Protein Interactions Related to Protein Complexes Based on Protein Interaction Networks

Research Article Prediction of Protein-Protein Interactions Related to Protein Complexes Based on Protein Interaction Networks BioMed Research International Volume 2015, Article ID 259157, 9 pages http://dx.doi.org/10.1155/2015/259157 Research Article Prediction of Protein-Protein Interactions Related to Protein Complexes Based

More information

Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast

Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast Boyko Kakaradov Department of Computer Science, Stanford University June 10, 2008 Motivation: Mapping all transient

More information

Comparison of Human Protein-Protein Interaction Maps

Comparison of Human Protein-Protein Interaction Maps Comparison of Human Protein-Protein Interaction Maps Matthias E. Futschik 1, Gautam Chaurasia 1,2, Erich Wanker 2 and Hanspeter Herzel 1 1 Institute for Theoretical Biology, Charité, Humboldt-Universität

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen Bayesian Hierarchical Classification Seminar on Predicting Structured Data Jukka Kohonen 17.4.2008 Overview Intro: The task of hierarchical gene annotation Approach I: SVM/Bayes hybrid Barutcuoglu et al:

More information

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article JOURNAL OF COMPUTATIONAL BIOLOGY Volume 7, Number 2, 2 # Mary Ann Liebert, Inc. Pp. 8 DOI:.89/cmb.29.52 Research Article Reducing the Computational Complexity of Information Theoretic Approaches for Reconstructing

More information

GOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways

GOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways GOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways Jonas Gamalielsson and Björn Olsson Systems Biology Group, Skövde University, Box 407, Skövde, 54128, Sweden, [jonas.gamalielsson][bjorn.olsson]@his.se,

More information

A STUDY ON IDENTIFICATION OF STATIC AND DYNAMIC PROTEIN COMPLEX AND FUNCTIONAL MODULE IN PPI NETWORK

A STUDY ON IDENTIFICATION OF STATIC AND DYNAMIC PROTEIN COMPLEX AND FUNCTIONAL MODULE IN PPI NETWORK ISSN: 0976-3104 ARTICLE A STUDY ON IDENTIFICATION OF STATIC AND DYNAMIC PROTEIN COMPLEX AND FUNCTIONAL MODULE IN PPI NETWORK Subham Datta 1*, Dinesh Karunanidy 1, J. Amudhavel 2, Thamizh Selvam Datchinamurthy

More information

An overlapping module identification method in protein-protein interaction networks

An overlapping module identification method in protein-protein interaction networks PROCEEDINGS Open Access An overlapping module identification method in protein-protein interaction networks Xuesong Wang *, Lijing Li, Yuhu Cheng From The 2011 International Conference on Intelligent Computing

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Antonina Mitrofanova New York University antonina@cs.nyu.edu Vladimir Pavlovic Rutgers University vladimir@cs.rutgers.edu

More information

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

In order to compare the proteins of the phylogenomic matrix, we needed a similarity Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for

More information

CPredictor3.0: detecting protein complexes from PPI networks with expression data and functional annotations

CPredictor3.0: detecting protein complexes from PPI networks with expression data and functional annotations Xuet al. BMC Systems Biology 2017, 11(Suppl 7):135 DOI 10.1186/s12918-017-0504-3 RESEARCH Open Access CPredictor3.0: detecting protein complexes from PPI networks with expression data and functional annotations

More information

Constructing More Reliable Protein-Protein Interaction Maps

Constructing More Reliable Protein-Protein Interaction Maps Constructing More Reliable Protein-Protein Interaction Maps Limsoon Wong National University of Singapore email: wongls@comp.nus.edu.sg 1 Introduction Progress in high-throughput experimental techniques

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources

Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Antonina Mitrofanova New York University antonina@cs.nyu.edu Vladimir Pavlovic Rutgers University vladimir@cs.rutgers.edu

More information

GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS

GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS 141 GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS MATTHIAS E. FUTSCHIK 1 ANNA TSCHAUT 2 m.futschik@staff.hu-berlin.de tschaut@zedat.fu-berlin.de GAUTAM

More information

Functional Characterization and Topological Modularity of Molecular Interaction Networks

Functional Characterization and Topological Modularity of Molecular Interaction Networks Functional Characterization and Topological Modularity of Molecular Interaction Networks Jayesh Pandey 1 Mehmet Koyutürk 2 Ananth Grama 1 1 Department of Computer Science Purdue University 2 Department

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Network by Weighted Graph Mining

Network by Weighted Graph Mining 2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore + Prediction of Protein Function from Protein-Protein Interaction Network

More information

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics GENOME Bioinformatics 2 Proteomics protein-gene PROTEOME protein-protein METABOLISM Slide from http://www.nd.edu/~networks/ Citrate Cycle Bio-chemical reactions What is it? Proteomics Reveal protein Protein

More information

Lecture 5: November 19, Minimizing the maximum intracluster distance

Lecture 5: November 19, Minimizing the maximum intracluster distance Analysis of DNA Chips and Gene Networks Spring Semester, 2009 Lecture 5: November 19, 2009 Lecturer: Ron Shamir Scribe: Renana Meller 5.1 Minimizing the maximum intracluster distance 5.1.1 Introduction

More information

Integration of functional genomics data

Integration of functional genomics data Integration of functional genomics data Laboratoire Bordelais de Recherche en Informatique (UMR) Centre de Bioinformatique de Bordeaux (Plateforme) Rennes Oct. 2006 1 Observations and motivations Genomics

More information

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from Gene Expression Data II Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday

More information

Function Prediction Using Neighborhood Patterns

Function Prediction Using Neighborhood Patterns Function Prediction Using Neighborhood Patterns Petko Bogdanov Department of Computer Science, University of California, Santa Barbara, CA 93106 petko@cs.ucsb.edu Ambuj Singh Department of Computer Science,

More information

Identifying Signaling Pathways

Identifying Signaling Pathways These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Anthony Gitter, Mark Craven, Colin Dewey Identifying Signaling Pathways BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver

More information

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong

A Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong A Geometric Interpretation of Gene Co-Expression Network Analysis Steve Horvath, Jun Dong Outline Network and network concepts Approximately factorizable networks Gene Co-expression Network Eigengene Factorizability,

More information

AN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK;

AN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK; 213 IEEE INERNAIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEP. 22 25, 213, SOUHAMPON, UK AN ENHANCED INIIALIZAION MEHOD FOR NON-NEGAIVE MARIX FACORIZAION Liyun Gong 1, Asoke K. Nandi 2,3

More information

Types of biological networks. I. Intra-cellurar networks

Types of biological networks. I. Intra-cellurar networks Types of biological networks I. Intra-cellurar networks 1 Some intra-cellular networks: 1. Metabolic networks 2. Transcriptional regulation networks 3. Cell signalling networks 4. Protein-protein interaction

More information

Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction Networks

Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction Networks 22 International Conference on Environment Science and Engieering IPCEE vol.3 2(22) (22)ICSIT Press, Singapoore Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley Jure Leskovec (@jure) Joint work with Jaewon Yang, Julian McAuley Given a network, find communities! Sets of nodes with common function, role or property 2 3 Q: How and why do communities form? A: Strength

More information

Biological Systems: Open Access

Biological Systems: Open Access Biological Systems: Open Access Biological Systems: Open Access Liu and Zheng, 2016, 5:1 http://dx.doi.org/10.4172/2329-6577.1000153 ISSN: 2329-6577 Research Article ariant Maps to Identify Coding and

More information

Supplemental Material

Supplemental Material Supplemental Material Article title: Construction and comparison of gene co expression networks shows complex plant immune responses Author names and affiliation: Luis Guillermo Leal 1*, Email: lgleala@unal.edu.co

More information

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods International Academic Institute for Science and Technology International Academic Journal of Science and Engineering Vol. 3, No. 6, 2016, pp. 169-176. ISSN 2454-3896 International Academic Journal of

More information

Context dependent visualization of protein function

Context dependent visualization of protein function Article III Context dependent visualization of protein function In: Juho Rousu, Samuel Kaski and Esko Ukkonen (eds.). Probabilistic Modeling and Machine Learning in Structural and Systems Biology. 2006,

More information

V 5 Robustness and Modularity

V 5 Robustness and Modularity Bioinformatics 3 V 5 Robustness and Modularity Mon, Oct 29, 2012 Network Robustness Network = set of connections Failure events: loss of edges loss of nodes (together with their edges) loss of connectivity

More information

Nonparametric Bayesian Matrix Factorization for Assortative Networks

Nonparametric Bayesian Matrix Factorization for Assortative Networks Nonparametric Bayesian Matrix Factorization for Assortative Networks Mingyuan Zhou IROM Department, McCombs School of Business Department of Statistics and Data Sciences The University of Texas at Austin

More information

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Networks & pathways. Hedi Peterson MTAT Bioinformatics Networks & pathways Hedi Peterson (peterson@quretec.com) MTAT.03.239 Bioinformatics 03.11.2010 Networks are graphs Nodes Edges Edges Directed, undirected, weighted Nodes Genes Proteins Metabolites Enzymes

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 20XX 1

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 20XX 1 IEEE/ACM TRANSACTIONS ON COMUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 20XX 1 rotein Function rediction using Multi-label Ensemble Classification Guoxian Yu, Huzefa Rangwala, Carlotta Domeniconi,

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Non-Negative Factorization for Clustering of Microarray Data

Non-Negative Factorization for Clustering of Microarray Data INT J COMPUT COMMUN, ISSN 1841-9836 9(1):16-23, February, 2014. Non-Negative Factorization for Clustering of Microarray Data L. Morgos Lucian Morgos Dept. of Electronics and Telecommunications Faculty

More information

FUNPRED-1: PROTEIN FUNCTION PREDICTION FROM A PROTEIN INTERACTION NETWORK USING NEIGHBORHOOD ANALYSIS

FUNPRED-1: PROTEIN FUNCTION PREDICTION FROM A PROTEIN INTERACTION NETWORK USING NEIGHBORHOOD ANALYSIS CELLULAR & MOLECULAR BIOLOGY LETTERS http://www.cmbl.org.pl Received: 25 July 2014 Volume 19 (2014) pp 675-691 Final form accepted: 20 November 2014 DOI: 10.2478/s11658-014-0221-5 Published online: 25

More information

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it? Proteomics What is it? Reveal protein interactions Protein profiling in a sample Yeast two hybrid screening High throughput 2D PAGE Automatic analysis of 2D Page Yeast two hybrid Use two mating strains

More information

Graph Clustering Algorithms

Graph Clustering Algorithms PhD Course on Graph Mining Algorithms, Università di Pisa February, 2018 Clustering: Intuition to Formalization Task Partition a graph into natural groups so that the nodes in the same cluster are more

More information

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 Cluster Analysis of Gene Expression Microarray Data BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 1 Data representations Data are relative measurements log 2 ( red

More information