Complex detection from PPI data using ensemble method
|
|
- Arleen Haynes
- 6 years ago
- Views:
Transcription
1 ORIGINAL ARTICLE Complex detection from PPI data using ensemble method Sajid Nagi 1 Dhruba K. Bhattacharyya 2 Jugal K. Kalita 3 Received: 18 August 2016 / Revised: 2 November 2016 / Accepted: 17 December 2016 Springer-Verlag Wien 2016 Abstract Many algorithms have been proposed recently to detect protein complexes in protein protein interaction (PPI) networks. Most proteins form complexes to accomplish biological functions such as transcription of DNA, translation of mrna and cell growth. Since proteins perform their tasks by interacting with each other, determining these protein protein interactions is an important task. Traditional clustering approaches for protein complex identification cannot deal with noisy and incomplete PPI data and dependent on information from a single source. Since the noise in the interaction datasets hampers the detection of accurate protein complexes, we propose an ensemble approach for protein complex detection that attempts to combine information from Gene Ontology at the time of complex detection. The PPI data network is taken as input by several baseline complex detection algorithms to generate protein complexes. The protein complexes are then subsequently combined by the proposed ensemble using a consensus building module for the purpose of identifying meaningful complexes. The protein complexes thus predicted by the ensemble are evaluated by comparing them to a set of gold standard protein complexes and their biological relevance established using a co-localization score. & Sajid Nagi sajidnagi@gmail.com Department of Computer Science, St. Edmund s College, Shillong , Meghalaya, India Department of Computer Science and Engineering, Tezpur University, Napaam , Assam, India Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA Keywords Protein complex prediction Protein interaction network Empirical study of proteins Ensemble clustering 1 Introduction Proteins are complex organic compounds made up of chains of amino acids used by the body for growth, maintenance and repair of body tissues. Protein complexes perform many crucial tasks within living beings such as transcription of DNA, translation of mrna, cell growth and transporting molecules from one place to another. Moreover, in-depth study of protein complexes helps understand how they are built and how they work, allowing better comprehension of biological systems. To understand the dynamics of biological processes within an organism, it is necessary to determine the complete map of physical interactions among the proteins (interactome). The result is the PPI network where nodes represent proteins and edges represent interactions between pairs of proteins. It has been well established that the main cause of many diseases is proteins (Hung et al. 1997; Payne et al. 1998; Tanaka et al. 2003; Hung and Link 2011). An erroneous production of a protein complex can cause diseases by affecting the biological processes in which it is involved. Therefore, correct classification of a newly discovered protein complex is important as it may guide discovery of appropriate drugs. New protein complexes can be detected based on the observation that densely connected regions in the PPI networks often correspond to actual protein complexes. Graph clustering techniques have been useful since PPI networks are large-scale graphical data structures consisting of tens of thousands of pair-wise protein protein interactions.
2 1.1 Background A wide range of detection methods have been proposed in recent years for the purpose of identification and categorization of protein complexes (Yong et al. 2012; Wu et al. 2013; Zaki 2014; Srihari et al. 2015; Yong et al. 2014; Ou- Yang et al. 2015; Sharma et al. 2015; Li et al. 2016). Every new published method compares its performance with a few selected earlier methods. It has been noticed that due to the differences in the PPI networks produced, benchmark datasets used for evaluation, evaluation criteria, threshold settings and parameters used, the results of such comparisons and surveys on complex detection vary and are difficult to reconcile (Brohee and van Helden 2006; Vlasblom and Wodak 2009; Li et al. 2010; Srihari and Leong 2013). Evaluation of predicted complexes is generally performed by (a) comparing the predicted complexes against one or more gold standard sets of complexes using performance measures such as precision and recall (Liu et al. 2009), (b) if a gold standard dataset is not available, measures such as cluster cohesiveness and separability are used (Bader and Hogue 2003; King et al. 2004) and (c) evaluation of the predicted complexes can also be performed by computing a functional or co-localization score, after incorporating the appropriate annotation data (Liu et al. 2009; King et al. 2004). This evaluation is particularly useful for biological relevance of the predictions. 1.2 Observations and motivation Based on our survey of literature on protein complex detection, we make the following observations. (i) (ii) (iii) (iv) (v) Every clustering algorithm has its own nuances, being highly dependent on specific initial conditions, seed selection or parameter setting. The noise in the datasets hampers the detection of accurate protein complexes. Proteins can be assigned importance based on the number of interactions they participate in. These protein complexes can then be ranked to further study the roles they play in biological processes. Most methods report low recall because the search for complexes takes place in the dense regions of the network causing them to miss complexes of low densities from sparse parts in the network (Srihari and Leong 2012). A protein complex is usually validated after it is fully identified or detected and it may be desirable for a method to perform validation during complex identification or detection so that its goodness based on cohesiveness can be explored better. To tackle these limitations, an ensemble method, PCDEN, is proposed since ensembles have been able to improve robustness, stability and prediction accuracy of clusterings by combining the output of several algorithms (Yang et al. 2010; Nagi and Bhattacharyya 2013). Since the noise in the interaction datasets hampers the detection of accurate protein complexes, one way to compensate for the lack of interaction coverage is to combine multiple datasets for the purpose of identification of meaningful complexes. An ensemble of clustering algorithms balances individual characteristics by combining the output of several algorithms and may be able to identify larger, denser clusters with improved biological significance (Nagi and Bhattacharyya 2014). Additionally, ensemble-based methods may be preferable for protein complex detection considering that traditional clustering methods search for complexes in dense regions and miss complexes of low densities. An ensemble method proposed by (Dai et al. 2016) employs a cluster-wise voting mechanism to extract the consensus information embedded in the results of different methods, whereas (Wu et al. 2016) integrate the clusters obtained from the base clustering mechanisms with co-complex affinity scores. An alternate way may be to add topological information to biological information (Zaki 2014) so that complexes of low densities, that is, small complexes of two or three proteins are not discarded. Moreover, cluster ensembles may also be able to address the problem of local optima and obtain a more globally optimum solution, i.e., a stable clustering solution. To evaluate the quality and accuracy of predicted complexes, we use Frac (Nepusz et al. 2012), MMR (Nepusz et al. 2012), Acc (Brohee and van Helden 2006), Precision and Recall (Liu et al. 2009) and F-measure (Van Rijsbergen 1979). To evaluate the biological relevance, we have used co-localization (Huh et al. 2003) and GO semantic similarity (Schlicker et al. 2006). 1.3 Organization of the paper The rest of the paper is organized as follows: Sect. 2 briefly describes the algorithms and ensemble approaches used for comparison. Section 3 gives an account of the evaluation metrics used, followed by the description of the datasets in Sect. 4. Section 5 describes the experimental design methodology of the proposed PCDEN ensemble, followed by the results in Sect. 6. Finally, the discussion and concluding remarks are included in Sect. 7.
3 2 Related work We compare eight algorithms, namely MCL (Van Dongen 2000; Pereira-Leal et al. 2004), MCODE (Bader and Hogue 2003), RNSC (King et al. 2004), CFinder (Adamcsek et al. 2006), RRW (Macropol et al. 2009), AP (Frey and Dueck 2007), CMC (Liu et al. 2009) and ClusterONE (Nepusz et al. 2012) with our proposed ensemble. In addition, we compare our proposed ensemble method with three ensemble approaches proposed in recent years, namely Ensemble Non-negative Matrix Factorization (NMF) (Greene et al. 2008), Bipartite Graph Ensemble (BGENS) (Zhang et al. 2012) and Full Graph Ensemble (FGENS) (Zhang et al. 2012), Ensemble Clustering Bayesian Nonnegative Matrix Factorization (EC-BNMF) (Ou-Yang et al. 2013). A brief overview of the algorithms and ensembles is presented next. 2.1 Overview of the clustering approaches The Markov Clustering (MCL) (Van Dongen 2000; Pereira-Leal et al. 2004) detects protein complexes by simulating random walks in PPI networks by boosting the probabilities of intra-cluster walks and demoting intercluster walks (i.e., more links within a cluster and fewer links between clusters). Molecular Complex Detection (MCODE) (Bader and Hogue 2003) selects seed nodes with high weights as initial clusters and enhances these clusters by outward traversal from the seeds. The number of predicted complexes is generally small and the size of many predicted complexes is often too large. The Restricted Neighborhoods Search Clustering (RNSC) (King et al. 2004) algorithm begins with an initial random clustering and then searches for a better clustering with lower costs by moving vertices around. It predicts relatively few complexes and its results depend on the quality of the initial clustering. Clique Finder (Cfinder) (Adamcsek et al. 2006) takes a parameter k as input and detects all the k-cliques of the input network and connects them if they are adjacent, i.e., if they share (k - 1) common nodes. A k- clique percolation cluster is then constructed by linking all the adjacent k-cliques as a bigger subgraph. It can detect overlapping clusters. The Repeated Random Walk (RRW) (Macropol et al. 2009) algorithm iteratively expands at most k times a cluster of nodes so as to include proteins with high proximity using an affinity function, followed by a rank step to remove clusters with an overlap score above a given threshold. It is able to handle weighted and unweighted graphs. Affinity Propagation (AP) (Frey and Dueck 2007) is a k-centers clustering technique with centers referred to as exemplars, randomly selected and refined iteratively to decrease the sum of squared errors between each input object and its nearest center. AP cannot detect overlapping clusters. Clustering based on Maximum Cliques (CMC) (Liu et al. 2009) uses an iterative scoring algorithm followed by a maximal clique finding process to assess whether two proteins are in the same protein complex. The cliques are merged when the network between them is denser than the merge threshold. It is not able to handle weighted networks. The Cluster Overlapping Neighborhood Expansion (ClusterONE) (Nepusz et al. 2012) algorithm starts by selecting the protein with the highest degree as the first seed and grows a cohesive group from it using a greedy approach and then moves on to the next seed. The cohesive groups are then merged based on their overlap score. 2.2 Overview of the ensemble approaches The ensemble framework for detecting protein complexes Ensemble Non-negative Matrix Factorization (NMF) proposed by (Greene et al. 2008) first generates a collection of non-negative matrix factorizations with different number of dimensions. Then, a hierarchical meta-clustering algorithm is employed to aggregate these factorizations and produce a disjoint hierarchy of meta-clusters. Finally, the results are transformed into a soft hierarchical clustering of the original dataset. Bipartite Graph Ensemble (BGENS) (Zhang et al. 2012) is a bipartite graph that is constructed based on the affiliation of proteins and clusters. The original PPI data are initially clustered into a set of T partitions using the base clustering algorithms. Each clustering partition consists of a set of clusters from a certain clustering method. BGENS constructs a bipartite graph between protein nodes in the dataset and the cluster nodes obtained from the base clustering algorithms. The cluster ensemble, Full Graph Ensemble (FGENS) (Zhang et al. 2012), is produced from the bipartite graph BGENS, by omitting interactions among the proteins and also the relationships among the clusters. The authors consider the original protein interactions as the basis of cluster-belonging preference for those proteins that appear in different clusters. The proteins are partitioned into different final clusters by taking into consideration their original relationships with other proteins. The method proposed in BGENS is followed to partition the cluster nodes generated by the base clustering algorithms and a graph of proteins and clusters is constructed. The graph is now partitioned using spectral clustering, which partitions the graph into K parts with the objective of minimizing the cut. The ensemble approach EC-BNMF (Ensemble Clustering Bayesian Nonnegative Matrix Factorization) (Ou-Yang et al. 2013) consists of two phases. In the generation phase, useful information in the form of features is extracted from several base clustering results. Each base clustering result is regarded a feature of the original PPI network, where two proteins are
4 assumed to be connected if they occur in the same cluster at least once. In the complex detection phase, they use a Bayesian NMF-based ensemble clustering method which uses group information provided by the edges between interacting proteins to detect protein complexes. 3 Evaluation of protein complexes In this section, we outline the measures used to compare a predicted protein complex with a reference complex and also to evaluate the biological relevance of the predictions. 3.1 Performance measures The following performance measures try to assess the quality of a protein complex predicted by comparing it with the protein complexes present in a specific gold standard. However, it is not possible to establish the quality of a predicted complex if it does not match, even partially, with at least one gold standard complex Frac The fraction of pairs (Frac) (Nepusz et al. 2012) is a measure which says that a predicted complex p and a reference complex b (gold standard) match with an overlap score NAðp; bþ, given by NAðp; b Þ ¼ V p \ V b 2 x j Vb j : ð1þ V p The complexes are considered to match if NAðp; bþ. is greater than a threshold value Geometry accuracy Geometric accuracy (Acc) (Brohee and van Helden 2006) is the geometric mean of clustering sensitivity (Sn) (Brohee and van Helden 2006), and the positive predictive value (PPV) (Brohee and van Helden 2006) of a clustering. Both are based on the confusion matrix T ¼ t ij of the complexes. With n reference and m predicted complexes, the element t ij of the confusion matrix refers to the number of proteins that are found both in the reference complex i and the predicted complex j. If n i is defined as the number of proteins in reference complex i, then P n i¼1 Sn ¼ maxm j¼1 t ij P n i¼1 n ; and ð2þ i P m j¼1 PPV ¼ maxn i¼1 t ij P m P n j¼1 i¼1 t : ð3þ ij Clustering sensitivity (Sn) can give inflated values if every protein is placed in the same cluster, whereas the positive predictive value (PPV) can be maximized by putting every protein in its own cluster. For this reason, the two measures are balanced by computing their geometric accuracy (Acc), which is defined as: p Acc ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Sn PPV: ð4þ MMR The maximum matching ratio (MMR) (Nepusz et al. 2012) is used to express how well the predicted complexes compare with a gold standard, where the overlap score between two protein complexes is computed by Eq. (2). MMR is obtained by dividing the total weight of the maximum matching complexes by the number of reference complexes. Given n standard complexes and m predicted complexes, let j be a member of the predicted complexes. MMR is then defined as: P n i¼1 MMR ¼ maxm j n NA ð p; b Þ : ð5þ A predicted complex that may not match any reference complex may show a degraded geometric accuracy score. However, a predicted complex that does not match any reference complex is not necessarily an undesired result and may be an undiscovered complex. Therefore, trying to optimize the geometric accuracy may discourage detection of such new complexes in a PPI network. This is one motivation behind the use of MMR as a metric F-measure The F-measure (Van Rijsbergen 1979) is the harmonic mean of two measures called Precision and Recall, defined as follows. Precision This measure is defined as the fraction of a cluster that consists of objects of a specific class. The precision of a cluster V j. with respect to class U i. is represented by n i;j Precision U i ; V j ¼ : ð6þ n j Recall This measure is defined as the proportion of objects of a class in a cluster. The recall of a cluster V j with respect to class U i is represented by n i;j Recall U i ; V j ¼ ; ð7þ n i where n i;j equals number of data items belonging to class U i in cluster V j, n j is the number of data items
5 belonging to cluster V j and n i denotes number of data items belonging to class U i. F-measure is defined as the rate at which a cluster contains only objects of a particular class and all objects of that class (Van Rijsbergen 1979). Thus, the F-measure of a cluster U i with respect to class V j is represented by the following expression: F-measure ¼ 2 Precision U i; V j Recall Ui ; V j : Precision U i ; V j þ Recall Ui ; V j ð8þ F-measure is useful in the sense that it is an objective metric of how well a clustering algorithm is able to recover the original clusters. It is measured in the range [0, 1] and high values indicate a good quality of clustering. 3.2 Biological significance measures One of the main issues that arise during the comparison of predicted complexes and a gold standard complex is that the gold standard protein complex sets are usually incomplete (Jansen and Gerstein 2004). Hence, a predicted complex that does not match any of the reference complexes may belong to a valid but previously uncataloged complex. Therefore, relying on the comparison measures outlined in Sect. 3.1, which are based on a predefined gold standard, may not give a true picture of the predicted complexes. Rather, the scores should be supported by measures that assess the biological relevance of predicted complexes, based on the similarity of their functional annotation or the co-localization of the constituent proteins Co-localization similarity Since proteins in the same complex have a tendency to share common functions, they tend to be located at the same location in a cell. The maximum fraction of proteins in a complex found at the same location is known as the colocalization score for that complex. The average co-localization score is calculated as the weighted average over all complexes and is defined as: P j C ¼ max il i;j P : ð9þ j A j Here, l i;j is the number of proteins of complex A j assigned to the localization group i and A j is the number of proteins in the complex A j with localization assignments. A high co-localization score indicates a high functional similarity between proteins in the same complex GO semantic similarity Comparison of GO terms associated with proteins in a protein complex is another indicator of biological relevance. This can be performed in terms of functional similarity (Ashburner et al. 2000). The functional similarity of two proteins is measured in terms of the semantic similarity of GO terms with which these proteins have been annotated. The GO semantic similarity score for a protein complex is defined as the geometric mean of all complex scores determined separately for the biological process and molecular function taxonomies for a set of complexes. A higher GO semantic similarity score is associated with better quality of a set of protein complexes. Out of the several variations of semantic similarity that have been described (Resnick 1999; Lin 1998; Lord et al. 2003; Schlicker et al. 2006), we use the semantic similarity proposed by (Schlicker et al. 2006). It is defined as: sim Schlicker ðt 1 ; t 2 Þ ¼ 2 ln ½ p msðt 1 ; t 2 Þ lnðpðt 1 Þþlnðpðt 2 Þ ð 1 p msðt 1 ; t 2 ÞÞ ð10þ Here, t, t 1 and t 2 are GO terms, pt ðþ represents the probability of term t and p ms is the minimum subsumer. 4 Dataset description We compare protein complexes, our clustering algorithm predicts, with a set of known interactions called gold standards. 4.1 PPI datasets Each of the algorithms and the proposed ensemble has been tested on three publicly available high-throughput benchmark yeast PPI datasets, namely Gavin (Gavin et al. 2006), Krogan Core (Krogan et al. 2006) referred to in this paper as Krogan, and Krogan Extended (Krogan et al. 2006) referred in the paper as Krogan?. Some details of the datasets are shown in Table Gold standard datasets We consider the Saccharomyces Genome Database (SGD) (Hong et al. 2008) and the Munich Information Centre for Protein Sequences (MIPS) (Mewes et al. 2004) Saccharomyces cerevisiae protein protein interaction dataset as they have been applied in several analyses as gold standard reference because of their quality and comprehensiveness. The general properties of these datasets are given in Table 2.
6 Table 1 Benchmark datasets Details Gavin Krogan Krogan? References Gavin et al. (2006) Krogan et al. (2006) Krogan et al. (2006) Number of proteins Number of interactions ,317 Weighted Yes Yes Yes 5 Protein complex detection ensemble (PCDEN) The PCDEN approach consists of three phases. The First Phase generates n individual complexes sets from PPI Network data taken as input by n base clustering algorithms. The set of base clustering algorithms is identified after an exhaustive experimentation with a large number of benchmark datasets. The Second Phase employs a consensus building process to generate protein complexes by identifying a common set of protein nodes from the corresponding complex sets (obtained from the base clustering algorithms) and also using GO similarity. The Third Phase evaluates the predicted protein complexes and also establishes their biological relevance. A conceptual framework for the proposed protein complex detection ensemble (PCDEN) is shown in Fig. 1. Phase I: complex set generation The PPI network data are taken as input by n consistently well-performing base clustering algorithms BC 1, BC 2,, BC n which generate n individual complexes sets C 1,C 2,, C n. In this step, care must be taken to select base clustering algorithms that are diverse in nature. As a result, the errors made in clustering by one algorithm are likely to be averaged out by the correct clustering of another, so that the overall clustering accuracy is improved and a final unbiased decision can be made. Phase II: protein complex generation through consensus Next, we take a complex set C 1 generated by the base clustering algorithms BC 1 and identify the node within this complex with the highest degree. A node with a higher degree implies high connectivity or association with other nodes. Such nodes are chosen since they give more guarantee to form meaningful complex. The next complex set C 2 generated by BC 2 is taken and we identify the corresponding complex based on the highest degree node as Table 2 General properties of the gold standard datasets General properties SGD MIPS References Hong et al. (2008) Mewes et al. (2004) No. of proteins No. of complexes Overlapping proteins identified by complex set C 1. Similarly, we take the other set of complexes C 3, C 4,, C n generated by the base clustering algorithms BC 3, BC 4,, BC n and identify the corresponding complexes based on the highest degree node identified earlier. After we have identified all the complexes containing the highest degree node from the n complex sets, a common set of nodes (proteins), say P i, is identified from the corresponding complex sets with the condition that each member element of P i is present in at least two complexes. Two cases are considered now. Case 1 For the common set of nodes P i, we complete the edges among the nodes of P i subject to the condition that if n i and n j are two nodes present in the common set of nodes P i, then they will be connected by an edge if and only if (n i, n j ) are already connected by an edge in at least two complexes. This will continue for every node and will give rise to a common complex, P Com i, consisting of nodes that are present in all corresponding complexes and inter-connected by an edge. Hence, a complex is generated which is arrived at by consensus among the corresponding complexes. Case 2 For the common set of nodes P i, complete the edges to obtain a complex P GO i based on GO similarity. Since a protein is usually annotated with multiple GO terms, the semantic similarity measured between two GO terms can be directly converted to a measurement of the similarity between two proteins. Thus, Gene Ontology information is integrated with PPI network data based on semantic similarity measure (Schlicker et al. 2006). Next, we superimpose both the complexes obtained from the common set of nodes P i, i.e., P Com i and P GO i to obtain the final complex P Fin i. The superimposition is done using the logic given in Table 3. In other words, if the edge is present for a pair of nodes n i and n j in both P Com i and P GO i, then the edge is retained in the final predicted complex P Fin i. If an edge is present for a pair of nodes n i and n j in P Com i and absent in P GO i or vice versa, then also that edge is retained even though it may be a case of false alarm. The reason to retain this edge is that in the case of P Com i, the edge is a result of consensus (where un-supervised approach has been applied) and in the case of P GO i, the edge is obtained through GO similarity (a case of supervised approach). So in both the cases, the probability of the existence of the edge is justified. The final
7 COMPLEX SET GENERATION FROM BASE CLUSTERING ALGORITHMS PPI Network data BC 1 BC 2 BC 3 BC n (unsupervised methods) C 1 C 2 C 3 C n-1 C n (n individual complex sets) PROTEIN COMPLEX GENERATION THROUGH CONSENSUS AMONG THE CORRESPONDING COMPLEXES Choose a complex from n individual complex sets Identify a node with highest degree from the complex Find corresponding complexes based on node identified Find a common set of nodes, P i, from the corresponding complex sets, where each member element is present atleast in two complexes Complete the edges among the nodes of P i to obtain P i Com by fulfilling the condition that: a pair of nodes (n i, n j ) will have an edge iff n i, n j are connected by an edge atleast in two complexes Use GO Similarity to complete the edges among the nodes of P i to obtain P i GO Superimpose P i Com and P i GO to obtain the final complex P i Fin such that an edge will exist in P i Fin iff the edge is present in (i) P i Com or P i GO (ii) Both in P i Com and in P i GO VALIDATION PHASE Result validation of P i Fin using Frac, Acc, MMR, Precision, Recall and F-measure Result validation using Co-localization Score Gold Standard protein complexes Fig. 1 Conceptual framework for the proposed protein complex detection ensemble Table 3 Existence of an edge in the final complex Edge in P Com i Edge in P GO i Edge in P Fin i Present Present Present Present Not present Present Not present Present Present Not present Not present Not present predicted complex P Fin i is now stored in a file for the purpose of validation. Now, we start with the second highest degree node for the formation of the second protein complex and then iterate through to the next lower degree node and so on. This process of identifying the node of the highest degree from the next set of complexes and the process of identifying protein complexes through consensus and through the use of GO similarity are repeated for each corresponding complex. The end result of this phase is that there will be interacting protein complexes with overlapping proteins. Phase III: validation Finally, the validation phase evaluates protein complexes predicted by comparing them to a set of gold standard protein complexes and also establishes the biological relevance of the predicted protein complexes using the co-localization score. Though the method adopted for protein complex detection is computationally expensive, the primary objective is to find complexes with high precision. For this reason, speed of detection is compromised in favor of
8 accuracy and precision. Our main focus is to make use of multiple sources and means to detect and validate the complexes found since a single information source or criterion has its limitations as discussed in Sect Experimental evaluation Three different clustering methods for protein complex detection, MCODE (Bader and Hogue 2003), MCL (Van Dongen 2000; Pereira-Leal et al. 2004), and CFinder (Adamcsek et al. 2006), are used as the base clustering algorithms. These methods have been selected as they follow different approaches for clustering and MCODE and CFinder can detect overlapping complexes. Moreover, we have also performed an empirical evaluation of these algorithms along with others, namely RNSC (King et al. 2004), AP (Frey and Dueck 2007), CMC (Liu et al. 2009), RRW (Macropol et al. 2009) and ClusterONE (Nepusz et al. 2012) and have found that the performance of these algorithms fall in the category of good, average and not so good. Hence, these algorithms satisfy the diversity criteria of our ensemble. ClusterONE (Nepusz et al. 2012) has been found to be the best performing complex finding algorithm and hence we will compare our ensemble s performance with this algorithm. GO semantic similarity is calculated based on the information content of GO terms and the semantic similarity proposed by (Schlicker et al. 2006). The PPI data set for the experiments is summarized in Table 1. The GO-slim file was downloaded from and the Biological Process (BP) hierarchy was selected to calculate the GO-driven similarity of proteins. The evaluation of the results was done by the SGD and MIPS gold standard datasets, shown in Table 2. 6 Results For each clustering algorithm tested, we tune the input parameters after experimenting with different combinations of settings to achieve optimal performance, unless the authors of the original algorithm explicitly suggest particular settings. This procedure was conducted separately for each measure, for each dataset and with respect to each gold standard to obtain the most optimistic score. 6.1 Comparison of PCDEN with baseline clustering algorithms We compare the performance of the proposed ensemble PCDEN with the approaches described in Sect Quality score by MIPS gold standard In this section, we report the results obtained by comparing protein complexes predicted by each clustering algorithm and the proposed PCDEN ensemble, with the gold standard obtained from the MIPS catalog of protein complexes. The best scores are highlighted in bold for easy comparison. The number of complexes and the matched number of complexes for each dataset, the fraction of protein complexes matched by at least one predicted complex, geometric accuracy and maximum matching ratio as detected by the algorithms are reported in Table 4. When we compare the eight clustering approaches, we observe that ClusterONE shows the best scores in the number of detected complexes that match at least one real complex, and also for the three quality measures, namely Frac, Acc and MMR across all the three datasets. However, we notice that although the number of protein complexes identified by PCDEN is lower than several others, 88 of the 152 predicted complexes matched very well with the benchmark complexes. Further we also observe that PCDEN obtains the best scores for the quality measures Frac and Acc across all the three datasets used for comparison. The maximum matching ratio MMR score of for the Gavin dataset obtained by PCDEN is also the highest indicating better quality of overlapped protein complexes detected. The comparison results are presented in Fig. 2 and it is clear that the PCDEN ensemble outperforms the other approaches in all datasets by achieving the highest score in each of the three categories used for comparison. The results obtained by PCDEN are better than those of the single clustering approaches Quality score by SGD gold standard Similarly, Table 5 shows the scores for the number of complexes detected, number of matched complexes, Frac, Acc and MMR for the three datasets achieved by the algorithms and PCDEN using the SGD gold standard. It is observed that although ClusterONE shows the best scores among the eight baseline clustering algorithms across all the three datasets, PCDEN achieves the best results in protein complex detection with respect to the three quality measures, for all the three datasets and also identifies the maximum number of matched complexes. We also observe in Tables 4 and 5 that the number of complexes detected by PCDEN for the Krogan and Krogan? datasets is closer to the actual number of MIPS complexes, i.e., 203, while the results of ClusterONE seem to contain a large number of extra clusters, i.e., 526 and 531. Moreover, the number of predicted complexes by
9 Table 4 Comparison of the protein complex detection algorithms and proposed ensemble PCDEN on PPI datasets using the MIPS complex dataset Algorithm Gavin Krogan Krogan? #c #m Frac Acc MMR #c #m Frac Acc MMR #c #m Frac Acc MMR MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN The best scores are highlighted in bold #c no. of complexes, #m no. of matched complexes 0.8 Comparison of baseline approaches with PCDEN for each Dataset Quality Score Frac Acc MMR Frac Acc MMR Frac Acc MMR MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN Gavin Krogan Krogan+ Evaluation Metrics of the clustering algorithms and PCDEN Fig. 2 Comparing the evaluation metrics of the clustering algorithms and PCDEN for the MIPS dataset Table 5 Comparison of the protein complex detection algorithms and proposed ensemble on PPI datasets using the SGD complex dataset Algorithm Gavin Krogan Krogan? #c #m Frac Acc MMR #c #m Frac Acc MMR #c #m Frac Acc MMR MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN The best scores are highlighted in bold #c no. of complexes, #m no. of matched complexes PCDEN matches very well with the number of benchmark complexes for the Krogan and Krogan? datasets. Figure 3 shows the comparison of the clustering algorithms and PCDEN for SGD dataset and it is noticed that PCDEN outperforms all the other single clustering approaches in all datasets. The results obtained by PCDEN prove that the integration of multiple sources has greater power to predict protein complexes in PPI networks.
10 6.1.3 Biological significance of predicted complexes on Gavin and Krogan datasets The biological relevance of the predicted complexes based on the similarity of their co-localization score is discussed next Co-localization score We make use of the localization dataset published by (Huh et al. 2003) and ProCope (Krumsiek et al. 2008; Friedel et al. 2009), a popular tool used to predict and evaluate protein complexes, to calculate the co-localization score. Table 6 shows the co-localization scores of protein complexes detected using the eight algorithms and PCDEN on the Gavin and Krogan datasets. The co-localization scores of protein complexes obtained by the eight algorithms on the Gavin and Krogan datasets are shown in Table 6. As expected, the highest colocalization score of on the Gavin dataset is achieved by PCDEN, followed by ClusterONE with a score of The co-localization score on the Krogan dataset of PCDEN is which is higher than 0.725, the score of obtained by MCL and achieved by ClusterONE. Hence, the protein complexes detected by the PCDEN ensemble have relatively high quality from the biological standpoint due to the high co-localization score on both the Gavin and Krogan datasets. The results of comparisons of Co-localization score of PCDEN and the baseline algorithms on the Gavin and Krogan datasets are shown in Fig Comparison of PCDEN with Ensemble Approaches In addition to comparison with individual clustering methods, we compare PCDEN with the ensemble approaches proposed in recent years, NMF (Greene et al. 2008), EC-BNMF (Ou-Yang et al. 2013), BGENS (Zhang et al. 2012) and FGENS (Zhang et al. 2012). Table 6 Comparison of Co-localization Score Algorithm Co-localization score Gavin Krogan MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN The best scores are highlighted in bold Co-localization Score Comparison of Co-localization Score Gavin Datasets Fig. 4 Comparison of Co-localization Score Krogan Analysis of ensemble results We use the validity measures Precision, Recall and F-measure discussed in Sect The results of the PCDEN ensemble are compared with the results of the ensemble approaches discussed in Sect Due to the non-availability of source code of the ensemble approaches used for comparison with PCDEN, the metrics Frac, Acc and MMR could not be established for these ensemble approaches. Hence, we use the Precision, Recall and F-measure scores as provided by the authors for the purpose of comparison with PCDEN. MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN Quality Score Comparison of baseline approaches with PCDEN for each Dataset Frac Acc MMR Frac Acc MMR Frac Acc MMR Gavin Krogan Krogan+ Evaluation metrics of the clustering algorithms and PCDEN MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN Fig. 3 Comparative analysis of the evaluation metrics of the clustering algorithms and PCDEN for SGD
11 The proposed ensemble method outperforms other approaches by achieving higher F-measure and Recall rates for the MIPS gold standard, as shown in Table 7. The F-measure and Recall rates for SGD gold standard are and for PCDEN, which are almost at par with FGENS, which has scored for F-measure and for Recall rate. However, PCDEN finds more matched modules from the PPI network than others and this demonstrates its effectiveness. NMF and BC-NMF use only the PPI network topology whereas PCDEN integrates information from GO similarity sources and so achieves higher scores than these methods. It also shows better results than BGENS and FGENS for the MIPS gold standard. Its F-measure and Recall rates for SGD gold standard are almost at par with FGENS but are better than BGENS, the reason being that FGENS combines gene expression data and GO similarity information, which is also the case with PCDEN. This indicates that the integration of multiple information sources greatly benefits the protein complex detection. A graphical comparison of the clustering approaches using Precision, Recall and F-measure is given in Fig. 5. The performance of PCDEN is better than the other cluster ensemble methods, i.e., NMF and BC-NMF, but is at par for one result with FGENS. This indicates the effectiveness of a graph-based cluster ensemble method such as FGENS and PCDEN in detecting protein complexes, especially so, when additional information about the proteins is integrated into the network. 7 Discussion and conclusion Protein complexes are important for understanding principles of cellular organization and function. A better comprehension of their roles allows us to realize how a protein complex disorder can affect the biological processes in which it is involved. Proteins are related to diseases and diseases are usually caused by an erroneous production of some protein complex. Therefore, much work has been focussed on the prediction of protein complexes from PPI Quality Score Precision Comparative Performance of Ensemble Approaches Recall MIPS F-measure networks. However, it has been observed that traditional clustering approaches for protein complex identification are hampered by a number of factors, such as PPI networks being noisy and incomplete giving rise to high false-positive rates in identifying interactions. The partitioning schemes themselves are dependent on specific initial conditions and parameter settings; the end result is unstable and unsatisfactory cluster partitions which lack biological meaning. Keeping these challenges in mind, we have developed and established a framework for an ensemble method with satisfactory results for accurate identification of protein complexes from interaction datasets. An ensemble approach is generally able to improve the robustness and stability of the clusterings by combining the output of several algorithms, thus improving the overall prediction accuracy. Such a method can identify larger, denser clusters with improved biological significance. Ensemble-based methods are preferable for protein complex detection also considering that traditional clustering methods search for complexes in dense regions and miss complexes of low densities, that is, small complexes of two or three proteins. Moreover, cluster ensembles can also address the problem of local optima and obtain a globally optimum solution, i.e., a stable clustering solution. Precision Recall SGD F-measure Evaluation Measures for MIPS and SGD Gold Standard BGENS FGENS NMF EC- BNMF PCDEN Fig. 5 Comparison of the ensemble approaches with the proposed ensemble PCDEN Table 7 Comparison of the ensemble approaches with the proposed PCDEN ensemble Ensemble Method MIPS SGD Precision Recall F-measure Precision Recall F-measure BGENS FGENS NMF EC-BNMF PCDEN The best scores are highlighted in bold
12 We propose an ensemble framework called Protein Complex Detection Ensemble (PCDEN) to predict protein complexes from a PPI network. We compare PCDEN with eight algorithms and the results are validated independently using two gold standard datasets. The biological relevance of the protein complexes detected is measured using a colocalization score. PCDEN is also compared with four ensemble approaches. The PCDEN ensemble outperforms the other clustering approaches in all datasets by achieving the highest score in each of the categories used for comparison. We conclude that the complexes obtained by PCDEN achieve accuracies comparable to gold standard complexes. The results prove that the integration of multiple sources increases the precision in detecting protein complexes in PPI networks. References Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T (2006) CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 22(8): Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25 29 Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform 4(1):2. doi: / Brohee S, van Helden J (2006) Evaluation of clustering algorithms for protein protein interaction networks. BMC Bioinform 7:488. doi: / Dai Q, Duan X, Guo M, Guo Y (2016) EnPC: an EnsembleClustering framework for detecting protein complexes in protein protein interaction network. Curr Proteom 13(2): Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814): Friedel CC, Krumsiek J, Zimmer R (2009) Bootstrapping the interactome: unsupervised identification of protein complexes in yeast. J Comput Biol 16(8): Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature 440(7084): Greene D, Cagney G, Krogan N, Cunningham P (2008) Ensemble nonnegative matrix factorization. Bioinformatics 24: Hong EL, Balakrishnan R, Dong Q, Christie KR, Park J, Binkley G (2008) Gene ontology annotations at SGD: new data sources and annotation methods. Nucleic Acids Res 36(Database issue): Huh WK, Falvo JV, Gerke LC, Carroll SA, Howson RW, Weissman JS, O Shea EK (2003) Global analysis of protein localization in budding yeast. Nature 425: Hung MC, Link W (2011) Protein localization in disease and therapy. J Cell Sci 124: Hung IH, Suzuki M, Yamaguchi Y, Yuan DS, Klausner RD, Gitlin JD (1997) Biochemical characterization of the Wilson disease protein and functional expression in the yeast Saccharomyces cerevisiae. J Biol Chem 272: Jansen R, Gerstein M (2004) Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction. Curr Opin Microbiol 7: King A, Przulj N, Jurisica I (2004) Protein complex prediction via cost-based clustering. Bioinformatics 20(17): Krogan N, Cagney G, Yu H, Zhong G, Guo X et al (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084): Krumsiek J, Friedel CC, Zimmer R (2008) ProCope-Protein complex Prediction and evaluation. Bioinformatics 24(18): doi: /bioinformatics/btn376 Li XL, Wu M, Kwoh CC, Ng SK (2010) Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genom. 11(Suppl 1):S3. doi: / S1-S3 Li X, Wang J, Zhao B, Wu F-X, Pan Y (2016) Identification of protein complexes from multi-relationship protein interaction networks. Hum Genom 10(Suppl 2):17. doi: /s z Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA, pp Liu G, Wong L, Chua HN (2009) Complex discovery from weighted PPI networks. Bioinformatics 25(15): Lord PW, Stevens RD, Brass A, Goble CA (2003) Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19: Macropol K, Can T, Singh AK (2009) RRW: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinform 10:283 Mewes HW, Frishman D, Mayer KFX, Münsterkötter M, Noubibou O, Rattei T, Oesterheld M, Stümpflen V (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 32:41 44 Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Health Inform Bioinform 2(3): Nagi S, Bhattacharyya DK (2014) Cluster analysis of cancer data using semantic similarity, sequence similarity and biological measures. Netw Model Anal Health Inform Bioinform 3(67):1 38 Nepusz T, Yu H, Paccanaro A (2012) Detecting overlapping protein complexes in protein protein interaction networks. Nat Methods 9(5): Ou-Yang L, Dai D-Q, Zhang XF (2013) Protein complex detection via weighted ensemble. PLoS One 8(5):e62158 Ou-Yang L, Dai D-Q, Zhang XF (2015) Detecting protein complexes from signed protein protein interaction networks. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 12(6): Payne AS, Kelly EJ, Gitlin JD (1998) Functional expression of the Wilson disease protein reveals mislocalization and impaired copper-dependent trafficking of the common H1069Q mutation. Proc Natl Acad Sci USA 95: Pereira-Leal JB, Enright AJ, Ouzounis CA (2004) Detection of functional modules from protein interaction networks. Proteins Struct Funct Bioinform 54:49 57 Resnick P (1999) Semantic similarity in a taxonomy: an information based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11: Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006) A new measure for functional similarity of gene products based on gene ontology. BMC Bioinform 7:302 Sharma P, Ahmed HA, Roy S, Bhattacharyya DK (2015) Unsupervised methods for finding protein complexes from PPI networks. Netw Model Anal Health Inform Bioinform 4(1):1 15
Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks
Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks Twan van Laarhoven and Elena Marchiori Institute for Computing and Information
More informationA Multiobjective GO based Approach to Protein Complex Detection
Available online at www.sciencedirect.com Procedia Technology 4 (2012 ) 555 560 C3IT-2012 A Multiobjective GO based Approach to Protein Complex Detection Sumanta Ray a, Moumita De b, Anirban Mukhopadhyay
More informationProtein complex detection using interaction reliability assessment and weighted clustering coefficient
Zaki et al. BMC Bioinformatics 2013, 14:163 RESEARCH ARTICLE Open Access Protein complex detection using interaction reliability assessment and weighted clustering coefficient Nazar Zaki 1*, Dmitry Efimov
More informationAn Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules
An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules Ying Liu 1 Department of Computer Science, Mathematics and Science, College of Professional
More informationMIPCE: An MI-based protein complex extraction technique
MIPCE: An MI-based protein complex extraction technique PRIYAKSHI MAHANTA 1, *, DHRUBA KR BHATTACHARYYA 1 and ASHISH GHOSH 2 1 Department of Computer Science and Engineering, Tezpur University, Napaam
More informationTowards Detecting Protein Complexes from Protein Interaction Data
Towards Detecting Protein Complexes from Protein Interaction Data Pengjun Pei 1 and Aidong Zhang 1 Department of Computer Science and Engineering State University of New York at Buffalo Buffalo NY 14260,
More informationIdentification of protein complexes from multi-relationship protein interaction networks
Li et al. Human Genomics 2016, 10(Suppl 2):17 DOI 10.1186/s40246-016-0069-z RESEARCH Identification of protein complexes from multi-relationship protein interaction networks Xueyong Li 1,2, Jianxin Wang
More informationProtein Complex Identification by Supervised Graph Clustering
Protein Complex Identification by Supervised Graph Clustering Yanjun Qi 1, Fernanda Balem 2, Christos Faloutsos 1, Judith Klein- Seetharaman 1,2, Ziv Bar-Joseph 1 1 School of Computer Science, Carnegie
More informationCS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction. Wong Limsoon
CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction Wong Limsoon 2 Lecture Outline Overview of protein-complex prediction A case study: MCL-CAw Impact of PPIN cleansing
More informationEFFICIENT AND ROBUST PREDICTION ALGORITHMS FOR PROTEIN COMPLEXES USING GOMORY-HU TREES
EFFICIENT AND ROBUST PREDICTION ALGORITHMS FOR PROTEIN COMPLEXES USING GOMORY-HU TREES A. MITROFANOVA*, M. FARACH-COLTON**, AND B. MISHRA* *New York University, Department of Computer Science, New York,
More informationAnalysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science
1 Analysis and visualization of protein-protein interactions Olga Vitek Assistant Professor Statistics and Computer Science 2 Outline 1. Protein-protein interactions 2. Using graph structures to study
More informationCS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction. Wong Limsoon
CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction Wong Limsoon 2 Lecture outline Overview of protein-complex prediction A case study: MCL-CAw Impact of PPIN cleansing
More informationThe Role of Network Science in Biology and Medicine. Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs
The Role of Network Science in Biology and Medicine Tiffany J. Callahan Computational Bioscience Program Hunter/Kahn Labs Network Analysis Working Group 09.28.2017 Network-Enabled Wisdom (NEW) empirically
More informationProtein function prediction via analysis of interactomes
Protein function prediction via analysis of interactomes Elena Nabieva Mona Singh Department of Computer Science & Lewis-Sigler Institute for Integrative Genomics January 22, 2008 1 Introduction Genome
More informationEnsemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions
Belfield Campus Map Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions
More informationChapter 16. Clustering Biological Data. Chandan K. Reddy Wayne State University Detroit, MI
Chapter 16 Clustering Biological Data Chandan K. Reddy Wayne State University Detroit, MI reddy@cs.wayne.edu Mohammad Al Hasan Indiana University - Purdue University Indianapolis, IN alhasan@cs.iupui.edu
More informationA Max-Flow Based Approach to the. Identification of Protein Complexes Using Protein Interaction and Microarray Data
A Max-Flow Based Approach to the 1 Identification of Protein Complexes Using Protein Interaction and Microarray Data Jianxing Feng, Rui Jiang, and Tao Jiang Abstract The emergence of high-throughput technologies
More informationComparison of Protein-Protein Interaction Confidence Assignment Schemes
Comparison of Protein-Protein Interaction Confidence Assignment Schemes Silpa Suthram 1, Tomer Shlomi 2, Eytan Ruppin 2, Roded Sharan 2, and Trey Ideker 1 1 Department of Bioengineering, University of
More informationConstruction of dynamic probabilistic protein interaction networks for protein complex identification
Zhang et al. BMC Bioinformatics (2016) 17:186 DOI 10.1186/s12859-016-1054-1 RESEARCH ARTICLE Open Access Construction of dynamic probabilistic protein interaction networks for protein complex identification
More informationEvidence for dynamically organized modularity in the yeast protein-protein interaction network
Evidence for dynamically organized modularity in the yeast protein-protein interaction network Sari Bombino Helsinki 27.3.2007 UNIVERSITY OF HELSINKI Department of Computer Science Seminar on Computational
More informationMTopGO: a tool for module identification in PPI Networks
MTopGO: a tool for module identification in PPI Networks Danila Vella 1,2, Simone Marini 3,4, Francesca Vitali 5,6,7, Riccardo Bellazzi 1,4 1 Clinical Scientific Institute Maugeri, Pavia, Italy, 2 Department
More informationIntroduction to Bioinformatics
CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics
More informationInteraction Network Topologies
Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 Inferrng Protein-Protein Interactions Using Interaction Network Topologies Alberto Paccanarot*,
More informationIEEE Xplore - A binary matrix factorization algorithm for protein complex prediction http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=573783 Page of 2-3-9 Browse > Conferences> Bioinformatics and Biomedicine...
More informationDetecting temporal protein complexes from dynamic protein-protein interaction networks
Detecting temporal protein complexes from dynamic protein-protein interaction networks Le Ou-Yang, Dao-Qing Dai, Xiao-Li Li, Min Wu, Xiao-Fei Zhang and Peng Yang 1 Supplementary Table Table S1: Comparative
More informationA MAX-FLOW BASED APPROACH TO THE IDENTIFICATION OF PROTEIN COMPLEXES USING PROTEIN INTERACTION AND MICROARRAY DATA (EXTENDED ABSTRACT)
1 A MAX-FLOW BASED APPROACH TO THE IDENTIFICATION OF PROTEIN COMPLEXES USING PROTEIN INTERACTION AND MICROARRAY DATA (EXTENDED ABSTRACT) JIANXING FENG Department of Computer Science and Technology, Tsinghua
More informationAn Improved Ant Colony Optimization Algorithm for Clustering Proteins in Protein Interaction Network
An Improved Ant Colony Optimization Algorithm for Clustering Proteins in Protein Interaction Network Jamaludin Sallim 1, Rosni Abdullah 2, Ahamad Tajudin Khader 3 1,2,3 School of Computer Sciences, Universiti
More informationA method for predicting protein complex in dynamic PPI networks
Zhang et al. BMC Bioinformatics 2016, 17(Suppl 7):229 DOI 10.1186/s12859-016-1101-y RESEARCH Open Access A method for predicting protein complex in dynamic PPI networks Yijia Zhang *, Hongfei Lin, Zhihao
More informationhsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference
CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science
More informationDiscrete Applied Mathematics. Integration of topological measures for eliminating non-specific interactions in protein interaction networks
Discrete Applied Mathematics 157 (2009) 2416 2424 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: www.elsevier.com/locate/dam Integration of topological measures
More informationDiscovering molecular pathways from protein interaction and ge
Discovering molecular pathways from protein interaction and gene expression data 9-4-2008 Aim To have a mechanism for inferring pathways from gene expression and protein interaction data. Motivation Why
More informationComputational approaches for functional genomics
Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding
More informationIteration Method for Predicting Essential Proteins Based on Orthology and Protein-protein Interaction Networks
Georgia State University ScholarWorks @ Georgia State University Computer Science Faculty Publications Department of Computer Science 2012 Iteration Method for Predicting Essential Proteins Based on Orthology
More informationFrancisco M. Couto Mário J. Silva Pedro Coutinho
Francisco M. Couto Mário J. Silva Pedro Coutinho DI FCUL TR 03 29 Departamento de Informática Faculdade de Ciências da Universidade de Lisboa Campo Grande, 1749 016 Lisboa Portugal Technical reports are
More informationPredicting Protein Functions and Domain Interactions from Protein Interactions
Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput
More informationPhylogenetic Analysis of Molecular Interaction Networks 1
Phylogenetic Analysis of Molecular Interaction Networks 1 Mehmet Koyutürk Case Western Reserve University Electrical Engineering & Computer Science 1 Joint work with Sinan Erten, Xin Li, Gurkan Bebek,
More informationMTGO: PPI Network Analysis Via Topological and Functional Module Identification
www.nature.com/scientificreports Received: 1 November 2017 Accepted: 28 February 2018 Published: xx xx xxxx OPEN MTGO: PPI Network Analysis Via Topological and Functional Module Identification Danila Vella
More informationCMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison
CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture
More informationDifferential Modeling for Cancer Microarray Data
Differential Modeling for Cancer Microarray Data Omar Odibat Department of Computer Science Feb, 01, 2011 1 Outline Introduction Cancer Microarray data Problem Definition Differential analysis Existing
More informationConstraint-based Subspace Clustering
Constraint-based Subspace Clustering Elisa Fromont 1, Adriana Prado 2 and Céline Robardet 1 1 Université de Lyon, France 2 Universiteit Antwerpen, Belgium Thursday, April 30 Traditional Clustering Partitions
More information2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms
Bioinformatics Advance Access published March 7, 2007 The Author (2007). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
More informationPROTEINS are the building blocks of all the organisms and
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 9, NO. 3, MAY/JUNE 2012 717 A Coclustering Approach for Mining Large Protein-Protein Interaction Networks Clara Pizzuti and Simona
More informationExtension and Robustness of Transitivity Clustering for Protein Protein Interaction Network Analysis
Internet Mathematics Vol. 7, No. 4: 255 273 Extension and Robustness of Transitivity Clustering for Protein Protein Interaction Network Analysis Tobias Wittkop, Sven Rahmann, Richard Röttger, Sebastian
More informationA general co-expression network-based approach to gene expression analysis: comparison and applications
BMC Systems Biology This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon. A general co-expression
More informationInteraction Network Analysis
CSI/BIF 5330 Interaction etwork Analsis Young-Rae Cho Associate Professor Department of Computer Science Balor Universit Biological etworks Definition Maps of biochemical reactions, interactions, regulations
More informationarxiv: v1 [q-bio.mn] 5 Feb 2008
Uncovering Biological Network Function via Graphlet Degree Signatures Tijana Milenković and Nataša Pržulj Department of Computer Science, University of California, Irvine, CA 92697-3435, USA Technical
More informationEstimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree
Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree YOICHI YAMADA, YUKI MIYATA, MASANORI HIGASHIHARA*, KENJI SATOU Graduate School of Natural
More informationResearch Article Prediction of Protein-Protein Interactions Related to Protein Complexes Based on Protein Interaction Networks
BioMed Research International Volume 2015, Article ID 259157, 9 pages http://dx.doi.org/10.1155/2015/259157 Research Article Prediction of Protein-Protein Interactions Related to Protein Complexes Based
More informationMarkov Random Field Models of Transient Interactions Between Protein Complexes in Yeast
Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast Boyko Kakaradov Department of Computer Science, Stanford University June 10, 2008 Motivation: Mapping all transient
More informationComparison of Human Protein-Protein Interaction Maps
Comparison of Human Protein-Protein Interaction Maps Matthias E. Futschik 1, Gautam Chaurasia 1,2, Erich Wanker 2 and Hanspeter Herzel 1 1 Institute for Theoretical Biology, Charité, Humboldt-Universität
More informationComputational methods for predicting protein-protein interactions
Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational
More informationBayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen
Bayesian Hierarchical Classification Seminar on Predicting Structured Data Jukka Kohonen 17.4.2008 Overview Intro: The task of hierarchical gene annotation Approach I: SVM/Bayes hybrid Barutcuoglu et al:
More informationGene expression microarray technology measures the expression levels of thousands of genes. Research Article
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 7, Number 2, 2 # Mary Ann Liebert, Inc. Pp. 8 DOI:.89/cmb.29.52 Research Article Reducing the Computational Complexity of Information Theoretic Approaches for Reconstructing
More informationGOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways
GOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways Jonas Gamalielsson and Björn Olsson Systems Biology Group, Skövde University, Box 407, Skövde, 54128, Sweden, [jonas.gamalielsson][bjorn.olsson]@his.se,
More informationA STUDY ON IDENTIFICATION OF STATIC AND DYNAMIC PROTEIN COMPLEX AND FUNCTIONAL MODULE IN PPI NETWORK
ISSN: 0976-3104 ARTICLE A STUDY ON IDENTIFICATION OF STATIC AND DYNAMIC PROTEIN COMPLEX AND FUNCTIONAL MODULE IN PPI NETWORK Subham Datta 1*, Dinesh Karunanidy 1, J. Amudhavel 2, Thamizh Selvam Datchinamurthy
More informationAn overlapping module identification method in protein-protein interaction networks
PROCEEDINGS Open Access An overlapping module identification method in protein-protein interaction networks Xuesong Wang *, Lijing Li, Yuhu Cheng From The 2011 International Conference on Intelligent Computing
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationIntegrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources
Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Antonina Mitrofanova New York University antonina@cs.nyu.edu Vladimir Pavlovic Rutgers University vladimir@cs.rutgers.edu
More informationIn order to compare the proteins of the phylogenomic matrix, we needed a similarity
Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for
More informationCPredictor3.0: detecting protein complexes from PPI networks with expression data and functional annotations
Xuet al. BMC Systems Biology 2017, 11(Suppl 7):135 DOI 10.1186/s12918-017-0504-3 RESEARCH Open Access CPredictor3.0: detecting protein complexes from PPI networks with expression data and functional annotations
More informationConstructing More Reliable Protein-Protein Interaction Maps
Constructing More Reliable Protein-Protein Interaction Maps Limsoon Wong National University of Singapore email: wongls@comp.nus.edu.sg 1 Introduction Progress in high-throughput experimental techniques
More informationIterative Laplacian Score for Feature Selection
Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,
More informationIntegrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources
Integrative Protein Function Transfer using Factor Graphs and Heterogeneous Data Sources Antonina Mitrofanova New York University antonina@cs.nyu.edu Vladimir Pavlovic Rutgers University vladimir@cs.rutgers.edu
More informationGRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS
141 GRAPH-THEORETICAL COMPARISON REVEALS STRUCTURAL DIVERGENCE OF HUMAN PROTEIN INTERACTION NETWORKS MATTHIAS E. FUTSCHIK 1 ANNA TSCHAUT 2 m.futschik@staff.hu-berlin.de tschaut@zedat.fu-berlin.de GAUTAM
More informationFunctional Characterization and Topological Modularity of Molecular Interaction Networks
Functional Characterization and Topological Modularity of Molecular Interaction Networks Jayesh Pandey 1 Mehmet Koyutürk 2 Ananth Grama 1 1 Department of Computer Science Purdue University 2 Department
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationLearning in Bayesian Networks
Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks
More informationNetwork by Weighted Graph Mining
2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore + Prediction of Protein Function from Protein-Protein Interaction Network
More informationBioinformatics 2. Yeast two hybrid. Proteomics. Proteomics
GENOME Bioinformatics 2 Proteomics protein-gene PROTEOME protein-protein METABOLISM Slide from http://www.nd.edu/~networks/ Citrate Cycle Bio-chemical reactions What is it? Proteomics Reveal protein Protein
More informationLecture 5: November 19, Minimizing the maximum intracluster distance
Analysis of DNA Chips and Gene Networks Spring Semester, 2009 Lecture 5: November 19, 2009 Lecturer: Ron Shamir Scribe: Renana Meller 5.1 Minimizing the maximum intracluster distance 5.1.1 Introduction
More informationIntegration of functional genomics data
Integration of functional genomics data Laboratoire Bordelais de Recherche en Informatique (UMR) Centre de Bioinformatique de Bordeaux (Plateforme) Rennes Oct. 2006 1 Observations and motivations Genomics
More informationInferring Transcriptional Regulatory Networks from Gene Expression Data II
Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday
More informationFunction Prediction Using Neighborhood Patterns
Function Prediction Using Neighborhood Patterns Petko Bogdanov Department of Computer Science, University of California, Santa Barbara, CA 93106 petko@cs.ucsb.edu Ambuj Singh Department of Computer Science,
More informationIdentifying Signaling Pathways
These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Anthony Gitter, Mark Craven, Colin Dewey Identifying Signaling Pathways BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2018
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification
More informationAnalysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing
Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver
More informationA Geometric Interpretation of Gene Co-Expression Network Analysis. Steve Horvath, Jun Dong
A Geometric Interpretation of Gene Co-Expression Network Analysis Steve Horvath, Jun Dong Outline Network and network concepts Approximately factorizable networks Gene Co-expression Network Eigengene Factorizability,
More informationAN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK;
213 IEEE INERNAIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEP. 22 25, 213, SOUHAMPON, UK AN ENHANCED INIIALIZAION MEHOD FOR NON-NEGAIVE MARIX FACORIZAION Liyun Gong 1, Asoke K. Nandi 2,3
More informationTypes of biological networks. I. Intra-cellurar networks
Types of biological networks I. Intra-cellurar networks 1 Some intra-cellular networks: 1. Metabolic networks 2. Transcriptional regulation networks 3. Cell signalling networks 4. Protein-protein interaction
More informationStructure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction Networks
22 International Conference on Environment Science and Engieering IPCEE vol.3 2(22) (22)ICSIT Press, Singapoore Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction
More informationBiological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor
Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms
More informationJure Leskovec Joint work with Jaewon Yang, Julian McAuley
Jure Leskovec (@jure) Joint work with Jaewon Yang, Julian McAuley Given a network, find communities! Sets of nodes with common function, role or property 2 3 Q: How and why do communities form? A: Strength
More informationBiological Systems: Open Access
Biological Systems: Open Access Biological Systems: Open Access Liu and Zheng, 2016, 5:1 http://dx.doi.org/10.4172/2329-6577.1000153 ISSN: 2329-6577 Research Article ariant Maps to Identify Coding and
More informationSupplemental Material
Supplemental Material Article title: Construction and comparison of gene co expression networks shows complex plant immune responses Author names and affiliation: Luis Guillermo Leal 1*, Email: lgleala@unal.edu.co
More informationA New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods
International Academic Institute for Science and Technology International Academic Journal of Science and Engineering Vol. 3, No. 6, 2016, pp. 169-176. ISSN 2454-3896 International Academic Journal of
More informationContext dependent visualization of protein function
Article III Context dependent visualization of protein function In: Juho Rousu, Samuel Kaski and Esko Ukkonen (eds.). Probabilistic Modeling and Machine Learning in Structural and Systems Biology. 2006,
More informationV 5 Robustness and Modularity
Bioinformatics 3 V 5 Robustness and Modularity Mon, Oct 29, 2012 Network Robustness Network = set of connections Failure events: loss of edges loss of nodes (together with their edges) loss of connectivity
More informationNonparametric Bayesian Matrix Factorization for Assortative Networks
Nonparametric Bayesian Matrix Factorization for Assortative Networks Mingyuan Zhou IROM Department, McCombs School of Business Department of Statistics and Data Sciences The University of Texas at Austin
More informationNetworks & pathways. Hedi Peterson MTAT Bioinformatics
Networks & pathways Hedi Peterson (peterson@quretec.com) MTAT.03.239 Bioinformatics 03.11.2010 Networks are graphs Nodes Edges Edges Directed, undirected, weighted Nodes Genes Proteins Metabolites Enzymes
More informationEffects of Gap Open and Gap Extension Penalties
Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See
More informationIEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 20XX 1
IEEE/ACM TRANSACTIONS ON COMUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. XX, NO. XX, XX 20XX 1 rotein Function rediction using Multi-label Ensemble Classification Guoxian Yu, Huzefa Rangwala, Carlotta Domeniconi,
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationNon-Negative Factorization for Clustering of Microarray Data
INT J COMPUT COMMUN, ISSN 1841-9836 9(1):16-23, February, 2014. Non-Negative Factorization for Clustering of Microarray Data L. Morgos Lucian Morgos Dept. of Electronics and Telecommunications Faculty
More informationFUNPRED-1: PROTEIN FUNCTION PREDICTION FROM A PROTEIN INTERACTION NETWORK USING NEIGHBORHOOD ANALYSIS
CELLULAR & MOLECULAR BIOLOGY LETTERS http://www.cmbl.org.pl Received: 25 July 2014 Volume 19 (2014) pp 675-691 Final form accepted: 20 November 2014 DOI: 10.2478/s11658-014-0221-5 Published online: 25
More informationProteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?
Proteomics What is it? Reveal protein interactions Protein profiling in a sample Yeast two hybrid screening High throughput 2D PAGE Automatic analysis of 2D Page Yeast two hybrid Use two mating strains
More informationGraph Clustering Algorithms
PhD Course on Graph Mining Algorithms, Università di Pisa February, 2018 Clustering: Intuition to Formalization Task Partition a graph into natural groups so that the nodes in the same cluster are more
More informationCluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002
Cluster Analysis of Gene Expression Microarray Data BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002 1 Data representations Data are relative measurements log 2 ( red
More information