Complex detection from PPI data using ensemble method

Size: px

Start display at page:

Download "Complex detection from PPI data using ensemble method"

Arleen Haynes
6 years ago
Views:

1 ORIGINAL ARTICLE Complex detection from PPI data using ensemble method Sajid Nagi 1 Dhruba K. Bhattacharyya 2 Jugal K. Kalita 3 Received: 18 August 2016 / Revised: 2 November 2016 / Accepted: 17 December 2016 Springer-Verlag Wien 2016 Abstract Many algorithms have been proposed recently to detect protein complexes in protein protein interaction (PPI) networks. Most proteins form complexes to accomplish biological functions such as transcription of DNA, translation of mrna and cell growth. Since proteins perform their tasks by interacting with each other, determining these protein protein interactions is an important task. Traditional clustering approaches for protein complex identification cannot deal with noisy and incomplete PPI data and dependent on information from a single source. Since the noise in the interaction datasets hampers the detection of accurate protein complexes, we propose an ensemble approach for protein complex detection that attempts to combine information from Gene Ontology at the time of complex detection. The PPI data network is taken as input by several baseline complex detection algorithms to generate protein complexes. The protein complexes are then subsequently combined by the proposed ensemble using a consensus building module for the purpose of identifying meaningful complexes. The protein complexes thus predicted by the ensemble are evaluated by comparing them to a set of gold standard protein complexes and their biological relevance established using a co-localization score. & Sajid Nagi sajidnagi@gmail.com Department of Computer Science, St. Edmund s College, Shillong , Meghalaya, India Department of Computer Science and Engineering, Tezpur University, Napaam , Assam, India Department of Computer Science, University of Colorado, Colorado Springs, CO 80918, USA Keywords Protein complex prediction Protein interaction network Empirical study of proteins Ensemble clustering 1 Introduction Proteins are complex organic compounds made up of chains of amino acids used by the body for growth, maintenance and repair of body tissues. Protein complexes perform many crucial tasks within living beings such as transcription of DNA, translation of mrna, cell growth and transporting molecules from one place to another. Moreover, in-depth study of protein complexes helps understand how they are built and how they work, allowing better comprehension of biological systems. To understand the dynamics of biological processes within an organism, it is necessary to determine the complete map of physical interactions among the proteins (interactome). The result is the PPI network where nodes represent proteins and edges represent interactions between pairs of proteins. It has been well established that the main cause of many diseases is proteins (Hung et al. 1997; Payne et al. 1998; Tanaka et al. 2003; Hung and Link 2011). An erroneous production of a protein complex can cause diseases by affecting the biological processes in which it is involved. Therefore, correct classification of a newly discovered protein complex is important as it may guide discovery of appropriate drugs. New protein complexes can be detected based on the observation that densely connected regions in the PPI networks often correspond to actual protein complexes. Graph clustering techniques have been useful since PPI networks are large-scale graphical data structures consisting of tens of thousands of pair-wise protein protein interactions.

2 1.1 Background A wide range of detection methods have been proposed in recent years for the purpose of identification and categorization of protein complexes (Yong et al. 2012; Wu et al. 2013; Zaki 2014; Srihari et al. 2015; Yong et al. 2014; Ou- Yang et al. 2015; Sharma et al. 2015; Li et al. 2016). Every new published method compares its performance with a few selected earlier methods. It has been noticed that due to the differences in the PPI networks produced, benchmark datasets used for evaluation, evaluation criteria, threshold settings and parameters used, the results of such comparisons and surveys on complex detection vary and are difficult to reconcile (Brohee and van Helden 2006; Vlasblom and Wodak 2009; Li et al. 2010; Srihari and Leong 2013). Evaluation of predicted complexes is generally performed by (a) comparing the predicted complexes against one or more gold standard sets of complexes using performance measures such as precision and recall (Liu et al. 2009), (b) if a gold standard dataset is not available, measures such as cluster cohesiveness and separability are used (Bader and Hogue 2003; King et al. 2004) and (c) evaluation of the predicted complexes can also be performed by computing a functional or co-localization score, after incorporating the appropriate annotation data (Liu et al. 2009; King et al. 2004). This evaluation is particularly useful for biological relevance of the predictions. 1.2 Observations and motivation Based on our survey of literature on protein complex detection, we make the following observations. (i) (ii) (iii) (iv) (v) Every clustering algorithm has its own nuances, being highly dependent on specific initial conditions, seed selection or parameter setting. The noise in the datasets hampers the detection of accurate protein complexes. Proteins can be assigned importance based on the number of interactions they participate in. These protein complexes can then be ranked to further study the roles they play in biological processes. Most methods report low recall because the search for complexes takes place in the dense regions of the network causing them to miss complexes of low densities from sparse parts in the network (Srihari and Leong 2012). A protein complex is usually validated after it is fully identified or detected and it may be desirable for a method to perform validation during complex identification or detection so that its goodness based on cohesiveness can be explored better. To tackle these limitations, an ensemble method, PCDEN, is proposed since ensembles have been able to improve robustness, stability and prediction accuracy of clusterings by combining the output of several algorithms (Yang et al. 2010; Nagi and Bhattacharyya 2013). Since the noise in the interaction datasets hampers the detection of accurate protein complexes, one way to compensate for the lack of interaction coverage is to combine multiple datasets for the purpose of identification of meaningful complexes. An ensemble of clustering algorithms balances individual characteristics by combining the output of several algorithms and may be able to identify larger, denser clusters with improved biological significance (Nagi and Bhattacharyya 2014). Additionally, ensemble-based methods may be preferable for protein complex detection considering that traditional clustering methods search for complexes in dense regions and miss complexes of low densities. An ensemble method proposed by (Dai et al. 2016) employs a cluster-wise voting mechanism to extract the consensus information embedded in the results of different methods, whereas (Wu et al. 2016) integrate the clusters obtained from the base clustering mechanisms with co-complex affinity scores. An alternate way may be to add topological information to biological information (Zaki 2014) so that complexes of low densities, that is, small complexes of two or three proteins are not discarded. Moreover, cluster ensembles may also be able to address the problem of local optima and obtain a more globally optimum solution, i.e., a stable clustering solution. To evaluate the quality and accuracy of predicted complexes, we use Frac (Nepusz et al. 2012), MMR (Nepusz et al. 2012), Acc (Brohee and van Helden 2006), Precision and Recall (Liu et al. 2009) and F-measure (Van Rijsbergen 1979). To evaluate the biological relevance, we have used co-localization (Huh et al. 2003) and GO semantic similarity (Schlicker et al. 2006). 1.3 Organization of the paper The rest of the paper is organized as follows: Sect. 2 briefly describes the algorithms and ensemble approaches used for comparison. Section 3 gives an account of the evaluation metrics used, followed by the description of the datasets in Sect. 4. Section 5 describes the experimental design methodology of the proposed PCDEN ensemble, followed by the results in Sect. 6. Finally, the discussion and concluding remarks are included in Sect. 7.

3 2 Related work We compare eight algorithms, namely MCL (Van Dongen 2000; Pereira-Leal et al. 2004), MCODE (Bader and Hogue 2003), RNSC (King et al. 2004), CFinder (Adamcsek et al. 2006), RRW (Macropol et al. 2009), AP (Frey and Dueck 2007), CMC (Liu et al. 2009) and ClusterONE (Nepusz et al. 2012) with our proposed ensemble. In addition, we compare our proposed ensemble method with three ensemble approaches proposed in recent years, namely Ensemble Non-negative Matrix Factorization (NMF) (Greene et al. 2008), Bipartite Graph Ensemble (BGENS) (Zhang et al. 2012) and Full Graph Ensemble (FGENS) (Zhang et al. 2012), Ensemble Clustering Bayesian Nonnegative Matrix Factorization (EC-BNMF) (Ou-Yang et al. 2013). A brief overview of the algorithms and ensembles is presented next. 2.1 Overview of the clustering approaches The Markov Clustering (MCL) (Van Dongen 2000; Pereira-Leal et al. 2004) detects protein complexes by simulating random walks in PPI networks by boosting the probabilities of intra-cluster walks and demoting intercluster walks (i.e., more links within a cluster and fewer links between clusters). Molecular Complex Detection (MCODE) (Bader and Hogue 2003) selects seed nodes with high weights as initial clusters and enhances these clusters by outward traversal from the seeds. The number of predicted complexes is generally small and the size of many predicted complexes is often too large. The Restricted Neighborhoods Search Clustering (RNSC) (King et al. 2004) algorithm begins with an initial random clustering and then searches for a better clustering with lower costs by moving vertices around. It predicts relatively few complexes and its results depend on the quality of the initial clustering. Clique Finder (Cfinder) (Adamcsek et al. 2006) takes a parameter k as input and detects all the k-cliques of the input network and connects them if they are adjacent, i.e., if they share (k - 1) common nodes. A k- clique percolation cluster is then constructed by linking all the adjacent k-cliques as a bigger subgraph. It can detect overlapping clusters. The Repeated Random Walk (RRW) (Macropol et al. 2009) algorithm iteratively expands at most k times a cluster of nodes so as to include proteins with high proximity using an affinity function, followed by a rank step to remove clusters with an overlap score above a given threshold. It is able to handle weighted and unweighted graphs. Affinity Propagation (AP) (Frey and Dueck 2007) is a k-centers clustering technique with centers referred to as exemplars, randomly selected and refined iteratively to decrease the sum of squared errors between each input object and its nearest center. AP cannot detect overlapping clusters. Clustering based on Maximum Cliques (CMC) (Liu et al. 2009) uses an iterative scoring algorithm followed by a maximal clique finding process to assess whether two proteins are in the same protein complex. The cliques are merged when the network between them is denser than the merge threshold. It is not able to handle weighted networks. The Cluster Overlapping Neighborhood Expansion (ClusterONE) (Nepusz et al. 2012) algorithm starts by selecting the protein with the highest degree as the first seed and grows a cohesive group from it using a greedy approach and then moves on to the next seed. The cohesive groups are then merged based on their overlap score. 2.2 Overview of the ensemble approaches The ensemble framework for detecting protein complexes Ensemble Non-negative Matrix Factorization (NMF) proposed by (Greene et al. 2008) first generates a collection of non-negative matrix factorizations with different number of dimensions. Then, a hierarchical meta-clustering algorithm is employed to aggregate these factorizations and produce a disjoint hierarchy of meta-clusters. Finally, the results are transformed into a soft hierarchical clustering of the original dataset. Bipartite Graph Ensemble (BGENS) (Zhang et al. 2012) is a bipartite graph that is constructed based on the affiliation of proteins and clusters. The original PPI data are initially clustered into a set of T partitions using the base clustering algorithms. Each clustering partition consists of a set of clusters from a certain clustering method. BGENS constructs a bipartite graph between protein nodes in the dataset and the cluster nodes obtained from the base clustering algorithms. The cluster ensemble, Full Graph Ensemble (FGENS) (Zhang et al. 2012), is produced from the bipartite graph BGENS, by omitting interactions among the proteins and also the relationships among the clusters. The authors consider the original protein interactions as the basis of cluster-belonging preference for those proteins that appear in different clusters. The proteins are partitioned into different final clusters by taking into consideration their original relationships with other proteins. The method proposed in BGENS is followed to partition the cluster nodes generated by the base clustering algorithms and a graph of proteins and clusters is constructed. The graph is now partitioned using spectral clustering, which partitions the graph into K parts with the objective of minimizing the cut. The ensemble approach EC-BNMF (Ensemble Clustering Bayesian Nonnegative Matrix Factorization) (Ou-Yang et al. 2013) consists of two phases. In the generation phase, useful information in the form of features is extracted from several base clustering results. Each base clustering result is regarded a feature of the original PPI network, where two proteins are

4 assumed to be connected if they occur in the same cluster at least once. In the complex detection phase, they use a Bayesian NMF-based ensemble clustering method which uses group information provided by the edges between interacting proteins to detect protein complexes. 3 Evaluation of protein complexes In this section, we outline the measures used to compare a predicted protein complex with a reference complex and also to evaluate the biological relevance of the predictions. 3.1 Performance measures The following performance measures try to assess the quality of a protein complex predicted by comparing it with the protein complexes present in a specific gold standard. However, it is not possible to establish the quality of a predicted complex if it does not match, even partially, with at least one gold standard complex Frac The fraction of pairs (Frac) (Nepusz et al. 2012) is a measure which says that a predicted complex p and a reference complex b (gold standard) match with an overlap score NAðp; bþ, given by NAðp; b Þ ¼ V p \ V b 2 x j Vb j : ð1þ V p The complexes are considered to match if NAðp; bþ. is greater than a threshold value Geometry accuracy Geometric accuracy (Acc) (Brohee and van Helden 2006) is the geometric mean of clustering sensitivity (Sn) (Brohee and van Helden 2006), and the positive predictive value (PPV) (Brohee and van Helden 2006) of a clustering. Both are based on the confusion matrix T ¼ t ij of the complexes. With n reference and m predicted complexes, the element t ij of the confusion matrix refers to the number of proteins that are found both in the reference complex i and the predicted complex j. If n i is defined as the number of proteins in reference complex i, then P n i¼1 Sn ¼ maxm j¼1 t ij P n i¼1 n ; and ð2þ i P m j¼1 PPV ¼ maxn i¼1 t ij P m P n j¼1 i¼1 t : ð3þ ij Clustering sensitivity (Sn) can give inflated values if every protein is placed in the same cluster, whereas the positive predictive value (PPV) can be maximized by putting every protein in its own cluster. For this reason, the two measures are balanced by computing their geometric accuracy (Acc), which is defined as: p Acc ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Sn PPV: ð4þ MMR The maximum matching ratio (MMR) (Nepusz et al. 2012) is used to express how well the predicted complexes compare with a gold standard, where the overlap score between two protein complexes is computed by Eq. (2). MMR is obtained by dividing the total weight of the maximum matching complexes by the number of reference complexes. Given n standard complexes and m predicted complexes, let j be a member of the predicted complexes. MMR is then defined as: P n i¼1 MMR ¼ maxm j n NA ð p; b Þ : ð5þ A predicted complex that may not match any reference complex may show a degraded geometric accuracy score. However, a predicted complex that does not match any reference complex is not necessarily an undesired result and may be an undiscovered complex. Therefore, trying to optimize the geometric accuracy may discourage detection of such new complexes in a PPI network. This is one motivation behind the use of MMR as a metric F-measure The F-measure (Van Rijsbergen 1979) is the harmonic mean of two measures called Precision and Recall, defined as follows. Precision This measure is defined as the fraction of a cluster that consists of objects of a specific class. The precision of a cluster V j. with respect to class U i. is represented by n i;j Precision U i ; V j ¼ : ð6þ n j Recall This measure is defined as the proportion of objects of a class in a cluster. The recall of a cluster V j with respect to class U i is represented by n i;j Recall U i ; V j ¼ ; ð7þ n i where n i;j equals number of data items belonging to class U i in cluster V j, n j is the number of data items

5 belonging to cluster V j and n i denotes number of data items belonging to class U i. F-measure is defined as the rate at which a cluster contains only objects of a particular class and all objects of that class (Van Rijsbergen 1979). Thus, the F-measure of a cluster U i with respect to class V j is represented by the following expression: F-measure ¼ 2 Precision U i; V j Recall Ui ; V j : Precision U i ; V j þ Recall Ui ; V j ð8þ F-measure is useful in the sense that it is an objective metric of how well a clustering algorithm is able to recover the original clusters. It is measured in the range [0, 1] and high values indicate a good quality of clustering. 3.2 Biological significance measures One of the main issues that arise during the comparison of predicted complexes and a gold standard complex is that the gold standard protein complex sets are usually incomplete (Jansen and Gerstein 2004). Hence, a predicted complex that does not match any of the reference complexes may belong to a valid but previously uncataloged complex. Therefore, relying on the comparison measures outlined in Sect. 3.1, which are based on a predefined gold standard, may not give a true picture of the predicted complexes. Rather, the scores should be supported by measures that assess the biological relevance of predicted complexes, based on the similarity of their functional annotation or the co-localization of the constituent proteins Co-localization similarity Since proteins in the same complex have a tendency to share common functions, they tend to be located at the same location in a cell. The maximum fraction of proteins in a complex found at the same location is known as the colocalization score for that complex. The average co-localization score is calculated as the weighted average over all complexes and is defined as: P j C ¼ max il i;j P : ð9þ j A j Here, l i;j is the number of proteins of complex A j assigned to the localization group i and A j is the number of proteins in the complex A j with localization assignments. A high co-localization score indicates a high functional similarity between proteins in the same complex GO semantic similarity Comparison of GO terms associated with proteins in a protein complex is another indicator of biological relevance. This can be performed in terms of functional similarity (Ashburner et al. 2000). The functional similarity of two proteins is measured in terms of the semantic similarity of GO terms with which these proteins have been annotated. The GO semantic similarity score for a protein complex is defined as the geometric mean of all complex scores determined separately for the biological process and molecular function taxonomies for a set of complexes. A higher GO semantic similarity score is associated with better quality of a set of protein complexes. Out of the several variations of semantic similarity that have been described (Resnick 1999; Lin 1998; Lord et al. 2003; Schlicker et al. 2006), we use the semantic similarity proposed by (Schlicker et al. 2006). It is defined as: sim Schlicker ðt 1 ; t 2 Þ ¼ 2 ln ½ p msðt 1 ; t 2 Þ lnðpðt 1 Þþlnðpðt 2 Þ ð 1 p msðt 1 ; t 2 ÞÞ ð10þ Here, t, t 1 and t 2 are GO terms, pt ðþ represents the probability of term t and p ms is the minimum subsumer. 4 Dataset description We compare protein complexes, our clustering algorithm predicts, with a set of known interactions called gold standards. 4.1 PPI datasets Each of the algorithms and the proposed ensemble has been tested on three publicly available high-throughput benchmark yeast PPI datasets, namely Gavin (Gavin et al. 2006), Krogan Core (Krogan et al. 2006) referred to in this paper as Krogan, and Krogan Extended (Krogan et al. 2006) referred in the paper as Krogan?. Some details of the datasets are shown in Table Gold standard datasets We consider the Saccharomyces Genome Database (SGD) (Hong et al. 2008) and the Munich Information Centre for Protein Sequences (MIPS) (Mewes et al. 2004) Saccharomyces cerevisiae protein protein interaction dataset as they have been applied in several analyses as gold standard reference because of their quality and comprehensiveness. The general properties of these datasets are given in Table 2.

6 Table 1 Benchmark datasets Details Gavin Krogan Krogan? References Gavin et al. (2006) Krogan et al. (2006) Krogan et al. (2006) Number of proteins Number of interactions ,317 Weighted Yes Yes Yes 5 Protein complex detection ensemble (PCDEN) The PCDEN approach consists of three phases. The First Phase generates n individual complexes sets from PPI Network data taken as input by n base clustering algorithms. The set of base clustering algorithms is identified after an exhaustive experimentation with a large number of benchmark datasets. The Second Phase employs a consensus building process to generate protein complexes by identifying a common set of protein nodes from the corresponding complex sets (obtained from the base clustering algorithms) and also using GO similarity. The Third Phase evaluates the predicted protein complexes and also establishes their biological relevance. A conceptual framework for the proposed protein complex detection ensemble (PCDEN) is shown in Fig. 1. Phase I: complex set generation The PPI network data are taken as input by n consistently well-performing base clustering algorithms BC 1, BC 2,, BC n which generate n individual complexes sets C 1,C 2,, C n. In this step, care must be taken to select base clustering algorithms that are diverse in nature. As a result, the errors made in clustering by one algorithm are likely to be averaged out by the correct clustering of another, so that the overall clustering accuracy is improved and a final unbiased decision can be made. Phase II: protein complex generation through consensus Next, we take a complex set C 1 generated by the base clustering algorithms BC 1 and identify the node within this complex with the highest degree. A node with a higher degree implies high connectivity or association with other nodes. Such nodes are chosen since they give more guarantee to form meaningful complex. The next complex set C 2 generated by BC 2 is taken and we identify the corresponding complex based on the highest degree node as Table 2 General properties of the gold standard datasets General properties SGD MIPS References Hong et al. (2008) Mewes et al. (2004) No. of proteins No. of complexes Overlapping proteins identified by complex set C 1. Similarly, we take the other set of complexes C 3, C 4,, C n generated by the base clustering algorithms BC 3, BC 4,, BC n and identify the corresponding complexes based on the highest degree node identified earlier. After we have identified all the complexes containing the highest degree node from the n complex sets, a common set of nodes (proteins), say P i, is identified from the corresponding complex sets with the condition that each member element of P i is present in at least two complexes. Two cases are considered now. Case 1 For the common set of nodes P i, we complete the edges among the nodes of P i subject to the condition that if n i and n j are two nodes present in the common set of nodes P i, then they will be connected by an edge if and only if (n i, n j ) are already connected by an edge in at least two complexes. This will continue for every node and will give rise to a common complex, P Com i, consisting of nodes that are present in all corresponding complexes and inter-connected by an edge. Hence, a complex is generated which is arrived at by consensus among the corresponding complexes. Case 2 For the common set of nodes P i, complete the edges to obtain a complex P GO i based on GO similarity. Since a protein is usually annotated with multiple GO terms, the semantic similarity measured between two GO terms can be directly converted to a measurement of the similarity between two proteins. Thus, Gene Ontology information is integrated with PPI network data based on semantic similarity measure (Schlicker et al. 2006). Next, we superimpose both the complexes obtained from the common set of nodes P i, i.e., P Com i and P GO i to obtain the final complex P Fin i. The superimposition is done using the logic given in Table 3. In other words, if the edge is present for a pair of nodes n i and n j in both P Com i and P GO i, then the edge is retained in the final predicted complex P Fin i. If an edge is present for a pair of nodes n i and n j in P Com i and absent in P GO i or vice versa, then also that edge is retained even though it may be a case of false alarm. The reason to retain this edge is that in the case of P Com i, the edge is a result of consensus (where un-supervised approach has been applied) and in the case of P GO i, the edge is obtained through GO similarity (a case of supervised approach). So in both the cases, the probability of the existence of the edge is justified. The final

7 COMPLEX SET GENERATION FROM BASE CLUSTERING ALGORITHMS PPI Network data BC 1 BC 2 BC 3 BC n (unsupervised methods) C 1 C 2 C 3 C n-1 C n (n individual complex sets) PROTEIN COMPLEX GENERATION THROUGH CONSENSUS AMONG THE CORRESPONDING COMPLEXES Choose a complex from n individual complex sets Identify a node with highest degree from the complex Find corresponding complexes based on node identified Find a common set of nodes, P i, from the corresponding complex sets, where each member element is present atleast in two complexes Complete the edges among the nodes of P i to obtain P i Com by fulfilling the condition that: a pair of nodes (n i, n j ) will have an edge iff n i, n j are connected by an edge atleast in two complexes Use GO Similarity to complete the edges among the nodes of P i to obtain P i GO Superimpose P i Com and P i GO to obtain the final complex P i Fin such that an edge will exist in P i Fin iff the edge is present in (i) P i Com or P i GO (ii) Both in P i Com and in P i GO VALIDATION PHASE Result validation of P i Fin using Frac, Acc, MMR, Precision, Recall and F-measure Result validation using Co-localization Score Gold Standard protein complexes Fig. 1 Conceptual framework for the proposed protein complex detection ensemble Table 3 Existence of an edge in the final complex Edge in P Com i Edge in P GO i Edge in P Fin i Present Present Present Present Not present Present Not present Present Present Not present Not present Not present predicted complex P Fin i is now stored in a file for the purpose of validation. Now, we start with the second highest degree node for the formation of the second protein complex and then iterate through to the next lower degree node and so on. This process of identifying the node of the highest degree from the next set of complexes and the process of identifying protein complexes through consensus and through the use of GO similarity are repeated for each corresponding complex. The end result of this phase is that there will be interacting protein complexes with overlapping proteins. Phase III: validation Finally, the validation phase evaluates protein complexes predicted by comparing them to a set of gold standard protein complexes and also establishes the biological relevance of the predicted protein complexes using the co-localization score. Though the method adopted for protein complex detection is computationally expensive, the primary objective is to find complexes with high precision. For this reason, speed of detection is compromised in favor of

8 accuracy and precision. Our main focus is to make use of multiple sources and means to detect and validate the complexes found since a single information source or criterion has its limitations as discussed in Sect Experimental evaluation Three different clustering methods for protein complex detection, MCODE (Bader and Hogue 2003), MCL (Van Dongen 2000; Pereira-Leal et al. 2004), and CFinder (Adamcsek et al. 2006), are used as the base clustering algorithms. These methods have been selected as they follow different approaches for clustering and MCODE and CFinder can detect overlapping complexes. Moreover, we have also performed an empirical evaluation of these algorithms along with others, namely RNSC (King et al. 2004), AP (Frey and Dueck 2007), CMC (Liu et al. 2009), RRW (Macropol et al. 2009) and ClusterONE (Nepusz et al. 2012) and have found that the performance of these algorithms fall in the category of good, average and not so good. Hence, these algorithms satisfy the diversity criteria of our ensemble. ClusterONE (Nepusz et al. 2012) has been found to be the best performing complex finding algorithm and hence we will compare our ensemble s performance with this algorithm. GO semantic similarity is calculated based on the information content of GO terms and the semantic similarity proposed by (Schlicker et al. 2006). The PPI data set for the experiments is summarized in Table 1. The GO-slim file was downloaded from and the Biological Process (BP) hierarchy was selected to calculate the GO-driven similarity of proteins. The evaluation of the results was done by the SGD and MIPS gold standard datasets, shown in Table 2. 6 Results For each clustering algorithm tested, we tune the input parameters after experimenting with different combinations of settings to achieve optimal performance, unless the authors of the original algorithm explicitly suggest particular settings. This procedure was conducted separately for each measure, for each dataset and with respect to each gold standard to obtain the most optimistic score. 6.1 Comparison of PCDEN with baseline clustering algorithms We compare the performance of the proposed ensemble PCDEN with the approaches described in Sect Quality score by MIPS gold standard In this section, we report the results obtained by comparing protein complexes predicted by each clustering algorithm and the proposed PCDEN ensemble, with the gold standard obtained from the MIPS catalog of protein complexes. The best scores are highlighted in bold for easy comparison. The number of complexes and the matched number of complexes for each dataset, the fraction of protein complexes matched by at least one predicted complex, geometric accuracy and maximum matching ratio as detected by the algorithms are reported in Table 4. When we compare the eight clustering approaches, we observe that ClusterONE shows the best scores in the number of detected complexes that match at least one real complex, and also for the three quality measures, namely Frac, Acc and MMR across all the three datasets. However, we notice that although the number of protein complexes identified by PCDEN is lower than several others, 88 of the 152 predicted complexes matched very well with the benchmark complexes. Further we also observe that PCDEN obtains the best scores for the quality measures Frac and Acc across all the three datasets used for comparison. The maximum matching ratio MMR score of for the Gavin dataset obtained by PCDEN is also the highest indicating better quality of overlapped protein complexes detected. The comparison results are presented in Fig. 2 and it is clear that the PCDEN ensemble outperforms the other approaches in all datasets by achieving the highest score in each of the three categories used for comparison. The results obtained by PCDEN are better than those of the single clustering approaches Quality score by SGD gold standard Similarly, Table 5 shows the scores for the number of complexes detected, number of matched complexes, Frac, Acc and MMR for the three datasets achieved by the algorithms and PCDEN using the SGD gold standard. It is observed that although ClusterONE shows the best scores among the eight baseline clustering algorithms across all the three datasets, PCDEN achieves the best results in protein complex detection with respect to the three quality measures, for all the three datasets and also identifies the maximum number of matched complexes. We also observe in Tables 4 and 5 that the number of complexes detected by PCDEN for the Krogan and Krogan? datasets is closer to the actual number of MIPS complexes, i.e., 203, while the results of ClusterONE seem to contain a large number of extra clusters, i.e., 526 and 531. Moreover, the number of predicted complexes by

9 Table 4 Comparison of the protein complex detection algorithms and proposed ensemble PCDEN on PPI datasets using the MIPS complex dataset Algorithm Gavin Krogan Krogan? #c #m Frac Acc MMR #c #m Frac Acc MMR #c #m Frac Acc MMR MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN The best scores are highlighted in bold #c no. of complexes, #m no. of matched complexes 0.8 Comparison of baseline approaches with PCDEN for each Dataset Quality Score Frac Acc MMR Frac Acc MMR Frac Acc MMR MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN Gavin Krogan Krogan+ Evaluation Metrics of the clustering algorithms and PCDEN Fig. 2 Comparing the evaluation metrics of the clustering algorithms and PCDEN for the MIPS dataset Table 5 Comparison of the protein complex detection algorithms and proposed ensemble on PPI datasets using the SGD complex dataset Algorithm Gavin Krogan Krogan? #c #m Frac Acc MMR #c #m Frac Acc MMR #c #m Frac Acc MMR MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN The best scores are highlighted in bold #c no. of complexes, #m no. of matched complexes PCDEN matches very well with the number of benchmark complexes for the Krogan and Krogan? datasets. Figure 3 shows the comparison of the clustering algorithms and PCDEN for SGD dataset and it is noticed that PCDEN outperforms all the other single clustering approaches in all datasets. The results obtained by PCDEN prove that the integration of multiple sources has greater power to predict protein complexes in PPI networks.

10 6.1.3 Biological significance of predicted complexes on Gavin and Krogan datasets The biological relevance of the predicted complexes based on the similarity of their co-localization score is discussed next Co-localization score We make use of the localization dataset published by (Huh et al. 2003) and ProCope (Krumsiek et al. 2008; Friedel et al. 2009), a popular tool used to predict and evaluate protein complexes, to calculate the co-localization score. Table 6 shows the co-localization scores of protein complexes detected using the eight algorithms and PCDEN on the Gavin and Krogan datasets. The co-localization scores of protein complexes obtained by the eight algorithms on the Gavin and Krogan datasets are shown in Table 6. As expected, the highest colocalization score of on the Gavin dataset is achieved by PCDEN, followed by ClusterONE with a score of The co-localization score on the Krogan dataset of PCDEN is which is higher than 0.725, the score of obtained by MCL and achieved by ClusterONE. Hence, the protein complexes detected by the PCDEN ensemble have relatively high quality from the biological standpoint due to the high co-localization score on both the Gavin and Krogan datasets. The results of comparisons of Co-localization score of PCDEN and the baseline algorithms on the Gavin and Krogan datasets are shown in Fig Comparison of PCDEN with Ensemble Approaches In addition to comparison with individual clustering methods, we compare PCDEN with the ensemble approaches proposed in recent years, NMF (Greene et al. 2008), EC-BNMF (Ou-Yang et al. 2013), BGENS (Zhang et al. 2012) and FGENS (Zhang et al. 2012). Table 6 Comparison of Co-localization Score Algorithm Co-localization score Gavin Krogan MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN The best scores are highlighted in bold Co-localization Score Comparison of Co-localization Score Gavin Datasets Fig. 4 Comparison of Co-localization Score Krogan Analysis of ensemble results We use the validity measures Precision, Recall and F-measure discussed in Sect The results of the PCDEN ensemble are compared with the results of the ensemble approaches discussed in Sect Due to the non-availability of source code of the ensemble approaches used for comparison with PCDEN, the metrics Frac, Acc and MMR could not be established for these ensemble approaches. Hence, we use the Precision, Recall and F-measure scores as provided by the authors for the purpose of comparison with PCDEN. MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN Quality Score Comparison of baseline approaches with PCDEN for each Dataset Frac Acc MMR Frac Acc MMR Frac Acc MMR Gavin Krogan Krogan+ Evaluation metrics of the clustering algorithms and PCDEN MCL MCODE RNSC CFINDER RRW AP CMC ClusterONE PCDEN Fig. 3 Comparative analysis of the evaluation metrics of the clustering algorithms and PCDEN for SGD

11 The proposed ensemble method outperforms other approaches by achieving higher F-measure and Recall rates for the MIPS gold standard, as shown in Table 7. The F-measure and Recall rates for SGD gold standard are and for PCDEN, which are almost at par with FGENS, which has scored for F-measure and for Recall rate. However, PCDEN finds more matched modules from the PPI network than others and this demonstrates its effectiveness. NMF and BC-NMF use only the PPI network topology whereas PCDEN integrates information from GO similarity sources and so achieves higher scores than these methods. It also shows better results than BGENS and FGENS for the MIPS gold standard. Its F-measure and Recall rates for SGD gold standard are almost at par with FGENS but are better than BGENS, the reason being that FGENS combines gene expression data and GO similarity information, which is also the case with PCDEN. This indicates that the integration of multiple information sources greatly benefits the protein complex detection. A graphical comparison of the clustering approaches using Precision, Recall and F-measure is given in Fig. 5. The performance of PCDEN is better than the other cluster ensemble methods, i.e., NMF and BC-NMF, but is at par for one result with FGENS. This indicates the effectiveness of a graph-based cluster ensemble method such as FGENS and PCDEN in detecting protein complexes, especially so, when additional information about the proteins is integrated into the network. 7 Discussion and conclusion Protein complexes are important for understanding principles of cellular organization and function. A better comprehension of their roles allows us to realize how a protein complex disorder can affect the biological processes in which it is involved. Proteins are related to diseases and diseases are usually caused by an erroneous production of some protein complex. Therefore, much work has been focussed on the prediction of protein complexes from PPI Quality Score Precision Comparative Performance of Ensemble Approaches Recall MIPS F-measure networks. However, it has been observed that traditional clustering approaches for protein complex identification are hampered by a number of factors, such as PPI networks being noisy and incomplete giving rise to high false-positive rates in identifying interactions. The partitioning schemes themselves are dependent on specific initial conditions and parameter settings; the end result is unstable and unsatisfactory cluster partitions which lack biological meaning. Keeping these challenges in mind, we have developed and established a framework for an ensemble method with satisfactory results for accurate identification of protein complexes from interaction datasets. An ensemble approach is generally able to improve the robustness and stability of the clusterings by combining the output of several algorithms, thus improving the overall prediction accuracy. Such a method can identify larger, denser clusters with improved biological significance. Ensemble-based methods are preferable for protein complex detection also considering that traditional clustering methods search for complexes in dense regions and miss complexes of low densities, that is, small complexes of two or three proteins. Moreover, cluster ensembles can also address the problem of local optima and obtain a globally optimum solution, i.e., a stable clustering solution. Precision Recall SGD F-measure Evaluation Measures for MIPS and SGD Gold Standard BGENS FGENS NMF EC- BNMF PCDEN Fig. 5 Comparison of the ensemble approaches with the proposed ensemble PCDEN Table 7 Comparison of the ensemble approaches with the proposed PCDEN ensemble Ensemble Method MIPS SGD Precision Recall F-measure Precision Recall F-measure BGENS FGENS NMF EC-BNMF PCDEN The best scores are highlighted in bold

12 We propose an ensemble framework called Protein Complex Detection Ensemble (PCDEN) to predict protein complexes from a PPI network. We compare PCDEN with eight algorithms and the results are validated independently using two gold standard datasets. The biological relevance of the protein complexes detected is measured using a colocalization score. PCDEN is also compared with four ensemble approaches. The PCDEN ensemble outperforms the other clustering approaches in all datasets by achieving the highest score in each of the categories used for comparison. We conclude that the complexes obtained by PCDEN achieve accuracies comparable to gold standard complexes. The results prove that the integration of multiple sources increases the precision in detecting protein complexes in PPI networks. References Adamcsek B, Palla G, Farkas IJ, Derényi I, Vicsek T (2006) CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 22(8): Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25 29 Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform 4(1):2. doi: / Brohee S, van Helden J (2006) Evaluation of clustering algorithms for protein protein interaction networks. BMC Bioinform 7:488. doi: / Dai Q, Duan X, Guo M, Guo Y (2016) EnPC: an EnsembleClustering framework for detecting protein complexes in protein protein interaction network. Curr Proteom 13(2): Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814): Friedel CC, Krumsiek J, Zimmer R (2009) Bootstrapping the interactome: unsupervised identification of protein complexes in yeast. J Comput Biol 16(8): Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, Marzioch M, Rau C (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature 440(7084): Greene D, Cagney G, Krogan N, Cunningham P (2008) Ensemble nonnegative matrix factorization. Bioinformatics 24: Hong EL, Balakrishnan R, Dong Q, Christie KR, Park J, Binkley G (2008) Gene ontology annotations at SGD: new data sources and annotation methods. Nucleic Acids Res 36(Database issue): Huh WK, Falvo JV, Gerke LC, Carroll SA, Howson RW, Weissman JS, O Shea EK (2003) Global analysis of protein localization in budding yeast. Nature 425: Hung MC, Link W (2011) Protein localization in disease and therapy. J Cell Sci 124: Hung IH, Suzuki M, Yamaguchi Y, Yuan DS, Klausner RD, Gitlin JD (1997) Biochemical characterization of the Wilson disease protein and functional expression in the yeast Saccharomyces cerevisiae. J Biol Chem 272: Jansen R, Gerstein M (2004) Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction. Curr Opin Microbiol 7: King A, Przulj N, Jurisica I (2004) Protein complex prediction via cost-based clustering. Bioinformatics 20(17): Krogan N, Cagney G, Yu H, Zhong G, Guo X et al (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 440(7084): Krumsiek J, Friedel CC, Zimmer R (2008) ProCope-Protein complex Prediction and evaluation. Bioinformatics 24(18): doi: /bioinformatics/btn376 Li XL, Wu M, Kwoh CC, Ng SK (2010) Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genom. 11(Suppl 1):S3. doi: / S1-S3 Li X, Wang J, Zhao B, Wu F-X, Pan Y (2016) Identification of protein complexes from multi-relationship protein interaction networks. Hum Genom 10(Suppl 2):17. doi: /s z Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning. San Francisco, CA, pp Liu G, Wong L, Chua HN (2009) Complex discovery from weighted PPI networks. Bioinformatics 25(15): Lord PW, Stevens RD, Brass A, Goble CA (2003) Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics 19: Macropol K, Can T, Singh AK (2009) RRW: repeated random walks on genome-scale protein networks for local cluster discovery. BMC Bioinform 10:283 Mewes HW, Frishman D, Mayer KFX, Münsterkötter M, Noubibou O, Rattei T, Oesterheld M, Stümpflen V (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 32:41 44 Nagi S, Bhattacharyya DK (2013) Classification of microarray cancer data using ensemble approach. Netw Model Anal Health Inform Bioinform 2(3): Nagi S, Bhattacharyya DK (2014) Cluster analysis of cancer data using semantic similarity, sequence similarity and biological measures. Netw Model Anal Health Inform Bioinform 3(67):1 38 Nepusz T, Yu H, Paccanaro A (2012) Detecting overlapping protein complexes in protein protein interaction networks. Nat Methods 9(5): Ou-Yang L, Dai D-Q, Zhang XF (2013) Protein complex detection via weighted ensemble. PLoS One 8(5):e62158 Ou-Yang L, Dai D-Q, Zhang XF (2015) Detecting protein complexes from signed protein protein interaction networks. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 12(6): Payne AS, Kelly EJ, Gitlin JD (1998) Functional expression of the Wilson disease protein reveals mislocalization and impaired copper-dependent trafficking of the common H1069Q mutation. Proc Natl Acad Sci USA 95: Pereira-Leal JB, Enright AJ, Ouzounis CA (2004) Detection of functional modules from protein interaction networks. Proteins Struct Funct Bioinform 54:49 57 Resnick P (1999) Semantic similarity in a taxonomy: an information based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11: Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T (2006) A new measure for functional similarity of gene products based on gene ontology. BMC Bioinform 7:302 Sharma P, Ahmed HA, Roy S, Bhattacharyya DK (2015) Unsupervised methods for finding protein complexes from PPI networks. Netw Model Anal Health Inform Bioinform 4(1):1 15

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks Twan van Laarhoven and Elena Marchiori Institute for Computing and Information