Belfield Campus Map Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions <D onn ybr ook N11 Entrance 8 25 19 50 15 Greenfield Entrance 26 18 37 31 22 16 1 Derek Greene 1,2 Gerard Cagney 63 49 36 21 14 51 35 34 10 61 38 41 29 23 52 40 2 Nevan Krogan 1 Pádraig Cunningham 20 27 17 28 39 32 46 47 48 24 62 5 1 1: School of Computer Science & Informatics, UCD 2: Department of Cellular & Molecular Pharmacology, UCSF 3 2 Richview Entrance 6 4 7 33 9 42 44 43 30 45
2 Outline Protein Interaction Data Existing Cluster Analysis Techniques Hierarchical Clustering Non-negative Matrix Factorization (NMF) Objectives for Clustering Ensemble NMF Clustering Algorithm Generation Phase Integration Phase Experimental Evaluation NMF Tree Browser Application
3 Analysing Protein Interaction Data Large biological datasets comprising thousands of protein-protein interactions have been assembled. Cataloguing and analysing interaction data is a first step toward understanding the biological basis of the interactions and the role of any network structure that underlies them. In recent years, the size and density of these datasets has presented a barrier to analysis, even by individuals with extensive knowledge of the proteins. e.g. 18,324 physically interacting protein pairs in the Saccharomyces cerevisiae proteome alone (Salwinski et al., 2004). Cluster analysis techniques are often used to explore and organize large biological datasets.
4 Hierarchical Clustering Constructs a binary tree by iteratively merging most similar clusters. Applied to identify functional groupings in protein interaction data (Collins et al., 2007). ARP2 ARP2 X Drawbacks: Each data object can only reside in a single branch of the tree at a given level. In protein networks proteins may be associated with multiple biological processes. A protein should belong to multiple distinct branches in the natural cluster hierarchy of the data.
5 NMF Clustering Non-negative Matrix Factorization (NMF) (Lee & Seung, 1999) algorithms have been used to discover overlapping groups. Produces a low-dimensional approximation of a non-negative data matrix, which can be interpreted as a "soft" clustering. Symmetric NMF (Ding & He, 2005) Non-negative Similarity Matrix Factor Matrix (Clustering) S V V T n n n k k n S ij : Strength of association between protein i and protein j V ij : Real-valued membership weight for protein i in cluster j
6 NMF Clustering Non-negative Matrix Factorization (NMF) (Lee & Seung, 1999) algorithms have been used to discover overlapping groups. Produces a low-dimensional approximation of a non-negative data matrix, which can be interpreted as a "soft" clustering. Symmetric NMF (Ding & He, 2005) Non-negative Similarity Matrix Factor Matrix (Clustering) S n n V n k V T k n 1.0 0.8 0.4 0.1 0.0 0.8 1.0 0.4 0.0 0.0 0.4 0.4 1.0 0.5 0.5 0.1 0.0 0.5 1.0 0.9 0.0 0.0 0.5 0.9 1.0 Symmetric NMF k =2 Cluster Cluster 1 2 0.94 0.00 0.94 0.00 0.52 0.63 0.02 0.95 0.00 0.96 Significant overlap Pairwise Similarity Matrix Factor Matrix
7 NMF Clustering - Analysis Advantages Solutions can represent overlapping clusters. Often produces a sparse factor matrix... Can identify small, localised clusters. Can eliminate irrelevant and outlying instances. Disadvantages Output depends on initial matrix used to seed the algorithm Does not discover hierarchical relations between clusters. No intuitive visualisation for the output. How are these clusters related? 0.23 0.87 0.25 0.21 0.28 0.19 0.04 0.32 0.91 0.12 0.46 0.34 0.73 0.04 0.31 0.32 0.19 0.50 0.55 0.42 0.57 0.01 0.33 0.38 0.60 0.06 0.15 0.44 0.15 0.85 0.70 0.16 0.58 0.10 0.25 Parameter selection can be difficult... How many clusters k in the factor matrix?
8 Objectives for Clustering Q. What features do we require in a cluster analysis procedure when working with protein interaction data? 1. Clusters similar to known protein complex compositions. 2. Clusters should be presented in an intuitive visual format. 3. Provision of meaningful hierarchical structure. 4. Identify shared subunits and "moonlighting" proteins. 5. Assignment of putative protein function. When analysing protein interaction networks, we propose a new algorithm that combines... Ability of NMF to accurately identify overlapping groups. Organisational and visualisation benefits of hierarchical clustering.
9 Soft Hierarchical Clustering An alternative binary tree representation that supports overlapping groups. Proteins can be associated with multiple nodes in the tree to different degrees.
10 Ensemble NMF Algorithm Key Idea: Ensemble algorithms combine the output of multiple Machine Learning procedures to produce a superior result. Algorithm involves a two phase process: 1. Generation phase: 2. Integration phase: Produce a collection of NMF factorizations (i.e. the members of the ensemble) Combine the factorizations to produce an improved clustering. Symmetric NMF Integration Function Original Dataset NMF Factorizations Consensus Solution NB: Consensus solution is a soft hierarchical clustering.
11 Algorithm: Generation Phase Q. How do we generate an ensemble of factorizations? Repeatedly apply Symmetric NMF to a pairwise similarity matrix representing our data: V 1V2 Pairwise Similarity Matrix S Symmetric NMF V 3 V 4 Large collection of ensemble members Ensemble techniques are most effective when combining a diverse collection of solutions (Opitz & Shavlik, 1996). To introduce diversity in the generation phase: Initialise Symmetric NMF with a random solution. Randomly select the number of factors k from a fixed range. The fixed range can be chosen "roughly", which simplifies the NMF model selection problem.
12 Algorithm: Integration Phase Q. How do we combine an ensemble of factorizations to produce a final "consensus" clustering of the data? Construct a dataset from all clusters present in the ensemble. Apply "min-max" hierarchical clustering to produce a metaclustering (i.e. a clustering of clusters) V 1V2 V 3 V 4 Build Matrix Transpose Matrix Min-Max Clustering n l l n Ensemble of Factorizations Matrix of Clusters (Columns) Matrix of Clusters (Rows) Meta Clustering NB: We can construct a soft hierarchical clustering of the original proteins from the meta-clustering. Take mean vector for each tree node in the meta-clustering.
13 Experimental Evaluation We used an extensive and high-quality assembly of binary interactions for 2390 proteins (Collins et al., 2007). This dataset provides a confidence score measuring the evidence that the proteins do indeed co-purify, referred to as Purification Enrichment (PE). We apply Ensemble NMF to the corresponding PE matrix. S PE Score Matrix S ij Strength of evidence that there is a genuine positive or negative interaction between protein i and protein j Baseline approach: We also applied average-linkage hierarchical clustering to the PE score matrix.
14 Evaluation: External Validation External validation: compare a clustering to a "gold standard" classification, if available. For protein interaction data we use functional groupings provided by the MIPS database. We consider two well-known validation measures: Precision: Fraction of proteins in a given cluster that pertain to a specific MIPS class. Recall: Fraction of the proteins from a given MIPS class that were recovered in a given cluster. Ideally we want a cluster analysis procedure that recovers known protein complex compositions with high precision and recall.
15 External Validation Results The structures uncovered by Ensemble NMF seem to be far more informative than those identified using the baseline approach. Greene et al Reflected in the substantially improved validation scores for both validation approaches, based on MIPS classes. Table 1. Validation scores for 20 most significant clusters identified by Ensemble NMF on Collins protein interaction data. Table 2. Validation scores for 20 most significant clusters identified by average-linkage hierarchical clustering on Collins protein interaction data. Class Precision Recall 20S proteasome 1.00 0.88 Anaphase promoting complex (APC) 1.00 0.80 H+-transporting ATPase vacuolar 1.00 0.64 Post-replication complex 1.00 1.00 Pre-replication complex (pre-rc) 1.00 0.60 Replication complex 1.00 0.40 Replication initiation complex 1.00 0.75 Septin filaments 1.00 1.00 TRAPP complex 1.00 0.70 RNA polymerase I 0.93 0.59 SWI/SNF activator complex 0.89 0.89 COPI Ensemble NMF0.88 1.00 Exocyst complex 0.88 1.00 Kornbergs mediator (SRB) complex 0.86 1.00 Signal recognition particle (SRP) 0.86 1.00 Gim complexes 0.83 1.00 TFIIIC 0.83 1.00 19/22S regulator 0.78 1.00 Arp2p/Arp3p complex 0.71 1.00 Class Precision Recall Geranylgeranyltransferase II 1.00 0.67 v-snares 1.00 0.33 NEF3 complex 0.50 0.14 RNA polymerase I 0.50 0.05 RNase MRP 0.50 1.00 RNase P 0.50 1.00 Replication factor C complex 0.50 1.00 mrna splicing 0.50 0.04 Other respiration chain complexes 0.50 0.14 RSC complex 0.27 0.90 SWI/SNF transcription activator complex 0.27 1.00 SAGA complex Hierarchical Clustering 0.14 0.91 rrna splicing 0.13 0.15 Dam1 protein complex 0.10 1.00 20S proteasome 0.09 0.94 RNA polymerase III 0.08 0.92 ADA complex 0.07 0.83 RNA polymerase II 0.07 0.85 TRAPP complex 0.06 1.00
16 Evaluation: Discussion Provision of meaningful hierarchical structure: Soft hierarchical clustering produced by Ensemble NMF lends itself to the identification of sub-complexes. Example: the COMA subcomplex (Ame1, Okp1, Mcm21, Ctf19) of the larger CTF19 central kinetochore complex can be resolved Identification of shared subunits and "moonlighting" proteins: Ensemble NMF successfully accommodates proteins that are present in two or more groupings. Example: The 3 chromatin remodelling complexes SWR-C, INO80, and Nu4A all contain actin and the actin-related protein Arp4. Assignment of putative protein function: The uncharacterised protein YNR024W is grouped within a tree node that contains all twelve members of the exosome complex. YNR024W may be a previously undescribed component of this complex, and/or participate in these processes.
17 NMF Tree Browser Application We developed the NMF Tree Browser, a cross-platform Java application for visually inspecting a soft hierarchy produced by the Ensemble NMF algorithm. Zoom controls Statistics for selected node Class correlations for selected node Currently selected node Tree root node Membership weights for selected node
18 NMF Tree Browser Application The application includes a range of data exploration tools. Class sizes and correlations Precision & Recall scores List of most significant class/ node combinations Membership weights for proteins in selected node Clustering and Tree Browser software is freely available: http://mlg.ucd.ie/nmf
19 Conclusions We have presented a new clustering approach that involves aggregating a collection of matrix factorizations generated using NMF-like techniques. In evaluations on high-quality protein interaction data, we have observed that Ensemble NMF can... Improve our ability to identify groupings that accurately reflect known protein complex compositions. Help discover overlapping groups and multi-function or "moonlighting" proteins. Provide an intuitive, tree-like organisation of the data. We have developed the NMF Tree Browser application, which supports cluster visualisation and labelling of previously uncharacterised proteins. Many other potential applications - e.g. discovering structures genetic interaction data, gene microarray data.
20 References Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., Bowie, J. U., and Eisenberg, D. (2004). The database of interacting proteins: 2004 update. Nucleic Acids Res, 32(Database issue), pp. 449 51. Collins, S. R. R., Kemmeren, P., Zhao, X.-C. C., Greenblatt, J. F. F., Spencer, F., Holstege, F. C. C., Weissman, J. S. S., and Krogan, N. J. J. (2007). Towards a comprehensive atlas of the physical interactome of Saccharomyces cescerevisiae. MolCell Proteomics. Strehl, A. and Ghosh, J. (2002). Cluster ensembles - a knowledge reuse framework for combining partitionings. In Proc. Conference on Artificial Intelligence (AAAI 02), pp. 93 98. Ding, C. and He, X. (2005). On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering. In Proc. SIAM International Conference on Data Mining (SDM 05), pp. 606 610. Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401, pp. 788 91. Opitz, D. W. and Shavlik, J. W. (1996). Generating accurate and diverse members of a neural-network ensemble. NIPS 8, pp. 535 541.