Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions

Similar documents
BIOINFORMATICS ORIGINAL PAPER

Discovering modules in expression profiles using a network

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

arxiv: v3 [cs.lg] 18 Mar 2013

Hub Gene Selection Methods for the Reconstruction of Transcription Networks

BME 5742 Biosystems Modeling and Control

Bioinformatics Chapter 1. Introduction

Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast

Discovering molecular pathways from protein interaction and ge

Towards Detecting Protein Complexes from Protein Interaction Data

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Integrating Ontological Prior Knowledge into Relational Learning

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

AN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK;

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Introduction to Bioinformatics

On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

MIPCE: An MI-based protein complex extraction technique

V 5 Robustness and Modularity

Predicting Protein Functions and Domain Interactions from Protein Interactions

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

Learning in Bayesian Networks

Detecting temporal protein complexes from dynamic protein-protein interaction networks

Computational Biology: Basics & Interesting Problems

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

Network by Weighted Graph Mining

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Simulation of Gene Regulatory Networks

EUSIPCO

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Non-Negative Factorization for Clustering of Microarray Data

Fuzzy Clustering of Gene Expression Data

A Complex-based Reconstruction of the Saccharomyces cerevisiae Interactome* S

A New Method to Build Gene Regulation Network Based on Fuzzy Hierarchical Clustering Methods

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

BMD645. Integration of Omics

Comparative Genomics II

Chapter 16. Clustering Biological Data. Chandan K. Reddy Wayne State University Detroit, MI

MCB 110. "Molecular Biology: Macromolecular Synthesis and Cellular Function" Spring, 2018

SUPPLEMENTARY INFORMATION

Clustering and Network

Constraint-based Subspace Clustering

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Introduction to clustering methods for gene expression data analysis

UE Praktikum Bioinformatik

Multi Omics Clustering. ABDBM Ron Shamir

Integration of functional genomics data

Inferring Transcriptional Regulatory Networks from High-throughput Data

Xiaosi Zhang. A thesis submitted to the graduate faculty. in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE

Challenges and Rewards of Interaction Proteomics

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Introduction to clustering methods for gene expression data analysis

Ensembles of Classifiers.

CS6220: DATA MINING TECHNIQUES

arxiv: v1 [stat.ml] 23 Dec 2015

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

ProtoNet 4.0: A hierarchical classification of one million protein sequences

CS6220: DATA MINING TECHNIQUES

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

Visualize Biological Database for Protein in Homosapiens Using Classification Searching Models

PREDICTION OF HETERODIMERIC PROTEIN COMPLEXES FROM PROTEIN-PROTEIN INTERACTION NETWORKS USING DEEP LEARNING

Modern Information Retrieval

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Extending the Associative Rule Chaining Architecture for Multiple Arity Rules

Iterative Laplacian Score for Feature Selection

Welcome to Class 21!

Multi-Task Clustering using Constrained Symmetric Non-Negative Matrix Factorization

Automatic Rank Determination in Projective Nonnegative Matrix Factorization

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Differential Modeling for Cancer Microarray Data

Field 045: Science Life Science Assessment Blueprint

Interaction Network Topologies

GCD3033:Cell Biology. Transcription

86 Part 4 SUMMARY INTRODUCTION

Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction Networks

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

Computational approaches for functional genomics

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

MTopGO: a tool for module identification in PPI Networks

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Introduction. Gene expression is the combined process of :

Systems biology and biological networks

identifiers matched to homologous genes. Probeset annotation files for each array platform were used to

Protein Expression Molecular Pattern Discovery by Nonnegative Principal Component Analysis

Identifying Signaling Pathways

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Data visualization and clustering: an application to gene expression data

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Transcription:

Belfield Campus Map Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions <D onn ybr ook N11 Entrance 8 25 19 50 15 Greenfield Entrance 26 18 37 31 22 16 1 Derek Greene 1,2 Gerard Cagney 63 49 36 21 14 51 35 34 10 61 38 41 29 23 52 40 2 Nevan Krogan 1 Pádraig Cunningham 20 27 17 28 39 32 46 47 48 24 62 5 1 1: School of Computer Science & Informatics, UCD 2: Department of Cellular & Molecular Pharmacology, UCSF 3 2 Richview Entrance 6 4 7 33 9 42 44 43 30 45

2 Outline Protein Interaction Data Existing Cluster Analysis Techniques Hierarchical Clustering Non-negative Matrix Factorization (NMF) Objectives for Clustering Ensemble NMF Clustering Algorithm Generation Phase Integration Phase Experimental Evaluation NMF Tree Browser Application

3 Analysing Protein Interaction Data Large biological datasets comprising thousands of protein-protein interactions have been assembled. Cataloguing and analysing interaction data is a first step toward understanding the biological basis of the interactions and the role of any network structure that underlies them. In recent years, the size and density of these datasets has presented a barrier to analysis, even by individuals with extensive knowledge of the proteins. e.g. 18,324 physically interacting protein pairs in the Saccharomyces cerevisiae proteome alone (Salwinski et al., 2004). Cluster analysis techniques are often used to explore and organize large biological datasets.

4 Hierarchical Clustering Constructs a binary tree by iteratively merging most similar clusters. Applied to identify functional groupings in protein interaction data (Collins et al., 2007). ARP2 ARP2 X Drawbacks: Each data object can only reside in a single branch of the tree at a given level. In protein networks proteins may be associated with multiple biological processes. A protein should belong to multiple distinct branches in the natural cluster hierarchy of the data.

5 NMF Clustering Non-negative Matrix Factorization (NMF) (Lee & Seung, 1999) algorithms have been used to discover overlapping groups. Produces a low-dimensional approximation of a non-negative data matrix, which can be interpreted as a "soft" clustering. Symmetric NMF (Ding & He, 2005) Non-negative Similarity Matrix Factor Matrix (Clustering) S V V T n n n k k n S ij : Strength of association between protein i and protein j V ij : Real-valued membership weight for protein i in cluster j

6 NMF Clustering Non-negative Matrix Factorization (NMF) (Lee & Seung, 1999) algorithms have been used to discover overlapping groups. Produces a low-dimensional approximation of a non-negative data matrix, which can be interpreted as a "soft" clustering. Symmetric NMF (Ding & He, 2005) Non-negative Similarity Matrix Factor Matrix (Clustering) S n n V n k V T k n 1.0 0.8 0.4 0.1 0.0 0.8 1.0 0.4 0.0 0.0 0.4 0.4 1.0 0.5 0.5 0.1 0.0 0.5 1.0 0.9 0.0 0.0 0.5 0.9 1.0 Symmetric NMF k =2 Cluster Cluster 1 2 0.94 0.00 0.94 0.00 0.52 0.63 0.02 0.95 0.00 0.96 Significant overlap Pairwise Similarity Matrix Factor Matrix

7 NMF Clustering - Analysis Advantages Solutions can represent overlapping clusters. Often produces a sparse factor matrix... Can identify small, localised clusters. Can eliminate irrelevant and outlying instances. Disadvantages Output depends on initial matrix used to seed the algorithm Does not discover hierarchical relations between clusters. No intuitive visualisation for the output. How are these clusters related? 0.23 0.87 0.25 0.21 0.28 0.19 0.04 0.32 0.91 0.12 0.46 0.34 0.73 0.04 0.31 0.32 0.19 0.50 0.55 0.42 0.57 0.01 0.33 0.38 0.60 0.06 0.15 0.44 0.15 0.85 0.70 0.16 0.58 0.10 0.25 Parameter selection can be difficult... How many clusters k in the factor matrix?

8 Objectives for Clustering Q. What features do we require in a cluster analysis procedure when working with protein interaction data? 1. Clusters similar to known protein complex compositions. 2. Clusters should be presented in an intuitive visual format. 3. Provision of meaningful hierarchical structure. 4. Identify shared subunits and "moonlighting" proteins. 5. Assignment of putative protein function. When analysing protein interaction networks, we propose a new algorithm that combines... Ability of NMF to accurately identify overlapping groups. Organisational and visualisation benefits of hierarchical clustering.

9 Soft Hierarchical Clustering An alternative binary tree representation that supports overlapping groups. Proteins can be associated with multiple nodes in the tree to different degrees.

10 Ensemble NMF Algorithm Key Idea: Ensemble algorithms combine the output of multiple Machine Learning procedures to produce a superior result. Algorithm involves a two phase process: 1. Generation phase: 2. Integration phase: Produce a collection of NMF factorizations (i.e. the members of the ensemble) Combine the factorizations to produce an improved clustering. Symmetric NMF Integration Function Original Dataset NMF Factorizations Consensus Solution NB: Consensus solution is a soft hierarchical clustering.

11 Algorithm: Generation Phase Q. How do we generate an ensemble of factorizations? Repeatedly apply Symmetric NMF to a pairwise similarity matrix representing our data: V 1V2 Pairwise Similarity Matrix S Symmetric NMF V 3 V 4 Large collection of ensemble members Ensemble techniques are most effective when combining a diverse collection of solutions (Opitz & Shavlik, 1996). To introduce diversity in the generation phase: Initialise Symmetric NMF with a random solution. Randomly select the number of factors k from a fixed range. The fixed range can be chosen "roughly", which simplifies the NMF model selection problem.

12 Algorithm: Integration Phase Q. How do we combine an ensemble of factorizations to produce a final "consensus" clustering of the data? Construct a dataset from all clusters present in the ensemble. Apply "min-max" hierarchical clustering to produce a metaclustering (i.e. a clustering of clusters) V 1V2 V 3 V 4 Build Matrix Transpose Matrix Min-Max Clustering n l l n Ensemble of Factorizations Matrix of Clusters (Columns) Matrix of Clusters (Rows) Meta Clustering NB: We can construct a soft hierarchical clustering of the original proteins from the meta-clustering. Take mean vector for each tree node in the meta-clustering.

13 Experimental Evaluation We used an extensive and high-quality assembly of binary interactions for 2390 proteins (Collins et al., 2007). This dataset provides a confidence score measuring the evidence that the proteins do indeed co-purify, referred to as Purification Enrichment (PE). We apply Ensemble NMF to the corresponding PE matrix. S PE Score Matrix S ij Strength of evidence that there is a genuine positive or negative interaction between protein i and protein j Baseline approach: We also applied average-linkage hierarchical clustering to the PE score matrix.

14 Evaluation: External Validation External validation: compare a clustering to a "gold standard" classification, if available. For protein interaction data we use functional groupings provided by the MIPS database. We consider two well-known validation measures: Precision: Fraction of proteins in a given cluster that pertain to a specific MIPS class. Recall: Fraction of the proteins from a given MIPS class that were recovered in a given cluster. Ideally we want a cluster analysis procedure that recovers known protein complex compositions with high precision and recall.

15 External Validation Results The structures uncovered by Ensemble NMF seem to be far more informative than those identified using the baseline approach. Greene et al Reflected in the substantially improved validation scores for both validation approaches, based on MIPS classes. Table 1. Validation scores for 20 most significant clusters identified by Ensemble NMF on Collins protein interaction data. Table 2. Validation scores for 20 most significant clusters identified by average-linkage hierarchical clustering on Collins protein interaction data. Class Precision Recall 20S proteasome 1.00 0.88 Anaphase promoting complex (APC) 1.00 0.80 H+-transporting ATPase vacuolar 1.00 0.64 Post-replication complex 1.00 1.00 Pre-replication complex (pre-rc) 1.00 0.60 Replication complex 1.00 0.40 Replication initiation complex 1.00 0.75 Septin filaments 1.00 1.00 TRAPP complex 1.00 0.70 RNA polymerase I 0.93 0.59 SWI/SNF activator complex 0.89 0.89 COPI Ensemble NMF0.88 1.00 Exocyst complex 0.88 1.00 Kornbergs mediator (SRB) complex 0.86 1.00 Signal recognition particle (SRP) 0.86 1.00 Gim complexes 0.83 1.00 TFIIIC 0.83 1.00 19/22S regulator 0.78 1.00 Arp2p/Arp3p complex 0.71 1.00 Class Precision Recall Geranylgeranyltransferase II 1.00 0.67 v-snares 1.00 0.33 NEF3 complex 0.50 0.14 RNA polymerase I 0.50 0.05 RNase MRP 0.50 1.00 RNase P 0.50 1.00 Replication factor C complex 0.50 1.00 mrna splicing 0.50 0.04 Other respiration chain complexes 0.50 0.14 RSC complex 0.27 0.90 SWI/SNF transcription activator complex 0.27 1.00 SAGA complex Hierarchical Clustering 0.14 0.91 rrna splicing 0.13 0.15 Dam1 protein complex 0.10 1.00 20S proteasome 0.09 0.94 RNA polymerase III 0.08 0.92 ADA complex 0.07 0.83 RNA polymerase II 0.07 0.85 TRAPP complex 0.06 1.00

16 Evaluation: Discussion Provision of meaningful hierarchical structure: Soft hierarchical clustering produced by Ensemble NMF lends itself to the identification of sub-complexes. Example: the COMA subcomplex (Ame1, Okp1, Mcm21, Ctf19) of the larger CTF19 central kinetochore complex can be resolved Identification of shared subunits and "moonlighting" proteins: Ensemble NMF successfully accommodates proteins that are present in two or more groupings. Example: The 3 chromatin remodelling complexes SWR-C, INO80, and Nu4A all contain actin and the actin-related protein Arp4. Assignment of putative protein function: The uncharacterised protein YNR024W is grouped within a tree node that contains all twelve members of the exosome complex. YNR024W may be a previously undescribed component of this complex, and/or participate in these processes.

17 NMF Tree Browser Application We developed the NMF Tree Browser, a cross-platform Java application for visually inspecting a soft hierarchy produced by the Ensemble NMF algorithm. Zoom controls Statistics for selected node Class correlations for selected node Currently selected node Tree root node Membership weights for selected node

18 NMF Tree Browser Application The application includes a range of data exploration tools. Class sizes and correlations Precision & Recall scores List of most significant class/ node combinations Membership weights for proteins in selected node Clustering and Tree Browser software is freely available: http://mlg.ucd.ie/nmf

19 Conclusions We have presented a new clustering approach that involves aggregating a collection of matrix factorizations generated using NMF-like techniques. In evaluations on high-quality protein interaction data, we have observed that Ensemble NMF can... Improve our ability to identify groupings that accurately reflect known protein complex compositions. Help discover overlapping groups and multi-function or "moonlighting" proteins. Provide an intuitive, tree-like organisation of the data. We have developed the NMF Tree Browser application, which supports cluster visualisation and labelling of previously uncharacterised proteins. Many other potential applications - e.g. discovering structures genetic interaction data, gene microarray data.

20 References Salwinski, L., Miller, C. S., Smith, A. J., Pettit, F. K., Bowie, J. U., and Eisenberg, D. (2004). The database of interacting proteins: 2004 update. Nucleic Acids Res, 32(Database issue), pp. 449 51. Collins, S. R. R., Kemmeren, P., Zhao, X.-C. C., Greenblatt, J. F. F., Spencer, F., Holstege, F. C. C., Weissman, J. S. S., and Krogan, N. J. J. (2007). Towards a comprehensive atlas of the physical interactome of Saccharomyces cescerevisiae. MolCell Proteomics. Strehl, A. and Ghosh, J. (2002). Cluster ensembles - a knowledge reuse framework for combining partitionings. In Proc. Conference on Artificial Intelligence (AAAI 02), pp. 93 98. Ding, C. and He, X. (2005). On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering. In Proc. SIAM International Conference on Data Mining (SDM 05), pp. 606 610. Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by nonnegative matrix factorization. Nature, 401, pp. 788 91. Opitz, D. W. and Shavlik, J. W. (1996). Generating accurate and diverse members of a neural-network ensemble. NIPS 8, pp. 535 541.