Protein Complex Identification by Supervised Graph Clustering

Similar documents
Towards Detecting Protein Complexes from Protein Interaction Data

Network alignment and querying

Comparative Network Analysis

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Learning in Bayesian Networks

A Max-Flow Based Approach to the. Identification of Protein Complexes Using Protein Interaction and Microarray Data

Interaction Network Topologies

CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction. Wong Limsoon

Protein function prediction via analysis of interactomes

Undirected Graphical Models

Predicting Protein Functions and Domain Interactions from Protein Interactions

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

3 : Representation of Undirected GM

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Project in Computational Game Theory: Communities in Social Networks

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

CS4220: Knowledge Discovery Methods for Bioinformatics Unit 6: Protein-Complex Prediction. Wong Limsoon

Comparison of Protein-Protein Interaction Confidence Assignment Schemes

Discovering molecular pathways from protein interaction and ge

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Directed and Undirected Graphical Models

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks

Probabilistic Graphical Models

Systems biology and biological networks

COMP538: Introduction to Bayesian Networks

Predictive analysis on Multivariate, Time Series datasets using Shapelets

A MAX-FLOW BASED APPROACH TO THE IDENTIFICATION OF PROTEIN COMPLEXES USING PROTEIN INTERACTION AND MICROARRAY DATA (EXTENDED ABSTRACT)

Prof. Dr. Lars Schmidt-Thieme, L. B. Marinho, K. Buza Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course

Bayesian Networks Inference with Probabilistic Graphical Models

Lecture 15. Probabilistic Models on Graph

Unsupervised Image Segmentation Using Comparative Reasoning and Random Walks

Structure Learning: the good, the bad, the ugly

Sampling from Bayes Nets

Undirected Graphical Models: Markov Random Fields

Clustering and Gaussian Mixture Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Applying Bayesian networks in the game of Minesweeper

UVA CS 4501: Machine Learning

Probabilistic Graphical Models

Learning a probabalistic model of rainfall using graphical models

Microarray Data Analysis: Discovery

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Network Alignment 858L

Introduction to Machine Learning Midterm Exam Solutions

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Discrete Applied Mathematics. Integration of topological measures for eliminating non-specific interactions in protein interaction networks

The Origin of Deep Learning. Lili Mou Jan, 2015

Jessica Wehner. Summer Fellow Bioengineering and Bioinformatics Summer Institute University of Pittsburgh 29 May 2008

CSE 546 Final Exam, Autumn 2013

An Introduction to Bayesian Machine Learning

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

A Multiobjective GO based Approach to Protein Complex Detection

Bioinformatics: Network Analysis

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Adaptive Crowdsourcing via EM with Prior

Statistical Pattern Recognition

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Intelligent Systems (AI-2)

Being Bayesian About Network Structure:

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

INCORPORATING GRAPH FEATURES FOR PREDICTING PROTEIN-PROTEIN INTERACTIONS

Review Article From Experimental Approaches to Computational Techniques: A Review on the Prediction of Protein-Protein Interactions

Kernel Logistic Regression and the Import Vector Machine

Neural Networks. Nethra Sambamoorthi, Ph.D. Jan CRMportals Inc., Nethra Sambamoorthi, Ph.D. Phone:

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

Machine Learning, Midterm Exam

Bayesian Networks. Motivation

Be able to define the following terms and answer basic questions about them:

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

CSC 412 (Lecture 4): Undirected Graphical Models

4 : Exact Inference: Variable Elimination

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

Probabilistic Graphical Models

Introduction to Bioinformatics

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Learning the Structure of Linear Latent Variable Models

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

CSE446: Clustering and EM Spring 2017

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

A Latent Mixed Membership Model for Relational Data

Model Accuracy Measures

CS325 Artificial Intelligence Chs. 18 & 4 Supervised Machine Learning (cont)

EFFICIENT AND ROBUST PREDICTION ALGORITHMS FOR PROTEIN COMPLEXES USING GOMORY-HU TREES

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Data Exploration and Unsupervised Learning with Clustering

Inferring Protein-Signaling Networks

CPSC 540: Machine Learning

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Iteration Method for Predicting Essential Proteins Based on Orthology and Protein-protein Interaction Networks

Probabilistic Graphical Models

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Transcription:

Protein Complex Identification by Supervised Graph Clustering Yanjun Qi 1, Fernanda Balem 2, Christos Faloutsos 1, Judith Klein- Seetharaman 1,2, Ziv Bar-Joseph 1 1 School of Computer Science, Carnegie Mellon University, 2 University of Pittsburgh School of Medicine

Protein-Protein Interaction (PPI) Involved in most activities in the cell - Can be used to infer function - Combined to infer pathways and complexes Lots of experimental work - Mass spectrometry (Gavin et al 2006; Krogan et al 2006) - Yeast two-hybrid (Rual et al 2005) Lots of computational work - Bayesian networks (Jansen et al 2003) - Random Forest (Qi et al 2006) 2

Protein Complexes A set of proteins working together as a super machine Complex member interacts with all or part of the group Correct identification leads to better understanding of function and mechanisms 3

Identifying complexes in a PPI graph Problem statement: Given a PPI graph identify the subsets of interacting proteins that form complexes PPI Network - Algorithms for addressing this problem were used in the high throughput mass spec papers - Many other algorithms suggested for this task Protein Complex 4

Methods for identifying complexes in PPI graph focus on cliques Prior methods looked for dense subgraphs (cliques) Methods mainly differed in how the graph was segmented Most methods treated the graph as a binary graph ignoring weight on the edges - such weight can be obtained from both, computational predictions and experimental data Example: MCODE (Bader et al 2003) detects densely connected regions in PPI networks using vertex weights representing local neighborhood density 5

while many other topological structures are present Projecting computationally predicted PPI graph on curated MIPS complexes. More 6

Method 7

Key ideas for our method Utilize available data for training - Supervised instead of unsupervised methods Summarize properties of the possible topological structures - Use common subgraph features Take into account the biological properties of complexes - Use information about the weight and size of proteins 8

Features used to model subgraphs Subgraph properties as features Various topological properties from graph Biological attributes of complexes Can be computed on projections of known complexes on our PPI graph No. Sub-Graph Property 1 Vertex Size 2 Graph Density 3 Edge Weight Ave / Var 4 Node degree Ave / Max 5 Degree Correlation Ave / Max 6 Clustering Coefficient Ave / Max 7 Topological Coefficient Ave / Max 8 First Two Eigen Value 9 Fraction of Edge Weight > Certain Cutoff 10 Complex Member Protein Size Ave / Max 11 Complex Member Protein Weight Ave / Max 9

Example Node Size 3 4 4 4 5 Density 1 0.5 1 0.667 0.4

Probabilistic model for complex features We use a Bayesian Network to represent the joint probability distribution of the various features we use 11

Log likelihood ratio for complexes We use a Bayesian Network to represent the joint probability distribution of the various features we use Bayesian Network (BN) C : If this subgraph is a complex (1) or not (0) N C N : Number of nodes in subgraph X X X X Xi : Properties of subgraph L log p( c p( c 1 n, x1, x 0 n, x, x 1 2 2,..., x,..., x m m ) ) 12

Learning BN parameters were learned using MLE Trained from known complexes and random sampled subgraphs with of the same size (non-complexes) Discretize continuous features Bayesian Prior to smooth multinomial parameters Evaluate candidate subgraphs with the log ratio score L L log p( c p( c 1 n, x1, x 0 n, x, x 1 2 2,..., x,..., x m m ) ) log p( c 1) p( n c 1) p( c 0) p( n c 0) m k 1 m k 1 p( x k p( x k n, c 1) n, c 0) 13

Searching for high scoring complexes Given our likelihood function we would like to find high scoring complexes (maximizing the log likelihood ratio) Lemma: Identifying the set of maximally scoring subgraphs in our PPI graph is NP-hard We thus employ the iterated simulated annealing search on the log-ratio score 14

Evaluation and Results 15

Experimental Setup Positive training data (complexes) Set1: MIPS Yeast complex catalog: a curated set of ~100 protein complexes Set2: TAP06 Yeast complex catalog (Gavin et al 2006): a reliable experimental set of ~150 complexes Complex size (nodes num.) follows a power law Negative training data (pseudo non-complexes) Generated from randomly selected nodes in the graph Size distribution is similar as the positive complexes (for each of the two sets) 16

Data Distribution Node size distribution Feature distribution 17

Evaluation Train on set1 and evaluate on set 2 and vice versa Precision / Recall / F1 measures A cluster detects a complex if A C B A : Number of proteins only in the cluster B : Number of proteins only in the complex C : Number of proteins shared between the two We set the threshold (p) to be 50% Detected Cluster Known complex C A C p & C B C p 18

Performance Comparison: Training on MIPS On yeast predicted PPI graph (Qi et al 2006, ~2000 nodes) Compared to three other methods: - MCODE which looks for highly interconnected regions (Bader et al 2003) - Search relying on density feature only - Same set of features using SVM rather than BN Training on MIPS, testing on TAP06 Methods Precision Recall F1 Density MCODE SVM BN 0.217 0.293 0.247 0.312 0.409 0.088 0.377 0.489 0.283 0.135 0.298 0.381 19

Performance Comparison: Training on TAP Training on TAP06, testing on MIPS Compared to three other methods: Methods Precision Recall F1 Density MCODE SVM BN 0.143 0.146 0.176 0.219 0.515 0.063 0.379 0.537 0.224 0.088 0.240 0.312 MCODE tends to find a few big clusters 20

Examples of identified new complexes 8/5/2008 Edge weight color coded 21

Conclusions and future work Supervised method can identify complexes that are missed when using strong topological assumptions Utilizing edge weight leads to higher predictions and recall Can be used whenever weight information is available Further improvements: - Better local search algorithm - Other features 22

Acknowledgements Funding NSF grants CAREER 0448453, CAREER CC044917, NIH NO1 AI-5001 ISMB Travel Fellowship (though unable to use it ) www.cs.cmu.edu/~qyj/supercomplex/index.html 23

Heuristic Local Search Search: Accept the new cluster candidate if with higher score If lower, accept with probability exp(l - l)/t T : temperature parameter, decreasing by a scaling factor alpha after each round Accepted cluster must score higher than a threshold When to Stop: N( i, k+1) = Ø (k-th round) Number of round since the last score improvements larger than a specified number k is larger than a specified number Expand current cluster: Generate a sub-set V* from all neighbors of current cluster Top M nodes ranked by their max-weight to current cluster 24

Algorithm Input: Weighted protein-protein interaction network; A training set of complexes and non-complexes; Output: Discovered list of detected clusters; Complex Model Parameter Estimation: Extract features from positive and negative training examples; Calculate MLE parameters on the multinomial distributions; Search for Complexes: Starting from the seeding nodes, apply simulated annealing search to identify candidate complexes; Output detected clusters ranked with log-ratio scores 25