Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Similar documents
Gürol M. Süel, Steve W. Lockless, Mark A. Wall, and Rama Ra

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Computational methods for predicting protein-protein interactions

Network Alignment 858L

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Week 10: Homology Modelling (II) - HHpred

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Quantitative Stability/Flexibility Relationships; Donald J. Jacobs, University of North Carolina at Charlotte Page 1 of 12

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Prediction of double gene knockout measurements

Some Problems from Enzyme Families

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Pairwise & Multiple sequence alignments

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

BLAST. Varieties of BLAST

arxiv: v3 [q-bio.bm] 16 Dec 2014

Exercise 5. Sequence Profiles & BLAST

Quantifying sequence similarity

Introduction to Bioinformatics

Genomics and bioinformatics summary. Finding genes -- computer searches

Computational Biology: Basics & Interesting Problems

Analysis of correlated mutations in Ras G-domain

Towards the Prediction of Protein Abundance from Tandem Mass Spectrometry Data

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

CSCE555 Bioinformatics. Protein Function Annotation

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Homology Modeling. Roberto Lins EPFL - summer semester 2005

BLAST: Target frequencies and information content Dannie Durand

Bioinformatics Chapter 1. Introduction

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

SUPPLEMENTARY INFORMATION

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

EECS730: Introduction to Bioinformatics

Protein tertiary structure prediction with new machine learning approaches

Local Alignment Statistics

IEEE TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 1. Graphical Models of Residue Coupling in Protein Families

K-means-based Feature Learning for Protein Sequence Classification

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Introduction to Bioinformatics Online Course: IBT

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Prediction of protein function from sequence analysis

Training algorithms for fuzzy support vector machines with nois

Xia Ning,*, Huzefa Rangwala, and George Karypis

proteins Zhenxing Liu, 1 Jie Chen, 1 and D. Thirumalai 1,2 * INTRODUCTION

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

7.36/7.91 recitation CB Lecture #4

Example of Function Prediction

Supplementary Materials for R3P-Loc Web-server

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Introduction to Bioinformatics

Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology

Supplemental Material can be found at:

Lecture 5,6 Local sequence alignment

Supplementary Materials for mplr-loc Web-server

Protein Structure Prediction

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

A New Similarity Measure among Protein Sequences

Linear predictive coding representation of correlated mutation for protein sequence alignment

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

BIOINFORMATICS: An Introduction

CS612 - Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics

Algorithms in Bioinformatics

Radial Basis Function Neural Networks in Protein Sequence Classification ABSTRACT

An Introduction to Sequence Similarity ( Homology ) Searching

An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

Basics of protein structure

Protein Structure Prediction, Engineering & Design CHEM 430

User Guide for LeDock

Graphical Models of Residue Coupling in Protein Families

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

The Quantum Landscape

In-Depth Assessment of Local Sequence Alignment

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Tools and Algorithms in Bioinformatics

Basic Local Alignment Search Tool

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Evolutionary dynamics of populations with genotype-phenotype map

Building 3D models of proteins

Practical considerations of working with sequencing data

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Bioinformatics and BLAST

We used the PSI-BLAST program ( to search the

Graph Alignment and Biological Networks

Transcription:

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Abstract: We used an algorithm known as statistical coupling analysis (SCA) 1 to create a set of features for building a biologically relevant and evolutionarily stable protein interaction network. The heart of the algorithm is to use protein sequence information from different organisms to determine which residues of a protein coevolve: the idea is that coevolving proteins probably also interact in a way that is important for an organism's survival. We have concatenated the sequences of three proteins (two of which are known to interact with each other; the third was a negative control) and, treating them as a single protein sequence, used SCA to see which residues coevolve. We found that the SCA analysis on our sequence data produced noisy results and we were not able to identify significant interactions among the proteins. We conclude with suggestions on how to improve this method. Introduction: One of the central challenges of biology is creating a map of all biologically relevant protein protein interactions. High throughput screens may be useful in giving a set of possible interactions, but it is unclear which interactions are actually important parts of an organism s survival. One method, known as statistical coupling analysis, has been shown to be successful in singling out the most important interactions within a protein,3. The method is based on a multiple sequence alignment of orthologs of a given protein across many species. The statistical coupling between two sites is given essentially by how much the probability for finding a certain amino acid site A changes upon fixing the amino acid at site B to a given amino acid. In order to extend this technique proteome wide, we will concatenate the sequences of many proteins and treat them as a single superprotein, and then perform SCA on this superprotein. What follows is a brief introduction to SCA; for further explanation, please consult Ranganathan (3). Consider the frequency distribution of amino acids at a given position (say, position i) in our superprotein (for example, position i could be valine 3% of the time and cystein 7% of the time in our multiple sequence alignment). Let s say another site in the protein (site j) also has a given distribution of amino acids. If we now examine only the sequences in which the amino acid at position i is cystein, the distribution at position j may or may not change. If it changes dramatically, this would imply that the two sites are statistically coupled; if not, they are not coupled. We can quantify the amount of coupling with the following formula: Here, ddg stat i,j is the statistical coupling energy between positions i and j in the multiple sequence alignment. P x i dj is the probability of observing amino acid x at position i, given a perturbation dj (i.e. selecting only the sequences where a certain position is fixed to a certain amino acid). P x MSA dj is the probability of observing amino acid x across the total multiple sequence alignment. P x i and P x MSA are the equivalent probabilities without the perturbation. kt * is an arbitrary energy value, which we have used for normalization.

We applied SCA to an MSA of three proteins: MAP kinase 1 (MAPK1), MAP kinase kinase 1 (MAPK1), and p38 MAP kinase (p38) in order to determine which residues were statistically coupled. Methods and results: We downloaded sequences for our three proteins of interest from PSI BLAST by searching for the mouse versions. We removed duplicate homologs from a given species simply by randomly selecting one to keep, and removing the remaining sequences. We then did a multiple sequence alignment on each of the proteins separately, and then concatenated the resulting alignments by matching up the sequences by species. This resulted in an alignment with 11 orthologs (from 11 species), with a total length of 3739 amino acids. We then proceeded with the SCA by identifying 39 perturbations. The criteria for selecting perturbations were that there must be at least sequences that contain a certain amino acid at a certain site and, after the perturbation, there must be at least sequences remaining. We performed SCA with these perturbations, and finally produced the graphs displaying ddg for each position given a perturbation. In the MSA, MAPK1 is located at amino acids #1 11; MAPK1 is located at #116 68, and p38 is located at #68 3739. We found that the average ddg values did not change significantly across these three proteins, indicated that we did not identify any significant interactions. Each plot in Figure 1 shows the ddg values across all positions for a given perturbation. We had hoped to be able to (1) cluster our results to identify amino acids that coevolve across all perturbations and () use supervised learning techniques to train a classifier on what properties the matrix of ddg values of two interacting proteins has, but because the ddg values for all proteins had similar values, we decided not to proceed. Conclusions: We believe that the main reasons we were unable to recover significant interactions were due to the original quality of the multiple sequence alignment. To improve it, we would need (1) a larger number of sequences and () cleaner sequences, in the sense that PSI BLAST returns sequences for proteins that are not necessarily orthologous to the query sequence, and therefore some of our MSA is in fact composed of unrelated, junk proteins that happen to share sequence homology with our query sequence. The alignment process itself can also be improved. The SCA algorithm assumes that each position in the alignment refers to the same 3D position in all of the amino acids. Without adjusting the alignments so that they correspond to the (most likely) 3D structure, we can t be confident in the results from SCA. Finally, it may be worth trying to use mutual information as our coevolution metric, rather than SCA. Both methods have given similar results, and it is unclear which one gives better results. Please email Jeremy Chang (jbchang@stanford.edu) for the multiple sequence alignment and/or MATLAB code used in this project. References 1. Lockless, S.W. & Ranganathan, R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science 86, 9 9(1999).

. Socolich, M. et al. Evolutionary information for specifying a protein fold. Nature 37, 1 8(). 3. Russ, W.P. et al. Natural like function in artificial WW domains. Nature 37, 79 83().. Zhang, X. et al. An Allosteric Mechanism for Activation of the Kinase Domain of Epidermal Growth Factor Receptor. Cell 1, 1137 119(6).. Ben Hur, A. et al. Support Vector Machines and Kernels for Computational Biology. PLoS Comput Biol, e1173(8).

Figure 1 (two pages) Perturb. no.1 1 Perturb. no. Perturb. no. Perturb. no. 1 Perturb. no.3 Perturb. no.6 1 Perturb. no.7 Perturb. no.8 Perturb. no.9 Perturb. no.1 1 Perturb. no.11 Perturb. no.1 Perturb. no.13 Perturb. no.1 1 Perturb. no.1 Perturb. no.16 Perturb. no.17 Perturb. no.18 Perturb. no.19 Perturb. no. Perturb. no.1

Perturb. no.1 Perturb. no. 1 Perturb. no. Perturb. no. 1 Perturb. no.3 1 Perturb. no.6 Perturb. no.7 Perturb. no.8 Perturb. no.9 Perturb. no.3 Perturb. no.31 Perturb. no.3 Perturb. no.33 1 Perturb. no.3 Perturb. no.3 1 Perturb. no.36 Perturb. no.37 1 Perturb. no.38 Perturb. no.39