CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Similar documents
Computational methods for predicting protein-protein interactions

CSCE555 Bioinformatics. Protein Function Annotation

Predicting Protein Functions and Domain Interactions from Protein Interactions

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

Introduction to Bioinformatics

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Week 10: Homology Modelling (II) - HHpred

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Lecture 7 Sequence analysis. Hidden Markov Models

Computational approaches for functional genomics

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Computational Systems Biology

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Some Problems from Enzyme Families

PREDICTION OF PROTEIN BINDING SITES BY COMBINING SEVERAL METHODS

Computational Biology: Basics & Interesting Problems

Machine learning methods to infer drug-target interaction network

Data Mining in Bioinformatics HMM

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Computational Biology From The Perspective Of A Physical Scientist

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

EECS730: Introduction to Bioinformatics

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Information Extraction from Text

Sequence Alignment Techniques and Their Uses

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Computational Genomics and Molecular Biology, Fall

Dynamic Approaches: The Hidden Markov Model

Comparative Network Analysis

SUPPLEMENTARY INFORMATION

Computational Molecular Biology (

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Computational Genomics and Molecular Biology, Fall

Bioinformatics. Dept. of Computational Biology & Bioinformatics

This document describes the process by which operons are predicted for genes within the BioHealthBase database.

Basic Local Alignment Search Tool

Motif Extraction and Protein Classification

Hidden Markov Models

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

BIOINFORMATICS. Improved Network-based Identification of Protein Orthologs. Nir Yosef a,, Roded Sharan a and William Stafford Noble b

Bioinformatics Chapter 1. Introduction

Graphical Models Another Approach to Generalize the Viterbi Algorithm

Markov Chains and Hidden Markov Models. = stochastic, generative models

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Biological Networks. Gavin Conant 163B ASRC

Support vector machines, Kernel methods, and Applications in bioinformatics

Procedure to Create NCBI KOGS

METABOLIC PATHWAY PREDICTION/ALIGNMENT

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Hidden Markov Models

K-means-based Feature Learning for Protein Sequence Classification

Learning in Bayesian Networks

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

An Introduction to Sequence Similarity ( Homology ) Searching

Metabolic networks: Activity detection and Inference

A Study of Network-based Kernel Methods on Protein-Protein Interaction for Protein Functions Prediction

Genomics and bioinformatics summary. Finding genes -- computer searches

Hidden Markov Models (I)

O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Hidden Markov Models. Terminology, Representation and Basic Problems

Template Free Protein Structure Modeling Jianlin Cheng, PhD

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Overview of Research at Bioinformatics Lab

Mining and classification of repeat protein structures

Genome Annotation Project Presentation

BLAST. Varieties of BLAST

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Similarity searching summary (2)

Improving domain-based protein interaction prediction using biologically-significant negative dataset

SUPPLEMENTARY INFORMATION

BIOINFORMATICS: An Introduction

Applications of Hidden Markov Models

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Improved network-based identification of protein orthologs

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Hidden Markov Models

Information Extraction from Biomedical Text. BMI/CS 776 Mark Craven

Structure to Function. Molecular Bioinformatics, X3, 2006

Comparison of Protein-Protein Interaction Confidence Assignment Schemes

Large-Scale Genomic Surveys

Gene function annotation

Drug-Target Interaction Prediction by Learning From Local Information and Neighbors

Integration of functional genomics data

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Transcription:

CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1

Background Proteins do not function as isolated entities. Protein-Protein interaction is essential to cellular functions. When two proteins interact, it can mean: They physically interact They are enzymes catalyzing successive reactions in a pathway One protein regulates expression of the other CISC636, F16, Lec22, Liao 2

Protein-Protein Interaction plays essential roles in cellular processes

PPI network reconstruction is a central task in systems biology Given a pair of proteins: 1. Do they interact? (identify de novo pathways, cross talk) 2. How do they interact, i.e., which amino acids are involved in interaction? (design mutants to modulate PPI)

Data sources Yeast 2-hybrid system 2-D gel + MSMS Gene expression (DNA microarray) Localization data Phylogenetic profiles Structural information at binding sites Sequences? CISC636, F16, Lec22, Liao 5

Shoemaker & Panchenko, 2007 PLoS Computational Biology CISC636, F16, Lec22, Liao 6

CISC636, F16, Lec22, Liao 7

Shoemaker & Panchenko, 2007 PLoS Computational Biology CISC636, F16, Lec22, Liao 8

CISC636, F16, Lec22, Liao 9

Use of sequence information (gene cluster) Moreno-Hagelsieb G, Collado-Vides J (2002) A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 18: S329 S336 CISC636, F16, Lec22, Liao 10

Use of sequence information (Rosetta stone, Gene fusion) Marcotte et al, Science (1999) CISC636, F16, Lec22, Liao 11

CISC636, F16, Lec22, Liao 12

Use of structural information Structural Compatibility Trypsin inhibitor Thermitase CISC636, F16, Lec22, Liao 13

Proteins Interact via Domains Chemical bonds are formed between amino acids across interface at two interacting proteins. Domain A Protein X Protein Y Domain B Residues at interface tend to be more conserved due to selection pressure during evolution. 14

Not all residues in domain directly participate in interaction Wilson et al, BMC Dev. Biol. (2011) RING domain: cysteines residues in red interact with Zn++ ions to stabilize the ring finger structure; residues colored blue are on the interacting interface.

Profile Hidden Markov Models capturing interaction domains P(x ) : probability that sequence x contains a domain described by the model. Viterbi algorithm can align x against the model to annotate interacting residues. iphmm H C Y W Friedrich et al, Bioinformatics 2006 16

How reliable is the prediction? If P(X A) = 0.8, P(Y B) = 0.8, probability X and Y interact via domains A and B is P(X A) P(Y B) = 0.8 x 0.8 = 0.64. From Domain to Domain-Domain Interaction Given a pair of proteins: 1. Do they interact? 2. How do they interact, i.e., which amino acids are involved in interaction? Simple Solution: i. Query the sequences against domain databases like Pfam. ii. If Protein X contains domain A, Protein Y contains domain B, and it is known that domain A interacts with domain B, then Protein X interact with Protein Y.

From Domain to Domain-Domain Interaction to Protein- Protein Interaction Gonzalez & Liao, BMC Bioinformatics 2010

Sequence A Given a pair of proteins: 1. Do they interact? 2. How do they interact, i.e., which amino acids are involved in interaction? Sequence B What is residue contact matrix? B 1 B 2 B 3 B 4 B 5 B n 0 1 0 0 0 0 A 1 A 2 A 3 0 0 0 0 0 0 0 0 0 0 0 0 A 5 A m 0 0 0 0 1 0 0 0 0 0 0 0 20

Predicting residue contact matrix for a pair of interacting proteins Gonzalez, Liao, Wu, Bioinformatics 2013 21

24

Knowledge Leverage and Integration for Better Learning Protein X Protein B iphmm Domain & site iphmm+s VM DDI & PPI Contact matrix CMipHMM

Method: iphmm bit Integrated machine learning classifier with contact matrix prediction and iphmm site prediction. Classifier: Logistic Regression Pr( Y 1 X 1,..., X k ) 1 exp[ ( X 0 1 1 1 X 2 2... X k k )] 26

Results The data set contains 72 DDI families collected from 3DID. Each has 10 ~ 20 member sequences, with domain length < 150 residues. CM-ipHMM vs. iphmm : p-value, 4.36E-77; CM-ipHMM vs. CM-Only : p-value, 9.32E-10; Ground-truth-CM: Replace predicted contact matrix with ground-truth

Basic idea of our method

Weight Optimization by Linear Programming (WOLP)

Weight Optimization by Linear Programming (WOLP)

Algorithm Golden standard connected network: G(V, E) Connected training network: G (E) tn G (E) vn G (E) = tt G (E) G (E) tn G (E) vn G (E) = f tt

Experiments on network inference with real data Data description of DIP yeast PPI networks(release 20150101) Largest connected component: G(V, E) = G(5,030, 22,394) 1 Connected training network: G tn (V, E) = (5,030, 5,394) 2 Validation edge set: G vn (V, E) = (?, 1,000) 3 Testing edge set: G tt (V, E) = (?, 16,000)

Experiments on network inference with real data) Feature kernels [39] 1 K Jaccard [15]: This kernel measure the similarity of protein pairs i, j in term of neighbors(i) neighbors(j)/neighbors(i) neighbors(j). 2 K SN : It measures the total number of neighbors of protein i and j, KSN = neighbors(i) + neighbors(j). 3 K B : It is a sequence-based kernel matrix that is generated using the BLAST. 4 K E : This is a gene co-expression kernel matrix constructed entirely from microarray gene expression measurements. 5 K Pfam : This is a generalization of the previous pairwise comparison-based matrices in which the pairwise comparison scores are replaced by expectation values derived from hidden Markov models (HMMs) in the Pfam database. [39] Y. Yamanishi, J.-P. Vert, and M. Kanehisa. Protein network inference from multiple genomic data: a supervised approach. Bioinformatics, 20(suppl 1):i363 i370, 2004.