Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

Similar documents
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Bio nformatics. Lecture 23. Saad Mneimneh

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Building 3D models of proteins

Multiple Whole Genome Alignment

Comparative Network Analysis

Protein Structure Prediction, Engineering & Design CHEM 430

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Week 10: Homology Modelling (II) - HHpred

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Protein Structure Prediction

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Multiple Sequence Alignment

CAP 5510 Lecture 3 Protein Structures

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Protein Structures. 11/19/2002 Lecture 24 1

Protein Threading Based on Multiple Protein Structure Alignment

Bayesian Models and Algorithms for Protein Beta-Sheet Prediction

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

RNA and Protein Structure Prediction

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Protein Structure Determination

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Computational Genomics and Molecular Biology, Fall

Bioinformatics: Secondary Structure Prediction

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Computational Genomics and Molecular Biology, Fall

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Pairwise & Multiple sequence alignments

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Supporting Online Material for

In-Depth Assessment of Local Sequence Alignment

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

BMI/CS 576 Fall 2016 Final Exam

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Algorithms in Bioinformatics

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

A Method for Aligning RNA Secondary Structures

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Multiple Sequence Alignment (MAS)

Bridging efficiency and capacity of threading models.

Lecture 18 Generalized Belief Propagation and Free Energy Approximations

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Lecture 12 : Graph Laplacians and Cheeger s Inequality

Sequence Alignment (chapter 6)

Bio nformatics. Lecture 3. Saad Mneimneh

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

1.5 Sequence alignment

Data Structures and Algorithms

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Applications of Hidden Markov Models

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

BLAST: Target frequencies and information content Dannie Durand

Bioinformatics. Macromolecular structure

Protein Structure Prediction using String Kernels. Technical Report

Evolutionary Tree Analysis. Overview

Texas A&M University

Introduction to Evolutionary Concepts

Phylogenetic inference

STRUCTURAL BIOINFORMATICS I. Fall 2015

Basics of protein structure

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Computational Biology

Pairwise sequence alignment

Scoring Matrices. Shifra Ben-Dor Irit Orr

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Variable Elimination: Algorithm

Sequence Database Search Techniques I: Blast and PatternHunter tools

CS612 - Algorithms in Bioinformatics

BINF 730. DNA Sequence Alignment Why?

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

CSE 202 Dynamic Programming II

Chem 14C Lecture 1 Spring 2017 Final Exam Part B Page 1

Comparing whole genomes

BMI/CS 776 Lecture 4. Colin Dewey

Lecture 5,6 Local sequence alignment

Modeling for 3D structure prediction

Structure-Based Comparison of Biomolecules

Dynamic Programming. Weighted Interval Scheduling. Algorithmic Paradigms. Dynamic Programming

Structural biomathematics: an overview of molecular simulations and protein structure prediction

Markov Chains and Hidden Markov Models. = stochastic, generative models

Transcription:

Protein Threading BMI/CS 776 www.biostat.wisc.edu/bmi776/ Colin Dewey cdewey@biostat.wisc.edu Spring 2015

Goals for Lecture the key concepts to understand are the following the threading prediction task the threading search task template models branch and bound search for threading

Protein Threading generalization of homology modeling homology modeling: align sequence to sequence threading: align sequence to structure (templates) key ideas limited number of basic folds found in nature amino acid preferences for different structural environments provide sufficient information to choose among folds

A Core Template protein A threaded on template template protein B threaded on template core secondary structure segments loops Figure from R. Lathrop et al. Analysis and Algorithms for Protein Sequence-Structure Alignment, in Computational Methods in Molecular Biology, Salzberg et al. editors, 1998.

Components of a Threading Approach library of core fold templates objective function to evaluate any particular placement of a sequence in a core template ü method for searching over space of alignments between sequence and each core template method for choosing the best template given alignments

Task Definition: Prediction Via Threading given: a protein sequence a library of core templates return: the best alignment of the sequence to a template

given: a protein sequence a single template Task Definition: Threading Search return: the best alignment of the sequence to the template

Threading Objective Functions possible sequence/template alignments are scored using a specified objective function the objective function scores the sequence/structure compatibility between sequence amino acids their corresponding positions in a core template it takes into account factors such as a.a. preferences for solvent accessibility a.a. preferences for particular secondary structures interactions among spatially neighboring a.a. s

Core Template with Interactions Figure from R. Lathrop et al. Analysis and Algorithms for Protein Sequence-Structure Alignment. first amino acid in L interacts with last amino acid in L last amino acid in K interacts with first amino acid in L small circles represent amino acid positions thin lines indicate interactions represented in model

One Threading Figure from R. Lathrop et al. Analysis and Algorithms for Protein Sequence-Structure Alignment. t a threading can be represented as a vector, where each element indicates the index of the amino acid placed in the first position of each core segment

Threading Sets b i d i bl dl a set of potential threadings can be represented by bounds on the first position of each core segment T = { t b i t i d i, b j t j d j, b k t k d k, b l t l d } l

Possible Threadings Figure from R. Lathrop et al. Analysis and Algorithms for Protein Sequence-Structure Alignment. finding the optimal alignment is NP-hard in the general case where there are variable length gaps between the core segments, and the objective function includes interactions between neighboring (in 3D) amino acids

A General Pairwise Objective Function the general objective function with pairwise interactions is: f ( t ) = g 1 (i, t i ) + g 2 (i, j, t i, t j ) i j>i We wish to minimize this function i scores for scores for segment interactions individual segments

Searching the Space of Alignments if interaction terms between amino acids are not allowed dynamic programming will find optimal alignment efficiently if interaction terms allowed heuristic methods fast might not find the optimal alignment exact methods (e.g. branch & bound) will find the optimal alignment might take exponential time

Branch and Bound Abstractly 3 components A data structure for compactly representing a set of potential solutions May be a very large set An algorithm for computing a lower bound on the score obtained by any member of a set In general, should not explicitly examine all members of the set An algorithm for splitting a set into subsets

Branch and Bound Search initialize Q repeat l set in Q if l else contains only 1 threading return split with one entry representing the set of l l into smaller subsets compute lower bound for each subset put subsets in with lowest lower bound Q sorted by lower bound all threadings

Branch and Bound Illustrated a hypothetical branch and bound search each circle illustrates the space of possible threadings solid lines indicate splits made in previous steps dashed lines indicate splits made in current step numbers indicate lower bounds for newly created subsets arrows show the set that has been split Figure from R. Lathrop and T. Smith. Global Optimum Protein Threading with Gapped Alignment and Empirical Pair Score Functions. Journal of Molecular Biology 255:641-665, 1996.

Branch and Bound Search there are two key issues to address in instantiating this approach how to compute the lower bound for a set of threadings how to split a threading set into subsets these aspects determine the expected efficiency of the branch and bound search

A Simple Lower Bound min t T f ( t $ ' ) = min & g 1 (i, t i ) + g 2 (i, j, t i, t j )) t T i %& j>i () % ( ' min g 1 (i,x)+ min g b i x d i b i y d 2 (i, j,y,z) * ' i * j>i & b j z d j ) i objective function lower bound in a nutshell: calculate minimum over each term separately

A Better Lower Bound min t T min t T f i ( t ) % ' ' g 1 (i, t i ) + g 2 (i 1, i, t i 1, t i ) + min u T ' & j>i+1 j<i 1 ( 1 2 g * 2(i, j, t i, u j )* * ) interaction with preceding segment best case interaction with other segments [Lathrop & Smith, JMB 96]

Splitting a Threading Set a threading set is split by choosing a single core segment a split point in the segment s i a simple method split the segment having the widest interval, i.e. max d b i [ ] i i s i choose the split point as the value that results in the lower bound for the set

Branch and Bound: Splitting a Set Figure from R. Lathrop et al. Analysis and Algorithms for Protein Sequence-Structure Alignment. T = { t b i t i d i, b j t j d j, b k t k d k, b l t l d } l splitting into three sets s i 1 2 3 { } { } { } T = t b i t i < s i, T = t t i = s i, T = t s i < t i d i,

Threading Example Suppose we have three segments (i, j, k), each of which includes three amino acids. For a given sequence there are three possible starting positions for each segment. Suppose that you are given the following values for the scores of the individual segments and the scores for segment interactions. g1(i,2) = 5 g1(j,8) = 9 g1(k,13) = 3 g1(i,3) = 2 g1(j,9) = 7 g1(k,14) = 4 g1(i,4) = 8 g1(j,10) = 6 g1(k,15) = 1 g2(i,j,2,8) = 1 g2(i,j,2,9) = 2 g2(i,j,2,10) = 2 g2(i,j,3,8) = 5 g2(i,j,3,9) = 6 g2(i,j,3,10) = 4 g2(i,j,4,8) = 7 g2(i,j,4,9) = 3 g2(i,j,4,10) = 4 g2(j,k,8,13) = 7 g2(j,k,8,14) = 8 g2(j,k,8,15) = 7 g2(j,k,9,13) = 1 g2(j,k,9,14) = 6 g2(j,k,9,15) = 8 g2(j,k,10,13) = 11 g2(j,k,10,14) = 12 g2(j,k,10,15) = 13 g2(i,k,2,13) = 1 g2(i,k,2,14) = 2 g2(i,k,2,15) = 5 g2(i,k,3,13) = 5 g2(i,k,3,14) = 6 g2(i,k,3,15) = 4 g2(i,k,4,13) = 1 g2(i,k,4,14) = 2 g2(i,k,4,15) = 4 We ll find the optimal threading using the "simple lower bound" and splitting a set on the segment with the minimal g1 value. When splitting the selected segment, we ll divide it into three intervals of length one.

Threading Example T=[2,4], [8,10], [13,15] T=[2,4], [8,10], [13] LB = g1(i,3) + g2(i,j,2,8) + g2(i,k,2,13) + g1(j,10) + g2(j,k,9,13) + g1(k,13) = 14 T=[2,4], [8,10], [14] LB = g1(i,3) + g2(i,j,2,8) + g2(i,k,2,14) + g1(j,10) + g2(j,k,9,14) + g1(k,14) = 21 T=[2,4], [8,10], [15] LB = g1(i,3) + g2(i,j,2,8) + g2(i,k,3,15) + g1(j,10) + g2(j,k,8,15) + g1(k,15) = 21 T=[2], [8,10], [13] LB = g1(i,2) + g2(i,j,2,8) + g2(i,k,2,13) + g1(j,10) + g2(j,k,9,13) + g1(k,13) = 17 T=[3], [8,10], [13] LB = g1(i,3) + g2(i,j,3,10) + g2(i,k,3,13) + g1(j,10) + g2(j,k,9,13) + g1(k,13) = 21 T=[4], [8,10], [13] LB = g1(i,4) + g2(i,j,4,9) + g2(i,k,4,13) + g1(j,10) + g2(j,k,9,13) + g1(k,13) = 22 T=[2], [8], [13] LB = g1(i,2) + g2(i,j,2,8) + g2(i,k,2,13) + g1(j,8) + g2(j,k,8,13) + g1(k,13) = 26 T=[2], [9], [13] LB = g1(i,2) + g2(i,j,2,9) + g2(i,k,2,13) + g1(j,9) + g2(j,k,9,13) + g1(k,13) = 19 T=[2], [10], [13] LB = g1(i,2) + g2(i,j,2,10) + g2(i,k,2,13) + g1(j,10) + g2(j,k,10,13) + g1(k,13) = 28

Branch and Bound Efficiency 58 proteins threaded against their native (i.e. correct) models Table from R. Lathrop and T. Smith, Journal of Molecular Biology 255:641-665, 1996.