Protein Structure Prediction

Similar documents
CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

Protein Structure Prediction, Engineering & Design CHEM 430

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

Building 3D models of proteins

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

Protein Dynamics. The space-filling structures of myoglobin and hemoglobin show that there are no pathways for O 2 to reach the heme iron.

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Determination

CAP 5510 Lecture 3 Protein Structures

Supporting Online Material for

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Week 10: Homology Modelling (II) - HHpred

Template Free Protein Structure Modeling Jianlin Cheng, PhD

Packing of Secondary Structures

CS612 - Algorithms in Bioinformatics

ALL LECTURES IN SB Introduction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Bioinformatics. Macromolecular structure

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Protein Structure Prediction

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

Prediction and refinement of NMR structures from sparse experimental data

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Physiochemical Properties of Residues

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

Figure 1. Molecules geometries of 5021 and Each neutral group in CHARMM topology was grouped in dash circle.

Bioinformatics: Secondary Structure Prediction

Contact map guided ab initio structure prediction

FlexPepDock In a nutshell

Ab-initio protein structure prediction

Orientational degeneracy in the presence of one alignment tensor.

Basics of protein structure

Protein Structures: Experiments and Modeling. Patrice Koehl

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Alpha-helical Topology and Tertiary Structure Prediction of Globular Proteins Scott R. McAllister Christodoulos A. Floudas Princeton University

HOMOLOGY MODELING. The sequence alignment and template structure are then used to produce a structural model of the target.

Analysis and Prediction of Protein Structure (I)

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Computer simulations of protein folding with a small number of distance restraints

Homology modeling. Dinesh Gupta ICGEB, New Delhi 1/27/2010 5:59 PM

Modeling for 3D structure prediction

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Template-Based Modeling of Protein Structure

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Course Notes: Topics in Computational. Structural Biology.

Structural Alignment of Proteins

Assignment 2 Atomic-Level Molecular Modeling

Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

Docking. GBCB 5874: Problem Solving in GBCB

Supplementary Figure 3 a. Structural comparison between the two determined structures for the IL 23:MA12 complex. The overall RMSD between the two

Protein Structure Prediction

Protein Folding Prof. Eugene Shakhnovich

Lecture 2-3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

Computational Protein Design

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Conformational Geometry of Peptides and Proteins:

NMR, X-ray Diffraction, Protein Structure, and RasMol

SUPPLEMENTARY INFORMATION

7.91 Amy Keating. Solving structures using X-ray crystallography & NMR spectroscopy

Template-Based 3D Structure Prediction

Protein Structure Prediction and Display

Protein Structure Prediction and Protein-Ligand Docking

Protein Modeling. Generating, Evaluating and Refining Protein Homology Models

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Presenter: She Zhang

Chemical Shift Restraints Tools and Methods. Andrea Cavalli

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Protein structure (and biomolecular structure more generally) CS/CME/BioE/Biophys/BMI 279 Sept. 28 and Oct. 3, 2017 Ron Dror

The protein folding problem consists of two parts:

Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur. Lecture - 06 Protein Structure IV

Homework 9: Protein Folding & Simulated Annealing : Programming for Scientists Due: Thursday, April 14, 2016 at 11:59 PM

DATE A DAtabase of TIM Barrel Enzymes

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Protein structure alignments

Section Week 3. Junaid Malek, M.D.

Bioinformatics 2 -- lecture 6

Sequence analysis and comparison

Large-Scale Genomic Surveys

Steps in protein modelling. Structure prediction, fold recognition and homology modelling. Basic principles of protein structure

A profile-based protein sequence alignment algorithm for a domain clustering database

Protein Folding by Robotics

From Amino Acids to Proteins - in 4 Easy Steps

Tools for Cryo-EM Map Fitting. Paul Emsley MRC Laboratory of Molecular Biology

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

Computational Molecular Modeling

Transcription:

Page 1 Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding is different from structure prediction --Folding is concerned with the process of taking the 3D shape, usually based on physical principals. --Prediction uses any statistical, theoretical or empirical data to try to get at the end result. Protein Structure Prediction 1. A bit of history: Asilomar, 1994, 1996, 1998 & 2000. 2. Four approaches to structure prediction: a. Homology Modeling b. Ab initio prediction c. Sequence-Structure Threading d. Docking 3. Two ways of threading Dynamic programming Knowledge-based potentials Asilomar, 1994, 1996, 1998 & 2000 1. Asilomar is state conference ground near Carmel, Monterey. 2. December 1994: Meeting on Critical Assessment of Techniques for Protein Structure Prediction 3. December 1996 & 1998: Second and Third meeting, etc 4. Competition was held to compare/contrast methods. Asilomar 4. Competition worked like this: Experimentalists who had structure that would be solved before date of CASP meeting submitted the sequence of the unknown to central repository. Predictors could download sequence and minimal information about protein (name), and could enter one of three categories. Assessors use automatic programs for analysis in addition to expertise to evaluate quality of predictions. Asilomar Categories 1. Homology Modeling (sequences with high homology to sequences of known structure) Given a sequence with homology > 25-30% with known structure in PDB, use known structure as starting point to create a model of the 3D structure of the sequence. Takes advantage of knowledge of a closely related protein. Use sequence alignment techniques to establish correspondences between known template and unknown.

Page 2 Asilomar Categories 2. Ab initio prediction (no known homology with any sequence of known structure) Given only the sequence, predict the 3D structure from first principles, based on energetic or statistical principles. Secondary structure prediction and multiple alignment techniques used to predict features of these molecules. Then, some method necessary for assembling 3D structure. New sequence: Ab initio prediction MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL Comparison of calculated (red) and experimental (blue) structures for the protein myoglobin using the refined potential function. The calculated structure is the lowest energy structure obtained from 3 different jobs with clustering and energy selection. The total simulation time on a 16 node partition CM-5 massively parallel computer was 60 hours, in which about 5 billion structures were generated. The RMS deviation of the two structures is 6.2 Å. Predict secondary structure: MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL HHHHHCCCCCHHHHHHHHHHCCCCBBBBBBBCCBBBB Predict 3D structure entirely: Asilomar Categories 3. Fold recognition (sequences with no sequence identity (<= 30%) to sequences of known structure. Given the sequence, and a set of folds observed in PDB, see if any of the sequences could adopt one the known folds. Takes advantage of knowledge of existing structures, and principles by which they are stabilized (favorable interactions).

Page 3 New sequence: Fold Recognition MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL Library of known folds: Asilomar Categories 3. Docking two proteins ( 96 only) Given two separate (known) protein structures, predict the geometry of their physical association.???? Use information about surface properties to find best hand/glove or lock/key fit between two known structures. Can do it by rigid body docking or flexible docking (harder) X X! X Protein Docking How to evaluate predictions? + RMSD Overall identification and topology of secondary structures Energy considerations (contacts, H-bonds) Similarity of hydrophobic core Sequence alignment quality (and systematic shift) See review of CASP4 at http://www3.interscience.wiley.com/cgi-bin/issuetoc?type=dd&id=90010623 Homology Modeling When sequence homology is > 70%, high resolution models are possible (< 3 Å RMSD). Sophisticated energy minimization techniques do not dramatically improve upon initial guess. Sample Homology Modeling MODELLER (Sali et al, see course web page) 1. Find homologous proteins with known structure and align 2. Collect distance distributions between atoms in known protein structures 3. Use these distributions to compute positions for equivalent atoms in alignment 4. Refine using energetics Rigorous criteria applied such as torsion angles, van der Waals violations, RMSD.

Page 4 Homology modeling sample. Thick backbone shows known structure. Thin lines show modeled structures. Some sidechains are not positioned correctly, but backbone and other sidechains look quite good. a. Sidechain mistakes b. Shifts with correct alignment c. No template d. Misalignment e. Incorrect template Use of sensitive multiple alignment (e.g. PSI- BLAST) techniques helped get best alignments. Sidechain modeling using libraries of known amino acid conformations. Success ranged from 45% to 80% correct (= angles within 30 of experimental structure). Energy based refinement still not improving the structures. PSI BLAST Extension of BLAST with extra features: 1. Multiple blocks aligned (not just 1) 2. Profile used iterative to increase sensitivity in picking distance sequences build profile based on initial hits use profile to conduct another search rebuilt profile repeat 5. Be careful about repeating too many times PSIBLAST DRIFT

Page 5 PSI BLAST OVERVIEW SKIP FOLD RECOGNITION AND COME BACK TO IT Ab Initio Predictions 1 to 2 : (Secondary structure prediction) Range of accuracy from 66% to 77% (3 state labeling: helix, coil or beta). Human hand editing improves the accuracy. Multiple sequence alignments improve the performance of secondary structure prediction. Ab Initio Predictions 2 to 3 : (Assemble secondary structures into 3D) Sensitive to errors in secondary structure Predictors were more likely to predict previously known structures. Ab Initio Predictions 1 to 3 : (Predict 3D from sequence only) Predict interresidue contacts and then compute structure (mild success) Simplified energy term + reduced search space (phi/psi or lattice) (moderate success) Creative ways to memorize sequence <-> structure correlations in short segments from the PDB, and use these to model new structures. ROSETTA Method. Ab Initio Predictions 1 to 3 : Good progress (3 models better than fold recognition results in CASP III) 1. Associate sequence of unknown with known 3D structure library, and then optimizing contact frequency of amino acids, as measured in PDB (Baker et al). 2. Generate all folds on lattice and then filter the bad ones out (Samudrala et al) 3. Combine multiple sequence alignment, secondary structure prediction and lattice. (Skolnick et al)

Page 6 Lattice search Rosetta Method for ab initio 1. Break target into fragments of 9 amino acids 2. Create profile, X, for target 3. Create profile, S, for similar PDB sequences 4. Align profiles X, S to get rank order list of best match fragments in the PDB (REF: Simons Baker, JMB 306: 1191-1199) Rosetta Method for ab initio 5. Start with extended chain, and evaluate the effect of introducing the fragments into the chain. 6. Use Metropolis-type algorithm for optimization, using following terms: hydrophobic burial polar side-chain interactions hydrogen bonding between beta-strands hard sphere repulsion (van der Waals) 6. Create 1000 structures, cluster them. 7. Choose one representative from each cluster as possible prediction Use an ellipsoid to be sure that hydrophobic residues are central

Page 7 CASP IV Performance Performance of Rosetta Method Alexey Murzin (Proteins Volume 45, Issue S5, 2001. Pages: 76-85) In 1996, in CASP2, we presented a semimanual approach to the prediction of protein structure that was aimed at the recognition of probable distant homology, where it existed, between a given target protein and a protein of known structure (Murzin and Bateman, [Proteins 1997; Suppl 1:105-112]). Central to our method was the knowledge of all known structural and probable evolutionary relationships among proteins of known structure classified in the SCOP database (Murzin et al., J Mol Biol 1995;247:536-540). It was demonstrated that a knowledge-based approach could compete successfully with the best computational methods of the time in the correct recognition of the target protein fold. Murzin prediction CASP IV The computational community responds Alexey can t play! Experimental Predicted

Page 8 Fold Recognition (check if sequence matches known 3D fold) CASP1: Of 21 target proteins, 11 wound up having folds that were previously known. CASP2: Of 22 targets, 15 with available folds CASP3: Of 43 targets, 36 with available folds CASP4: Of 56 target domains hard to say Every predictor does well on something. Common folds (more examples) are easier to recognize. Fold recognition was the surprise performer at the first competition. Incremental progress at second, third, fourth Fold Recognition Not all or none. List of top N hits much better than top hit. Common folds easier to recognize. Quality of alignments that result is NOT good. Potentials include: residue pair contact terms, hydrophobicity, polarity, H-bonds, local structure terms. Simple Dynamic Programming with environmental matching sometimes performs as well as sophisticated 3D potentials... Fold Recognition N-1 = target, N-2 = Fold in PDB New sequence: MLDTNMKTQLKAYLEKLTKPVELIATLDDSAKSAEIKELL Library of known folds:???? X X! X N-1 = target, N-2 = Fold in PDB N-1 = target, N-2 = Fold in PDB

Page 9 Fold Recognition ~ Threading ~ Inverse Folding Fold Recognition: given a sequence, and a library of backbones, find the backbone that accommodates the sequence best. Threading: Given a backbone, find the best way to mount the sequence on the backbone (with gaps) to maximize good interactions. Predictors for CASP I are along top row. Target sequences along first column. Dark grey means bad prediction, light gray pretty good, white very good. Hatched means no prediction. Upper left corner shows rank of best answer among list submitted by predictors (also shows fold used to make prediction, shift error and general protein class) Inverse Folding: (Folding = sequence to 3D). Start with 3D and find a good sequence. Elements of a fold recognition algorithm 1. Library of protein structures, suitably processed - All structures - Representative subset - Structures with loops removed 2. Scoring function - contact potential - environmental evaluation function 3. Method for generating initial alignments and/or searching for better alignments. Dynamic Programming with Environmental Strings (The subject of one of the homeworks) IDEA: Instead of aligning a sequence to a sequence, align a sequence to a string of descriptors that describe the 3D environment of the target structure. Usual DP, score matrix relates two amino acids: A R N D C Q A 2-2 0 0-2 0 R -2 6 0-1 -4 1 N 0 0 2 2-4 -1 D 0-1 2 4-5 2 C -2-4 -4-5 12-5 Q 0 1 1 2-5 4 Thread DP, relate AAs to environments in 3D structure. E1 E2 E3 E4 E5... A -0.77 1.05-0.54-0.65-1.52 R -1.80-1.52-2.35-0.11-0.41 N -1.76-2.18-2.61-0.48-0.26 D -2.48-1.80-2.63-0.80-2.08 C -0.43-0.45-0.59 0.15-0.72 Q -1.38-2.03-0.84 0.16-0.79

Page 10 What are environments. How do you compute them? Conceptually, superimpose multiple structures and look at the statistically conserved features around each 3D xyz position. This may include: Is AA buried/partially buried or exposed? If buried, how polar is the environment? If partially buried, how polar? What kind of secondary structures? (Buried status, polarity and secondary structure) 1. Align proteins with similar 3D structure. 2. Align homologous proteins by sequence alone. 3. For each position in protein, identify what environment it is by computing the local properties of interest (e.g. secondary structure, buried, polarity). 4. Count frequencies of different amino acids (within multiple alignment) in different environments. This creates a MATCH MATRIX. Bowie et al define 18 environments Another example of position-specific scores. DP threading Match Matrix Sample matrix showing alignment of amino acids and environments for globins. Entries indicate possible score for each amino acid at each environmental position, taken from match matrix. Z-Scores of DP threading for myoglobins, globins and non-globins. How do you thread a new sequence? Using standard dynamic programming, use new score matrix to align the sequence of environments from the structure of interest to the sequence of amino acids from unknown sequence. The highest scoring alignment is the best superposition of the sequence onto the structure. Using knowledge of scores of sequences with known structure, can see if the score is high enough to put the new sequence in the family.

Page 11 Advantages: DP Threading 1. Environmental proclivities may be more accurate than simple amino acid similarity: structural information local context potentially, many other features Net Result: Sample alignment B1 E2α B2α B2α E2α B2β P2β Eα Eβ Eα.. His Asp Val Ile Lys Ile Tyr Ser.. 2. Fast. 3. Pretty good performance (at Asilomar even). Disadvantages DP Threading Requires previous examples to work. Resulting match usually needs refinement May share some problems of DP in general (independence assumption from column to column, gap penalty choice, etc...) Disadvantages DP Threading Assumes average amino acid preferences overall similar protein-family environments. Doesn t compute the actual environment created by mounting the sequence on the structure. Assumes that the environment is relatively constant, and that only amino acid details change. But could have different types of interactions... Contact Potential Threading IDEA: Instead of modeling energies from first physical principles, simplify the problem by positioning only amino acids, and compute empirical energies from the observed associations of amino acids. GLU is attracted to LYS = E(glu, lys) Contact potential threading Create energy terms between amino acids: E(interaction) = -KT ln[frequency of interaction] where K is constant, T is temperature (constant), frequency of interaction measured in database of known structures. More frequent > more favorable.

Page 12 Contact potential (After Sippl et al.) More specifically: a = amino acid type a (ALA, VAL, etc...) b = amino acid type b s = separation in sequence E abs (r) = E abs (r) E s (r) Energy of interaction between a and b minus average energy at that separation equals the energy difference that contributes to stability. Contact Potential E abs (r) = -KT ln [ f abs (r) / f s (r) ] For any given sequence in 3D, compute distances between all pairs of amino acids (usually upto r = 10-15Å), and sum. E tot = Σ E abs (r) all a,b pairs Using contact potential 1. Given 3D structure, need to mount the sequence on the structure. simple dynamic programming (misses the point) other dynamic programming (better) exhautive enumeration (too expensive) recent paper shows that this is NP-hard heuristic enumeration limit on gap lengths, loop lengths (heuristic) Using contact potential Z-score. Number of standard deviations away from mean. Most meaningful for normal distributions... 2. Evaluate the contact potential for the alignment. 3. {Optional} Locally optimize the potential score. 4. Compare potential with random shuffle of sequence, and with other sequences to approximate z-score. 2SD Mean Sample threading. Other uses of contact potentials Fold recognition (as discussed here) Incorrect fold recognition detect unlikely or wrong structures bad predictions bad contacts, etc... Measure protein stability Use for ab initio prediction...

Page 13 Conclusions 1. Protein fold recognition will get asymptotically better, as we get more folds. 2. Best ab initio methods use knowledge of database, and will thus also improve. 2. Estimates are that we now have between 30% and 50% of folds that occur. 3. Given fold, we need to improve refinement with homology modeling techniques. Other information 1. http://predictioncenter.llnl.gov/ points to CASP results and targets. 2. Special journal issues devoted to CASP: Proteins 23(3), 1995 CASP2: Proteins Supplement 1, 1997 CASP3: Nature Structural Biology, Vol 6, No. 2, Feb 1999, page 108. CASP4: Proteins Vol 45 (S5), 2001.