RNA and Protein Structure Prediction

Similar documents
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Bio nformatics. Lecture 23. Saad Mneimneh

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

CAP 5510 Lecture 3 Protein Structures

Protein Structure Basics

Basics of protein structure

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Model Mélange. Physical Models of Peptides and Proteins

Biomolecules: lecture 10

Introduction to" Protein Structure

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

Combinatorial approaches to RNA folding Part I: Basics

Protein Structure. Hierarchy of Protein Structure. Tertiary structure. independently stable structural unit. includes disulfide bonds

From Amino Acids to Proteins - in 4 Easy Steps

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

What is the central dogma of biology?

D Dobbs ISU - BCB 444/544X 1

Bi 8 Midterm Review. TAs: Sarah Cohen, Doo Young Lee, Erin Isaza, and Courtney Chen

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Orientational degeneracy in the presence of one alignment tensor.

1/23/2012. Atoms. Atoms Atoms - Electron Shells. Chapter 2 Outline. Planetary Models of Elements Chemical Bonds

Dana Alsulaibi. Jaleel G.Sweis. Mamoon Ahram

Physiochemical Properties of Residues

Motif Prediction in Amino Acid Interaction Networks

Protein Structures. Sequences of amino acid residues 20 different amino acids. Quaternary. Primary. Tertiary. Secondary. 10/8/2002 Lecture 12 1

Conformational Geometry of Peptides and Proteins:

Details of Protein Structure

The Structure and Functions of Proteins

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

Protein structure (and biomolecular structure more generally) CS/CME/BioE/Biophys/BMI 279 Sept. 28 and Oct. 3, 2017 Ron Dror

Chapter

Analysis and Prediction of Protein Structure (I)

Protein Dynamics. The space-filling structures of myoglobin and hemoglobin show that there are no pathways for O 2 to reach the heme iron.

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

1. (5) Draw a diagram of an isomeric molecule to demonstrate a structural, geometric, and an enantiomer organization.

Enzyme Catalysis & Biotechnology

MULTIPLE CHOICE. Circle the one alternative that best completes the statement or answers the question.

1. What is an ångstrom unit, and why is it used to describe molecular structures?

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

Ch 3: Chemistry of Life. Chemistry Water Macromolecules Enzymes

BCH 4053 Spring 2003 Chapter 6 Lecture Notes

Structure-Based Comparison of Biomolecules

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Packing of Secondary Structures

Supersecondary Structures (structural motifs)

From gene to protein. Premedical biology

Protein Structures: Experiments and Modeling. Patrice Koehl

ALL LECTURES IN SB Introduction

Objective: Students will be able identify peptide bonds in proteins and describe the overall reaction between amino acids that create peptide bonds.

Computational Biology: Basics & Interesting Problems

Bioinformatics. Macromolecular structure

Unit 1: Chemistry - Guided Notes

Final Chem 4511/6501 Spring 2011 May 5, 2011 b Name

CHEM 463: Advanced Inorganic Chemistry Modeling Metalloproteins for Structural Analysis

AP Biology. Proteins. AP Biology. Proteins. Multipurpose molecules

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov

Protein Structure and Function. Protein Architecture:

proteins are the basic building blocks and active players in the cell, and

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

BME 5742 Biosystems Modeling and Control

Announcements. Primary (1 ) Structure. Lecture 7 & 8: PROTEIN ARCHITECTURE IV: Tertiary and Quaternary Structure

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Table 1. Crystallographic data collection, phasing and refinement statistics. Native Hg soaked Mn soaked 1 Mn soaked 2

BA, BSc, and MSc Degree Examinations

NMR, X-ray Diffraction, Protein Structure, and RasMol

CS273: Algorithms for Structure Handout # 2 and Motion in Biology Stanford University Thursday, 1 April 2004

BIRKBECK COLLEGE (University of London)

15.2 Prokaryotic Transcription *

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

Presentation Outline. Prediction of Protein Secondary Structure using Neural Networks at Better than 70% Accuracy

Outline. Levels of Protein Structure. Primary (1 ) Structure. Lecture 6:Protein Architecture II: Secondary Structure or From peptides to proteins

LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Motivating the need for optimal sequence alignments...

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell.

Properties of amino acids in proteins

Protein Secondary Structure Prediction

Protein Structure. Role of (bio)informatics in drug discovery. Bioinformatics

Biochemistry Quiz Review 1I. 1. Of the 20 standard amino acids, only is not optically active. The reason is that its side chain.

Protein Structure & Motifs

Introduction to Protein Folding

Review. Membrane proteins. Membrane transport

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

THE UNIVERSITY OF MANITOBA. PAPER NO: 409 LOCATION: Fr. Kennedy Gold Gym PAGE NO: 1 of 6 DEPARTMENT & COURSE NO: CHEM 4630 TIME: 3 HOURS

NGF - twenty years a-growing

Getting To Know Your Protein

Biochemistry Prof. S. DasGupta Department of Chemistry Indian Institute of Technology Kharagpur. Lecture - 06 Protein Structure IV

HIV protease inhibitor. Certain level of function can be found without structure. But a structure is a key to understand the detailed mechanism.

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Protein Structures. 11/19/2002 Lecture 24 1

BIOCHEMISTRY Unit 2 Part 4 ACTIVITY #6 (Chapter 5) PROTEINS

Transcription:

RNA and Protein Structure Prediction Bioinformatics: Issues and Algorithms CSE 308-408 Spring 2007 Lecture 18-1-

Outline Multi-Dimensional Nature of Life RNA Secondary Structure Prediction Protein Structure Determination Protein Threading -2-

Life is not one-dimensional Secondary and tertiary structures for trna (tranfser RNA) Sample protein structures http://crystal.uah.edu/~carter/protein/xray.htm http://www.mdc-berlin.de/englisch/research/research_areas/cancer/heinemann.htm -3-

RNA secondary structure: Base-pairing rules similar to basic sequence alignment. Unlike DNA, single-stranded RNA can fold back on itself. Recall that C-G and A-U form stable pairs ( Watson-Crick ). In addition, G-U forms a weaker pair ( wobble ). http://www.bioinfo.rpi.edu/~zukerm/bio-5495/rnafold-html/ -4- http://tigger.uic.edu/classes/phys/phys461/phys450/anjum04/rna_sstrand.jpg RNA secondary structure

Modeling RNA secondary structure Let R = r1r2... rn be an RNA sequence, where ri {A, C, G, U}. A secondary structure is a set of ordered pairs (i, j) such that: (1) j i > 4 pairing can't be too tight (2) if (i, j) and (i', j') are two base pairs, with i i', then either: (a) i = i' and j = j' i.e., same base pair or (b) i < j < i' < j' (i, j) precedes (i', j') or (c) i < i' < j' < j (i, j) encloses (i', j') This is also known as a folding. http://www.bioinfo.rpi.edu/~zukerm/bio-5495/rnafold-html/ -5-

Visualizing RNA secondary structure Circular representation Computer predicted folding of Bacillus subtilis RNase P RNA H M = multi-loop I = interior loop B = bulge loop H = hairpin loop I base 0 I B I base 400 I H M H Dot plot representation H M H H M H M Hlower base pair precedes upper pair (i < j < i' < j') H M I H H this base pair encloses H all others (i < i' < j' < j ) http://www.bioinfo.rpi.edu/~zukerm/bio-5495/rnafold-html/ -6- place dot for each (i, j) pair

Predicting RNA secondary structure Premise: Base pairings have a stabilizing effect on a structure's free energy. Loops have a destabiling effect. Goal: to determine a secondary structure with minimum free energy. Hairpin loop Stacked base pairs Two popular approaches: Ignore loops, maximize number of base pairings (a bit simplistic, but leads to a nice algorithm). Employ loop-specific energy models (more realistic, but also more complex algorithms). -7-

An approach for predicting RNA secondary structure Assumption: Energy of each base pair is independent of all other base pairs and of loop structure. Consequence: Total free energy is sum of energies for all base pairs. Downside: Predictions only approximate reality. Key observation: We can use solutions for shorter subsequences to determine solution for longer sequences. Should sound familiar: precisely the kind of decoupling required for dynamic programming to work. -8-

Notation and definitions S(i, j) = secondary structure of RNA strand (i.e., set of base pairings) from base ri to base rj, inclusive. e(a, b) = free energy of base pair (a, b). E(S(i, j)) = free energy of secondary structure S(i, j). So we have: E S i, j = e a, b a, b S i, j Note: we can define e(a, b) to be a very large value when physical constraints are violated (e.g., b a 4, so that hairpin loop would be too tight). -9-

Algorithm for predicting RNA secondary structure Consider RNA strand from base position i to base position j: What is optimal folding for this? ri rk rj-1 rj Two possible cases depending on whether rj base-pairs: If rj does not base-pair, then E(S(i, j)) = E(S(i, j-1)). We already computed this. ri rk rj-1 rj If rj does pair, E(S(i, j)) = E(S(i, k-1)) + e(k, j) + E(S(k+1, j-1)). Already computed. ri Already computed. rk rj-1-10 - rj

Algorithm for predicting RNA secondary structure Optimize free energy over increasingly longer subsequences. For each partial folding S(i, j), free energy is given by: E S i, j = { min E S i 1, j 1 e i, j min {E S i, k 1 e k, j E S k 1, j 1 } for i k j Free energy of optimal folding for entire strand is E(S(1,n)). How do we deal with case where ri is not paired? Define e(k, j) to be 0 when k = j. - 11 - }

Time and space complexity What are time and space requirements of this algorithm? For each partial folding S(i, j), free energy is given by: E S i, j = { min E S i 1, j 1 e i, j min {E S i, k 1 e k, j E S k 1, j 1 } for i k j For RNA sequence of length n, we must be an n x n matrix. Each entry requires exploring an average of n/2 values for k. Hence, time complexity is O(n3), space complexity is O(n2). - 12 - }

Summary of RNA secondary structure prediction #1 General summary of the situation: Tertiary structure is difficult to model and compute. Determining secondary structure is more amendable to a solution. While only an approximation, it gives good hints. No knots planar graph. Solved using dynamic programming in O(n3) time. http://matisse.ucsd.edu/~hwa/pub/goletter.pdf - 13 -

Summary of RNA secondary structure prediction #2 Better results can be obtained by modeling loops. This problem is also solvable in O(n3) time using some tricks. Bulge Hairpin loop red = destabilizing ( bad ) green = stabilizing ( good ) Helical region Interior loop - 14 -

Protein structure Primary structure of protein is determined by number and order of amino acids within polypeptide chain. Protein's secondary structure is defined as local conformation of its backbone, which consists of molecules that make up an amino acid's frame excluding side chains. Two common motifs include beta-pleated sheets and alpha helices. Tertiary structure is formed when attractions of side chains and secondary structure combine to form distinct 3-dimensional structure. This gives protein its specific function. Sometimes distinct proteins must combine to form correct 3-dimensional structure for a particular protein to function properly. E.g., hemoglobin is made of four similar proteins that combine to form its quaternary structure. http://crystal.uah.edu/~carter/protein/protein.htm - 15 -

Protein structure http://www.science.org.au/sats2004/images/mackay2.jpg - 16 -

Sequence structure function structure medicine sequence function http://www.bioinformatics.uwaterloo.ca/~j3xu/cs882/cs882-proteinstructureprediction.ppt - 17 -

How is true 3-D structure determined? As of today, must be determined experimentally. Techniques include x-ray crystallography and NMR. diffraction pattern http://crystal.uah.edu/~carter/protein/xray.htm - 18 -

X-ray crystallography A diffraction pattern: the white spots are the reflections. http://www-structure.llnl.gov/xray/xrayequipment.htm - 19 -

From electron density map to structure http://www.ib3.gmu.edu/vaisman/csi731/lec04f02.pdf - 20 -

Protein structure determination backbone... w/ side chains http://www-structure.llnl.gov/xray/tutorial/protein_structure.htm - 21 - electron density map

Two common protein secondary structures Alpha Helix R groups of amino acids all extend to outside. Helix makes a complete turn every 3.6 amino acids. Helix is right-handed; it twists in clockwise direction. Carbonyl group (-C=O) of each peptide bond extends parallel to axis of helix and points directly at -N-H group of peptide bond 4 amino acids below it in helix. A hydrogen bond forms between them [-N-H O=C-]. Beta Conformation Consists of pairs of chains lying side-by-side and stabilized by hydrogen bonds between carbonyl oxygen atom on one chain and -NH group on adjacent chain. Chains are often "anti-parallel"; N-terminal to C-terminal direction of one being reverse of other. http://www.rothamsted.bbsrc.ac.uk/notebook/courses/guide/protalpha.htm - 22 -

Alpha helix Alpha-helix (also written α-helix) is rod-like structure stabilized by hydrogen bonds between CO and NH groups of main chain. Ribbon representation of righthanded alpha-helix with only the alpha carbons represented. Examining backbone structure, note that alpha carbons spaced three and four in linear sequence are actually quite close together in helix structure. The hydrogen bonds are shown in green; all main chain CO and NH groups are hydrogen bonded. This structure is quite sturdy. http://www.rothamsted.bbsrc.ac.uk/notebook/courses/guide/protalpha.htm - 23 -

Beta sheet http://www.cryst.bbk.ac.uk/pps2/course/section3/sheet.html - 24 -

Protein structure Some proteins are made up of mostly alpha helicies. Both marine bloodworm hemoglobin (left) and E. coli cytochrome B562 (right) are composed of mostly alpha helicies. The 4 helix bundle of the cytochrome is a common motif. Some are mostly beta sheet. The green alga plastocyanin (left) and sea snake neurotoxin (right) are mostly beta sheets. Red = alpha helix Green = beta sheet Black = misc. loops http://bmbiris.bmb.uga.edu/wampler/tutorial/prot3.html - 25 -

Protein structure Many proteins are a mix of alpha helicies and beta sheets. Two simple proteins with a mix of 2o components: ribonuclease T1 (left) and pancreatic trypsin inhibitor (right). http://bmbiris.bmb.uga.edu/wampler/tutorial/prot3.html - 26 -

Protein Domains Tertiary structure of many proteins is built from several domains. Often each has a separate function to perform, such as: binding a small ligand (e.g., a peptide in the molecule shown here) spanning the plasma membrane (transmembrane proteins) containing the catalytic site (enzymes) DNA-binding (in transcription factors) providing a surface to bind specifically to another protein. In some cases, each domain is encoded by a separate exon in the gene. The histocompatibility molecule shown here has three domains: α1, α2, and α3 are each encoded by its own exon. http://users.rcn.com/jkimball.ma.ultranet/biologypages/t/tertiarystructure.html - 27 -

Moving towards protein structure prediction... Most important information seems to be contained in alpha helices and beta sheets (which form core), not in loops. Given amino acid sequence, we want to determine locations of helices, sheets, and loops, and their arrangements. How to do this? Experimental techniques are expensive and time-consuming. Exhaustive enumeration at molecular level (taking structure with smallest free energy)? Nah... As of 2002, the NIH protein structure database contained approximately 15,000 entries. Hmm... Idea: given sequence, see if it could fit a known structure. This is known as protein threading. - 28 -

PDB new vs. old folding growth Old fold New fold Number of unique folds in nature is fairly small (possibly a few thousands). 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB. http://www.bioinformatics.uwaterloo.ca/~j3xu/cs882/cs882-proteinstructureprediction.ppt - 29 -

Protein threading Somewhat similar to sequence alignment we studied earlier: homology modeling: align sequence to sequence, threading: align sequence to structure (templates). http://www.biostat.wisc.edu/bmi576/lecture16.pdf http://www.ics.uci.edu/~rickl/publications/1998-salzberg-chapter.pdf - 30 -

The protein threading problem Given: new protein sequence, and library of templates: Find: best alignment of sequence to some template. http://www.biostat.wisc.edu/bmi576/lecture16.pdf - 31 -

The protein threading problem One possible threading (note non-local interactions): http://www.dcs.kcl.ac.uk/teaching/units/csmacmb/doc/lecture17b.pdf http://www.ics.uci.edu/~rickl/publications/1998-salzberg-chapter.pdf - 32 -

The protein threading problem Input: 1. Protein sequence A with n amino acids ai. 2. Core structural model C, with m core segments Ci. Also: (a) Length ci of each core segment. (b) Core segments Ci and Ci+1 are connected by loop λi for which we know max (lmaxi) and min (lmini) lengths. (c) Structural environment for each amino acid position. 3. Scoring function f(t) to evaluate each threading T. Output: set of integers T = {t1, t2,..., tm} such that value of ti indicates which amino acid from A occupies first position in core segment i. - 33 -

Core templates with interactions Small circles represent amino acid positions. Thin lines indicate interactions represented in model. http://www.biostat.wisc.edu/bmi576/lecture16.pdf http://www.ics.uci.edu/~rickl/publications/1998-salzberg-chapter.pdf - 34 -

The protein threading problem Possible threadings: Unfortunately, due to variable-length gaps between core segments and non-local interactions, this problem is NP-hard. Fortunately, it is amenable to solution by a general-purpose optimization strategy known as branch-and-bound. http://www.biostat.wisc.edu/bmi576/lecture16.pdf http://www.ics.uci.edu/~rickl/publications/1998-salzberg-chapter.pdf - 35 -

Digression: branch-and-bound Basic idea: Partion solution space into distinct sets. Compute lower bound that applies to all solutions in given set. If we can find a solution that is better than this lower bound, we don't need to explore any solution in that set. Important note: branch and bound will find optimal solution (it's not heuristic)... but... it might take exponential time to do it. Still, it is often much faster than naive exhaustive search. - 36 -

Digression: branch-and-bound Let's see how this works for the traveling salesman problem, which we know is also NP-complete (protein threading is a bit too complicated for now). Given: a set of cities and costs to travel between them. Find: a minimum cost tour that visits each city once. A 2 12 3 B 7 5 D 2 2 4 C 2 5 E A B C D E A 2 12 5 2-37 - B 2 3 7 2 C 12 3 2 5 D 5 7 2 4 E 2 2 5 4 -

Branch-and-bound Start search at A: Now let's try going from A to C: A A 2 12 B B C C 3 B +3=5 C 2 15 +2=7 D 5 D E 14 17 D + 4 = 11 E E + 2 = 13 No point in exploring any of these subtrees any further! Total cost for this tour is 13. - 38 -

Branch-and-bound In traveling salesman, we were able to eliminate from consideration all tours starting with: A-C-B-... and A-C-D-... and A-C-E-... because we knew they could never be optimal; we already had a tour with total cost less than their partial costs. General observation: complete solution with cost w partial solutions lower bound x w partial solutions lower bound y < w partial solutions lower bound z < y don't bother exploring this explore this if/when bound becomes best explore this first - 39 -

Back to protein threading... Recall that ti indicates which amino acid occupies the first position in core segment i. Our scoring function is: f T = g 1 i, t i g 2 i, j, t i, t j i i j i As we know sizes of core segments and min and max lengths for loops, we can determine ranges for ti's. di-1 ti di di+1-40 - This will form basis for branchand-bound.

Branch-and-bound for protein threading Given a set of threadings T *, the optimization problem is: min f T = min g 1 i, t i g 2 i, j, t i, t j T T * T T * i i [ j i = min g 1 i, t i g 2 i, j, t i, t j T T * i j i ] What's a lower bound we can use? i [ min g 1 i, x min g 2 i, j, y, z bi x d i j i bi y d i b j z d j ] Note this is determined by the interval [bi, di] that ti may fall in. - 41 -

Branch-and-bound for protein threading Now we must split solution space into disjoint sets. Do this by selecting largest current interval for a ti and cutting it in half. http://www.biostat.wisc.edu/bmi576/lecture16.pdf - 42 -

Branch-and-bound for protein threading http://www.biostat.wisc.edu/bmi576/lecture16.pdf - 43 -

CAFASP3 Example CAFASP: Critical Assessment of Fully Automated Structure Prediction CAFASP3 evaluated by MaxSub, a computer program. Predicted structures are superimposed to the experimental structures to see how long is superimposable. Red: Experimental Structure Blue: Correct Prediction Green: Incorrect Prediction http://www.bioinformatics.uwaterloo.ca/~j3xu/cs882/cs882-proteinstructureprediction.ppt - 44 -

Wrap-up Readings for next time: "The Invention of the Genetic Code," Brian Hayes, American Scientist, vol. 86, no. 1, Jan. Feb., 1998, pp. 8 14. "Ode to the Code," Brian Hayes, American Scientist, vol. 92, no. 6, Nov. Dec., 2004, pp. 494 498. (Both papers are in the Readings folder on Blackboard.) Remember: Come to class having done the readings. Check Blackboard regularly for updates. - 45 -