BIOINF 4120 Bioinforma2cs 2 - Structures and Systems -

Similar documents
98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

Rapid Dynamic Programming Algorithms for RNA Secondary Structure

Algorithms in Bioinformatics

RNA secondary structure prediction. Farhat Habib

proteins are the basic building blocks and active players in the cell, and

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

Predicting RNA Secondary Structure

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters

RNA Structure Prediction and Comparison. RNA folding

13 Comparative RNA analysis

CS681: Advanced Topics in Computational Biology

Combinatorial approaches to RNA folding Part I: Basics

Lecture 4. Laminar Premixed Flame Configura6on 4.- 1

RNA Secondary Structure Prediction

RecitaLon CB Lecture #10 RNA Secondary Structure

BCB 444/544 Fall 07 Dobbs 1

BIOINFORMATICS. Prediction of RNA secondary structure based on helical regions distribution

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models

Semi-Supervised CONTRAfold for RNA Secondary Structure Prediction: A Maximum Entropy Approach

BIOINF 4120 Bioinforma2cs 2 - Structures and Systems -

RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev

RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev

The Ensemble of RNA Structures Example: some good structures of the RNA sequence

DYNAMIC PROGRAMMING ALGORITHMS FOR RNA STRUCTURE PREDICTION WITH BINDING SITES

BIOINFORMATICS. Fast evaluation of internal loops in RNA secondary structure prediction. Abstract. Introduction

In Genomes, Two Types of Genes

Lab III: Computational Biology and RNA Structure Prediction. Biochemistry 208 David Mathews Department of Biochemistry & Biophysics

Shape Based Indexing For Faster Search Of RNA Family Databases

DANNY BARASH ABSTRACT

Sparse RNA Folding Revisited: Space-Efficient Minimum Free Energy Prediction

Computing the partition function and sampling for saturated secondary structures of RNA, with respect to the Turner energy model

RNA Abstract Shape Analysis

Genome 559 Wi RNA Function, Search, Discovery

A two length scale polymer theory for RNA loop free energies and helix stacking

DNA/RNA Structure Prediction

A Method for Aligning RNA Secondary Structures

Moments of the Boltzmann distribution for RNA secondary structures

Recent measurements of low- energy hadronic cross seccons at BABAR & implicacons for g- 2 of the muon

Sparse RNA folding revisited: space efficient minimum free energy structure prediction

Introduction to Polymer Physics

Structure-Based Comparison of Biomolecules

Complete Suboptimal Folding of RNA and the Stability of Secondary Structures

Motivating the need for optimal sequence alignments...

COMBINATORICS OF LOCALLY OPTIMAL RNA SECONDARY STRUCTURES

Prediction of Locally Stable RNA Secondary Structures for Genome-Wide Surveys

Sequence Comparison with Mixed Convex and Concave Costs

Broadcast EncrypCon Amos Fiat & Moni Naor

The wonderful world of RNA informatics

Lecture 5: September Time Complexity Analysis of Local Alignment

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

Hierarchical Modeling of Astronomical Images and Uncertainty in Truncated Data Sets. Brandon Kelly Harvard Smithsonian Center for Astrophysics

PROTEIN SYNTHESIS: TRANSLATION AND THE GENETIC CODE

II MoCvaCon SUBSIDENCE MECHANICS: HEAT FLOW ANALOG (38)

RNALogo: a new approach to display structural RNA alignment

A Structure-Based Flexible Search Method for Motifs in RNA

RNA and Protein Structure Prediction

Lecture 2: Pairwise Alignment. CG Ron Shamir

Conserved RNA Structures. Ivo L. Hofacker. Institut for Theoretical Chemistry, University Vienna.

Pairwise sequence alignment

A New Similarity Measure among Protein Sequences

Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis

Math 8803/4803, Spring 2008: Discrete Mathematical Biology

Pure Multiple RNA Secondary Structure Alignments: A Progressive Profile Approach

Supplementary Material

STRUCTURAL BIOINFORMATICS I. Fall 2015

of all secondary structures of k-point mutants of a is an RNA sequence s = s 1,..., s n obtained by mutating

Lecture 12. DNA/RNA Structure Prediction. Epigenectics Epigenomics: Gene Expression

RNAdualPF: software to compute the dual partition function with sample applications in molecular evolution theory

RNA Secondary Structure Prediction: taking conservation into account

Outline. Approximation: Theory and Algorithms. Motivation. Outline. The String Edit Distance. Nikolaus Augsten. Unit 2 March 6, 2009

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Collabora've Filtering

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Analytical Study of Hexapod mirnas using Phylogenetic Methods

arxiv: v1 [q-bio.bm] 25 Jul 2012

Bioinformatics Advance Access published July 14, Jens Reeder, Robert Giegerich

Junction-Explorer Help File

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Finding Consensus Energy Folding Landscapes Between RNA Sequences

BLAST: Target frequencies and information content Dannie Durand

Approximation: Theory and Algorithms

Sequence analysis and Genomics

Journal of Mathematical Analysis and Applications

A tutorial on RNA folding methods and resources

Sparse RNA Folding: Time and Space Efficient Algorithms

RNA SECONDARY STRUCTURES AND THEIR PREDICTION 1. Centre de Recherche de MatMmatiques Appliqu6es, Universit6 de Montr6al, Montreal, Canada H3C 3J7

Determinants of 2 2 Matrices

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Characterising RNA secondary structure space using information entropy

Grand Plan. RNA very basic structure 3D structure Secondary structure / predictions The RNA world

Hairpin Database: Why and How?

NUMERICAL SOLUTION OF THE 1- D DIFFUSION EQUATION (39)

Computational Biology

Quantitative modeling of RNA single-molecule experiments. Ralf Bundschuh Department of Physics, Ohio State University

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Vincenzo Vagnoni (INFN Bologna) on behalf of the LHCb Collabora:on

arxiv: v1 [q-bio.bm] 16 Aug 2015

Transcription:

BIOINF 4120 Bioinforma2cs 2 - Structures and Systems - Oliver Kohlbacher Summer 2014 3. RNA Structure Part II Overview RNA Folding Free energy as a criterion Folding free energy of RNA Zuker- SCegler algorithm k- loops Free energy definicon Dynamic program Folding by comparacve analysis ConservaCon of structure vs. sequence Mutual informacon 2 RNA Folding Problems of Nussinov s alg.: All base pairs considered equal Stability of different base pairs not accounted for Stability is not only determined by base pairs Adjacent bases in helices contribute to stability through base stacking Base stacking in DNA/RNA http://dspace.jorum.ac.uk/xmlui/bitstream/handle/10949/956/items/s377_1_006i.jpg 3 1

Free Energy of RNA Folding The surrounding of the base pair has an influence as well: Stacking of a base with an adjacent base stabilizes the structure Loops, bulges, and interior loops are destabilizing the structure A more complete list of energecc contribucons could thus look like this: Free energy of base pairing (stabilizing) Free energy of base stacking (stabilizing) Free energy of end loops (destabilizing) Free energy of interior loops (destabilizing) Free energy of bulges (destabilizing) 4 Free Energy of RNA Folding Reasonable escmates for the free energies of the base pairs C- G, A- U, and G- U at 37 C are - 12 kj/mol, - 8 kj/mol, and - 4 kj/mol, respeccvely A simple definicon of e(i, j) could for example look like this The total energy E(s, P) of a sequence s folding into secondary structure P is then the sum of the base pair contribucons: 5 Free Energy Minimiza2on Nussinov s algorithm can be easily adapted to account for different base pair energies Replace ±(i, j) by an energy funccon e(i, j) Now the maximizacon of the number of base pairs has to be turned into a free energy minimiza2on Fortunately, the algorithm can be easily adapted to minimizacon instead of maximizacon The free energy minimizacon problem can scll be solved easily using dynamic programming with the following recursion: 6 2

Free Energy Minimiza2on Unfortunately, even these generalizacons of Nussinov s algorithm do not yield good structures The algorithm does not account for the stabilizing effect of base stacking in stems the destabilizing effect of loops More sophisccated approaches are required for this However, we want to hold on to the idea of energy minimizacon this idea is perfectly reasonable from a thermodynamic point of view What we need are bejer energy func2ons These should scll be efficiently computable 7 Zuker- S2egler Algorithm In 1981, Zuker and SCegler proposed a more sophisccated dynamic programming algorithm for RNA folding The algorithm is based on a more sophisccated energy funccon accouncng for loops, stacked base pairs, and other secondary structure elements. The key idea in their algorithm is the decomposicon of the structure into loops rather than base pairs Their energy funccon is thus more complex and captures biochemical reality beger M. Zuker, P. Stiegler, Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucl. Acids Res. (1981), 9:133-148F 8 Accessibility and Loops Defini2on 1: If (i, j) is a base pair in secondary structure P and i < h < j then we say that base h is accessible from (i, j) if there is no base pair (i, j ) 2 P such that i < i < h < j < j. A base pair (k, l) is accessible from (i, j) if both k and l are accessible from (i, j). Defini2on 2: The set of all bases accessible from a base pair (i, j) 2 P is called a loop. The size of the loop is the number of unpaired bases it contains. k i j l 9 3

k- Loops DefiniCon 2 implies that base pairs can form loops of size 0: i j Defini2on 3: The set l of all k- 1 base pairs and k unpaired bases that are accessible from (i, j) is called the k- loop closed by (i, j). The null k- loop l 0 consists of those single bases and base pairs that are accessible from no base pair. 10 k- Loops Defini2on 4: Based on the above we can define well- known secondary structures in terms of k- loops: 1. A hairpin loop is a 1- loop. 2. Let (k, l) be the pair accessible from the 2- loop closed by (i, j). The 2- loop is then called stacked pair if k - i = 1 and j - l = 1, bulge loop if k - i > 1 or j - l > 1, but not both, and interior loop if k - i > 1 and j - l > 1. 3. Mul2- loops are k- loops for k > 2. 4. Dangling ends of a structure form a null k- loop. 11 k- Loops and Secondary Structures i k i k j l j l i k j l D. Mount, Bioinformatics, p. 209 12 4

k- Loop Decomposi2on Observa2on: Any secondary structure P on a sequence s = (s 1, s 2,, s n ) can be parcconed into k- loops l 0, l 1, l m where m > 0 iff P. This k- loop decomposicon was first suggested by Sankoff et al. (1983). It allows the decomposicon into individual loops and given an energy funccon e(l) for k- loops the addicve determinacon of the total energy of the decomposed secondary structure P: Sankoff, D., Kruskal, J., Mainville, S., Cedergren, R., 1983. In: Sankoff, D., Kruskal, J. (Eds.), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison- Wesley, Reading, MA, pp. 93 120. 13 k- Loop Decomposi2on Miklósa et al., Bull. Math. Biol., 67 (2005), 1031-1047. 14 k- Loop Decomposi2on Observa2on The number of non- null k- loops of a structure equals the number of base pairs it contains. Miklósa et al., Bull. Math. Biol., 67 (2005), 1031-1047. 15 5

k- Loop Energies Only stacked base pairs yield a negacve contribucon to G. We denote the energy of stacked pairs (i, j) and (k, l) in the stacking loop closed by (i, j) as e s (i, j). e s A/U C/G G/C U/A G/U U/G A/U -0.9-1.8-2.3-1.1-1.1-0.8 C/G -1.7-2.9-3.4-2.3-2.1-1.4 G/C -2.1-2.0-2.9-1.8-1.9-1.2 U/A -0.9-1.7-2.1-0.9-1.0-0.5 G/U -0.5-1.2-1.4-0.8-0.4-0.2 U/G -1.0-1.9-2.1-1.1-1.5-0.4 (free energies in kcal/mol at 37 C) 16 k- Loop Energies All other k- loops contribute posicve energies to G. size interior loop bulge hairpin 1-3.9-2 4.1 3.1-3 5.1 3.5 4.1 4 4.9 4.2 4.9 5 5.3 4.8 4.4 6 6.3 5.5 5.3 (free energies in kcal/mol at 37 C) 17 k- Loop Energies The full energy funccon for a secondary structure is then composed of the following contribucons: e h (i, j), the energy for a hairpin loop closed by (i, j) e s (i, j), the energy of the stacked pair (i, j) and (i+1, j- 1) e bi (i, j, k, l), the energy of the bulge or interior loop closed by (i, j) with (k, l) accessible from (i, j) e ml denotes a constant energy associated with mulc- loops 18 6

Zuker- S2egler Algorithm Input: A sequence s 2 RNA n Output: A set of base pairings P describing a secondary structure of s of minimal free energy. The Zuker- SCegler algorithm now finds a minimum free energy secondary structure for s given an energy funccon for k- loops through dynamic programming In contrast to Nussinov s algorithm the recursion is centered on k- loops, not on base pairs. The recursion is a bit more complicated, requiring two DP matrices, V and W 19 Zuker- S2egler Algorithm The matrix W(i, j) denotes the minimum folding free energy of all non- empty foldings of the subsequence s i,, s j for all i < j. AddiConally, V(i, j) denotes the minimum folding free energy of all non- empty foldings of the subsequence s i,, s j containing the base pair (i, j). From the energy funccon described earlier, it is evident that the following relacon holds: W(i, j) V(i, j) for all i, j Both matrices are inicalized as follows: W(i, j) = V(i, j) = 1 for all i, j with j 4 < i < j 20 Zuker- S2egler Algorithm Main recursion: For all i, j with 1 i < j n: We consider the four well- known cases: 1. i is unpaired 2. j is unpaired 3. i and j are paired to each other (and thus close a k- loop). Best free energies for the k- loop come from matrix V. 4. i and j are possibly paired, but not to each other. 21 7

Zuker- S2egler Algorithm Energies for the main recursion: Deriving the energies is fairly simple for the first two cases, hairpins and stacking pairs 1. For a hairpin we just add the energy e h (i, j) of the hairpin closed by (i, j) 2. For a stacking loop closed by (i, j) we add the energy of the stacking loop plus the energy of the remaining secondary structure (i+1, j- 1) The other two cases are a bit more complicated. 22 Zuker- S2egler Algorithm Case 3: bulges and interior loops For this case we have to consider every possible way to define a bulge or interior loop i k j l The energy V BI is then the minimum over all possible bulges/interior loops and the energy of the secondary structure closed by (k, l): 23 Zuker- S2egler Algorithm Case 4: mul:- loops For mulc- loops we consider the different ways to compose a mulc- loop from two substructures: To account for the destabilizing effect of the mulc- loop, we add a constant energy e ML i 24 8

Complexity of the Algorithm Let us consider Nussinov s algorithm first: For l = 2 to n: For j = l to n: i = j l + 1 From the above it is evident that there are n 2 entries in (O(n 2 ) space) and the computacon of the fourth case takes at most O(n) Cme. Overall run Cme complexity is thus O(n 3 ) 25 Complexity of the Algorithm Now for the Zuker- S2egler algorithm: For all i, j with 1 i < j n: Matrices V and W have O(n 2 ) entries ComputaCon of W takes O(n 3 ) steps (same as Nussinov!) ComputaCon of V takes O(n 2 ) steps (without V BI and V M!) 26 Complexity of the Algorithm Now for the Zuker- S2egler algorithm: ComputaCon of O(n 2 ) possible V BI (i, j) takes O(n 2 ), resulcng in O(n 4 ) in total ComputaCon of V M requires similarly O(n) Cme and O(n 3 ) Cme in total Total 2me complexity of the Zuker- S2egler algorithm is thus: O(n 4 ) By limicng the size of bulges or interior loops to some fixed number d, usually about 30, the runcme can be reduced to O(n 3 ). This can be done by restriccng the search in the definicon of V BI. 27 9

Mul2- Loops Constant energy funccons for mulc- loops are an oversimplificacon A more general energy funccon could look like this: e ml = a + b n unp + c n p where a, b, c are constants and n unp and n p are the number of unpaired and paired bases in the mulc- loop. Similar to the nocon of affine gap costs in sequence alignment, this type of mulc- loop energy allows the construccon of an efficient O(n 3 ) algorithm Over the years, numerous addicons and modificacons have been proposed to improve on this. These are, however, beyond the scope of this lecture. 28 Example We can now try to fold a simple sequence: AAACAUGAGGAUUACCCAUGU Applying the Zuker- SCegler algorithm results in the following structure: 29 MFOLD Web Server Michael Zuker provides a web server that can be used to fold RNA sequences It uses a slightly modified algorithm and a different energy funccon, in this case it provides us with the same structure, though: http://mfold.rit.albany.edu/?q=mfold/rna-folding-form 30 10

Folding by Compara2ve Analysis Another way to predict secondary structure is to look at the sequences of related structures As with protein structures, RNA structure is highly conserved even if sequences similarity is completely gone ComparaCve analysis of RNA sequences/structures can reveal the common structure, though The underlying mechanism that conserves structure is called compensatory change: in order to conserve the secondary structure not one, but two bases have to change. A G C G A U 31 Folding by Compara2ve Analysis Analysis of sequence covariance in related RNAs can thus help to idencfy posicons that form base pairs Mount, Bioinformatics, p. 223 32 Mutual Informa2on To measure the amount of correlacon between two posicons, one can use mutual informa2on: if you tell me the base at posibon i, how much do I learn about the base at posibon j? Consider base frequencies in a given alignment First, the frequencies f i (x) for each column i and base x is computed Second, the 16 joint frequencies f ij (x, y) of two nucleocdes x in column i and y in column j are computed. For each pair of columns (i, j) we compute the raco If the base frequencies are independent of each other, then that raco should be close to 1, otherwise it will be larger than 1 33 11

Mutual Informa2on To calculate the mutual informa2on H(i, j) in bits between the two columns i and j, the logarithm of this raco is summed up for all base combinacons: For RNA sequences, we expect a maximum of two bits if there is perfect correlacon, zero if the two columns are encrely independent If either site is totally conserved, the mutual informacon is zero, because there is no covariance Problem: what happens for f i (x) = 0? 34 Mutual Informa2on To compensate for small sample size or unobserved bases, a so- called unbiased probability es2mator replaces the frequencies: where n is the number of sequences in the alignment Mutual informacon is then given as: Chiu & Kolodziejczak, CABIOS 7 (1991), 347 35 Mutual Informa2on Example Compute the mutual informacon: 1 2 3 4 5 6 C G C G A U C G G C C G C G C G G C C G G C U A H 1,2 =? H 3,4 =? H 5,6 =? 36 12

Mutual Informa2on Example An alignment of 1088 trnas taken from Rfam: CGCG.GGAU.A.GAGCAGUC.UGGU...AGCUCG.U.CGGGC.UCAUAACCCG.AAG GCCA.AAGU.A.GUUUAAU...GGU...AGAACA.A.UAAUU.UCAUGAAUUA.AGA GUCC.CUUU.C.GUCCAGU...GGUU..AGGACA.U.CGUCU.UUUCAUGUCG.AAG UGCA.AUAU.G.AUGUAAUU..GGUU..AACAUU.U.UAGGG.UCAUGACCUA.AUU GUGA.AUUU.A.GUUUAAUA..GAU...AAAACA.U.UUGCU.UUGCAAGCAA.AAC AGGG.GUUU.A.AGUUAA...UCU...AAACUA.A.AAGCC.UUCAAAGCUU.UAA ACUU.UUAA.A.GGAUAGA...AGU...AAUCCA.U.UGGCC.UUAGGAGCCA.AAA GUCU.CUGU.G.GCGCAAUC..GGUU..AGCGCG.U.UCGGC.UGUUAACCGA.AAG [ ] 37 Mutual Informa2on Example Consensus structure (from Rfam) 38 Links Web sites RFam: The Rfam database of RNA alignments and CMs hgp://rfam.janelia.org NonCode - database of non- coding RNAs hgp://www.noncode.org RNAdb - mammalian non- coding RNA database hgp://research.imb.uq.edu.au/rnadb/ many more links at IMB Jena hgp://www.rna.uni- jena.de/rna.php 39 13

Links Webservers Zuker's mfold Server: hgp://frontend.bioinfo.rpi.edu/applicacons/mfold/ Vienna RNA Secondary Structure PredicCon: hgp://rna.tbi.univie.ac.at/cgi- bin/rnafold.cgi 40 Sources Kay Nieselt, Lecture RNA Secondary Structure from Algorithms in BioinformaCcs M. S. Waterman. IntroducCon to ComputaConal Biology Maps, sequences and genomes. Chapman & Hall, Boca Raton, 1995 D. W. Mount. BioinformaCcs. Sequences and genome analysis, 2001 M. Zuker and P. SCegler. OpCmal computer folding of large RNA sequences using thermodynamics and auxiliary informacon. Nucl. Acids Res. (1981), 9(1): 133 148. (PMID: 6163133 ) 41 14