BIOINF 4120 Bioinforma2cs 2 - Structures and Systems -

BIOINF 4120 Bioinforma2cs 2 - Structures and Systems - Oliver Kohlbacher Summer 2014 3. RNA Structure Part II Overview RNA Folding Free energy as a criterion Folding free energy of RNA Zuker- SCegler algorithm k- loops Free energy definicon Dynamic program Folding by comparacve analysis ConservaCon of structure vs. sequence Mutual informacon 2 RNA Folding Problems of Nussinov s alg.: All base pairs considered equal Stability of different base pairs not accounted for Stability is not only determined by base pairs Adjacent bases in helices contribute to stability through base stacking Base stacking in DNA/RNA http://dspace.jorum.ac.uk/xmlui/bitstream/handle/10949/956/items/s377_1_006i.jpg 3 1

Free Energy of RNA Folding The surrounding of the base pair has an influence as well: Stacking of a base with an adjacent base stabilizes the structure Loops, bulges, and interior loops are destabilizing the structure A more complete list of energecc contribucons could thus look like this: Free energy of base pairing (stabilizing) Free energy of base stacking (stabilizing) Free energy of end loops (destabilizing) Free energy of interior loops (destabilizing) Free energy of bulges (destabilizing) 4 Free Energy of RNA Folding Reasonable escmates for the free energies of the base pairs C- G, A- U, and G- U at 37 C are - 12 kj/mol, - 8 kj/mol, and - 4 kj/mol, respeccvely A simple definicon of e(i, j) could for example look like this The total energy E(s, P) of a sequence s folding into secondary structure P is then the sum of the base pair contribucons: 5 Free Energy Minimiza2on Nussinov s algorithm can be easily adapted to account for different base pair energies Replace ±(i, j) by an energy funccon e(i, j) Now the maximizacon of the number of base pairs has to be turned into a free energy minimiza2on Fortunately, the algorithm can be easily adapted to minimizacon instead of maximizacon The free energy minimizacon problem can scll be solved easily using dynamic programming with the following recursion: 6 2

Free Energy Minimiza2on Unfortunately, even these generalizacons of Nussinov s algorithm do not yield good structures The algorithm does not account for the stabilizing effect of base stacking in stems the destabilizing effect of loops More sophisccated approaches are required for this However, we want to hold on to the idea of energy minimizacon this idea is perfectly reasonable from a thermodynamic point of view What we need are bejer energy func2ons These should scll be efficiently computable 7 Zuker- S2egler Algorithm In 1981, Zuker and SCegler proposed a more sophisccated dynamic programming algorithm for RNA folding The algorithm is based on a more sophisccated energy funccon accouncng for loops, stacked base pairs, and other secondary structure elements. The key idea in their algorithm is the decomposicon of the structure into loops rather than base pairs Their energy funccon is thus more complex and captures biochemical reality beger M. Zuker, P. Stiegler, Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucl. Acids Res. (1981), 9:133-148F 8 Accessibility and Loops Defini2on 1: If (i, j) is a base pair in secondary structure P and i < h < j then we say that base h is accessible from (i, j) if there is no base pair (i, j ) 2 P such that i < i < h < j < j. A base pair (k, l) is accessible from (i, j) if both k and l are accessible from (i, j). Defini2on 2: The set of all bases accessible from a base pair (i, j) 2 P is called a loop. The size of the loop is the number of unpaired bases it contains. k i j l 9 3

k- Loops DefiniCon 2 implies that base pairs can form loops of size 0: i j Defini2on 3: The set l of all k- 1 base pairs and k unpaired bases that are accessible from (i, j) is called the k- loop closed by (i, j). The null k- loop l 0 consists of those single bases and base pairs that are accessible from no base pair. 10 k- Loops Defini2on 4: Based on the above we can define well- known secondary structures in terms of k- loops: 1. A hairpin loop is a 1- loop. 2. Let (k, l) be the pair accessible from the 2- loop closed by (i, j). The 2- loop is then called stacked pair if k - i = 1 and j - l = 1, bulge loop if k - i > 1 or j - l > 1, but not both, and interior loop if k - i > 1 and j - l > 1. 3. Mul2- loops are k- loops for k > 2. 4. Dangling ends of a structure form a null k- loop. 11 k- Loops and Secondary Structures i k i k j l j l i k j l D. Mount, Bioinformatics, p. 209 12 4

k- Loop Decomposi2on Observa2on: Any secondary structure P on a sequence s = (s 1, s 2,, s n ) can be parcconed into k- loops l 0, l 1, l m where m > 0 iff P. This k- loop decomposicon was first suggested by Sankoff et al. (1983). It allows the decomposicon into individual loops and given an energy funccon e(l) for k- loops the addicve determinacon of the total energy of the decomposed secondary structure P: Sankoff, D., Kruskal, J., Mainville, S., Cedergren, R., 1983. In: Sankoff, D., Kruskal, J. (Eds.), Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison- Wesley, Reading, MA, pp. 93 120. 13 k- Loop Decomposi2on Miklósa et al., Bull. Math. Biol., 67 (2005), 1031-1047. 14 k- Loop Decomposi2on Observa2on The number of non- null k- loops of a structure equals the number of base pairs it contains. Miklósa et al., Bull. Math. Biol., 67 (2005), 1031-1047. 15 5

k- Loop Energies Only stacked base pairs yield a negacve contribucon to G. We denote the energy of stacked pairs (i, j) and (k, l) in the stacking loop closed by (i, j) as e s (i, j). e s A/U C/G G/C U/A G/U U/G A/U -0.9-1.8-2.3-1.1-1.1-0.8 C/G -1.7-2.9-3.4-2.3-2.1-1.4 G/C -2.1-2.0-2.9-1.8-1.9-1.2 U/A -0.9-1.7-2.1-0.9-1.0-0.5 G/U -0.5-1.2-1.4-0.8-0.4-0.2 U/G -1.0-1.9-2.1-1.1-1.5-0.4 (free energies in kcal/mol at 37 C) 16 k- Loop Energies All other k- loops contribute posicve energies to G. size interior loop bulge hairpin 1-3.9-2 4.1 3.1-3 5.1 3.5 4.1 4 4.9 4.2 4.9 5 5.3 4.8 4.4 6 6.3 5.5 5.3 (free energies in kcal/mol at 37 C) 17 k- Loop Energies The full energy funccon for a secondary structure is then composed of the following contribucons: e h (i, j), the energy for a hairpin loop closed by (i, j) e s (i, j), the energy of the stacked pair (i, j) and (i+1, j- 1) e bi (i, j, k, l), the energy of the bulge or interior loop closed by (i, j) with (k, l) accessible from (i, j) e ml denotes a constant energy associated with mulc- loops 18 6

Zuker- S2egler Algorithm Input: A sequence s 2 RNA n Output: A set of base pairings P describing a secondary structure of s of minimal free energy. The Zuker- SCegler algorithm now finds a minimum free energy secondary structure for s given an energy funccon for k- loops through dynamic programming In contrast to Nussinov s algorithm the recursion is centered on k- loops, not on base pairs. The recursion is a bit more complicated, requiring two DP matrices, V and W 19 Zuker- S2egler Algorithm The matrix W(i, j) denotes the minimum folding free energy of all non- empty foldings of the subsequence s i,, s j for all i < j. AddiConally, V(i, j) denotes the minimum folding free energy of all non- empty foldings of the subsequence s i,, s j containing the base pair (i, j). From the energy funccon described earlier, it is evident that the following relacon holds: W(i, j) V(i, j) for all i, j Both matrices are inicalized as follows: W(i, j) = V(i, j) = 1 for all i, j with j 4 < i < j 20 Zuker- S2egler Algorithm Main recursion: For all i, j with 1 i < j n: We consider the four well- known cases: 1. i is unpaired 2. j is unpaired 3. i and j are paired to each other (and thus close a k- loop). Best free energies for the k- loop come from matrix V. 4. i and j are possibly paired, but not to each other. 21 7

Zuker- S2egler Algorithm Energies for the main recursion: Deriving the energies is fairly simple for the first two cases, hairpins and stacking pairs 1. For a hairpin we just add the energy e h (i, j) of the hairpin closed by (i, j) 2. For a stacking loop closed by (i, j) we add the energy of the stacking loop plus the energy of the remaining secondary structure (i+1, j- 1) The other two cases are a bit more complicated. 22 Zuker- S2egler Algorithm Case 3: bulges and interior loops For this case we have to consider every possible way to define a bulge or interior loop i k j l The energy V BI is then the minimum over all possible bulges/interior loops and the energy of the secondary structure closed by (k, l): 23 Zuker- S2egler Algorithm Case 4: mul:- loops For mulc- loops we consider the different ways to compose a mulc- loop from two substructures: To account for the destabilizing effect of the mulc- loop, we add a constant energy e ML i 24 8

Complexity of the Algorithm Let us consider Nussinov s algorithm first: For l = 2 to n: For j = l to n: i = j l + 1 From the above it is evident that there are n 2 entries in (O(n 2 ) space) and the computacon of the fourth case takes at most O(n) Cme. Overall run Cme complexity is thus O(n 3 ) 25 Complexity of the Algorithm Now for the Zuker- S2egler algorithm: For all i, j with 1 i < j n: Matrices V and W have O(n 2 ) entries ComputaCon of W takes O(n 3 ) steps (same as Nussinov!) ComputaCon of V takes O(n 2 ) steps (without V BI and V M!) 26 Complexity of the Algorithm Now for the Zuker- S2egler algorithm: ComputaCon of O(n 2 ) possible V BI (i, j) takes O(n 2 ), resulcng in O(n 4 ) in total ComputaCon of V M requires similarly O(n) Cme and O(n 3 ) Cme in total Total 2me complexity of the Zuker- S2egler algorithm is thus: O(n 4 ) By limicng the size of bulges or interior loops to some fixed number d, usually about 30, the runcme can be reduced to O(n 3 ). This can be done by restriccng the search in the definicon of V BI. 27 9

Mul2- Loops Constant energy funccons for mulc- loops are an oversimplificacon A more general energy funccon could look like this: e ml = a + b n unp + c n p where a, b, c are constants and n unp and n p are the number of unpaired and paired bases in the mulc- loop. Similar to the nocon of affine gap costs in sequence alignment, this type of mulc- loop energy allows the construccon of an efficient O(n 3 ) algorithm Over the years, numerous addicons and modificacons have been proposed to improve on this. These are, however, beyond the scope of this lecture. 28 Example We can now try to fold a simple sequence: AAACAUGAGGAUUACCCAUGU Applying the Zuker- SCegler algorithm results in the following structure: 29 MFOLD Web Server Michael Zuker provides a web server that can be used to fold RNA sequences It uses a slightly modified algorithm and a different energy funccon, in this case it provides us with the same structure, though: http://mfold.rit.albany.edu/?q=mfold/rna-folding-form 30 10

Folding by Compara2ve Analysis Another way to predict secondary structure is to look at the sequences of related structures As with protein structures, RNA structure is highly conserved even if sequences similarity is completely gone ComparaCve analysis of RNA sequences/structures can reveal the common structure, though The underlying mechanism that conserves structure is called compensatory change: in order to conserve the secondary structure not one, but two bases have to change. A G C G A U 31 Folding by Compara2ve Analysis Analysis of sequence covariance in related RNAs can thus help to idencfy posicons that form base pairs Mount, Bioinformatics, p. 223 32 Mutual Informa2on To measure the amount of correlacon between two posicons, one can use mutual informa2on: if you tell me the base at posibon i, how much do I learn about the base at posibon j? Consider base frequencies in a given alignment First, the frequencies f i (x) for each column i and base x is computed Second, the 16 joint frequencies f ij (x, y) of two nucleocdes x in column i and y in column j are computed. For each pair of columns (i, j) we compute the raco If the base frequencies are independent of each other, then that raco should be close to 1, otherwise it will be larger than 1 33 11

Mutual Informa2on To calculate the mutual informa2on H(i, j) in bits between the two columns i and j, the logarithm of this raco is summed up for all base combinacons: For RNA sequences, we expect a maximum of two bits if there is perfect correlacon, zero if the two columns are encrely independent If either site is totally conserved, the mutual informacon is zero, because there is no covariance Problem: what happens for f i (x) = 0? 34 Mutual Informa2on To compensate for small sample size or unobserved bases, a so- called unbiased probability es2mator replaces the frequencies: where n is the number of sequences in the alignment Mutual informacon is then given as: Chiu & Kolodziejczak, CABIOS 7 (1991), 347 35 Mutual Informa2on Example Compute the mutual informacon: 1 2 3 4 5 6 C G C G A U C G G C C G C G C G G C C G G C U A H 1,2 =? H 3,4 =? H 5,6 =? 36 12

Mutual Informa2on Example An alignment of 1088 trnas taken from Rfam: CGCG.GGAU.A.GAGCAGUC.UGGU...AGCUCG.U.CGGGC.UCAUAACCCG.AAG GCCA.AAGU.A.GUUUAAU...GGU...AGAACA.A.UAAUU.UCAUGAAUUA.AGA GUCC.CUUU.C.GUCCAGU...GGUU..AGGACA.U.CGUCU.UUUCAUGUCG.AAG UGCA.AUAU.G.AUGUAAUU..GGUU..AACAUU.U.UAGGG.UCAUGACCUA.AUU GUGA.AUUU.A.GUUUAAUA..GAU...AAAACA.U.UUGCU.UUGCAAGCAA.AAC AGGG.GUUU.A.AGUUAA...UCU...AAACUA.A.AAGCC.UUCAAAGCUU.UAA ACUU.UUAA.A.GGAUAGA...AGU...AAUCCA.U.UGGCC.UUAGGAGCCA.AAA GUCU.CUGU.G.GCGCAAUC..GGUU..AGCGCG.U.UCGGC.UGUUAACCGA.AAG [ ] 37 Mutual Informa2on Example Consensus structure (from Rfam) 38 Links Web sites RFam: The Rfam database of RNA alignments and CMs hgp://rfam.janelia.org NonCode - database of non- coding RNAs hgp://www.noncode.org RNAdb - mammalian non- coding RNA database hgp://research.imb.uq.edu.au/rnadb/ many more links at IMB Jena hgp://www.rna.uni- jena.de/rna.php 39 13

Links Webservers Zuker's mfold Server: hgp://frontend.bioinfo.rpi.edu/applicacons/mfold/ Vienna RNA Secondary Structure PredicCon: hgp://rna.tbi.univie.ac.at/cgi- bin/rnafold.cgi 40 Sources Kay Nieselt, Lecture RNA Secondary Structure from Algorithms in BioinformaCcs M. S. Waterman. IntroducCon to ComputaConal Biology Maps, sequences and genomes. Chapman & Hall, Boca Raton, 1995 D. W. Mount. BioinformaCcs. Sequences and genome analysis, 2001 M. Zuker and P. SCegler. OpCmal computer folding of large RNA sequences using thermodynamics and auxiliary informacon. Nucl. Acids Res. (1981), 9(1): 133 148. (PMID: 6163133 ) 41 14