Boltzmann probability of RNA structural neighbors and riboswitch detection

Size: px

Start display at page:

Download "Boltzmann probability of RNA structural neighbors and riboswitch detection"

Rodney Shields
5 years ago
Views:

1 Boltzmann probability of RNA structural neighbors and riboswitch detection Eva Freyhult 1, Vincent Moulton 2 Peter Clote 3, 1 Linnaeus Centre for Bioinformatics, University of Uppsala, Sweden, eva.freyhult@lcb.uu.se. 2 School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, UK, vincent.moulton@cmp.uea.ac.u. 3 Department of Biology, Boston College, Chestnut Hill, MA 02467, USA, clote@bc.edu. Corresponding author Abstract Given an RNA nucleotide sequence s, let S 0 be any secondary structure s. S 0 could be the minimum free energy structure of s, it could be the secondary structure obtained by analysis of the X-ray structure or by comparative sequence analysis, or it could be an arbitrary intermediate structure. Another secondary structure S of s is called a δ-neighbor of S 0 if S and S 0 differ by exactly δ base pairs. Here we describe a new software pacage, RNAbor, to compute the number N δ, the Boltzmann partition function Z δ and the minimum free energy structure MFE δ over the collection of all δ-neighbors of S 0. This computation is done simultaneously for all δ m, in run time O(mn 3 ) and memory O(mn 2 ). Our novel algorithms depend on a new manner of partitioning up the space of secondary structures, nested multiple recursions and dynamic programming. Computations are done with the Turner nearest neighbor energy parameters, nown for its success in ab initio secondary structure prediction. We apply RNAbor to automatic detection of possible RNA conformational switches, and compare RNAbor to existent switch detection methods. Public access to our software, RNAbor (RNA neighbor), is provided by a web server available at 1 Introduction In the last few years, there has been intense interest in RNA due to the surprising, previously unsuspected roles played by ribonucleic acid in what until now has been a predominantly protein-centric view of molecular biology. Apart from its roles as messenger RNA and transfer RNA, ribonucleic acid molecules play a catalytic role in the peptidyltransferase reaction in peptide bond formation (??) and in intron splicing (?), both examples of enzymatic RNAs now termed ribonucleic enzymes or ribozymes (?). RNA plays a role in post-transcriptional gene regulation due to the hybridization of mrna by small interfering RNAs 1

2 (sirna) (??) and micro-rnas (mirna) (?). By completely different means, RNA performs transcriptional and translational gene regulation by allostery, where a portion of the 5 untranslated region (5 UTR) of mrna nown as a riboswitch (??) can undergo a conformational change upon binding a specific ligand such as adenine, guanine, lysine, etc. RNA is nown to play critical roles in various other cellular mechanisms including dosage compensation (?), protein shuttling (?), retranslation events such as selenocysteine insertion (?) and ribosomal frameshift (??), etc. Illustrative of the growing recognition for the importance of RNA, the 2006 Nobel Prize in Physiology or Medicine was awarded to A.Z. Fire and C.C. Mello for their discovery of RNA interference and gene silencing by double-stranded RNA. The function of noncoding RNA 1 and cis-regulatory motifs depends on the RNA tertiary structure, which is largely determined by secondary structure involving Watson-Cric and GU wobble base pair stacing, hairpin loops, bulges, interior and multiloops (?). For this reason, various groups have tried to develop a noncoding RNA (ncrna) genefinder based on the fact that ncrna has been shown to have lower folding energy than random RNA (???). One of the most successful such algorithms is the moving window ncrna genefinder RNAz, developed by Washietl, Hofacer and Stadler (?). RNAz uses comparative genomics and a support vector machine to detect whether current window contents structurally align with nown ncrnas and have lower folding energy than random RNA. 2 In this paper, we develop novel and efficient algorithms to compute both the number N δ (s, S 0 ) and the partition function Z δ (s, S 0 ) = S exp( E(S)/RT ) of all δ-neighbors S of s, S 0, where E(S) denotes the energy of S with respect to the Turner nearest neighbor energy model, R is the universal gas constant, and T is temperature in degrees Kelvin. Our software, called RNAbor (RNA neighbor), additionally computes graphs of the probability density function p δ = Z δ /Z as a function of δ. As shown in various figures, we can computationally detect low energy secondary structures which differ structurally from that of S 0. In cases where there is too little sequence identity with nown examples of ncrna for application of programs such as RNAz (?) and Dynalign (?), our software RNAbor suggests some promise as a tool capable of ab initio detection of riboswitches and pseudonots, a topic to be explored in future wor. RNAbor was motivated by Moulton et al. (?) who suggested that the stability of a secondary structure might depend on the number of structural neighbors at varying distances from the given structure for instance from the minimum free energy (MFE) structure. It turns out that the number of structural neighbors at varying distances is not sufficient to distinguish between structural RNA and random RNA having the same mono- and dinucleotide frequence; see Figure??. However, we do see a distinction when computing a weighted count of structural neighbors, where low energy structures are more heavily weighted. Formally, this is the Boltzmann partition function with respect to all structural neighbors at a given base pair distance δ. Figure?? displays a density plot produced by 1 Noncoding RNA (ncrna) is transcribed but does not code a protein. Examples of ncrna are trna, rrna, riboswitches, ribozymes, mirna, etc. 2 Rather than randomizing the RNA sequence in the current window, a multiple sequence alignment of the moving window contents with nown ncrnas is randomized (?) by column permutation. In contrast to randomization of single RNA sequences, the Z-scores thus obtained are statistically significant. 2

3 Number of structures 1e+00 1e+08 1e ! ! (a) Number of neighbors. (b) Density plot. Figure 1: The number and probability of neighboring structures at varying distances from the minimum free energy (MFE) structure of the precursor microrna dme-mir-1 (AE / ) from Drosophila melanogaster (?) (solid line) and a random RNA having the same secondary structure (dashdotted line). The random RNA was obtained by applying RNAinverse with the MFE structure of dme-mir-1 as input structure. In unshown data, we have appied RNAbor to random RNA generated by the Altschul-Erison algorithm (?), which preserves the same mono- and dinucleotides of the given RNA. (See Worman and Krogh (?) for an argument why random RNA should be generated so as to preserve dinucleotide frequency, and see (??) for data supporting the fact that structural RNA has lower folding energy than random RNA.) RNAbor which clearly suggests alternate low energy secondary structures with radically different topologies. The plan of this paper is as follows. In section??, we present graphs of the the number N δ and the Boltzmann probability density p δ = Z δ /Z of structural neighbors, which differ by δ base pairs from a given secondary structure. We compare the output of our program RNAbor with the program parnass of Giegerich and co-worers (?) when run on a wide variety of example RNAs, some nown switches and some nown non-switches. In section??, we discuss the data presented, and in section??, we describe the algorithms to compute the number N δ, the Boltzmann partition function Z δ and the minimum free energy structure MFE δ over the collection of all δ-neighbors. Pseudocode for our implemententation is presented in the appendix. Additional data obtained by running RNAbor on all SAM riboswitches from Rfam (?) is available in the web supplement bioinformatics.bc.edu/clotelab/rnabor/websupplement. 2 Results In this section, we present probability density graphs for a variety of conformational switches and for some non-switches. Additional data is provided for all 3

4 SAM riboswitches at the web supplement bioinformatics.bc.edu/clotelab/rnabor/websupplement/. We additionally compare RNAbor with the web server parnass (?), which latter uses a heuristic to determine whether there appear to be two or more clusters of distinct secondary structures for a given RNA sequence. In (???) Giegerich and co-worers compute the Boltzmann partition function for RNAshapes, which are classes of secondary structures having the same topology. For instance, [[][][]] is the shape of the usual cloverleaf secondary structure for trna. The data we present compares the output of RNAbor with that of parnass and RNAshapes. 2.1 Detecting conformational switches In this section, we define a conformational switch to be an RNA sequence which has exactly two distinct low energy secondary structures. By multi-switch we mean an RNA sequence which can adopt two or more distinct low energy secondary structures. For given RNA sequence s = s 1,..., s n and secondary structure S 0 of s, we use RNAbor to compute p δ (s, S 0 ) = Z δ (s, S 0 )/Z(s, S 0 ). Taing S 0 to be the minimum free energy structure, or alternatively the structure determined by comparative sequence alignment (?), our intuition is that a conformational switch should display a bimodal probability density graph. To illustrate the behavior of a typical conformational switch, we consider the nown 47 nt. switch (?) with EMBL accession number AE / and sequence GUGACUGCAA UGCUAUUUGA GUAUCCUGAA AACGGGCUUU UCAGAAU. This conformational switch, which involves a pseudonotted structure, surrounds the bacterial alpha operon ribosome binding site and can fold into two distinct structures as illustrated in Figure??. These are at a base-pair distance of 23 from each other and their energies, as computed by RNAfold -d2, are cal/mol and cal/mol, respectively. Figure?? shows the density plot, i.e. the probability p δ = Z δ /Z of finding a structure at a distance δ from the input structure, which in this example is the MFE structure...((((((...((((...))))...)))))).., the structure shown in Figure??(A) Figure?? depicts a similar bi-modal density graph for the artificially engineered bistable switch CUUAUGAGGGUACUCAUAAGAGUAUCC of Flamm et al. (?). In Figures?? and??, we analyze the 76 nt. conformational switch with PDB ID 1SJ3:R (?), which controls hepatitis delta virus ribozyme catalysis. The sequence and secondary structure of this switch is as follows. 3 GAUGGCCGGCAUGGUCCCAGCCUCCUCGCUGGCGCCGGCUGGGCAACACCAUUGCACUCCGGUGGUGAAUGGGACU...((((((...[.[[[.(((...)))))))))...((.((...)).))...]]].]. Note that the secondary structure contains 13 base pairs, including the noncanonical base pair (A,G) located at positions (46,65), and that there are four pseudonotted base pairs. RNAbor computes that the Boltzmann probability P r[0 neighhbors] of the MFE structure is , and that the next largest 3 The secondary structure is obtained by analysis of the hydrogen bonding classification from the Nucleic Acid DataBan for NDB ID PR0122 and by application of a program we wrote to extract the maximal planar secondary structure, denoted by parentheses, and subsequently to extract the pseudonots, denoted by square bracets. Idea of extraction algorithm is due to Yann Ponty, and will appear elsewhere. 4

5 C A G U U G G C A A UG U A A G C U A U U A UG C U A U U U G C U G A U C G CU G C A G A A A U G U A G C GU C A G U A G A A A C U U C U A U U U G A G U C U U U C G G A A U C A A A G C G (a) The MFE structure. (b) Figure 2: The two alternative secondary structures with free energies and cal/mol, respectively, of the primary structure GUGACUGCAA UGCUAUUUGA GUAUCCUGAA AACGGGCUUU UCAGAAU. Figure 3: Boltzmann probability density plot for the 47 nt. conformational switch (?) with EMBL accession number AE / and sequence s = GUGACUGCAA UGCUAUUUGA GUAUCCUGAA AACGGGCUUU UCAGAAU. The curve shows the probability, p δ = Z δ (s, S 0 )/Z(s, S 0 ), for all secondary structures of RNA sequence s having base pair distance δ from the MFE structure S 0. 5

6 0 0.7 Bistable switch designed by Flamm et al density of states base pair distance bistableswit C U U A U G A G G G U A C U C A U A A G A G U A U C C C U U A U G A G G G U A C U C A U A A G A G U A U C C C U U A U G A G G G U A C U C A U A A G A G U A U C C C U U A U G A G G G U A C U C A U A A G A G U A U C C Figure 4: (Left) Boltzmann probability density plot for the 29 nt. bistable switch artificially engineered by Flamm et al. (?) and having sequence CUUAUGAGGGUACUCAUAAGAGUAUCC. The graph shows the Boltzmann probability, p δ = Z δ (s, S 0 )/Z(s, S 0 ), of all δ-neighbors, for all values of δ bounded by sequence length. (Right) Dot plot produced by RNAfold -d2 -p. The upper triangular region represents base pair probabilities, as computed by McCasill s algorithm (?), while the lower triangular region represents base pairs from the MFE structure. 6

7 probability occurs for the collection of 18-neighbors of the MFE structure, with P r[18 neighhbors] = In addition to computing the partition function and Boltzmann probabilities for δ-neighbors, for all δ, RNAbor computes the MFE δ structures; i.e. for each δ, RNAbor computes the MFE structure among all δ-neighbors. G G G U G A G G U G AA U C C AA C G G G A G C U C C G U AC G G C U C C C G A U C C G C U G G C G C C G G C U G G A G U U C GC U G C A C C U C C U A U G G A G U G G A G G G U C C A G G U AC G G C U C C C G A U C C G C U G G C G C C G G C U G G A C C C A U UG C A C U C U CAG C G G U A A GC C U C (a) MFE structure. (b) 18-neighbor. Figure 5: Two alternative secondary structures for the 76 nt. conformational switch which controls hepatitis delta virus ribozyme catalysis. This switch has sequence GAUGGCCGGC AUGGUCCCAG CCUCCUCGCU GGCGCCGGCU GGGCAACACC AUUGCACUCC GGUGGUGAAU GGGACU. The 3-dimensional structure of this switch, as determined by X-ray crystallography (?), is available in the Protein Data- Ban with PDB code 1SJ3:R. (Left) Minimum free energy structure with free energy cal/mol. (Right) Alternative low energy structure with free energy cal/mol. This MFE 18 structure has the lowest free energy among all many 18-neighbors of the MFE structure parnass In parnass (?) a structural RNA switch is predicted by means of studying properties of the energy landscape of the RNA. Secondary structures are sampled from the structure space using RNAsubopt (?) or mfold (?). Pairwise distances are calculated between the sampled sequences using two different distance measures (e.g. energy barrier, morphological, tree alignment or string edit distance). Using a standard clustering method the structures are clustered into two clusters based on the distance measures. If the RNA is a conformational switch it has two stable structures and hence two clusters are expected (in a multi-switch, more than two stable structures are expected). As an additional test, the consensus structure of the clusters are computed and for each sample 7

8 0 0.3 RNAbor density of states density of states base pair distance (a) Density plot [][] [[][]] [][[][]] [] [][][] [[][]][] [[][][]] Figure 6: Density of δ-neighbors and output of RNAshapes for for the 76 nt. conformational switch which controls hepatitis delta virus ribozyme catalysis with PDB code 1SJ3:R. (?). (Sequence is given in Figure??.) (Left) RNAbor density plot which graphs the Boltzmann probability P r[δ neighhbors] of δ- neighbors as a function of δ. In this case, P r[0 neighhbors] = , and P r[18 neighhbors] = When comparing number of structures, there is one 0-neighbor, the MFE structure itself, many 18-neighbors, and many 38-neighbors (the largest class of δ-neighbors for all δ). By way of comparison the output of RNAshape is in the table below. Note that in this example, the two alternative low-energy structures displayed in this figure both have the RNA shape [][], and that RNAshapes does not predict a switch. 8

9 structure the distances to the two consensus structures are plotted against each other. If the RNA is really a conformational switch, then parnass output should display two clouds of points one near the x-axis and one near the y-axis Comparison with parnass We now compare the ability to predict RNA conformational switches of RNAbor and parnass. We have chosen to display the parnass distance plot of energy barrier versus morphological distance (?), but in all the below examples the distance plots using tree alignment or string edit distance showed similar results. The E.coli ho (host illing) mrna folds into two different conformations (?). The full length mrna folds into a stable structure involving a long-range interaction between the 5 and 3 -end. Degradation of the 3 -end leads to a conformational change as the stabilizing long-range interaction is broen. Here we have investigated the part of the mrna that undergo a conformational change (as provided on the parnass webserver uni-bielefeld.de/parnass). For this RNA, both RNAbor and parnass detect the conformational switch, the RNAbor probability plot shows two distinct peas suggesting two alternative stable structures and the parnass plot show two clearly separated clusters, both suggesting that all the reasonably stable structures fall into one out of two conformations, see Figure??. Although both RNAbor and parnass suggest that the ho gene has two alternative structures, there are some uncertainties in the result. In the RNAbor density plot there are actually three peas (even though the third pea is significantly smaller than the other two), indicating that there might be more than two alternative structures. The 5 -untranslated (UTR) region of E.coli thim mrna undergo a change in structure, that is important for regulation (?). Both RNAbor and parnass indicate more than one single stable structure for the thim-leader. As can be seen from Figure?? there actually seem to be more than two alternative structures. However, the third structure seems to be less important (lower probability), and hence this RNA is predicted as a conformational switch by RNAbor. A comparison of how RNAbor and parnass perform on the example conformational switches available on the parnass website is summarized in Table??. We can of course also investigate a non switch with RNAbor and parnass. In Figure?? an example Hammerhead I is shown. The Hammerhead I structure is an example of a not very well-defined structure (?). In the RNAbor plot there are plenty of peas, not a single pea as is the case for well-defined structures such as mirna (see Figure??) and not two distinct peas as for a conformational switch. Also parnass shows one large cloud in the distance plot indicating that Hammerhead I do not have two separated alternative structures. To investigate how many false predictions we get with our switch prediction method based on the RNAbor computation we have investigated a set of RNAs assumed not to have alternative structures. We have included both RNAs with well-defined and not so well-defined structures. See Table?? for the results. 9

10 (a) RNAbor density plot. (b) parnass distance plot. (c) parnass validation plot. Figure 7: E.coli ho 3 Discussion In this paper, we present probability density graphs for a variety of conformational switches and for some non-switches. We additionally compare RNAbor with the web server parnass (?), which latter uses a heuristic to determine whether there appear to be two or more clusters of distinct secondary structures for a given RNA sequence. In (?) Ding and Lawrence describe how to sample RNA secondary structures after first computing the Boltzmann partition function using McCasill s algorithm (?). By computing base pair distance between each two sampled structures, Ding et al. (?) subsequently compute centroids of clusters produced by hierarchical clustering. This Boltzmann centroid method of Ding et al. (?) provides a computational means of probing the landscape of the low energy ensemble at thermodynamic equilibrium. Since parnass calls the Vienna RNA Pacage program RNAsubopt, it requires a userdefined bound E in order to generate all secondary structures within E cal/mol of the minimum free energy. In contrast to parnass and the Boltzmann centroid method, our algorithm RNAbor directly computes the Boltzmann partition function for all secondary structures of a given RNA sequence which differ by exactly δ base pairs, for all δ less than a user-defined bound. Potential applications of RNAbor which will be pursued in future wor include the following. 10

11 (a) RNAbor density plot. (b) The parnass distance plot. Figure 8: E.coli thim-leader (a) RNAbor density plot. (b) The parnass distance plot. Figure 9: Hammerhead I Since RNAbor allows one to distinguish whether the given RNA nucleotide sequence s has a single pronounced well of attraction around a given secondary structure S 0 of s, it may be possible to use RNAbor to detect situations where the native secondary structure, as determined by X-ray crystallography, is different than that proposed by mfold and RNAfold. The idea would be to determine if there is no pea around δ = 0, when S 0 is taen to be the MFE structure. Figure??(a) shows an interesting example where the RNA seems to have more than one alternative structure. Does this RNA have more than two alternative structures? Is it the case that the MFE structure is not biologically functional? (In this example, the other two alternative structures seem to be probable). Using RNAbor, we can determine the minimum free energy structures over all δ-neighbors, where the Boltzmann probability p δ is high. Ultimately chemical probing experiments might determine whether these MFE δ structures are the preferred biologically active structure. RNAbor is a useful complement to already existing tools for detecting putative conformational switches. Unlie parnass, the number of structures 11

12 to be analyzed and the maximum allowable free energy difference from the MFE structure need not be decided in advance. (These can change the parnass result quite dramatically). Depending on the number of structures to be analyzed and the energy bound, parnass can tae an exponential amount of time, in contrast to O(m n 3 ) time for RNAbor to compute N δ, Z δ and MFE δ. As for any bioinformatics software, it will be necessary to perform experimental validation of predictions made by RNAbor. In future wor, we intend to include user-defined constraints, which allow the user to require all investigated structures to contain certain specified base pairs and for certain specified nucleotides to remain unpaired. This will allow RNAbor to be used together with chemical probing experiments to determine biologically active conformers. 4 Materials and methods Given an RNA nucleotide sequence s, consider a fixed secondary structure S 0 of s. Here S 0 could be the MFE structure of s, it could be the secondary structure obtained from the 3-dimensional X-ray conformation or by comparative sequence analysis, or it could be an arbitrary intermediate structure (such intermediate forms may play a biologically important role, as in viroids). Recall that a secondary structure S of s is a δ-neighbor of (s, S 0 ) if S and S 0 differ by exactly δ base pairs. In this section, we describe how to efficiently compute the number N δ of δ-neighbors, the partition function Z δ for δ-neighbors, and the minimum free energy structure MFE δ over all δ-neighbors. 4.1 The number of δ-neighbors of a fixed secondary structure Let s = a 1,..., a n denote an RNA sequence, i.e. a sequence of letters in the alphabet of nucleotides {A, C, G, U}. A secondary structure S on s is a set of base-pairs (i, j), where θ 0 is an integer (corresponding to hairpin loop size, which we usually set to 3) and 1 i i + θ < j n, such that if (, l) is a base pair, then = i l = j and i < < l i < l < j. We say that S is compatible with s if for every base pair (i, j) in S the pair a i a j is contained in the set B = {AU, UA, GC, CG, GU, UG} (i.e. the set of Watson-Cric basepairings together with wobbles). Given two secondary structures S, T on s, we define the base-pair distance d BP between S and T to be the number of base-pairs that they have that are not in common, i.e. d BP (S, T ) = S T = S T S T, For the rest of this section, we consider both s as well as the secondary structure S on s to be fixed. We now provide recursions for determining the number of secondary structures T compatible with s that are at precisely basepair distance δ to S. Let S [i,j] denote the restriction of S to interval [i, j] of s, that is, the set of base pairs S [i,j] = {(, l) : i < l j, (, l) S}. 12

13 A secondary structure T [i,j] on s is a δ-neighbor of S [i,j] if d BP (S [i,j], T [i,j] ) = δ. For all 0 δ m, and all 1 i j n, let N δ i,j (s, S) denote the number of secondary structures T [i,j] compatible with s such that d BP (S [i,j], T [i,j] ) = δ. In the following we may omit the sequence s and secondary structure S in our notation since these are fixed. In particular, we put N δ i,j = Nδ i,j (s, S). N δ i,j is recursively computed. The initial conditions for computing N δ i,j are then given by N 0 i,j = 1, for i < j (1) since the only 0-neighbor to a structure is the structure itself, and N δ i,j = 0, for δ > 0, j i + θ, (2) since the empty structure is the only possible structure for a sequence shorter than θ + 2 nucleotides, there are no δ-neighbors for δ > 0. The recursion used to compute N δ i,j for δ > 0 and j > i + θ is N δ i,j = N δ b0 i,j 1 + N w i, 1N w +1,j 1, (3) a a j B, i <j w+w =δ b where b 0 = 1 if j is base pairing in S [i,j] and 0 otherwise and b = d BP (S [i,j], S [i, 1] S [+1,l 1] {(, l)}). This holds since a secondary structure T [i,j] on [i, j] that is a δ-neighbor of S [i,j] either nucleotide j is unpaired in [i, j] or it is paired to a nucleotide, such that i < j. In this latter case it is enough to study the smaller sequence segments [i, 1] and [ + 1, j 1] noting that, except for (, j), base-pairs outside of these regions are not allowed. In addition, for d BP (S [i,j], T [i,j] ) = δ to be fulfilled it is necessary for w + w = δ b to hold, where w = d BP (S [i, 1], T [i, 1] ) and w = d BP (S [+1,j 1], T [+1,j 1] ), since b is the number of base pairs that differ between S [i,j] and a structure T [i,j], due to the introduction of the base pair (, j). Pseudocode for computing N δ i,j for values of δ between 0 and m is given in Appendix??. The algorithm runs in time O(mn 3 ) and space O(mn 2 ) where, as defined above, n is the length of s and m is the maximum value of δ. 4.2 analogue In this section, we explain how to extend our approach to computing N δ i,j to one for computing the partition function contribution of the set of structures compatible with a given RNA sequence s at a fixed base-pair distance δ from an RNA structure S compatible with s. This allows us to compute the probability of finding a structure compatible with s at distance δ from S. It is straight-forward to extend the previous approach to compute partition functions for the Nussinov-Jacobson energy model. In particular, by simply replacing recursion (??) with N δ i,j = N δ b0 i,j 1 + a a j B, i <j w+w =δ b N w i, 1N w +1,j 1e E bp(,l)/rt, (4) where E bp (, l) is the energy of the base-pair (, l), R is the gas constant, and T is the temperature, we can compute the partition function contribution of 13

14 structures at a given base-pair distance δ. The base-pair energy E bp (, l) taes the value 1 if a a l B and 0 otherwise. Note that the energy contribution can be altered for different base-pairs (e.g. 3 for GC, 2 for AU and 1 for GU are weights used in (?)). Employing a substantially more complicated algorithm, similar to the dynamic programming calculation of the partition function described in (?), the partition function contributions can also be computed according to the Turner energy model. In the Turner energy model a secondary structure is decomposed into loops, as described in (?), and the energy is computed as a sum of the energy contributions of the loops. A -loop consists of 1 base pairs (excluding the closing base pair) and u unpaired bases. The energies of 1-loops (hairpins) and 2-loops (stacs if u = 0, bulges or interior loops if u > 0) are experimentally determined (??) and are dependent on and u as well as the RNA sequence. In the Turner model the energies for multi-loops ( > 2) are generally determined by the approximate linear model E M = a + b( 1) + cu, where a, b and c are constants. As before, from now on we regard s and S to be a fixed RNA sequence with compatible secondary structure S. The partition function for s is then defined as Z = T e E T /RT, where the sum is taen over all structures T compatible with s, and E T is the energy of the structure T. We aim to compute the restriction Z δ = Z δ 1,n = Z δ 1,n(s, S), that is, the sum of e E T /RT taen over all structures T that are compatible with s and at base-pair distance δ from S. The probability for finding a structure at a distance δ from S is then given by p δ = Z δ /Z. As with the usual McCasill partition function calculations (?), in the dynamic programming we use three matrices Z, ZB and ZM for recursively computing Z δ instead of the single matrix N used for computing N δ in the previous section. In particular, for the sequence segment [i, j] of s, define Z δ i,j = e E T [i,j] /RT, where the sum is over all structures T [i,j] compatible with s and such that d BP (S [i,j], T [i,j] ) = δ. Also, define the restricted partition function ZB δ i,j as the sum of e E T [i,j] /RT taen over all structures T [i,j] such that (i, j) T [i,j], and ZM δ i,j, which is the partition function contribution if the sequence segment [i, j] is part of a multi-loop. The matrices Z, ZB and ZM are filled using the following three recursions. To compute Z we use Z δ i,j = Z δ b0 i,j 1 + a a l B, i <j Z w i, 1ZB w,je Ed/RT, w+w =δ d 1 where E d is the energy contribution due to dangling ends (energy contributions from single bases stacing on adjacent base-pairs) and closing AU base-pairs (since a non GC base-pair closing a stem has a destabilizing effect), and d 1 = d BP (S [i,j], S [i, 1] S [,l] ). Note that the first term of this recursion corresponds to the case where j is unpaired (and hence has no energy contribution) in [i, j]. The second term includes all other structures on [i, j]. The sum is taen over all possible base pairs (, j) with i < j. If (, j) is a base-pair the partition function for [, j] is given by ZB w,j, the partition function for [i, 1] is given by Z w i, 1. 14

15 We compute ZB using the recursion ZB δ i,j = (d BP (S [i,j], {(i, j)}) δ)e E(i,j)/RT + + ZB δ d2,l e E(i,j,,l)/RT + (5) a a j B, i<<l<j + a a j B, i<<l<j w+w =δ d 3 ZM w i+1, 1ZB w,le (a+b+c(j l 1))/RT, where E(i, j) is the energy of the hairpin loop with closing base pair (i, j), E(i, j,, l) is the energy of the stac, bulge or interior loop with the closing base pair (i, j) and the interior base pair (, l), d 2 = d BP (S [i,j], S [,l] {(i, j)}), and d 3 = d BP (S [i,j], S [i+1, 1] S [,l] {(i, j)}). Here, (x, y) is the Kronecer function, which equals 1 if x = y, and is otherwise 0. Note that since the above equation computes ZB δ i,j, it follows that (i, j) forms a base-pair in the neighboring structures T [i,j] (if this is not possible then ZB δ i,j = 0). The first term in the recursion taes care of the case where (i, j) is the only base pair on [i, j], i.e. (i, j) closes a hairpin loop. The second term handles the case where there is an interior loop (or a bulge or a stac) closed by (i, j) and (, l). The third term taes care of all the structures where (i, j) is closing a multi-loop. To reduce complexity of the algorithm the interior and bulge loop size can be limited to a maximum size of L, by requiring that l > j L in the above recursion. The final recursion, for computing ZM, is ZM δ i,j = ZM δ b0 i,j 1 e c/rt + a a j B, i <j ( ZB δ d4,j e (b+c( i))/rt + w+w =δ d 5 ZM w i, 1ZB w,je b/rt ), (6) where d 4 = d BP (S [i,j], S [,j] ) and d 5 = d BP (S [i,j], S [i, 1] S [,j] ). Note that since ZM δ i,j computes the partition function contribution under the assumption that [i, j] is part of a multi-loop, there will be exactly one stem-loop structure in this region (the ZB term) or more than one (the ZB-ZM term). Pseudocode for computing Z δ is given in Appendix??. The complexity is the same as for computing the number of δ-neighbors, O(mn 2 ) in space and O(mn 3 ) in time, if the size of internal loops and bulges are limited to a fixed length such as 30, following the convention of Vienna RNA Pacage. 4.3 Minimum free energy δ-neighbors Given an RNA nucleotide sequence s and secondary structure S 0, the minimum free energy δ-neighbor, denoted MFE δ, is that secondary structure S of s, which has base pair distance δ with S 0, and which has least free energy E δ among all such structures having base pair distance δ with S 0. Free energy is measured according to the Turner energy model (??), where our treatment of dangles follows that of Vienna RNA pacage with d2 option. 15

16 In this section, we describe a novel algorithm capable of computing the MFE δ models, for all δ. As in our partition function computation, the run time [resp. space requirement] to compute all MFE δ structures for δ m is O(m n 3 ) [resp. O(m n 2 )]. This algorithm is obtained from the algorithm in section?? essentially by replacing Boltzmann factor e E(S)/RT by free energy E(S) and by replacing the operations of addition [resp. multiplication] by minimization [resp. addition]. In future wor, we plan to analyze the structure morphological changes in proceeding from S 0 to MFE 0, MFE 1, MFE 2, etc. Such analysis could prove useful in conformational switch detection and other applications. Fix RNA nucleotide sequence s = a 1,..., a n and secondary structure S 0 of s. To compute E δ we use E δ i,j = min Eδ b0 i,j 1, min a a l B, i <j min E w w+w i, 1 + EB w,j + E d =δ d 1 where E d is the energy contribution due to dangling ends (energy contributions from single bases stacing on adjacent base-pairs) and closing AU base-pairs (since a non GC base-pair closing a stem has a destabilizing effect), and d 1 = d BP (S [i,j], S [i, 1] S [,l] ). Note that the first term of this recursion corresponds to the case where j is unpaired (and hence has no energy contribution) in [i, j]. The second term includes all other structures on [i, j]. The minimization is taen over all possible base pairs (, j) with i < j. If (, j) is a base-pair, then the minimum free energy of all w -neighbors of S 0 restricted to [, j] is given by EB w,j, while the mfe for all w-neighbors of S 0 restricted to [i, 1] is given by E w i, 1. We compute EB using the recursion EB δ i,j = min { (d BP (S [i,j], {(i, j)}) δ)e(i, j), min EB δ d2 a a,l + E(i, j,, l), j B, (7) i<<l<j min a a j B, i<<l<j min EM w w+w i+1, 1 + EB w,l + a + b + c(j l 1) =δ d 3, where E(i, j) is the energy of the hairpin loop with closing base pair (i, j), E(i, j,, l) is the energy of the stac, bulge or interior loop with the closing base pair (i, j) and the interior base pair (, l), d 2 = d BP (S [i,j], S [,l] {(i, j)}), and d 3 = d BP (S [i,j], S [i+1, 1] S [,l] {(i, j)}). Here, (x, y) is the Kronecer function, which equals 1 if x = y, and is otherwise 0. Note that since the above equation computes EB δ i,j, it follows that (i, j) forms a base-pair in the neighboring structures T [i,j] (if this is not possible then EB δ i,j = 0). The first term in the recursion taes care of the case where (i, j) is the only base pair on [i, j], i.e. (i, j) closes a hairpin loop. The second term handles the case where there is an interior loop (or a bulge or a stac) closed by (i, j) and (, l). The third term taes care of all the structures where (i, j) is closing a multi-loop. To reduce complexity of the algorithm the interior and bulge loop size can be limited to a maximum size of L, by requiring that l > j L in the above recursion. 16

17 The final recursion, for computing EM, is { EM δ i,j = min EM δ b0 i,j 1 + c, ( min a a j B, i <j EB δ d4,j + b + c( i), min w+w =δ d 5 EM w i, 1 + EB w,j + b (8) where d 4 = d BP (S [i,j], S [,j] ) and d 5 = d BP (S [i,j], S [i, 1] S [,j] ). Note that since EM δ i,j computes the minimum free energy of δ-neighbors of S 0 restricted to [i, j], under the assumption that [i, j] is part of a multi-loop, this minimization is made over one stem-loop structure in this region (the EB term) and structures having more than one (the EM+EB term). For reasons of space, the pseudocode for computing E δ is not presented; given our previous description of E δ and the pseudocode for computing the partition function Z δ, appearing in the appendix, the reader will have no difficulty to reconstruct the pseudocode for E δ. 5 Acnowledgements Research of P.C. was partially supported by National Science Foundation DBI , which additionally supported some travel of E.F. All three authors would lie to than Elena Rivas, Eric Westhof and funding agencies for organizing the meeting RNA-2006 in Benasque, Spain, in July 2006, where some of this wor was carried out. Thans as well to Yann Ponty, for reading the manuscript and telling us of his algorithm to extract maximal planar secondary structures, details of which are forthcoming. )}, 17

18 Table 1: RNAbor and parnass predictions of positive examples provided on the parnass website and a set of negative examples. The parnass predictions for the positive examples are presented as they were in (?) and as we manually have interpreted the parnass plots (using default settings) ourselves. For the negative examples we present our own interpretations. The RNAbor results are automatic/manual. A question mar indicates that a decision could not be made, or was made with uncertainty. RNA RNAbor (auto/man) parnass (auto/man) Positive examples from parnass website 47 nt. switch (AE / ) + / + / - α operon mrna (E.coli) / / 3 -UTR of AMV RNA / / Attenuator + / + b + / + a,b 5 -UTR of btub mrna (E.coli) + / + a + / + dsra (E.coli) + / + + / + HDV ribozyme + / + + / + HIV-1 leader + / + / + ho (E.coli) + / + ± / + 5 -UTR of MS2 RNA (E. coli) / + / ribd leader (B.subtilis) / / S15 mrna (E.coli) / + / S-box leader mete (B.subtilis) / + / Spliced leader RNA (L.collosoma) b / ± + / + T4 td gene intron + b / + + / + Tetrahymena group I intron + / + + / + thim-leader RNA (E.coli) b / + / + ypaa leader (B.subtilis ) c / / Negative examples Hammerhead type I (L07513) / (?) Hammerhead type I (Z69690) b /? mirna, let-7 (AP001359) / mirna, mir-1 (AE003667) / 5S rrna (M16530) / trna (AE006699) c / ± (?) trna (X06054) + / + (few structures) U5 (X13427) / U5 (X15935) + / (partial?) a A switch under the assumption that the input (MFE) structure is one of the alternative structures. b More than two peas. c Two peas, but the second is much smaller than the MFE pea. 18

19 References M.D. Adams and et al. The genome sequence of Drosophila melanogaster. Science, 287(5461): , S.F. Altschul and B.W. Erison. Significance of nucleotide sequence alignments: A method for random sequence permutation that preserves dinucleotide and codon usage. Mol. Biol. Evol, 2(6): , A.R. Banerjee, J.A. Jaeger, and D.H. Turner. Thermal unfolding of a group I ribozyme: The low-temperature transition is primarily disruption of tertiary structure. Biochemistry, 32: , E. Bonnet, J. Wuyts, P. Rouze, and Y. Van de Peer. Evidence that microrna precursors, unlie other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics, 20(17): , C. Brown, B. Hendrich, J. Rupert, R. Lafreniere, Y. Xing, J. Lawrence, and H. Willard. The human XIST gene: Analysis of a 17 b inactive X-specific RNA that contains conserved repeats and is highly localized within the nucleus. Cell, 71: , J.J. Cannone, S. Subramanian, M.N. Schnare, J.R. Collett, L.M. D Souza, Y. Du, B. Feng, N. Lin, L.V. Madabusi, K.M. Muller, N. Pande, Z. Shang, N. Yu, and R.R. Gutell. The comparative rna web (crw) site: An online database of comparative sequence and structure information for ribosomal, intron, and other rnas. BioMed Central Bioinformatics, 3(2), Correction: BioMed Central Bioinformatics. 3(15). P. Clote, F. Ferre, E. Kranais, and D. Krizanc. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA., 11: , P. Clote, F. Ferré, E. Kranais, and D. Krizanc. Structural RNA has lower folding energy than random RNA of the same dinucleotide frequency. RNA, 11(5): , S. Commans and A. Böc. Selenocysteine inserting trnas: an overview. FEMS Microbiology Reviews, 23: , Y. Ding, C.Y. Chan, and C.E. Lawrence. RNA secondary structure by centroids in a Boltzmann weighted ensemble. RNA, 11(8): , Y. Ding and C.E. Lawrence. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res., 31(24): , J.A. Doudna and T.R. Cech. The chemical repertoire of natural ribozymes. Nature, 418(6894): , C. Flamm, I.L. Hofacer, S. Mauer-Stroh, P.F. Stadler, and M. Zehl. Design of multi-stable RNA molecules. RNA, 7: , Thomas Franch, Alexander P. Gultyaev, and Kenn Gerdes. Programmed cell death by ho/so of plasmid r1: Processing at the ho mrna 3h-end triggers structural rearrangements that allow translation and antisense rna binding. J. Mol. Biol., 273:38 51,

20 Eva Freyhult, Paul P Gardner, and Vincent Moulton. A comparison of rna folding measures. BMC Bioinformatics, 6(1):241, R. Giegerich, D. Haase, and M. Rehmsmeier. Prediction and visualization of structural switches in RNA. Pac. Symp. Biocomput., 0: , R. Giegerich, B. Voss, and M. Rehmsmeier. Abstract shapes of RNA. Nucleic Acids Res., 32(16): , S. Griffiths-Jones, A. Bateman, M. Marshall, A. Khanna, and S.R. Eddy. Rfam: an RNA family database. Nucleic Acids Res., 31(1): , J. Harborth, S. M. Elbashir, K. Vandenburgh, H. Manninga, S. A. Scaringe, K. Weber, and T. Tuschl. Sequence, chemical, and structural variation of small interfering RNAs and short hairpin RNAs and the effect on mammalian gene silencing. Antisense Nucleic Acid Drug Dev., 13:83 106, A. Ke, K. Zhou, F. Ding, J.H. Cate, and J.A. Doudna. A conformational switch controls hepatitis delta virus ribozyme catalysis. Nature, 429: , L.P. Lim, M.E. Glasner, S. Yeta, C.B. Burge, and D.P. Bartel. microrna genes. Science, 299(5612):1540, Vertebrate D. H. Mathews and D. H. Turner. Experimentally derived nearest-neighbor parameters for the stability of RNA three- and four-way multibranch loops. Biochemistry., 41: , D.H. Mathews, J. Sabina, M. Zuer, and H. Turner. Expanded sequence dependence of thermodynamic parameters provides robust prediction of RNA secondary structure. J. Mol. Biol., 288: , D.H. Matthews, J. Sabina, M. Zuer, and D.H. Turner. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol., 288: , J. S. McCasill. The equilibrium partition function and base pair binding probabilities for RNA secondary structures. Biopolymers, 29: , S. Moon, Y. Byun, H.-J. Kim, S. Jeong, and K. Han. Predicting genes expressed via 1 and +1 frameshifts. Nucleic Acids Res., 32(16): , V. Moulton, M. Zuer, M. Steel, R. Pointon, and D. Penny. Metrics on RNA secondary structures. J. Comput. Biol., 7: , R. Nussinov and A. B. Jacobson. Fast algorithm for predicting the secondary structure of single-stranded RNA. Proc. Natl. Acad. Sci. U.S.A., 77: , R. Penchovsy and R.R. Breaer. Computational design and experimental validation of oligonucleotide-sensing allosteric ribozymes. Nature Biotechnology, 23(11), E. Rivas and S. R. Eddy. Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs. Bioinformatics., 16: ,

21 P.J. Schlax, K.A. Xavier, T.C. Gluic, and D.E. Draper. Translational repression of the Escherichia coli alpha operon mrna: importance of an mrna conformational switch and a ternary entrapmentcomplex. J Biol Chem, 276: , P. Steffen, B.Voss, M. Rehmsmeier, J. Reeder, and R. Giegerich. RNAshapes: an integrated RNA analysis pacage based on abstract shapes. Bioinformatics, 22(4): , T. Tuschl. Functional genomics: RNA sets the standard. Nature, 421: , A.V. Uzilov, J.M. Keegan, and D.H. Mathews. Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change. BMC Bioinformatics, 7:173, Q. Vicens and T.R. Cech. Atomic level architecture of group I introns revealed. Trends Biochem Sci., 31(1):41 51, B. Voss, R. Giegerich, and M. Rehmsmeier. Complete probabilistic analysis of RNA shapes. BMC Biol., 4(5), B. Voss, C. Meyer, and R. Giegerich. Evaluating the predictability of conformational switching in RNA. Bioinformatics., 0:0, P. Walter and G. Blobel. Signal recognition particle contains a 7S RNA essential for protein translocation across the endoplasmic reticulum. Nature, 299(5885), S. Washietl, I.L. Hofacer, and P.F. Stadler. Fast and reliable prediction of noncoding RNAs. Proc. Natl. Acad. Sci. USA, 19: , J.S. Weinger, K.M. Parnell, S. Dorner, R. Green, and S.A. Strobel. Substrateassisted catalysis of peptide bond formation by the ribosome. Nature Structural & Molecular Biology, 11: , W. C. Winler, S. Cohen-Chalamish, and R. R. Breaer. An mrna structure that controls gene expression by binding FMN. Proc. Natl. Acad. Sci. U.S.A., 99: , C. Worman and A. Krogh. No evidence that mrnas have lower folding free energies than random sequences with the same dinucleotide distribution. Nucl. Acids. Res., 27: , S. Wuchty, W. Fontana, I. L. Hofacer, and P. Schuster. Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers, 49: , T. Xia, Jr. J. SantaLucia, M.E. Burard, R. Kierze, S.J. Schroeder, X. Jiao, C. Cox, and D.H. Turner. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Cric base pairs. Biochemistry, 37: , M. Zuer. Prediction of RNA secondary structure by energy minimization. Methods. Mol. Biol., 25: ,

22 Michael Zuer and David Sanoff. RNA secondary structures and their prediction. Bulletin of Mathemetical Biology, 46: ,

23 A RNAbor and parnass plots The below figures show the RNAbor plot side by side with the parnass plot showing energy barrier versus morphological distance. All of the parnass figures are produced using the parnass webserver with the default settings (a energy threshold for the suboptimal structures of 2 cal/mol and a maximum number of structures of 50). A.1 Positive examples [][] [] [[][]] [][][] [][][][] Figure 10: 47 nucleotide example switch [][] [[][]] [][][] [][[][]] [[][]][] Figure 11: E.coli α operon mrna 23

0 0.05 0.03 0.01 0 20 40 60 80 100 120 [][][][][] 0.6068573 [[][]][][][] 347912 [[][][]][][] 0.

24 [][][][][] [[][]][][][] [[][][]][][] [][][][[][]] [[][][][][]] Figure 12: 3 -UTR of AMV RNA [][] [] [[][]] [][][] [[][]][] Figure 13: Attenuator 24

25 [[][]][[][[][]]] [[][]][[[[][]][]][]] [][][[[[][]][]][]] [[][][[[[][]][]][]]] [[][]][[[][]][][]] Figure 14: 5 -UTR of E.coli btub mrna [][][] [][][][] [[][]][] [][[][]][] [][[][]] Figure 15: E.coli dsra 25

26 [][[][]] [[][[][]]] [[][[][][]]] [][[][]][] [][[][][]] Figure 16: HDV ribozyme [][[][[][]]] [][[[][[][]]][]] [[][][[[][[][]]][]]] [][[][[][][]]] [][[[][[][][]]][]] Figure 17: HIV-1 leader 26

27 [][][] [[][]] [][[][]][] [[][][]] [[[][]][]] Figure 18: E.coli ho [][] [][][] [[][]][] [][[][]] [][][][] Figure 19: 5 -UTR of MS2 RNA from E.coli 27

28 [[[[][[][]]][]][]] [[[[][][]][]][]] [[[[][[][]]][]][[][]]] [[[[][][]][]][[][]]] [][[[][[][][]][]][]] Figure 20: B.subtilis ribd leader [][] [][][] [[][]] [] Figure 21: E.coli S15 mrna [[][[][][]][]][] [][[][[][][]][]][] [[[][[][][]]][]][] [[][][[][[][]]]][] [[][[][[][][]][]]][] Figure 22: S-box leader of B.subtilis mete 28

29 [] [][] [[][]] [][][] [[][]][] Figure 23: Spliced Leader RNA from L.collosoma [[[][]][][[][]]] [[][]][][[][]] [[][[][[][]]]] [[][[][][[][]]]] [][[][[][]]] Figure 24: T4 td gene intron 29

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 8.3.1 Simple energy minimization Maximizing the number of base pairs as described above does not lead to good structure predictions.