Computational RNA Secondary Structure Design:

Size: px

Start display at page:

Download "Computational RNA Secondary Structure Design:"

Sara Roberts
5 years ago
Views:

1 omputational RN Secondary Structure Design: Empirical omplexity and Improved Methods by Rosalía guirre-hernández B.Sc., niversidad Nacional utónoma de México, 996 M. Eng., niversidad Nacional utónoma de México, 999 THESIS SBMITTED IN PRTIL FLFILMENT OF THE REQIREMENTS FOR THE DEREE OF Doctor of Philosophy in The Faculty of raduate Studies (Mathematics) The niversity Of British olumbia pril, 2007 c Rosalía guirre-hernández 2007

2 bstract Ribonucleic acids play fundamental roles in cellular processes and their function is directly related to their structure. The research reported in this thesis is focused on the design of RN strands that are predicted to fold to a given secondary structure, according to a standard thermodynamic model. The design of RN structures is important for applications in therapeutics and nanotechnology. This work also applies to DN with the appropriate thermodynamic model for DN molecules. The overall goal of this research is to improve the performance and scope of algorithmic methods for RN secondary structure design. First, we investigate the hardness of this problem, since its theoretical complexity is unknown. scaling analysis on random and biologically generated structures supports the hypothesis that the running time of the RN Secondary Structure Designer (RN-SSD) algorithm, one of the state of the art algorithms for designing secondary structures, scales polynomially with the size of the structure. We found that structures with small stems separated by loops are difficult to design. Our improvements to the RN-SSD algorithm include the support for primary structure constraints, where bases or base types are fixed in certain positions of the sequence. Such constraints are important, for example, when designing RNs such as ribozymes or trns, where certain base positions must be fixed in order to permit interaction with other molecules. We investigate the correlation between the number and the location of the primary structure constraints and the performance of RN-SSD. In the second part of our research, we have extended the RN-SSD algorithm to design for stability, rather than minimum free energy folding. We measure stability according to several criteria such as high probability of observing the minimum free energy structure, and low average number of incorrectly paired nucleotides in the ensemble of structures for the designed sequence. The design of complexes of RN molecules, that is RN molecules that interact with each other, is relevant for many applications. We describe several ways to design stable structures and complexes, and we also discuss the advantages and limitations of each approach. ii

3 Table of ontents bstract ii Table of ontents iii List of Tables v List of Figures cknowledgements Dedication ix xix xx Introduction Motivation Research goals and contributions Thesis outline RN Secondary Structure Design Problem Secondary structure of an RN molecule Free energy of a single RN strand Free energy of a duplex Partition function Stability Previous Work Protein design Nucleic acid design Biochemical methods omputational methods Empirical omplexity of RN-SSD lgorithm Improvement of the RN-SSD algorithm Experiments iii

4 Table of ontents 4.3 Results nalysis of RN-SSD and RNinverse on secondary structures without constraints nalysis of RN-SSD on secondary structures with constraints Performance of RN-SSD with different number and locations of primary base constraints Summary Interaction of Two RN Molecules Description of the algorithms Internal algorithm Linker algorithm Interface algorithm Experiments Results Performance of Linker and Interface Performance of Linker and Internal Performance of Interface and Internal Hardness of duplex design Summary Stability The RN-SSD-stability algorithm Experiments Results omparison of RN-SSD-stability and INFO-RN omparison of RN-SSD-stability and adaptive walk Summary onclusions and Future Work Bibliography ppendix : ndesignable structures iv

5 List of Tables 3. List of algorithms that will be constantly mention in this thesis for predicting and designing RN secondary structures IP nomenclature for nucleic acids Set of structures generated by folding random sequences with the RNfold function from the Vienna RN package. Nucleotides in the sequence are assigned uniformly at random. The sets of structures longer than 75 bases are smaller because of the amount of time required for designing these structures Biological structures obtained from the literature and used by ndronescu et al. [4]. Structures marked with an asterisk ( ) were obtained from original, pseudoknotted structures by eliminating 8 base pairs in each case to remove the pseudoknot Properties of the structures from Table 4.3; the intervals specify the minimum and maximum values observed for the respective features. These parameters were used to generate structures with biological properties. These values denote the minimum and maximum ratio of bulges to base pairs in the stems Sets of structures generated with the RN structure generator, using the parameters from Table Set of structures used to study the correlation between the primary structure constraints and the performance of RN- SSD. Structures with similar characteristics (such as size, number of multiloops, etc.) appear in the same group. The structure Bio-50-n2 was also selected for the study because is relatively easy to design v

6 List of Tables 5. Pseudoknot-free ribozyme-substrate duplexes obtained from ndronescu et al. [5]. These duplexes have been selected arbitrarily from the biological literature. The length of the ribozyme and the target structure is specified by l and l 2, respectively Biologically motivated duplexes (BIOMD). The strands used to generate the duplexes have statistical features derived from biological structures (see Table 4.5). BIOMD-I refers to duplexes generated by splitting a single strand in a random position as explained in Section 5.2. BIOMD-II are generated by removing a hairpin from a single strand Performance results for Linker, Interface and Internal on sets of biological duplexes. The columns SR( ) show the fraction of structures for which the respective algorithm found solutions in all of the runs (SR stands for success rate). R is the target duplex; S is the structure that results from concatenating the molecules D and D 2 of the duplex; D and D 2 are designed independently by RN-SSD in the Interface algorithm Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD The first two structures are artificial RNs generated by Dirks et al. [25]. rtificial -multiloop has one multiloop with four branches. rtificial 3-multiloop has three multiloops, each of them with three branches. The rest are biological structures that were obtained arbitrarily from the biological literature; the same structures were used by ndronescu et al. [4] vi

7 List of Tables 6.2 utoff times for the adaptive walk and RN-SSD-stability. The running time columns represent the time spent by the adaptive walk, in P seconds, to find a local minimum for each structure. The adaptive walk was run only one time for each stability criteria n(s ) and p(s ). The cutoff time columns represents the maximum time that we allow the adaptive walk and the RN-SSD-stability to find a solution Statistics for the stability of the sequences designed by RN- SSD-stability and INFO-RN when n(s ) is optimized Statistics for the stability of the sequences designed by RN- SSD-stability and INFO-RN when p(s ) is optimized Probability of the shrep with the lowest free energy for the best and the median sequences designed by RN-SSD-stability and INFO-RN. Sequences from the upper and lower Table were obtained by optimizing n(s ) and p(s ) respectively. We do not report any value for ( ) because the target structure was not found in the shapes generated by RNshape. For this structure we generate up to 0 shapes with free energy 0% above the MFE. This suggests that the shape probability of this sequence is low Performance results for RN-SSD-stability and adaptive walk on artificial and biological RN structures. The stability criteria to design the RN structures are n(s ) and p(s ). The cutoff time columns indicate the running time in P seconds for both algorithms on our reference machine. The success rates show the fraction of runs for which the respective algorithm found a sequence that fold correctly into the target structure Statistics for the stability of the sequences designed by RN- SSD-stability and adaptive walk when n(s ) is optimized. The best, median and coefficient of variation, c v, are computed for the values of n(s ) and p(s ) of the sequences designed for a given structure. The correlation, ρ n,p, between n(s ) and p(s ) of all the runs of a given structure is also included Statistics for the stability of the sequences designed by RN- SSD-stability and adaptive walk when p(s ) is optimized Stability of biological sequences with respect to the metrics n(s ) and p(s ) vii

8 List of Tables 6.0 Stability of sequences designed by RN-SSD-stability and adaptive walk on biological RN structures when n(s ) is optimized (upper Table) and when p(s ) is optimized (lower Table). We run the adaptive walk one time until it finds a local minima. The running time in P seconds for the respective structure is indicated in the second column. In all cases, a valid sequence was found Stability of the sequence designed by RN-SSD-stability for the biological structure B-5. We performed one run of RN- SSD-stability, where p(s ) is optimized, with a cutoff time of P seconds and different values of p a and n r. The default values of p a and n r are 0.2 and 5 respectively. The success rate is one if the sequence fold correctly. The metrics n(s ) and p(s ) are used to evaluate the stability of the sequences Free energy parameters for internal loops viii

9 List of Figures. Primary structure (a) and secondary structure (b) of an RN molecule Hairpin ribozyme-substrate complex. The arrow indicates the cleavage site in the substrate. Sequence requirements are specified, where N represents any nucleotide; Y represents or ; R represents or ; B represents, or ; V represents, or. Illustration adapted from E. Puerta- Fernańdez et al. [46] Secondary structure motifs. (a) Hairpin. (b) Interior loop. (c) Bulge. (d) Multiloop. (e) Helix or stem. (f) Pseudoknot. (g) External loop Free energy calculation of a secondary structure. Stems are denoted by S, hairpins by H, multiloops by M, internal loops by I, bulges by B and dangling ends by D. The free energy of this structure is (S,X) = (S) + + (S5) + (H)+ (H2)+ (M)+ (I)+ (B)+ (D) = = 23.3 kcal/mol Matrix of equilibrium base pair binding probabilities between all nucleotides in the sequence of Figure 2.2. The lower triangle represents the optimal structure. Illustration obtained with the program MFold from Zuker et al. [7] Positive and negative design. The x-axis represents the structure space S(X) of a sequence X and the y-axis the free energy of each structure. Illustration adapted from Dirks et al. [25] Tectosquare designed by hworos et al. [20]. Panel (a) shows the RN structure that can self-assembly to form the square of panel (b). Panel (c) shows the three-dimensional representation of the tectosquare ix

10 List of Figures 3.2 Pseudocode for designing structures with the RN-SSD algorithm (a) Randomly generated structure of length 75 (RND-75-n62) with loops separated by short stems. The line represents the location where the structure is split into two substructures. Parts (b) and (c) show the corresponding substructures with static cap structure and dangling ends, respectively Substructures from Figure a with dynamic cap structure (a) and dynamic dangling ends (b) Ribosomal RN: Leptospira interrogans strain The structure motif that consists of two internal loops separated by one base pair is hard to design. We verified this experimentally by adding another base pair between the internal loops, whereupon the expected time to design the structure decreases by a factor of three Scaling analysis of the expected run-time (y-axis) using RN- SSD of structures of lengths 50, 75, 00, 25, 50, 200, 450 and 500 (x-axis). logarithmic scale is used in the y-axis. The solid (dotted) line corresponds to best fits of the data, for structures with lengths 50 to 50, using a polynomial (exponential) that is specified in each case. The expected run-times for structures longer than 50 appear closer to the polynomial than the exponential fit line. (a) Median (Q50) of expected run-time of random structures. (b) 0.9-quantile (Q90) of expected run-time of random structures. (c) Median (Q50) of expected run-time of biologically motivated structures. (d) 0.9-quantile (Q90) of expected run-time of biologically motivated structures x

11 List of Figures 4.5 Scaling analysis of the expected run-time (y-axis) of structures of lengths 50, 75, 00, 25, 50, 200 and 450 (x-axis). logarithmic scale is used on both axes. The lines correspond to best fits of the data, for structures with lengths 50 to 50, using a polynomial that is specified in each case. The expected run-time for structures longer than 50 appear close to the corresponding fit line. (a) Median (Q50) of expected runtime of biological structures and median, 0.-quantile (Q0) and 0.9-quantile (Q90) of expected run-time using RN-SSD for biologically motivated structures. (b) Median of expected run-time of random and biologically motivated structures using RN-SSD and RNinverse. Structures of length 200 are the largest structures of our data set that we designed with RNinverse ost distribution of RN-SSD. Distribution of expected runtime of RN-SSD on (a) random structures and (b) biologically motivated structures. For each point, the x-value gives an expected run time and the y-value gives the fraction of structures whose run-time is at most the x-value Distribution of expected run-time of RNinverse on (a) random structures and (b) biologically motivated structures Examples of structures not designed by RN-SSD. Structures not designed by RN-SSD have short stems separated by loops, indicated by arrows in the Figure. (a) Random structure of length 450 (RND-450-n84). This is the only random structure in our data set that RN-SSD did not design. Note that it has two internal loops separated only by one base pair. (b) Biologically motivated structure of length 74 (BIOM-50- n262) ndesignable motifs. Two structure motifs of our data set that are not compatible with the thermodynamic model. Bold lines represent base pairs. (a) Motif B: bulges separated by one base pair. (b) Motif 2I: internal loops separated by one base pair Secondary structure of small subunit ribosomal RN of canthamoeba castellanii. Biological secondary structure with three bulges, each separated by only one base pair. This structure was obtained from The omparative RN Web (RW) Site [7, 8] xi

12 List of Figures 4. Secondary structure of small subunit ribosomal RN of Escherichia coli. Biological secondary structure with two internal loops separated by one base pair. This structure was obtained from The omparative RN Web (RW) Site [7, 8, 28] Distribution of expected run-time of RN-SSD on three structures of approximately 50 bases: RND-50-n85, BIOM-50- n89 and VS Ribozyme from Neurospora mitochondria. The structures were designed with two sets of primary base constraints: one where the bases are fixed at random positions and the other where the bases are fixed on stems for each structure. Both sets have the same range [a, b] of constrained bases after propagation, where a and b are smallest and largest number of bases constrained if 50% of stems are fixed in a given structure. We fixed 50% plus one stems if a structure has an odd number of stems Scaling analysis using RN-SSD for the median expected run-time of biologically motivated structures with no primary base constraints and with bases constrained in fifty percent of random positions and fifty percent of stems. The lines represent the polynomial that best fits the data. The experiment with primary structure constraints is computationally expensive, and for this reason, fewer structures of each length were used. The best fit for structures with constraints was calculated based on those of lengths 50, 75 and 00. Note that the run-times for constrained structures longer than 00 appear below the corresponding fit line xii

13 List of Figures 4.4 Hardness of RN-SSD with base constraints. orrelation between the fraction of bases constrained in a particular structure (x-axis) and the median expected run-time for designing the structure with RN-SSD (y-axis). We report the fraction of constrained bases after propagation for constraints on randomly chosen base positions. This fraction, for both randomly chosen bases and stems, corresponds to the median fraction of bases constrained in a set of 50 constraints that were generated by fixing a given percentage of bases or stems. There are two curves in each graph, one for designing structures with base constraints located in random positions and the other for constraints located in stems. (a) VS Ribozyme from Neurospora mitochondria; (b) Bio-50-n38; (c) Bio-50- n4; (d) roup II intron ribozyme D35 from Saccharomyces; (e) Bio-200-n9; (f) Bio-50-n Biologically motivated structure Bio-50-n4 with ten stems. When constraining the bases in stems 7 and 8, this structure is hard to design. The structure motif formed by these stems, which are short and separated by a bulge, is unstable (a) Secondary structure of hammerhead ribozyme in satellite DN from Dolichopoda schiavazzi (cricket). (b) Transcleaving hammerhead ribozyme. This picture shows the secondary structure model of the hammerhead-substrate duplex where the specified bases are important for catalytic activity. Dots represent any nucleotide. In both pictures the arrow indicates the cleavage site RN complex with kissing loop interactions. This complex designed by hworos et al. [20] is a self assembling square made of four similar RN strands, called tectorns, that can have applications in nanobiotechnology Pseudocode for designing duplexes with the Internal algorithm Panel (a): Duplex structure (R,b), where b = 4 in this case. Panel (b): a single structure S is formed by concatenating the strands D and D 2 of the duplex (R,4). Panel (c): Duplex structure (R,2). Panel (d): a linker of five bases is added between the strands D and D 2 from (R,2) to obtain a structure S allowed by the thermodynamic model Pseudocode for designing duplexes with the Linker algorithm. 63 xiii

14 List of Figures 5.6 Interface algorithm to design duplexes. (a) Interface I: bases in the intermolecular helix are fixed. (b) Design of structures D and D 2 independently with RN-SSD Pseudocode for designing duplexes with the Interface algorithm rtificially generated duplex. (a) Biologically motivated structure generated with RN Structure enerator. (b) The arrow in (a) indicates the place where the structure is cut to obtain the duplex shown in (b) or the hairpin that is removed to obtain the duplex in (c) orrelation between the expected running times in P seconds of the Linker (x-axis) and the Interface (y-axis) approaches to design the biological duplexes from Table orrelation between the expected running times of the Linker and the Interface algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Linker or Interface is unable to design as 0 6 P seconds. Structures that are designed by none of the approaches are excluded. (a) BIOMD-50; (b) BIOMD-00; (c) BIOMD-200; (d) BIOMD (a) Duplex number 0 from the data set BIOMD-I-200, see Table 5.6. (b) The Linker algorithm concatenates both molecules S = D D 2 and design a sequence X = X X 2 that folds correctly into S. (c) Secondary structure of X and X 2 predicted by PairFold. Note that X and X 2 do not fold correctly into the target duplex R (a) Biological duplex BIOD-4. The corresponding sequences were designed by the Interface algorithm. The shaded bases belong to the intermolecular helix. The sequences X and X 2 from (b) and (c) are designed independently by RN- SSD. The shaded bases in this case correspond to nucleotides constrained in the interface. Note that X and X 2 do not fold correctly into D and D 2 respectively (a) Target structure BIOD-7. (b) Duplex formed by the sequences X and X 2 designed by Interface. There are two bases that are incorrectly paired that form an internal loop instead of the two consecutive bulges of Figure (a). (c) The sequences X and X 2 fold correctly into D and D xiv

15 List of Figures 5.4 orrelation between the expected running times in P seconds of the Linker (x-axis) and the Internal (y-axis) approaches to design the biological duplexes from Table orrelation between the expected running times of the Linker and the Internal algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Linker or Internal is unable to design as 0 6 P seconds. Structures that are designed by none of the approaches are excluded. (a) BIOMD-50; (b) BIOMD-00; (c) BIOMD-200; (d) BIOMD orrelation between the expected running times in P seconds of the Internal (x-axis) and the Interface (y-axis) approaches to design the biological duplexes from Table orrelation between the expected running times of the Internal and the Interface algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Interface or Internal is unable to design as 0 6 P seconds. Structures that are not designed by none of the approaches are excluded. (a) BIOMD-50; (b) BIOMD-00; (c) BIOMD-200; (d) BIOMD Scaling analysis of the median (Q50) of expected run-time (y-axis) of artificially generated duplex BIOMD of lengths 50, 00, 200 and 500 and artificially generated structures BIOM of lengths 50, 75, 00, 25, 50, 200 and 500 (x-axis). Duplexes are designed with Linker, Interface and Internal in Figure (a), (b) and (c) respectively. In all cases, single structures from BIOM are designed with RN-SSD. The line corresponds to the best fit of the BIOM data (obtained in hapter 4) for structures with lengths 50 to 50 using a polynomial of degree three Pseudocode for designing stable structures with RN-SSDstability Pseudocode for SLS-stable xv

16 List of Figures 6.3 orrelation between the average number of incorrect nucleotides and the probability of the target in the ensemble when sequences are designed with RN-SSD-stability and INFO-RN. Logarithmic values are used in both axes with n(s ) on the x-axis and p(s ) on the y-axis. structure is stable if n(s ) and p(s ) are small. Panels (a) and (b) correspond to structure - and panels (c) and (d) to structure -2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) orrelation between the stability metrics n(s ) and p(s ). Panels (a) and (b) correspond to structure B- and panels (c) and (d) to structure B-2. The values of n(s ) and p(s ) of the corresponding biological sequence are indicated in each panel orrelation between the stability metrics n(s ) and p(s ). Panels (a) and (b) correspond to structure B-3 and panels (c) and (d) to structure B-4. The values of n(s ) and p(s ) of the corresponding biological sequence are indicated in each panel orrelation between the stability metrics n(s ) and p(s ). Panels (a) and (b) correspond to structure B-5 and panels (c) and (d) to structure B-6. The values of n(s ) and p(s ) of the corresponding biological sequence are indicated in each panel Solution quality distribution of RN-SSD-stability and INFO- RN with respect to the probability of shapes. Panels (a) and (b) correspond to structure - and panels (c) and (d) to structure -2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) Solution quality distribution of RN-SSD-stability and INFO- RN with respect to the probability shapes. Panels (a) and (b) correspond to structure B- and panels (c) and (d) to structure B-2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) xvi

17 List of Figures 6.9 orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for the artificial structure -. In panels (a) and (b) the sequences are designed by the adaptive walk and RN-SSD-stability respectively using n(s ) as the stability measure. Panels (c) and (d) show the quality of the sequences designed by adaptive walk and RN-SSD-stability respectively when p(s ) is optimized orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for the artificial structure -2 (panels (a) and (b)) and the biological structure B- (panels (c) and (d)) using the adaptive walk and the RN-SSD-stability algorithm, respectively. The values of n(s ) and p(s ) for the biological sequence corresponding to B- are also included orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for the biological structure B-2. The values of n(s ) and p(s ) for the biological sequence corresponding to B-2 are also included in each panel orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for B-3 (panels (a) and (b)) and B- 4 (panels (c) and (d)). The stability of the corresponding biological sequence is indicated by an arrow orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for the biological structure B-5. The stability of the corresponding biological sequence is indicated by an arrow orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for B-6. The stability of the corresponding biological sequence is indicated by an arrow Solution quality distribution for RN-SSD-stability and adaptive walk. The x-axis shows the probability of the shape that contains the target structure for every sequence designed by RN-SSD-stability and the adaptive walk. The y-axis gives the fraction of structures whose probability shape is at most the x-value. Panels (a) and (b) show the stability of the sequences designed for structure -. Panels (c) and (d) correspond to structure -2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) xvii

18 List of Figures 6.6 Solution quality distribution of RN-SSD-stability and adaptive walk with respect to the probability shapes. Panels (a) and (b) correspond to structure B- and panels (c) and (d) to structure B-2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) (a) Motif I: internal loop formed by breaking the base pair x i+ x j 3 from motif B. (b) Motif I: internal loop formed by breaking the base pair x i+6 x j 4 from motif 2I xviii

19 cknowledgements My deepest gratitude to nne ondon and Holger Hoos for their guidance, advice and enthusiasm as my research supervisors. They provided me with countless hours of discussions and financial support. Their keen observations and thorough revisions helped me to considerably improve my work. ertainly without their enormous patience and constant encouragement this thesis would not be completed. I also have to acknowledge Jenny Bryan, David Kirkpatrick, Lawrence McIntosh and Paul Higgs who kindly agreed to serve as members of my committee and provided me with valuable feedback and recommendations on this project. Many thanks to Mirela ndronescu for her collaboration, enthusiasm and help when the technical aspects of my thesis became a burden. I would also like to thank Roman Baranowski for introducing me to the IM cluster, and Dan Tulpan and Rachel MacKay for wonderful discussions. Thanks to Sanja Rogic for keeping me calm when I was finishing my thesis. nd of course to the people in the beta lab and the IM that made my stay in B enjoyable. I am grateful for the financial support from ONyT and from the overnment of anada in the first years of the program. Special thanks to my parents and siblings, always supportive and encouraging. Finally, but most importantly, my deepest gratitude to my husband lberto for his sacrifices, endless patience, love and help in editing this document. xix

20 To my lovely Tessa xx

21 hapter Introduction Ribonucleic acids (RN) are macromolecules that play fundamental roles in many biological processes. RNs, such as ribosomal, transfer and messenger RNs, have well-established roles in protein biosynthesis. Other RNs called ribozymes, catalyze several essential biological processes such as RN splicing, RN processing and peptide bond formation during translation [46]. RNs are composed of a single strand of four different nucleotides: adenine (), cytosine (), guanine () and uracil (). Each strand has two chemically distinct ends, known as 5 and 3 ends; when written as a base sequence, the first base is considered to represent the 5 end. The unique sequence that describes a particular RN molecule is known as the primary structure of that molecule (Figure. a). The RN single-stranded chain can fold on itself, forming hydrogen bonds between pairs of bases. The most common pairs occur between the Watson-rick complementary bases: bonds with, bonds with (and vice-versa), but there are also - wobble base pairs and non-canonical pairs formed by other bases. Folded RN can be characterised at the secondary and tertiary structure level. n RN secondary structure is the set of Watson-rick and wobble base pairs that form when the molecule folds. The base pairs produce helical regions, also called stems, and single stranded regions or loops (Figure. b). The tertiary structure is the relative location of atoms in the molecule in three-dimensional space. It includes the precise geometry of the base-pairing interactions as well as the spatial arrangement of any interaction between secondary and tertiary structure elements. The structure of most RNs is essential for their biological function. Replication control in single-stranded RN viruses [35] is governed by its secondary structure. Mechanisms for mrn synthesis also rely on specific RN structures. For example, in bacteria, transcription termination is caused by the RN polymerase responding to structural perturbations in the DN template or by hairpin loops in the transcript [60]. In eukaryotes, mrn contains introns whose splicing is catalyzed by a ribonucleoprotein complex known as spliceosome. onserved secondary structure motifs of

22 hapter. Introduction (a) 5 3 (b) 5 3 Figure.: Primary structure (a) and secondary structure (b) of an RN molecule. introns can play an important role in intron recognition by spliceosomes. Furthermore, particular mrn conformations can be recognized by regulatory proteins that modulate translation initiation, and mrns encode signals that modulate translation and regulate gene expression [22, 23]. These signals include sequences and structures that serve as targets for translational repressors [60]. The structure of ribozymes also plays an important role regarding their catalytic activity. For example, allosteric ribozymes have an effector-binding site that is separate from their active site. When the effector binds to the ribozyme it induces a conformational change in the ribozyme that enhances or inhibits catalytic function; effectors can be proteins, oligonucleotides, metal ions, etc. [5] omputational approaches for prediction of RN secondary structures are based on a thermodynamic model that associates a free energy value with each possible secondary structure for a strand at fixed conditions such as temperature and ion concentration. Thermodynamic parameters for RN folding have been determined by different methods such as optical melt- 2

23 hapter. Introduction ing, microcalorimetry [52] and knowledge-based methods using databases of known structures [4]. The secondary structure with the lowest possible free energy value, the minimum free energy (MFE) structure, is predicted to be the most stable secondary structure for the strand and is known as ground state. There are dynamic programming algorithms that, given an RN strand, find the secondary structure with MFE. The MFold server of Zuker [7] and RNfold from the Vienna Package [29] are widely used programs for predicting structures without pseudoknots (see hapter 2 for the definition of pseudoknot-free structures) using the RN energy parameters of the Turner group [4]. Both algorithms have an Θ(n 3 ) running time, where n is the length of the input RN sequence. There are also several dynamic programming algorithms for predicting RN secondary structures with pseudoknots, but most of these handle a restricted class of pseudoknotted structures [47]. The general problem of predicting RN secondary structures with pseudoknots is NP-hard but Rivas and Eddy [49] proposed an algorithm with a time complexity of Θ(n 6 ) that can handle a large class of structures [47]. We are interested in the inverse problem, the RN Secondary Structure Design Problem. Thus, in this thesis, we focus on the design of RN strands that are predicted to fold to a given secondary structure, according to a standard thermodynamic model such as that of the Turner group [4]. It is not known whether this problem has a polynomial-time algorithm. n algorithm for solving this problem takes as input a secondary structure of an RN molecule (without specific bases assigned to the sequence positions), and outputs an RN sequence that is predicted to fold into that structure. urrent algorithms design for MFE, that is, they find a sequence with MFE structure as the target structure. Once the sequence is known, it is possible to create the RN molecules using established laboratory techniques. One of the best algorithms for solving this problem is a stochastic local search algorithm known as RN Secondary Structure Designer (RN-SSD) of ndronescu et al. [4]. Stochastic local search algorithms initialise the search process at a randomly chosen point of the given search space (here: the space of RN sequences of a given length) and then proceed by iteratively moving from the present point to a neighbouring point, where the decision on each search step is based on local knowledge only, as well as randomization [32]. One of the reasons why RN-SSD has a good performance is because it decomposes the input secondary structure at multiloops. Different from other algorithms, RN-SSD recursively splits the structure into two substructures in each decomposition step; thus obtaining a binary decomposition tree whose root is formed by the full target structure and 3

24 hapter. Introduction whose leaves correspond to small substructures. Each non-leaf node of the tree represents a substructure of that can be obtained by merging the two substructures corresponding to its children. The SLS algorithm is only applied to the smallest substructures, and the corresponding partial solutions are combined into candidate solutions for larger subproblems guided by a decomposition tree.. Motivation RN Secondary Structure Design is important, both because it facilitates the characterization of biological RNs by their function, as well as the design of new ribozymes that potentially can be used as therapeutic agents [4, 46]. There are also applications in nanobiotechnology in the context of building self-assembling, stable structures and devices from small RN molecules [58]. There are several aspects that play an important role in the design of RN structures for practical applications. For example, when designing ribozymes we need to be able to design duplexes, because ribozymes catalyze highly sequence-specific reactions determined by RN-RN interactions between the ribozyme and its substrate molecule. Substrate recognition and binding to the substrate are essentially governed by Watson-rick interactions. Figure.2 shows the sequence requirements for the hairpin ribozyme-substrate complex. In particular, primary structure constraints play a fundamental role in the design of the complex. Most of the nucleotides important for catalytic activity are contained within the unpaired regions. There are almost no sequence restrictions for bases involved in the formation of the intermolecular helices as long as base pairing is achieved. Substrate regions that interact with the ribozyme do not show high sequence requirements. However, a stable association between the two molecules is required, especially close to the cleavage site. The catalytic performance of the hairpin ribozyme can be improved by stabilisation of helix 4 (H4) [46]. onsequently, it is also important to design for stability. It is also possible to design DN sequences for construction of DNbased geometrical objects and nanomechanical devices. The key to using DN for this purpose is the design of stable branched molecules, that can interact with other nucleic acid molecules [57]. n important aspect in the study of the RN Secondary Structure Design problem is to understand better the factors that render RN structures hard to design. Such understanding provides the basis for improving the 4

25 hapter. Introduction substrate n g 3 n n H 2 5 n n n n n b y r n n NNNNNV 5 N H N YNN ribozyme N N N N N 3 N H 3 N N N N N N N Figure.2: Hairpin ribozyme-substrate complex. The arrow indicates the cleavage site in the substrate. Sequence requirements are specified, where N represents any nucleotide; Y represents or ; R represents or ; B represents, or ; V represents, or. Illustration adapted from E. Puerta-Fernańdez et al. [46] H 4 performance of computational approaches for solving this problem and for characterising its limitations. To our knowledge, it has not been determined whether there is a polynomial-time algorithm for RN secondary structure design. Schuster et al. [53] performed experiments with the RNinverse algorithm by Hofacker et al. [3] on few small random sequences and a simple trn to support the hypothesis that there is no need to search huge portions of the sequence space to find a particular structure by mutation and selection. Based on these experiments, they argue that sequences sharing the same structure are distributed randomly over sequence space and that common structures, that is, structures that have many sequences that fold into them, can be accessed from an arbitrary sequence compatible with the target structure by a number of mutations much smaller than the sequence length. These results are based on small sequences and therefore they do not give insight into the computational complexity of the design problem. On the other hand, ndronescu et al. [4] found evidence that some ribosomal RN structures are difficult to design and that the correlation between the 5

26 hapter. Introduction size and hardness is not very strong. The goals of our work in this thesis are:. nderstand the empirical complexity of the RN Secondary Structure Design problem, that is, the scaling of the typical difficulty of the design task for various classes of RN structures as the size of the target structure is increased. losely related is the identification of factors that make RN structures hard to design. 2. Develop methods for RN secondary structure design with primary structure constraints; in this case, the input comprises fixed bases or base types in certain positions of the sequence to be designed. 3. Develop methods for designing duplexes of RN molecules with specific secondary structure. 4. Develop methods for designing stable RN molecules, since there are several RN sequences that can fold into a given structure. Finding a sequence with low MFE structure is one approach to design for stability. However, in hapter 2 we will describe other stability criteria, discussed by Dirks et al. [25], that can be use to design more stable structures than the MFE approach..2 Research goals and contributions In this section we explain our contributions to the rational design of RN secondary structures. This includes building efficient algorithms to design duplexes and stable structures. To the best of our knowledge, there are no algorithms for designing duplexes in the literature and the only algorithm that design explicitly stable structures is the adaptive walk from Dirks et al., that we will describe in hapter 3. However, this algorithm performs well only on short structures. To achieve our first goal, namely to understand the empirical complexity of the RN Secondary Structure Design Problem, we present an empirical analysis of the performance of two algorithms for the problem. One is the RN-SSD algorithm of ndronescu et al. [4] and the other is the RNinverse algorithm of Hofacker et al. [3] from the Vienna RN Package [29]. RN-SSD is one of the best algorithms available to design RN molecules in terms of the time spent designing a given structure and the number of structures that is able to design. We used an improved version of RN-SSD that supports primary structure constraints; an online version is available 6

27 hapter. Introduction at For the analysis we consider randomly generated structures, obtained by folding a randomly generated sequence with the RNfold function from Vienna Package; and also structures that are generated to have random features of biological structures, which we refer to as biologically generated structures. The use of random structures allows us to test the algorithms for types of structures that rarely occur in nature but that can have applications in nanotechnology. Furthermore, the performance of RN-SSD on biologically generated structures is relevant for biological applications like molecular therapeutics. The scaling analysis on random and biologically motivated structures supports the hypothesis that the running time of both algorithms scales polynomially with the size of the structure. We also found that the algorithms are in general faster when constraints are placed only on paired bases in the structure. When comparing both algorithms, the RN-SSD algorithm performs better than RNinverse since it requires less time to design a given structure and also because it is able to design more structures. Furthermore, we prove that, according to the standard thermodynamic model, for some structures that the RN-SSD algorithm was unable to design, there exists no sequence whose minimum free energy structure is the target structure. Our next goal is to extend the RN-SSD algorithm to be able to design better structures that work in practice. Here we consider the design of complexes of two RN molecules that can work as ribozymes or for applications in nanostructure design. We describe three different approaches for RN duplex design. These are the first algorithms for designing duplexes since to the best of our knowledge, there are no algorithms for designing complexes in the literature. Each of our algorithms takes as its input a complex of molecules that adopt a particular secondary structure. Depending on the problem of interest, there may be input base constraints located in some positions of the complex. The outputs are two sequences that are predicted to form the desired complex according to the PairFold function of ndronescu et al. [5]. The Linker approach concatenates two molecules to generate a single structure that is designed with RN-SSD. The designed sequence is then separated into the corresponding strands. The Interface approach assigns bases in the positions of intermolecular helices of the complex. Then both structures are designed independently by RN-SSD where the bases located in the regions where both structures interact remain fixed in RN-SSD. The Internal approach modifies the RN-SSD algorithm to design duplexes inside the algorithm where the candidate solutions are evaluated with PairFold. scaling analysis on these algorithms show that there is no evidence that the running time of Linker and Internal scale exponen- 7

28 hapter. Introduction tially with the length of the duplex. When comparing these approaches, we found that Interface has the lowest running time and designs less duplexes in our data set than Linker and Internal. Frequently, Interface has the problem that components of the designed sequences form undesirable interactions even though each sequence fold correctly into the structure of the corresponding molecule. We also found that Linker has the best running time but Internal designs more complexes than the other algorithms. Internal uses the PairFold function to evaluate the designs inside the algorithm. The theoretical running time of PairFold is Θ(n 3 ) where n is the sum of the two sequence lengths. However in practice the running time is longer than RNfold due to additional checking for the location of the intermolecular linkage in the calculation of the free energy of the duplex [3]. That is, in practice the running time of PairFold is a constant times more expensive than RNfold that evaluates designs for single sequences. Therefore, there is a trade-off between using PairFold internally or not since in the former case is more expensive even though it designs more duplexes. Finally, we extend the RN-SSD algorithm to design stable structures since to the best of our knowledge there are no algorithms reported in the literature that design stable structures explicitly. We design for stability inside RN-SSD at the lowest level of the algorithm where the smallest substructures of the decomposition tree are designed. n SLS algorithm is used to find a stable substructure with respect to some stability criteron like the probability of observing the MFE structure, or the average number of incorrectly paired nucleotides in the ensemble of structures of the designed sequence. The performance of the extended RN-SSD, called RN-SSDstability, is evaluated by comparing the stability of the sequences designed with this algorithm and other approaches like the adaptive walk procedure of Dirks et al. [25] and the INFO-RN algorithm of Busch et al. [6]. RN-SSD typically does not achieve the performance of INFO-RN when designing MFE structures. However, RN-SSD-stability performs better than INFO-RN which does not use any stability measure to design structures. When RN-SSD-stability is compared with the adaptive walk, we find that our algorithm designs more stable sequences in less time, especially for structures that are difficult to design. However, for very long run-times, the adaptive walk performs better than RN-SSD. n insight gained from our comparison is that RN-SSD-stability designs stable structures by merging stable substructures. Therefore, we believe that further improvements can be obtained if stable sequences are designed at every level of the decomposition tree, possibly by keeping several candidate subsequences to concatenate, and selecting the most stable. 8

29 .3 Thesis outline hapter. Introduction The remainder of this thesis is structured as follows. In hapter 2, we give an overview of existing related work from the literature. omputational approaches to the design of RN secondary structures are discussed in hapter 3 including heuristic algorithms available to solve this problem such as RNinverse, INFO-RN and RN-SSD. In hapter 4 we present an extension of RN-SSD that supports primary structure constraints and study the empirical complexity of the RN Secondary Structure Design Problem. hapter 5 deals with the design of RN duplexes, and hapter 6 with the design of stable RN molecules. The conclusions of this thesis and the future directions in which this work can be extended are discussed in hapter 7. proof that some structural motifs are impossible to design with the current thermodynamic model of the Turner group [4] is provided in the ppendix. 9

30 hapter 2 RN Secondary Structure Design Problem In this chapter we give some background to define the RN Secondary Structure Design Problem. 2. Secondary structure of an RN molecule RN secondary structure is characterized as the set of base pairs inducing a structure like the one in Figure. (b). onsider a sequence X = x,x 2,x 3, x n, where x i {,,, } i =,,n. For i < j n, let i j denote the pairing of base x i with x j. secondary structure S on a sequence X is a set of base pairs P such that two base pairs in S, i j and i j are either identical, or else i i and j j. This means that a base is paired with at most one other base. Figure 2. shows a classification of the loops and the base pairs in a secondary structure. We can classify loops formed by the bonding of base pairs in a structure according to the number of base pairs that they contain. hairpin loop contains exactly one base pair (Figure 2. a). n internal loop contains exactly two base pairs (Figure 2. b). bulge is an internal loop with one base from each of its two base pairs adjacent (Figure 2. c). Furthermore, a stacked pair is defined as two consecutive base pairs (i j) and (i + j ) (Figure 2. e). stem is formed by one or several stacked pairs. multibranched loop or multiloop is a loop that contains more than two base pairs (Figure 2. d). n exterior or closing pair is the base pair in a loop closest to the ends of RN strand. More precisely, the exterior pair is the one that maximizes j i over all pairs i j in the loop. ll other pairs are interior. Dangling bases are free bases located in the immediate vicinity of a stem. n RN secondary structure includes a pseudoknot if there exist two base pairs i j,i j in the structure with i < i < j < j (Figure 2. f) otherwise it is pseudoknot free. The bases (i,j,i,j,,i d,j d ) with d 0,i < j < i < j < < i d < j d define an 0

31 hapter 2. RN Secondary Structure Design Problem external loop if i pairs with j, i pairs with j,,i d pairs with j d and k is a free base k, k < i,j < k < i,,j d < k n where n is the length of the sequence (Figure 2. g). domain is a substructure which is closed by a base pair of an external loop, that is, a domain closed by i j is the set of all base pairs whose indices are in the interval [i,j]. The external loop shown in Figure 2. (g) contains two domains and seven free bases. The free bases are called external bases because they are not inside any domain, but they are between domains or between a domain and one of the ends of the strand. 2.2 Free energy of a single RN strand ssociated with a secondary structure of a strand is its free energy. The free energy of a secondary structure measures (in kcal/mol) the specificity of the sequence for a secondary structure at fixed temperature. The bases that are bonded tend to stabilize the RN, whereas unpaired bases form destabilizing loops. The sequence is most likely to fold into the structure with lowest free energy. For pseudoknot-free secondary structures, the free energy is calculated with the nearest neighbour thermodynamic model described by Zuker et al. [74] and Mathews et al. [4]. This model calculates the free energy of a structure as the sum of free energies of each stacked pair and each loop using experimentally obtained thermodynamic data. n example is given in Figure 2.2. Let (X,S) denote the free energy of an RN sequence X when folded into a secondary structure S. Furthermore, let Φ denote a function that assigns to each RN sequence X a secondary structure S that minimizes free energy (X, S) over all possible secondary structures S of X. The RN Secondary Structure Prediction Problem can be formulated as follows. iven an RN sequence X, determine Φ(X). Zuker and Stiegler developed a dynamic programming algorithm for finding the minimum free energy (MFE) secondary structure S without pseudoknots of an RN molecule X. They use the nearest neighbour thermodynamic model to evaluate the energy (X,S) of a sequence X folded into the secondary structure S. The running time of Zuker and Stiegler s algorithm is Θ(n 4 ) [75] and it has been reduced to Θ(n 3 ) by Lyngsø et al. [39]. The program mfold [72] was the first implementation of Zuker and Stiegler s algorithm and is available online [73]. The Vienna RN Package [3] also implements Zuker and Stiegler s algorithm. It is available online and is free open source software [30].

32 hapter 2. RN Secondary Structure Design Problem 5 i i+ exterior pair 5 i+ i i - i interior pair 3 j j- 3 j j- j j + (a) (b) dangling base i i 3 j j (c) (d) 5 3 stacked pair i i+ j j- 5 i j i j 3 (e) (f) Domain 3 j 5 i Domain B i j external bases (g) Figure 2.: Secondary structure motifs. (a) Hairpin. (b) Interior loop. (c) Bulge. (d) Multiloop. (e) Helix or stem. (f) Pseudoknot. (g) External loop. 2

33 hapter 2. RN Secondary Structure Design Problem Figure 2.2: Free energy calculation of a secondary structure. Stems are denoted by S, hairpins by H, multiloops by M, internal loops by I, bulges by B and dangling ends by D. The free energy of this structure is (S,X) = (S) + + (S5) + (H) + (H2) + (M) + (I) + (B) + (D) = = 23.3 kcal/mol. 3

34 hapter 2. RN Secondary Structure Design Problem n abstract secondary structure S of length n, (n,s), is defined as a set of integer pairs (i,j) i < j n, such that each i is contained in at most one pair and no two base pairs (i,j) and (i,j ) of S cross, that is, it is not the case that i < i < j < j or that i < i < j < j. The RN Secondary Structure Design Problem can be stated as follows. iven an abstract RN secondary structure (n,s ), find a sequence X such that Φ(X ) = S. 2.3 Free energy of a duplex In some applications, it is desirable to predict the secondary structure of two or more interacting RNs. Such predictions aid in understanding mechanisms for ribozyme function [69] and in designing novel ribozymes [7] or nanostructures [8]. The free energy calculation for a pair of RN molecules is very similar to the free energy calculation for one molecule. The free energy of a duplex of RNs is calculated with the nearest neighbour thermodynamic model as the sum of the energies of its loops. It is possible to distinguish regular and special loops. Let Y = y y 2 y n be the sequence obtained by concatenating two RN sequences X and X 2 and let b denotes the number of nucleotides in X, that is, b is the linkage location between X and X 2. loop is special if the linkage location b lies within it; otherwise it is regular. The free energy of the special structures is calculated in the same way as for regular structures, except that an inter-molecular initiation penalty I = 4. kcal/mol [5] is added if the special structure is a hairpin, an internal loop, a multiloop or a stacked pair. ndronescu et al. developed an efficient algorithm, PairFold [5], that predicts the MFE secondary structure that can be formed by two interacting nucleic acid molecules. This algorithm takes as input a pair of RN strands X and X 2, and extends the dynamic programming algorithm by Zuker and Stiegler [75] for single molecules. The worst-case time and space complexity for PairFold when calculating the MFE structure are Θ(n 3 ) for time and Θ(n 2 ) for space, where n is the sum of the two input sequence lengths. The structure of a duplex is specified as (n,r,b) where R is the secondary structure of length n of the interacting strands X of length b and X 2 of length n b. We refer to b as the linkage location. n abstract duplex R of length n, (n,r,b), is defined as a set of integers pairs (i,j) i < j n such that each i is contained in at most one pair and no two base pairs (i,j) and (i,j ) of R cross, that is, it is not the case that i < i < j < j or that i < i < j < j. The RN Secondary Structure Design Problem for 4

35 hapter 2. RN Secondary Structure Design Problem Duplexes can be formulated as follows. iven an abstract duplex secondary structure (n,r,b), find a pair of sequences X of length b and X 2 of length n b such that Φ(X,X 2 ) = (n,r,b). 2.4 Partition function lthough free energy models for secondary structure loops have been refined over time to achieve a better characterization of folding thermodynamics, the energy parameters are still inaccurate [4]. In the previous section we described how the free energy of a structure is computed by adding the free energy of each loop and stacked pair. slight deviation in the free energy parameters can lead to substantial differences in the computed MFE structure. Hence, the MFE structure derived from a folding algorithm may not be a true structure, i.e. the structure into which the molecule folds. nother consideration that has to be made is that RN molecules are not in the (true) MFE structure all the time because base pairs are constantly forming and breaking. For example, co-transcriptional folding leads to the formation of temporary secondary structure elements [36, 48] that have biological functions, e.g. as initial sites for protein anchoring during pre-mrn transcription [50]. Furthermore, RN conformational switching is fundamental in translational regulation, protein synthesis, and mrn splicing [38]. However, the MFE structure is the most likely structure. These suggest that it is beneficial to characterise the ensemble of all secondary structures. The probability P(S) of sampling a secondary structure S in the ensemble of structures of a given RN molecule X with free energy (X,S) is proportional to e (X,S)/RT where R is the gas constant and T is the absolute temperature. The partition function, from statistical mechanics, is a normalising constant that allows estimation of the probabilities from the free energy values and is given by Z(X) = S S(X) e (X,S)/RT (2.) where S(X) is the set of all possible secondary structures of a strand X. Therefore, the probability of observing a structure S in the ensemble of a given sequence X is P(X,S) = e (X,S)/RT Z(X) To simplify notation we will use P(S) from now on.. (2.2) 5

36 hapter 2. RN Secondary Structure Design Problem Note that the partition function (see Equation 2.) is a weighted sum over all admissible secondary structures of a given RN sequence. n admissible secondary structure has a set of base pairs that can be formed from the RN sequence. The lower the free energy of a structure, the higher its weighting, that is, the contribution to the sum. The computation time for calculating the partition function by explicitly summing all terms grows exponentially with the length n of the sequence. But Mcaskill [42] derived a dynamic programming algorithm to calculate the partition function of pseudoknotfree structures in time Θ(n 3 ). Mcaskill also introduced a matrix of equilibrium base pair probabilities between all nucleotides in the RN sequence. The values displayed in the upper triangle of the matrix (sometimes graphically shown as differently sized dots or boxes) represent the sum of the probabilities associated with all the structures in which the chosen base pair occurs. The lower triangle of the matrix is often used to illustrate the MFE structure. This matrix summarizes the features of the global ensemble of structures at equilibrium. Figure 2.3 shows the matrix for the sequence of Figure 2.2. The pairing probability for every possible pair can be calculated efficiently using a dynamic programming algorithm that is described by Mc- askill [42]. Let p ij denote the probability of forming a base pair i j for i,j n, where n is the sequence length and let p i,n+ be the probability that i is unpaired for i n. Let { δij α if i j is a base pair of the structure S = α 0 otherwise for i,j n (2.3) { δi,n+ α if i is an unpaired base of the structure S = α 0 otherwise then p ij = S α S(X) P(S α )δ α ij for i n (2.4) where P(S α ) is the probability of observing S α in the ensemble of structures S(X). 6

37 hapter 2. RN Secondary Structure Design Problem Figure 2.3: Matrix of equilibrium base pair binding probabilities between all nucleotides in the sequence of Figure 2.2. The lower triangle represents the optimal structure. Illustration obtained with the program MFold from Zuker et al. [7]. 2.5 Stability iven a secondary structure S, there are typically several sequences that have MFE structure S. If we want to design stable structures then one approach is to find the sequence with the lowest MFE S. Dirks et al. [25] described two paradigms for designing a structure. positive design optimizes sequence affinity for the target structure. negative design optimizes sequence specificity to the target structure. Sequences with high affinity have admissible structures similar to the target structure. Sequences with high specificity have the target structure as their MFE structure. When designing a structure, it is desirable to achieve both high affinity and high specificity. Figure 2.4 shows the structure space for three sequences. The target structure is indicated in the picture. Sequence has a higher affin- 7

38 hapter 2. RN Secondary Structure Design Problem ity to the target structure than sequences and B since it has the lowest free energy when folding into the desired structure. Moreover, sequences and B have a higher specificity to the target structure since S is the MFE structure of and B. Overall we will prefer sequence B since it has high affinity and high specificity and therefore its MFE structure is the most thermodynamically stable. Dirks et al. [25] define several criteria to evaluate the specificity and the affinity of a structure. Energy minimization only evaluates a sequence in terms of its affinity for a target structure. The lower the energy of the sequence when folding into the target structure, the higher the affinity to the target structure. But this condition is not sufficient to have high specificity to the target structure (see sequence in Figure 2.4). Furthermore, the MFE criterion only evaluates the sequence in terms of its specificity to the target structure but does not ensure high affinity (see sequence in Figure 2.4). With the partition function it is possible to calculate the probability P(S ) for a sequence to fold into S (see Equation 2.2). If P(S ) is close to one then the sequence achieves high affinity and specificity. However, requiring that P(S ) is a very strict design evaluation criterion since it requires that all nucleotides match the target exactly. n alternative criterion is the weighted average number of incorrect positions n(x,s ), with respect to S, over the equilibrium ensemble of secondary structures of a given sequence X [25]. To simplify notation we will use n(s ) from now on. sequence has high affinity and specificity to the target structure if n(s ) is close to zero. The weighted average number of incorrect bases can be computed from the matrix of base pair probabilities of the partition function. If S is the MFE structure of X and δ is defined as in Equations 2.3 and 2.4, then the weighted average number of incorrect nucleotides is: n(s ) = n = n p ij δ ij i n j n+ }{{} expected number of correct nucleotides i n = n j n i n j n p ij δ ij + p i,n+δ i,n+ p ij δ ij i n p i,n+ δ i,n+ 8

39 hapter 2. RN Secondary Structure Design Problem Figure 2.4: Positive and negative design. The x-axis represents the structure space S(X) of a sequence X and the y-axis the free energy of each structure. Illustration adapted from Dirks et al. [25] = n i n j n p ij δ ij i n p ij δ i,n+ j n since j n p ij + p i,n+ = for every i n. Therefore, n(s ) = n i n j n p ij δ ij i n δ i,n+ + i n j n p ij δ i,n+. (2.5) The indicator that we use to identify stable structures depends on the context. For example, in long structures it is difficult to have a probability P(S ) close to one because the number of structures in the ensemble is very large. In this case we can use the first energy gap [67] to identify the most stable structure. The first energy gap is a negative design criterion that is defined as the difference between the energy of the MFE structure 9

40 hapter 2. RN Secondary Structure Design Problem and the second best structure. In some situations it is better to design a sequence with big first energy gap. Suppose that we have two sequences X and X 2 whose MFE structure is the target structure S. ssume that X has an ensemble with several structures very different from each other and all of them with probabilities close to P(S ). Then we will prefer to choose sequence X 2 if P(S ) is smaller in this new ensemble but has a bigger first energy gap. Note that if we choose X then there is a significant chance to get kinetically trapped in a very stable structure different from the target structure. Moreover, if the energetically favourable structures in the ensemble of X 2 are similar to each other then it is better to choose this sequence because it has a higher probability of folding into the desired target than sequence X. From the previous example, it is useful to determine whether there is some family of structures in the ensemble that is similar, distinct from the rest, and dominates the probabilities of all other families. Voß et al. [64] introduced an algorithm, called RNshapes, to compute the accumulated probabilities of all structures that share the same shape. n RN shape is a representation of an RN secondary structure that abstracts loop and stem lengths. onsider the following sequence and two secondary structures from its folding space:..(((.((..(((...))).(((...))))))))....(((...(((...))).(((...)))..))).. There are several levels of abstraction. In the least abstract shape, the unpaired regions are represented by an underscore and the stacking regions by a pair of squared brackets. The following are the shapes of the previous structures. [ [ [ ] [ ] ] ] [ [ ] [ ] ] The most abstract level excludes unpaired bases and combines nested helices. In this case, both structures have the same shape: [ [ ] [ ] ] The accumulated probabilities of all the structures that share the same shape can be used as another stability measure. If the MFE structure belongs to a shape of probability close to one, then the corresponding sequence achieves high affinity and specificity. 20

41 hapter 3 Previous Work The most important work related to structure design is presented in this chapter. We give an overview on the design of proteins and nucleic acids. We also describe biochemical and computational methods including heuristic algorithms for RN structure design that have been empirically shown to achieve good performance. 3. Protein design Proteins have a big range of natural function and hence they represent a fertile medium for the design of new medical and industrial products. The ultimate goal of protein design is the creation of novel proteins that perform specified tasks. necessary requirement for meeting this goal is the ability to identify sequences that fold with sufficient stability into a target structure. omputational procedures for protein design consists of starting from a given protein three-dimensional structure, usually a known structure from the Protein Data Bank (PDB) [9, 0], and searching for the amino acid sequence or sequences that are compatible with this structure. Pierce and Winfree show that the protein design protein is NP-hard [44]. However, this is a reflection of worst-case behavior but in practice, it is possible for an exponential-time algorithm to perform well or for an approximate stochastic method to prove capable of finding excellent solutions to NP-hard problems. Stochastic methods based on Monte arlo, simulated annealing or genetic algorithms have performed with some success on small protein design problems [24]. Proteins can be designed computationally by using positive strategies that maximize the stability of the desired structure or by negative strategies that seek to destabilize competing states. Design efforts have focused mainly on positive strategies that maximize favorable interactions in the target conformation. This approach has given good results including the introduction of catalytic activity into a previously inert protein [3] and the creation of a novel protein fold [37]. Negative design, by contrast, maximizes unfavorable interactions in competing states and requires modeling of each 2

42 hapter 3. Previous Work unwanted conformation [70]. However, one of the challenges in negative design is to model accurately the energetic effects of destabilizing mutations in competing states. 3.2 Nucleic acid design In this section we discuss biochemical and computational methods that have been used to design nucleic acids. Nucleic acids have the advantage that they are easy to synthesize and that structure formation is mainly based on secondary structure, that is, base-pairing interactions within a strand. Nucleic acids are versatile building materials. In nature, DN and RN s, such as mrn, rrns and trns, are involved in making proteins. In nanotechnology, DNs and RNs have several applications. ssembly and folding principles of natural RN are used to build potentially functional artificial structures at the nano-scale [20, 5]. This concept, called RN tectonics, led to the synthesis of RN grids with various patterns such as the one in Figure 3.. Nucleic acids have been also used to build nanomechanical devices [59]. In this case, the combination of single and double stranded sections of DN yield structures which can be thought of as a network of stiff and flexible elements. The deliberate formation or destruction of doublestranded sections in such a network induces conformational changes which result in nanoscale motions such as rotational motion, pulling and stretching, or even unidirectional motion. Other applications include engineering logic circuits [54] and simple computers [6] Biochemical methods In practice, RN design is mostly done using biochemical methods. These procedures allow us to find RN structures similar to those retained by natural evolution, and also to identify alternative conformations that can perform the same function. Structures of interest can be characterized by X-ray crystallography and NMR spectroscopy. By using phylogenetic analysis and sequence alignment of several RN sequences it is possible to identify conserved primary and secondary structural features that can be related to function [7]. Wang and nrau [65] use random recombination and selection to isolate the core functional elements of an RN where phylogeny is lacking or is limited. In vitro mutagenesis and selection have been performed on several ribozymes and substrates to determine the overall secondary structure and to identify which elements are essential for activity [2, 46]. Minimal 22

hapter 3. Previous Work 3 XXXXXX() N XX X X N X X 5 - - N - N o - X X X X X X ( a ) ( b ) ( c ) Figure 3.: Tectosquare designed by hworos et al. [20].

motifs that support catalytic activity or modified structures with an improved catalytic activity can be determined by this method. Breaker et al.

43 hapter 3. Previous Work 3 XXXXXX() N XX X X N X X N - N o - X X X X X X ( a ) ( b ) ( c ) Figure 3.: Tectosquare designed by hworos et al. [20]. Panel (a) shows the RN structure that can self-assembly to form the square of panel (b). Panel (c) shows the three-dimensional representation of the tectosquare. motifs that support catalytic activity or modified structures with an improved catalytic activity can be determined by this method. Breaker et al. [5] also used in vitro selection to design allosteric ribozymes as biosensor components. n allosteric ribozyme induces or inhibits catalytic function in the presence of an effector molecule that binds to a receptor site distinct from that of the enzyme s active site. nother approach searches in the enebank database for potential structural motifs that might have functional significance. Ferbeyre et al. [27] mutate versions of the hammerhead self-cleaving RNs to find alternative structures with similar function or with an increase catalytic activity omputational methods There has been also progress with computational approaches. deterministic approach is given by Seeman [56]. He uses a sequence-symmetry minimization algorithm where bases are selected to minimize similarities between segments of the molecule. In this way the sequence adopts the desire conformation and is less likely to fold into an alternative structure. Designed structures are validated with gel electrophoresis where the particular RN molecule is identified by the band patterns it yields in gel electrophoresis after being cut with various restriction enzymes. The RN Secondary Structure Design Problem can be seen as a discrete constraint satisfaction problem [32], where the constraint variables are the 23

Computational approaches for RNA energy parameter estimation

Computational approaches for RNA energy parameter estimation omputational approaches for RNA energy parameter estimation by Mirela Ştefania Andronescu M.Sc., The University of British olumbia, 2003 B.Sc., Bucharest Academy of Economic Studies, 1999 A THESIS SUBMITTED