Computational RNA Secondary Structure Design:

Size: px
Start display at page:

Download "Computational RNA Secondary Structure Design:"

Transcription

1 omputational RN Secondary Structure Design: Empirical omplexity and Improved Methods by Rosalía guirre-hernández B.Sc., niversidad Nacional utónoma de México, 996 M. Eng., niversidad Nacional utónoma de México, 999 THESIS SBMITTED IN PRTIL FLFILMENT OF THE REQIREMENTS FOR THE DEREE OF Doctor of Philosophy in The Faculty of raduate Studies (Mathematics) The niversity Of British olumbia pril, 2007 c Rosalía guirre-hernández 2007

2 bstract Ribonucleic acids play fundamental roles in cellular processes and their function is directly related to their structure. The research reported in this thesis is focused on the design of RN strands that are predicted to fold to a given secondary structure, according to a standard thermodynamic model. The design of RN structures is important for applications in therapeutics and nanotechnology. This work also applies to DN with the appropriate thermodynamic model for DN molecules. The overall goal of this research is to improve the performance and scope of algorithmic methods for RN secondary structure design. First, we investigate the hardness of this problem, since its theoretical complexity is unknown. scaling analysis on random and biologically generated structures supports the hypothesis that the running time of the RN Secondary Structure Designer (RN-SSD) algorithm, one of the state of the art algorithms for designing secondary structures, scales polynomially with the size of the structure. We found that structures with small stems separated by loops are difficult to design. Our improvements to the RN-SSD algorithm include the support for primary structure constraints, where bases or base types are fixed in certain positions of the sequence. Such constraints are important, for example, when designing RNs such as ribozymes or trns, where certain base positions must be fixed in order to permit interaction with other molecules. We investigate the correlation between the number and the location of the primary structure constraints and the performance of RN-SSD. In the second part of our research, we have extended the RN-SSD algorithm to design for stability, rather than minimum free energy folding. We measure stability according to several criteria such as high probability of observing the minimum free energy structure, and low average number of incorrectly paired nucleotides in the ensemble of structures for the designed sequence. The design of complexes of RN molecules, that is RN molecules that interact with each other, is relevant for many applications. We describe several ways to design stable structures and complexes, and we also discuss the advantages and limitations of each approach. ii

3 Table of ontents bstract ii Table of ontents iii List of Tables v List of Figures cknowledgements Dedication ix xix xx Introduction Motivation Research goals and contributions Thesis outline RN Secondary Structure Design Problem Secondary structure of an RN molecule Free energy of a single RN strand Free energy of a duplex Partition function Stability Previous Work Protein design Nucleic acid design Biochemical methods omputational methods Empirical omplexity of RN-SSD lgorithm Improvement of the RN-SSD algorithm Experiments iii

4 Table of ontents 4.3 Results nalysis of RN-SSD and RNinverse on secondary structures without constraints nalysis of RN-SSD on secondary structures with constraints Performance of RN-SSD with different number and locations of primary base constraints Summary Interaction of Two RN Molecules Description of the algorithms Internal algorithm Linker algorithm Interface algorithm Experiments Results Performance of Linker and Interface Performance of Linker and Internal Performance of Interface and Internal Hardness of duplex design Summary Stability The RN-SSD-stability algorithm Experiments Results omparison of RN-SSD-stability and INFO-RN omparison of RN-SSD-stability and adaptive walk Summary onclusions and Future Work Bibliography ppendix : ndesignable structures iv

5 List of Tables 3. List of algorithms that will be constantly mention in this thesis for predicting and designing RN secondary structures IP nomenclature for nucleic acids Set of structures generated by folding random sequences with the RNfold function from the Vienna RN package. Nucleotides in the sequence are assigned uniformly at random. The sets of structures longer than 75 bases are smaller because of the amount of time required for designing these structures Biological structures obtained from the literature and used by ndronescu et al. [4]. Structures marked with an asterisk ( ) were obtained from original, pseudoknotted structures by eliminating 8 base pairs in each case to remove the pseudoknot Properties of the structures from Table 4.3; the intervals specify the minimum and maximum values observed for the respective features. These parameters were used to generate structures with biological properties. These values denote the minimum and maximum ratio of bulges to base pairs in the stems Sets of structures generated with the RN structure generator, using the parameters from Table Set of structures used to study the correlation between the primary structure constraints and the performance of RN- SSD. Structures with similar characteristics (such as size, number of multiloops, etc.) appear in the same group. The structure Bio-50-n2 was also selected for the study because is relatively easy to design v

6 List of Tables 5. Pseudoknot-free ribozyme-substrate duplexes obtained from ndronescu et al. [5]. These duplexes have been selected arbitrarily from the biological literature. The length of the ribozyme and the target structure is specified by l and l 2, respectively Biologically motivated duplexes (BIOMD). The strands used to generate the duplexes have statistical features derived from biological structures (see Table 4.5). BIOMD-I refers to duplexes generated by splitting a single strand in a random position as explained in Section 5.2. BIOMD-II are generated by removing a hairpin from a single strand Performance results for Linker, Interface and Internal on sets of biological duplexes. The columns SR( ) show the fraction of structures for which the respective algorithm found solutions in all of the runs (SR stands for success rate). R is the target duplex; S is the structure that results from concatenating the molecules D and D 2 of the duplex; D and D 2 are designed independently by RN-SSD in the Interface algorithm Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD Performance results for Linker, Interface and Internal on the set of artificially generated structures BIOMD The first two structures are artificial RNs generated by Dirks et al. [25]. rtificial -multiloop has one multiloop with four branches. rtificial 3-multiloop has three multiloops, each of them with three branches. The rest are biological structures that were obtained arbitrarily from the biological literature; the same structures were used by ndronescu et al. [4] vi

7 List of Tables 6.2 utoff times for the adaptive walk and RN-SSD-stability. The running time columns represent the time spent by the adaptive walk, in P seconds, to find a local minimum for each structure. The adaptive walk was run only one time for each stability criteria n(s ) and p(s ). The cutoff time columns represents the maximum time that we allow the adaptive walk and the RN-SSD-stability to find a solution Statistics for the stability of the sequences designed by RN- SSD-stability and INFO-RN when n(s ) is optimized Statistics for the stability of the sequences designed by RN- SSD-stability and INFO-RN when p(s ) is optimized Probability of the shrep with the lowest free energy for the best and the median sequences designed by RN-SSD-stability and INFO-RN. Sequences from the upper and lower Table were obtained by optimizing n(s ) and p(s ) respectively. We do not report any value for ( ) because the target structure was not found in the shapes generated by RNshape. For this structure we generate up to 0 shapes with free energy 0% above the MFE. This suggests that the shape probability of this sequence is low Performance results for RN-SSD-stability and adaptive walk on artificial and biological RN structures. The stability criteria to design the RN structures are n(s ) and p(s ). The cutoff time columns indicate the running time in P seconds for both algorithms on our reference machine. The success rates show the fraction of runs for which the respective algorithm found a sequence that fold correctly into the target structure Statistics for the stability of the sequences designed by RN- SSD-stability and adaptive walk when n(s ) is optimized. The best, median and coefficient of variation, c v, are computed for the values of n(s ) and p(s ) of the sequences designed for a given structure. The correlation, ρ n,p, between n(s ) and p(s ) of all the runs of a given structure is also included Statistics for the stability of the sequences designed by RN- SSD-stability and adaptive walk when p(s ) is optimized Stability of biological sequences with respect to the metrics n(s ) and p(s ) vii

8 List of Tables 6.0 Stability of sequences designed by RN-SSD-stability and adaptive walk on biological RN structures when n(s ) is optimized (upper Table) and when p(s ) is optimized (lower Table). We run the adaptive walk one time until it finds a local minima. The running time in P seconds for the respective structure is indicated in the second column. In all cases, a valid sequence was found Stability of the sequence designed by RN-SSD-stability for the biological structure B-5. We performed one run of RN- SSD-stability, where p(s ) is optimized, with a cutoff time of P seconds and different values of p a and n r. The default values of p a and n r are 0.2 and 5 respectively. The success rate is one if the sequence fold correctly. The metrics n(s ) and p(s ) are used to evaluate the stability of the sequences Free energy parameters for internal loops viii

9 List of Figures. Primary structure (a) and secondary structure (b) of an RN molecule Hairpin ribozyme-substrate complex. The arrow indicates the cleavage site in the substrate. Sequence requirements are specified, where N represents any nucleotide; Y represents or ; R represents or ; B represents, or ; V represents, or. Illustration adapted from E. Puerta- Fernańdez et al. [46] Secondary structure motifs. (a) Hairpin. (b) Interior loop. (c) Bulge. (d) Multiloop. (e) Helix or stem. (f) Pseudoknot. (g) External loop Free energy calculation of a secondary structure. Stems are denoted by S, hairpins by H, multiloops by M, internal loops by I, bulges by B and dangling ends by D. The free energy of this structure is (S,X) = (S) + + (S5) + (H)+ (H2)+ (M)+ (I)+ (B)+ (D) = = 23.3 kcal/mol Matrix of equilibrium base pair binding probabilities between all nucleotides in the sequence of Figure 2.2. The lower triangle represents the optimal structure. Illustration obtained with the program MFold from Zuker et al. [7] Positive and negative design. The x-axis represents the structure space S(X) of a sequence X and the y-axis the free energy of each structure. Illustration adapted from Dirks et al. [25] Tectosquare designed by hworos et al. [20]. Panel (a) shows the RN structure that can self-assembly to form the square of panel (b). Panel (c) shows the three-dimensional representation of the tectosquare ix

10 List of Figures 3.2 Pseudocode for designing structures with the RN-SSD algorithm (a) Randomly generated structure of length 75 (RND-75-n62) with loops separated by short stems. The line represents the location where the structure is split into two substructures. Parts (b) and (c) show the corresponding substructures with static cap structure and dangling ends, respectively Substructures from Figure a with dynamic cap structure (a) and dynamic dangling ends (b) Ribosomal RN: Leptospira interrogans strain The structure motif that consists of two internal loops separated by one base pair is hard to design. We verified this experimentally by adding another base pair between the internal loops, whereupon the expected time to design the structure decreases by a factor of three Scaling analysis of the expected run-time (y-axis) using RN- SSD of structures of lengths 50, 75, 00, 25, 50, 200, 450 and 500 (x-axis). logarithmic scale is used in the y-axis. The solid (dotted) line corresponds to best fits of the data, for structures with lengths 50 to 50, using a polynomial (exponential) that is specified in each case. The expected run-times for structures longer than 50 appear closer to the polynomial than the exponential fit line. (a) Median (Q50) of expected run-time of random structures. (b) 0.9-quantile (Q90) of expected run-time of random structures. (c) Median (Q50) of expected run-time of biologically motivated structures. (d) 0.9-quantile (Q90) of expected run-time of biologically motivated structures x

11 List of Figures 4.5 Scaling analysis of the expected run-time (y-axis) of structures of lengths 50, 75, 00, 25, 50, 200 and 450 (x-axis). logarithmic scale is used on both axes. The lines correspond to best fits of the data, for structures with lengths 50 to 50, using a polynomial that is specified in each case. The expected run-time for structures longer than 50 appear close to the corresponding fit line. (a) Median (Q50) of expected runtime of biological structures and median, 0.-quantile (Q0) and 0.9-quantile (Q90) of expected run-time using RN-SSD for biologically motivated structures. (b) Median of expected run-time of random and biologically motivated structures using RN-SSD and RNinverse. Structures of length 200 are the largest structures of our data set that we designed with RNinverse ost distribution of RN-SSD. Distribution of expected runtime of RN-SSD on (a) random structures and (b) biologically motivated structures. For each point, the x-value gives an expected run time and the y-value gives the fraction of structures whose run-time is at most the x-value Distribution of expected run-time of RNinverse on (a) random structures and (b) biologically motivated structures Examples of structures not designed by RN-SSD. Structures not designed by RN-SSD have short stems separated by loops, indicated by arrows in the Figure. (a) Random structure of length 450 (RND-450-n84). This is the only random structure in our data set that RN-SSD did not design. Note that it has two internal loops separated only by one base pair. (b) Biologically motivated structure of length 74 (BIOM-50- n262) ndesignable motifs. Two structure motifs of our data set that are not compatible with the thermodynamic model. Bold lines represent base pairs. (a) Motif B: bulges separated by one base pair. (b) Motif 2I: internal loops separated by one base pair Secondary structure of small subunit ribosomal RN of canthamoeba castellanii. Biological secondary structure with three bulges, each separated by only one base pair. This structure was obtained from The omparative RN Web (RW) Site [7, 8] xi

12 List of Figures 4. Secondary structure of small subunit ribosomal RN of Escherichia coli. Biological secondary structure with two internal loops separated by one base pair. This structure was obtained from The omparative RN Web (RW) Site [7, 8, 28] Distribution of expected run-time of RN-SSD on three structures of approximately 50 bases: RND-50-n85, BIOM-50- n89 and VS Ribozyme from Neurospora mitochondria. The structures were designed with two sets of primary base constraints: one where the bases are fixed at random positions and the other where the bases are fixed on stems for each structure. Both sets have the same range [a, b] of constrained bases after propagation, where a and b are smallest and largest number of bases constrained if 50% of stems are fixed in a given structure. We fixed 50% plus one stems if a structure has an odd number of stems Scaling analysis using RN-SSD for the median expected run-time of biologically motivated structures with no primary base constraints and with bases constrained in fifty percent of random positions and fifty percent of stems. The lines represent the polynomial that best fits the data. The experiment with primary structure constraints is computationally expensive, and for this reason, fewer structures of each length were used. The best fit for structures with constraints was calculated based on those of lengths 50, 75 and 00. Note that the run-times for constrained structures longer than 00 appear below the corresponding fit line xii

13 List of Figures 4.4 Hardness of RN-SSD with base constraints. orrelation between the fraction of bases constrained in a particular structure (x-axis) and the median expected run-time for designing the structure with RN-SSD (y-axis). We report the fraction of constrained bases after propagation for constraints on randomly chosen base positions. This fraction, for both randomly chosen bases and stems, corresponds to the median fraction of bases constrained in a set of 50 constraints that were generated by fixing a given percentage of bases or stems. There are two curves in each graph, one for designing structures with base constraints located in random positions and the other for constraints located in stems. (a) VS Ribozyme from Neurospora mitochondria; (b) Bio-50-n38; (c) Bio-50- n4; (d) roup II intron ribozyme D35 from Saccharomyces; (e) Bio-200-n9; (f) Bio-50-n Biologically motivated structure Bio-50-n4 with ten stems. When constraining the bases in stems 7 and 8, this structure is hard to design. The structure motif formed by these stems, which are short and separated by a bulge, is unstable (a) Secondary structure of hammerhead ribozyme in satellite DN from Dolichopoda schiavazzi (cricket). (b) Transcleaving hammerhead ribozyme. This picture shows the secondary structure model of the hammerhead-substrate duplex where the specified bases are important for catalytic activity. Dots represent any nucleotide. In both pictures the arrow indicates the cleavage site RN complex with kissing loop interactions. This complex designed by hworos et al. [20] is a self assembling square made of four similar RN strands, called tectorns, that can have applications in nanobiotechnology Pseudocode for designing duplexes with the Internal algorithm Panel (a): Duplex structure (R,b), where b = 4 in this case. Panel (b): a single structure S is formed by concatenating the strands D and D 2 of the duplex (R,4). Panel (c): Duplex structure (R,2). Panel (d): a linker of five bases is added between the strands D and D 2 from (R,2) to obtain a structure S allowed by the thermodynamic model Pseudocode for designing duplexes with the Linker algorithm. 63 xiii

14 List of Figures 5.6 Interface algorithm to design duplexes. (a) Interface I: bases in the intermolecular helix are fixed. (b) Design of structures D and D 2 independently with RN-SSD Pseudocode for designing duplexes with the Interface algorithm rtificially generated duplex. (a) Biologically motivated structure generated with RN Structure enerator. (b) The arrow in (a) indicates the place where the structure is cut to obtain the duplex shown in (b) or the hairpin that is removed to obtain the duplex in (c) orrelation between the expected running times in P seconds of the Linker (x-axis) and the Interface (y-axis) approaches to design the biological duplexes from Table orrelation between the expected running times of the Linker and the Interface algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Linker or Interface is unable to design as 0 6 P seconds. Structures that are designed by none of the approaches are excluded. (a) BIOMD-50; (b) BIOMD-00; (c) BIOMD-200; (d) BIOMD (a) Duplex number 0 from the data set BIOMD-I-200, see Table 5.6. (b) The Linker algorithm concatenates both molecules S = D D 2 and design a sequence X = X X 2 that folds correctly into S. (c) Secondary structure of X and X 2 predicted by PairFold. Note that X and X 2 do not fold correctly into the target duplex R (a) Biological duplex BIOD-4. The corresponding sequences were designed by the Interface algorithm. The shaded bases belong to the intermolecular helix. The sequences X and X 2 from (b) and (c) are designed independently by RN- SSD. The shaded bases in this case correspond to nucleotides constrained in the interface. Note that X and X 2 do not fold correctly into D and D 2 respectively (a) Target structure BIOD-7. (b) Duplex formed by the sequences X and X 2 designed by Interface. There are two bases that are incorrectly paired that form an internal loop instead of the two consecutive bulges of Figure (a). (c) The sequences X and X 2 fold correctly into D and D xiv

15 List of Figures 5.4 orrelation between the expected running times in P seconds of the Linker (x-axis) and the Internal (y-axis) approaches to design the biological duplexes from Table orrelation between the expected running times of the Linker and the Internal algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Linker or Internal is unable to design as 0 6 P seconds. Structures that are designed by none of the approaches are excluded. (a) BIOMD-50; (b) BIOMD-00; (c) BIOMD-200; (d) BIOMD orrelation between the expected running times in P seconds of the Internal (x-axis) and the Interface (y-axis) approaches to design the biological duplexes from Table orrelation between the expected running times of the Internal and the Interface algorithms to design artificially generated duplexes. We arbitrarily (but unambiguously) report the running time for structures that Interface or Internal is unable to design as 0 6 P seconds. Structures that are not designed by none of the approaches are excluded. (a) BIOMD-50; (b) BIOMD-00; (c) BIOMD-200; (d) BIOMD Scaling analysis of the median (Q50) of expected run-time (y-axis) of artificially generated duplex BIOMD of lengths 50, 00, 200 and 500 and artificially generated structures BIOM of lengths 50, 75, 00, 25, 50, 200 and 500 (x-axis). Duplexes are designed with Linker, Interface and Internal in Figure (a), (b) and (c) respectively. In all cases, single structures from BIOM are designed with RN-SSD. The line corresponds to the best fit of the BIOM data (obtained in hapter 4) for structures with lengths 50 to 50 using a polynomial of degree three Pseudocode for designing stable structures with RN-SSDstability Pseudocode for SLS-stable xv

16 List of Figures 6.3 orrelation between the average number of incorrect nucleotides and the probability of the target in the ensemble when sequences are designed with RN-SSD-stability and INFO-RN. Logarithmic values are used in both axes with n(s ) on the x-axis and p(s ) on the y-axis. structure is stable if n(s ) and p(s ) are small. Panels (a) and (b) correspond to structure - and panels (c) and (d) to structure -2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) orrelation between the stability metrics n(s ) and p(s ). Panels (a) and (b) correspond to structure B- and panels (c) and (d) to structure B-2. The values of n(s ) and p(s ) of the corresponding biological sequence are indicated in each panel orrelation between the stability metrics n(s ) and p(s ). Panels (a) and (b) correspond to structure B-3 and panels (c) and (d) to structure B-4. The values of n(s ) and p(s ) of the corresponding biological sequence are indicated in each panel orrelation between the stability metrics n(s ) and p(s ). Panels (a) and (b) correspond to structure B-5 and panels (c) and (d) to structure B-6. The values of n(s ) and p(s ) of the corresponding biological sequence are indicated in each panel Solution quality distribution of RN-SSD-stability and INFO- RN with respect to the probability of shapes. Panels (a) and (b) correspond to structure - and panels (c) and (d) to structure -2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) Solution quality distribution of RN-SSD-stability and INFO- RN with respect to the probability shapes. Panels (a) and (b) correspond to structure B- and panels (c) and (d) to structure B-2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) xvi

17 List of Figures 6.9 orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for the artificial structure -. In panels (a) and (b) the sequences are designed by the adaptive walk and RN-SSD-stability respectively using n(s ) as the stability measure. Panels (c) and (d) show the quality of the sequences designed by adaptive walk and RN-SSD-stability respectively when p(s ) is optimized orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for the artificial structure -2 (panels (a) and (b)) and the biological structure B- (panels (c) and (d)) using the adaptive walk and the RN-SSD-stability algorithm, respectively. The values of n(s ) and p(s ) for the biological sequence corresponding to B- are also included orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for the biological structure B-2. The values of n(s ) and p(s ) for the biological sequence corresponding to B-2 are also included in each panel orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for B-3 (panels (a) and (b)) and B- 4 (panels (c) and (d)). The stability of the corresponding biological sequence is indicated by an arrow orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for the biological structure B-5. The stability of the corresponding biological sequence is indicated by an arrow orrelation between the stability metrics n(s ) and p(s ) of the sequences designed for B-6. The stability of the corresponding biological sequence is indicated by an arrow Solution quality distribution for RN-SSD-stability and adaptive walk. The x-axis shows the probability of the shape that contains the target structure for every sequence designed by RN-SSD-stability and the adaptive walk. The y-axis gives the fraction of structures whose probability shape is at most the x-value. Panels (a) and (b) show the stability of the sequences designed for structure -. Panels (c) and (d) correspond to structure -2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) xvii

18 List of Figures 6.6 Solution quality distribution of RN-SSD-stability and adaptive walk with respect to the probability shapes. Panels (a) and (b) correspond to structure B- and panels (c) and (d) to structure B-2. The stability metric n(s ) is optimized in panels (a) and (c) whereas p(s ) is optimized in panels (b) and (d) (a) Motif I: internal loop formed by breaking the base pair x i+ x j 3 from motif B. (b) Motif I: internal loop formed by breaking the base pair x i+6 x j 4 from motif 2I xviii

19 cknowledgements My deepest gratitude to nne ondon and Holger Hoos for their guidance, advice and enthusiasm as my research supervisors. They provided me with countless hours of discussions and financial support. Their keen observations and thorough revisions helped me to considerably improve my work. ertainly without their enormous patience and constant encouragement this thesis would not be completed. I also have to acknowledge Jenny Bryan, David Kirkpatrick, Lawrence McIntosh and Paul Higgs who kindly agreed to serve as members of my committee and provided me with valuable feedback and recommendations on this project. Many thanks to Mirela ndronescu for her collaboration, enthusiasm and help when the technical aspects of my thesis became a burden. I would also like to thank Roman Baranowski for introducing me to the IM cluster, and Dan Tulpan and Rachel MacKay for wonderful discussions. Thanks to Sanja Rogic for keeping me calm when I was finishing my thesis. nd of course to the people in the beta lab and the IM that made my stay in B enjoyable. I am grateful for the financial support from ONyT and from the overnment of anada in the first years of the program. Special thanks to my parents and siblings, always supportive and encouraging. Finally, but most importantly, my deepest gratitude to my husband lberto for his sacrifices, endless patience, love and help in editing this document. xix

20 To my lovely Tessa xx

21 hapter Introduction Ribonucleic acids (RN) are macromolecules that play fundamental roles in many biological processes. RNs, such as ribosomal, transfer and messenger RNs, have well-established roles in protein biosynthesis. Other RNs called ribozymes, catalyze several essential biological processes such as RN splicing, RN processing and peptide bond formation during translation [46]. RNs are composed of a single strand of four different nucleotides: adenine (), cytosine (), guanine () and uracil (). Each strand has two chemically distinct ends, known as 5 and 3 ends; when written as a base sequence, the first base is considered to represent the 5 end. The unique sequence that describes a particular RN molecule is known as the primary structure of that molecule (Figure. a). The RN single-stranded chain can fold on itself, forming hydrogen bonds between pairs of bases. The most common pairs occur between the Watson-rick complementary bases: bonds with, bonds with (and vice-versa), but there are also - wobble base pairs and non-canonical pairs formed by other bases. Folded RN can be characterised at the secondary and tertiary structure level. n RN secondary structure is the set of Watson-rick and wobble base pairs that form when the molecule folds. The base pairs produce helical regions, also called stems, and single stranded regions or loops (Figure. b). The tertiary structure is the relative location of atoms in the molecule in three-dimensional space. It includes the precise geometry of the base-pairing interactions as well as the spatial arrangement of any interaction between secondary and tertiary structure elements. The structure of most RNs is essential for their biological function. Replication control in single-stranded RN viruses [35] is governed by its secondary structure. Mechanisms for mrn synthesis also rely on specific RN structures. For example, in bacteria, transcription termination is caused by the RN polymerase responding to structural perturbations in the DN template or by hairpin loops in the transcript [60]. In eukaryotes, mrn contains introns whose splicing is catalyzed by a ribonucleoprotein complex known as spliceosome. onserved secondary structure motifs of

22 hapter. Introduction (a) 5 3 (b) 5 3 Figure.: Primary structure (a) and secondary structure (b) of an RN molecule. introns can play an important role in intron recognition by spliceosomes. Furthermore, particular mrn conformations can be recognized by regulatory proteins that modulate translation initiation, and mrns encode signals that modulate translation and regulate gene expression [22, 23]. These signals include sequences and structures that serve as targets for translational repressors [60]. The structure of ribozymes also plays an important role regarding their catalytic activity. For example, allosteric ribozymes have an effector-binding site that is separate from their active site. When the effector binds to the ribozyme it induces a conformational change in the ribozyme that enhances or inhibits catalytic function; effectors can be proteins, oligonucleotides, metal ions, etc. [5] omputational approaches for prediction of RN secondary structures are based on a thermodynamic model that associates a free energy value with each possible secondary structure for a strand at fixed conditions such as temperature and ion concentration. Thermodynamic parameters for RN folding have been determined by different methods such as optical melt- 2

23 hapter. Introduction ing, microcalorimetry [52] and knowledge-based methods using databases of known structures [4]. The secondary structure with the lowest possible free energy value, the minimum free energy (MFE) structure, is predicted to be the most stable secondary structure for the strand and is known as ground state. There are dynamic programming algorithms that, given an RN strand, find the secondary structure with MFE. The MFold server of Zuker [7] and RNfold from the Vienna Package [29] are widely used programs for predicting structures without pseudoknots (see hapter 2 for the definition of pseudoknot-free structures) using the RN energy parameters of the Turner group [4]. Both algorithms have an Θ(n 3 ) running time, where n is the length of the input RN sequence. There are also several dynamic programming algorithms for predicting RN secondary structures with pseudoknots, but most of these handle a restricted class of pseudoknotted structures [47]. The general problem of predicting RN secondary structures with pseudoknots is NP-hard but Rivas and Eddy [49] proposed an algorithm with a time complexity of Θ(n 6 ) that can handle a large class of structures [47]. We are interested in the inverse problem, the RN Secondary Structure Design Problem. Thus, in this thesis, we focus on the design of RN strands that are predicted to fold to a given secondary structure, according to a standard thermodynamic model such as that of the Turner group [4]. It is not known whether this problem has a polynomial-time algorithm. n algorithm for solving this problem takes as input a secondary structure of an RN molecule (without specific bases assigned to the sequence positions), and outputs an RN sequence that is predicted to fold into that structure. urrent algorithms design for MFE, that is, they find a sequence with MFE structure as the target structure. Once the sequence is known, it is possible to create the RN molecules using established laboratory techniques. One of the best algorithms for solving this problem is a stochastic local search algorithm known as RN Secondary Structure Designer (RN-SSD) of ndronescu et al. [4]. Stochastic local search algorithms initialise the search process at a randomly chosen point of the given search space (here: the space of RN sequences of a given length) and then proceed by iteratively moving from the present point to a neighbouring point, where the decision on each search step is based on local knowledge only, as well as randomization [32]. One of the reasons why RN-SSD has a good performance is because it decomposes the input secondary structure at multiloops. Different from other algorithms, RN-SSD recursively splits the structure into two substructures in each decomposition step; thus obtaining a binary decomposition tree whose root is formed by the full target structure and 3

24 hapter. Introduction whose leaves correspond to small substructures. Each non-leaf node of the tree represents a substructure of that can be obtained by merging the two substructures corresponding to its children. The SLS algorithm is only applied to the smallest substructures, and the corresponding partial solutions are combined into candidate solutions for larger subproblems guided by a decomposition tree.. Motivation RN Secondary Structure Design is important, both because it facilitates the characterization of biological RNs by their function, as well as the design of new ribozymes that potentially can be used as therapeutic agents [4, 46]. There are also applications in nanobiotechnology in the context of building self-assembling, stable structures and devices from small RN molecules [58]. There are several aspects that play an important role in the design of RN structures for practical applications. For example, when designing ribozymes we need to be able to design duplexes, because ribozymes catalyze highly sequence-specific reactions determined by RN-RN interactions between the ribozyme and its substrate molecule. Substrate recognition and binding to the substrate are essentially governed by Watson-rick interactions. Figure.2 shows the sequence requirements for the hairpin ribozyme-substrate complex. In particular, primary structure constraints play a fundamental role in the design of the complex. Most of the nucleotides important for catalytic activity are contained within the unpaired regions. There are almost no sequence restrictions for bases involved in the formation of the intermolecular helices as long as base pairing is achieved. Substrate regions that interact with the ribozyme do not show high sequence requirements. However, a stable association between the two molecules is required, especially close to the cleavage site. The catalytic performance of the hairpin ribozyme can be improved by stabilisation of helix 4 (H4) [46]. onsequently, it is also important to design for stability. It is also possible to design DN sequences for construction of DNbased geometrical objects and nanomechanical devices. The key to using DN for this purpose is the design of stable branched molecules, that can interact with other nucleic acid molecules [57]. n important aspect in the study of the RN Secondary Structure Design problem is to understand better the factors that render RN structures hard to design. Such understanding provides the basis for improving the 4

25 hapter. Introduction substrate n g 3 n n H 2 5 n n n n n b y r n n NNNNNV 5 N H N YNN ribozyme N N N N N 3 N H 3 N N N N N N N Figure.2: Hairpin ribozyme-substrate complex. The arrow indicates the cleavage site in the substrate. Sequence requirements are specified, where N represents any nucleotide; Y represents or ; R represents or ; B represents, or ; V represents, or. Illustration adapted from E. Puerta-Fernańdez et al. [46] H 4 performance of computational approaches for solving this problem and for characterising its limitations. To our knowledge, it has not been determined whether there is a polynomial-time algorithm for RN secondary structure design. Schuster et al. [53] performed experiments with the RNinverse algorithm by Hofacker et al. [3] on few small random sequences and a simple trn to support the hypothesis that there is no need to search huge portions of the sequence space to find a particular structure by mutation and selection. Based on these experiments, they argue that sequences sharing the same structure are distributed randomly over sequence space and that common structures, that is, structures that have many sequences that fold into them, can be accessed from an arbitrary sequence compatible with the target structure by a number of mutations much smaller than the sequence length. These results are based on small sequences and therefore they do not give insight into the computational complexity of the design problem. On the other hand, ndronescu et al. [4] found evidence that some ribosomal RN structures are difficult to design and that the correlation between the 5

26 hapter. Introduction size and hardness is not very strong. The goals of our work in this thesis are:. nderstand the empirical complexity of the RN Secondary Structure Design problem, that is, the scaling of the typical difficulty of the design task for various classes of RN structures as the size of the target structure is increased. losely related is the identification of factors that make RN structures hard to design. 2. Develop methods for RN secondary structure design with primary structure constraints; in this case, the input comprises fixed bases or base types in certain positions of the sequence to be designed. 3. Develop methods for designing duplexes of RN molecules with specific secondary structure. 4. Develop methods for designing stable RN molecules, since there are several RN sequences that can fold into a given structure. Finding a sequence with low MFE structure is one approach to design for stability. However, in hapter 2 we will describe other stability criteria, discussed by Dirks et al. [25], that can be use to design more stable structures than the MFE approach..2 Research goals and contributions In this section we explain our contributions to the rational design of RN secondary structures. This includes building efficient algorithms to design duplexes and stable structures. To the best of our knowledge, there are no algorithms for designing duplexes in the literature and the only algorithm that design explicitly stable structures is the adaptive walk from Dirks et al., that we will describe in hapter 3. However, this algorithm performs well only on short structures. To achieve our first goal, namely to understand the empirical complexity of the RN Secondary Structure Design Problem, we present an empirical analysis of the performance of two algorithms for the problem. One is the RN-SSD algorithm of ndronescu et al. [4] and the other is the RNinverse algorithm of Hofacker et al. [3] from the Vienna RN Package [29]. RN-SSD is one of the best algorithms available to design RN molecules in terms of the time spent designing a given structure and the number of structures that is able to design. We used an improved version of RN-SSD that supports primary structure constraints; an online version is available 6

27 hapter. Introduction at For the analysis we consider randomly generated structures, obtained by folding a randomly generated sequence with the RNfold function from Vienna Package; and also structures that are generated to have random features of biological structures, which we refer to as biologically generated structures. The use of random structures allows us to test the algorithms for types of structures that rarely occur in nature but that can have applications in nanotechnology. Furthermore, the performance of RN-SSD on biologically generated structures is relevant for biological applications like molecular therapeutics. The scaling analysis on random and biologically motivated structures supports the hypothesis that the running time of both algorithms scales polynomially with the size of the structure. We also found that the algorithms are in general faster when constraints are placed only on paired bases in the structure. When comparing both algorithms, the RN-SSD algorithm performs better than RNinverse since it requires less time to design a given structure and also because it is able to design more structures. Furthermore, we prove that, according to the standard thermodynamic model, for some structures that the RN-SSD algorithm was unable to design, there exists no sequence whose minimum free energy structure is the target structure. Our next goal is to extend the RN-SSD algorithm to be able to design better structures that work in practice. Here we consider the design of complexes of two RN molecules that can work as ribozymes or for applications in nanostructure design. We describe three different approaches for RN duplex design. These are the first algorithms for designing duplexes since to the best of our knowledge, there are no algorithms for designing complexes in the literature. Each of our algorithms takes as its input a complex of molecules that adopt a particular secondary structure. Depending on the problem of interest, there may be input base constraints located in some positions of the complex. The outputs are two sequences that are predicted to form the desired complex according to the PairFold function of ndronescu et al. [5]. The Linker approach concatenates two molecules to generate a single structure that is designed with RN-SSD. The designed sequence is then separated into the corresponding strands. The Interface approach assigns bases in the positions of intermolecular helices of the complex. Then both structures are designed independently by RN-SSD where the bases located in the regions where both structures interact remain fixed in RN-SSD. The Internal approach modifies the RN-SSD algorithm to design duplexes inside the algorithm where the candidate solutions are evaluated with PairFold. scaling analysis on these algorithms show that there is no evidence that the running time of Linker and Internal scale exponen- 7

28 hapter. Introduction tially with the length of the duplex. When comparing these approaches, we found that Interface has the lowest running time and designs less duplexes in our data set than Linker and Internal. Frequently, Interface has the problem that components of the designed sequences form undesirable interactions even though each sequence fold correctly into the structure of the corresponding molecule. We also found that Linker has the best running time but Internal designs more complexes than the other algorithms. Internal uses the PairFold function to evaluate the designs inside the algorithm. The theoretical running time of PairFold is Θ(n 3 ) where n is the sum of the two sequence lengths. However in practice the running time is longer than RNfold due to additional checking for the location of the intermolecular linkage in the calculation of the free energy of the duplex [3]. That is, in practice the running time of PairFold is a constant times more expensive than RNfold that evaluates designs for single sequences. Therefore, there is a trade-off between using PairFold internally or not since in the former case is more expensive even though it designs more duplexes. Finally, we extend the RN-SSD algorithm to design stable structures since to the best of our knowledge there are no algorithms reported in the literature that design stable structures explicitly. We design for stability inside RN-SSD at the lowest level of the algorithm where the smallest substructures of the decomposition tree are designed. n SLS algorithm is used to find a stable substructure with respect to some stability criteron like the probability of observing the MFE structure, or the average number of incorrectly paired nucleotides in the ensemble of structures of the designed sequence. The performance of the extended RN-SSD, called RN-SSDstability, is evaluated by comparing the stability of the sequences designed with this algorithm and other approaches like the adaptive walk procedure of Dirks et al. [25] and the INFO-RN algorithm of Busch et al. [6]. RN-SSD typically does not achieve the performance of INFO-RN when designing MFE structures. However, RN-SSD-stability performs better than INFO-RN which does not use any stability measure to design structures. When RN-SSD-stability is compared with the adaptive walk, we find that our algorithm designs more stable sequences in less time, especially for structures that are difficult to design. However, for very long run-times, the adaptive walk performs better than RN-SSD. n insight gained from our comparison is that RN-SSD-stability designs stable structures by merging stable substructures. Therefore, we believe that further improvements can be obtained if stable sequences are designed at every level of the decomposition tree, possibly by keeping several candidate subsequences to concatenate, and selecting the most stable. 8

29 .3 Thesis outline hapter. Introduction The remainder of this thesis is structured as follows. In hapter 2, we give an overview of existing related work from the literature. omputational approaches to the design of RN secondary structures are discussed in hapter 3 including heuristic algorithms available to solve this problem such as RNinverse, INFO-RN and RN-SSD. In hapter 4 we present an extension of RN-SSD that supports primary structure constraints and study the empirical complexity of the RN Secondary Structure Design Problem. hapter 5 deals with the design of RN duplexes, and hapter 6 with the design of stable RN molecules. The conclusions of this thesis and the future directions in which this work can be extended are discussed in hapter 7. proof that some structural motifs are impossible to design with the current thermodynamic model of the Turner group [4] is provided in the ppendix. 9

30 hapter 2 RN Secondary Structure Design Problem In this chapter we give some background to define the RN Secondary Structure Design Problem. 2. Secondary structure of an RN molecule RN secondary structure is characterized as the set of base pairs inducing a structure like the one in Figure. (b). onsider a sequence X = x,x 2,x 3, x n, where x i {,,, } i =,,n. For i < j n, let i j denote the pairing of base x i with x j. secondary structure S on a sequence X is a set of base pairs P such that two base pairs in S, i j and i j are either identical, or else i i and j j. This means that a base is paired with at most one other base. Figure 2. shows a classification of the loops and the base pairs in a secondary structure. We can classify loops formed by the bonding of base pairs in a structure according to the number of base pairs that they contain. hairpin loop contains exactly one base pair (Figure 2. a). n internal loop contains exactly two base pairs (Figure 2. b). bulge is an internal loop with one base from each of its two base pairs adjacent (Figure 2. c). Furthermore, a stacked pair is defined as two consecutive base pairs (i j) and (i + j ) (Figure 2. e). stem is formed by one or several stacked pairs. multibranched loop or multiloop is a loop that contains more than two base pairs (Figure 2. d). n exterior or closing pair is the base pair in a loop closest to the ends of RN strand. More precisely, the exterior pair is the one that maximizes j i over all pairs i j in the loop. ll other pairs are interior. Dangling bases are free bases located in the immediate vicinity of a stem. n RN secondary structure includes a pseudoknot if there exist two base pairs i j,i j in the structure with i < i < j < j (Figure 2. f) otherwise it is pseudoknot free. The bases (i,j,i,j,,i d,j d ) with d 0,i < j < i < j < < i d < j d define an 0

31 hapter 2. RN Secondary Structure Design Problem external loop if i pairs with j, i pairs with j,,i d pairs with j d and k is a free base k, k < i,j < k < i,,j d < k n where n is the length of the sequence (Figure 2. g). domain is a substructure which is closed by a base pair of an external loop, that is, a domain closed by i j is the set of all base pairs whose indices are in the interval [i,j]. The external loop shown in Figure 2. (g) contains two domains and seven free bases. The free bases are called external bases because they are not inside any domain, but they are between domains or between a domain and one of the ends of the strand. 2.2 Free energy of a single RN strand ssociated with a secondary structure of a strand is its free energy. The free energy of a secondary structure measures (in kcal/mol) the specificity of the sequence for a secondary structure at fixed temperature. The bases that are bonded tend to stabilize the RN, whereas unpaired bases form destabilizing loops. The sequence is most likely to fold into the structure with lowest free energy. For pseudoknot-free secondary structures, the free energy is calculated with the nearest neighbour thermodynamic model described by Zuker et al. [74] and Mathews et al. [4]. This model calculates the free energy of a structure as the sum of free energies of each stacked pair and each loop using experimentally obtained thermodynamic data. n example is given in Figure 2.2. Let (X,S) denote the free energy of an RN sequence X when folded into a secondary structure S. Furthermore, let Φ denote a function that assigns to each RN sequence X a secondary structure S that minimizes free energy (X, S) over all possible secondary structures S of X. The RN Secondary Structure Prediction Problem can be formulated as follows. iven an RN sequence X, determine Φ(X). Zuker and Stiegler developed a dynamic programming algorithm for finding the minimum free energy (MFE) secondary structure S without pseudoknots of an RN molecule X. They use the nearest neighbour thermodynamic model to evaluate the energy (X,S) of a sequence X folded into the secondary structure S. The running time of Zuker and Stiegler s algorithm is Θ(n 4 ) [75] and it has been reduced to Θ(n 3 ) by Lyngsø et al. [39]. The program mfold [72] was the first implementation of Zuker and Stiegler s algorithm and is available online [73]. The Vienna RN Package [3] also implements Zuker and Stiegler s algorithm. It is available online and is free open source software [30].

32 hapter 2. RN Secondary Structure Design Problem 5 i i+ exterior pair 5 i+ i i - i interior pair 3 j j- 3 j j- j j + (a) (b) dangling base i i 3 j j (c) (d) 5 3 stacked pair i i+ j j- 5 i j i j 3 (e) (f) Domain 3 j 5 i Domain B i j external bases (g) Figure 2.: Secondary structure motifs. (a) Hairpin. (b) Interior loop. (c) Bulge. (d) Multiloop. (e) Helix or stem. (f) Pseudoknot. (g) External loop. 2

33 hapter 2. RN Secondary Structure Design Problem Figure 2.2: Free energy calculation of a secondary structure. Stems are denoted by S, hairpins by H, multiloops by M, internal loops by I, bulges by B and dangling ends by D. The free energy of this structure is (S,X) = (S) + + (S5) + (H) + (H2) + (M) + (I) + (B) + (D) = = 23.3 kcal/mol. 3

34 hapter 2. RN Secondary Structure Design Problem n abstract secondary structure S of length n, (n,s), is defined as a set of integer pairs (i,j) i < j n, such that each i is contained in at most one pair and no two base pairs (i,j) and (i,j ) of S cross, that is, it is not the case that i < i < j < j or that i < i < j < j. The RN Secondary Structure Design Problem can be stated as follows. iven an abstract RN secondary structure (n,s ), find a sequence X such that Φ(X ) = S. 2.3 Free energy of a duplex In some applications, it is desirable to predict the secondary structure of two or more interacting RNs. Such predictions aid in understanding mechanisms for ribozyme function [69] and in designing novel ribozymes [7] or nanostructures [8]. The free energy calculation for a pair of RN molecules is very similar to the free energy calculation for one molecule. The free energy of a duplex of RNs is calculated with the nearest neighbour thermodynamic model as the sum of the energies of its loops. It is possible to distinguish regular and special loops. Let Y = y y 2 y n be the sequence obtained by concatenating two RN sequences X and X 2 and let b denotes the number of nucleotides in X, that is, b is the linkage location between X and X 2. loop is special if the linkage location b lies within it; otherwise it is regular. The free energy of the special structures is calculated in the same way as for regular structures, except that an inter-molecular initiation penalty I = 4. kcal/mol [5] is added if the special structure is a hairpin, an internal loop, a multiloop or a stacked pair. ndronescu et al. developed an efficient algorithm, PairFold [5], that predicts the MFE secondary structure that can be formed by two interacting nucleic acid molecules. This algorithm takes as input a pair of RN strands X and X 2, and extends the dynamic programming algorithm by Zuker and Stiegler [75] for single molecules. The worst-case time and space complexity for PairFold when calculating the MFE structure are Θ(n 3 ) for time and Θ(n 2 ) for space, where n is the sum of the two input sequence lengths. The structure of a duplex is specified as (n,r,b) where R is the secondary structure of length n of the interacting strands X of length b and X 2 of length n b. We refer to b as the linkage location. n abstract duplex R of length n, (n,r,b), is defined as a set of integers pairs (i,j) i < j n such that each i is contained in at most one pair and no two base pairs (i,j) and (i,j ) of R cross, that is, it is not the case that i < i < j < j or that i < i < j < j. The RN Secondary Structure Design Problem for 4

35 hapter 2. RN Secondary Structure Design Problem Duplexes can be formulated as follows. iven an abstract duplex secondary structure (n,r,b), find a pair of sequences X of length b and X 2 of length n b such that Φ(X,X 2 ) = (n,r,b). 2.4 Partition function lthough free energy models for secondary structure loops have been refined over time to achieve a better characterization of folding thermodynamics, the energy parameters are still inaccurate [4]. In the previous section we described how the free energy of a structure is computed by adding the free energy of each loop and stacked pair. slight deviation in the free energy parameters can lead to substantial differences in the computed MFE structure. Hence, the MFE structure derived from a folding algorithm may not be a true structure, i.e. the structure into which the molecule folds. nother consideration that has to be made is that RN molecules are not in the (true) MFE structure all the time because base pairs are constantly forming and breaking. For example, co-transcriptional folding leads to the formation of temporary secondary structure elements [36, 48] that have biological functions, e.g. as initial sites for protein anchoring during pre-mrn transcription [50]. Furthermore, RN conformational switching is fundamental in translational regulation, protein synthesis, and mrn splicing [38]. However, the MFE structure is the most likely structure. These suggest that it is beneficial to characterise the ensemble of all secondary structures. The probability P(S) of sampling a secondary structure S in the ensemble of structures of a given RN molecule X with free energy (X,S) is proportional to e (X,S)/RT where R is the gas constant and T is the absolute temperature. The partition function, from statistical mechanics, is a normalising constant that allows estimation of the probabilities from the free energy values and is given by Z(X) = S S(X) e (X,S)/RT (2.) where S(X) is the set of all possible secondary structures of a strand X. Therefore, the probability of observing a structure S in the ensemble of a given sequence X is P(X,S) = e (X,S)/RT Z(X) To simplify notation we will use P(S) from now on.. (2.2) 5

36 hapter 2. RN Secondary Structure Design Problem Note that the partition function (see Equation 2.) is a weighted sum over all admissible secondary structures of a given RN sequence. n admissible secondary structure has a set of base pairs that can be formed from the RN sequence. The lower the free energy of a structure, the higher its weighting, that is, the contribution to the sum. The computation time for calculating the partition function by explicitly summing all terms grows exponentially with the length n of the sequence. But Mcaskill [42] derived a dynamic programming algorithm to calculate the partition function of pseudoknotfree structures in time Θ(n 3 ). Mcaskill also introduced a matrix of equilibrium base pair probabilities between all nucleotides in the RN sequence. The values displayed in the upper triangle of the matrix (sometimes graphically shown as differently sized dots or boxes) represent the sum of the probabilities associated with all the structures in which the chosen base pair occurs. The lower triangle of the matrix is often used to illustrate the MFE structure. This matrix summarizes the features of the global ensemble of structures at equilibrium. Figure 2.3 shows the matrix for the sequence of Figure 2.2. The pairing probability for every possible pair can be calculated efficiently using a dynamic programming algorithm that is described by Mc- askill [42]. Let p ij denote the probability of forming a base pair i j for i,j n, where n is the sequence length and let p i,n+ be the probability that i is unpaired for i n. Let { δij α if i j is a base pair of the structure S = α 0 otherwise for i,j n (2.3) { δi,n+ α if i is an unpaired base of the structure S = α 0 otherwise then p ij = S α S(X) P(S α )δ α ij for i n (2.4) where P(S α ) is the probability of observing S α in the ensemble of structures S(X). 6

37 hapter 2. RN Secondary Structure Design Problem Figure 2.3: Matrix of equilibrium base pair binding probabilities between all nucleotides in the sequence of Figure 2.2. The lower triangle represents the optimal structure. Illustration obtained with the program MFold from Zuker et al. [7]. 2.5 Stability iven a secondary structure S, there are typically several sequences that have MFE structure S. If we want to design stable structures then one approach is to find the sequence with the lowest MFE S. Dirks et al. [25] described two paradigms for designing a structure. positive design optimizes sequence affinity for the target structure. negative design optimizes sequence specificity to the target structure. Sequences with high affinity have admissible structures similar to the target structure. Sequences with high specificity have the target structure as their MFE structure. When designing a structure, it is desirable to achieve both high affinity and high specificity. Figure 2.4 shows the structure space for three sequences. The target structure is indicated in the picture. Sequence has a higher affin- 7

38 hapter 2. RN Secondary Structure Design Problem ity to the target structure than sequences and B since it has the lowest free energy when folding into the desired structure. Moreover, sequences and B have a higher specificity to the target structure since S is the MFE structure of and B. Overall we will prefer sequence B since it has high affinity and high specificity and therefore its MFE structure is the most thermodynamically stable. Dirks et al. [25] define several criteria to evaluate the specificity and the affinity of a structure. Energy minimization only evaluates a sequence in terms of its affinity for a target structure. The lower the energy of the sequence when folding into the target structure, the higher the affinity to the target structure. But this condition is not sufficient to have high specificity to the target structure (see sequence in Figure 2.4). Furthermore, the MFE criterion only evaluates the sequence in terms of its specificity to the target structure but does not ensure high affinity (see sequence in Figure 2.4). With the partition function it is possible to calculate the probability P(S ) for a sequence to fold into S (see Equation 2.2). If P(S ) is close to one then the sequence achieves high affinity and specificity. However, requiring that P(S ) is a very strict design evaluation criterion since it requires that all nucleotides match the target exactly. n alternative criterion is the weighted average number of incorrect positions n(x,s ), with respect to S, over the equilibrium ensemble of secondary structures of a given sequence X [25]. To simplify notation we will use n(s ) from now on. sequence has high affinity and specificity to the target structure if n(s ) is close to zero. The weighted average number of incorrect bases can be computed from the matrix of base pair probabilities of the partition function. If S is the MFE structure of X and δ is defined as in Equations 2.3 and 2.4, then the weighted average number of incorrect nucleotides is: n(s ) = n = n p ij δ ij i n j n+ }{{} expected number of correct nucleotides i n = n j n i n j n p ij δ ij + p i,n+δ i,n+ p ij δ ij i n p i,n+ δ i,n+ 8

39 hapter 2. RN Secondary Structure Design Problem Figure 2.4: Positive and negative design. The x-axis represents the structure space S(X) of a sequence X and the y-axis the free energy of each structure. Illustration adapted from Dirks et al. [25] = n i n j n p ij δ ij i n p ij δ i,n+ j n since j n p ij + p i,n+ = for every i n. Therefore, n(s ) = n i n j n p ij δ ij i n δ i,n+ + i n j n p ij δ i,n+. (2.5) The indicator that we use to identify stable structures depends on the context. For example, in long structures it is difficult to have a probability P(S ) close to one because the number of structures in the ensemble is very large. In this case we can use the first energy gap [67] to identify the most stable structure. The first energy gap is a negative design criterion that is defined as the difference between the energy of the MFE structure 9

40 hapter 2. RN Secondary Structure Design Problem and the second best structure. In some situations it is better to design a sequence with big first energy gap. Suppose that we have two sequences X and X 2 whose MFE structure is the target structure S. ssume that X has an ensemble with several structures very different from each other and all of them with probabilities close to P(S ). Then we will prefer to choose sequence X 2 if P(S ) is smaller in this new ensemble but has a bigger first energy gap. Note that if we choose X then there is a significant chance to get kinetically trapped in a very stable structure different from the target structure. Moreover, if the energetically favourable structures in the ensemble of X 2 are similar to each other then it is better to choose this sequence because it has a higher probability of folding into the desired target than sequence X. From the previous example, it is useful to determine whether there is some family of structures in the ensemble that is similar, distinct from the rest, and dominates the probabilities of all other families. Voß et al. [64] introduced an algorithm, called RNshapes, to compute the accumulated probabilities of all structures that share the same shape. n RN shape is a representation of an RN secondary structure that abstracts loop and stem lengths. onsider the following sequence and two secondary structures from its folding space:..(((.((..(((...))).(((...))))))))....(((...(((...))).(((...)))..))).. There are several levels of abstraction. In the least abstract shape, the unpaired regions are represented by an underscore and the stacking regions by a pair of squared brackets. The following are the shapes of the previous structures. [ [ [ ] [ ] ] ] [ [ ] [ ] ] The most abstract level excludes unpaired bases and combines nested helices. In this case, both structures have the same shape: [ [ ] [ ] ] The accumulated probabilities of all the structures that share the same shape can be used as another stability measure. If the MFE structure belongs to a shape of probability close to one, then the corresponding sequence achieves high affinity and specificity. 20

41 hapter 3 Previous Work The most important work related to structure design is presented in this chapter. We give an overview on the design of proteins and nucleic acids. We also describe biochemical and computational methods including heuristic algorithms for RN structure design that have been empirically shown to achieve good performance. 3. Protein design Proteins have a big range of natural function and hence they represent a fertile medium for the design of new medical and industrial products. The ultimate goal of protein design is the creation of novel proteins that perform specified tasks. necessary requirement for meeting this goal is the ability to identify sequences that fold with sufficient stability into a target structure. omputational procedures for protein design consists of starting from a given protein three-dimensional structure, usually a known structure from the Protein Data Bank (PDB) [9, 0], and searching for the amino acid sequence or sequences that are compatible with this structure. Pierce and Winfree show that the protein design protein is NP-hard [44]. However, this is a reflection of worst-case behavior but in practice, it is possible for an exponential-time algorithm to perform well or for an approximate stochastic method to prove capable of finding excellent solutions to NP-hard problems. Stochastic methods based on Monte arlo, simulated annealing or genetic algorithms have performed with some success on small protein design problems [24]. Proteins can be designed computationally by using positive strategies that maximize the stability of the desired structure or by negative strategies that seek to destabilize competing states. Design efforts have focused mainly on positive strategies that maximize favorable interactions in the target conformation. This approach has given good results including the introduction of catalytic activity into a previously inert protein [3] and the creation of a novel protein fold [37]. Negative design, by contrast, maximizes unfavorable interactions in competing states and requires modeling of each 2

42 hapter 3. Previous Work unwanted conformation [70]. However, one of the challenges in negative design is to model accurately the energetic effects of destabilizing mutations in competing states. 3.2 Nucleic acid design In this section we discuss biochemical and computational methods that have been used to design nucleic acids. Nucleic acids have the advantage that they are easy to synthesize and that structure formation is mainly based on secondary structure, that is, base-pairing interactions within a strand. Nucleic acids are versatile building materials. In nature, DN and RN s, such as mrn, rrns and trns, are involved in making proteins. In nanotechnology, DNs and RNs have several applications. ssembly and folding principles of natural RN are used to build potentially functional artificial structures at the nano-scale [20, 5]. This concept, called RN tectonics, led to the synthesis of RN grids with various patterns such as the one in Figure 3.. Nucleic acids have been also used to build nanomechanical devices [59]. In this case, the combination of single and double stranded sections of DN yield structures which can be thought of as a network of stiff and flexible elements. The deliberate formation or destruction of doublestranded sections in such a network induces conformational changes which result in nanoscale motions such as rotational motion, pulling and stretching, or even unidirectional motion. Other applications include engineering logic circuits [54] and simple computers [6] Biochemical methods In practice, RN design is mostly done using biochemical methods. These procedures allow us to find RN structures similar to those retained by natural evolution, and also to identify alternative conformations that can perform the same function. Structures of interest can be characterized by X-ray crystallography and NMR spectroscopy. By using phylogenetic analysis and sequence alignment of several RN sequences it is possible to identify conserved primary and secondary structural features that can be related to function [7]. Wang and nrau [65] use random recombination and selection to isolate the core functional elements of an RN where phylogeny is lacking or is limited. In vitro mutagenesis and selection have been performed on several ribozymes and substrates to determine the overall secondary structure and to identify which elements are essential for activity [2, 46]. Minimal 22

43 hapter 3. Previous Work 3 XXXXXX() N XX X X N X X N - N o - X X X X X X ( a ) ( b ) ( c ) Figure 3.: Tectosquare designed by hworos et al. [20]. Panel (a) shows the RN structure that can self-assembly to form the square of panel (b). Panel (c) shows the three-dimensional representation of the tectosquare. motifs that support catalytic activity or modified structures with an improved catalytic activity can be determined by this method. Breaker et al. [5] also used in vitro selection to design allosteric ribozymes as biosensor components. n allosteric ribozyme induces or inhibits catalytic function in the presence of an effector molecule that binds to a receptor site distinct from that of the enzyme s active site. nother approach searches in the enebank database for potential structural motifs that might have functional significance. Ferbeyre et al. [27] mutate versions of the hammerhead self-cleaving RNs to find alternative structures with similar function or with an increase catalytic activity omputational methods There has been also progress with computational approaches. deterministic approach is given by Seeman [56]. He uses a sequence-symmetry minimization algorithm where bases are selected to minimize similarities between segments of the molecule. In this way the sequence adopts the desire conformation and is less likely to fold into an alternative structure. Designed structures are validated with gel electrophoresis where the particular RN molecule is identified by the band patterns it yields in gel electrophoresis after being cut with various restriction enzymes. The RN Secondary Structure Design Problem can be seen as a discrete constraint satisfaction problem [32], where the constraint variables are the 23

Computational approaches for RNA energy parameter estimation

Computational approaches for RNA energy parameter estimation omputational approaches for RNA energy parameter estimation by Mirela Ştefania Andronescu M.Sc., The University of British olumbia, 2003 B.Sc., Bucharest Academy of Economic Studies, 1999 A THESIS SUBMITTED

More information

RNA Secondary Structure Prediction

RNA Secondary Structure Prediction RN Secondary Structure Prediction Perry Hooker S 531: dvanced lgorithms Prof. Mike Rosulek University of Montana December 10, 2010 Introduction Ribonucleic acid (RN) is a macromolecule that is essential

More information

proteins are the basic building blocks and active players in the cell, and

proteins are the basic building blocks and active players in the cell, and 12 RN Secondary Structure Sources for this lecture: R. Durbin, S. Eddy,. Krogh und. Mitchison, Biological sequence analysis, ambridge, 1998 J. Setubal & J. Meidanis, Introduction to computational molecular

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri RNA Structure Prediction Secondary

More information

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 8.3.1 Simple energy minimization Maximizing the number of base pairs as described above does not lead to good structure predictions.

More information

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype Lecture Series 7 From DNA to Protein: Genotype to Phenotype Reading Assignments Read Chapter 7 From DNA to Protein A. Genes and the Synthesis of Polypeptides Genes are made up of DNA and are expressed

More information

GCD3033:Cell Biology. Transcription

GCD3033:Cell Biology. Transcription Transcription Transcription: DNA to RNA A) production of complementary strand of DNA B) RNA types C) transcription start/stop signals D) Initiation of eukaryotic gene expression E) transcription factors

More information

Combinatorial approaches to RNA folding Part I: Basics

Combinatorial approaches to RNA folding Part I: Basics Combinatorial approaches to RNA folding Part I: Basics Matthew Macauley Department of Mathematical Sciences Clemson University http://www.math.clemson.edu/~macaule/ Math 4500, Spring 2015 M. Macauley (Clemson)

More information

RNA Abstract Shape Analysis

RNA Abstract Shape Analysis ourse: iegerich RN bstract nalysis omplete shape iegerich enter of Biotechnology Bielefeld niversity robert@techfak.ni-bielefeld.de ourse on omputational RN Biology, Tübingen, March 2006 iegerich ourse:

More information

In Genomes, Two Types of Genes

In Genomes, Two Types of Genes In Genomes, Two Types of Genes Protein-coding: [Start codon] [codon 1] [codon 2] [ ] [Stop codon] + DNA codons translated to amino acids to form a protein Non-coding RNAs (NcRNAs) No consistent patterns

More information

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 Dr. Stefan Simm, 01.11.2016 simm@bio.uni-frankfurt.de RNA secondary structures a. hairpin loop b. stem c. bulge loop d. interior loop e. multi

More information

A Novel Statistical Model for the Secondary Structure of RNA

A Novel Statistical Model for the Secondary Structure of RNA ISBN 978-1-8466-93-3 Proceedings of the 5th International ongress on Mathematical Biology (IMB11) Vol. 3 Nanjing, P. R. hina, June 3-5, 11 Novel Statistical Model for the Secondary Structure of RN Liu

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell.

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell. RNAs L.Os. Know the different types of RNA & their relative concentration Know the structure of each RNA Understand their functions Know their locations in the cell Understand the differences between prokaryotic

More information

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail

More information

From gene to protein. Premedical biology

From gene to protein. Premedical biology From gene to protein Premedical biology Central dogma of Biology, Molecular Biology, Genetics transcription replication reverse transcription translation DNA RNA Protein RNA chemically similar to DNA,

More information

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications 1 GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications 2 DNA Promoter Gene A Gene B Termination Signal Transcription

More information

Lecture 7: Simple genetic circuits I

Lecture 7: Simple genetic circuits I Lecture 7: Simple genetic circuits I Paul C Bressloff (Fall 2018) 7.1 Transcription and translation In Fig. 20 we show the two main stages in the expression of a single gene according to the central dogma.

More information

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters 1 Binod Kumar, Assistant Professor, Computer Sc. Dept, ISTAR, Vallabh Vidyanagar,

More information

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable RNA STRUCTURE RNA Basics RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U wobble pairing Bases can only pair with one other base. 23 Hydrogen Bonds more stable RNA Basics transfer RNA (trna) messenger

More information

4. Why not make all enzymes all the time (even if not needed)? Enzyme synthesis uses a lot of energy.

4. Why not make all enzymes all the time (even if not needed)? Enzyme synthesis uses a lot of energy. 1 C2005/F2401 '10-- Lecture 15 -- Last Edited: 11/02/10 01:58 PM Copyright 2010 Deborah Mowshowitz and Lawrence Chasin Department of Biological Sciences Columbia University New York, NY. Handouts: 15A

More information

Chapter 17. From Gene to Protein. Biology Kevin Dees

Chapter 17. From Gene to Protein. Biology Kevin Dees Chapter 17 From Gene to Protein DNA The information molecule Sequences of bases is a code DNA organized in to chromosomes Chromosomes are organized into genes What do the genes actually say??? Reflecting

More information

DNA/RNA Structure Prediction

DNA/RNA Structure Prediction C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Master Course DNA/Protein Structurefunction Analysis and Prediction Lecture 12 DNA/RNA Structure Prediction Epigenectics Epigenomics:

More information

DNA THE CODE OF LIFE 05 JULY 2014

DNA THE CODE OF LIFE 05 JULY 2014 LIFE SIENES N THE OE OF LIFE 05 JULY 2014 Lesson escription In this lesson we nswer questions on: o N, RN and Protein synthesis o The processes of mitosis and meiosis o omparison of the processes of meiosis

More information

PROTEIN SYNTHESIS INTRO

PROTEIN SYNTHESIS INTRO MR. POMERANTZ Page 1 of 6 Protein synthesis Intro. Use the text book to help properly answer the following questions 1. RNA differs from DNA in that RNA a. is single-stranded. c. contains the nitrogen

More information

EVALUATION OF RNA SECONDARY STRUCTURE MOTIFS USING REGRESSION ANALYSIS

EVALUATION OF RNA SECONDARY STRUCTURE MOTIFS USING REGRESSION ANALYSIS EVLTION OF RN SEONDRY STRTRE MOTIFS SIN RERESSION NLYSIS Mohammad nwar School of Information Technology and Engineering, niversity of Ottawa e-mail: manwar@site.uottawa.ca bstract Recent experimental evidences

More information

BA, BSc, and MSc Degree Examinations

BA, BSc, and MSc Degree Examinations Examination Candidate Number: Desk Number: BA, BSc, and MSc Degree Examinations 2017-8 Department : BIOLOGY Title of Exam: Molecular Biology and Biochemistry Part I Time Allowed: 1 hour and 30 minutes

More information

Sugars, such as glucose or fructose are the basic building blocks of more complex carbohydrates. Which of the following

Sugars, such as glucose or fructose are the basic building blocks of more complex carbohydrates. Which of the following Name: Score: / Quiz 2 on Lectures 3 &4 Part 1 Sugars, such as glucose or fructose are the basic building blocks of more complex carbohydrates. Which of the following foods is not a significant source of

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

RNA Folding and Interaction Prediction: A Survey

RNA Folding and Interaction Prediction: A Survey RNA Folding and Interaction Prediction: A Survey Syed Ali Ahmed Graduate Center, City University of New York New York, NY November 19, 2015 Abstract The problem of computationally predicting the structure

More information

Gene regulation II Biochemistry 302. February 27, 2006

Gene regulation II Biochemistry 302. February 27, 2006 Gene regulation II Biochemistry 302 February 27, 2006 Molecular basis of inhibition of RNAP by Lac repressor 35 promoter site 10 promoter site CRP/DNA complex 60 Lewis, M. et al. (1996) Science 271:1247

More information

A Method for Aligning RNA Secondary Structures

A Method for Aligning RNA Secondary Structures Method for ligning RN Secondary Structures Jason T. L. Wang New Jersey Institute of Technology J Liu, JTL Wang, J Hu and B Tian, BM Bioinformatics, 2005 1 Outline Introduction Structural alignment of RN

More information

What is the central dogma of biology?

What is the central dogma of biology? Bellringer What is the central dogma of biology? A. RNA DNA Protein B. DNA Protein Gene C. DNA Gene RNA D. DNA RNA Protein Review of DNA processes Replication (7.1) Transcription(7.2) Translation(7.3)

More information

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1 Zhou Pei-Yuan Centre for Applied Mathematics, Tsinghua University November 2013 F. Piazza Center for Molecular Biophysics and University of Orléans, France Selected topic in Physical Biology Lecture 1

More information

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

Name: SBI 4U. Gene Expression Quiz. Overall Expectation: Gene Expression Quiz Overall Expectation: - Demonstrate an understanding of concepts related to molecular genetics, and how genetic modification is applied in industry and agriculture Specific Expectation(s):

More information

The Double Helix. CSE 417: Algorithms and Computational Complexity! The Central Dogma of Molecular Biology! DNA! RNA! Protein! Protein!

The Double Helix. CSE 417: Algorithms and Computational Complexity! The Central Dogma of Molecular Biology! DNA! RNA! Protein! Protein! The Double Helix SE 417: lgorithms and omputational omplexity! Winter 29! W. L. Ruzzo! Dynamic Programming, II" RN Folding! http://www.rcsb.org/pdb/explore.do?structureid=1t! Los lamos Science The entral

More information

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

Lecture 18 June 2 nd, Gene Expression Regulation Mutations Lecture 18 June 2 nd, 2016 Gene Expression Regulation Mutations From Gene to Protein Central Dogma Replication DNA RNA PROTEIN Transcription Translation RNA Viruses: genome is RNA Reverse Transcriptase

More information

Predicting RNA Secondary Structure

Predicting RNA Secondary Structure 7.91 / 7.36 / BE.490 Lecture #6 Mar. 11, 2004 Predicting RNA Secondary Structure Chris Burge Review of Markov Models & DNA Evolution CpG Island HMM The Viterbi Algorithm Real World HMMs Markov Models for

More information

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis Chapters 12&13 Notes: DNA, RNA & Protein Synthesis Name Period Words to Know: nucleotides, DNA, complementary base pairing, replication, genes, proteins, mrna, rrna, trna, transcription, translation, codon,

More information

Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1:

Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1: Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1: Biochemistry: An Evolving Science Tips on note taking... Remember copies of my lectures are available on my webpage If you forget to print them

More information

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11 UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11 REVIEW: Signals that Start and Stop Transcription and Translation BUT, HOW DO CELLS CONTROL WHICH GENES ARE EXPRESSED AND WHEN? First of

More information

Introduction to Molecular and Cell Biology

Introduction to Molecular and Cell Biology Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the molecular basis of disease? What

More information

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??

More information

Using SetPSO to determine RNA secondary structure

Using SetPSO to determine RNA secondary structure Using SetPSO to determine RNA secondary structure by Charles Marais Neethling Submitted in partial fulfilment of the requirements for the degree of Master of Science (Computer Science) in the Faculty of

More information

Chapter 6- An Introduction to Metabolism*

Chapter 6- An Introduction to Metabolism* Chapter 6- An Introduction to Metabolism* *Lecture notes are to be used as a study guide only and do not represent the comprehensive information you will need to know for the exams. The Energy of Life

More information

Gene regulation I Biochemistry 302. Bob Kelm February 25, 2005

Gene regulation I Biochemistry 302. Bob Kelm February 25, 2005 Gene regulation I Biochemistry 302 Bob Kelm February 25, 2005 Principles of gene regulation (cellular versus molecular level) Extracellular signals Chemical (e.g. hormones, growth factors) Environmental

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Chapter 16 Lecture. Concepts Of Genetics. Tenth Edition. Regulation of Gene Expression in Prokaryotes

Chapter 16 Lecture. Concepts Of Genetics. Tenth Edition. Regulation of Gene Expression in Prokaryotes Chapter 16 Lecture Concepts Of Genetics Tenth Edition Regulation of Gene Expression in Prokaryotes Chapter Contents 16.1 Prokaryotes Regulate Gene Expression in Response to Environmental Conditions 16.2

More information

Controlling Gene Expression

Controlling Gene Expression Controlling Gene Expression Control Mechanisms Gene regulation involves turning on or off specific genes as required by the cell Determine when to make more proteins and when to stop making more Housekeeping

More information

Mir Md. Maruf Morshed

Mir Md. Maruf Morshed Investigation of External Acoustic Loadings on a Launch Vehicle Fairing During Lift-off Supervisors: Professor Colin H. Hansen Associate Professor Anthony C. Zander School of Mechanical Engineering South

More information

Molecular Biology - Translation of RNA to make Protein *

Molecular Biology - Translation of RNA to make Protein * OpenStax-CNX module: m49485 1 Molecular Biology - Translation of RNA to make Protein * Jerey Mahr Based on Translation by OpenStax This work is produced by OpenStax-CNX and licensed under the Creative

More information

Lab III: Computational Biology and RNA Structure Prediction. Biochemistry 208 David Mathews Department of Biochemistry & Biophysics

Lab III: Computational Biology and RNA Structure Prediction. Biochemistry 208 David Mathews Department of Biochemistry & Biophysics Lab III: Computational Biology and RNA Structure Prediction Biochemistry 208 David Mathews Department of Biochemistry & Biophysics Contact Info: David_Mathews@urmc.rochester.edu Phone: x51734 Office: 3-8816

More information

Biology I Fall Semester Exam Review 2014

Biology I Fall Semester Exam Review 2014 Biology I Fall Semester Exam Review 2014 Biomolecules and Enzymes (Chapter 2) 8 questions Macromolecules, Biomolecules, Organic Compunds Elements *From the Periodic Table of Elements Subunits Monomers,

More information

1. In most cases, genes code for and it is that

1. In most cases, genes code for and it is that Name Chapter 10 Reading Guide From DNA to Protein: Gene Expression Concept 10.1 Genetics Shows That Genes Code for Proteins 1. In most cases, genes code for and it is that determine. 2. Describe what Garrod

More information

15.2 Prokaryotic Transcription *

15.2 Prokaryotic Transcription * OpenStax-CNX module: m52697 1 15.2 Prokaryotic Transcription * Shannon McDermott Based on Prokaryotic Transcription by OpenStax This work is produced by OpenStax-CNX and licensed under the Creative Commons

More information

Number of questions TEK (Learning Target) Biomolecules & Enzymes

Number of questions TEK (Learning Target) Biomolecules & Enzymes Unit Biomolecules & Enzymes Number of questions TEK (Learning Target) on Exam 8 questions 9A I can compare and contrast the structure and function of biomolecules. 9C I know the role of enzymes and how

More information

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology 2012 Univ. 1301 Aguilera Lecture Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the

More information

Computational approaches for RNA energy parameter estimation

Computational approaches for RNA energy parameter estimation Computational approaches for RNA energy parameter estimation Mirela Andronescu 1, Anne Condon 2, Holger H. Hoos 2, David H. Mathews 3, and Kevin P. Murphy 2 1 Dept. of Genome Sciences, University of Washington,

More information

Initiation of translation in eukaryotic cells:connecting the head and tail

Initiation of translation in eukaryotic cells:connecting the head and tail Initiation of translation in eukaryotic cells:connecting the head and tail GCCRCCAUGG 1: Multiple initiation factors with distinct biochemical roles (linking, tethering, recruiting, and scanning) 2: 5

More information

Supersecondary Structures (structural motifs)

Supersecondary Structures (structural motifs) Supersecondary Structures (structural motifs) Various Sources Slide 1 Supersecondary Structures (Motifs) Supersecondary Structures (Motifs): : Combinations of secondary structures in specific geometric

More information

ATP. P i. trna. 3 Appropriate trna covalently bonds to amino acid, displacing AMP. Computer model Hydrogen bonds

ATP. P i. trna. 3 Appropriate trna covalently bonds to amino acid, displacing AMP. Computer model Hydrogen bonds mino acid attachment site nticodon Hydrogen bonds mino acid T i denosine i i denosine minoacyl-trn synthetase (enzyme) trn 1 ctive site binds the amino acid and T. 2 T loses two groups and bonds to the

More information

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus L3.1: Circuits: Introduction to Transcription Networks Cellular Design Principles Prof. Jenna Rickus In this lecture Cognitive problem of the Cell Introduce transcription networks Key processing network

More information

Translation Part 2 of Protein Synthesis

Translation Part 2 of Protein Synthesis Translation Part 2 of Protein Synthesis IN: How is transcription like making a jello mold? (be specific) What process does this diagram represent? A. Mutation B. Replication C.Transcription D.Translation

More information

Biomolecules. Energetics in biology. Biomolecules inside the cell

Biomolecules. Energetics in biology. Biomolecules inside the cell Biomolecules Energetics in biology Biomolecules inside the cell Energetics in biology The production of energy, its storage, and its use are central to the economy of the cell. Energy may be defined as

More information

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2 Cellular Neuroanatomy I The Prototypical Neuron: Soma Reading: BCP Chapter 2 Functional Unit of the Nervous System The functional unit of the nervous system is the neuron. Neurons are cells specialized

More information

c 2011 JOSHUA DAVID JOHNSTON ALL RIGHTS RESERVED

c 2011 JOSHUA DAVID JOHNSTON ALL RIGHTS RESERVED c 211 JOSHUA DAVID JOHNSTON ALL RIGHTS RESERVED ANALYTICALLY AND NUMERICALLY MODELING RESERVOIR-EXTENDED POROUS SLIDER AND JOURNAL BEARINGS INCORPORATING CAVITATION EFFECTS A Dissertation Presented to

More information

Lesson Overview. Ribosomes and Protein Synthesis 13.2

Lesson Overview. Ribosomes and Protein Synthesis 13.2 13.2 The Genetic Code The first step in decoding genetic messages is to transcribe a nucleotide base sequence from DNA to mrna. This transcribed information contains a code for making proteins. The Genetic

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators CSEP 59A Summer 26 Lecture 4 MLE, EM, RE, Expression FYI, re HW #2: Hemoglobin History 1 Alberts et al., 3rd ed.,pg389 2 Tonight MLE: Maximum Likelihood Estimators EM: the Expectation Maximization Algorithm

More information

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression CSEP 590A Summer 2006 Lecture 4 MLE, EM, RE, Expression 1 FYI, re HW #2: Hemoglobin History Alberts et al., 3rd ed.,pg389 2 Tonight MLE: Maximum Likelihood Estimators EM: the Expectation Maximization Algorithm

More information

Introduction to" Protein Structure

Introduction to Protein Structure Introduction to" Protein Structure Function, evolution & experimental methods Thomas Blicher, Center for Biological Sequence Analysis Learning Objectives Outline the basic levels of protein structure.

More information

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University COMP 598 Advanced Computational Biology Methods & Research Introduction Jérôme Waldispühl School of Computer Science McGill University General informations (1) Office hours: by appointment Office: TR3018

More information

Regulation and signaling. Overview. Control of gene expression. Cells need to regulate the amounts of different proteins they express, depending on

Regulation and signaling. Overview. Control of gene expression. Cells need to regulate the amounts of different proteins they express, depending on Regulation and signaling Overview Cells need to regulate the amounts of different proteins they express, depending on cell development (skin vs liver cell) cell stage environmental conditions (food, temperature,

More information

Darwin's theory of natural selection, its rivals, and cells. Week 3 (finish ch 2 and start ch 3)

Darwin's theory of natural selection, its rivals, and cells. Week 3 (finish ch 2 and start ch 3) Darwin's theory of natural selection, its rivals, and cells Week 3 (finish ch 2 and start ch 3) 1 Historical context Discovery of the new world -new observations challenged long-held views -exposure to

More information

Introduction to molecular biology. Mitesh Shrestha

Introduction to molecular biology. Mitesh Shrestha Introduction to molecular biology Mitesh Shrestha Molecular biology: definition Molecular biology is the study of molecular underpinnings of the process of replication, transcription and translation of

More information

GENETICS - CLUTCH CH.11 TRANSLATION.

GENETICS - CLUTCH CH.11 TRANSLATION. !! www.clutchprep.com CONCEPT: GENETIC CODE Nucleotides and amino acids are translated in a 1 to 1 method The triplet code states that three nucleotides codes for one amino acid - A codon is a term for

More information

9 The Process of Translation

9 The Process of Translation 9 The Process of Translation 9.1 Stages of Translation Process We are familiar with the genetic code, we can begin to study the mechanism by which amino acids are assembled into proteins. Because more

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

Structure-Based Comparison of Biomolecules

Structure-Based Comparison of Biomolecules Structure-Based Comparison of Biomolecules Benedikt Christoph Wolters Seminar Bioinformatics Algorithms RWTH AACHEN 07/17/2015 Outline 1 Introduction and Motivation Protein Structure Hierarchy Protein

More information

Impact Of The Energy Model On The Complexity Of RNA Folding With Pseudoknots

Impact Of The Energy Model On The Complexity Of RNA Folding With Pseudoknots Impact Of The Energy Model On The omplexity Of RN Folding With Pseudoknots Saad Sheikh, Rolf Backofen Yann Ponty, niversity of Florida, ainesville, S lbert Ludwigs niversity, Freiburg, ermany LIX, NRS/Ecole

More information

From Gene to Protein

From Gene to Protein From Gene to Protein Gene Expression Process by which DNA directs the synthesis of a protein 2 stages transcription translation All organisms One gene one protein 1. Transcription of DNA Gene Composed

More information

Videos. Bozeman, transcription and translation: https://youtu.be/h3b9arupxzg Crashcourse: Transcription and Translation - https://youtu.

Videos. Bozeman, transcription and translation: https://youtu.be/h3b9arupxzg Crashcourse: Transcription and Translation - https://youtu. Translation Translation Videos Bozeman, transcription and translation: https://youtu.be/h3b9arupxzg Crashcourse: Transcription and Translation - https://youtu.be/itsb2sqr-r0 Translation Translation The

More information

Conserved RNA Structures. Ivo L. Hofacker. Institut for Theoretical Chemistry, University Vienna.

Conserved RNA Structures. Ivo L. Hofacker. Institut for Theoretical Chemistry, University Vienna. onserved RN Structures Ivo L. Hofacker Institut for Theoretical hemistry, University Vienna http://www.tbi.univie.ac.at/~ivo/ Bled, January 2002 Energy Directed Folding Predict structures from sequence

More information

Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction Networks

Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction Networks 22 International Conference on Environment Science and Engieering IPCEE vol.3 2(22) (22)ICSIT Press, Singapoore Structure and Centrality of the Largest Fully Connected Cluster in Protein-Protein Interaction

More information

Rex-Family Repressor/NADH Complex

Rex-Family Repressor/NADH Complex Kasey Royer Michelle Lukosi Rex-Family Repressor/NADH Complex Part A The biological sensing protein that we selected is the Rex-family repressor/nadh complex. We chose this sensor because it is a calcium

More information

Chemical Reactions and the enzimes

Chemical Reactions and the enzimes Chemical Reactions and the enzimes LESSON N. 6 - PSYCHOBIOLOGY Chemical reactions consist of interatomic interactions that take place at the level of their orbital, and therefore different from nuclear

More information

Rapid Dynamic Programming Algorithms for RNA Secondary Structure

Rapid Dynamic Programming Algorithms for RNA Secondary Structure ADVANCES IN APPLIED MATHEMATICS 7,455-464 I f Rapid Dynamic Programming Algorithms for RNA Secondary Structure MICHAEL S. WATERMAN* Depurtments of Muthemutics und of Biologicul Sciences, Universitk of

More information

Laith AL-Mustafa. Protein synthesis. Nabil Bashir 10\28\ First

Laith AL-Mustafa. Protein synthesis. Nabil Bashir 10\28\ First Laith AL-Mustafa Protein synthesis Nabil Bashir 10\28\2015 http://1drv.ms/1gigdnv 01 First 0 Protein synthesis In previous lectures we started talking about DNA Replication (DNA synthesis) and we covered

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

Prediction of Conserved and Consensus RNA Structures

Prediction of Conserved and Consensus RNA Structures Prediction of onserved and onsensus RN Structures DISSERTTION zur Erlangung des akademischen rades Doctor rerum naturalium Vorgelegt der Fakultät für Naturwissenschaften und Mathematik der niversität Wien

More information

Chapter 12. Genes: Expression and Regulation

Chapter 12. Genes: Expression and Regulation Chapter 12 Genes: Expression and Regulation 1 DNA Transcription or RNA Synthesis produces three types of RNA trna carries amino acids during protein synthesis rrna component of ribosomes mrna directs protein

More information

Multiple Choice Review- Eukaryotic Gene Expression

Multiple Choice Review- Eukaryotic Gene Expression Multiple Choice Review- Eukaryotic Gene Expression 1. Which of the following is the Central Dogma of cell biology? a. DNA Nucleic Acid Protein Amino Acid b. Prokaryote Bacteria - Eukaryote c. Atom Molecule

More information

Human Biology. The Chemistry of Living Things. Concepts and Current Issues. All Matter Consists of Elements Made of Atoms

Human Biology. The Chemistry of Living Things. Concepts and Current Issues. All Matter Consists of Elements Made of Atoms 2 The Chemistry of Living Things PowerPoint Lecture Slide Presentation Robert J. Sullivan, Marist College Michael D. Johnson Human Biology Concepts and Current Issues THIRD EDITION Copyright 2006 Pearson

More information

BIOLOGY STANDARDS BASED RUBRIC

BIOLOGY STANDARDS BASED RUBRIC BIOLOGY STANDARDS BASED RUBRIC STUDENTS WILL UNDERSTAND THAT THE FUNDAMENTAL PROCESSES OF ALL LIVING THINGS DEPEND ON A VARIETY OF SPECIALIZED CELL STRUCTURES AND CHEMICAL PROCESSES. First Semester Benchmarks:

More information

Enzyme Enzymes are proteins that act as biological catalysts. Enzymes accelerate, or catalyze, chemical reactions. The molecules at the beginning of

Enzyme Enzymes are proteins that act as biological catalysts. Enzymes accelerate, or catalyze, chemical reactions. The molecules at the beginning of Enzyme Enzyme Enzymes are proteins that act as biological catalysts. Enzymes accelerate, or catalyze, chemical reactions. The molecules at the beginning of the process are called substrates and the enzyme

More information

Chapter 1. DNA is made from the building blocks adenine, guanine, cytosine, and. Answer: d

Chapter 1. DNA is made from the building blocks adenine, guanine, cytosine, and. Answer: d Chapter 1 1. Matching Questions DNA is made from the building blocks adenine, guanine, cytosine, and. Answer: d 2. Matching Questions : Unbranched polymer that, when folded into its three-dimensional shape,

More information

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming ombinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming Matthew Macauley Department of Mathematical Sciences lemson niversity http://www.math.clemson.edu/~macaule/ Math

More information

Grand Plan. RNA very basic structure 3D structure Secondary structure / predictions The RNA world

Grand Plan. RNA very basic structure 3D structure Secondary structure / predictions The RNA world Grand Plan RNA very basic structure 3D structure Secondary structure / predictions The RNA world very quick Andrew Torda, April 2017 Andrew Torda 10/04/2017 [ 1 ] Roles of molecules RNA DNA proteins genetic

More information