Secondary Structure Models of Nucleic Acid Folding Kinetics

arnegie Mellon niversity Research Showcase @ M Dissertations heses and Dissertations 9-8-2011 Secondary Structure Models of Nucleic Acid Folding Kinetics Benjamin Adair Sauerwine arnegie Mellon niversity Follow this and additional works at: http://repository.cmu.edu/dissertations Part of the Physics ommons Recommended itation Sauerwine, Benjamin Adair, "Secondary Structure Models of Nucleic Acid Folding Kinetics" (2011). Dissertations. Paper 103. his Dissertation is brought to you for free and open access by the heses and Dissertations at Research Showcase @ M. It has been accepted for inclusion in Dissertations by an authorized administrator of Research Showcase @ M. For more information, please contact researchshowcase@andrew.cmu.edu.

Secondary Structure Models of Nucleic Acid Folding Kinetics by Benjamin Adair Sauerwine Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at arnegie Mellon niversity Department of Physics Pittsburgh, Pennsylvania Advised by Professor Michael Widom September 8, 2011

Abstract his thesis examines nucleic acid folding processes using secondary structure models, which have the advantage that they may simulate behavior at long timescales. First, it verifies that Kinetic Monte arlo can reproduce known behavior of biological riboswitches. Motivated by the results of that study, a new and general method for determining the effective barrier between local free energy minima is introduced. Finally, a general rate model is developed that greatly improves simulated agreement with measured dynamic properties by considering the energy of intermediates outside the state space of secondary structure.

Acknowledgments For useful discussions regarding riboswitch folding kinetics, the author wishes to thank Maumita Mandal, Jay Kadane and Jon Widom. For implementing the NAfold energy parameters in ViennaRNA, the author wishes to thank anmay Mudholkar. he author appreciates the advice of the thesis committee, Markus Deserno, Maumita Mandal, Robert Sekerka and Robert Swendsen. he wise guidance of Mike Widom has been invaluable to the success of this research. i

ontents 1 Introduction 1 1.1 Nucleic Acids............................... 2 1.1.1 Physical Properties of Nucleic Acids.............. 2 1.1.2 Energy Landscapes........................ 8 1.1.3 Biological Roles of Nucleic Acids................ 9 1.2 Simulation ools............................. 13 1.2.1 Energy Models.......................... 13 1.2.2 Zuker and Mcaskill Algorithms................ 14 1.2.3 RNAsubopt Program....................... 16 1.2.4 kinfold Monte arlo....................... 16 1.2.5 barriers Program........................ 20 2 Folding Efficiency of Riboswitch ranscriptional erminators 22 2.1 Motivation................................. 22 2.2 Methods.................................. 23 2.2.1 Riboswitch Identification and Annotation............ 23 2.2.2 Folding............................... 23 2.2.3 Statistical analysis of distributions............... 24 2.3 Results and Discussion.......................... 26 2.3.1 erminators Versus Sequesterers................. 26 2.3.2 Antitermination Versus ermination.............. 29 2.3.3 Engineered Antiterminator.................... 32 2.4 onclusions................................ 34 3 Lifetimes of Nucleic Acid Structures from the Free Energy Landscape 35 3.1 Motivation................................. 35 3.2 Arrhenius Laws.............................. 37 3.3 Ab-initio Methods............................ 38 3.3.1 he ViennaRNA Package..................... 39 3.3.2 Exact alculation of Arrhenius Barriers........... 39 3.4 Empirical Methods............................ 43 ii

3.4.1 he barriers Program..................... 44 3.4.2 Arrhenius Barriers From the Free Energy Landscape..... 45 3.5 onclusions................................ 49 4 Arrhenius Dynamics for Systems with Significant Intermediates 52 4.1 Motivation................................. 52 4.2 Arrhenius Rate Model.......................... 53 4.3 Results................................... 57 4.4 onclusions................................ 63 iii

List of ables 1.1 Physical properties of nucleic acid polymers. Some researches use DNA parameters for RNA when unavailable [17]. More properties are available from the Bionumbers website [22]................. 4 1.2 Fractal dimensions, a measure of volume occupied by increasingly longer chains, for scaling factor in the Jacobson-Stockmayer entropy extrapolation [29]................................. 5 1.3 A list of known types of riboswitches and their ligands........ 12 2.1 erminator efficiency tables for Fisher s exact test comparing PP terminator hairpins to Shine-Dalgarno sequesterers, when grown at 50 nt/s and assuming τ K = 5µs....................... 26 2.2 Switching performance of individual bound-off switches measured by termination efficiency e growing from the post-aptamer position and antitermination rate a = 1 e where e is efficiency growing from the pre-antiterminator position (see figure 1.3)............... 30 2.3 Simulated switch function for Bacillus subtilis bound-off riboswitches. ermination post-aptamer and pre-antiterm indicate the fractions of runs that form proper terminator hairpins when grown, respectively, from the nucleotide following the aptamer or from the nucleotide preceding the antiterminator. Barriers are obtained from the barriers program and lifetimes are obtained from kinfold simulations. Starred entries correspond to functions that could not be meaningfully simulated due to the presence of pseudoknots, and are discussed in section 2.3.2.................................. 31 3.1 Arrhenius prefactors c 0 and normalized mean-square errors, σ 2 B,F = 1 R 2, for barriers (B) and the free energy methods (F)...... 44 4.1 alibrated timescales for an elementary kinfold conformation change of DNA at the data points nearest 37 using the Arrhenius rate model. 62 iv

List of Figures 1.1 A chemical diagram of DNA bonded with RNA. Hashes indicate hydrogen bonds. Figure produced with hempaint [20]......... 3 1.2 3D diagram of A-form and B-form DNA helices adapted from Wikipedia [21]..................................... 4 1.3 wo functional secondary structure conformations of the xpt riboswitch[39]. Blue nucleotides on the P2 and P3 loops (left) can form a pseudoknot, a type of tertiary structure........................ 7 1.4 Diagrams representing secondary structure of the xpt riboswitch aptamer. Above: Rainbow diagram of the folded aptamer. Below: Parenthesis diagram of the folded aptamer............... 8 1.5 ertiary structure of the ligand-bound aptamer of the add riboswitch, which is a purine riboswitch similar in structure and function to the xpt riboswitch. Figure taken from Serganov et. al. [40]........ 9 1.6 Funnel diagram [43] schematic of the energy landscape between open and closed conformations of a nucleic acid hairpin........... 10 1.7 he life cycle of a typical microrna.................. 13 1.8 ypes of nucleic acid secondary structure for which energies are given in energy models............................. 14 1.9 he three types of move allowed by kinfold [70]............ 19 2.1 Folding efficiency distributions of rho-independent terminators and Shine-Dalgarno sequesterers of PP riboswitches from a wide range of prokaryotes. Inset: fraction f of expected pairs formed over repeated trials and all sequences.......................... 25 2.2 Proportion of PP terminators and sequesterers that fold efficiently at various timescales. Points of the same color indicate the distribution of individual folding efficiencies of the hairpins in that set. reen dotted line indicates the timescale for τ K = 5µs and R t = 50 nt/s...... 27 2.3 Reproduced results of figure 2.1 including efficiencies of PP rhoindependent terminators with an additional random five nucleotide pairs added to the stem.......................... 28 v

2.4 Frequency weighted sequence logos [78] for PP rho-independent terminators (left) and Shine-Dalgarno sequesterers (right). Regions 1-5 correspond, respectively, to the first half of the 5 side of the stem, the second half of the same, the loop, the first half of the 3 side of the stem, the second half of the same.................... 29 2.5 Modified xpt bound-off riboswitch. Ligand-bound state (left) terminates in original xpt but antiterminates in modified. Ligand-unbound state (right) antiterminates in original but terminates in modified. Notation: Red labels P 1 stem of aptamer; blue and green label P 2 and P 3 stems; magenta labels terminator.................... 33 3.1 Ab-initio study of the sequence AAA... 42 3.2 (left) Kawasaki-rule simulated and (right) Metropolis-rule simulated lifetime versus the calculated barrier in dual-hairpin topology short sequences 40 nucleotides long...................... 44 3.3 Simplified one-dimensional of the free-energy based barrier finding program.................................... 48 3.4 omparison of basin and barrier entropy contributions to Arrhenius barrier................................... 50 4.1 Diagram of the true barrier state between the open chain and a nucleic acid chain with a single base pair. Loop and stack energy contributions to the intermediate state are diagrammed at the bottom....... 54 4.2 Optimal folding pathway for 12 hairpin in 0.1M Nal at 37.... 56 4.3 he 12 DNA hairpin in its closed conformation. he other A n and n hairpins differ only in the loop and share the same stem in the closed conformation............................... 57 4.4 omparison of experimental [90] and calculated rates. Left-hand column shows sequences n for various n. Right-hand column shows sequence A 21 at varioussalt concentrations. op rowshows experimental opening(k open ) andclosing (k close )rates. Middlerowshows simulations utilizing the Kawasaki rate model. Bottom row compares equilibrium constants K = k open /k close between experiment and calculation.... 58 4.5 Simulatedopeningandclosingratesof n sequences. (left)thekawasaki rate model as previously shown in figure 4.4. (right) Arrhenius rate model with a Kawasaki cutoff...................... 60 4.6 SimulatedopeningandclosingratesofA 21 sequences. (topleft)kawasaki rate model as previously shown in figure 4.4. (top right) Arrhenius rate model with a Kawasaki cutoff. (bottom left) the Kawasaki rate model including stacking rigidity of 0.5 kcal/mol/nt. (bottom right) the Arrhenius rate model including stacking rigidity.............. 61 vi

List of Algorithms 1.1 reatly simplified pseudocode for the Zuker algorithm for determining the minimum free energy secondary structure of a nucleic acid. he variant here is simplified for ease of understanding........... 15 1.2 reatly simplified pseudocode for the backtracking RNAsubopt algorithm for determining all structures with free energy below a given threshold. signifies string concatenation............... 17 1.3 Pseudocode for the barriers algorithm for determining saddle points and local energy minima for a nucleic acid free energy landscape... 21 3.1 Determine the median first passage time from an initial distribution to a final state indexed f using the rate matrix R exact f. I indicates the identity matrix.............................. 41 3.2 Determine the basin or basins that each state from RNAsubopt is in. 46 3.3 Determine the partition functions for each basin and border state.. 46 3.4 Determine the Arrhenius barrier between an initial and final state.. 47 vii

hapter 1 Introduction he discovery of the structure and function of DNA (Deoxyribose Nucleic Acid) by Watson and rick [1] marked the beginning of molecular biology as a field. DNA and a related molecule, RNA (Ribonucleic Acid), are similar biopolymers with the primary chemical difference between DNA and RNA being a single additional hydroxyl (OH) group on the RNA backbone as seen in figure 1.1. In life, DNA is normally found hydrogen bonded with a complementary strand producing the familiar double stranded double helix structure. RNA, on the other hand, is normally found singlestranded and as such is free to fold on itself or hybridize with other molecules to produce biologically useful structures. he goal of this work is to develop and implement techniques to study the kinetics of nucleic acid folding using thermodynamics and simplified secondary structure models concerned only with nucleotide binding. he central dogma of molecular biology [2] states that genes are transcribed from DNA into messenger RNA (mrna) which are then translated into proteins at the ribosome. Indeed, evolution has favored DNA as the carrier of genetic information in most known organisms, but RNA-based viruses such as retroviruses [3] and viroids [4] as well as prions [5] provide exceptions to the rule. Since RNA appears single-stranded, it is thus free to produce complex structure. Because of this freedom, it has the ability to function as both an enzyme and a carrier of genetic information. his pair of properties led to the RNA World hypothesis [6], that the earliest life on Earth consisted of organisms where RNA played both the role of protein and DNA. his thesis focuses on those functions of nucleic acid outside the central dogma of molecular biology, especially regulatory or enzymatic roles. Examples include micrornas [7], where small RNA complementary to part of a messenger RNA can affect translation, and riboswitches [8] where the presence of a ligand can regulate either transcription or translation of an affected gene. MicroRNA and riboswitches are not the only relevant applications of this work, as RNA viroids are known to have specialized functional components [4] and synthetic PNA (Peptide Nucleic Acids) are being studied as therapeutic agents [9] as well as a means for nanoscale construction [10]. his chapter begins with a discussion of nucleic acids and their roles, behavior, 1

and important properties. he remainder of this chapter focuses on simulation tools and algorithms that will be used in the later chapters. 1.1 Nucleic Acids Nucleic acids play versatile roles in organisms. hey encode the genes of all known organisms and can be considered a blueprint for life [1]. As microrna [7] or riboswitches [8], they act as regulators, sensors, and even enzymes. In small nucleolar RNA [11] and the ribosome [12], nucleic acids play a role in both self-assembly of the ribosome itself as well as a mechanical role in constructing proteins. Nucleic acids are able to play such versatile biological roles because their physical properties allow them to both carry information and to assemble into structures able to perform enzymatic functions. his section first discusses the physical properties that allow nucleic acids to perform such remarkable functions, then the biological roles that are directly studied in the remaining chapters. 1.1.1 Physical Properties of Nucleic Acids here are four regimes of structure that give increasing detail in terms of the behavior of the molecule. At the crudest level, nucleic acids are polymers with characteristic properties such as persistence length, radius of gyration, elastic modulus and chirality. At the level of primary structure, nucleic acids are composed of a sequence of bases known as nucleotides that encode genetic information. By virtue of their Watson-rick and non-canonical base pairing, nucleic acids may form a secondary structure composed of stems and loops. Finally, the nucleic acid may form additional tertiary structure with other molecules or with itself. ertiary structure refers to steric interactions and interactions between nucleic acids and proteins and binding to ligands. his section discusses these levels of structure in order from the lowest level, polymer-like structure to the tertiary interactions with itself and other molecules. Zeroth Level of Structure: Nucleic Acids as Polymers Measurements of the radius of gyration of DNA in aqueous salt solution [13] show good agreement with the Flory exponent for an ideal polymer in a good solvent, ν = 3/5 [14]. his simply means that the expected end-to-end distance and consequently radius of gyration of a polymer chain of N persistence lengths is R N ν. aken alongside the structural similarity between DNA and other nucleic acids, this agreement implies that nucleic acid in aqueous solution is well approximated by ideal polymer models. Indeed, experiment shows coincidence between expected properties of nucleic acid polymers in dilute solutions, while interactions between the polymers become important in semi-dilute solutions [15]. 2

Nucleic acids are chiral molecules, with DNA and RNA having distinct 5 and 3 ends. For the purpose of base pairing, the 5 end is considered to be the beginning of the strand and the 3 end the end of the strand because most biological processes process the strand in this direction. Double-stranded DNA such as the DNA in human cells has two complementary strands that run in opposite directions so that the 5 end of one strand meets the 3 end of the other. A chemical sketch of hybridized DNA and RNA can be seen in figure 1.1 with zig-zag lines indicating hydrogen bonds. Note the slight difference in backbone chemistry between DNA and RNA, with RNA having an additional hydroxyl group. ypical values of physical properties for both double-stranded ds and singlestranded ss polymers are given in table 1.1. Note that the values in table 1.1 can change based on ionic strength of solution as well as in a sequence dependent manner [16]. he elastic modulus of ssrna and dsrna are measurable through pulling experiments, but can be taken to equal that of their DNA counterparts [17]. For the purpose of free energy calculations, the persistence length of ssdna and ssrna is frequently taken to equal one nucleotide length [18, 19]. DNA 5' O O P O O N H N O RNA 3' HO HO Adenine ytosine O O P O O O O O P O O O N N N N H N N O O N H N N O H N N N O N N N HO O O O P O racil O P HO O O O O O uanine O O uanine hymine O O O P O O O N N O N O N N H N NH N N O N N N N O P HO O O O O O O ytosine O Adenine HO DNA 3' O P O O O RNA 5' Figure 1.1: A chemical diagram of DNA bonded with RNA. Hashes indicate hydrogen bonds. Figure produced with hempaint [20]. 3

Figure 1.2: 3D diagram of A-form and B-form DNA helices adapted from Wikipedia [21]. able 1.1: Physical properties of nucleic acid polymers. Some researches use DNA parameters for RNA when unavailable [17]. More properties are available from the Bionumbers website [22]. Molecule ssdna dsdna ssrna dsrna Persistence length (nm) 0.75 [23] 54 [24] 1 [25] 62 [24] Length of one nucleotide (nm) 0.7 [26] Rotation per nucleotide 36 [27] 36 [27] 32.73 [27] 32.73 [27] Radius at phosphate (nm) 0.75 [27] 0.89 [27] 0.7 [27] 0.88 [27] Helical rise per base pair (nm) 0.18 [27] 0.338 [27] 0.19 [27] 0.281 [27] Elastic modulus (pn) 800 [23] 1100 [28] Form Polymer B-form helix Polymer A-form helix he behavior of nucleic acids is well-explained by standard models of polymers in solution [15]. From the perspective of secondary structure and coarse-grained kinetics, polymer models give insight on bending free energy and the frequency with which distant nucleotides on the nucleic acid chain come together due to fluctuations of the polymer. For the purpose of determining bending free energy, the Jacobson- Stockmayer extrapolation [29] is used in both the RNA [18] and DNA [19] free energy models employed here. he assumption that the free energy of long polymer loops is primarily entropic is shown to be reasonable for DNA hairpins [16] as well as 4

for carefully calibrated models of RNA [18]. he Jacobson-Stockmayer entropy extrapolation is particularly useful because it can be applied with respect to a given length, since nucleotide-nucleotide interactions within short loops cause some deviation from strictly logarithmic loop free energy as seen in the data for the models themselves [18, 19]. he contribution of nucleotide-nucleotide interaction within the loops is most clear in the pathologies of tetraloops and triloops, which are unusually stable due to interactions within the loop [18]. he Jacobson-Stockmayer extrapolation is most easily explained using the ideal chain model. In the ideal chain model, a polymer chain is treated simply as a nonself-interacting random walk. hen, for a polymer loop composed of n steps in a random walk, we may write S(n) = Rln(Z(n)) where Z(n) is the partition function for this loop. he most important aspect of this partition function is the fractal dimension the loop occupies, in equation 1.1, so that Z(n) n. For a free chain, this is famously given by the inverse of the Flory exponent = ν 1 [14]. Other treatments for self-avoiding walks are given by de ennes [30]. Some energy models find it preferable to fit the measured loop free energy to the size of the loop in order to find the scaling exponent [19]. Finally, we may derive the Jacobson-Stockmayer entropy extrapolation for n > n 0, S(n) = S(n 0 )+Rln( Z(n) Z(n 0 ) ) = S(n 0)+Rln( n n 0 ). (1.1) In equation 1.1, is the fractal dimension of the polymer loop, R is the ideal gas constant, n 0 is the length of the reference polymer chain in persistence lengths and n is the length of the desired polymer chain in persistence lengths. Some values for are given in table 1.2. For the RNA energy parameters employed here [18], the value = 1.75 is used from the Jacobson-Stockmayer derivation. able 1.2: Fractal dimensions, a measure of volume occupied by increasingly longer chains, for scaling factor in the Jacobson-Stockmayer entropy extrapolation [29]. Model Fractal Dimension = ν 1 Free chain in a good solvent 5/3 [14] Jacobson-Stockmayer derivation 1.75 [29] DNA loop experiment 2.44 [19] wo models relevant to the timescale of monomer dynamics within a polymer chain are due to Rouse [31] and Zimm [32]. he Rouse model treats the polymer as a set of beads connected by springs, while Zimm adds hydrodynamic interactions to the beads. Experimental evidence shows that single-stranded DNA monomer diffusion agrees with the Zimm model, whereas for double-stranded DNA monomer diffusion 5

agrees with the Rouse model [33]. Since this thesis is primarily concerned with interactions between distant nucleotides which are dependent mainly on thermodynamics, the Zimm or Rouse models for monomer diffusion are not necessary. Rather, the remaining chapters use the Arrhenius model for transitions between states. Primary Structure: he enetic Sequence he Watson-rick bases are adenine A with thymine and guanine with cytosine in DNA [1]. In RNA, uracil denoted takes the place of thymine in DNA. At the level of primary structure nucleic acids are a polymer as described above, with nucleotides covalently bound along the length of the chain. In addition to the four Watson-rick bases, non-canonical bases such as inosine and pseudouridine exist in RNA. hese are formed by modification of A,, and by agents such as snorna [11]. Within the central dogma, the canonical four bases on DNA form the genetic code. Each gene codes for the formation of its respective protein through triplets of these four bases known as codons [34]. ranscription of DNA into RNA creates a copy of the code suitable for translation into protein. As the gene is translated by the ribosome, the ribosome presents nucleic acid triplets to transfer RNA (trna) and the trna that binds with this triplet ultimately decides which amino acid to add to the protein in formation [35]. Each trna has a clover-leaf structure which includes a domain that recognizes its intended codon or codons and a domain that attaches to its amino acid. hus, the trna is essential to translating the genetic sequence on the primary structure, but the trna itself is a non-coding RNA with a functional secondary structure and important tertiary interactions. Secondary Structure: Base Pairing Secondary structure is simply the set of paired bases in a polymer, without regard to any spatial configuration of the polymer itself. wo different possible secondary structures of a particular functional RNA can be seen in figure 1.3. In DNA, A can hydrogenbondwith and with.likewiseinrna, A canhydrogenbond with and with. he canonical Watson-rick base pairs can be seen in figure 1.1. he stabilizing free energy of an A-pair is somewhat smaller than that of a-pair, butbothhaveacontributionontheorderof1kilocaloriepermole[18,19]. he exact energetics of base pairing are a function of salt concentration [19, 36], temperature, nucleic acid backbone, as well as the nucleotides and conformation near the paired bases[18, 19]. his dependence on nearby bases and conformation is known as stacking energy, and the energy models used are based on experimental data on numerous such local configurations [18, 19]. he most important non-canonical base pair for the purpose of RNA secondary structure is known as the - wobble pair [37, 18]. he - wobble pair has limited stability, and typically appears only as part of a stack of canonical base pairs. Other 6

non-canonical base pairs exist, but are not part of the simulations discussed here. Inosine, for instance, may produce its own type of wobble base pair [37]. Another example can be found in repeats which may produce a -quadruplex, a type of interaction where four guanine nucleotides may mutually hydrogen-bond with each other [38]. he -quadruplex pairing is biologically relevant to telomeres, but since it requires association of four nucleotides to form, it is not part of secondary structure and is thus not considered further here. A A 20 A P2 30 A A 40 A A Aptamer 10 P1 a u a c u c u a u g g g a u 70 A A c g A A P3 60 A A A A A A Post aptamer 80 A A 50 A 100 A A A A 90 A erminator c u c 120 g c u g u g c g u u a 140 a c g a 110 Pre terminator g a g g c a u g c u 130 A A c a c u c a u 10 a A A 20 A 30 A A P2 A A 40 A 60 P3 A A 100 a c g A 110 A a 80 A A u A u g 90 A c A u g u u g 120 g c 70 Antiterminator A A 50 A A g u A u a u g c u c g u a 140 a c g g g c g a g 130 A Pre antiterminator Figure 1.3: wo functional secondary structure conformations of the xpt riboswitch [39]. Blue nucleotides on the P2 and P3 loops (left) can form a pseudoknot, a type of tertiary structure. A convenient way of describing secondary structure is through a balanced parenthesis diagram. In a parenthesis diagram, unpaired nucleotides are written as. and pairs of nucleotides are indicated with corresponding pairs of opening ( and closing ) parenthesis. Each pair in secondary structure guarantees that the nucleotides internal to the pair do not pair with anything external to the pair. For example, if nucleotides(i,j)arepairedandalso(k,l)arepaired, theni < k < l < j ork < i < j < l. hus the parenthesis diagram will unambiguously define the secondary structure of its associated sequence. An example of a parenthesis diagram representing secondary structures of the xpt riboswitch aptamer, as well as a rainbow diagram, can be seen in figure 1.4. he rainbow diagram simply has lines connecting the paired nucleotides in the parenthesis diagram. Secondary-and-a-half Structure: Pseudoknots Pseudoknots are formed when bound pairs of nucleotides indexed (i, j) and (k, l) have the property that i < k < j < l or k < i < l < j. For example, the nucleotides highlighted in blue in figure 1.3 may hydrogen bond to produce a pseudoknot. here exist tools that are capable of performing optimizations on the space of pseudoknots as discussed in section 1.2, but as pseudoknots break the convenient ordering of 7

Figure 1.4: Diagrams representing secondary structure of the xpt riboswitch aptamer. Above: Rainbow diagram of the folded aptamer. Below: Parenthesis diagram of the folded aptamer. secondary structure, this quickly becomes computationally prohibitive. In fact, by excluding pseudoknots, the otherwise computationally non-polynomial time problem of determining optimal structure becomes a much easier polynomial-time problem, as will be discussed in section 1.2. It is this very division of the folding problem that allows the algorithms discussed in section 1.2 to complete in polynomial time. his work focuses on studying specific structures where pseudoknots are not directly relevant to their function in order to develop algorithms for efficient calculation of properties. ertiary Structure: External Interactions and the 3D Shape ertiary structure is concerned with interactions between the nucleic acid and other molecules such as proteins or other nucleic acids, steric self-interactions, and the particular shape the molecule takes in space. his type of interaction is frequently studied with computationally expensive molecular dynamics models or heuristics. Because this thesis takes a coarse-grained thermodynamic approach to nucleic acid interactions, and because many important properties of nucleic acids can be understood from the secondary structures, tertiary structure is not considered in the models described in this thesis. As an example of tertiary structure, the ligand binding pocket of the add riboswitch aptamer can be seen in figure 1.5. 1.1.2 Energy Landscapes Broadly, this thesis concerns processes that occur on complex energy landscapes, where states of various energy are connected through some topology representing a physical notion of elementary motion [41]. Nucleic acid folding is a particular application where a nontrivial topology of energies arises from a simple system [42]. Most of the processes considered here entail first passage times between an initial and 8

Figure 1.5: ertiary structure of the ligand-bound aptamer of the add riboswitch, which is a purine riboswitch similar in structure and function to the xpt riboswitch. Figure taken from Serganov et. al. [40]. final configuration. Figure 1.6 gives a schematic of a first passage process between an open and closed conformation of a nucleic acid hairpin with two funnels representing local basins that possess a large number of possible intermediate conformations. Both the initial and final state have a local basin with considerable entropy, and indeed the space of structures midway between them represents a combinatorial explosion of secondary structures available to the polymer. Since in secondary structure the open conformation is considered as a single state, it is shown in the funnel diagram as such despite the substantial conformational entropy available to it. hapter 3 will examine the effects of the combinatorial explosion of structures on the overall lifetime of structures. 1.1.3 Biological Roles of Nucleic Acids he most remarkable property of nucleic acids as functional molecules is their versatility, and in the last half-century research has revealed increasingly unusual roles that evolution has found for nucleic acids in organisms. hey are able to act as both carriers of genetic material [2] and as enzymes [8, 11]. he ability to fill both of these roles simultaneously makes them attractive candidates for the earliest life on Earth [6]. his section discusses the biological roles of nucleic acids starting with their roles as carriers of genetic information, then in the enzymatic and regulatory 9

Open onformation Order Parameter losed onformation Number of Available Secondary Structures otal Entropy of onformations Figure 1.6: Funnel diagram [43] schematic of the energy landscape between open and closed conformations of a nucleic acid hairpin. roles that will be important in the remainder of this work. Within the entral Dogma of Molecular Biology he most important functions of nucleic acids in life are described by the central dogma of molecular biology, which simply states that messenger RNA is transcribed from DNA by RNA polymerase and that proteins are translated from messenger RNA by ribosomes [2]. he proteins that a cell produces ultimately determine its function within its niche. It is easy to see, then, that the central dogma cannot be a complete description of the fate of a cell. For instance, each cell in the human body has an 10

identical copy of the genome in its nucleus; However, nerve cells are clearly different from liver cells and muscle cells, et cetera. his is due to a plethora of epigenetic factors and stimuli that decide which proteins need to be translated, how they need to be spliced, and when [44]. In the remainder of this section we will consider two ways out of many that nucleic acids can directly regulate which genes are transcribed and which proteins are translated. Outside the entral Dogma: Riboswitches Riboswitches are a prototypical example of nucleic acids acting simultaneously as a regulator and as an enzyme. hey are found most frequently in bacteria, but are also found in some fungi and plants. he essential features of a riboswitch are an aptamer region which actually binds some ligand in solution and an expression platform. he expression platform performs a function based on its structure, and its ability to form that structure is a function of the aptamer being bound or unbound to its ligand. Examples of expression platforms include terminator hairpins which are able to stop gene expression by dissociating the transcribing RNA polymerase from the DNA [45], and sequesterer hairpins which stop gene expression by blocking the Shine-Dalgarno sequence, which is a ribosomal binding site [46]. A diagram of the ligand-bound and ligand-unbound conformations of the xpt riboswitch with a terminator hairpin is shown in figure 1.3, in which three distinct but overlapping functional regions can be seen. he first is the purine-binding aptamer, which forms first as the messenger RNA is transcribed from the 5 to the 3 side. he second functional region is the antiterminator, which acts as a logical not switch. If the aptamer is stabilized by its ligand, the antiterminator cannot form and the terminator will form as shown on the left. If the aptamer is not stabilized by its ligand, the antiterminator will parasitize it. he antiterminator prevents formation of the terminator as shown on the right. he third functional region is the terminator. If the terminator is formed as shown on the left, it will cause the RNA polymerase transcribing the attached messenger RNA to dissociate from the DNA, preventing further transcription of the gene and effectively turning the gene off. he xpt riboswitch is a bound-off switch since the aptamer, when bound to its ligand, turns off the gene. here also exist riboswitches with opposite bound-on behavior which lack the antiterminator and have a terminator hairpin that directly competes with the aptamer. Still other riboswitches can have multiple aptamers in tandem which cooperate to increase sensitivity on the expression platform. A table of known types of riboswitches and their associated ligands can be found in table 1.3 he most common riboswitch is the PP riboswitch, which is annotated in over 1000 organisms in the Rfam database [47]. One aspect of riboswitches, and in particular riboswitches that include terminator hairpins, that makes them attractive targets for study by kinetic simulations is that these riboswitches are under a clear time constraint to reach their functional structure 11

able 1.3: A list of known types of riboswitches and their ligands. Riboswitch Ligand SAM riboswitch [48] S-adenosyl methionine Purine riboswitch [39] uanine and Adenine Lysine riboswitch [49] Lysine obalamin riboswitch [50] obalamin lycine riboswitch [51] lycine SAH riboswitch [52] S-adenosylhomocysteine Moco riboswitch [53] Molybdenum or tungsten cofactor FMN riboswitch [54] Flavin mononucleotide glms ribozyme [55] lucosamine-6-phosphate yclic di-mp riboswitch [56] yclic di-guanylate PP riboswitch [57] hiamine pyrophosphate preq1 riboswitch [58] Pre-queuosine 1 quickly and efficiently. Since RNA is transcribed from DNA by RNA polymerase at a rate of approximately 50 nucleotides per second [59], knowing the length of the riboswitch plus the terminator expression platform and of any transcriptional pause sites [60] gives the timescale within which the riboswitch must act. ogether, the length of the sequence and pausing at transcriptional pause sites lead to timescales of tens of seconds to a few minutes. In the remaining chapters of this work, the focus on the timescales of RNA folding is motivated by this prototypical example of a kinetic folding constraint on RNA. Outside the entral Dogma: microrna Another type of functional RNA, microrna, acts in the cytoplasm at the time of translation by binding to messenger RNA, usually leading to the downregulation of the gene [7]. Secondary structure plays important roles in both the genesis of and regulatory action of microrna. he usual route to production of microrna is to transcribe them from the gene, at which point the microrna folds to its hairpin structure known as pri-mirna and is trimmed by the protein Drosha. After export from the nucleus, the microrna s hairpin loop is then trimmed off by the protein Dicer, and the two sides of the hairpin stem are free to dissociate. hese dissociated halves of the stem then bind to an argonaute protein like Dicer. Finally, this RNA-induced silencing complex binds to messenger RNA, usually downregulating its expression. A diagram of this process can be found in figure 1.7. Secondary structure is thus important in hairpin formation, dissociation of the stem, and binding to the messenger RNA. In fact, it has been shown that A-rich regions of messenger RNA bear stronger regulatory targets due to the weaker secondary 12

structure in these regions [61]. mirna hairpin mrna pri-mirna Figure 1.7: he life cycle of a typical microrna. 1.2 Simulation ools Secondary structure models offer substantial improvements over molecular dynamics simulations in terms of computational efficiency. First, secondary structure models reduce the space of allowed states from a continuum of molecular configurations to a discrete set. he models by which the free energies of these discrete states are calculated are discussed in section 1.2.1. Second, efficient algorithms exist on the space of secondary structures that allow calculation of minimum free energy structures as well as the ensemble free energy of the nucleic acid molecule. hese algorithms will be discussed in section 1.2.2. hird, the allowed secondary structures of nucleic acids can be enumerated so that the energy landscape can be studied. Kinetic Monte arlo on this landscape is easily able to access biologically relevant timescales of seconds to minutes or even longer. 1.2.1 Energy Models ommon to each simulation tool discussed in this section are experimentally determined free energies of nucleic acid structures. By identifying each substructure that makes up the complete secondary structure of a nucleic acid, it is possible to add the corresponding energies of the substructures in order to determine the energy of the conformation of the entire nucleic acid. he RNA energy parameters of Mathews [18] 13

and the DNA energy parameters of SantaLucia [19] are used in the simulations in this thesis. An important aspect of each energy model is the salt concentration. Mathews RNA parameters are determined at 1M Nal, with no salt correction given [18]. SantaLucia s DNA parameters are also determined at 1M Nal and include an empirical salt correction for both Na + and Mg 2+ so that they are useful for a variety of experimental conditions. Loop free energies are found experimentally for small loops of size 30 nucleotides, and are given by the Jacobson-Stockmayer extrapolation discussed in section 1.1.1 for longer loops. Diagrams of various types of structures for which energies are given are shown in figure 1.8. Note that the free energy of a secondary structure is a constrained free energy, in contrast to the ensemble free energy of the polymer with e βf full = secondary structures e βf constrained. Hairpin Loop Bulge Multiloop Internal Loop Dangles Stacks Figure 1.8: ypes of nucleic acid secondary structure for which energies are given in energy models. 1.2.2 Zuker and Mcaskill Algorithms he Zuker algorithm for determining the minimum free energy secondary structure of nucleic acids is based on the computational strategies of divide and conquer and dynamic programming [62]. he constraint of secondary structure gives a natural point of division for the dynamic programming algorithm, as the basis is to determine the optimal secondary structure on the global sequence by determining the optimal 14

structure on each possible subsequence. By definition of secondary structure, any base pair that occurs in any subsequence necessarily prevents the nucleotides interior to this pair from pairing with the exterior nucleotides. hus, it is possible to write the minimum free energy of a sequence as a relatively simple function of the minimum free energy of the subsequences. he minimum free energy structure can then found by backtracking on the set of energies of subsequences, as is done by the RNAsubopt program discussed in section 1.2.3. Pseudocode for the Zuker algorithm is found in algorithm 1.1. A modification of the Zuker algorithm by Mcaskill allows for efficient calculation of the partition function and ensemble free energy of a sequence including constraints [63]. he Mcaskill modification again relies on the canonical partition function of a large sequence being a function of that of the subsequences. By replacing the minimum function in the Zuker algorithm with a product of the partition functions of the operands, the resulting algorithm thus constructs the overall partition function. he Mcaskill algorithm can therefore provide the ensemble free energy of the entire nucleotide chain or the same subject to some constraints. Please note that this modification will not work in the simplified version, algorithm 1.1, because the cases in the simplified version are not mutually exclusive. Algorithm 1.1 reatly simplified pseudocode for the Zuker algorithm for determining the minimum free energy secondary structure of a nucleic acid. he variant here is simplified for ease of understanding. Zuker(a, b) Require: Nucleotide sequence s i with length L Require: Dynamic programming matrix MFE i,j for substrings of the nucleotide sequence Require: Desired starting and ending nucleotides a and b, respectively if MFE a,b = then (if a > b) MFE a,b minimum end if return MFE a,b E unpaired (no pairs on [a,b]) E paired +Zuker(a+1,b 1) (a and b are paired) Zuker(a,k)+Zuker(k+1,b) (for all a < k < b) he Zuker and Mcaskill algorithms run in computational time O(N 3 ), as can be seen from algorithm 1.1, but they find only partition functions and minimum free-energy structures of secondary structure without pseudoknots. he implementations we use are NAfold (previously known as mfold) [64] and RNAfold from the ViennaRNA [65] package. here also exist implementations with more prohibitive computational complexity that can find optimal pseudoknotted structures with certain classes of pseudoknots such as HotKnots [66] and pknotsr [67]. he general prob- 15

lem of solving minimum free energy structure including all degrees of pseudoknots is non-polynomial time [68]. 1.2.3 RNAsubopt Program he Zuker algorithm discussed in section 1.2.2 uses dynamic programming, or problem solving using a chart to store intermediate results, in order to determine the minimum free energy of a structure. RNAsubopt adds backtracking in order to determine the structure from the dynamic programming matrix that Zuker originally generated. RNAsubopt makes use of the dynamic programming matrix in order to either enumerate all structures with energy below a certain threshold or to give a Boltzmannweighted sample of the equilibrium ensemble [69]. Pseudocode for RNAsubopt can be found in algorithm 1.2. In order to obtain the Boltzmann-weighted sample using RNAsubopt, for each potential pair (i,j) the choice will be made between the cases unpaired, paired or substructures based on the statistical weight of the partition function for that case. Substructures means that i pairs with k for some i < k < j. hough the number of secondary structures available to a sequence at a given energy grows exponentially for low energy, the total number of states is finite and the enumeration of secondary structures below a given energy level is typically manageable for sequences of less than 50 nucleotides. his feature is used to determine exactly the energy barriers as well as free energy basins of a sequence in section 1.2.5. 1.2.4 kinfold Monte arlo he kinfold model of nucleic acid kinetics [70] is utilized throughout this thesis. Kinfold itself entails two major ideas. he first is a move set which gives a notion of adjacency to secondary structures, allowing for the analysis of folding pathways and energy landscapes at the coarse-grained level of secondary structure. he allowed kinfold moves are to close a base pair, to open an existing base pair, to change the base that an already paired base is paired with. hese moves are shown in figure 1.9, and adjacency A between two states i and j is defined by { 1 i and j separated by a single kinfold move,i j A ij = 0 otherwise. (1.2) In kinfold, the rate to make each type of move depends only on the between the beginning and ending state. hus there is an implicit assumption that the ratelimiting steps of global reconfigurations belong to the space of secondary structure. his assumption is revisited in chapter 4. he other important concept implemented in kinfold is kinetic Monte arlo simulation of trajectories of through the energy landscape of secondary structures 16

Algorithm 1.2 reatly simplified pseudocode for the backtracking RNAsubopt algorithm for determining all structures with free energy below a given threshold. signifies string concatenation. subopt(a,b,e max ) Require: Desired beginning nucleotide a and ending nucleotide b. Require: Energy threshold E max Require: Nucleotide sequence s i Require: omplete memoization matrix MFE i,j for substrings of the nucleotide sequence structs = if a > b then return structs end if if a = b then return. end if if E unpaired E max then he substring from a to b unpaired satisfies the threshold. structs structs {. b a+1 } end if if E paired +MFE a+1,b 1 E max then Nucleotides a and b paired with some substring satisfies the threshold. structs structs ( subopt(a+1,b 1,E max E paired ) ) end if for k = a b do if MFE a,k +MFE k+1,b E max then Some substructures can satisfy the threshold substructs subopt(a,k,e max MFE k+1,b )) for all s substructs do structs structs (s subopt(k +1,b,E max energy(s))) end for end if end for return structs 17

using the illespie algorithm[71]. he illespie algorithm is a discrete approximation to the continuous dynamics described by the master equation, dp i dt = R j i P j R i j P i, (1.3) j i j i where R i j is a transition rate from state i to state j and P i is a probability density for being in state i. For transition rates, Kinfold s simulation uses the familiar Metropolis or Kawasaki rate models { A ij min(1,e i j R ) (Metropolis) R i j = A ij e i j 2R (Kawasaki). (1.4) hese are only two options out of infinitely many possible choices of rate model which obey detailed balance, and the best model should be chosen based on agreement with experiment. he Kawasaki rate model would give preference to the most energetically favorable neighbor, whereas the Metropolis rate model would choose with equal statistical weight any energetically favorable neighboring state. Since the master equation is clearly separated into arriving and departing terms for each state, it is convenient to think of the Metropolis and Kawasaki rates as a single rate matrix with diagonal element determined by E(i) = j i R i j. hen, the rate matrix S is given as a function of the rates R between states, with the total exit rate for state i given by E(i): S i j = { Ri j i j E(i) i = j. (1.5) With this generalization, the master equation takes on the aesthetically pleasing form d P dt = S P. (1.6) Note that the solution P(t) = exp(st) P(0) is a sum of exponentials and leads to a single exponential decay at long times. During each step in the simulation of the illespie algorithm the state i transitions to state j i with probability p i j = R i j E(i). (1.7) After the transition, the kinfold simulation timer is incremented from an exponential distribution of expected dwell times in the beginning state using a uniform random number u (0,1] as t = ln(1/u) E(i). (1.8) 18

Since the master equation has solutions that are exponentials in the eigenvalues of R, it is reasonable that dwell times should themselves be exponentials as in equation 1.8; However, the sum of exponential distributions is not itself an exponential distribution and as such equation 1.8 is an approximation to the true dynamics of the system. It is noteworthy that two major assumptions have been made. First, the kinfold simulation has unitless rates. o address this assumption, calibration to real time is essential by multiplying the time increment for each step by a factor κ with units of physical time. Second, kinfold s choice of unitless rates implies that κ is the same for each of the three processes seen in figure 1.9. Whether or not this is correct is more difficult to verify, and is best addressed by checking that the simulation and reality behave in comparable ways across many sequences, temperatures and processes. In fact, chapter 4 could be interpreted as a process for selecting κ which better reproduce the physics of the system. he simple but powerful combination of the model of secondary structure adjacency with the illespie algorithm allows kinfold to quickly and efficiently simulate events that are known to take seconds or longer in vivo, such as loop opening or riboswitch folding. New Pair Remove Pair Shift Figure 1.9: he three types of move allowed by kinfold [70]. 19

1.2.5 barriers Program he barriers program [72] from the ViennaRNA [65] package uses the notion of adjacency from the kinfold [70] move set illustrated in figure 1.9 and the enumerated states from RNAsubopt [69] in order to produce a list of local energy minima as well as saddle points connecting them. he algorithm begins with a list of secondary structures available to the nucleic acid, sorted by energy. hen, for each structure in the list, one of three cases must hold. First, the structure could be adjacent to no other structure read so far. If this is the case, the structure is a local energy minimum. Second, the structure could be adjacent to structures in only one single energy basin. If this is the case, the structure is part of that particular energy basin. hird, the structure could be adjacent to structures in multiple energy basins. If this is the case, and this is the first such structure found that is adjacent to both basins, then the structure is a saddle point between the two basins. Pseudocode for the barriers program is given in algorithm 1.3. he use of the local minima and saddle points from barriers in order to determine Arrhenius lifetimes of nucleic acid secondary structures will be the central focus of chapter 3. 20