Finding Consensus Energy Folding Landscapes Between RNA Sequences

University of Central Florida Electronic Theses and Dissertations Masters Thesis (Open Access) Finding Consensus Energy Folding Landscapes Between RNA Sequences 2015 Joshua Burbridge University of Central Florida Find similar works at: http://stars.library.ucf.edu/etd University of Central Florida Libraries http://library.ucf.edu Part of the Computer Engineering Commons STARS Citation Burbridge, Joshua, "Finding Consensus Energy Folding Landscapes Between RNA Sequences" (2015). Electronic Theses and Dissertations. 5032. http://stars.library.ucf.edu/etd/5032 This Masters Thesis (Open Access) is brought to you for free and open access by STARS. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of STARS. For more information, please contact lee.dotson@ucf.edu.

FINDING CONSENSUS ENERGY FOLDING LANDSCAPES BETWEEN RNA SEQUENCES by JOSHUA BURBRIDGE B.S. University of Central Florida 2013 A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in the Department of Electrical Engineering and Computer Science in the College of Engineering and Computer Science at the University of Central Florida Orlando, Florida Summer Term 2015 Major Professor: Shaojie Zhang

2015 Joshua Burbridge ii

ABSTRACT In molecular biology, the secondary structure of a ribonucleic acid (RNA) molecule is closely related to its biological function. One problem in structural bioinformatics is to determine the two- and three-dimensional structure of RNA using only sequencing information, which can be obtained at low cost. This entails designing sophisticated algorithms to simulate the process of RNA folding using detailed sets of thermodynamic parameters. The set of all chemically feasible structures an RNA molecule can assume, as well as the energy associated with each structure, is called its energy folding landscape. This research focuses on defining and solving the problem of finding the consensus landscape between multiple RNA molecules. Specifically, we discuss how this problem is equivalent to the problem of Balanced Global Network Alignment, and what effect a solution to this problem would have on our understanding of RNA. Because this problem is known to be NP-hard, we instead define an approximate consensus on a landscape of reduced size, which dramatically reduces the searching space associated with the problem. We use the program RNASLOpt to enumerate all stable local optimal secondary structures in multiple landscapes within a certain energy and stability range of the minimum free energy (MFE) structure. We then encode these using an extended structural alphabet and perform sequence alignment using a structural substitution matrix to find and rank the best matches between the sets based on stability, energy, and structural distance. We apply this method to twenty landscapes from four sets of riboswitches from Bacillus subtillis in order to predict their native on and off structures. We find that this iii

method significantly reduces the size of the list of candidate structures, as well as increasing the ranking of previously obscure secondary structures, resulting in more accurate predictions overall. Advances in the field of structural bioinformatics can help elucidate the underlying mechanisms of many genetic diseases. iv

ACKNOWLEDGMENTS The author would like to thank his adviser and committee chair, Dr. Shaojie Zhang, for his invaluable intellectual and moral support during the research process. v

TABLE OF CONTENTS LIST OF FIGURES... vii LIST OF TABLES... viii LIST OF EQUATIONS... ix CHAPTER ONE: INTRODUCTION... 1 1.1 General Background... 1 1.1.1 Dynamic Programming... 1 1.1.2 RNA... 4 1.1.3 RNA Secondary Structure... 8 1.1.4 RNA Folding... 11 1.1.5 Folding Landscapes... 15 1.1.6 Folding Pathways... 18 1.2 Consensus Folding Landscapes... 20 1.3 Riboswitches... 24 1.4 Predicting Native Structures of Riboswitches... 26 CHAPTER TWO: PREVIOUS WORKS... 28 2.1 RNASLOpt... 28 2.2 RNAConSLOpt... 29 2.3 GraphClust... 31 CHAPTER THREE: METHODOLOGY... 34 3.1 RNASLOpt... 34 3.2 Brand new Alphabet for RNA (BEAR)... 38 3.3 Substitution Matrices and MBR... 39 3.4 Datasets... 41 3.5 Software Pipeline... 41 CHAPTER FOUR: RESULTS AND DISCUSSION... 48 4.1 Ranks of Target Matches in Complete Pairwise Alignment... 48 4.2 Comparison of Ranks to RNA SLOpt Prediction... 58 4.3 Analysis of Consensus Landscapes... 62 4.4 Benchmarking... 77 CHAPTER FIVE: CONCLUSIONS AND FUTURE WORK... 81 5.1 Conclusions, Advantages, Disadvantages, and Limitations... 81 5.2 Future Work and Alternative Approaches... 85 APPENDIX: RNA SEQUENCES AND STRUCTURES USED... 88 LIST OF REFERENCES... 93 vi

LIST OF FIGURES Figure 1 - Example of the Secondary Structure of an RNA Molecule... 9 Figure 2 - Examples of the Six Different Sub-Structural Elements... 11 Figure 3 - MUSHI Pipeline... 42 Figure 4 - Performance of RNASLOpt compared to performance of MUSHI... 61 Figure 5 - Performance of RNASLOpt compared to performance of MUSHI (without adenine)... 62 Figure 6 - Changes in target 'On' structures in adenine (graphed)... 64 Figure 7 - Changes in target 'Off' structures in adenine (graphed)... 65 Figure 8 - Changes in target 'On' structures in lysine (graphed)... 67 Figure 9 - Changes in target 'Off' structures in lysine (graphed)... 68 Figure 10 - Lysine native 'Off' structure... 69 Figure 11 - Top structure returned by RNASLOpt for sequence lysine 3... 70 Figure 12 - Top structure returned by RNASLOpt+MUSHI for lysine 3 after comparison with lysine 1... 70 Figure 13 - Changes in target 'On' structures in TPP (graphed)... 72 Figure 14 - Changes in target 'Off' structures in TPP (graphed)... 73 Figure 15 - Changes in target 'On' structures in FMN (graphed)... 75 Figure 16 - Changes in target 'Off' structures in FMN (graphed)... 76 Figure 17 - Average effect of MUSHI on landscape size... 77 Figure 18 - Structural similarity of RNASLOpt and RNAConSLOpt 'On' target structures to native 'On' structure... 79 Figure 19 - Structural similarity of RNASLOpt and RNAConSLOpt 'Off' target structures to native 'Off' structure... 79 vii

LIST OF TABLES Table 1 - RNASLOpt Parameters and Landscape Sizes... 43 Table 2 - Results of aligning each landscape with native On and Off structures... 48 Table 3 Complete pairwise structural alignment for adenine On... 50 Table 4 Complete pairwise structural alignment for adenine Off... 50 Table 5 Complete pairwise structural alignment for lysine On... 51 Table 6 Complete pairwise structural alignment for lysine Off... 51 Table 7 Complete pairwise structural alignment for TPP On... 52 Table 8 Complete pairwise structural alignment for TPP Off... 52 Table 9 Complete pairwise structural alignment for FMN On... 53 Table 10 Complete pairwise structural alignment for FMN Off... 53 Table 11 Complete pairwise alignment rankings of On target matches for different values of α, β, γ, gap, and bonus respectively... 56 Table 12 Complete pairwise alignment rankings of Off target matches for different values of α, β, γ, gap, and bonus respectively... 57 Table 13 - Effect of MUSHI on adenine landscape size... 63 Table 14 Changes in target On structures in adenine (tabulated)... 64 Table 15 Changes in target Off structures in adenine (tabulated)... 65 Table 16 - Effect of MUSHI on lysine landscape size... 66 Table 17 Changes in target On structures in lysine (tabulated)... 66 Table 18 Changes in target Off structures in lysine (tabulated)... 67 Table 19 - Effect of MUSHI on TPP landscape size... 71 Table 20 Changes in target On structures in TPP (tabulated)... 71 Table 21 Changes in target Off structures in TPP (tabulated)... 72 Table 22 - Effect of MUSHI on FMN landscape size... 74 Table 23 Changes in target On structures in FMN (tabulated)... 74 Table 24 Changes in target Off structures in FMN (tabulated)... 75 Table 25 - Performance of RNAConSLOpt on riboswitches from B. Subtilis... 78 viii

LIST OF EQUATIONS Equation 1 Needleman-Wunsch recursive function..2 Equation 2 Recursive function approximating landscape size.. 11 Equation 3 Asymptotic formula approximating landscape size 11 Equation 4 Recursive function for Nussinov s algorithm. 12 Equation 5 Recursive function for Zuker-Sankoff s algorithm..13 Equation 6 Conserved topological interactions in global network alignment...22 Equation 7 Total similarity score in balanced global network alignment..22 Equation 8 Weighted sum in balanced global network alignment..22 Equation 9 Covariance and conservation score from RNAalifold..30 Equation 10 Bonus given to columns with compensatory mutations in RNAalifold..30 Equation 11 Penalty given to columns not following the consensus structure in RNAalifold..30 Equation 12 Log-odds score in substitution matrices.40 Equation 13 Mattei-variant of the Needleman-Wunsch Algorithm.44 Equation 14 Weighted sum ranking pairs of structures in MUSHI.45 Equation 15 Stability ranking in MUSHI 45 Equation 16 Energy ranking in MUSHI..45 Equation 17 Structural similarity ranking in MUSHI.46 ix

CHAPTER ONE: INTRODUCTION 1.1 General Background Bioinformatics is one of the fastest-growing academic fields. Since the year 2000, rapid advances in processing power coupled with dramatic decreases in the cost of DNA sequencing technology have opened the floodgates for research institutions to carry out various studies of the human genome. Very often, a new research study will provide insight into one particular question while inevitably spawning multiple additional questions. Thus, we expect the exponential explosion of novel problems in bioinformatics to continue for some time. In this thesis, we formulate and suggest solutions to such a novel problem, one which has thus far remained largely untouched by other computer scientists: finding the consensus energy folding landscape between multiple RNA sequences. Before presenting the problem statement, it will be necessary to briefly review the computational strategies used, as well as provide the appropriate background information necessary for understanding the problem. Due to the amount of background knowledge necessary to fully understand the problem, and in order to make this document accessible to readers from both Computer Science and Biology backgrounds, Chapter 1 is somewhat extensive. The reader is encouraged to skip over sections containing information with which they are already well-acquainted. 1.1.1 Dynamic Programming Dynamic programming is a classic computational strategy central to many different problems in bioinformatics, including the problem of finding a consensus folding landscape. First formally proposed by Richard Bellman in 1953, dynamic programming 1

is an ambiguous name for a very clever strategy. The characteristic approach for a dynamic programming solution is to identify the different subproblems within a larger problem, solve these smaller subproblems, and then combine the smaller solutions into a solution to the overall problem. A classic example of the power of dynamic programming is its use in generating Fibonacci numbers. The recursive definition of the Fibonacci sequence is F(n) = F(n-1) + F(n-2). This definition splits each instance of the problem into two smaller subproblems, and so a naïve recursive function to compute Fibonacci numbers would result in O(2 N ) time complexity. The power of dynamic programming lies in recognizing that there are only N distinct subproblems (F1, F2,, FN), and so we can reduce the exponential solution to a linear solution simply by memoizing the result of each distinct subproblem. One of the earliest and most well-known uses of dynamic programming in bioinformatics is the Needleman-Wunsch algorithm for global sequence alignment[1]. Biologists are often interested in computing the sequence identity of two polymers, usually DNA, RNA, or protein. Informally, the sequence identity specifies the similarity between two given strings. An alignment algorithm seeks the best possible mapping from characters in the first sequence to characters in the second, while also allowing insertion of gaps in either sequence. The algorithm operates by recognizing that the optimal solution for a subsequence of either string is also included in the global solution. We can proceed by first examining the last characters of both sequences. There are then three possibilities: they can be aligned, a gap can be inserted in the first sequence, or a gap can be inserted in the second sequence. The recursive function for the Needleman-Wunsch algorithm can be written as D(i, j) = max { D(i 1, j 1) + s(x i, y j ) D(i 1, j) + g D(i, j 1) + g 2

S(x i, y j ) = { x match x mismach if x i = y j if x i y j (1) where g is the penalty for inserting a gap, x match is the bonus for aligning two identical characters, and x mismatch is the penalty for aligning two different characters. The value of these parameters can be adjusted to suit the needs of the specific application. However, implementing the recursive function alone will not solve the problem. After transforming D into an iterative procedure, we will end up with an (M+1) by (N+1) matrix, in which D(m, n) will represent the best possible score achievable by an optimal alignment of the first m characters of the first sequence, and the first n characters of the second. To recover the alignment that generated this score, we perform a traceback procedure. While creating matrix D, we simultaneously create a traceback matrix T, where each cell T(i, j) corresponds to cell D(i, j). Each cell in T contains three Booleans, which we call left, up, and diagonal. When the value for D(i, j) is calculated, it is defined as the maximum of the three values D(i-1, j), D(i, j-1), and D(i-1, j-1). If the final value in D(i, j) is equal to any of these three values, to corresponding Boolean is switched to true, indicating where the value D(i, j) was derived from, and ultimately marking the trail through the matrix that will show us the optimal solution. When matrix D is filled, we begin the traceback procedure in cell T(i+1, j+1), and check each Boolean in the cell. If we find that T(i+1, j+1).left is true, this means the last character in the second sequence was matched with a gap in the first sequence, so we print this as the first column of the alignment. If T(i+1, j+1).diagonal is true, this means we matched the corresponding characters in both sequences with each other, so we add the next character of both sequences to the alignment. Finally, if T(i+1, j+1).up is true, we match the last character in the first sequence with an inserted gap in the 3

second sequence. Proceeding in this fashion from the lower right corner of T to the upper left corner of T will generate an optimal alignment between the two sequences. It is important to note, however, that there may be more than one optimal alignment. In order to recover all of them, we need to write a procedure that can enumerate all distinct paths from T(i+1, j+1) to T(0, 0). We sometimes expand upon the penalty for adding a gap into the alignment with an affine gap function. This is called the affine gap penalty, and it recognizes the fact that, from a biological standpoint, it may be more difficult for a sequence to mutate and open a gap (that is, to increase the length of the sequence) than to keep the characters aligned but mismatched, or to extend a gap that is already open. Clearly, the Needleman-Wunsch algorithm can be modified in many useful ways. This is important to note, as we will present a modified version of this algorithm in chapter 3. 1.1.2 RNA The primary focus of this thesis is to apply strategies like dynamic programming to a particular subfield within bioinformatics called structural bioinformatics. In structural bioinformatics, we are interesting in characterizing the three-dimensional shape of various biological molecules. The biological molecule of interest in this study is called ribonucleic acid, also known as RNA. RNA plays a pivotal role in the process known to molecular biologists as central dogma, which we will briefly explain. It is now nearly universal knowledge that the physical and behavioral traits organisms pass down to their offspring are packaged in the form of genes. These genes are nothing more than complex sequences of the four nucleotides adenine, guanine, cytosine, and thymine, 4

chemically bonded to create a long, double-stranded polymer known as deoxyribonucleic acid, or DNA. Sequences of exactly three nucleotides in DNA called codons can be translated into one amino acid, of which eukaryotes use twenty-one. Long sequences of amino acids called polypeptides fold into complex shapes and becomes proteins, which perform innumerable biological functions within the cell, many of which are still unknown. However, DNA is not directly translated into protein. It relies on RNA as an intermediary. In order for a gene to be expressed in the cell, the two strands of its corresponding sequence in DNA must be pulled apart by an enzyme called RNA Polymerase. This enzyme attaches to one strand of the DNA and moves down the sequence, pausing at each nucleotide in the DNA to add the corresponding nucleotide to a growing chain matching the DNA sequence to be copied. Once finished, RNA Polymerase disconnects itself from the DNA, and releases the short molecule containing the copy of the sequence. The process of copying the sequence is called transcription, and the molecule which is a direct copy of the sequence is called RNA. RNA differs from DNA in three major ways. First, it uses the nucleotide uracil in place of thymine. Therefore, if a DNA sequence were GCGCATA, the corresponding RNA sequence would be GCGCAUA. Second, RNA has a hydroxyl group attached to the 2 position of its pentose ring, whereas DNA does not (hence the prefix deoxy- ). This makes RNA less stable than DNA. Third and most importantly for this thesis, RNA is a single-stranded molecule, whereas DNA is double-stranded. This structural difference makes RNA more flexible and able to fold into complex shapes, which is often the key determinant of its function. After transcription, the RNA molecule must be translated into an amino acid sequence to complete the gene expression process. However, before this can happen, some important 5

pre-processing steps must occur. The most important of these steps is called RNA splicing. Contrary to what was once popular belief, not all DNA directly codes for an amino acid sequence. In fact, in the human genome, only about 1% of our DNA will eventually be translated into protein. Approximately 25% of the remaining DNA has been associated with regulatory elements and other non-coding portions of genes, but the function of much of our genome is still unknown. What is clear, however, is that after a gene is transcribed, molecules called small nuclear ribonucleoproteins (snrnps) cut the RNA molecule into fragments that can be classified as either exons or introns. Exons are stretches of RNA that will be translated into amino acid sequences, whereas introns may perform a variety of other functions. Once the process of RNA splicing is complete, the exons are, in essence, stitched back together, so that only the coding portion remains in the RNA molecule. After undergoing some additional processing, the strings of exons now known as messenger RNA (mrna) exits the nucleus, and is picked up by a ribosome. The purpose of the ribosome is to scan through the RNA molecule and translate each codon into a single amino acid, thus building a chain of amino acids in the same way that RNA Polymerase builds a chain of nucleotides. This process is called translation. Central dogma refers to the entire process in which DNA is transcribed into RNA, which is translated into proteins. While this is surely an oversimplification of the process (indeed, some viruses known as reverse transcriptase actually attack the cell by reversing this process and inserting foreign sequences into the DNA), this basic explanation is sufficient for understanding the role that RNA plays in the cell, and why we are interested in predicting its shape. 6

In general, RNA can be broadly classified into two categories, coding RNA and non-coding RNA (ncrna). ncrna is defined as any RNA sequence that is not directly translated into a polypeptide. There are many different classes of RNA molecules. Messenger RNA (mrna) is the concatenation of exons in a gene that undergo translation. This is coding RNA. ncrna contains many smaller groups. We have already discussed small nuclear ribonucleoproteins (snrnps), which cleave RNA into exons and introns. As the name suggests, snrnps are complexes that consist of RNA and proteins. The RNA portion of this complex is referred to as small nuclear RNA (snrna). These molecules are typically about 150 nucleotides (nt) in length. Similarly, ribosomes are composed of RNA and protein molecules. The RNA in ribosomes is called ribosomal RNA (rrna). In order for the ribosome to match a codon with a particular amino acid, it bonds with transfer RNA (trna). trnas are a class of RNA molecules with a highly conserved cloverleaf shape. There exists a specific trna for each possible pairing of one codon with one amino acid. The trna s structure has a region that allows it to bond to a codon in an mrna sequence, as well as a region that bonds to a specific amino acid. Thus, each trna acts as a sort of grammatical rule by adding its particular amino acid to the growing polypeptide in the ribosome when it recognizes the correct codon in the mrna. Some RNA molecules can silence the expression of genes by destroying the corresponding mrna molecule in a process called RNA interference (RNAi). The two most important types of these molecules are called micro RNA (mirna) and small interfering RNA (sirna), both of which share roughly the same function but usually have dramatically different structures, thus affecting the specific mechanism by which they act. Small nucleolar RNAs (snornas) function primarily in processing other RNA molecules, such as rrna. 7

More classes of RNA exist than have been described here, and still more have yet to be discovered. Ultimately, there are two important facts that are central to this thesis. First, it is clear that RNA plays many complex roles in the cell beyond the simple storage of protein coding information. Second, and most importantly, the various functions of RNA molecules are differentiated primarily by the structure of the molecule itself. Thus, the more we understand about how RNA molecules acquire their three-dimensional shape, the more we can infer about their roles in gene regulation and expression. 1.1.3 RNA Secondary Structure The structures of RNA and other biological molecules are typically classified in terms of primary, secondary, tertiary, and sometimes quaternary structure. An RNA molecule s primary structure refers to the specific sequence of the nucleotides it contains, and is simply written as a string: GCGCAUA. The chemical structure of RNA allows it to be very flexible such that, if it is energetically favorable, the molecule will fold back onto itself, allowing some nucleotides to form hydrogen bonds with other nucleotides farther downstream. We refer to these bonds as base pairs. In general, the following base pairs are energetically favorable, in order from most stable to least stable: G-C, A-U, G-U. The varying stability between these three possible base pairs is determined by the number of hydrogen bonds the nucleotides can form. When guanine bonds with cytosine, both components are held together by three hydrogen bonds, therefore conferring greater stability to the molecule as a whole than an A-U base pair, which is joined by two hydrogen bonds. Occasionally, it is possible for three nucleotides to form a more complex configuration in which all three are bonded to each other, resulting in a base triplet. However, this is an uncommon occurrence, 8

and it is usually safe to assume that if any two nucleotides form a base pair together, those nucleotides can no longer form base pairs with any other nucleotides in the sequence. When describing the secondary structure of an RNA molecule, we typically number each nucleotide from 1 to N, where N is the length of the sequence. The notation for referring to a base pair is (i, j), where i and j refer to the i-th and j-th nucleotides in the sequence, respectively. Graphically, we may represent an RNA sequence and its corresponding structure as follows: AAAGCGAAAGCGAAACGCAAAGCGAAACGCAAACGCAAA...(((...(((...)))...(((...)))...)))... This string representation corresponds to the following two-dimensional structure: Figure 1 - Example of the Secondary Structure of an RNA Molecule This is called dot-bracket notation. A dot in the i-th location in the structure indicates that, in this structure, the i-th nucleotide in the sequence is not part of any base pair, while an open or close parenthesis indicates that the associated nucleotide is paired with the nucleotide associated with the corresponding complementing parenthesis. Thus, a 9

secondary structure can be fully described by a sequence length N and a list of K base pairs {(i1, j1),, (ik, jk)}. In addition to ignoring base triplets, we will also ignore pseudoknots. A pseudoknot is defined as a set of base pairs (i, j) and (i, j ) such that i < i < j < j. While such formations do appear in some RNA molecules, pseudoknots are considered uncommon if not rare [2], and excluding them from our model simplifies the problem of predicting secondary structure dramatically. Thus, the language describing the set of all valid RNA structures can be described as L = *, where = {., (, ) }, such that all parentheses are balanced. This is a slight variation of the well-known Dyck language, which is context-free. Longer, more complicated secondary structures have recurring motifs, which can be classified into six categories: hairpin loops, stacks, internal loops, bulges, multiloops, and dangling or unpaired bases. The diagram below shows an example of a structure with each kind of motif. Because motif is a word with special connotations in sequence analysis, we henceforth refer to these as secondary sub-structural elements (SSEs). 10

Figure 2 - Examples of the Six Different Sub-Structural Elements 1.1.4 RNA Folding It is important to note that one sequence may have many different possible structures. Based on the definition of admissible structures above, the size of the subset of words of length N+1 in the language L satisfies the recurrence N 2 T(N + 1) = T(N) + T(k) T(N k 1) k=0 (2) where T(N) is the number of possible structures of a sequence of length N [3]. The first term on the right hand side of the equation accounts for all possible structures where a new nucleotide N+1 remains unpaired, and the summation accounts for all possible structures when nucleotide N+1 forms a base pair with nucleotide k+1. We can also specify the base cases as T(0) = T(1) = T(2) = 1. This recurrence can be approximated by the formula [4] 11

15 + 7 5 T(N) = ( ) N 3 2 ( 3 + 5 ) 8π 2 N (3) which, for a sequence length of 150, equates to approximately 5.75 * 10 29, a colossal number of structures for a relatively short sequence. However, because we normally place further restrictions on the class of chemically feasible structures, this formula represents a vastly overestimated upper bound. Examples of these restrictions includes setting a minimum for the number of nucleotides that must occur between any two base pairs. Clearly, two adjacent nucleotides are already bonded, and cannot be considered in the same base pair. Additionally, hairpin loops typically require at least three nucleotides between the closing members of the base pair because electromagnetic forces between nucleotides make it very difficult for a molecule to bend sharply, only including one or two nucleotides in the loop. Thus, three rules define the class of permissible RNA structures: (i) all parentheses must be balanced, (ii) we do not consider pseudoknots, and (iii) for any base pair (i, j), j i 4. Once we begin to consider the actual sequence information, further restrictions are placed on the set of permissible structures. Most importantly, we only consider the following pairs: G-C, A-U, and G-U. This property can alter the size of the set of permissible structures, taking it from only one structure (if, for example, the entire sequence is composed of a single nucleotide), up to a maximum that depends on the sequence itself. In general, the constraints imposed by the sequence data reduce the size of the landscape (a term formally defined in section 1.1.6) dramatically. 12

Simple thermodynamics requires that complex molecules conform into energetically stable structures. Thus, the first attempts at predicting RNA secondary structure focused on selecting the structure with minimum free energy (MFE) from the set of permissible structures. The earliest algorithm to achieve a reasonable solution to this problem was given by Nussinov et al. in 1980 [5]. Because base pairs contribute to the stability of the overall structure, this dynamic programming solution focuses on finding the structure in which the number of base pairs are maximized. The recursive function for the algorithm can be written M(i, j 1) M(i, j) = max { M(i, k 1) + M(k + 1, j 1) + δ(k, j) for i k < j (4) where δ(k, j) = 1 if the respective nucleotides can form a base pair. Note that this scoring function does not differentiate base pairs by stability, but it can easily be modified to do so. Filling this matrix requires O(N 3 ) time. However, base pairs are not the only elements within a structure that affect its overall stability. SSEs in various forms such as hairpin loops or internal loops destabilize the overall structure, but some loops destabilize it more than others. Additionally, there may be multiple structures with the maximum number of base pairs, and this solution would be unable to differentiate between them. Therefore, in order to determine the true MFE structure, a good folding algorithm must account for a larger set of parameters than Nussinov s algorithm. A more sophisticated algorithm was developed by Zuker and Steigler in 1981, refined by Zuker and Sankoff in 1984, and again by Sankoff in 1985 [2, 3, 6]. This algorithm takes into account the destabilizing energies of SSEs such as hairpin loops and internal loops. Its recursive function can be written 13

V(i, j) = min V(i, j) W(i + 1, j) W(i, j) = min W(i, j 1) { min i k j 1 {W(i, k) + W(k + 1, j)} eh(i, j) V(i + 1, j 1) + es(r i, r j, r i+1, r j 1 ) min i<i <j <j && 2<i i+j j {V(i, j ) + ebi(i, j, i, j )} { min i+1 k j 2 {W(i + 1, k) + W(k + 1, j 1)} + a (5) where W(i, j) is the score of the best possible structure on subsequence [i j], and V(i, j) is the score of the best possible structure on subsequence [i j] with the added assumption that i and j form a base pair in that particular structure. Additionally, eh(i, j) is the energy penalty of a hairpin loop closed by base pair (i, j), es(ri, rj, ri+1, rj-1) is the energy bonus given by stacking base pairs (i, j) and (i+1, j-1) together, a function that depends specifically on the nucleotides involved in the stacking, ebi(i, j, i, j ) is the energy penalty given by the internal loop or bulge closed by base pairs (i, j) and (i, j ), and a is the energy penalty given by opening a new multiloop structure. In simple terms, the matrix W accounts for 4 possibilities: i and j form a base pair, i unpaired, j is unpaired, or both i and j are in a base pair, but not with each other. Note that the possibility that both i and j are unpaired can be accounted for by choosing the second and third options consecutively. The matrix V assumes i and j form a base pair, and then determines the SSE of minimum energy that could be closed by (i, j). The advantage of this algorithm is that the functions eh, es, ebi, and a can be continuously updated to reflect parameters given by new experiments. Michael Zuker s MFOLD software package uses a version of this algorithm which runs in O(N 4 ) time. 14

However, while this algorithm has been used to successfully compute the MFE structure of many different sequences, it has been shown that the MFE structure is not often the native structure assumed by an RNA molecule [7]. This is because the native structure depends on the dynamic folding process of the molecule, which can depend on how the molecule interacts with external forces, and is thus extremely difficult to predict. Very often, the RNA molecule will end up in a structure that is both energetically favorable and stable, but is not necessarily the structure with the minimum possible energy given its sequence. Thus, algorithms that find the MFE structure are limited, and we must continue to search for new and creative ways to visualize and solve the problem. 1.1.5 Folding Landscapes We discussed earlier the notion of a folding landscape, which is composed of all possible structures an RNA molecule can take. The landscape can naturally be represented as a graph in which any two structures that differ only by the addition or subtraction of a base pair are connected by an edge. These edges can also be directed if it is desired that an edge represents a transition from a structure with higher free energy to a structure with lower free energy. The edges can also be weighted if it is desired that an edge should include the difference in free energy between two structures. However, these are just alternate representations of the same information. For the purposes of this research, we will represent the landscape as an unweighted, undirected graph. Each node is encoded with the free energy of its associated structure. As previously discussed, the size of the folding landscape is enormous, but finite. RNAsubopt, introduced by Wuchty et al. in 1999 [8] is capable of enumerating all suboptimal structures in the energy range from the MFE to some arbitrary upper limit. 15

Additionally, BARRIERS, introduced by Flamm et al. in 2002 [9], is capable of constructing the exact energy landscape by establishing that the topology of the landscape always forms a hierarchy, and so the landscape can be represented in a form called a barrier tree. Therefore, it is possible to fully realize an actual folding landscape. However, the size of a landscape for any RNA sequence of sufficient length is overwhelmingly large, so we are still in essence restricted to working with the portion of the landscape within a certain percentage of the MFE, rather than the full landscape. To solve the problem of finding a consensus folding landscape, we must first start with a rigorous mathematical definition of a landscape as well as its interesting features. Flamm et al. have already provided such definitions [9], and the relevant ones will be explained here. Let a configuration space be a graph G(V, E), where each v V represents a unique secondary structure, and each pair of vertices v1, v2 V whose associated structures in dotbracket notation differ only by the addition or deletion of a single base pair is connected by an unweighted, undirected edge e E. Then a landscape L(G, f) is defined by a configuration space and a function f: V R. In the case of our RNA landscape, this function f is precisely the function that calculates the free energy of a structure. There are a couple of intuitive assumptions we can make about identifying the properties of native structures from a folding landscape. The first and most obvious is that a native structure will be locally optimal, or equivalently a local minimum in the landscape. This makes sense because any structure that is not a local minimum can spontaneously transform into a structure that is a local minimum by adding another base pair, much like a ball rolling down a hill. Thus, a folding RNA molecule will not come to rest in a conformation that is 16

not a local minimum. The second assumption is that that native structure should be stable. Specifically, stability of a structure x refers to the difference in energy between x and height of the lowest saddle point between x and any other local minimum y V. This is an important feature of the landscape because for RNA molecules to function reliably, they should be strongly resistant to external forces that may act on the structure, breaking some base pairs and possibly pushing the molecule into a new energy basin and structure. Because local optimality and stability are two independent properties, it is important to note that for two local minima A and B, A may have greater total energy that B, but it may also be more stable. The third assumption deals with the notion of accessibility of a local minimum. That is, it may be possible to have a local minimum with low total energy which is also highly stable, but for which the probability that a random structure in V lies in its associated basin is far less than the probability of the same event for a different basin. If we visualize an energy basin as generally assuming the shape of a bowl, the accessibility property refers to the width of the bowl. Flamm et al. have given us explicit ways to measure these three properties, as we will now see. The neighbors of a vertex v V can be defined by the set v = {v} = {y V {v, y} E}. This definition extends to the neighboring set of a connected component: A = {y V \ A x A: {x, y} E}. A is called the boundary of A. The neighborhood of A is the union of set A and its boundary. That is, N(A) = A A. A vertex x is a local minimum if f(x) f(y) for all y x. If the inequality is changed to f(x) < f(y), x is called a strict local minimum. M is the set of all local minima in the landscape. As mentioned, it is assumed the vertex representing the native structure of an RNA molecule belongs to M. A landscape is non-degenerate or invertible if f(x) = f(y) x = y x,y V. Some degenerate RNA 17

folding landscapes exist. A walk of length k in the landscape is a sequence of vertices p = {x1, x2,, xk} such that each xi V and {xi, xi+1} E. The set of all possible walks between vertices x and y is denoted by Pxy. X and y are mutually accessible at height n if there exists a walk p Pxy such that f(z) n for all z p. Finally, the saddle height f (x, y) between two vertices x and y is the minimum height at which they are mutually accessible. That is, f (x, y) = min p Pxy max z p f(z). With these definitions, we can measure the three characteristics discussed previously. A structure is locally optimal if its associated vertex is a local minimum in the landscape. The stability of a structure x can be represented as the minimal saddle height between x and any other structure. When discussing the saddle height between two adjacent local minima, the number represents the energy required to transform one structure into the other (for the purposes of this thesis, adjacent means that the shortest walk between two local minima can be divided into one ascending phase followed by one descending phase). 1.1.6 Folding Pathways Assuming that the dynamic folding process occurs by the discrete addition or subtraction of individual base pairs, it can be modeled as a path through the landscape. Here, discrete means that no two base pairs are altered at precisely the same time, so a path denotes the series of transformations an RNA molecule assumes during the folding process. Naturally, we are primarily interested in the end point of this path, which we assume lies in a stable energy basin. However, this can depend on the molecule s starting position. If we assume in vitro folding, the molecule begins in a flat conformation, containing no base pairs. Then, small fluctuations in thermal noise allow the molecule to begin folding, travelling a 18

path in the landscape that is almost entirely determined by sequence information. However, this is not necessarily the case in vivo. In the cell, there may be many external forces that can interfere with the folding process. More importantly, when RNA molecules are assembled in the transcription process, the free-floating portion of the molecule may begin to fold before the rest of the molecule is assembled. Because each nucleotide added to the sequence adds a new set of structures to the folding landscape, the RNA molecule is constantly jumping between different landscapes until it arrives at the final one. The structure of the molecule at this point is then considered its starting point in the final landscape, and in the absence of any external stabilizing forces can drastically alter the end point of the path, or the final structure. This is akin to dropping a marble directly above different locations on a complex topological surface. Assuming the marble can only roll downhill (that is, there are no external forces acting upon the system), choosing an arbitrary starting location for the marble necessarily excludes a set of final locations. Thus, even if we could perfectly determine the step an RNA molecule may take at each point in the landscape, we still cannot predict the final vertex without knowing the starting vertex. The accessibility property discussed in the previous section has important implications for folding pathways. If we assume that the folding pathway for an RNA molecule ends in a local minimum that is highly inaccessible, there is a greater probability that some external force may act on the molecule before it reaches the basin, and because the basin is so inaccessible, the molecule may never recover from this error and will end up misfolded. Thus, it seems clear that pressure from natural selection would also drive the creation of nucleotide sequences for which the target structure in the folding landscape lies in a highly accessible basin, therefore minimizing the occurrences of a misfold. 19

1.2 Consensus Folding Landscapes At last, we are ready to move to the primary focus of this thesis: defining and solving the problem of finding the consensus folding landscape between two or more RNA sequences. To the best of our knowledge, Li et al. [10] are the only group who have researched this subject, but even in that paper, the problem is not formally defined. This thesis will give a formal definition of the problem, explain the efficacy of some solutions on constrained versions of the problem, and discuss what implications solving this problem could have for the field of structural bioinformatics. Because we have already defined a folding landscape as a graph, the intuitive notion for a consensus folding landscape is that it should be an alignment of two graphs that conserves a maximal amount of correspondence between them. If two landscapes can be said to have a consensus in some region, then both of the associated RNA molecules should be governed by the same set of folding dynamics when they are in those regions. How can we formulate this problem? Clearly, one aspect should involve comparing the nodes of each graph, and finding the best correspondence between the two sets. The best correspondence could be defined in terms of structural similarity, total energy level, stability, or some combination of these factors. In fact, this is exactly the approach we take in Chapter 3. However, finding such a correspondence is not enough. Because a folding landscape defines how a molecule behaves during the folding process, a consensus landscape should also define how both molecules behave when they are in the consensus zone. Therefore, a consensus landscape should maximize the size of a correspondence between the vertices of two networks while also conserving topological information. 20

Although no one has yet defined the problem of finding a consensus folding landscape, it is somewhat obvious from the previous discussion that this problem is identical to the Global Network Alignment (GNA) problem. This problem is defined rigorously in relation to protein-protein interaction networks by Zaslavskiy et al. [11].The definition we are about to give is more or less a transcription of their definition. Assume we are trying to find some global alignment between graphs G and H. We begin by assuming G and H have the same number of vertices. Even though this is rarely the case, we can simulate this property by adding some number of dead vertices to each graph so that they have the same order. If any particular alignment maps a live vertex in G to a dead vertex in H, this means that the associated structure in G does not map to any real structure in H. Now that the vertex sets are the same size, we are looking for a bijection between the two sets that maximizes the number of conserved topological interactions. Such a mapping is given by a permutation π of {1, 2,, N}, which are the vertices of G. In each permutation, the i-th vertex of G is mapped to the π(i)-th vertex of H. Each permutation π can also be represented by a permutation matrix P, where Pij = 1 if and only if π(i) = j. Then, the set of all possible permutations is defined by P = {P {0,1} NxN P1N = 1N, P T 1N = 1N}, where 1N is the column vector with N entries all equal to 1. For clarity, the identity matrix I NxN represents the permutation where the first vertex of G is mapped to the first vertex of H, the second vertex of the G is mapped to the second vertex of H, and so on. The two conditions on the right side of the definition of P mean that each distinct row and each distinct column in a permutation matrix P must sum to exactly one. This is because each vertex in G must map to exactly one vertex in H, and the relationship is bidirectional. 21

We can score each permutation by recording the number of interactions it conserves. That is, we are interested in the number of vertex pairs (i, j) that are connected in G where π(i) and π(j) are also connected in H. We denote this number by J(P), and it is clear that we seek to maximize this quantity. If we apply the permutation encoded by P to the graph H, we can obtain a new graph isomorphic to H, denoted by P(H). This is because the permutation simply shuffles the vertex labels. Note that this is not equivalent to multiplying PAH. Now, the adjacency matrix for the permuted graph AP(H) can be obtained from multiplying PAHP T [12]. Because adjacency matrices are symmetric, J(P) is then equivalent to half the number of entries in both AG and AP(H) that are simultaneously equal to 1. Formally, N J(P) = 1 2 [A G] ij [A P(H) ] ij i,j=1 (6) Zaslavskiy et al. [11] define two formulations of this problem. The first is referred to as the Constrained Global Network Alignment problem, and can be applied in the case where we have a list of candidate matchings between vertices, and we simply want to disambiguate the set of matchings by disallowing some correspondences, which allows us to rule out large numbers of permutations simultaneously. For the purposes of finding a consensus folding landscape, however, we can make no such obvious constraints on which structures can match. Therefore, we will focus on the second formulation of the problem given by Zaslavskiy et al. [11], which is called the Balanced Global Network Alignment problem. This formulation takes into account the degrees of similarity between all vertices in the network, and allows for the trade-off or balancing between vertex similarity and topological information. Assuming we have an N x N similarity matrix C in which Cij 22

denotes the similarity score between vertex I G and j H, the total similarity for any permutation P can be denoted N S(P) = C i,π(i) i=1 (7) Finally, we can state the BGNA problem as the optimization of the following weighted sum max p P λj(p) + (1 λ)s(p) (8) where λ is a weighing factor determining the relative importance of topological information and vertex similarity, respectively. If the Balanced GNA problem could be solved exactly, it would illuminate many aspects of RNA folding landscapes. Currently, little is known about quantifying the change enacted on a landscape by a change in the RNA sequence. Two identical RNA sequences also have identical folding landscapes, but if we change a single nucleotide in the second sequence, how dramatically does its landscape change? We could quantify this using the change in the best alignment given by a solution to BGNA. We could then measure the point at which two landscapes diverge from each other due to differences in the primary sequence. This could establish a sequence identity threshold between two RNA molecules, below which any consensus folding landscape of sufficient size could be said to be statistically significant. In clearer terms, we mean that the consensus landscape between two sequences that differ only be a single nucleotide is more likely to be a result of high sequence identity between the two molecules, rather than having any biological significance. However, if two molecules with lower sequence identity share a relatively large consensus, this 23

information is likely to clue us in on significant biological structures, reducing the list of candidate structures, and improving native structure prediction significantly. Unfortunately, we can currently only speak of these advantages in general terms, because the intractability of BGNA prohibits us from quantifying exactly how much a solution would help us. The Balanced GNA problem is known to be NP-hard. The number of possible permutations is N!, where N is even greater than the estimated size of the landscape discussed in section 1.1.4 due to the additional dead vertices required by the problem formulation. A number of approximate solutions to the problem have been proposed [11, 13, 14], but even these are incapable of processing RNA folding landscapes, which are enormous in size. Thus, we will also propose an approximate solution specific to RNA folding landscapes, in which we perform a similar matching on a landscape whose size has been dramatically reduced by predicting and only considering the stable local optimal structures. The method for predicting these points of interest on the landscape is discussed in section 2.1, and methods for redefining and finding a consensus are discussed in chapter 3. 1.3 Riboswitches The transcription and translation of mrna is largely regulated by ncrna. However, mrna molecules are also capable of self-regulation. Riboswitches are cis-regulatory elements usually found in the 5 UTR of mrna molecules. Most known riboswitches exist in the RNA of bacteria, though the existence of at least one riboswitch has been verified in eukaryotes [15]. Riboswitch sequences are relatively short (usually around 150 nt long) and consist of two primary domains. The aptamer domain is responsible for binding a very specific ligand, whose presence in the cellular solution triggers changes in mrna 24

processing. The second domain is the expression platform, which is the portion of the sequence that can act as an interface between mrna and ribosomes or other cellular machinery. The aptamer domain and expression platform overlap in the sequence. The overlapping portion is called the switching sequence, and it undergoes structural changes in response to the binding of a ligand to the aptamer domain. The result is that the riboswitch can assume two native structures, commonly referred to as the on and off positions. Thus, while most ncrna molecules assume one stable local optimal structure in the folding landscape, riboswitches assume two native structures, and the binding of the ligand to the aptamer domain provides a mechanism for the riboswitch to follow a path from one energy basin to another and back. The exact mechanism by which this structural change occurs depends on the riboswitch, and requires detailed knowledge of the atomic structure of the molecule. There are three primary mechanisms by which riboswitches can regulate RNA. The first is transcription termination. This occurs after the riboswitch has been transcribed from a DNA sequence but before the rest of the gene has been transcribed by RNA polymerase. In the presence of the appropriate ligand, the switching sequence may form what is known as a rho-independent hairpin loop, which causes the transcription process to stall. The hairpin loop is then followed by a polyuracil chain, which destabilizes the transcription complex, letting it detach from the DNA strand and ending the process prematurely. Because this step occurs before the alternative splicing of the gene, the riboswitch also disallows the introns from being transcribed, and therefore it also regulates ncrna. The second mechanism is translation inhibition. In this case, the expression platform coincides with the ribosome-bonding site on the mrna, and the switching sequence partially alters 25

its structure. This makes the mrna unable to bond with a ribosome, prohibiting translation from occurring. The third mechanism is ribozyme activation. In this case, the riboswitch acts as a ribozyme, and when activated by a ligand, automatically cleaves itself, effectively destroying the mrna in the process. Riboswitches perform regulatory duties in many other ways, but these are the most common. 1.4 Predicting Native Structures of Riboswitches In this thesis, we are interested in using an approximate solution to the global network alignment problem to predict the native structures of different riboswitches. The rationale behind this is that different RNA sequences may regulate the same genes and bind the same ligands, and are therefore considered the same riboswitch, even though the primary structure of these sequences may differ somewhat. Because each riboswitch must bind a single particular ligand, the structure of the aptamer domain must be very strongly conserved among all sequences comprising the same riboswitch. The riboswitch would not function correctly if it were to accidentally bind the wrong ligand. Thus, though the different manifestations of the same riboswitch may differ in sequence and therefore folding landscape, we can make a strong assumption that there must be two structures common to both landscapes: the native on and off structures of the riboswitch. Current methods of secondary structure prediction tend to focus on the candidate structures derived from a single sequence, and the result is that viable theoretical candidates may be returned, but without having any biological significance. However, if we compare the points of interests across multiple landscapes, we may be able to narrow our pool of candidate structures. 26

One possible limitation to this approach is that the sequences for a riboswitch may have high (>90%) sequence identity, which may mean that the landscapes for each sequence are very similar, and therefore finding a consensus among them may not yield much useful information. Essentially, if we are interested in the intersection between two sets, and the sets happen to be very similar, the intersection will be large, which may not help us in our search. However, if the folding landscapes of two sequences diverge rapidly with decreasing sequence identity, the intersection is likely to be very small, which would improve structure prediction. The validity of such speculation remains to be seen. 27

CHAPTER TWO: PREVIOUS WORKS There are a number of previous related works, some of which form a strong basis for this research. However, because of the novelty of the problem, very little work has been performed to directly solve the problem of finding a consensus landscape, except for RNAConSLOpt, discussed in section 2.2. 2.1 RNASLOpt RNASLOpt is a program published in 2011 by Dr. Yuan Li and Dr. Shaojie Zhang at the University of Central Florida [7]. The problem that their research seeks to address is how to deal with the prohibitively large size of the energy folding landscape. RNASLOpt filters out the vast majority of these extraneous structures by computing the Stable, Locally Optimal structures in the landscape, hence the acronym RNASLOpt. The assumptions underlying the methodology are simple: the native structure should not spontaneously change into another structure that is thermodynamically more favorable (it should be locally optimal), and the barrier energy of the native structure should be sufficiently high that the secondary structure cannot be altered by thermal noise or other unpredictable events (it should be stable). RNASLOpt takes an RNA sequence and uses Bafna s algorithm [16] to compute all possible putative stacks in O(n 2 ) time within user-defined parameters. In order to significantly reduce the time of the computation, Li et al. do not consider stacks with length less than 4, because Bafna showed that the fraction of stacks missed with this cutoff is less than 10%. It is important to note that the native structures of the riboswitches we use in 28

this thesis contain a small number stacks of length 2 and 3, as well as isolated base pairs, so we will have to accept this as merely an approximation. Using this list of putative stacks, RNASLOpt can use two separate algorithms to compute optimal configurations of these stacks, one based on the Nussinov model [5], and the other based on the Turner energy model [17], and also partially based on the Zuker-Sankoff algorithm [3]. A configuration is considered locally optimal when no new stacks can be added to the structure without conflicting with another stack or forming a pseudoknot. RNASLOpt returns the list of locally optimal structures. It then computes the stability of each of these structures, determined by the height of the lowest saddle point adjacent to each energy basin. If this energy barrier is less than ΔB, the structure is discarded. Finally, the remaining structures are ranked according to stability. Li et al. benchmark RNASLOpt using seven riboswitches (Adenine-BS, Adenine-VV, Guanine, SAM, C-di-GMP, Lysine, and TPP) against other RNA secondary structure prediction software, including mfold, RNAShapes, and RNAlocopt. In all cases except Lysine, RNASLOpt ranked the correct native structure higher than its competitors, indicating that its unique way of estimating the points of interest in the landscape is sound. Therefore, the methodology of this thesis will begin with results returned from RNASLOpt and attempt to optimize them, although the data sets will come from the paper summarized in the next section. 2.2 RNAConSLOpt Li et al. expanded on RNASLOpt by creating RNAConSLOpt [10], which, based on a thorough review of the literature, is currently the only software package that addresses the 29

problem of finding a consensus folding landscape. The input to the program is a multiple sequence alignment. RNAConSLOpt then analyzes the alignment using the covariance and conservation score introduced in RNAalifold by Hofacker et al. [18]. Specifically, if we visualize the multiple sequence alignment as a matrix in which each row contains a sequence and each column contains a set of nucleotides that have been aligned, we can calculate the covariance and conservation score between columns i and j, denoted by γij, using the following formula: γ ij = 1 n (C ij φ 1 q ij ) where ϕ1 is a weighting factor set to 1 by default, n is the number of sequences in the alignment, and Cij is defined by (9) C ij = 2 n 1 {d(a k i, d i l ) + d(a j k, a j l ) if (a i k a j k ) and (a i l a j l ) 0 otherwise 1 k<l n (10) where a k i refers to the ith nucleotide in sequence k, (a k i a k j ) means the associated nucleotides can form a base pair, and d(a k i, a k j ) = 1 if the nucleotides are equal and 0 if they are not. Finally, qij is defined by 0 if a i j k a k q ij = { 0.25 if both a i k and a j k are gaps 1 k n 1 otherwise (11) 30

In other words, Cij represents a bonus given to columns with compensatory mutations that keep the consensus structure, while qij is a penalty assessed to columns that do not follow the consensus structure. RNASLOpt computes γij for all possible pairs of distinct columns in the alignment, and uses this value in a modified version of the recursive function from [7] to compute locally optimal configurations of consensus stacks. Using the same heuristic from the 2011 paper, RNAConSLOpt calculates the barrier energy of each ConLOpt structure to determine the ConSLOpt structures. Because this methodology takes into account stacks shared between multiple sequences, the number of structures is reduced in comparison to RNASLOpt, essentially cutting out most of the structures that are simply a consequence of the RNA sequence, and leaving the biologically relevant structures. It is important to note that while other tools such as LocARNA [19-21] are already capable of computing the consensus structure of a multiple sequence alignment, these tools focus exclusively on the best possible consensus structure, and are therefore inappropriate to apply to RNA molecules that have alternate functional structures, such as riboswitches. In a sense, MUSHI represents the converse of RNAConSLOpt. While RNAConSLOpt takes an alignment first and computes optimal configurations of stacks (the consensus landscape) second, MUSHI takes input from RNASLOpt, which computes the landscapes first, and performs structural alignment second. In section 4.4, we compare the results of RNASLOpt + MUSHI to RNAConSLOpt, explaining the advantages and disadvantages of each. 2.3 GraphClust In 2012, Heyne et al. [22] published a paper describing GraphClust, a tool for ncrna annotation. As discussed previously, recent studies have begun to indicate that ncrna 31

plays a crucial role in gene expression [23], and yet we know relatively little about how such molecules function. One reason for this is that the classes of ncrna are quite diverse, and ncrnas sharing a common structure do not necessarily have high sequence identity, so sequencing information alone is not as useful in ncrna clustering as it is in mrnas. Thus, ncrna annotation requires clustering using sequence-structure information, which is usually computationally expensive. GraphClust solves this problem by decomposing the landscape and using fast heuristics, many of which operate in constant time. Because of this speedup, GraphClust can scale to hundreds of thousands of sequences. GraphClust s relevance to this thesis is due to its methodology, which is similar but not equivalent to finding the consensus landscape between two sequences. First, the sequences are analyzed and many suboptimal structures are enumerated. The structures are then represented by a graph, where the vertices are nucleotides, and an edge exists between vertices if and only if the corresponding nucleotides are adjacent in the sequence, or they form a base pair. Furthermore, an additional node is added in the middle of each pair of stacking base pairs, which induces important features into the graph. Because in practice one often only has a partial transcript of the RNA sequence, the authors further consider subsequences of the original sequence. In this way, each sequence is represented by a set of disconnected graphs. The authors can then compare these representative graphs using a graph kernel, which is a simple way of computing the similarity between two graphs. Specifically, they use the neighborhood subgraph pairwise distance kernel, which is a decomposition kernel introduced by Costa and Grave in 2010 [24]. Further Heyne et al. propose a fast method for testing for graph isomorphism, which reduces two isomorphic graphs to an identical string, which can then be mapped to an integer using an iterative 32

hashing procedure. At this point, determining isomorphism between two graphs simply amounts to checking whether or not these integers are equal. Using these special distance measures with the graph kernel, the authors are able to cluster ncrnas based on the similarity of their representative structures, and thereby detect new classes of ncrnas. This methodology is important to us because it shows us one method of comparing elements of a sequence s landscape to find some sort of consensus which can be used to cluster the sequences. However, our desired outcomes are different. In our case, we already start with a group of ncrnas that we know are part of the same class, and we want to analyze their landscapes in order to find common structures. GraphClust, by contrast, only seeks to cluster the ncrnas, showing that they are part of the same class, but not necessarily making any definitive statements on their secondary structure. These three papers will prove to be the most relevant to our discussion of consensus landscapes. In order to analyze various riboswitches and the folding landscapes, we have written a custom piece of software in Java capable of performing the tasks explained in detail in the next two chapters. Originally, we envisioned a structural alphabet similar to the BEAR alphabet before actually learning of BEAR s existence. The original structural alphabet was rudimentary, dividing each SSE into only five categories: multiloop, unpaired, stack, hairpin, internal loop/bulge. Thus, MUSHI was chosen as an acronym. 33

CHAPTER THREE: METHODOLOGY Clearly, the problem of global network alignment is exceedingly difficult, especially for energy folding networks, which are enormous in size. Current methods cannot solve the problem directly. Therefore, it is necessary to find the best approximate solution by working with an abridged data set. One observation of note is that the folding landscape includes all possible structures on any particular sequence, and necessarily includes structures that are clearly not of any biological significance. If we can devise some method of filtering out these uninteresting data points before any sort of alignment is attempted, it may be possible to find a solution. The primary method of investigation in this thesis combines the approaches of RNASLOpt, a structural alphabet called BEAR, and a structural substation matrix called MBR. In the following chapter, we explain how these approaches are combined to provide insight on our new problem, and in Chapter 4 present the results of applying them to predict the native structures of riboswitches. 3.1 RNASLOpt As discussed in the literature review, RNASLOpt [7] takes an RNA sequence, finds the putative stacks predicted using the methods of Bafna et al. [16], and enumerates all possible locally optimal structures that can be constructed using the stacks. These structures are then filtered by stability, so the output of the program is a list of stable local optimal structures. Because these are the points in the landscape that are most likely to be the native structure, the input to MUSHI will be the output of RNASLOpt. The goal of this methodology is to improve native structure prediction by both reducing the size of the list of candidates output 34

by RNASLOpt and reordering them such that the structures most closely resembling the native structures move toward the top. Our primary thesis is that this can be done by solving, exactly or approximately, the consensus folding landscape problem. Therefore, given two lists of SLOpt structures for two RNA sequences, we should attempt to find an approximate consensus between them. The most thorough way to achieve this goal is to perform a pairwise comparison of every possible pair consisting of one structure from each landscape. Much thought was given to the problem of how to compare two structures. The most naïve method would be to perform an alignment using the Needleman-Wunsch algorithm. However, this method has many limitations. Consider the following two structures in dot-bracket notation:...((((...))))...(((...(((...)))...)))......((((...(((...(((...)))...)))...)))) Particularly, if we examine the first 12 characters in each structure, we can see that an alignment based on the edit distance of these two sequences would treat both sets of characters as a perfect alignment. However, in the first structure, characters 8 through 12 are part of a hairpin loop, whereas the same characters in the second structure are actually part of an internal loop. While, in DNA sequence alignment, each nucleotide is an atomic unit of the sequence whose identity does not depend on the surrounding nucleotides, each character in a dot-bracket string represents something different depending not only on the identity of the character, but the context of the surrounding characters. Thus, performing structural alignment based solely on the edit distance of these structural representations eliminates contextual information that is critical to understanding the information the 35

sequence is meant to convey. Clearly, we must devise some way to overcome the limitations of such a primitive alphabet. In order to best understand the current methodology, it may help to explain some of the ideas that were discarded along the way, and why they were abandoned. When discussing the best way to compare two sets of SLOpt structures, many options were debated. One option was to compute the pairing probability matrix for each landscape. For a sequence of length N, the pairing probability matrix is a matrix of length and width N. Each cell (i,j) stores the probability that nucleotides i and j form a base pair. This probability is calculated from the frequency of such occurrences in the actual landscape. We calculated pairing probability matrices for two landscapes, scaled them so that the values in cells ranged from 0 to 255, and used MATLAB to create a greyscale visualization of each matrix, where each cell was assigned in color in the spectrum from black to white, depending on its numerical value. While this method provided interesting insight to the most common stack structures shared between both landscapes, aligning these matrices in a way that yielded valuable information proved to be difficult. Additionally, because this methodology was a direct translation of dot-bracket notation, it suffered from the same drawbacks described earlier. Particularly, critical information about the non-stack structures such as internal loops is not conveyed in a pairing probability matrix, which makes it poor method of comparing landscapes. Because the shortcomings of dot-bracket notation were now apparent, we next attempted to create a structural encoding for each structure that would convey a greater amount of information. Initially, we decided to use five possible characters: m (multiloop), u 36

(unpaired), s (stack), h (hairpin), i (internal loop), which inspired the acronym MUSHI. The example below shows how a structure would be encoded using this scheme:...(((...(((...(((...)))...)))...(((...)))...)))... uuummmuuusssiiissshhhsssiiisssuuussshhhsssuuummmuuu Using this encoding, we then attempted to extract additional information from the landscape. We defined a condensed structural encoding as a shortened version of a structural encoding that retained transitions between different characters, but removed repeat characters. The example below shows the transformation from a structural encoding to a condensed structural encoding: uuummmuuusssiiissshhhsssiiisssuuussshhhsssuuummmuuu umusishsisushsumu The rationale behind using the condensed representation was to preclude length from being a factor in the alignments. It is possible for two structures to strongly resemble each other even though they may differ in length. Additionally, we attempted to gather data on frequency at which each nucleotide in the sequence was involved in a particular SSE. For each sequence of length N, this required creating a 5xN matrix, where each of the five rows recorded the frequency that its associated nucleotide was involved in that SSE. This process is similar to assigning a color to each nucleotide in a consensus structure, such as that computed by LocARNA, where the color denotes the sequence conservation of the nucleotide. We then attempted to visualize these in MATLAB by mapping three of the five rows to one of the three standard colors in the RGB additive color model, again scaling the range [0,1] to [0, 255]. While this resulted in some interesting patterns and identified a few important analogous nucleotides from each landscape, no meaningful consensus could be extracted from this information, so we again abandoned it. 37

3.2 Brand new Alphabet for RNA (BEAR) After some research, we came across an article in Nucleic Acids Research published in March 2014 by a group of researchers from the University of Rome [25]. In this article, Mattei et al. describe a structural alphabet similar to our earlier version, but far more sophisticated. First, they divide the class of possible structures into four groups: loops (L), internal loops (I), stems/stacks (S), and bulges (B). Each group contains a series of characters denoting an SSE of different length. For example, S = {S1, S2, S3, }, where S1 denotes that the nucleotide is a member of a stem of length 1, and so on. Additionally, the alphabets S, I, and B are divided into characters denoting branching and nonbranching structures (S = Sn Sb). A non-branching structure is defined by the maximal boundary [i, j] such that if a hairpin loop exists within the boundary, it is the only such loop. In other words, there does not exist a multiloop structure in [i, j]. This distinction was necessary because they observed different transition rates between SSEs when building the substitution matrix discussed in the next section. Furthermore, the alphabets I and B are divided again into left and right internal loops and bulges. Thus, I = ILn ILb IRn IRb. In order to determine the upper limit on the size of each alphabet, Mattei et al. use a set of carefully selected RNA structures and use the 95 th percentile of the length distribution as the limit for each SSE. Then, the final BEAR alphabet β = L I S B. One additional character, :, is added to the alphabet to denote unpaired nucleotides not belonging to any other SSE. The full alphabet used in this methodology is detailed in the following breakdown, where the first character in each set represents an SSE of length 1, the second represents an SSE of length 2, and so on: 38

Hairpin Loop = {j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,^} Stack = {a,b,c,d,e,f,g,h,i,=} Stack (Branching) = {A,B,C,D,E,F,G,H,I,J} Left Internal Loop = {?,!,,#,$,%,&,,(,),+} Left Internal Loop (Branching) = {?,K,L,M,N,O,P,Q,R,S,T,U,V,W} Right Internal Loop = {?,2,3,4,5,6,7,8,9,0,>} Right Internal Loop (Branching) = {?,Y,Z,~,?,_,,/,\,@} Left Bulge = { [ } Left Bulge (Branching) = { { } Right Bulge = { ] } Right Bulge (Branching) = { } } Unpaired = { : } As an example of this new encoding, consider the following sequence, followed by the traditional dot-bracket representation of its structure, followed by the BEAR representation: AAAGCGCAAAGCGCAAACGCGAAAGCGCAAACGCGAAACGCGAAA...((((...((((...))))...((((...))))...))))... :::dddd:::ddddllldddd:::ddddllldddd:::dddd::: 3.3 Substitution Matrices and MBR In addition to creating a new expressive alphabet, Mattei et al. also construct a structural substitution matrix which can be used for a variety of purposes, such as structural alignment. Substitution matrices were originally introduced by Margaret Dayhoff in 1978 [26]. Their purpose is to measure rates of evolutionary mutations between two polypeptides by expressing the relative probability that one amino acid in a multiple sequence alignment will mutate into another. Such matrices have dimensions 20x20, and the value S(i, j) 39

corresponds to the likelihood that, over a certain period of time, residue i in the sequence will mutate into residue j. The two most common substitution matrices in use are PAM and BLOSUM. PAM (Point Accepted Mutations or Percent Accepted Mutations, depending on the literature) is the matrix introduced by Dayhoff, and is based on global alignments of very closely related amino acid sequences (i.e. <1% divergence). The matrix PAM1 derives the effect on the sequence after 1% of the amino acids have changed. The commonly used PAM250 is equivalent to PAM1 250. BLOSUM (Blocks Substitution Matrix) was introduced by Henikoff in 1992 [27]. This matrix is based on local alignments that are more distantly related. A substitution matrix is calculated from a multiple sequence alignment by counting the number of occurrences of each residue, as well as the number of times a specific residue is aligned with another specific residue, including an identical one. The result is then expressed as what is known as a log-odds score: observed frequency S(i, j) = log( expected frequency ) The base of this logarithm is negligible. A score less than zero indicates that residues i and j were aligned less than what would be expected simply by chance, while a score greater than zero indicates the converse. In this way, we can use substitution matrices to predict how an ancestral sequence might mutate over millions of years, which can help us establish genetic phylogenies. Mattei et al. use the concept of a substitution matrix in a completely new context. Beginning with a set of highly structured RNA families in the work by Meyer et al [28], they searched through the database Rfam to find additional highly structured RNA families. 40 (12)

Each structure in the RNA families was folded, its sequence was converted to the BEAR alphabet, and each character in its BEAR representation was mapped to its corresponding nucleotide in the multiple sequence alignment. Then, using the method created by Dayhoff described above, they derived an 83x83 matrix where each value (i, j) denotes the relative probability that a nucleotide involved in a substructure i may, in another structure, be involved in a substructure j. In essence, this creates a completely new kind of substitution matrix that can be used to analyze different structures for many purposes. Mattei et al. explain such uses, as well as validating this matrix, in their paper [25]. This matrix is called the Matrix of BEAR-encoded RNA secondary structures, or MBR. 3.4 Datasets Because we are interested in comparing this methodology to that of RNAConSLOpt, we selected the same four nucleotides used in the benchmarking process in Li s paper [10]. Specifically, we examined the following riboswitches: (1) the adenine riboswitch from the ydhl gene of Bacillus subtilis, the lysine riboswitch from the lysc gene of Bacillus subtilis, the thiamine pyrophosphate (TPP) riboswitch from the thiamin gene of Bacillus subtilis, and the flavin mononucleotide (FMN) riboswitch from the ribd gene of Bacillus subtilis. For each riboswitch, we obtained five RNA sequences in the family, as well as the canonical on and off structures and sequences. These were used as input to the software pipeline described in the next section. 3.5 Software Pipeline We began by creating four files one for each of the riboswitches. Each file contained five RNA sequences belonging to that family. Each file was given to RNASLOpt as input to 41

find the stable local optimal structures of each sequence. RNASLOpt also requires parameter specifying the acceptable range of structures it should return. Specifically, the user should select a Δp, which specifies the boundary for which no structure having a free energy value greater than p percent away from the MFE structure will be returned, and a ΔB, which specifies the boundary for which no structure having stability less than B kcal/mol will be returned. The full software pipeline is detailed in Figure 3. The parameters used for each sequence, as well as the number of structures generated by RNASLOpt, are detailed in Table 1. Figure 3 - MUSHI Pipeline 42