Lab III: Computational Biology and RNA Structure Prediction. Biochemistry 208 David Mathews Department of Biochemistry & Biophysics

Lab III: Computational Biology and RNA Structure Prediction Biochemistry 208 David Mathews Department of Biochemistry & Biophysics

Contact Info: David_Mathews@urmc.rochester.edu Phone: x51734 Office: 3-8816 Web: http://rna.urmc.rochester.edu

Outline: Define Bioinformatics and Computational Biology. Explain why RNA is important and interesting. Background in RNA structure. Comparative sequence analysis. Free energy model for quantifying structure stability. Free energy minimization by dynamic programming algorithm. Partition function calculations of base pair probabilities. A few words about Thursday s lab.

Definitions: Bioinformatics is the derivation of new knowledge about Biology by the analysis of data. Computational Biology is the use of computers to develop and test hypotheses about Biology. These terms are often used interchangeably.

Why RNA structure Prediction: This is what I study, so I can teach the field well. My group authored the software we will use for lab, so I know it well and it is free. This is a paradigm for computational biology. It is not so important that we are predicting RNA structure, but that you have an opportunity to solve a problem with the help of computation.

Central Dogma of Biology:

RNA is an Active Player: Antisense Antibiotics

RNA Secondary and Tertiary Structure: AAUUGCGGGAAAGGGGUCAA CAGCCGUUCAGUACCAAGUC UCAGGGGAAACUUUGAGAUG GCCUUGCAAAGGGUAUGGUA AUAAGCUGACGGACAUGGUC CUAACCACGCAGCCAAGUCC UAAGUCAACAGAUCUUCUGU UGAUAUGGAUGCAGUUCA P5a 160 P5b A C A G G C A G U G C U A A A A U 180 G P5c A A AGG G UA G U C G U U C C G G U A G A G U U U C A G A C C G U U C A G U A C C A A G U C U C A G G G G A A C A 140 200 U G G U C C U A A C C A C G C A P5 P4 G P6 C C A A 220 G U C C U A A GU C A A C A G A U C U A C U G G G G A A A G G G C G U 120 260 CA U U U G A A A C G U A G G U A U A G U U G U C U P6a P6b 240 Waring & Davies. (1984) Gene 28: 277. Cate, et al. (Cech & Doudna). (1996) Science 273:1678.

Base Pairs:

Helices:

An RNA Secondary Structure: R2 Retrotransposon 3 UTR from D. melanogaster. Mathews et al., RNA 3:1-16. On average, 46 % of nucleotides are unpaired.

Predicting Secondary Structure is an Important Problem: A secondary structure provides insight into how an RNA functions. A predicted structures provides a framework for making hypotheses about structure. A secondary structure is needed for determining a tertiary structure. Building constructs for structural biology (NMR and crystallography) Assignments This is a paradigm of (structural) Computational Biology.

Comparative Sequence Analysis of RNA Secondary Structure: Accurate method for predicting RNA secondary structure. It requires a large number of homologous sequences (usually derived from different species.) Over 97% of base pairs predicted in ribosomal RNA sequences were proven in subsequent crystal structures. The method assumes that base pairing is conserved by evolution even though sequence is not.

Example: Without using a computer algorithm, predict the secondary structure of: 5 GCGACCGGG GCUGGCUUGG UAAUGGUACU CCCCUGUCAC GGGAGAGAAU GUGGGUUCAA AUCCCAUCGG UCGCGCCA3

Determine the Possible Base Pairs: 71 61 51 41 31 21 11 1 1 11 21 31 41 51 61 71 For this 77-mer, there are 686 possible canonical (AU, GC, GU) base pairs.

What if There were 10 Homologous Sequences: Homology: Merriam-Webster s Online Dictionary http://www.merriam-webster.com/dictionary/homology 1: a similarity often attributable to common origin 2 a: likeness in structure between parts of different organisms (as the wing of a bat and the human arm) due to evolutionary differentiation from a corresponding part in a common ancestor compare analogy b: correspondence in structure between a series of parts (as vertebrae) in the same individual 3: similarity of nucleotide or amino acid sequence (as in nucleic acids or proteins) 4: a branch of the theory of topology concerned with partitioning space into geometric components (as points, lines, and triangles) and with the study of the number and interrelationships of these components especially by the use of group theory called also homology theory compare cohomology

What If There Were 10 Homologous Sequences: You could look for base pairs that all sequences have in common. Many of 686 base pairs in the first sequence will not be possible in all 10 sequences. For example, in the first sequence a putative AU pair might be AA in another sequence. More interestingly, a putative AU pair in the first sequence might align to a GC pair in another sequence. Called a compensating base pair change. Secondary structure is conserved by evolution even though sequence is not.

A Convenient Way to Test Hypotheses About Base Pairing is to Construct a Sequence Alignment that Reflects Secondary Structure: AAAAAAA BBBB bbbb CCCCC ccccc DDDDD dddddaaaaaaa GCGACCGGGGCUGGCUU-GGUA-AUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUACAAAUCCCACCGGUCGCGCCA GCCCGGGUGGUGPAGU--GGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUABOAAUCCCGCCUCGGGCGCCA GGCCCCAAAGCGAAGUD-GGUU-AUCGCGCCUCCCUGUCACGGAGGAGAUCACGGGUACGAGUCCCGUUGGGGUCGCCA GGCCCCG-GGUGPAGUU-GGUU-AACACACCCGCCUGUCACGPGGGAGAUCGCGGGUACGAGUCCCGUCGGGGCCGCCA GGAGCGG-AGUUCAGUC-GGUU-AGAAUACCUGCCUGUCCCGCAGGGG-UCGCGGGUACGAGUCCCGUCCGUUCCGCCA GGGAUUGUAGUUCAAUU-GGUC-AGAGCACCGCCCUGUCCAGGCGGAAGUUGCGGGUACGAGCCCCGUCAGUCCCGCCA GGGAUUGUAGUUCAAUU-GGUC-AGAGCACCGCCCUAUCCAGGCGGAAGUUGCGGGUACGAGCCCCGUCAGUCCCGCCA AAGAAACUAGUUAAACUA-----AUAACACUGGAUUAUCAGACCGGAG-UAACUGGUAAACAAUCAGUGUUUCUUGCCA AAAAAAUUAGUUUAAU--CA---AAAACCUUAGUAUGUC-AACUAAAAA-AAUUAGAUCAU--CUAAUAUUUUUUACCA GAGAUAUUAGUAAAA---UA---AUUACAUAACCUUAUCAAGGUUAAGU-UAUAGACUUAAA-UCUAUAUAUCUUACCA

Draw the Determined Pairs for the First Sequence:

Examples: RNase P Database: http://www.mbio.ncsu.edu/rnasep/home.html

Examples: Telomerase Database: http://telomerase.asu.edu/

What if There is a Single Sequence with Unknown Structure? Secondary structure can be predicted by Gibbs Free Energy minimization.

Gibb s Free Energy ( G ): Unpaired State Structure i K i = [Structure i] [Unpaired State] o = e - Gi /RT G quantifies the favorability of a structure at a given temperature.

Determining the Most Favored Structure: Unpaired State Structure i K i = Structure j [Structure i] [Unpaired State] Structure i o = e - Gi /RT [Structure [Structure i] j] = K i /K j = e o o ( G j G i )/ RT The structure with the lowest G is the most favored at a given temperature.

Experimentally Determining G : Consider: 5 CACGUG 3 GUGCAC G (310 K = 37 C) = -6.59 kcal/mol H = -50.31 kcal/mol S = -141.0 eu = -141.0 cal mol -1 K -1 G = H - T S Xia et al., Biochemistry, 1998, 37: 14719.

Optical Melting Curve (hypochromicity): 1 Normalized A at 260 or 280 nm 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Temperature ( C) Tm = melting temperature = 52.0 C

Nearest Neighbor Model: A nearest neighbor model is used to predict the Gibbs free energy change of RNA secondary structure formation. The free energy of each motif depends on only the sequence of that motif and the most adjacent base pairs. The total free energy is the sum of the increments.

Nearest Neighbor Model for Watson- Crick Base Pairs: Xia et al., Biochemistry, 1998, 37: 14719. Determined for helices in 1 M NaCl, ph 7, T = 37 C Parameter: AA UU AU UA UA AU CU GA CA GU GU CA GA CU CG GC GG CC GC CG G 37 (kcal/mol) -0.93-1.10-1.33-2.08-2.11-2.24-2.35-2.36-3.26-3.42 Initiation 4.09 Per AU end 0.45 Self-complementary 0.43

Example: G 37 = 4.09 2.08 3.42 2.36 3.42 2.08 +0.45 + 0.45 + 0.43 = -7.94 kcal/mol G 37 (experiment) = -7.99 kcal/mol Parameter: G 37 (kcal/mol) AA -0.93 UU AU -1.10 UA UA -1.33 AU CU -2.08 GA CA -2.11 GU GU -2.24 CA GA -2.35 CU CG -2.36 GC GG -3.26 CC GC -3.42 CG Initiation 4.09 Per AU end 0.45 Selfcomplementary 0.43

Nearest Neighbor Model for Free Energy Change of a Sample Hairpin Loop: -2.1-0.9-1.6 G helix = G GC CG + G GU CA + 2 G AA UU + G AC UG = CGUUUG G G U U -2.0 kcal/mol - 2.1 kcal/mol + 2x(-0.9) kcal/mol - 1.8 kcal/mol = -7.7 kcal/mol G C A A A C A C G hairpin loop = G initiation (6 nucleotides) + GG G mismatch CA = -2.0-0.9-1.8 +5.0 5.0 kcal/mol - 1.6 kcal/mol = 3.4 kcal/mol G total = G hairpin + G helix = 3.4 kcal/mol - 7.7 kcal/mol = -4.3 kcal/mol Note that the hairpin loop initiation replaces intermolecular initiation. Mathews et al., J. Mol. Biol., 1999, 288: 911. Mathews et al., PNAS, 2004, 101: 7287.

Sequence of Unpaired Regions Important: Wu et al. Biochemistry, 1995, 34: 3204 G 5 5 3 GAA C GG A C C G A G 1.6 kcal/mol 3 C G K = e - G /RT = A A e = 11,000-2.7 (4.3 kcal/mol)/(0.62 kcal/mol)

How can sequence dependence for loops be included in Free Energy Parameters? Too many possible sequences to study them all by optical melting. 1. Study model sequences and deduce general rules. (Hairpin loops, Internal loops). 2. Examine the frequency that sequences occur in motifs in a database of known structures. (Tetraloops - hairpins of four nucleotides). 3. Adjust thermodynamic parameters to optimize the accuracy of structure predictions. (Multibranch loops).

Equilibrium between Structures: G = -9.7 kcal/mol 37-2.1-3.5 1.1-2.1-2.4 5 CUUGGAUG G G U G A C 3 GGGUCCAC CUUGG A UGG G U G G G U C G C A C C A -3.3 5.6-3.0 CU U GG A UGG G U G G G C C A C C G U A -1.3-2.1-3.3 1.1 3.8-2.2-2.1-3.3-2.4 5.6-3.0 G = -9.2 kcal/mol 37

How is an RNA Secondary Structure Predicted? The lowest free energy structure is the most favored conformation. Nearest neighbor parameters can be used to predict the folding free energy at 37 C. How is a secondary structure predicted?

How is the Lowest Free Energy Structure Determined? Naïve approach would be to calculate the free energy of every possible secondary structure. Number of secondary structures 1.8 N (where N is the number of nucleotides) The free energies of 1000 structures can be calculated in 1 second. For 100 nucleotide sequence: Number of secondary structures 3 10 25 Time to calculate 10 14 years

Dynamic Programming Algorithm: Not to be confused with molecular dynamics. This is a calculation not a simulation. The lowest free energy structure is guaranteed given the nearest neighbor parameters used. Reviewed by Sean Eddy. Nature Biotechnology. 2004. 11: 1457.

Dynamic Programming Algorithm: Named by Richard Bellman in 1953. Applies to calculations in which the cost/score is built progressively from smaller solutions. Other applications Sequence alignment Determining partition functions for RNA secondary structures Finding shortest paths Determining moves in games Linguistics

Dynamic Programming: Recursion is used to speed the calculation. The problem is divided into smaller problems. The smaller problems are used to solve bigger problems. Two Step Process Fill determines the lowest free energy folding possible for each subsequence Traceback determined the structure that has the lowest free energy

Save Intermediate Results in Fill: Three arrays of numbers: V(i,j) = lowest free energy for fragment from nucleotides i to j, given that i and j are base paired. V(i,j) = infinity, if i and j cannot form a base pair V(i,j) = min[hairpin closure, extending a helix, closing an internal loop, closing a bulge loop, closing a multibranch loop] if i and j can base pair W(i,j) = lowest free energy for nucleotides i to j, given that the fragment will be a branch in a multibranch loop W5(i) = lowest free energy from nucleotides 1 to i Fill the arrays progressively, starting with the shortest sequences that can base pair (5 nucleotides) and getting longer. W5(N) = lowest free energy possible.

An RNA Secondary Structure: R2 Retrotransposon 3 UTR from D. melanogaster. Mathews et al., RNA 3:1-16. On average, 46 % of nucleotides are unpaired.

Some Examples for How Recursion Speeds the Consideration of All Possible Structures: When filling V(i,j), the base pair between i and j may stack on a previous pair (between i+1 and j-1): Then V(i,j) = nearest neighbor for the stacking of the i-j pair on the (i+1)-(j-1) pair + V(i+1,j-1) The energy of can be determined without regard for what the structure is that gives V(i+1,j-1)

Some Examples for How Recursion Speeds the Consideration of All Possible Structures:

Some Examples for How Recursion Speeds the Consideration of All Possible Structures: When filling W(i,j), one thing that needs to be considered is that the structure may bifurcate to allow multiple branches in a multibranch loop: Then: W(i,j) = min[w(i,k) + W(k+1,j)] for all i < k < j The energy of a bifurcation can then be determined without knowing what the structure was that determines W(i,k) and W(k+1,j)

Some Examples for How Recursion Speeds the Consideration of All Possible Structures:

Fill direction: Arrays: i j

Traceback: At the end of the Fill step, the lowest free energy is known, but the structure that gives that energy is unknown. The traceback step goes backwards through the recursions to determine the structure with lowest free energy.

Traceback Scheme:

Dynamic Programming Algorithm for Predicting RNA Secondary Structure: Algorithm scales O(N 3 ) in time and O(N 2 ) in storage where N is the length of the sequence. Therefore doubling the sequence length requires 8 as much computation time and 4 as much memory (RAM). This is costly compared to sorting numbers O(N log(n)). Pseudoknots are excluded: i < i < j < j

Calculation is Fast: Length: RNA: Time: (H:min:sec) Memory: (MB) 433 Tetrahymena Thermophila IVS LSU Group I Intron 0:00:03 15.7 1542 E. coli small subunit rrna 0:1:49 47.1 2904 E. coli large subunit rrna 0:10:35 130.2 3.4 GHz Intel I7, 4 cores, with 8 GB RAM; Microsoft Windows 7

Suboptimal Structure Prediction: A number of methods exist that can calculate a set of low free energy structures. These suboptimal structures are alternative hypotheses for the secondary structure. Important because of limitations in the algorithms (no pseudoknots) and limitations in the nearest neighbor parameters. Also important because some sequences have more than one secondary structure.

Example:

Suboptimal Structure Prediction: Set of heuristically generated suboptimal structures (Zuker. 1989. Science. 244: 48): Mfold: http://www.bioinfo.rpi.edu/applications/mfold/old/rna/ RNAstructure: http://rna.urmc.rochester.edu Exhaustive sampling of all possible suboptimal structures within a small energy increment of the lowest free energy structure (Wuchty et al. 1999. Biopolymers. 49: 145.): Vienna RNA Package: http://www.tbi.univie.ac.at/~ivo/rna/ RNAstructure: http://rna.urmc.rochester.edu Ensemble sampling of structures according to their probability of occurring in an equilibrium ensemble (Ding & Lawrence. 2003. Nucleic Acids Research. 31: 7280.): SFold: http://sfold.wadsworth.org RNAstructure: : http://rna.urmc.rochester.edu Recently reviewed: Mathews. Revolutions in RNA Secondary Structure Prediction. 2006. Journal of Molecular Biology. 359: 526.

Testing the Method: Predict secondary structures for sequences that have known structure (as determined by comparative sequence analysis). Score the percentage of known base pairs that are correctly predicted.

RNA Secondary Structure Prediction Accuracy: Percentage of Known Base Pairs Correctly Predicted: RNA: Nucleotides: Base Pairs: % Pseudoknot: Lowest Free Energy Best Suboptimal Any Suboptimal SSU (16 S) rrna 33,263 8,863 1.4 61.0 ± 23.7 75.7 ± 20.0 90.5 ± 14.1 (44.3 ± 13.2) a (54.0 ± 13.7) a (75.6 ± 12.1) a LSU (23 S) rrna 13,341 3,585 0.2 76.0 ± 12.4 87.0 ± 8.9 97.7 ± 2.6 (56.9 ± 9.3) a (64.0 ± 10.6) a (82.1 ± 10.9) a 5 S rrna 26,925 10,188 0.0 74.2 ± 26.9 96.0 ± 5.2 99.9 ± 0.6 Group I Intron 5,518 1,532 6.0 70.8 ± 12.8 83.9 ± 11.2 98.1 ± 4.7 Group I Intron - 2 3,056 865 6.2 (60.5 ± 10.5) (77.4 ± 9.8) (97.3 ± 4.4) Group II Intron 1,626 402 0.0 86.5 ± 3.6 92.4 ± 6.6 100 ± 0.0 RNase P 2,269 694 14.4 64.6 ± 15.2 75.9 ± 10.1 95.6 ± 4.6 RNase P - 2 2,198 1,099 11.3 (59.4 ± 10.2) (77.6 ± 4.9) (97.2 ± 2.7) SRP RNA 24,383 6,273 1.9 68.2 ± 25.8 88.3 ± 12.0 96.3 ± 8.6 trna 37,502 10,018 0.0 84.8 ± 18.9 96.5 ± 6.4 99.3 ± 4.7 Total: 151,503 43,519 1.4 72.8 ± 9.1 87.0 ± 8.1 97.2 ± 3.1 Mathews, Disney, Childs, Schroeder, Zuker, & Turner. 2004. PNAS 101: 7287.

Predicting RNA Secondary Structure by Hydrogen Bond Maximization: Percentage of Known Base Pairs Correctly Predicted: RNA: Nucleotides: Base Pairs: % Pseudoknot: Maximum H Bonds Best Suboptimal Any Suboptimal SSU (16 S) rrna 33,263 8,863 1.4 19.7 ± 18.4 52.5 ± 17.5 87.0 (7.4 ± 9.2) (28.3 ± 10.1) (57.5) LSU (23 S) rrna 13,341 3,585 0.2 23.7 ± 18.8 48.1 ± 13.7 84.6 (7.7 ± 9.2) (26.9 ± 6.1) (52.5) 5 S rrna 26,925 10,188 0.0 30.3 ± 24.8 78.6 ± 13.0 99.9 Group I Intron 5,518 1,532 6.0 14.8 ± 15.6 54.0 ± 12.7 90.4 Group I Intron - 2 3,056 865 6.2 (13.9 ± 12.9) (52.0 ± 15.6) (89.1) Group II Intron 1,626 402 0.0 20.7 ± 16.5 46.4 ± 2.1 95.1 RNase P 2,269 694 14.4 28.4 ± 19.4 49.4 ± 13.1 82.5 RNase P - 2 2,198 1,099 11.3 (25.4 ± 15.7) (54.0 ± 12.2) (88.5) SRP RNA 24,383 6,273 1.9 17.1 ± 24.1 65.3 ± 17.4 93.1 trna 37,502 10,018 0.0 21.5 ± 24.0 78.8 ± 14.7 100.0 Total: 151,503 43,519 1.4 20.5 ± 6.5 59.1 ± 13.4 89.9 ± 6.1 Mathews, Sabina, Zuker, Turner. 1999. J. Mol. Biol. 288: 911.

Limitations to Prediction of the Minimum Free Energy Structure: A minimum free energy structure provides the single best guess for the secondary structure. Assumes that: RNA is at equilibrium RNA has a single conformation RNA thermodynamic parameters are without error Non-nearest neighbor effects Some sequence-specific stabilities are averaged

A Method that Looks at the Probability of a Structure could be more Informative: A partition function can be used to determine the probability of a structure at equilibrium.

Recall the Equilibrium Equations: Unpaired State Structure i K i = Structure j [Structure i] [Unpaired State] Structure i o = e - Gi /RT [Structure [Structure i] j] = K i /K j = ( G j G i )/ RT e o o

A Step Further: Consider a sequence with possible structures i, j, and k. K i = [Structure i] [Unpaired State] K j = [Structure j] [Unpaired State] [Structure k] K k = [Unpaired State] [strands] = [structure i] + [structure j] + [structure k] + [unpaired state] Fraction of molecules in structure i = = = = [Structure i] [strands] [Structure i]/[unpair ed state] [strands]/ [unpaired state] K i K i Q Ki K K j k 1

So, How is a partition function calculated? We call Q the partition function. Q 1 i K i 1 i e ΔG i /RT

Dynamic Programming: McCaskill. Biopolymers. 29: 1105 (1990). Recursion is used to speed the calculation. Mathews RNA. 10: 1178. (2004). O(N 3 ) in time and O(N 2 ) in storage.

So, what is Q good for? P(Secondary Structure) e - G(Secondary Structure)/RT Q P e 1 Q - G(k)/RT - G(k)/RT i, j e k Q k Q i paired Q to j where k is the sum over all structures with the i-j base pair.

Accuracy: Sensitivity what percentage of known pairs occur in the predicted structure. Positive Predictive Value (PPV) what percentage of predicted pairs occur in the known structure. PPV Sensitivity because the structures determined by comparative sequence analysis do not have all pairs and there is a tendency to overpredict base pairs by free energy minimization.

Applying P BP (i,j) to Structure Prediction: 100 90 80 70 60 50 40 30 20 10 0 72.8 65.8 90.7 86.7 83.3 76.8 73.2 Sensitivity Positive Predictive Value (PPV) PPV PBP 99% PPV PBP 95% PPV PBP 90% PPV PBP 70% PPV PBP > 50% Percent Mathews. RNA. 10: 1178. (2004).

Percent of Predicted BP above Threshold: 90 80 70 60 50 40 30 20 10 80.8 69.9 50.1 41.1 24 Percent of Predicted Pairs 0 PPV PBP 99% PPV PBP 95% PPV PBP 90% PPV PBP 70% PPV PBP > 50% Mathews. RNA. 10: 1178. (2004).

E. coli 5S rrna Color Annotation:

Length: RNA: Calculation is Fast: Time: (H:min:sec) Memory: (MB) 433 Tetrahymena Thermophila IVS LSU Group I Intron 0:00:02 39.6 1542 E. coli small subunit rrna 0:1:35 144.7 2904 E. coli large subunit rrna 0:11:02 430.3 3.4 GHz Intel I7, 4 cores, with 8 GB RAM; Microsoft Windows 7

For Further Reading: Xia, T., SantaLucia, J., Jr., Burkard, M. E., Kierzek, R., Schroeder, S. J., Jiao, X., Cox, C. & Turner, D. H. (1998). Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick pairs. Biochemistry. 37, 14719-14735. Mathews, Sabina, Zuker, Turner. (1999). Expanded Sequence Dependence of Thermodynamic Parameters Improved Prediction of RNA Secondary Structure. J. Mol. Biol. 288: 911-940. Mathews, D. H., Schroeder, S. J., Turner, D. H., & Zuker, M. (2005). Predicting RNA Secondary Structure. In The RNA World, Third Edition (Gesteland, R. F., Cech, T. R., & Atkins, J. F., eds.), pp. 631-657. Cold Spring Harbor Laboratory Press. http://rna.cshl.edu/content/free/contents/rnaworld3e_toc.html Mathews, D. H. (2006). Revolutions in RNA secondary structure prediction. J. Mol. Biol. 359: 526-532. Eddy. (2004). How do RNA Folding Algorithms Work? Nat. Biotechnol. 22: 1457-1458.

Summary: Comparative sequence analysis determines the common secondary structure for a set of homologous sequences. A dynamic programming algorithm can find the lowest free energy structure for a single sequence. A dynamic programming algorithm can be used to predict base pair probabilities using a partition function.

Lab on Thursday, 2/20: Meet here. Bring laptops. You will work in groups. There will be a quiz. I will be present the whole time, so feel free to come with questions about today s lecture.