Lab III: Computational Biology and RNA Structure Prediction. Biochemistry 208 David Mathews Department of Biochemistry & Biophysics

Similar documents
DNA/RNA Structure Prediction

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters

BCB 444/544 Fall 07 Dobbs 1

Algorithms in Bioinformatics

Lecture 12. DNA/RNA Structure Prediction. Epigenectics Epigenomics: Gene Expression

RNA secondary structure prediction. Farhat Habib

proteins are the basic building blocks and active players in the cell, and

Expanded Sequence Dependence of Thermodynamic Parameters Improves Prediction of RNA Secondary Structure

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

Predicting RNA Secondary Structure

A two length scale polymer theory for RNA loop free energies and helix stacking

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

Grand Plan. RNA very basic structure 3D structure Secondary structure / predictions The RNA world

Introduction to Polymer Physics

Chapter 1. A Method to Predict the 3D Structure of an RNA Scaffold. Xiaojun Xu and Shi-Jie Chen. Abstract. 1 Introduction

Combinatorial approaches to RNA folding Part I: Basics

In Genomes, Two Types of Genes

RecitaLon CB Lecture #10 RNA Secondary Structure

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models

Quantitative modeling of RNA single-molecule experiments. Ralf Bundschuh Department of Physics, Ohio State University

D Dobbs ISU - BCB 444/544X 1

RNA Abstract Shape Analysis

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

A rule of seven in Watson-Crick base-pairing of mismatched sequences

arxiv: v1 [q-bio.bm] 16 Aug 2015

The wonderful world of RNA informatics

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

BIOINFORMATICS. Prediction of RNA secondary structure based on helical regions distribution

Rapid Dynamic Programming Algorithms for RNA Secondary Structure

RNA and Protein Structure Prediction

A Novel Statistical Model for the Secondary Structure of RNA

Secondary Structure Prediction of Single Sequences Using RNAstructure

TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs

Lecture 8: RNA folding

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

A Method for Aligning RNA Secondary Structures

BIOINF 4120 Bioinforma2cs 2 - Structures and Systems -

Computing the partition function and sampling for saturated secondary structures of RNA, with respect to the Turner energy model

RNA Secondary Structure Prediction

Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1:

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Basics of protein structure

The wonderful world of NUCLEIC ACID NMR!

DYNAMIC PROGRAMMING ALGORITHMS FOR RNA STRUCTURE PREDICTION WITH BINDING SITES

Short Announcements. 1 st Quiz today: 15 minutes. Homework 3: Due next Wednesday.

Biphasic Folding Kinetics of RNA Pseudoknots and Telomerase RNA Activity

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

Lecture 8: RNA folding

A nucleotide-level coarse-grained model of RNA

Bioinformatics Chapter 1. Introduction

Supplementary Material

Computational Biology: Basics & Interesting Problems

Stable stem enabled Shannon entropies distinguish non-coding RNAs from random backgrounds

Introduction to Evolutionary Concepts

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Blind tests of RNA nearest-neighbor energy prediction

STRUCTURAL BIOINFORMATICS I. Fall 2015

Precisely Control Protein Expression.

A statistical sampling algorithm for RNA secondary structure prediction

of all secondary structures of k-point mutants of a is an RNA sequence s = s 1,..., s n obtained by mutating

DANNY BARASH ABSTRACT

Using SetPSO to determine RNA secondary structure

Predicting free energy landscapes for complexes of double-stranded chain molecules

Junction-Explorer Help File

RNA Folding and Interaction Prediction: A Survey

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Introduction to" Protein Structure

Computational approaches for RNA energy parameter estimation

Describing RNA Structure by Libraries of Clustered Nucleotide Doublets

1. (5) Draw a diagram of an isomeric molecule to demonstrate a structural, geometric, and an enantiomer organization.

Bachelor Thesis. RNA Secondary Structure Prediction

DNA Structure. Voet & Voet: Chapter 29 Pages Slide 1

BIOINFORMATICS. Fast evaluation of internal loops in RNA secondary structure prediction. Abstract. Introduction

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Complete Suboptimal Folding of RNA and the Stability of Secondary Structures

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Protein Dynamics. The space-filling structures of myoglobin and hemoglobin show that there are no pathways for O 2 to reach the heme iron.

What is the central dogma of biology?

Detecting non-coding RNA in Genomic Sequences

Macromolecule Stability Curves

Computational approaches for RNA energy parameter estimation

arxiv: v1 [q-bio.bm] 21 Oct 2010

COMP598: Advanced Computational Biology Methods and Research

Genome 559 Wi RNA Function, Search, Discovery

Flow of Genetic Information

Protein Secondary Structure Prediction

F. Piazza Center for Molecular Biophysics and University of Orléans, France. Selected topic in Physical Biology. Lecture 1

BME 5742 Biosystems Modeling and Control

2.4 DNA structure. S(l) bl + c log l + d, with c 1.8k B. (2.72)

Bio nformatics. Lecture 23. Saad Mneimneh

Consecutive GA Pairs Stabilize Medium-Size RNA Internal Loops

RNA Matrices and RNA Secondary Structures

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

9/11/18. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

Hydrogen and hydration of DNA and RNA oligonucleotides

Lecture 9:3 RNA Structure and Function

Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure

Sparse RNA Folding Revisited: Space-Efficient Minimum Free Energy Prediction

Transcription:

Lab III: Computational Biology and RNA Structure Prediction Biochemistry 208 David Mathews Department of Biochemistry & Biophysics

Contact Info: David_Mathews@urmc.rochester.edu Phone: x51734 Office: 3-8816 Web: http://rna.urmc.rochester.edu

Outline: Define Bioinformatics and Computational Biology. Explain why RNA is important and interesting. Background in RNA structure. Comparative sequence analysis. Free energy model for quantifying structure stability. Free energy minimization by dynamic programming algorithm. Partition function calculations of base pair probabilities. A few words about Thursday s lab.

Definitions: Bioinformatics is the derivation of new knowledge about Biology by the analysis of data. Computational Biology is the use of computers to develop and test hypotheses about Biology. These terms are often used interchangeably.

Why RNA structure Prediction: This is what I study, so I can teach the field well. My group authored the software we will use for lab, so I know it well and it is free. This is a paradigm for computational biology. It is not so important that we are predicting RNA structure, but that you have an opportunity to solve a problem with the help of computation.

Central Dogma of Biology:

RNA is an Active Player: Antisense Antibiotics

RNA Secondary and Tertiary Structure: AAUUGCGGGAAAGGGGUCAA CAGCCGUUCAGUACCAAGUC UCAGGGGAAACUUUGAGAUG GCCUUGCAAAGGGUAUGGUA AUAAGCUGACGGACAUGGUC CUAACCACGCAGCCAAGUCC UAAGUCAACAGAUCUUCUGU UGAUAUGGAUGCAGUUCA P5a 160 P5b A C A G G C A G U G C U A A A A U 180 G P5c A A AGG G UA G U C G U U C C G G U A G A G U U U C A G A C C G U U C A G U A C C A A G U C U C A G G G G A A C A 140 200 U G G U C C U A A C C A C G C A P5 P4 G P6 C C A A 220 G U C C U A A GU C A A C A G A U C U A C U G G G G A A A G G G C G U 120 260 CA U U U G A A A C G U A G G U A U A G U U G U C U P6a P6b 240 Waring & Davies. (1984) Gene 28: 277. Cate, et al. (Cech & Doudna). (1996) Science 273:1678.

Base Pairs:

Helices:

An RNA Secondary Structure: R2 Retrotransposon 3 UTR from D. melanogaster. Mathews et al., RNA 3:1-16. On average, 46 % of nucleotides are unpaired.

Predicting Secondary Structure is an Important Problem: A secondary structure provides insight into how an RNA functions. A predicted structures provides a framework for making hypotheses about structure. A secondary structure is needed for determining a tertiary structure. Building constructs for structural biology (NMR and crystallography) Assignments This is a paradigm of (structural) Computational Biology.

Comparative Sequence Analysis of RNA Secondary Structure: Accurate method for predicting RNA secondary structure. It requires a large number of homologous sequences (usually derived from different species.) Over 97% of base pairs predicted in ribosomal RNA sequences were proven in subsequent crystal structures. The method assumes that base pairing is conserved by evolution even though sequence is not.

Example: Without using a computer algorithm, predict the secondary structure of: 5 GCGACCGGG GCUGGCUUGG UAAUGGUACU CCCCUGUCAC GGGAGAGAAU GUGGGUUCAA AUCCCAUCGG UCGCGCCA3

Determine the Possible Base Pairs: 71 61 51 41 31 21 11 1 1 11 21 31 41 51 61 71 For this 77-mer, there are 686 possible canonical (AU, GC, GU) base pairs.

What if There were 10 Homologous Sequences: Homology: Merriam-Webster s Online Dictionary http://www.merriam-webster.com/dictionary/homology 1: a similarity often attributable to common origin 2 a: likeness in structure between parts of different organisms (as the wing of a bat and the human arm) due to evolutionary differentiation from a corresponding part in a common ancestor compare analogy b: correspondence in structure between a series of parts (as vertebrae) in the same individual 3: similarity of nucleotide or amino acid sequence (as in nucleic acids or proteins) 4: a branch of the theory of topology concerned with partitioning space into geometric components (as points, lines, and triangles) and with the study of the number and interrelationships of these components especially by the use of group theory called also homology theory compare cohomology

What If There Were 10 Homologous Sequences: You could look for base pairs that all sequences have in common. Many of 686 base pairs in the first sequence will not be possible in all 10 sequences. For example, in the first sequence a putative AU pair might be AA in another sequence. More interestingly, a putative AU pair in the first sequence might align to a GC pair in another sequence. Called a compensating base pair change. Secondary structure is conserved by evolution even though sequence is not.

A Convenient Way to Test Hypotheses About Base Pairing is to Construct a Sequence Alignment that Reflects Secondary Structure: AAAAAAA BBBB bbbb CCCCC ccccc DDDDD dddddaaaaaaa GCGACCGGGGCUGGCUU-GGUA-AUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUACAAAUCCCACCGGUCGCGCCA GCCCGGGUGGUGPAGU--GGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUABOAAUCCCGCCUCGGGCGCCA GGCCCCAAAGCGAAGUD-GGUU-AUCGCGCCUCCCUGUCACGGAGGAGAUCACGGGUACGAGUCCCGUUGGGGUCGCCA GGCCCCG-GGUGPAGUU-GGUU-AACACACCCGCCUGUCACGPGGGAGAUCGCGGGUACGAGUCCCGUCGGGGCCGCCA GGAGCGG-AGUUCAGUC-GGUU-AGAAUACCUGCCUGUCCCGCAGGGG-UCGCGGGUACGAGUCCCGUCCGUUCCGCCA GGGAUUGUAGUUCAAUU-GGUC-AGAGCACCGCCCUGUCCAGGCGGAAGUUGCGGGUACGAGCCCCGUCAGUCCCGCCA GGGAUUGUAGUUCAAUU-GGUC-AGAGCACCGCCCUAUCCAGGCGGAAGUUGCGGGUACGAGCCCCGUCAGUCCCGCCA AAGAAACUAGUUAAACUA-----AUAACACUGGAUUAUCAGACCGGAG-UAACUGGUAAACAAUCAGUGUUUCUUGCCA AAAAAAUUAGUUUAAU--CA---AAAACCUUAGUAUGUC-AACUAAAAA-AAUUAGAUCAU--CUAAUAUUUUUUACCA GAGAUAUUAGUAAAA---UA---AUUACAUAACCUUAUCAAGGUUAAGU-UAUAGACUUAAA-UCUAUAUAUCUUACCA

Draw the Determined Pairs for the First Sequence:

A Convenient Way to Test Hypotheses About Base Pairing is to Construct a Sequence Alignment that Reflects Secondary Structure: AAAAAAA BBBB bbbb CCCCC ccccc DDDDD dddddaaaaaaa GCGACCGGGGCUGGCUU-GGUA-AUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUACAAAUCCCACCGGUCGCGCCA GCCCGGGUGGUGPAGU--GGCCCAUCAUACGACCCUGUCACGGUCGUGA-CGCGGGUABOAAUCCCGCCUCGGGCGCCA GGCCCCAAAGCGAAGUD-GGUU-AUCGCGCCUCCCUGUCACGGAGGAGAUCACGGGUACGAGUCCCGUUGGGGUCGCCA GGCCCCG-GGUGPAGUU-GGUU-AACACACCCGCCUGUCACGPGGGAGAUCGCGGGUACGAGUCCCGUCGGGGCCGCCA GGAGCGG-AGUUCAGUC-GGUU-AGAAUACCUGCCUGUCCCGCAGGGG-UCGCGGGUACGAGUCCCGUCCGUUCCGCCA GGGAUUGUAGUUCAAUU-GGUC-AGAGCACCGCCCUGUCCAGGCGGAAGUUGCGGGUACGAGCCCCGUCAGUCCCGCCA GGGAUUGUAGUUCAAUU-GGUC-AGAGCACCGCCCUAUCCAGGCGGAAGUUGCGGGUACGAGCCCCGUCAGUCCCGCCA AAGAAACUAGUUAAACUA-----AUAACACUGGAUUAUCAGACCGGAG-UAACUGGUAAACAAUCAGUGUUUCUUGCCA AAAAAAUUAGUUUAAU--CA---AAAACCUUAGUAUGUC-AACUAAAAA-AAUUAGAUCAU--CUAAUAUUUUUUACCA GAGAUAUUAGUAAAA---UA---AUUACAUAACCUUAUCAAGGUUAAGU-UAUAGACUUAAA-UCUAUAUAUCUUACCA Homologous sequence, homologous helix, homologous pair.

Examples: RNase P Database: http://www.mbio.ncsu.edu/rnasep/home.html

Examples: RNase P Database: http://www.mbio.ncsu.edu/rnasep/home.html

Examples: Telomerase Database: http://telomerase.asu.edu/

What if There is a Single Sequence with Unknown Structure? Secondary structure can be predicted by Gibbs Free Energy minimization.

Gibb s Free Energy ( G ): Unpaired State Structure i K i = [Structure i] [Unpaired State] o = e - Gi /RT G quantifies the favorability of a structure at a given temperature.

Determining the Most Favored Structure: Unpaired State Structure i K i = Structure j [Structure i] [Unpaired State] Structure i o = e - Gi /RT [Structure [Structure i] j] = K i /K j = e o o ( G j G i )/ RT The structure with the lowest G is the most favored at a given temperature.

Experimentally Determining G : Consider: 5 CACGUG 3 GUGCAC G (310 K = 37 C) = -6.59 kcal/mol H = -50.31 kcal/mol S = -141.0 eu = -141.0 cal mol -1 K -1 G = H - T S Xia et al., Biochemistry, 1998, 37: 14719.

Optical Melting Curve (hypochromicity): 1 Normalized A at 260 or 280 nm 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20 40 60 80 100 Temperature ( C) Tm = melting temperature = 52.0 C

Nearest Neighbor Model: A nearest neighbor model is used to predict the Gibbs free energy change of RNA secondary structure formation. The free energy of each motif depends on only the sequence of that motif and the most adjacent base pairs. The total free energy is the sum of the increments.

Nearest Neighbor Model for Watson- Crick Base Pairs: Xia et al., Biochemistry, 1998, 37: 14719. Determined for helices in 1 M NaCl, ph 7, T = 37 C Parameter: AA UU AU UA UA AU CU GA CA GU GU CA GA CU CG GC GG CC GC CG G 37 (kcal/mol) -0.93-1.10-1.33-2.08-2.11-2.24-2.35-2.36-3.26-3.42 Initiation 4.09 Per AU end 0.45 Self-complementary 0.43

Example: G 37 = 4.09 2.08 3.42 2.36 3.42 2.08 +0.45 + 0.45 + 0.43 = -7.94 kcal/mol G 37 (experiment) = -7.99 kcal/mol Parameter: G 37 (kcal/mol) AA -0.93 UU AU -1.10 UA UA -1.33 AU CU -2.08 GA CA -2.11 GU GU -2.24 CA GA -2.35 CU CG -2.36 GC GG -3.26 CC GC -3.42 CG Initiation 4.09 Per AU end 0.45 Selfcomplementary 0.43

Nearest Neighbor Model for Free Energy Change of a Sample Hairpin Loop: -2.1-0.9-1.6 G helix = G GC CG + G GU CA + 2 G AA UU + G AC UG = CGUUUG G G U U -2.0 kcal/mol - 2.1 kcal/mol + 2x(-0.9) kcal/mol - 1.8 kcal/mol = -7.7 kcal/mol G C A A A C A C G hairpin loop = G initiation (6 nucleotides) + GG G mismatch CA = -2.0-0.9-1.8 +5.0 5.0 kcal/mol - 1.6 kcal/mol = 3.4 kcal/mol G total = G hairpin + G helix = 3.4 kcal/mol - 7.7 kcal/mol = -4.3 kcal/mol Note that the hairpin loop initiation replaces intermolecular initiation. Mathews et al., J. Mol. Biol., 1999, 288: 911. Mathews et al., PNAS, 2004, 101: 7287.

Sequence of Unpaired Regions Important: Wu et al. Biochemistry, 1995, 34: 3204 G 5 5 3 GAA C GG A C C G A G 1.6 kcal/mol 3 C G K = e - G /RT = A A e = 11,000-2.7 (4.3 kcal/mol)/(0.62 kcal/mol)

How can sequence dependence for loops be included in Free Energy Parameters? Too many possible sequences to study them all by optical melting. 1. Study model sequences and deduce general rules. (Hairpin loops, Internal loops). 2. Examine the frequency that sequences occur in motifs in a database of known structures. (Tetraloops - hairpins of four nucleotides). 3. Adjust thermodynamic parameters to optimize the accuracy of structure predictions. (Multibranch loops).

Equilibrium between Structures: G = -9.7 kcal/mol 37-2.1-3.5 1.1-2.1-2.4 5 CUUGGAUG G G U G A C 3 GGGUCCAC CUUGG A UGG G U G G G U C G C A C C A -3.3 5.6-3.0 CU U GG A UGG G U G G G C C A C C G U A -1.3-2.1-3.3 1.1 3.8-2.2-2.1-3.3-2.4 5.6-3.0 G = -9.2 kcal/mol 37

How is an RNA Secondary Structure Predicted? The lowest free energy structure is the most favored conformation. Nearest neighbor parameters can be used to predict the folding free energy at 37 C. How is a secondary structure predicted?

How is the Lowest Free Energy Structure Determined? Naïve approach would be to calculate the free energy of every possible secondary structure. Number of secondary structures 1.8 N (where N is the number of nucleotides) The free energies of 1000 structures can be calculated in 1 second. For 100 nucleotide sequence: Number of secondary structures 3 10 25 Time to calculate 10 14 years

Dynamic Programming Algorithm: Not to be confused with molecular dynamics. This is a calculation not a simulation. The lowest free energy structure is guaranteed given the nearest neighbor parameters used. Reviewed by Sean Eddy. Nature Biotechnology. 2004. 11: 1457.

Dynamic Programming Algorithm: Named by Richard Bellman in 1953. Applies to calculations in which the cost/score is built progressively from smaller solutions. Other applications Sequence alignment Determining partition functions for RNA secondary structures Finding shortest paths Determining moves in games Linguistics

Dynamic Programming: Recursion is used to speed the calculation. The problem is divided into smaller problems. The smaller problems are used to solve bigger problems. Two Step Process Fill determines the lowest free energy folding possible for each subsequence Traceback determined the structure that has the lowest free energy

Save Intermediate Results in Fill: Three arrays of numbers: V(i,j) = lowest free energy for fragment from nucleotides i to j, given that i and j are base paired. V(i,j) = infinity, if i and j cannot form a base pair V(i,j) = min[hairpin closure, extending a helix, closing an internal loop, closing a bulge loop, closing a multibranch loop] if i and j can base pair W(i,j) = lowest free energy for nucleotides i to j, given that the fragment will be a branch in a multibranch loop W5(i) = lowest free energy from nucleotides 1 to i Fill the arrays progressively, starting with the shortest sequences that can base pair (5 nucleotides) and getting longer. W5(N) = lowest free energy possible.

An RNA Secondary Structure: R2 Retrotransposon 3 UTR from D. melanogaster. Mathews et al., RNA 3:1-16. On average, 46 % of nucleotides are unpaired.

Some Examples for How Recursion Speeds the Consideration of All Possible Structures: When filling V(i,j), the base pair between i and j may stack on a previous pair (between i+1 and j-1): Then V(i,j) = nearest neighbor for the stacking of the i-j pair on the (i+1)-(j-1) pair + V(i+1,j-1) The energy of can be determined without regard for what the structure is that gives V(i+1,j-1)

Some Examples for How Recursion Speeds the Consideration of All Possible Structures:

Some Examples for How Recursion Speeds the Consideration of All Possible Structures: When filling W(i,j), one thing that needs to be considered is that the structure may bifurcate to allow multiple branches in a multibranch loop: Then: W(i,j) = min[w(i,k) + W(k+1,j)] for all i < k < j The energy of a bifurcation can then be determined without knowing what the structure was that determines W(i,k) and W(k+1,j)

Some Examples for How Recursion Speeds the Consideration of All Possible Structures:

Fill direction: Arrays: i j

Traceback: At the end of the Fill step, the lowest free energy is known, but the structure that gives that energy is unknown. The traceback step goes backwards through the recursions to determine the structure with lowest free energy.

Traceback Scheme:

Dynamic Programming Algorithm for Predicting RNA Secondary Structure: Algorithm scales O(N 3 ) in time and O(N 2 ) in storage where N is the length of the sequence. Therefore doubling the sequence length requires 8 as much computation time and 4 as much memory (RAM). This is costly compared to sorting numbers O(N log(n)). Pseudoknots are excluded: i < i < j < j

Calculation is Fast: Length: RNA: Time: (H:min:sec) Memory: (MB) 433 Tetrahymena Thermophila IVS LSU Group I Intron 0:00:03 15.7 1542 E. coli small subunit rrna 0:1:49 47.1 2904 E. coli large subunit rrna 0:10:35 130.2 3.4 GHz Intel I7, 4 cores, with 8 GB RAM; Microsoft Windows 7

Suboptimal Structure Prediction: A number of methods exist that can calculate a set of low free energy structures. These suboptimal structures are alternative hypotheses for the secondary structure. Important because of limitations in the algorithms (no pseudoknots) and limitations in the nearest neighbor parameters. Also important because some sequences have more than one secondary structure.

Example:

Suboptimal Structure Prediction: Set of heuristically generated suboptimal structures (Zuker. 1989. Science. 244: 48): Mfold: http://www.bioinfo.rpi.edu/applications/mfold/old/rna/ RNAstructure: http://rna.urmc.rochester.edu Exhaustive sampling of all possible suboptimal structures within a small energy increment of the lowest free energy structure (Wuchty et al. 1999. Biopolymers. 49: 145.): Vienna RNA Package: http://www.tbi.univie.ac.at/~ivo/rna/ RNAstructure: http://rna.urmc.rochester.edu Ensemble sampling of structures according to their probability of occurring in an equilibrium ensemble (Ding & Lawrence. 2003. Nucleic Acids Research. 31: 7280.): SFold: http://sfold.wadsworth.org RNAstructure: : http://rna.urmc.rochester.edu Recently reviewed: Mathews. Revolutions in RNA Secondary Structure Prediction. 2006. Journal of Molecular Biology. 359: 526.

Testing the Method: Predict secondary structures for sequences that have known structure (as determined by comparative sequence analysis). Score the percentage of known base pairs that are correctly predicted.

RNA Secondary Structure Prediction Accuracy: Percentage of Known Base Pairs Correctly Predicted: RNA: Nucleotides: Base Pairs: % Pseudoknot: Lowest Free Energy Best Suboptimal Any Suboptimal SSU (16 S) rrna 33,263 8,863 1.4 61.0 ± 23.7 75.7 ± 20.0 90.5 ± 14.1 (44.3 ± 13.2) a (54.0 ± 13.7) a (75.6 ± 12.1) a LSU (23 S) rrna 13,341 3,585 0.2 76.0 ± 12.4 87.0 ± 8.9 97.7 ± 2.6 (56.9 ± 9.3) a (64.0 ± 10.6) a (82.1 ± 10.9) a 5 S rrna 26,925 10,188 0.0 74.2 ± 26.9 96.0 ± 5.2 99.9 ± 0.6 Group I Intron 5,518 1,532 6.0 70.8 ± 12.8 83.9 ± 11.2 98.1 ± 4.7 Group I Intron - 2 3,056 865 6.2 (60.5 ± 10.5) (77.4 ± 9.8) (97.3 ± 4.4) Group II Intron 1,626 402 0.0 86.5 ± 3.6 92.4 ± 6.6 100 ± 0.0 RNase P 2,269 694 14.4 64.6 ± 15.2 75.9 ± 10.1 95.6 ± 4.6 RNase P - 2 2,198 1,099 11.3 (59.4 ± 10.2) (77.6 ± 4.9) (97.2 ± 2.7) SRP RNA 24,383 6,273 1.9 68.2 ± 25.8 88.3 ± 12.0 96.3 ± 8.6 trna 37,502 10,018 0.0 84.8 ± 18.9 96.5 ± 6.4 99.3 ± 4.7 Total: 151,503 43,519 1.4 72.8 ± 9.1 87.0 ± 8.1 97.2 ± 3.1 Mathews, Disney, Childs, Schroeder, Zuker, & Turner. 2004. PNAS 101: 7287.

Predicting RNA Secondary Structure by Hydrogen Bond Maximization: Percentage of Known Base Pairs Correctly Predicted: RNA: Nucleotides: Base Pairs: % Pseudoknot: Maximum H Bonds Best Suboptimal Any Suboptimal SSU (16 S) rrna 33,263 8,863 1.4 19.7 ± 18.4 52.5 ± 17.5 87.0 (7.4 ± 9.2) (28.3 ± 10.1) (57.5) LSU (23 S) rrna 13,341 3,585 0.2 23.7 ± 18.8 48.1 ± 13.7 84.6 (7.7 ± 9.2) (26.9 ± 6.1) (52.5) 5 S rrna 26,925 10,188 0.0 30.3 ± 24.8 78.6 ± 13.0 99.9 Group I Intron 5,518 1,532 6.0 14.8 ± 15.6 54.0 ± 12.7 90.4 Group I Intron - 2 3,056 865 6.2 (13.9 ± 12.9) (52.0 ± 15.6) (89.1) Group II Intron 1,626 402 0.0 20.7 ± 16.5 46.4 ± 2.1 95.1 RNase P 2,269 694 14.4 28.4 ± 19.4 49.4 ± 13.1 82.5 RNase P - 2 2,198 1,099 11.3 (25.4 ± 15.7) (54.0 ± 12.2) (88.5) SRP RNA 24,383 6,273 1.9 17.1 ± 24.1 65.3 ± 17.4 93.1 trna 37,502 10,018 0.0 21.5 ± 24.0 78.8 ± 14.7 100.0 Total: 151,503 43,519 1.4 20.5 ± 6.5 59.1 ± 13.4 89.9 ± 6.1 Mathews, Sabina, Zuker, Turner. 1999. J. Mol. Biol. 288: 911.

Limitations to Prediction of the Minimum Free Energy Structure: A minimum free energy structure provides the single best guess for the secondary structure. Assumes that: RNA is at equilibrium RNA has a single conformation RNA thermodynamic parameters are without error Non-nearest neighbor effects Some sequence-specific stabilities are averaged

A Method that Looks at the Probability of a Structure could be more Informative: A partition function can be used to determine the probability of a structure at equilibrium.

Recall the Equilibrium Equations: Unpaired State Structure i K i = Structure j [Structure i] [Unpaired State] Structure i o = e - Gi /RT [Structure [Structure i] j] = K i /K j = ( G j G i )/ RT e o o

A Step Further: Consider a sequence with possible structures i, j, and k. K i = [Structure i] [Unpaired State] K j = [Structure j] [Unpaired State] [Structure k] K k = [Unpaired State] [strands] = [structure i] + [structure j] + [structure k] + [unpaired state] Fraction of molecules in structure i = = = = [Structure i] [strands] [Structure i]/[unpair ed state] [strands]/ [unpaired state] K i K i Q Ki K K j k 1

So, How is a partition function calculated? We call Q the partition function. Q 1 i K i 1 i e ΔG i /RT

Dynamic Programming: McCaskill. Biopolymers. 29: 1105 (1990). Recursion is used to speed the calculation. Mathews RNA. 10: 1178. (2004). O(N 3 ) in time and O(N 2 ) in storage.

So, what is Q good for? P(Secondary Structure) e - G(Secondary Structure)/RT Q P e 1 Q - G(k)/RT - G(k)/RT i, j e k Q k Q i paired Q to j where k is the sum over all structures with the i-j base pair.

Accuracy: Sensitivity what percentage of known pairs occur in the predicted structure. Positive Predictive Value (PPV) what percentage of predicted pairs occur in the known structure. PPV Sensitivity because the structures determined by comparative sequence analysis do not have all pairs and there is a tendency to overpredict base pairs by free energy minimization.

Applying P BP (i,j) to Structure Prediction: 100 90 80 70 60 50 40 30 20 10 0 72.8 65.8 90.7 86.7 83.3 76.8 73.2 Sensitivity Positive Predictive Value (PPV) PPV PBP 99% PPV PBP 95% PPV PBP 90% PPV PBP 70% PPV PBP > 50% Percent Mathews. RNA. 10: 1178. (2004).

Percent of Predicted BP above Threshold: 90 80 70 60 50 40 30 20 10 80.8 69.9 50.1 41.1 24 Percent of Predicted Pairs 0 PPV PBP 99% PPV PBP 95% PPV PBP 90% PPV PBP 70% PPV PBP > 50% Mathews. RNA. 10: 1178. (2004).

E. coli 5S rrna Color Annotation:

Length: RNA: Calculation is Fast: Time: (H:min:sec) Memory: (MB) 433 Tetrahymena Thermophila IVS LSU Group I Intron 0:00:02 39.6 1542 E. coli small subunit rrna 0:1:35 144.7 2904 E. coli large subunit rrna 0:11:02 430.3 3.4 GHz Intel I7, 4 cores, with 8 GB RAM; Microsoft Windows 7

For Further Reading: Xia, T., SantaLucia, J., Jr., Burkard, M. E., Kierzek, R., Schroeder, S. J., Jiao, X., Cox, C. & Turner, D. H. (1998). Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick pairs. Biochemistry. 37, 14719-14735. Mathews, Sabina, Zuker, Turner. (1999). Expanded Sequence Dependence of Thermodynamic Parameters Improved Prediction of RNA Secondary Structure. J. Mol. Biol. 288: 911-940. Mathews, D. H., Schroeder, S. J., Turner, D. H., & Zuker, M. (2005). Predicting RNA Secondary Structure. In The RNA World, Third Edition (Gesteland, R. F., Cech, T. R., & Atkins, J. F., eds.), pp. 631-657. Cold Spring Harbor Laboratory Press. http://rna.cshl.edu/content/free/contents/rnaworld3e_toc.html Mathews, D. H. (2006). Revolutions in RNA secondary structure prediction. J. Mol. Biol. 359: 526-532. Eddy. (2004). How do RNA Folding Algorithms Work? Nat. Biotechnol. 22: 1457-1458.

Summary: Comparative sequence analysis determines the common secondary structure for a set of homologous sequences. A dynamic programming algorithm can find the lowest free energy structure for a single sequence. A dynamic programming algorithm can be used to predict base pair probabilities using a partition function.

Lab on Thursday, 2/20: Meet here. Bring laptops. You will work in groups. There will be a quiz. I will be present the whole time, so feel free to come with questions about today s lecture.