DNA/RNA Structure Prediction

Similar documents
Lecture 12. DNA/RNA Structure Prediction. Epigenectics Epigenomics: Gene Expression

Algorithms in Bioinformatics

Lab III: Computational Biology and RNA Structure Prediction. Biochemistry 208 David Mathews Department of Biochemistry & Biophysics

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

Predicting RNA Secondary Structure

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

DNA Structure. Voet & Voet: Chapter 29 Pages Slide 1

Chapter 9 DNA recognition by eukaryotic transcription factors

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models

Biomolecules. Energetics in biology. Biomolecules inside the cell

Combinatorial approaches to RNA folding Part I: Basics

Introduction to Polymer Physics

RNA secondary structure prediction. Farhat Habib

Biophysical Model Building

BCB 444/544 Fall 07 Dobbs 1

THE TANGO ALGORITHM: SECONDARY STRUCTURE PROPENSITIES, STATISTICAL MECHANICS APPROXIMATION

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013

BME 5742 Biosystems Modeling and Control

Predicting free energy landscapes for complexes of double-stranded chain molecules

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters

Computational Biology: Basics & Interesting Problems

Grand Plan. RNA very basic structure 3D structure Secondary structure / predictions The RNA world

GCD3033:Cell Biology. Transcription

Biophysics Lectures Three and Four

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Proteins are not rigid structures: Protein dynamics, conformational variability, and thermodynamic stability

From gene to protein. Premedical biology

Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

Lecture 9:3 RNA Structure and Function

2.4 DNA structure. S(l) bl + c log l + d, with c 1.8k B. (2.72)

Chapter 1. Topic: Overview of basic principles

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

From Gene to Protein

In Genomes, Two Types of Genes

RNA Secondary Structure Prediction

UNIT 5. Protein Synthesis 11/22/16

proteins are the basic building blocks and active players in the cell, and

= (-22) = +2kJ /mol

Exploring the Sequence Dependent Structure and Dynamics of DNA with Molecular Dynamics Simulation

Chapter 1. A Method to Predict the 3D Structure of an RNA Scaffold. Xiaojun Xu and Shi-Jie Chen. Abstract. 1 Introduction

A two length scale polymer theory for RNA loop free energies and helix stacking

The wonderful world of RNA informatics

Section 10/5/06. Junaid Malek, M.D.

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

15.2 Prokaryotic Transcription *

Chapter

Using SetPSO to determine RNA secondary structure

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Biophysics II. Hydrophobic Bio-molecules. Key points to be covered. Molecular Interactions in Bio-molecular Structures - van der Waals Interaction

Contents. xiii. Preface v

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

Nucleus. The nucleus is a membrane bound organelle that store, protect and express most of the genetic information(dna) found in the cell.

Short Announcements. 1 st Quiz today: 15 minutes. Homework 3: Due next Wednesday.

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

RNA and Protein Structure Prediction

A Novel Statistical Model for the Secondary Structure of RNA

Chapter 1. DNA is made from the building blocks adenine, guanine, cytosine, and. Answer: d

A rule of seven in Watson-Crick base-pairing of mismatched sequences

Organic Chemistry Option II: Chemical Biology

Principles of Physical Biochemistry

D Dobbs ISU - BCB 444/544X 1

Flow of Genetic Information

Review. Membrane proteins. Membrane transport

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

The wonderful world of NUCLEIC ACID NMR!

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell.

Bi 8 Lecture 11. Quantitative aspects of transcription factor binding and gene regulatory circuit design. Ellen Rothenberg 9 February 2016

Lecture 2-3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

1. (5) Draw a diagram of an isomeric molecule to demonstrate a structural, geometric, and an enantiomer organization.

Macromolecule Stability Curves

Chemical Principles and Biomolecules (Chapter 2) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College Eastern Campus

UE Praktikum Bioinformatik

Protein structure (and biomolecular structure more generally) CS/CME/BioE/Biophys/BMI 279 Sept. 28 and Oct. 3, 2017 Ron Dror

Lesson Overview. Gene Regulation and Expression. Lesson Overview Gene Regulation and Expression

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Salt Dependence of Nucleic Acid Hairpin Stability

Proteins polymer molecules, folded in complex structures. Konstantin Popov Department of Biochemistry and Biophysics

DNA/RNA structure and packing

Chapter 1 1) Biological Molecules a) Only a small subset of the known elements are found in living systems i) Most abundant- C, N, O, and H ii) Less

Bioinformatics 2 - Lecture 4

What is the central dogma of biology?

9/11/18. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Processing: Eukaryotic mrnas

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

More Protein Synthesis and a Model for Protein Transcription Error Rates

Lecture 3: Markov chains.

Genomics and bioinformatics summary. Finding genes -- computer searches

BIOC2000 Summaries. How are biomolecules/macromolecules organised and how does this determine their function?

BIOINFORMATICS. Fast evaluation of internal loops in RNA secondary structure prediction. Abstract. Introduction

Sequence analysis and comparison

Chapter 17. From Gene to Protein. Biology Kevin Dees

Prokaryotic Regulation

BIBC 100. Structural Biochemistry

Simulation of Gene Regulatory Networks

RNA Folding and Interaction Prediction: A Survey

Bi 8 Midterm Review. TAs: Sarah Cohen, Doo Young Lee, Erin Isaza, and Courtney Chen

Applications of Free Energy. NC State University

Transcription:

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Master Course DNA/Protein Structurefunction Analysis and Prediction Lecture 12 DNA/RNA Structure Prediction

Epigenectics Epigenomics: Gene Expression Transcription factors (TF) are essential for transcription initialisation Transcription is done by polymerase type II (eukaryotes) mrna must then move from nucleus to ribosomes (extranuclear) for translation In eukaryotes there can be many TF-binding sites upstream of an ORF that together regulate transcription Nucleosomes (chromatin structures composed of histones) are structures round of which DNA coils. This blocks access of TFs

Epigenectics Epigenomics: Gene Expression TF binding site (closed) mrna transcription TATA Nucleosome TF binding site (open)

Expression Because DNA has flexibility, bound TFs can move in order to interact with pol II, which is necessary for transcription initiation (see next slide) Recent TF-based initialisation theory includes a wave function (Carlsberg) of TF-binding, which is supposed to go from left to right. In this way the TF-binding site nearest to the TATA box would be bound by a TF which will then in turn bind Pol II. It has been suggested that Speckles have something to do with this (speckels are observed protein plaques in the nucleus) Current prediction methods for gene co-expression, e.g. finding a single shared TF binding site, do not take this TF cooperativity into account ( parking lot optimisation )

Expression.. TF binding site TF Pol II mrna transcription TATA mrna Speckel This is still a hypothetical model TF binding site

DNA/RNA Structure-Function relationships Apart from coding for proteins via genes, DNA is now known to code for many more RNA-based cell components (snrna, rrna,..) The importance of structural features of DNA (e.g. bendability, binding histones, methylation) is becoming ever more important. For the many different classes of RNA molecules, structure is directly causing function It is therefore important to analyse and predict DNA structure, but particularly, RNA structure

Canonical base pairs The complementary bases, C-G and A-U form stable base pairs with each other through the creation of hydrogen bonds between donor and acceptor sites on the bases. These are called Watson-Crick base pairs and are also referred to as canonical base pairs. In addition, we consider the weaker G-U wobble pair, where the bases bond in a skewed fashion. Other base pairs also occur, some of which are stable. These are all called noncanonical base pairs.

RNA secondary structure The secondary structure of an RNA molecule is the collection of base pairs that occur in its 3-dimensional structure. An RNA sequence will be represented as R = r 1, r 1, r 2, r 3,, r n, where r i is called the i th (ribo)nucleotide. Each r i belongs to the set {a,c,g,u}..

Secondary Structure and Pseudoknots A secondary structure, or folding, on R is a set S of ordered pairs, written as i-j, satisfying: 1.j - i > 4 2.If i-j and i -j are 2 base pairs, (assuming without loss in generality that i i ), then either: 1. i = i and j = j (they are the same base pair), 2. i j i j (i-j precedes i -j ), or 3. i i j j (i-j includes i -j ) The last condition excludes pseudoknots. These occur when 2 base pairs, i-j and i -j, satisfy i i j j.

Pseudoknots Pseudoknots are not taken into account in secondary structure prediction because energy minimizing methods cannot deal with them. It is not known how to assign energies to the loops created by pseudoknots and dynamic programming methods that compute minimum energy structures break down. For this reason, pseudoknots are often considered as belonging to tertiary structure. However, pseudoknots are real and important structural features. However, covariance methods (next slide) are able to predict them from aligned, homologous RNA sequences. The Figure on the next slide represents a small pseudoknot model.

A 3D model of a pseudoknot

A 3D model of a pseudoknot The 2 helices in the structure (preceding slide) are stacked coaxially. RNA structure can be predicted from sequence data. There are two basic routes. 1. The first attempts structure prediction of single sequences based on minimizing the free energy of folding. 2. The second computes common foldings for a family of aligned, homologous RNAs. Usually, the alignment and secondary structure inference must be performed simultaneously, or at least iteratively (see next slide)

Predicting RNA Secondary Structure By Thermodynamics Method Minimize Gibbs Free Energy By Phylogenetic Comparison Method (Covariance method) Compare RNA Sequences of Identical Function From Different Organisms By Combination of the Above Two Methods In principle, this could be the most powerful method

Thermodynamics Gibbs Free Energy, G Describes the energetics of biomolecules in aqueous solution. The change in free energy, G, for a chemical process, such as nucleic acid folding, can be used to determine the direction of the process: G=0: equilibrium G>0: unfavorable process G<0: favorable process Thus the natural tendency for biomolecules in solution is to minimize free energy of the entire system (biomolecules + solvent).

Thermodynamics G = H - T S H is enthalpy, S is entropy, and T is the temperature in Kelvin. Molecular interactions, such as hydrogen bonds, van der Waals and electrostatic interactions contribute to the H term. S describes the change of order of the system. Thus, both molecular interactions as well as the order of the system determine the direction of a chemical process. For any nucleic acid solution, it is extremely difficult to calculate the free energy from first principle Biophysical methods can be used to measure free energy changes

Thermodynamics The Equilibrium Partition Function For a population of structures S, a partition function Q and the probability for a particular folding, s can be calculated: Q = s G s e RT S The heat capacity for the RNA can be obtained: G = RT ln Q and Gs e RT Heat capacity Cp (heat required to change temperature by 1 degree) can be measured experimentally, and can then be used to get information on G Q 2 G Cp = T T 2 is probability

Zuker s Energy Minimization Method (mfold) An RNA Sequence is called R= {r 1,r 2,r 3 r n }, where r i is the i th ribonucleotide and it belongs to a set of {A, U, G, C} A secondary structure of R is a set S of base pairs, i.j, which satisfies: 1=<i<j=<n; j-i>4 (can t have loop containing less than 4 nucleotides); If i,j and i.j are two basepairs, (assume i =< i ), then either» i = i and j = j (same base pair)» i < j < i < j (i.j proceeds i.j ) or» i < i < j < j (i.j includes i. j ) (this excludes pseudoknots which is i<i <j<j ) 5 3 3 5 If e(i,j) is the energy for the base pair i.j, the total energy for R is The objective is to minimize E(S). E( S) = i, j S e( i, j)

Zuker s Energy Minimization Method (mfold) Free Energy Parameters Extensive database of free energies for the following RNA units has been obtained (so called Tinoco Rules and Turner Rules ): Single Strand Stacking energy Canonical (AU GC) and non-canonical (GU) basepairs in duplexes Still lacking accurate free energy parameters for Loops Mismatches (AA, CA etc) Using these energy parameters, the current version of mfold can predict ~73% phylogenetically deduced secondary structures.

Dynamic Programming (mfold) An Example of W(i,j) A matrix W(i,j) is computed that is dependent on the experimentally measured basepair energy e(i,j) Recursion begins with i=1, j=n 1. If W(i+1,j)=W(i,j), then i is not paired. Set i=i+1 and start the recursion again. 2. If W(i,j-1)=W(i,j), then j is not paired. Set j=j-1 and start the recursion again. 3. If W(i,j)=W(i,k)+W(k+1,j), the fragment k+1,j gets put on a stack and the fragment i k is analyzed by setting j = k and going back to the recursion beginning. 4. If W(i,j)=e(i,j)+W(i+1,j-1), a basepair is identified and is added to the list by setting i=i+1 and j=j-1

Suboptimal Folding (mfold) For any sequence of N nucleotides, the expected number of structures is greater than 1.8 N A sequence of 100 nucleotides has 3x10 25 foldings. If a computer can calculate 1000 strs./s -1, it would take 10 15 years! mfold generates suboptimal foldings whose free energy fall within a certain range of values. Many of these structures are different in trivial ways. These suboptimal foldings can still be useful for designing experiments.

A computer predicted folding of Bacillus subtilis RNase P RNA These three representations are equivalent..

Secondary Structure Prediction for Aligned RNA Sequences Both energy as well as RNA sequence covariation can be combined to predict RNA secondary structures To quantify sequence covariation, let f i (X) be the frequency of base X at aligned position I and f ij (XY) be the frequency of finding X in i and Y in j, the mutual information score is (Chiu & Kolodziejczak and Gutell & Woese) fij ( XY ) M ij= fij ( XY )log X, Y fi ( X ) f j ( Y ) if for instance only GC and GU pairs at positions i and j then M ij =0. The total energy for RNA is set to a linear combination of measured free energy plus the covariance contribution

Other Secondary Prediction Methods Nusinov algorithm (historically important), Hogeweg and Hesper (1984) Vienna: http://www.tbi.univie.ac.at/~ivo/rna/ uses the same recursive method in searching the folding space Added the option of computing the population of RNA secondary structures by the equilibrium partition function Specific heat of an RNA can be calculated by numerical differentiation from the equilibrium partition function RNACAD:http://www.cse.ucsc.edu/research/compbio/ssurrna.html An effort in improving multiple RNA sequence alignment by taking into account both primary as well secondary structure information Use Stochastic Context-Free Grammars (SCFGs), an extension of hidden Markov models (HMMs) method Bundschuh, R., and Hwa, T. (1999) RNA secondary structure formation: A solvable model of heteropolymer folding. PHYSICAL REVIEW LETTERS 83, 1479-1482. This work treats RNA as heteropolymer and uses a simplified Go-like model to provide an exact solution for RNA transition between its native and molten phases.

Running mfold http://bioinfo.math.rpi.edu/~mfold/rna/form1.cgi Constraints can be entered 1. force bases i,i+1,...,i+k-1 to be double stranded by entering: F i 0 k on 1 line in the constraint box. 2. force consecutive base pairs i.j,i+1.j-1,...,i+k-1.j-k+1 by entering: F i j k on 1 line in the constraint box. 3. force bases i,i+1,...,i+k-1 to be single stranded by entering: P i 0 k on 1 line in the constraint box. 4. prohibit the consecutive base pairs i.j,i+1.j-1,...,i+k-1.j-k+1 by entering: P i j k on 1 line in the constraint box. 5. prohibit bases i to j from pairing with bases k to l by entering: P i-j k-l on 1 line in the constraint box.

Running mfold 5 -CUUGGAUGGGUGACCACCUGGG-3 No constraint F 1 21 2 entered

Predicting RNA 3D Structures Currently available RNA 3D structure prediction programs make use the fact that a tertiary structure is built upon preformed secondary structures So once a solid secondary structure can be predicted, it is possible to predict its 3D structure The chances of obtaining a valid 3D structure can be increased by known space constraints among the different secondary segments (e.g. cross-linking, NMR results). However, there are far less thermodynamic data on 3- D RNA structures which makes 3-D structure prediction challenging.

Mc-Sym Mc-Sym uses backtracking method to solve a general problem in computer science called the constraint satisfaction problem (CSP) Backtracking algorithm organizes the search space as a tree where each node corresponds to the application of an operator At each application, if the partially folded RNA structure is consistent with its RNA conformational database, the next operator is applied, otherwise the entire attached branch is pruned and the algorithm backtracks to the previous node.

Mc-Sym (Continued) The selection of a spanning tree for a particular RNA is left to the user, but it is suggested that the nucleotides imposing the most constraints are introduced first Users also supply a particular Mc-Sym conformation for each nucleotide. These conformers are derived from currently available 3D databases

Sample script: Mc-Sym (Continued) SEQUENCE 1 A r GAAUGCCUGCGAGCAUC CC ;; DECLARE ;; 1 helixa * 2 helixa * 3 helixa * 4 helixa * 5 helixa * 6 helixa * 19 helixa * ;; ;; RELATIONS ;; 18 helix * 19 17 helix * 18 16 helix * 17. 5 helix * 6 4 helix * 5 3 helix * 4 2 helix * 3 1 helix * 2 ;; BUILD ; 19 18 17 16 15 14 13 12 12 11 10 9 8 7 6 5 5 4 3 2 1 ;; CONSTRAINTS ;; (enter experimental constraints) 18 2 3.0

RNA-protein Interactions There is currently no computational method that can predict the RNA-protein interaction interfaces; Statistical methods have been applied to identify structure features at the protein-rna interface. For instance, ENTANCLE finds that most atoms contributed from a protein to recogonizing an RNA are from main chains (C, O, N, H), not from side chains! But much remains to be done; Electrostatic potential has primary importance in protein-rna recognition due to the negatively charged phosphate backbones. Efforts are made to quantify electrostatic potential at the molecular surface of a protein and RNA in order to predict the site of RNA interaction. This often provides good prediction at least for the site on the protein.

References Predicting RNA secondary structures: good reviews 1. Turner, D. H., and Sugimoto, N. (1988) RNA structure prediction. Annu Rev Biophys Biophys Chem 17, 167-92. 2. Zuker, M. (2000) Calculating nucleic acid secondary structure. Curr Opin Struct Biol 10, 303-10. Obtaining experimental thermodynamics parameters: 3. Xia, T., SantaLucia, J., Jr., Burkard, M. E., Kierzek, R., Schroeder, S. J., Jiao, X., Cox, C., and Turner, D. H. (1998) Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson- Crick base pairs. Biochemistry 37, 14719-35. 4. Borer, P. N., Dengler, B., Tinoco, I., Jr., and Uhlenbeck, O. C. (1974) Stability of ribonucleic acid double-stranded helices. J Mol Biol 86, 843-53. Thermodynamics Theory for RNA structure prediction: 5. Bundschuh, R., and Hwa, T. (1999) RNA secondary structure formation: A solvable model of heteropolymer folding. PHYSICAL REVIEW LETTERS 83, 1479-1482. 6. McCaskill, J. S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 29, 1105-19.