RNA secondary structure prediction. Farhat Habib

Similar documents
Algorithms in Bioinformatics

proteins are the basic building blocks and active players in the cell, and

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

Combinatorial approaches to RNA folding Part II: Energy minimization via dynamic programming

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters

CS681: Advanced Topics in Computational Biology

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models

RecitaLon CB Lecture #10 RNA Secondary Structure

BIOINFORMATICS. Fast evaluation of internal loops in RNA secondary structure prediction. Abstract. Introduction

Efficient Algorithm for RNA Pseudoknotted Secondary Structure Prediction Given a Pseudoknot Free Structure

RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev

RNA Folding Algorithms. Michal Ziv-Ukelson Ben Gurion University of the Negev

BIOINF 4120 Bioinforma2cs 2 - Structures and Systems -

Supplementary Material

RNA Structure Prediction and Comparison. RNA folding

Combinatorial approaches to RNA folding Part I: Basics

DNA/RNA Structure Prediction

Rapid Dynamic Programming Algorithms for RNA Secondary Structure

Lab III: Computational Biology and RNA Structure Prediction. Biochemistry 208 David Mathews Department of Biochemistry & Biophysics

RNA Secondary Structure Prediction

Bachelor Thesis. RNA Secondary Structure Prediction

The Ensemble of RNA Structures Example: some good structures of the RNA sequence

COMBINATORICS OF LOCALLY OPTIMAL RNA SECONDARY STRUCTURES

BIOINFORMATICS. Prediction of RNA secondary structure based on helical regions distribution

Grand Plan. RNA very basic structure 3D structure Secondary structure / predictions The RNA world

The Multistrand Simulator: Stochastic Simulation of the Kinetics of Multiple Interacting DNA Strands

RNA$2 nd $structure$predic0on

Introduction to Polymer Physics

RNA Secondary Structure Prediction

RNA Folding and Interaction Prediction: A Survey

Sparse RNA folding revisited: space efficient minimum free energy structure prediction

Complete Suboptimal Folding of RNA and the Stability of Secondary Structures

Sparse RNA Folding Revisited: Space-Efficient Minimum Free Energy Prediction

BIOINF 4120 Bioinforma2cs 2 - Structures and Systems -

Algorithmic Aspects of RNA Secondary Structures

In Genomes, Two Types of Genes

Journal of Mathematical Analysis and Applications

Computing the partition function and sampling for saturated secondary structures of RNA, with respect to the Turner energy model

Predicting RNA Secondary Structure

Lecture 2: Pairwise Alignment. CG Ron Shamir

Structure-Based Comparison of Biomolecules

RNA Abstract Shape Analysis

A Method for Aligning RNA Secondary Structures

BCB 444/544 Fall 07 Dobbs 1

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Supplementary Material

RNA SECONDARY STRUCTURES AND THEIR PREDICTION 1. Centre de Recherche de MatMmatiques Appliqu6es, Universit6 de Montr6al, Montreal, Canada H3C 3J7

Pairwise RNA Edit Distance

Lecture 12. DNA/RNA Structure Prediction. Epigenectics Epigenomics: Gene Expression

REDUCING THE WORST CASE RUNNING TIMES OF A FAMILY OF RNA AND CFG PROBLEMS, USING VALIANT S APPROACH

RNA Secondary Structure Prediction Using Hierarchical Folding

arxiv: v3 [math.co] 17 Jan 2018

SA-REPC - Sequence Alignment with a Regular Expression Path Constraint

Bio nformatics. Lecture 23. Saad Mneimneh

A nucleotide-level coarse-grained model of RNA

Protein folding. α-helix. Lecture 21. An α-helix is a simple helix having on average 10 residues (3 turns of the helix)

Finding Consensus Energy Folding Landscapes Between RNA Sequences

DYNAMIC PROGRAMMING ALGORITHMS FOR RNA STRUCTURE PREDICTION WITH BINDING SITES

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained

A statistical sampling algorithm for RNA secondary structure prediction

Algorithms in Bioinformatics

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

Local Alignment: Smith-Waterman algorithm

The Multistrand Simulator: Stochastic Simulation of the Kinetics of Multiple Interacting DNA Strands

arxiv: v1 [q-bio.bm] 16 Aug 2015

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

. A Phase Transition for the Minimum Free Energy of

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell.

Computational approaches for RNA energy parameter estimation

A Combinatorial Framework for Multiple RNA Interaction Prediction

University Alexandru Ioan Cuza of Iaşi Faculty of Computer Science. Support Vector Machines for MicroRNA Identification.

A two length scale polymer theory for RNA loop free energies and helix stacking

RNA Bioinformatics Beyond the One Sequence-One Structure Paradigm. Peter Schuster

RNA and Protein Structure Prediction

EECS730: Introduction to Bioinformatics

Using distance geomtry to generate structures

An alignmnet ends either with (1) a match/mismatch (2) a gap in the first sequence (3) a gap in the second sequence ATCGCT GGCATAC ATCGCT TTCCT A

Dynamic Programming 1

Biphasic Folding Kinetics of RNA Pseudoknots and Telomerase RNA Activity

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

Lecture 5: September Time Complexity Analysis of Local Alignment

D Dobbs ISU - BCB 444/544X 1

The wonderful world of RNA informatics

A Novel Statistical Model for the Secondary Structure of RNA

13 Comparative RNA analysis

Pairwise alignment, Gunnar Klau, November 9, 2005, 16:

Using SetPSO to determine RNA secondary structure

Final Chem 4511/6501 Spring 2011 May 5, 2011 b Name

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Efficient Cache-oblivious String Algorithms for Bioinformatics

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

CSE 202 Dynamic Programming II

Lecture 8: RNA folding

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

A Structure-Based Flexible Search Method for Motifs in RNA

Sparse RNA Folding: Time and Space Efficient Algorithms

Optimal Folding Of Bit Sliced Stacks+

Semi-Supervised CONTRAfold for RNA Secondary Structure Prediction: A Maximum Entropy Approach

Transcription:

RNA secondary structure prediction Farhat Habib

RNA RNA is similar to DNA chemically. It is usually only a single strand. T(hyamine) is replaced by U(racil) Some forms of RNA can form secondary structures by pairing up with itself. This can change its properties dramatically.! DNA and RNA can pair with each other. trna linear and 3D view: http://en.wikipedia.org/wiki/transfer_rna

Secondary structure Several interesting RNAs have a conserved secondary structure (resulting from basepairing interactions) Sometimes, the sequence itself may not be conserved for the function to be retained It is important to tell what the secondary structure is going to be, for homology detection

Basics of secondary G-C pairing: three bonds (strong) A-U pairing: two bonds (weaker) structure Base pairs are approximately coplanar Base pairs are stacked onto other base pairs (arranged side by side): stems

Secondary structure elements

Non-canonical base pairs G-C and A-U are the canonical base pairs G-U is also possible, almost as stable

Nesting Base pairs almost always occur in a nested fashion If positions i and j are paired, and positions i and j are paired, then these two base-pairings are said to be nested if: i < i < j < j OR i < i < j < j Non-nested base pairing: pseudoknot

Circle Plot

List RNA sequence across the top. List RNA sequence on the vertical side. Score G-C, A-U, and G-U pairs Long runs indicate hairpins and other structures with basepairing Dot Plot

Pseudoknots Pseudoknots are not handled by the algorithms we shall see Pseudoknots do occur in many important RNAs But the total number of pseudoknotted base pairs is typically relatively small

RNA Combinatorics The number of RNA secondary structures for the sequence [1,n] Approx 1.3 b structures for n = 27 Derived by Waterman

Nussinov s Algorithm Find the secondary structure with most base pairs. Recursive: finds best structure for small subsequences, and works its way outwards to larger subsequences

Nussinov s algorithm: idea There are only four possible ways of getting the best structure for subsequence (i,j) from the best structures of the smaller subsequences (1) Add unpaired position i onto best structure for subsequence (i+1,j) i+1 i j

Nussinov s algorithm: idea There are only four possible ways of getting the best structure for subsequence (i,j) from the best structures of the smaller subsequences (2) Add unpaired position j onto best structure for subsequence (i,j-1) i j-1 j

Nussinov s algorithm: idea There are only four possible ways of getting the best structure for subsequence (i,j) from the best structures of the smaller subsequences (3) Add (i,j) pair onto best structure for subsequence (i+1,j-1) i+1 j-1 i j

Nussinov s algorithm: idea There are only four possible ways of getting the best structure for subsequence (i, j) from the best structures of the smaller subsequences (4) Combine two optimal substructures (i, k) and (k+1, j) i k k+1 j

Summarizing

Nussinov RNA folding algorithm Given a sequence s of length L with symbols s 1 s L. Let δ(i,j) = 1 if s i and s j are a complementary base pair, and 0 otherwise. We recursively calculate scores g(i,j) which are the maximal number of base pairs that can be formed for subsequence s i s j.

Recursion Starting with all subsequences of length 2, to length L Initialization S(i,i-1) = 0 S(i,i) = 0

Traceback As usual in sequence alignment? Optimal sequence alignment is a linear path in the dynamic programming table Optimal secondary structure can have bifurcations

Traceback Input: Matrix S and positions i, j. Output: Secondary structure maximizing the number of base pairs. Initial call: traceback(i = 1, j = L). if i < j then if S(i, j) = S(i + 1, j) then // case (1) traceback(i + 1, j) else if S(i, j) = S(i, j 1) then // case (2) traceback(i, j 1) else if S(i, j) = S(i + 1, j 1) + w(i, j) then // case (3) print base pair (i, j) traceback(i + 1, j 1) else for k = i + 1 to j 1 do // case (4) if S(i, j) = S(i, k) + S(k + 1, j) then traceback(i, k) traceback(k + 1, j) break end

Secondary structure prediction Zuker s algorithm Based on minimization of G, the equilibrium free energy, rather than maximization of number of base pairs Better fit to real (experimental) G Energy of stem is sum of stacking contributions from the interface between neighboring base pairs

One way is to assign an energy to each base pair in a secondary structure. That is, there is a function e such that e(r i, r j ) is the energy of a base pair. The energy, E(S), of the entire structure, is then given by: Reasonable values of e at are -3, -2 and -1 kcal/ mole for GC, AU and GU base pairs, respectively.

Base pair dependent energy rules Let W=min(W s), where s ranges over all secondary structures Energy for pairing r i with r j is given by e(r i, r j ) Compute W ij matrix as

4nt loop +5.9 1nt bulge +3.3 dangle -0.3 A U U A G-C G-C A G-C U-A A-U C-G A-U A A -1.1 terminal mismatch of hairpin -2.9 stack -2.9 stack (special case) -1.8 stack -0.9 stack -1.8 stack -2.1 stack Neighboring base pairs: stack Single bulges OK in stacking Longer bulges: no stacking term Loop destabilisation energy Loop terminal mismatch energy

Energy contributions es(i,j): Free energy of stacked pair (i,j) and (i+1,j-1) eh(i,j): Free energy of a loop closed by (i,j): depends on length of loop, bases at i,j, and bases adjacent to them el(i, j, i, j ): Free energy of an internal loop or bulge, with (i, j) and (i, j ) being the bordering base pairs. Depends on bases at these positions, and adjacent unpaired bases em(i,j,i 1,j 1, i k,j k ): Free energy of a multibranch loop with (i,j) as the closing base pair and i 1 j 1 etc as the internal base pairs

Zuker s algorithm W(j): FE of optimal structure of s(1..j) V(i,j): FE of optimal structure of s(i..j) assuming i, j form a base pair VBI(i,j): FE of optimal structure of s(i..j) assuming i, j closes a bulge or internal loop VM(i,j): FE of optimal structure of s(i..j) assuming i, j closes a multibranch loop WM(i,j): used to compute VM

Dynamic programming recurrences W(j): FE of optimal structure of s(1..j) W(0) = 0 W(j) = min( W(j-1), s[j] is external base (a base not in any loop) min 1 i<j V(i,j)+W(i-1)) s(j) pairs with s(i), for some i < j

Dynamic programming recurrences V(i,j): FE of optimal structure of s(i..j) assuming i, j form a base pair V(i,j) = infinity if i >= j V(i,j) = min( eh(i,j), es(i,j) + V(i+1,j-1), i,j is exterior base pair of a hairpin loop VBI(i,j), VM(i,j)) if i < j

Dynamic programming recurrences V(i,j): FE of optimal structure of s(i..j) assuming i, j form a base pair V(i,j) = infinity if i >= j V(i,j) = min( eh(i,j), i,j is exterior pair of a stacked pair. i+1,j-1 is therefore a pair too. es(i,j) + V(i+1,j-1), VBI(i,j), VM(i,j)) if i < j

Dynamic programming recurrences V(i,j): FE of optimal structure of s(i..j) assuming i, j form a base pair V(i,j) = infinity if i >= j V(i,j) = min( eh(i,j), es(i,j) + V(i+1,j-1), i,j is exterior pair of a bulge or interior loop VBI(i,j), VM(i,j)) if i < j

Dynamic programming recurrences V(i,j): FE of optimal structure of s(i..j) assuming i, j form a base pair V(i,j) = infinity if i >= j V(i,j) = min( eh(i,j), es(i,j) + V(i+1,j-1), i,j is exterior pair of a multibranch loop VBI(i,j), VM(i,j)) if i < j

Dynamic programming recurrences VBI(i, j): FE of optimal structure of s(i..j) assuming i,j closes a bulge or internal loop VBI(i, j) = min (el(i, j, i, j ) + V(i, j )) i,j i<i <j <j Energy of the bulge Slow!

Dynamic programming recurrences VM(i,j): FE of optimal structure of s(i..j) assuming i, j closes a multibranch loop VM(i,j) = min (em(i,j,i 1,j 1,..,i k,j k ) + h V(i h,j h )) k 2 i 1,j 1, i k,j k Very slow! Energy of the loop itself

Order of computation What order to fill the DP table in? Increasing order of (j-i) VBI(i, j) and VM(i, j) before V(i, j)