RNA Secondary Structure Prediction

Similar documents
Even More on Dynamic Programming

CSCI Compiler Construction

Statistical Machine Translation

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

CS681: Advanced Topics in Computational Biology

Lecture 10: December 22, 2009

Foundations of Informatics: a Bridging Course

Introduction to Computational Linguistics

In this chapter, we explore the parsing problem, which encompasses several questions, including:

CYK Algorithm for Parsing General Context-Free Grammars

Finite Automata Theory and Formal Languages TMV027/DIT321 LP4 2018

CS5371 Theory of Computation. Lecture 7: Automata Theory V (CFG, CFL, CNF)

CS481F01 Prelim 2 Solutions

Algorithms in Bioinformatics

Remembering subresults (Part I): Well-formed substring tables

Chap. 7 Properties of Context-free Languages

Properties of Context-Free Languages

Improved TBL algorithm for learning context-free grammar

Maschinelle Sprachverarbeitung

Grammars and Context Free Languages

Maschinelle Sprachverarbeitung

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Computing if a token can follow

This kind of reordering is beyond the power of finite transducers, but a synchronous CFG can do this.

A parsing technique for TRG languages

Introduction to Theory of Computing

Grammars and Context Free Languages

CMPT-825 Natural Language Processing. Why are parsing algorithms important?

Plan for 2 nd half. Just when you thought it was safe. Just when you thought it was safe. Theory Hall of Fame. Chomsky Normal Form

Einführung in die Computerlinguistik

Parsing with CFGs L445 / L545 / B659. Dept. of Linguistics, Indiana University Spring Parsing with CFGs. Direction of processing

Parsing with CFGs. Direction of processing. Top-down. Bottom-up. Left-corner parsing. Chart parsing CYK. Earley 1 / 46.

MA/CSSE 474 Theory of Computation

Grammar formalisms Tree Adjoining Grammar: Formal Properties, Parsing. Part I. Formal Properties of TAG. Outline: Formal Properties of TAG

CPS 220 Theory of Computation

Notes for Comp 497 (Comp 454) Week 10 4/5/05

CISC4090: Theory of Computation

Formal Languages and Automata

Theory of Computation 8 Deterministic Membership Testing

CS311 Computational Structures. NP-completeness. Lecture 18. Andrew P. Black Andrew Tolmach. Thursday, 2 December 2010

Notes for Comp 497 (454) Week 10

straight segment and the symbol b representing a corner, the strings ababaab, babaaba and abaabab represent the same shape. In order to learn a model,

FORMAL LANGUAGES, AUTOMATA AND COMPUTATION

Theory of Computation Turing Machine and Pushdown Automata

Properties of context-free Languages

Context-Free Grammar

Decidable and undecidable languages

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Address for Correspondence

Computational Models: Class 5

Harvard CS 121 and CSCI E-207 Lecture 12: General Context-Free Recognition

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

CONTEXT-SENSITIVE HIDDEN MARKOV MODELS FOR MODELING LONG-RANGE DEPENDENCIES IN SYMBOL SEQUENCES

Review. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering

Follow sets. LL(1) Parsing Table

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

AC68 FINITE AUTOMATA & FORMULA LANGUAGES DEC 2013

Einführung in die Computerlinguistik

Final exam study sheet for CS3719 Turing machines and decidability.

Parametrized Stochastic Grammars for RNA Secondary Structure Prediction

Chomsky Normal Form and TURING MACHINES. TUESDAY Feb 4

5 Context-Free Languages

Before We Start. The Pumping Lemma. Languages. Context Free Languages. Plan for today. Now our picture looks like. Any questions?

An Efficient Context-Free Parsing Algorithm. Speakers: Morad Ankri Yaniv Elia

FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY

Probabilistic Context-Free Grammar

CSCI 1010 Models of Computa3on. Lecture 17 Parsing Context-Free Languages

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

NPDA, CFG equivalence

Computability Theory

Computational Models - Lecture 5 1

Parsing. Unger s Parser. Introduction (1) Unger s parser [Grune and Jacobs, 2008] is a CFG parser that is

Computational Models - Lecture 4

REDUCING THE WORST CASE RUNNING TIMES OF A FAMILY OF RNA AND CFG PROBLEMS, USING VALIANT S APPROACH

Decidability (intro.)

Context Free Grammars: Introduction. Context Free Grammars: Simplifying CFGs

Probabilistic Context-free Grammars

Context-free Grammars and Languages

Aspects of Tree-Based Statistical Machine Translation

CSE 355 Test 2, Fall 2016

Parsing. Unger s Parser. Laura Kallmeyer. Winter 2016/17. Heinrich-Heine-Universität Düsseldorf 1 / 21

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

Section 1 (closed-book) Total points 30

Advanced Natural Language Processing Syntactic Parsing

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Simplification of CFG and Normal Forms. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

Simplification of CFG and Normal Forms. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

CS Pushdown Automata

Decision problem of substrings in Context Free Languages.

CS20a: summary (Oct 24, 2002)

Computational Models - Lecture 4 1

Computational Models - Lecture 4 1

Automata Theory CS F-08 Context-Free Grammars

LECTURER: BURCU CAN Spring

Automata Theory - Quiz II (Solutions)

Logic. proof and truth syntacs and semantics. Peter Antal

Pushdown Automata (Pre Lecture)

11. Automata and languages, cellular automata, grammars, L-systems

1. (a) Explain the procedure to convert Context Free Grammar to Push Down Automata.

Harvard CS 121 and CSCI E-207 Lecture 10: CFLs: PDAs, Closure Properties, and Non-CFLs

Transcription:

RNA Secondary Structure Prediction 1

RNA structure prediction methods Base-Pair Maximization Context-Free Grammar Parsing. Free Energy Methods Covariance Models 2

The Nussinov-Jacobson Algorithm q = 9 A C A G U U G C A 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0

SCFG Version Nussinov algorithm can be converted to a stochastic context-free grammar: S W W aw cw gw uw W Wa Wc Wg Wu W awu cwg uwa gwc W WW 4

SCFGs Stochastic Context Free Grammars (SCFGs) have also been used to model RNA secondary structure Examples trnascan-se program created to find snornas Grammars are created by using a training set of data, and then the grammars are applied to potential sequences to see if they fit into the language 5

SCFGs SCFGs allow the detection of sequences belonging to a family trnas group I introns snornas snrnas 6

SCFGs Any RNA structure can be reduced to a SCFG (see Durbin, et al., p 278-279) 7

Transformational Grammars First described by linguist Noam Chomsky in the 1950 s. (Yes, the same Noam Chomsky who has expressed various dissident political views throughout the years!) 8

13 June 2006 9

13 June 2006 10

Transformational Grammars Very important in computer science, most notably in compiler design Covered in detail in compiler and automaton classes 11

Transformational Grammars Idea: take a set of outputs (sentence, RNA structure) and determine if it can be produced using a set of rules Consist of a set of symbols and production rules The symbols can be terminal (emitting) symbols or non-terminal symbols 12

13 June 2006 13

13 June 2006 14

13 June 2006 15

13 June 2006 16

Grammar for Palindromes Consider palindromic DNA sequences Five possible terminal symbols: {a, c, g, t, ) ( represents the blank terminal symbol) 17

Grammar for Palindromes Production Rules, where S and W are non-terminal symbols: S W W awa cwc gwg twt W a c g t 18

Derivation of Sequences Using these production rules, a derivation of the palindromic sequence acttgttca follows: S W awa acwca actwtca acttwttca acttgttca 19

13 June 2006 20

SCFGs for RNA base-paired columns modeled by pairwise emitting non terminals awu; uwa; gwc; cwg;... single-stranded columns modeled by leftwise emitting nonterminals (when possible) aw; cw; gw; uw;..., when possible 21

Parse Trees A context-free grammar can be aligned to a sequence using a parse tree Root of the tree is the non-terminal start symbol, S Leaves are terminal symbols Internal nodes are the nonterminals Leaves can be parsed from left to right to view the results of production 23

13 June 2006 24

Parse Tree S W W W W W a c t t g t t c a 25

13 June 2006 27

13 June 2006 28

13 June 2006 29

13 June 2006 30

13 June 2006 31

دانشگاه صنعتی امیر کبیر دانشکده مهندسی کامپیوتر CYK )Cocke-Younger-Kasami) Parsing Algorithm سید محمد حسین معطر پردازش زبان طبیعی

Parsing Algorithms CFGs are basis for describing (syntactic) structure of NL sentences Thus - Parsing Algorithms are core of NL analysis systems Recognition vs. Parsing: Recognition - deciding the membership in the language: Parsing Recognition +producing a parse tree for it Parsing is more difficult than recognition? (time complexity) Ambiguity - an input may have exponentially many parses

CYK )Cocke-Younger-Kasami) One of the earliest recognition and parsing algorithms The standard version of CYK can only recognize languages defined by context-free grammars in Chomsky Normal Form (CNF). It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF Harder to understand Based on a dynamic programming approach: Build solutions compositionally from sub-solutions Store sub-solutions and re-use them whenever necessary Recognition version: decide whether S == > w?

CYK Algorithm The CYK algorithm for the membership problem is as follows: Let the input string be a sequence of n letters a1... an. Let the grammar contain r terminal and nonterminal symbols R1... Rr, and let R1 be the start symbol. Let P[n,n,r] be an array of booleans. Initialize all elements of P to false. For each i = 1 to n For each unit production Rj -> ai, set P[i,1,j] = true. For each i = 2 to n -- Length of span For each j = 1 to n-i+1 -- Start of span For each k = 1 to i-1 -- Partition of span» For each production RA -> RB RC» If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true If P[1,n,1] is true Then string is member of language Else string is not member of language

CYK Pseudocode On input x = x 1 x 2 x n : for (i = 1 to n) //create middle diagonal for (each var. A) if(a x i ) add A to table[i-1][i] for (d = 2 to n) // d th diagonal for (i = 0 to n-d) for (k = i+1 to i+d-1) for (each var. A) for(each var. B in table[i][k]) for(each var. C in table[k][k+d]) if(a BC) add A to table[i][k+d] return S table[0][n]? ACCEPT : REJECT

CYK Algorithm this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk. Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on. For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence. Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol

CYK Algorithm for Deciding Context Free Languages Q: Consider the grammar G given by S AB XB T AB XB X AT A a B b 1. Is x = in L(G )?

CYK Algorithm for Deciding Context Free Languages Now look at : S AB XB T AB XB X AT A a B b a a a b b b

CYK Algorithm for Deciding Context Free Languages 1) Write variables for all length 1 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B b B

CYK Algorithm for Deciding Context Free Languages 2) Write variables for all length 2 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T b B

CYK Algorithm for Deciding Context Free Languages 3) Write variables for all length 3 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T T X b B

CYK Algorithm for Deciding Context Free Languages 4) Write variables for all length 4 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T T X S,T b B

CYK Algorithm for Deciding Context Free Languages 5) Write variables for all length 5 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T T X S,T X b B

CYK Algorithm for Deciding Context Free Languages 6) Write variables for all length 6 substrings. S AB XB T AB XB X AT A a B b S is included so accepted! a a a b b A A A B B X X S,T T S,T S,T b B

CYK Algorithm for Deciding Context Free Languages Can also use a table for same purpose. end at start at 1: 2: 3: 4: 5: 6: 0: 1: 2: 3: 4: 5:

CYK Algorithm for Deciding Context Free Languages 1. Variables for length 1 substrings. end at start at 1: 2: 3: 4: 5: 6: 0: A 1: A 2: A 3: B 4: B 5: B

CYK Algorithm for Deciding Context Free Languages 2. Variables for length 2 substrings. end at start at 1: 2: 0: A - 3: 1: A - 4: 2: A S,T 5: 3: B - 6: 4: B - 5: B

CYK Algorithm for Deciding Context Free Languages 3. Variables for length 3 substrings. end at start at 1: 2: 3: 0: A - - 4: 1: A - X 5: 2: A S,T - 6: 3: B - - 4: B - 5: B

CYK Algorithm for Deciding Context Free Languages 4. Variables for length 4 substrings. end at start at 1: 2: 3: 4: 0: A - - - 5: 1: A - X S,T 6: 2: A S,T - - 3: B - - 4: B - 5: B

CYK Algorithm for Deciding Context Free Languages 5. Variables for length 5 substrings. end at start at 1: 2: 3: 4: 5: 0: A - - - X 6: 1: A - X S,T - 2: A S,T - - 3: B - - 4: B - 5: B

CYK Algorithm for Deciding Context Free Languages 6. Variables for. ACCEPTED! end at start at 1: 2: 3: 4: 5: 6: 0: A - - - X S,T 1: A - X S,T - 2: A S,T - - 3: B - - 4: B - 5: B

Parsing results We keep the results for every w ij in a table. Note that we only need to fill in entries up to the diagonal the longest substring starting at i is of length n-i+1

Constructing parse tree we need to construct parse trees for string w: Idea: Keep back-pointers to the table entries that we combine At the end - reconstruct a parse from the back-pointers This allows us to find all parse trees

References Hopcroft and Ullman, Intro. to Automata Theory, Lang. and Comp. Section 6.3, pp. 139-141 CYK algorithm, Wikipedia, the free encyclopedia A representation by Zeph Grunschlag

The Nussinov-Jacobson Algorithm q = 9 A C A G U U G C A 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0

The Nussinov-Jacobson Algorithm A C A G U U G C A 1 2 3 4 5 6 7 8 9-1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0 65

The Nussinov-Jacobson Algorithm i < q j q-1 q A C A G U U G C A 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0 66

Co-terminus foldings: A U C A U G G C A U Partitionable foldings: A C A G U U G C A 1 2 3 4 5 6 7 8 9 67

Another way to write the Nussinov-Jacobson recursion Initialization: ( i, i 1) ( i, i) 0 0 for i 2 to L Recursion: ( i 1, j); ( i, j 1); ( i, j) max ( i 1, j 1) BasePairScore( i, j); maxi k j ( i, k) ( k 1, j). 68 Two special cases of Partitionable Folding Co-Terminus Folding Partitionable Folding

SCFG version of the Nussinov-Jacobson algorithm Stochastic Context-Free Grammars Makes use of production rules: W aw cw gw uw (i unpaired) Every production rule has a associated probability parameter. The maximum probability parse is equivalent to the maximum probability secondary structure. 69

SCFG Version of Nussinov- Jacobson Algorithm The algorithm can be converted to a stochastic context-free grammar: S W W aw cw gw uw W Wa Wc Wg Wu W awu cwg uwa gwc W WW 70

Needed terminology The inside-outside (recursive dynamic programming) algorithm for SCFGs in Chomsky normal form is the natural counterpart of the forward-backward algorithm for HMM. Best path variant of the inside-outside algorithm is the Cocke-Younger-Kasami (CYK) algorithm. It finds the maximum probabilistic alignment of the SCFG to the sequence. 71

CYK for Nussinov-style RNA SCFG Initialization: ( i, i 1) ( i, i) for i 2 to log max log p( x S) i p( Sx ) i L for i 1to L Addition to the fill stage of the Nussinov algorithm. The principal difference is that the SCFG description is a probabilistic model. Recursion: ( i 1, j) log p( xw i ); ( i, j 1) log p( Wx j ); ( i, j) max ( i 1, j 1) log p( xwx i j ); maxi k j ( i, k) ( k 1, j) log p( WW ). Two special cases of Partitionable Folding Co-Terminus Folding Partitionable Folding 72

CYK for Nussinov-style RNA SCFG (2) The log P( x, ˆ ) is the log likelihood of the optimal structure given the SCFG model The traceback to find the secondary structure corresponding to the best score is performed analogously to the traceback in the Nussinov algorithm ˆ 73

Example of RNA Structure SCFG RNA structure for the sequence produced by MFOLD, can be constructed (5 to 3 ): GCUUACGACCAUAUCACGUUGAAUGCAC GCCAUCCCGUCCGAUCUGGCAAGUUAAG CAACGUUGAGUCCAGUUAGUACUUGGAU CGGAGACGGCCUGGGAAUCCUGGAUGU UGUAAGCU 74

Example Construction S W Wu gwcu gcwgcu gcuwagcu gcuuwaagcu gcuuawuaagcu gcuuacwguaagcu gcuuacgwuguaagcu gcuuacgawuuguaagcu gcuuacgacwguuguaagcu gcuuacgaccwguuguaagcu gcuuacgaccawguuguaagcu... 75

CYK for Nussinov-style RNA SCFG Good starting example, but it is too simple to be an accurate RNA folder The algorithm does not consider important structural features like preferences for certain: Loop lengths Nearest neighbours in the structure caused by stacking interactions between neighbouring base pairs in a stem. 76