RNA Secondary Structure Prediction

RNA Secondary Structure Prediction 1

RNA structure prediction methods Base-Pair Maximization Context-Free Grammar Parsing. Free Energy Methods Covariance Models 2

The Nussinov-Jacobson Algorithm q = 9 A C A G U U G C A 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0

SCFG Version Nussinov algorithm can be converted to a stochastic context-free grammar: S W W aw cw gw uw W Wa Wc Wg Wu W awu cwg uwa gwc W WW 4

SCFGs Stochastic Context Free Grammars (SCFGs) have also been used to model RNA secondary structure Examples trnascan-se program created to find snornas Grammars are created by using a training set of data, and then the grammars are applied to potential sequences to see if they fit into the language 5

SCFGs SCFGs allow the detection of sequences belonging to a family trnas group I introns snornas snrnas 6

SCFGs Any RNA structure can be reduced to a SCFG (see Durbin, et al., p 278-279) 7

Transformational Grammars First described by linguist Noam Chomsky in the 1950 s. (Yes, the same Noam Chomsky who has expressed various dissident political views throughout the years!) 8

13 June 2006 9

13 June 2006 10

Transformational Grammars Very important in computer science, most notably in compiler design Covered in detail in compiler and automaton classes 11

Transformational Grammars Idea: take a set of outputs (sentence, RNA structure) and determine if it can be produced using a set of rules Consist of a set of symbols and production rules The symbols can be terminal (emitting) symbols or non-terminal symbols 12

13 June 2006 13

13 June 2006 14

13 June 2006 15

13 June 2006 16

Grammar for Palindromes Consider palindromic DNA sequences Five possible terminal symbols: {a, c, g, t, ) ( represents the blank terminal symbol) 17

Grammar for Palindromes Production Rules, where S and W are non-terminal symbols: S W W awa cwc gwg twt W a c g t 18

Derivation of Sequences Using these production rules, a derivation of the palindromic sequence acttgttca follows: S W awa acwca actwtca acttwttca acttgttca 19

13 June 2006 20

SCFGs for RNA base-paired columns modeled by pairwise emitting non terminals awu; uwa; gwc; cwg;... single-stranded columns modeled by leftwise emitting nonterminals (when possible) aw; cw; gw; uw;..., when possible 21

Parse Trees A context-free grammar can be aligned to a sequence using a parse tree Root of the tree is the non-terminal start symbol, S Leaves are terminal symbols Internal nodes are the nonterminals Leaves can be parsed from left to right to view the results of production 23

13 June 2006 24

Parse Tree S W W W W W a c t t g t t c a 25

13 June 2006 27

13 June 2006 28

13 June 2006 29

13 June 2006 30

13 June 2006 31

دانشگاه صنعتی امیر کبیر دانشکده مهندسی کامپیوتر CYK )Cocke-Younger-Kasami) Parsing Algorithm سید محمد حسین معطر پردازش زبان طبیعی

Parsing Algorithms CFGs are basis for describing (syntactic) structure of NL sentences Thus - Parsing Algorithms are core of NL analysis systems Recognition vs. Parsing: Recognition - deciding the membership in the language: Parsing Recognition +producing a parse tree for it Parsing is more difficult than recognition? (time complexity) Ambiguity - an input may have exponentially many parses

CYK )Cocke-Younger-Kasami) One of the earliest recognition and parsing algorithms The standard version of CYK can only recognize languages defined by context-free grammars in Chomsky Normal Form (CNF). It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF Harder to understand Based on a dynamic programming approach: Build solutions compositionally from sub-solutions Store sub-solutions and re-use them whenever necessary Recognition version: decide whether S == > w?

CYK Algorithm The CYK algorithm for the membership problem is as follows: Let the input string be a sequence of n letters a1... an. Let the grammar contain r terminal and nonterminal symbols R1... Rr, and let R1 be the start symbol. Let P[n,n,r] be an array of booleans. Initialize all elements of P to false. For each i = 1 to n For each unit production Rj -> ai, set P[i,1,j] = true. For each i = 2 to n -- Length of span For each j = 1 to n-i+1 -- Start of span For each k = 1 to i-1 -- Partition of span» For each production RA -> RB RC» If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true If P[1,n,1] is true Then string is member of language Else string is not member of language

CYK Pseudocode On input x = x 1 x 2 x n : for (i = 1 to n) //create middle diagonal for (each var. A) if(a x i ) add A to table[i-1][i] for (d = 2 to n) // d th diagonal for (i = 0 to n-d) for (k = i+1 to i+d-1) for (each var. A) for(each var. B in table[i][k]) for(each var. C in table[k][k+d]) if(a BC) add A to table[i][k+d] return S table[0][n]? ACCEPT : REJECT

CYK Algorithm this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk. Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on. For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence. Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol

CYK Algorithm for Deciding Context Free Languages Q: Consider the grammar G given by S AB XB T AB XB X AT A a B b 1. Is x = in L(G )?

CYK Algorithm for Deciding Context Free Languages Now look at : S AB XB T AB XB X AT A a B b a a a b b b

CYK Algorithm for Deciding Context Free Languages 1) Write variables for all length 1 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B b B

CYK Algorithm for Deciding Context Free Languages 2) Write variables for all length 2 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T b B

CYK Algorithm for Deciding Context Free Languages 3) Write variables for all length 3 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T T X b B

CYK Algorithm for Deciding Context Free Languages 4) Write variables for all length 4 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T T X S,T b B

CYK Algorithm for Deciding Context Free Languages 5) Write variables for all length 5 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T T X S,T X b B

CYK Algorithm for Deciding Context Free Languages 6) Write variables for all length 6 substrings. S AB XB T AB XB X AT A a B b S is included so accepted! a a a b b A A A B B X X S,T T S,T S,T b B

CYK Algorithm for Deciding Context Free Languages Can also use a table for same purpose. end at start at 1: 2: 3: 4: 5: 6: 0: 1: 2: 3: 4: 5:

CYK Algorithm for Deciding Context Free Languages 1. Variables for length 1 substrings. end at start at 1: 2: 3: 4: 5: 6: 0: A 1: A 2: A 3: B 4: B 5: B

CYK Algorithm for Deciding Context Free Languages 2. Variables for length 2 substrings. end at start at 1: 2: 0: A - 3: 1: A - 4: 2: A S,T 5: 3: B - 6: 4: B - 5: B

CYK Algorithm for Deciding Context Free Languages 3. Variables for length 3 substrings. end at start at 1: 2: 3: 0: A - - 4: 1: A - X 5: 2: A S,T - 6: 3: B - - 4: B - 5: B

CYK Algorithm for Deciding Context Free Languages 4. Variables for length 4 substrings. end at start at 1: 2: 3: 4: 0: A - - - 5: 1: A - X S,T 6: 2: A S,T - - 3: B - - 4: B - 5: B

CYK Algorithm for Deciding Context Free Languages 5. Variables for length 5 substrings. end at start at 1: 2: 3: 4: 5: 0: A - - - X 6: 1: A - X S,T - 2: A S,T - - 3: B - - 4: B - 5: B

CYK Algorithm for Deciding Context Free Languages 6. Variables for. ACCEPTED! end at start at 1: 2: 3: 4: 5: 6: 0: A - - - X S,T 1: A - X S,T - 2: A S,T - - 3: B - - 4: B - 5: B

Parsing results We keep the results for every w ij in a table. Note that we only need to fill in entries up to the diagonal the longest substring starting at i is of length n-i+1

Constructing parse tree we need to construct parse trees for string w: Idea: Keep back-pointers to the table entries that we combine At the end - reconstruct a parse from the back-pointers This allows us to find all parse trees

References Hopcroft and Ullman, Intro. to Automata Theory, Lang. and Comp. Section 6.3, pp. 139-141 CYK algorithm, Wikipedia, the free encyclopedia A representation by Zeph Grunschlag

The Nussinov-Jacobson Algorithm A C A G U U G C A 1 2 3 4 5 6 7 8 9-1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0 65

The Nussinov-Jacobson Algorithm i < q j q-1 q A C A G U U G C A 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0 66

Co-terminus foldings: A U C A U G G C A U Partitionable foldings: A C A G U U G C A 1 2 3 4 5 6 7 8 9 67

Another way to write the Nussinov-Jacobson recursion Initialization: ( i, i 1) ( i, i) 0 0 for i 2 to L Recursion: ( i 1, j); ( i, j 1); ( i, j) max ( i 1, j 1) BasePairScore( i, j); maxi k j ( i, k) ( k 1, j). 68 Two special cases of Partitionable Folding Co-Terminus Folding Partitionable Folding

SCFG version of the Nussinov-Jacobson algorithm Stochastic Context-Free Grammars Makes use of production rules: W aw cw gw uw (i unpaired) Every production rule has a associated probability parameter. The maximum probability parse is equivalent to the maximum probability secondary structure. 69

SCFG Version of Nussinov- Jacobson Algorithm The algorithm can be converted to a stochastic context-free grammar: S W W aw cw gw uw W Wa Wc Wg Wu W awu cwg uwa gwc W WW 70

Needed terminology The inside-outside (recursive dynamic programming) algorithm for SCFGs in Chomsky normal form is the natural counterpart of the forward-backward algorithm for HMM. Best path variant of the inside-outside algorithm is the Cocke-Younger-Kasami (CYK) algorithm. It finds the maximum probabilistic alignment of the SCFG to the sequence. 71

CYK for Nussinov-style RNA SCFG Initialization: ( i, i 1) ( i, i) for i 2 to log max log p( x S) i p( Sx ) i L for i 1to L Addition to the fill stage of the Nussinov algorithm. The principal difference is that the SCFG description is a probabilistic model. Recursion: ( i 1, j) log p( xw i ); ( i, j 1) log p( Wx j ); ( i, j) max ( i 1, j 1) log p( xwx i j ); maxi k j ( i, k) ( k 1, j) log p( WW ). Two special cases of Partitionable Folding Co-Terminus Folding Partitionable Folding 72

CYK for Nussinov-style RNA SCFG (2) The log P( x, ˆ ) is the log likelihood of the optimal structure given the SCFG model The traceback to find the secondary structure corresponding to the best score is performed analogously to the traceback in the Nussinov algorithm ˆ 73

Example of RNA Structure SCFG RNA structure for the sequence produced by MFOLD, can be constructed (5 to 3 ): GCUUACGACCAUAUCACGUUGAAUGCAC GCCAUCCCGUCCGAUCUGGCAAGUUAAG CAACGUUGAGUCCAGUUAGUACUUGGAU CGGAGACGGCCUGGGAAUCCUGGAUGU UGUAAGCU 74

Example Construction S W Wu gwcu gcwgcu gcuwagcu gcuuwaagcu gcuuawuaagcu gcuuacwguaagcu gcuuacgwuguaagcu gcuuacgawuuguaagcu gcuuacgacwguuguaagcu gcuuacgaccwguuguaagcu gcuuacgaccawguuguaagcu... 75

CYK for Nussinov-style RNA SCFG Good starting example, but it is too simple to be an accurate RNA folder The algorithm does not consider important structural features like preferences for certain: Loop lengths Nearest neighbours in the structure caused by stacking interactions between neighbouring base pairs in a stem. 76