RNA Secondary Structure Prediction 1
RNA structure prediction methods Base-Pair Maximization Context-Free Grammar Parsing. Free Energy Methods Covariance Models 2
The Nussinov-Jacobson Algorithm q = 9 A C A G U U G C A 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0
SCFG Version Nussinov algorithm can be converted to a stochastic context-free grammar: S W W aw cw gw uw W Wa Wc Wg Wu W awu cwg uwa gwc W WW 4
SCFGs Stochastic Context Free Grammars (SCFGs) have also been used to model RNA secondary structure Examples trnascan-se program created to find snornas Grammars are created by using a training set of data, and then the grammars are applied to potential sequences to see if they fit into the language 5
SCFGs SCFGs allow the detection of sequences belonging to a family trnas group I introns snornas snrnas 6
SCFGs Any RNA structure can be reduced to a SCFG (see Durbin, et al., p 278-279) 7
Transformational Grammars First described by linguist Noam Chomsky in the 1950 s. (Yes, the same Noam Chomsky who has expressed various dissident political views throughout the years!) 8
13 June 2006 9
13 June 2006 10
Transformational Grammars Very important in computer science, most notably in compiler design Covered in detail in compiler and automaton classes 11
Transformational Grammars Idea: take a set of outputs (sentence, RNA structure) and determine if it can be produced using a set of rules Consist of a set of symbols and production rules The symbols can be terminal (emitting) symbols or non-terminal symbols 12
13 June 2006 13
13 June 2006 14
13 June 2006 15
13 June 2006 16
Grammar for Palindromes Consider palindromic DNA sequences Five possible terminal symbols: {a, c, g, t, ) ( represents the blank terminal symbol) 17
Grammar for Palindromes Production Rules, where S and W are non-terminal symbols: S W W awa cwc gwg twt W a c g t 18
Derivation of Sequences Using these production rules, a derivation of the palindromic sequence acttgttca follows: S W awa acwca actwtca acttwttca acttgttca 19
13 June 2006 20
SCFGs for RNA base-paired columns modeled by pairwise emitting non terminals awu; uwa; gwc; cwg;... single-stranded columns modeled by leftwise emitting nonterminals (when possible) aw; cw; gw; uw;..., when possible 21
Parse Trees A context-free grammar can be aligned to a sequence using a parse tree Root of the tree is the non-terminal start symbol, S Leaves are terminal symbols Internal nodes are the nonterminals Leaves can be parsed from left to right to view the results of production 23
13 June 2006 24
Parse Tree S W W W W W a c t t g t t c a 25
13 June 2006 27
13 June 2006 28
13 June 2006 29
13 June 2006 30
13 June 2006 31
دانشگاه صنعتی امیر کبیر دانشکده مهندسی کامپیوتر CYK )Cocke-Younger-Kasami) Parsing Algorithm سید محمد حسین معطر پردازش زبان طبیعی
Parsing Algorithms CFGs are basis for describing (syntactic) structure of NL sentences Thus - Parsing Algorithms are core of NL analysis systems Recognition vs. Parsing: Recognition - deciding the membership in the language: Parsing Recognition +producing a parse tree for it Parsing is more difficult than recognition? (time complexity) Ambiguity - an input may have exponentially many parses
CYK )Cocke-Younger-Kasami) One of the earliest recognition and parsing algorithms The standard version of CYK can only recognize languages defined by context-free grammars in Chomsky Normal Form (CNF). It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF Harder to understand Based on a dynamic programming approach: Build solutions compositionally from sub-solutions Store sub-solutions and re-use them whenever necessary Recognition version: decide whether S == > w?
CYK Algorithm The CYK algorithm for the membership problem is as follows: Let the input string be a sequence of n letters a1... an. Let the grammar contain r terminal and nonterminal symbols R1... Rr, and let R1 be the start symbol. Let P[n,n,r] be an array of booleans. Initialize all elements of P to false. For each i = 1 to n For each unit production Rj -> ai, set P[i,1,j] = true. For each i = 2 to n -- Length of span For each j = 1 to n-i+1 -- Start of span For each k = 1 to i-1 -- Partition of span» For each production RA -> RB RC» If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true If P[1,n,1] is true Then string is member of language Else string is not member of language
CYK Pseudocode On input x = x 1 x 2 x n : for (i = 1 to n) //create middle diagonal for (each var. A) if(a x i ) add A to table[i-1][i] for (d = 2 to n) // d th diagonal for (i = 0 to n-d) for (k = i+1 to i+d-1) for (each var. A) for(each var. B in table[i][k]) for(each var. C in table[k][k+d]) if(a BC) add A to table[i][k+d] return S table[0][n]? ACCEPT : REJECT
CYK Algorithm this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk. Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on. For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence. Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol
CYK Algorithm for Deciding Context Free Languages Q: Consider the grammar G given by S AB XB T AB XB X AT A a B b 1. Is x = in L(G )?
CYK Algorithm for Deciding Context Free Languages Now look at : S AB XB T AB XB X AT A a B b a a a b b b
CYK Algorithm for Deciding Context Free Languages 1) Write variables for all length 1 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B b B
CYK Algorithm for Deciding Context Free Languages 2) Write variables for all length 2 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T b B
CYK Algorithm for Deciding Context Free Languages 3) Write variables for all length 3 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T T X b B
CYK Algorithm for Deciding Context Free Languages 4) Write variables for all length 4 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T T X S,T b B
CYK Algorithm for Deciding Context Free Languages 5) Write variables for all length 5 substrings. S AB XB T AB XB X AT A a B b a a a b b A A A B B S,T T X S,T X b B
CYK Algorithm for Deciding Context Free Languages 6) Write variables for all length 6 substrings. S AB XB T AB XB X AT A a B b S is included so accepted! a a a b b A A A B B X X S,T T S,T S,T b B
CYK Algorithm for Deciding Context Free Languages Can also use a table for same purpose. end at start at 1: 2: 3: 4: 5: 6: 0: 1: 2: 3: 4: 5:
CYK Algorithm for Deciding Context Free Languages 1. Variables for length 1 substrings. end at start at 1: 2: 3: 4: 5: 6: 0: A 1: A 2: A 3: B 4: B 5: B
CYK Algorithm for Deciding Context Free Languages 2. Variables for length 2 substrings. end at start at 1: 2: 0: A - 3: 1: A - 4: 2: A S,T 5: 3: B - 6: 4: B - 5: B
CYK Algorithm for Deciding Context Free Languages 3. Variables for length 3 substrings. end at start at 1: 2: 3: 0: A - - 4: 1: A - X 5: 2: A S,T - 6: 3: B - - 4: B - 5: B
CYK Algorithm for Deciding Context Free Languages 4. Variables for length 4 substrings. end at start at 1: 2: 3: 4: 0: A - - - 5: 1: A - X S,T 6: 2: A S,T - - 3: B - - 4: B - 5: B
CYK Algorithm for Deciding Context Free Languages 5. Variables for length 5 substrings. end at start at 1: 2: 3: 4: 5: 0: A - - - X 6: 1: A - X S,T - 2: A S,T - - 3: B - - 4: B - 5: B
CYK Algorithm for Deciding Context Free Languages 6. Variables for. ACCEPTED! end at start at 1: 2: 3: 4: 5: 6: 0: A - - - X S,T 1: A - X S,T - 2: A S,T - - 3: B - - 4: B - 5: B
Parsing results We keep the results for every w ij in a table. Note that we only need to fill in entries up to the diagonal the longest substring starting at i is of length n-i+1
Constructing parse tree we need to construct parse trees for string w: Idea: Keep back-pointers to the table entries that we combine At the end - reconstruct a parse from the back-pointers This allows us to find all parse trees
References Hopcroft and Ullman, Intro. to Automata Theory, Lang. and Comp. Section 6.3, pp. 139-141 CYK algorithm, Wikipedia, the free encyclopedia A representation by Zeph Grunschlag
The Nussinov-Jacobson Algorithm q = 9 A C A G U U G C A 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0
The Nussinov-Jacobson Algorithm A C A G U U G C A 1 2 3 4 5 6 7 8 9-1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0 65
The Nussinov-Jacobson Algorithm i < q j q-1 q A C A G U U G C A 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 A C A G U U G C A 1 A 0 0 0 1 2 2 2 3 2 C 0 0 0 1 1 1 2 2 3 3 A 0 0 0 1 1 1 2 3 4 G 0 0 0 0 0 1 2 5 U 0 0 0 0 1 2 6 U 0 0 0 1 2 7 G 0 0 1 1 8 C 0 0 0 9 A 0 0 66
Co-terminus foldings: A U C A U G G C A U Partitionable foldings: A C A G U U G C A 1 2 3 4 5 6 7 8 9 67
Another way to write the Nussinov-Jacobson recursion Initialization: ( i, i 1) ( i, i) 0 0 for i 2 to L Recursion: ( i 1, j); ( i, j 1); ( i, j) max ( i 1, j 1) BasePairScore( i, j); maxi k j ( i, k) ( k 1, j). 68 Two special cases of Partitionable Folding Co-Terminus Folding Partitionable Folding
SCFG version of the Nussinov-Jacobson algorithm Stochastic Context-Free Grammars Makes use of production rules: W aw cw gw uw (i unpaired) Every production rule has a associated probability parameter. The maximum probability parse is equivalent to the maximum probability secondary structure. 69
SCFG Version of Nussinov- Jacobson Algorithm The algorithm can be converted to a stochastic context-free grammar: S W W aw cw gw uw W Wa Wc Wg Wu W awu cwg uwa gwc W WW 70
Needed terminology The inside-outside (recursive dynamic programming) algorithm for SCFGs in Chomsky normal form is the natural counterpart of the forward-backward algorithm for HMM. Best path variant of the inside-outside algorithm is the Cocke-Younger-Kasami (CYK) algorithm. It finds the maximum probabilistic alignment of the SCFG to the sequence. 71
CYK for Nussinov-style RNA SCFG Initialization: ( i, i 1) ( i, i) for i 2 to log max log p( x S) i p( Sx ) i L for i 1to L Addition to the fill stage of the Nussinov algorithm. The principal difference is that the SCFG description is a probabilistic model. Recursion: ( i 1, j) log p( xw i ); ( i, j 1) log p( Wx j ); ( i, j) max ( i 1, j 1) log p( xwx i j ); maxi k j ( i, k) ( k 1, j) log p( WW ). Two special cases of Partitionable Folding Co-Terminus Folding Partitionable Folding 72
CYK for Nussinov-style RNA SCFG (2) The log P( x, ˆ ) is the log likelihood of the optimal structure given the SCFG model The traceback to find the secondary structure corresponding to the best score is performed analogously to the traceback in the Nussinov algorithm ˆ 73
Example of RNA Structure SCFG RNA structure for the sequence produced by MFOLD, can be constructed (5 to 3 ): GCUUACGACCAUAUCACGUUGAAUGCAC GCCAUCCCGUCCGAUCUGGCAAGUUAAG CAACGUUGAGUCCAGUUAGUACUUGGAU CGGAGACGGCCUGGGAAUCCUGGAUGU UGUAAGCU 74
Example Construction S W Wu gwcu gcwgcu gcuwagcu gcuuwaagcu gcuuawuaagcu gcuuacwguaagcu gcuuacgwuguaagcu gcuuacgawuuguaagcu gcuuacgacwguuguaagcu gcuuacgaccwguuguaagcu gcuuacgaccawguuguaagcu... 75
CYK for Nussinov-style RNA SCFG Good starting example, but it is too simple to be an accurate RNA folder The algorithm does not consider important structural features like preferences for certain: Loop lengths Nearest neighbours in the structure caused by stacking interactions between neighbouring base pairs in a stem. 76