SA-REPC - Sequence Alignment with a Regular Expression Path Constraint Nimrod Milo Tamar Pinhas Michal Ziv-Ukelson Ben-Gurion University of the Negev, Be er Sheva, Israel Graduate Seminar, BGU 2010 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 1 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint 3 Applying SA-REPC to microrna target prediction 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 2 / 54
Michal s group Michal Ziv-Ukelson Tamar Pinhas Isana Vaksler Noa Mussa Sivan Yogev Shay Zakov Erez Katzenelson Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 3 / 54
Topics of interest in our group Sequence and tree alignments and similarity Indexing, searching and compression Secondary structure prediction of RNA: folding and co- folding. microrna-mrna target prediction Sequence/structure motifs involved in localization and post-transcriptional regulation Post-transcriptional regulation: virus-host micro RNA- mrna behavior Protein motif discovery (common signals within family) Algorithms on Strings and Trees More... Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 4 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 5 / 54
MTP: Manhattan Tourist Problem s a a a a a a a a a a a a a a a a a a a a a a a a Imagine seeking a path (from source to sink) to travel on (going only eastward and southward) with the highest number of attractions on it, marked by weights on the streets (edges) in a Manhattan grid. a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 6 / 54
MTP: Manhattan Tourist Problem 1 10 s a a a a 2 3 a a a a a a a a a a 1 4 5 a a a a a 3 a a a a a Imagine seeking a path (from source to sink) to travel on (going only eastward and southward) with the highest number of attractions on it, marked by weights on the streets (edges) in a Manhattan grid. 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 6 / 54
MTP: Manhattan Tourist Problem 1 10 s a a a a 2 3 a a a a a a a a a a 1 4 5 a a a a a 3 a a a a a Imagine seeking a path (from source to sink) to travel on (going only eastward and southward) with the highest number of attractions on it, marked by weights on the streets (edges) in a Manhattan grid. 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 6 / 54
Manhattan Tourist Problem: Formulation Goal Find the highest scoring path in a weighted grid. Input A weighted grid G with two distinct vertices, one labeled source and the other labeled sink. Ouput Output: A longest path in G from source to sink Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 7 / 54
MTP solution using Dynamic programming Each vertex s score is the maximum of the prior vertices score plus the weight of the respective edge in between Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 8 / 54
MTP solution using Dynamic programming Each vertex s score is the maximum of the prior vertices score plus the weight of the respective edge in between Computing the score for a point (i,j) by the recurrence relation: S 0,0 = 0 { } Si 1,j + score of the edge between(i 1, j)and(i, j) S i,j = max S i,j 1 + score of the edge between(i, j 1)and(i, j) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 8 / 54
MTP solution using Dynamic programming Each vertex s score is the maximum of the prior vertices score plus the weight of the respective edge in between Computing the score for a point (i,j) by the recurrence relation: S 0,0 = 0 { } Si 1,j + score of the edge between(i 1, j)and(i, j) S i,j = max S i,j 1 + score of the edge between(i, j 1)and(i, j) Running time The running time of the above formula for a grid of size n m is: O(n m) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 8 / 54
Example 1 10 s a a a a 2 3 * a a a a S 1,0 = S 0,0 + 2 = 0 + 2 a a a a a 1 4 5 a a a a a 3 a a a a a 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 9 / 54
Example 1 10 s a a a a 2 3 a * a a a a a a a a 1 4 5 a a a a a S 1,0 = S 0,0 + 2 = 0 + 2 S 1,1 = max(s 0,1 + 0, S 1,0 + 3) = max(1 + 0, 2 + 3) 3 a a a a a 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 9 / 54
Extending the MTP problem 1 10 s a a a a 2 3 a a a a a a a a a a 1 4 5 a a a a a 3 a a a a a 3 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 10 / 54
Extending the MTP problem 1 10 s a a a a 2 3.5 a a a a a Changing the scores to real numbers. a a a a a 1 4.6 5 a a a a a 3 a a a a a 3.12 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 10 / 54
Extending the MTP problem 1 10 s a a a a 2 3.5 a a a a a Changing the scores to real numbers. Adding diagonal movement (edges in the graph). a a a a a 1 4.6 5 a a a a a 3 a a a a a 3.12 a a a a t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 10 / 54
Extending the MTP problem 1 10 s a a a a 2 3.5 a a a a a Changing the scores to real numbers. Adding diagonal movement (edges in the graph). a a a a a 1 4.6 5 a a a a a 3 a a a a a 3.12 a a a a t S i,j = max S i 1,j + score of the edge between(i 1, j)and(i, j) S i,j 1 + score of the edge between(i, j 1)and(i, j) S i 1,j 1 + score of the edge between(i 1, j 1)and(i, j) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 10 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 11 / 54
Sequence alignment Definition (Global sequence alignment problem) S 1 and S 2 two strings over an alphabet Σ. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 12 / 54
Sequence alignment Definition (Global sequence alignment problem) S 1 and S 2 two strings over an alphabet Σ. s a scoring matrix. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 12 / 54
Sequence alignment Definition (Global sequence alignment problem) S 1 and S 2 two strings over an alphabet Σ. s a scoring matrix. A sequence alignment is obtained by inserting gaps into S 1 and S 2, so that the symbols can be placed in one-to-one correspondence with each other. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 12 / 54
Sequence alignment Definition (Global sequence alignment problem) S 1 and S 2 two strings over an alphabet Σ. s a scoring matrix. A sequence alignment is obtained by inserting gaps into S 1 and S 2, so that the symbols can be placed in one-to-one correspondence with each other. The optimal global sequence alignment is a sequence alignment that has the optimal sum of scores, according to s, over the pairs of symbols that correspond to each other in the alignment. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 12 / 54
Sequence Alignment example Example S 1 = AGCGCGUU S 2 = GUCAGACG Example A G C G C G U U G U C A G A C G Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 13 / 54
Sequence Alignment example Example S 1 = AGCGCGUU S 2 = GUCAGACG The scoring matrix s to be -1 for mismatch/indel (space), 1 for match. Example A G C G C G U U G U C A G A C G -1 1-1 1-1 1-1 1 1-1 -1 An optimal alignment of S 1 and S 2 is scored -1. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 13 / 54
Adding some sequences to the grid We can extend the grid to represent an alignment between two sequences in the following way: We create a grid with size S 1 + 1 S 2 + 1 vertices. The additional row / column is for the gap sign ( - ). The scores on the edges will be as follows: - j j+1 s[ -,S 2 [j]] i a a s[s 1 [i], - ] s[s 1 [i], S 2 [j]] i+1 a a Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 14 / 54
A G C G C G U U G U C A G A C G -1-1 -1-1 -1-1 -1-1 s 0 0 0 0 0 0 0 0-1 -1-1 -1-1 -1-1 1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 15 / 54
A G C G C G U U G U C A G A C G -1-1 -1-1 -1-1 -1-1 s 0 0 0 0 0 0 0 0-1 -1-1 -1-1 -1-1 1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 1-1 -1-1 -1-1 -1-1 1-1 -1-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 0-1 -1-1 1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 -1-1 0 0 0 0 0 0 0 0 t Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 15 / 54
The alignment table S - G U C A G A C G - 0-1 -2-3 -4-5 -6-7 -8 A -1-1 -2-3 -2-3 -4-5 -6 G -2 0-1 -2-3 -1-2 -3-4 C -3-1 -1 0-1 -2-2 -1-2 G -4-2 -2-1 -1 0-1 -2 0 C -5-3 -3-1 -2-1 -1 0-1 G -6-4 -4-2 -2-1 -2-1 1 U -7-5 -3-3 -3-2 -2-2 0 U -8-6 -4-4 -4-3 -3-3 -1 S 1 = AGCGCGUU S 2 = GUCAGACG Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 16 / 54
The alignment table S - G U C A G A C G - 0-1 -2-3 -4-5 -6-7 -8 S 1 = AGCGCGUU S 2 = GUCAGACG A -1-1 -2-3 -2-3 -4-5 -6 G -2 0-1 -2-3 -1-2 -3-4 C -3-1 -1 0-1 -2-2 -1-2 A G G U C C A G G A C C G G U U G -4-2 -2-1 -1 0-1 -2 0 C -5-3 -3-1 -2-1 -1 0-1 G -6-4 -4-2 -2-1 -2-1 1 U -7-5 -3-3 -3-2 -2-2 0 U -8-6 -4-4 -4-3 -3-3 -1 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 16 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 17 / 54
Constraint Sequence Alignment Numerous studies suggest the application of additional constraints to sequence alignment for the purpose of improved speed or accuracy. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 18 / 54
Constraint Sequence Alignment Numerous studies suggest the application of additional constraints to sequence alignment for the purpose of improved speed or accuracy. The additional constraints can reflect a priori knowledge of the alignment and, therefore, narrows the problem search space or guides the search towards a preferred solution. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 18 / 54
Related Work Position anchoring [Myers-96, Sammeth-03] Demanding that the path will pass in certain cells in the table. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 19 / 54
Related Work Spaced seeds [Ma-02, Kucherov-05, Benson-06] Constraint on the path in the form of a partial word. Partial words are alignments based on letters 1 (match) and * (dont-care). For example: 11*11* will allow 111110 and also 110111. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 19 / 54
Related Work Regular Expression Constraint Sequence Alignment (RECSA) [Arslan-05] Each string should satisfy a regular expression constraint. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 19 / 54
Related Work SA-REPC Constraint on the path in the form of a regular expression. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 19 / 54
Related Work Position anchoring [Myers-96, Sammeth-03] Spaced seeds [Ma-02, Kucherov-05, Benson-06] Regular Expression Constraint Sequence Alignment (RECSA) [Arslan-05] SA-REPC Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 20 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 21 / 54
Preliminaries An extended definition of sequence alignment with alignment-path constraints. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 22 / 54
Preliminaries An extended definition of sequence alignment with alignment-path constraints. Example The constraint is in the form of a regular expression. S 1 = AGCGCGUU S 2 = GUCAGACG R = 11011 (1 - match, 0 - everything else) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 22 / 54
Preliminaries An extended definition of sequence alignment with alignment-path constraints. Example The constraint is in the form of a regular expression. S 1 = AGCGCGUU S 2 = GUCAGACG R = 11011 (1 - match, 0 - everything else) A G C G C G U U G U C A G A C G 1 1 0 1 1 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 22 / 54
Preliminaries - Alignment alphabet examples Σ r = {1, 0} 1 match 0 any other Example The letters A and A are mapped to 1. U and are mapped to 0. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54
Preliminaries - Alignment alphabet examples Σ r = {m, s, i, d} m s i d match substitution insertion deletion Example The letters A and A are mapped to m. U and are mapped to d. and A are mapped to i. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54
Preliminaries - Alignment alphabet examples Σ r = { σ1 σ σ 1, σ 2 Σ } { { } \ } 2 Each letter is mapped to a different symbol in the alignment alphabet Example The letters A and U are mapped to A U in the alignment alphabet and A, to A -. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54
Preliminaries - Alignment alphabet examples Because some Σ r symbols can be mapped from different symbols in Σ we need a mapping function f defined as: f : Σ Σ P(Σ r ) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54
Preliminaries - Alignment alphabet examples Because some Σ r symbols can be mapped from different symbols in Σ we need a mapping function f defined as: f : Σ Σ P(Σ r ) Example In Σ r = {0, 1} { σ 1 σ 2 σ 1, σ 2 Σ { } f (A, A) = {1, A A }, f (A, U) = {0, A U } } { \ }: Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 23 / 54
Sequence Alignment with a Regular Expression Path Constraint Definition (Global SA-REPC ) S 1 and S 2 be two strings over an alphabet Σ. s a scoring matrix over alphabet Σ. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 24 / 54
Sequence Alignment with a Regular Expression Path Constraint Definition (Global SA-REPC ) S 1 and S 2 be two strings over an alphabet Σ. s a scoring matrix over alphabet Σ. R a regular expression over an alignment alphabet Σ r. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 24 / 54
Sequence Alignment with a Regular Expression Path Constraint Definition (Global SA-REPC ) S 1 and S 2 be two strings over an alphabet Σ. s a scoring matrix over alphabet Σ. R a regular expression over an alignment alphabet Σ r. Definition Find an alignment of S 1 and S 2 such that two conditions hold: 1 There exists an accepted region in the alignment belonging to L R. 2 The overall score of the alignment, computed according to s, is optimal among all such alignments. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 24 / 54
Sequence Alignment vs. SA-REPC Example (input) S 1 = AGCGCGUU S 2 = GUCAGACG s be a scoring matrix: match +1, all other -1. Example (Sequence Alignment) A G C G C G U U G U C A G A C G Optimal alignment value = -1 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 25 / 54
Sequence Alignment vs. SA-REPC Example (input) S 1 = AGCGCGUU S 2 = GUCAGACG s be a scoring matrix: match +1, all other -1. R = 10 3 10 Example (Sequence Alignment) Example (SA-REPC ) A G C G C G U U A G C G C G U U G U C A G A C G Optimal alignment value = -1 G U C A G A C G 1 0 0 0 1 0 Optimal alignment value = -3 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 25 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 26 / 54
Modifications in the automaton Regular expression - R 1 (1 0)1 2 0 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 27 / 54
Modifications in the automaton Regular expression - R 1 (1 0)1 2 0 Automaton - A R 1 0 q 0 0 / 1 1 1 start q 1 q 2 q 3 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 27 / 54
Modifications in the automaton Regular expression - R 1 (1 0)1 2 0 Automaton - A R 1 0 q 0 0 / 1 1 1 start q 1 q 2 q 3 Built Automaton - A Σ 1 0 Σ start q init ɛ 0 / 1 1 1 q 0 q 1 q 2 q 3 ɛ q final Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 27 / 54
Dynamic Programming solution The DP solution We calculate a dynamic programming table M Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54
Dynamic Programming solution The DP solution We calculate a dynamic programming table M Each cell M[i, j] holds Q entries Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54
Dynamic Programming solution The DP solution We calculate a dynamic programming table M Each cell M[i, j] holds Q entries Cell M[i, j](q) holds the optimal score of aligning S 1 [1, i] with S 2 [1, j] such that there is a run on A which reached q Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54
Dynamic Programming solution The DP solution We calculate a dynamic programming table M Each cell M[i, j] holds Q entries Cell M[i, j](q) holds the optimal score of aligning S 1 [1, i] with S 2 [1, j] such that there is a run on A which reached q If no such alignment suffix exists, then the value of the entry M[i, j](q) is null Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54
Dynamic Programming solution The DP solution We calculate a dynamic programming table M Each cell M[i, j] holds Q entries Cell M[i, j](q) holds the optimal score of aligning S 1 [1, i] with S 2 [1, j] such that there is a run on A which reached q If no such alignment suffix exists, then the value of the entry M[i, j](q) is null The answer is in M[n, m](q final ). Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 28 / 54
Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 29 / 54
Single cell calculation The calculation of a single cell M[i, j] under the assumptions: S 1 [i] = S 2 [j] = C s[ C, C ] = 1 s[ C, - ] = s[ -, C ] = 0 Σ 1 0 Σ A = start q init ɛ 0 / 1 1 1 q 0 q 1 q 2 q 3 ɛ q final Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 30 / 54
Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 31 / 54
Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Example M[i, j](q 1 ) = M[i 1, j 1](q 0 ) + s[ C, C ] = M[i 1, j 1](q 0 ) + 1 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 31 / 54
Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Example M[i, j](q 1 ) = M[i 1, j](q 0 ) + s[ C, - ] = M[i 1, j](q 0 ) + 0 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 31 / 54
Dynamic programming recurrence formula The recurrence formula for the problem is as follows: { 1 0 q = qinitial M[0, 0](q) = null otherwise 2 M[i, j](q) = max {M[i 1, j 1](p) + s[s 1 [i], S 2 [j]] q δ(p, f (S 1 [i], S 2 [j]))} max max {M[i 1, j](p) + s[s 1 [i], - ] q δ(p, f (S 1 [i], - ))} max {M[i, j 1](p) + s[ -, S 2 [j]] q δ(p, f ( -, S 2 [j]))} 3 If i = 0 (or j = 0) the sets above, corresponding to i 1 (or to j 1) are ignored. Example M[i, j](q 1 ) = M[i, j 1](q 0 ) + s[ -, C ] = M[i, j 1](q 0 ) + 0 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 31 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 32 / 54
Complexity analysis We denote: t = Q the number of states in A. n to be the length of S 1 m to be the length of S 2 Method Trace Time (NFA) Time (DFA) Memory naïve O(mnt 2 ) O(mnt) O(mnt) naïve O(mnt 2 ) O(mnt) O(min{m, n}t) Hirschberg O(mnt 2 ) O(mnt) O(min{m, n}t) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 33 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 34 / 54
The Cell Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 35 / 54
Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 36 / 54
The central dogma Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 37 / 54
A short movie Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 38 / 54
mrna regions Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 39 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 40 / 54
micrornas 1 micrornas are short sequences of RNA (approximately 22 bases). Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 41 / 54
micrornas 1 micrornas are short sequences of RNA (approximately 22 bases). 2 Function as specific gene regulators. A cell function at any given time is determined by the composition of proteins in it. micrornas suppress the translation of RNA to Protein. transcription translation DNA RNA Protein Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 41 / 54
micrornas 1 micrornas are short sequences of RNA (approximately 22 bases). 2 Function as specific gene regulators. 3 Operate by binding to complementary sequences on their mrna target (this interaction is called: hybridization). Hybridization is chemical bonding of bases (also called base pairing) A:U G:C G:U Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 41 / 54
micrornas 1 micrornas are short sequences of RNA (approximately 22 bases). 2 Function as specific gene regulators. 3 Operate by binding to complementary sequences on their mrna target (this interaction is called: hybridization). 4 The complex created by hybridization of the microrna to its mrna target is called a duplex. Figure: picture from Lin et al. 2003 Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 41 / 54
Another short movie Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 42 / 54
Hybridization and Sequence alignment Hybridization of two sequences can be solved with the standard sequence alignment framework. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 43 / 54
Hybridization and Sequence alignment Hybridization of two sequences can be solved with the standard sequence alignment framework. Example The only difference is the scoring scheme. In sequence alignment a match is when both symbols are the same. In hybridization a match is when the two symbols are complementary. The matching pairs are: A:U, G:C and G:U. C U C G U G A U A C A C U U U G U U Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 43 / 54
Duplex properties Different properties of the microrna to target duplex were observed, some of which serve as a basis for current microrna target prediction algorithms. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54
Duplex properties 1 5 -end dominant seed: Several studies suggest the existence of a 6-8 nucleotides in the 5 -end of the microrna (the seed ). 5 Seed The 5 end of the seed is unpaired or starts with U, and doesn t contain wobble pairs (G:U). Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54
Duplex properties 1 5 -end dominant seed: Several studies suggest the existence of a 6-8 nucleotides in the 5 -end of the microrna (the seed ). 2 3 -end compensatory seed: There is significant evidence that a 3 -end seed of microrna can compensate for a non-perfect 5 -seed. 3 Seed Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54
Duplex properties 1 5 -end dominant seed: Several studies suggest the existence of a 6-8 nucleotides in the 5 -end of the microrna (the seed ). 2 3 -end compensatory seed: There is significant evidence that a 3 -end seed of microrna can compensate for a non-perfect 5 -seed. 3 Multiplicity: micrornas have been shown to be capable of functioning in a collaborative manner. There are two types of multiplicity: microrna microrna1 microrna2 Target Target Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54
Duplex properties 1 5 -end dominant seed: Several studies suggest the existence of a 6-8 nucleotides in the 5 -end of the microrna (the seed ). 2 3 -end compensatory seed: There is significant evidence that a 3 -end seed of microrna can compensate for a non-perfect 5 -seed. 3 Multiplicity: micrornas have been shown to be capable of functioning in a collaborative manner. 4 Accessibility and Thermodynamics: Thermodynamics and accessibility of the duplex and its surroundings area are very important properties. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 44 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 45 / 54
Using the current dogma on duplexes Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 46 / 54
Utilizing SA-REPC for microrna target prediction Some properties of the duplex can be written as a regular expression constraint. 5 -end dominant seed: ( i A G A A A C ) WCB 5 7 ii (WCB) 6 Where: WCB = 3 -end compensatory seed: ( G C C G A U U A ) 1 4 5 s 0 2 Inner buldge of the duplex: ( i 1 4 d 1 6)? ( 11 + ( i 1 4 d 1 6)) Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 47 / 54
More complex duplex properties Thermodynamics and Accessibility of the duplex site and its surroundings are more complex properties. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 48 / 54
More complex duplex properties Thermodynamics and Accessibility of the duplex site and its surroundings are more complex properties. Both properties are the computational bottlenecks. Thermodynamics: microrna-target hybridization tends to have low free energy. Accessibility: Target site accessibility plays an important role in the formation of the duplex. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 48 / 54
More complex duplex properties Thermodynamics and Accessibility of the duplex site and its surroundings are more complex properties. Both properties are the computational bottlenecks. The complexity of such computations ranges from O(nm 2 ) [Stadler-06] (with restrictions) and up to O(nm 5 ) [Hofacker-08]. Thermodynamics: microrna-target hybridization tends to have low free energy. Accessibility: Target site accessibility plays an important role in the formation of the duplex. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 48 / 54
More complex duplex properties Thermodynamics and Accessibility of the duplex site and its surroundings are more complex properties. Both properties are the computational bottlenecks. The complexity of such computations ranges from O(nm 2 ) [Stadler-06] (with restrictions) and up to O(nm 5 ) [Hofacker-08]. We suggest using our method as an initial filter for target prediction tools that rely on energy computation. Thermodynamics: microrna-target hybridization tends to have low free energy. Accessibility: Target site accessibility plays an important role in the formation of the duplex. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 48 / 54
Outline 1 About Michal s Group 2 Sequence Alignment with a Regular Expression Path Constraint Manhattan Tourist Problem Sequence Alignment Constraint Sequence Alignment SA-REPC definition Algorithm for the SA-REPC Complexity analysis 3 Applying SA-REPC to microrna target prediction Background micrornas Modifying the SA-REPC for microrna target prediction Results 4 Summary Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 49 / 54
Target prediction Implementation Implemented the tool in a java package named: calign. A web version is available at: http://www.cs.bgu.ac.il/ negevcb/calign Our data set 99 micrornas. 640 3 UTRs of human genes (2183 transcripts). 873 verified duplexes from mirecords. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 50 / 54
Comparative Results Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 51 / 54
Results Tool # of predicted pairs # of True Positives Sensitivity miranda 22,857 309 35.3% PITA 28,032 661 75.7% RNA hybrid 43,693 731 83.7% calign 43,210 626 71.7% Table: Results on all 63,360 pairs Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 52 / 54
Conclusions Conclusions Extended Sequence alignment to support a path constraint (SA-REPC ). Presented an application for our algorithm. Implemented the algorithm (calign). Showed preliminary comparative results. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 53 / 54
Conclusions Conclusions Extended Sequence alignment to support a path constraint (SA-REPC ). Presented an application for our algorithm. Implemented the algorithm (calign). Showed preliminary comparative results. Future work Find more properties of duplexes that can be used in SA-REPC. Find more applications for SA-REPC. Maybe extended to more general language classifications, such as grammars. An interesting open problem might be the application of some of the techniques previously used to obtain sub-quadratic sequence alignment, such as Four Russians and acceleration by compression. Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 53 / 54
Acknowledgements Special Thanks Tamar Pinhas Co-Author Dr. Michal Ziv-Ukelson My Advisor The rest of Michal s group at BGU Erez Katznelson Isana Vaksler Sivan Yogev Shay Zakov Noa Mussa Milo, Pinhas & Ziv-Ukelson (BGU) SA-REPC November 2010 54 / 54