FG PS lgorithm Sequence lignment Guided By ommon Motifs Described By ontext Free Grammars
motivation Find motifs- conserved regions that indicate a biological function or signature. Other algorithm do not always align motif regions together. Incorporate knowledge about common structures into the alignment process. Forcing the alignments align such common motifs. Internal Loop External Base Hairpin Loop Multi-loop Bulge Hairpin Loop
The goal To align sequences in a way to include all or some of the motif-matches in an order to optimize the resulting score. Example k+1 max z i, wi + max β(g) i=1 y,x G, G (,)+(G,G)+(UU,U)+(G,G)+ β(g 1 )+β(g 2 )+β(g 3 )
Problem definition Let (u, v) denote the maximum ordinary alignment score for strings u and v that can be computed using the lignment algorithm. Given: two sequences S1 and S2 set of context-free grammars = {G 1,..., G f } each of which represents a motif with all its possible variations. weight function β(g) ompute: k+1 max i=1 Z i, wi + max X,Y G, G β(g) over all possible X 1, X 2,...,X k, Y 1, Y 2,..., Y k S1 = z 1 x 1 z 2 x 2... z k x k z k+1 S2 = w 1 y 1 w 2 y 2...w k y k w k+1 Each Z j or w j, 1 <= j <= k + 1, can be an empty string For all I 1<= i<= k, there exists a G such that y i, x i L G, where L(G) is the language generated by FG G. motif-matching region
={G 1, G 2, G 3, G } G 1 =(V,T,P,V 0 ) context free grammar Set of variables V = {V 0, V 1, V 2 } V 0 is a start variable Terminal symbols T = {,U,,G} Set of rules: V0 --> V 1 G GV 1 V1 --> GV 2 V2 --> G G 2 ={ {V 0, V 1, V 2 }, {,U,,G}, p, V 0 } Set of rules: V0 --> V 1 U UV 1 V1 --> V 2 G V2 --> GG G 3 ={ {V 0, V 1, V 2, V 3 }, {,U,,G}, p, V 0 } Set of rules: V0 --> V 1 U GV 1 V 1 G V1 --> V 2 U V2 --> V 3 U UV 3 V3-->
Reminder YK lgorithm YK is an algorithm that receives a NF, G, and a string S, and determines if S can be produced from G, and how. Parsell Parsell is a modification of YK that receives a NF, G, and a string S, and finds all of the substrings of S that can be produced from G. NF homsky normal form NF is a FG in which all of the production rules are of one of the forms: V TS V a S ε Every FG is easily convertible to NF by following a simple algorithm.
Running Example: NF G V0 B you 1 No 2 one B B2 can B2 understand DP it. D Dynamic P Programming String S = Poor fool, do you think you can understand Dynamic Programming? No one can understand it.
you can understand Running Example: T[,]={} T[,]={} T[,]={} T[,]={} D Dynamic T[,]={D} P Programming T[,]={P} 1 No T[,]={1} 2 one T[,]={2} can T[,]={} understand T[,]={} it T[1,1]={} String S = Poor fool, do you think you can understand Dynamic Programming? No one can understand it.
Running Example: 1 D P 1 2 1
Running Example: DP B2 1 D P 1 2 B2 1
Running Example: B2 B B2 1 B2 D P 1 2 B B2 1
Running Example: B B2 1 B B2 D P 1 2 B B2 1
Running Example: V0 B V0 B 1 V0 B B2 D P 1 V0 2 B B2 1
Running Example: S1 = [,] S2=[,1] 1 V0 B B2 D P 1 V0 2 B B2 1
lgorithm Parsell (input string S, length n, FG G) Step 1. Find the substrings derived by the rules of the form X a: for i =1 to n do set T[i, i]= ϕ for each variable X if X S[i] is a rule then add X to T[i, i] Step 2. Find the substrings derived by the rules of the form X YZ: for l =2 to n do for i =1 to n - l +1 do set j = i + l 1 for k = i to j - 1 do for each rule X YZ if Y T[i, k] and Z T[k +1,j] Then add X to T[i, j] Step 3. Return the set of all substrings generated by G: P = ϕ for i =1 to n do for j = i to n do if V0 T[i, j] then add (i, j) to P return P Runtime = O( G N 3 )
-1-1 _ G G G U _ 0 - - - - - - - - - - - - - - - - - - - 1-1 -2-3 - - - - - - - - - - - -1-1 -1 - -1 2 1 0-1 -2-3 - - - - - - - - - - - G - -2 1 0 2 1 0-1 -2-3 - - - - - - - - - G - -3 0-1 1 0-1 -2 0-1 -2-3 - - - - - - - G - - -1-2 0-1 -2-3 -1 1 0-1 -2-3 - - - - - - - -2-3 -1 1 0-1 -2 0-1 1 0-1 -2-3 - - - - - -3 - -2 0 2 1 0-1 -2 0 2 1 0-1 -2-3 - - - - -2-3 -1 1 3 2 1 0-1 1 0-1 -2-3 -1-2 - - - -3 - -2 0 2 G - - - - U - - - -
Preprocessing For all G k in do: X k parsell(s 1,n 1,G k) Y k parsell(s 2,n 2,G k) Move every (i',i) X k to list X as (i,j,k) Move every (i',i) Y k to list Y as (i,j,k) Sort X & Y in ascending order of the end points i'. Y (1,2,) (2,, 1) (3,, ) (,, 1) (3,, 2) (,, ) (,, 1) (,1, 3) X (1,, 2) (2,, ) (,, 3) (3,, 2) (,, 1) (2,, ) (,, 3) i= j=
Time omplexity of Where N=max{N 1,N 2 } Running Parsell for nd creating X, Y: Sorting X, Y: O( N 3 ) O( X log X + Y log Y ) omputing the maxterms: O( X Y + N 2 ) Because of the usual N 2 table checking, and the new maxterm: max{h(i,j ) +β(g k1 )} (i,i,k 1,j,j,k 2 ) X i *Y j,g k1 =G k2 Which passes every NF in X and Y and check for matching NFs. Total time complexity: O( N 3 + X Y )
Possible Modifications ffine gap penalties local alignment computations more advanced algorithms used in the alignment of non-motif-matching regions. Same or better complexity. Solve the general problem of optimally aligning multiple sequences guided by a given set of motifs described by FGs.