Alignment Grph Alignment Mtrix
Computing the Optiml Globl Alignment Vlue An Introduction to Bioinformtics Algorithms A = n c t 2 3 c c 4 g 5 g 6 7 8 9 B = n 0 c g c g 2 3 4 5 6 7 8 t 9 0 2 3 4 5 6 7 8 9 Clssicl Dynmic Progrmming: O(n ) 2 Score of = Score of = - 2
A = n An Introduction 2 to Bioinformtics Algorithms The O(n ) time, Clssicl Dynmic Progrmming Algorithm c t c c 2 3 4 g 5 g 6 7 8 9 0 The Alignment Grph B = n c g c g 2 3 4 5 6 7 8 t 9 0 2 3 4 5 6 7 8 9 A: c g B: c g A: c g I2 I B: c g 3 I3 O = mx(i + edge[i,o]) x = x O A: c g B: c g A: c g B: c g Cn the qudrtic complexity of the optiml lignment vlue computtion be reduced without relxing the problem? x 3
Is it Possible to Align Sequences in Subqudrtic Time? Dynmic Progrmming tkes O(n 2 ) for globl lignment Cn we do better? O (h n 2 /log n), h <=. Lndu,Crochemore&Ziv-ukelson 2003 Techniques: () Compress the sequences. (2) Utilize the Totl Monotonicity of DIST.
c t c g g c An Introduction to Bioinformtics Algorithms c g c g t c t c g g c c g c g t O(n 2 ) vertices O(h n / log n) rows of n vertices + O(h n / log n) columns of n vertices 5
Stndrd, single-cell DP New, extended-cell DP I2 I3 I3 I4 I5 I6 c g I2 I g I O O4 3 O = mx(i + edge[i,o]) x = x x 6 O = mx(i + DIST[x,3]) 4 x = 0 x 6
Computing the score for Output Border Vertex O 4 2 3 4 c t cg 2 3 4 5 c g c g t I2 I g I3 I4 I5 I6 c g 5 6 g c I DIST[,4] I+DIST[,4] O4 I2 DIST[2,4] I2+DIST[2,4] I3 DIST[3,4] I3+DIST[3,4] I4 DIST[4,4] I4+DIST[4,4] I5 DIST[5,4] I5+DIST[5,4] I6 DIST[6,4] I6+DIST[6,4] O = mx(i + DIST[x,3]) 4 6 x = 0 x O4 7
Input I\ DIST Mtrix I = I2 = 2 I3 = 3 I4 = I5 = The Min Chllenges How to obtin the DIST for G in O(t) time? (Tke dvntge of the incrementl nture of LZ78 prsing). I6 = OUT[x,j] = Ix + DIST[x,j] Output vector O How to compute the column mxim of OUT in O(t) time? (Utilize the Totl Monotonicity Property of OUT). 8
How does Totl Monotonicity ffect Column Mxim behvior? For ll <b nd c < d, OUT[,c] <= OUT[b,c] OUT[,d] <= OUT[b,d] OUT Mtrix d b Column mxim row indices re monotoniclly non-decresing. SMAWK Mtrix Serching[Aggrwl et-l 87]. The t column mxim of Totlly Monotone rry cn be computed in O(t) time, by querying only O(t) elements. 9
6 An Introduction to Bioinformtics Algorithms c Trie for A g c 3 5 g 0 4 Accessing Prefix Block in Constnt time. t 2 2 3 4 5 6 c t c g g c 2 3 4 5 c g c g 3/2 5/2 3/4 5/4 t Trie for B g c 4 2 0 t 5 g 3 g c left prefix (5,2) c digonl prefix (3,2) c top g g c g prefix(3,4) block G (5,4) 0
Another technique to Align Sequences in Subqudrtic Time? For limited edit scoring schemes, such s LCS, use Four-Russins Speedup
How Mny Points Of Interest? LZ-78 compression blocks of size t How mny points of interest? O(n 2 /t) n/ t rows with n vertices ech n/ t columns with n vertices ech
The Four-Russins technique for speeding up for dynmic progrmming Dn Gusfield: The ide, comes from pper by four uthors concerning boolen mtrix multipliction. The generl ide tken from this pper hs come to be known in the West s The Four-Russins technique, even though only one of the uthors is Russin.
Arlzrov, Dinic, Kronrod nd Frdzev Msek & Pterson pplied the Four Russins to the string edit problem
Prtitioning Alignment Grid into Blocks of equl size t n n/t t n t n/t prtition
Block Alignment Problem Gol: Find the longest block pth through n edit grph Input: Two sequences, u nd v prtitioned into blocks of size t. This is equivlent to n n x n edit grph prtitioned into t x t subgrids Output: The block lignment of u nd v with the mximum score (longest block pth through the edit grph)
Stge : compute the mini-lignments n/t s, Solve mini-lignmnent problems s,2 s,3 Block pir represented by ech smll squre How my blocks? (n/t)*(n/t) = (n 2 /t 2 )
Stge 2: dynmic progrmming Let s i,j denote the optiml block lignment score between the first i blocks of u nd first j blocks of v s i,j = mx s i-,j - block s i,j- - block s i-,j- + i,j block is the penlty for inserting or deleting n entire block i,j is score of pir of blocks in row i nd column j.
Block Alignment Runtime Indices i,j rnge from 0 to n/t Running time of lgorithm is O( [n/t]*[n/t]) = O(n 2 /t 2 ) if we don t count the time to compute ech i,j
Block Alignment Runtime (cont d) Computing ll i,j requires solving (n/t)*(n/t)= n 2 /t 2 mini block lignments, ech of size (t*t) = t 2 So computing ll i,j tkes time O(n 2 /t 2 * t 2 ) = O(n 2 ) This is the sme s dynmic progrmming How do we speed this up?
Four Russins Technique Let t = log(n), where t is block size, n is sequence size. Insted of hving (n/t)*(n/t) )= n 2 /t 2 minilignments, construct 4 t x 4 t mini-lignments for ll pirs of strings of t nucleotides (huge size), nd put in lookup tble. However, size of lookup tble is not relly tht huge if t is smll. Let t = (logn)/4. Then 4 t x 4 t = 4 (logn)/4 x 4 (logn)/4 = 4 (logn)/2 = 2 (logn) = n t t
AAAAAA AAAAAC AAAAAG AAAAAT AAAACA Look-up Tble for Four Russins Technique ech sequence hs t nucleotides Lookup tble Score AAAAAA AAAAAC AAAAAG AAAAAT AAAACA size is only n, insted of (n/t)*(n/t) Let t = (logn)/4. Then the number of entries In the lookup tble: 4 t x 4 t = n Computing the scores for ech entry in the tble requires dynmic progrmming for (log n) by (log n) lignment: (logn) 2 Altogether: n (logn) 2
New Recurrence The new lookup tble Score is indexed by pir of t-nucleotide strings, so s i,j = mx s i-,j - block s i,j- - block O(logn) time s i-,j- + Score(i th block of v, j th block of u)
Four Russins Speedup Runtime Since computing the lookup tble Score of size n tkes O( n (logn) 2 ) time, the running time is minly limited by the n 2 /t 2 ccesses to the lookup tble Ech ccess tkes O(logn) time Overll running time: O( [n 2 /t 2 ]*logn ) Since t = logn, substitute in: O( [n 2 /{logn} 2 ]*logn) = O(n 2 /logn )
So Fr (restriced to block lignment) We cn divide up the grid into blocks nd run dynmic progrmming only on the corners of these blocks In order to speed up the mini-lignment clcultions to under n 2, we crete lookup tble of size n, which consists of ll scores for ll t-nucleotide pirs Running time goes from qudrtic, O(n 2 ), to subqudrtic: O(n 2 /logn)
Four Russins Speedup for LCS Unlike the block prtitioned grph, the LCS pth does not hve to pss through the vertices of the blocks. block lignment longest common subsequence
Block Alignment vs. LCS In block lignment, we only cre bout the corners of the blocks. In LCS, we cre bout ll points on the edges of the blocks, becuse those re points tht the pth cn trverse. Recll, ech sequence is of length n, ech block is of size t, so ech sequence hs (n/t) blocks.
How Mny Points Of Interest? block lignment longest common subsequence How my blocks? (n/t)*(n/t) = (n 2 /t 2 ) How mny points of interest? O(n 2 /t) n/t rows with n vertices ech n/t columns with n vertices ech
Trversing Blocks for LCS (cont d) If we used regulr dynmic progrmming to compute the grid, it would tke qudrtic, O(n 2 ) time, but we wnt to do better. Use the Four Russins Tbultion! we know these scores we cn clculte these scores t x t block
I = ( (, 2, 3), (2, 3, 4), bc, bb) O = (4, 3, 3, 3, 2, 2, 3)
I = (,, 2, 3), (, 2, 3, 4), bc, bb) * 4 t * 4 t n t * n t = (4n) 2t This will be huge tble! we need nother trick O = (4, 3, 3, 3, 2, 2, 3)
The Longest Common Subsequence T = B C B A D B D C D S = A B C B D B D D X = LCS(S,T) = BCBDBDD L = LCS(S,T) = BCBDBDD = 7
The LCS Alignment Grph S = m B C B C B D C 0 0 2 3 4 5 6 7 B C B A D B D D C 2 3 4 5 6 7 8 T = n 9 Digonl blue rrows re mtch points {(i,j) S[ i ] = T[ j ] Assigned score of. Horizontl blck rrows re deletions from T. Assigned score of 0. Verticl blck rrows re deletions from S Assigned score of 0. Clssicl Dynmic Progrmming: O(n m) (Crochemore, Lndu, Ziv-Ukelson O(n m/ log m))
0 An Introduction to Bioinformtics Algorithms I I2 2 I3 3 I4 I0 I0 I I2 I3 I4 I5 I6 I7 I8 I9 0 2 3 4 4 4 4 4 4 I5 I6 I7 I8 I9 Observtion. Due to the unit-step properties of LCS, both I nd O re monotoniclly non-decresing series, nd their vlues go up by unit steps. [Hunt-Szymnski-77]. Input row I 0 0 B C B A D B D C D 2 3 4 5 6 7 8 9 2 3 4 0 2 3 4 5 6 7 8 9 0 2 3 3 4 4 5 5 6 Output row O O0 O O2 O3 O4 O5 O6 O7 O8 O9
Reducing Tble Size Alignment scores in LCS re monotoniclly incresing, nd djcent elements cn t differ by more thn Exmple: 0,,2,2,3,4 is ok; 0,,2,4,5,8, is not becuse 2 nd 4 differ by more thn (nd so do 5 nd 8) Therefore, we only need to store qudruples whose scores re monotoniclly incresing nd differ by t most
Efficient Encoding of Alignment Scores Insted of recording numbers tht correspond to the index in the sequences u nd v, we cn use binry to encode the differences between the lignment scores 0 2 2 3 4 0 0 originl encoding binry encoding
We need to precompute only (0,(0,,),(,,), bc, bb)
Reducing Lookup Tble Size 2 t possible steps (t = size of blocks) 4 t possible strings Lookup tble size is (2 t * 2 t )*(4 t * 4 t ) = 2 6t Computing ech entry in the tble: t 2 Totl Tble Construction Time: 2 6t t 2 Let t = (logn)/6; Tble construction time is: 2 6((logn)/6) (logn) 2 = n (logn) 2
Reducing Lookup Tble Size Let t = (logn)/6; Stge : Tble construction time is: 2 6((logn)/6) (logn) 2 = n (logn) 2 Stge 2: lignment grph computtion time is: O( [n 2 /t 2 ]*t ) = O( [n 2 /{logn} 2 ]*logn) =O( n 2 /logn )
Summry We tke dvntge of the fct tht for ech block of t = O(log n), we cn pre-compute ll possible scores nd store them in lookup tble of size n, whose vlues cn be computed in time O(n (logn) 2 ). We used the Four Russin speedup to go from qudrtic running time for LCS to subqudrtic running time: O(n 2 /logn)