Multiple Sequence Alignment (MAS)

Multiple Sequence lignment (MS) Group-to-group alignments Steven driaensen & Ken Tanaka

References Osamu Goto Optimal lignment between groups of sequences and its application to multiple sequence alignment (1993) Further improvements in methods of group-to-group sequence alignment with generalized profile operations (1994)

Context Why do Multiple Sequence lignment? Reducing uncertainties Better for identifying Similarities

Context How to do Multiple Sequence lignment Generalization of standard DP algorithms Meta-heuristic Optimizations Progressive lignment pproach (e.g. the Clustal system)

Context Progressive lignment Step 1 discussed in the first lecture Calculate a distance matrix between all sequence pairs distance between a pair of sequences alignment score aligning of two sequences - global alignment: Needleman-Wunsh - local alignment: Smith-Waterman

Context Progressive lignment Step 2 discussed in the second lecture Constructing a guide tree (phylogenetic trees) tree describing the relationships between sequences - Fitch-Margoliash length of branches distance between sequences

Context Progressive lignment Step 3 We now have a guide tree, whats next? Grouplign = group-to-group alignment - the 4 algorithms by Osamu Gotoh

Group-to-group alignments = a 11 a 1I B = a M1 a MI b 11 b 1J b N1 b NJ a 1 a 2 e.g. N = b 1 b path: 1 3 2 J where 3 is a match and 1, 2 a gap in group, B resp. In order to find such an alignment we need 2 things: 1. Measure of how good an alignment is. 2. n algorithm that creates an alignment approximately optimal to this measure.

How good is an alignment? We have seen a score for an alignment of 2 sequences extend to multiple. ffine gap penalty: gap penalty = u gap length v d a mi, a ki = Sum of Pairs: M m 1 SP = S mk (a mi = a ki = ) u (a mi = a ki = ) d a mi, a ki (else) S mk = d(a mi, a ki ) v g mk m=2 k=1 i=1 Where g mk is the number of gaps in the pair of sequences (m,k). In context of group-to-group alignments we have SP N = SP SP B SP. B SP. B = S mn I M N m=1 n=1

Mind the gap(s) If v we must be able to compute g mk. What is g for these pair of sequences? C C G G B B C D B C B Removing matching from a pair of sequences should not affect SP and by consequence g. s these were not introduced in aligning these sequences. nswer: 5 C C G G B B C D B C B

lgorithms: Notations & remarks D ij : The score obtained by the algorithm for the alignment of {a 1,, a i } and {b 1,, b j } given that the last segment of the alignment path was. In algorithm -C these are the candidates retained for a sub-alignment ij. In algorithm D the semantics of the superscript are extended to denote any candidate retained for a sub-alignment ij. The score returned by the algorithms is the Score(.B) Only for C and D we have Score = SP. a mij is the last symbol in the m th row of group in a sub-alignment ij. It is equal to a i if = 1, 3 and if = 2. b nij is the last symbol in the n th row of group B in a sub-alignment ij. It is equal to b j if = 2, 3 and if = 1. e.g. For path path: 1 3 2, N = a 1 a 2 b 1 b J = a 1 a 21 a IJ b IJ b 1 b 21 ll algorithms obtain the same result if nor B contains gaps or if v =!

lgorithm Backtrack: Keep a pointer to predecessor for each candidate retained. Reduce memory: Replace records no longer required. (only keep J2 last computed records) Let be the group with the longest sequences (minimize J). Backtrack: Keep path for every candidate retained.

lgorithm D 1 3 ij = min D i 1,j 3 1 V, D i 1j d(a i, ) D 2 ij = min D i,j 1 V, D i,j 1 d(, b j ) D 3 ij = min D i 1,j 1 d(a i, b j ) V = MNv d(a ij, b ij ) = M m=1 N n=1 d(a mij, b nij ) record = {D ij 1, D ij 2, D ij 3 } 2

Gap openings and lgorithm lgorithm treats internal gaps in groups, B as an ordinary symbol. From a gap-opening perspective, there are no internal gaps. Count-rule: change in path 3 1, 3 2 is assumed to open a gap in every pairwise comparison. (V = MNv). e.g. path: [3,1]: counts NM = 2x2 = 4 Overestimate: e.g. path: [3,1]: counts NM = 2x2 = 4 Underestimate: e.g. path: [3,3]: counts a 11 a 12 a 21 a 22 b 11 b 21 a 11 a 12 a 21 a 22 b 11 b 21 a 11 = = a 12 a 21 a 22 b 11 b = 12 b 21 b 22 B B B B B actual gaps 4 actual gaps 1 actual gaps 2 m\n 1 2 1 2 m\n 1 2 1 2 m\n 1 2 1 2

Complexity of lgorithm Naively O(NMIJ) = O(NML²) O(L²) assuming you use profiles d(a ij, b ij ) = f B x X xi p xj. M = = x) f x,i m=1 (a mij p B x,j = a ij = a i B y X d(x, y) f yj = a i with f i = f i f a ij = and p i = p i if a ij p if a ij = f i, f B j, p i, p B j can be precomputed (outside loops) Essentially we are rewriting a sum with a fixed number of different terms as a dot product: e.g. 222225555 = 5*24*5 p x,j B gives the possible values for terms 2 5 and f x,i their frequencies 5 4. Beware of large constant X!

lgorithm B D ij 1 = min β=1,3 D ij 2 = min β=2,3 D ij 3 = min β=1,2,3 β D i 1,j β D i,j 1 β D i 1,j 1 RecordB = {D ij 1, D ij 2, D ij 3 } g ij β1 v d(ai, ) g ij β2 v d(, bj ) g ij β3 v d(ai, b j ) g i,j β is the estimated # new gaps introduced given the last 2 path segments are β.

Gap openings in algorithm B Gap openings: Estimates new gaps based on whether or not the symbols at the 2 last positions for every pair of sequences are gaps in the evaluated candidate alignment. B C C D? B C C? G G? B B

g ij β M N g 11 ij = ( γ 11 q m,i 1 )(1 q m,i ) m=1 n=1 M N g 13 ij = (γ 13 q m,i 1 )(1 q m,i )r n,j m=1 n=1 M N g ij 22 = (γ 22 r n,j 1 )(1 r n,j ) g ij 23 = m=1 n=1 M N m=1 n=1 M N (1 q m,i 1 γ 13 q m,i 1 )(1 r n,j )q m,i (1 r n,i,j 1 γ 23 r n,j 1 )(1 q m,i )r n,j (γ 23 r n,j 1 ) (1 r n,j )q m,i g ij 31 = (1 r n,j 1 γ 31 q m,i 1 r n,j 1 )(1 q m,i ) m=1 n=1 M N g ij 32 = (1 q m,i 1 γ 32 r n,j 1 q m,i 1 )(1 r n,j ) g ij 33 = m=1 n=1 M N m=1 n=1 (1 r n,j 1 γ 33 q m,i 1 r n,j 1 )(1 q m,i )r n,j (1 q m,i 1 γ 33 r n,j 1 q m,i 1 )(1 r n,j )q m,i q m,i = a m,i = r n,j = b n,j = γ β 1

Complexity of algorithm B Complexity: Naïve O(MNL²) Using profiles both d(a ij, b ij ) and g ij β can be computed using fixed #steps, resulting in O(L²)

lgorithm C D ij 1 = min β=1,3 D ij 2 = min β=2,3 D ij 3 = min β=1,2,3 β D i 1,j β D i,j 1 β D i 1,j 1 g ij β1 v d(ai, ) g ij β2 v d(, bj ) g ij β3 v d(ai, b j ) g ij β is the actual # new gaps introduced given the last 2 path segments are β. Q ij, R ij give for every sequence in sub-alignment ij of, B the # consecutive gaps it has at the end. RecordC = {D ij,q ij, R ij } ( = 1,2,3)

Gap openings and algorithm C B 1 1 1 2 2 C C 1 1 D 2 B C 1 1 2 C 2 G 1 1 2 G 3 B B 1 1 We count a new gap in pairwise sequence comparison if we have non-matching gaps and we are not extending a gap we already counted. The Score(,B) = SP(.B) Note however that some gaps only get counted when they are closed. This extra cost present in the full alignment, but not the sub-alignment is called the retarded gap penalty. Because of the retarded gap penalty, a sub-alignment ij of the optimal alignment IJ is not guaranteed to be a retained alignment for ij.

The 55 sub-alignment of the optimal alignment is not the optimal 55 subalignment. However as the better alternative has a different last path segment it is still retained.

The 66 sub-alignment of the optimal alignment is not the optimal 66 subalignment. The better alternative now has the same last path segment and is retained instead.

We do not find the optimal group-to-group alignment.

g ij β Let Q, R be the values for Q, R of the predecessor sub alignment β. We then have M N g ij β1 = 1 qm,i Q m R n g ij β2 = m=1 n=1 M N m=1 n=1 M N (1 rn,j ) (Q m R n) g ij β3 = 1 qm,i r n,j Q m R n (1 r n,j )q m,i (Q m R n) m=1 n=1 Let Q,R be the values for Q, R of the predecessor sub alignment of the retained candidate. Q ij, R ij are then computed as follows: Q m,ij = Q m 1 (if q m,i ) (else) R n,ij = R n 1 (if r n,j ) (else)

Complexity of algorithm C Naïve O(MNL²) Problem, not straightforward to use profiles to compute g ij β. We have O(MNL²). In Gotoh(1994) a concept of generalized profiles is introduced. chieving a complexity not directly dependent on M, N.

lgorithm D D retains a dynamic set of good candidates (candidate list paradigm), using a series of tests τ 1, τ 2, τ 3 and τ 4 Let T be the filter based on τ 1, τ 2, τ 3 and τ 4. Let us call EV ij the set of candidates evaluated and RE ij the set of candidates retained. EV 1 β ij = D i 1,j β RE i 1,j g ij β1 v d(ai, ) EV 2 β ij = D i,j 1 g ij β2 v d(, bj ) β RE i,j 1 EV 3 β ij = D i 1,j 1 β RE i 1,j 1 EV ij = EV ij 1 EV ij 2 EV ij 3 g ij β3 v d(ai, b j ) RE ij = T(EV ij ) RecordD = {D ij,q ij, R ij } ( ϵ RE ij ) Where g ij β is calculated as in algorithm C.

Candidate Tests in algorithm D 4 necessary conditions for being a sub-alignment of an optimal alignment. lgorithm D therefore returns the optimal alignment for two groups. Call C ij the set of candidates that haven t failed any prior test. l = minarg C ij (D ij) τ 1 = M N m=1 n=1 (Q m,ij Rn,ij ) < Qm,ij l R l n,ij m = minarg C ij (E ij ) where Eij = Dij v τ 2 = M N m=1 n=1 (Q m,ij Rn,ij m ) > Qm,ij m Rn,ij n = minarg C ij (E ij ) where Fij = Dij v τ 3 = M N m=1 n=1 (Q m,ij Rn,ij m ) < Qm,ij m Rn,ij = minarg C ij (G ij ) where Gij = Dij v τ 4 = (G ij > Dij ) or Q m,ij M m=1 N n=1 or Q m,ij M m=1 N n=1 or Q m,ij M m=1 N n=1 Rn,ij Qm,ij Rn,ij Qm,ij Rn,ij Qm,ij < Q l m,ij Rn,ij m < Q m,ij Rn,ij m > Q m,ij Rn,ij Test order τ 4 τ 1 τ 2 τ 3... (until none removed or C ij = 1) Note that l, m, n, always automatically pass their corresponding test. R l n,ij m Rn,ij m Rn,ij (Q m,ij Rn,ij )

Complexity algorithm D No worst case complexity known. (obvious upper bound: retains all) verage case: O(MNL²) Evaluates/retains less candidates than algorithms -C in practice. Evaluation of candidates takes more time. Using the profiles for C, allow us to compute D β ij and g ij in a time not directly dependent on MN. However the tests in D require MN time: Calculating the test values E ij, F ij, G ij O( C ij MN) Performing the tests τ 1, τ 2, τ 3 O( C ij MN) Gotoh (1994) rewrites these computations and tests using generalized profiles to remove these direct depencies between computation time and MN.

Experiments

Results General Tradeoff speed vs. accuracy Speed > B > C > D more procedural complexity ccuracy D > C > B > more accuracy in gap penalty estimation

Results & Considerations relatedness vs. accuracy for distantly related sequences C, D is better (generalized) profiles vs. no profiles ll have variants using profiles but additional complexity only when M.N is high

Conclusions Which algorithm to use? can depend on the groups to be aligned e.g. No internal gaps = B = C = D Distantly related D > C > B > M.N is high (D > C) with generalized profiles >... can depend on preferences e.g. Has to be simple (procedurally) > B > C > D Has to be efficient (time/memory-wise) > B >... Has to be accurate D > C >...