p(-,i)+p(,i)+p(-,v)+p(i,v),v)+p(i,v)

Multile Sequence Alignment Given: Set of sequences Score matrix Ga enalties Find: Alignment of sequences such that otimal score is achieved. Motivation Aligning rotein families Establish evolutionary relationshis Identify imortant functional regions Yield structural clues Aligning non-coding DNA sequences Find conserved regions in DNA for control of exression Infer evolutionary relationshis Identify imortant functional regions Motivation Scoring a Multile Alignment Binding sites, DNA sequence motifs, may be conserved within secies (to control exression in concerted fashion) The sites may be conserved across secies (using similar control mechanisms) May diverge within and across secies for secial urose or evolutionary drift S: GA_TCA : GTCTGA : GATATT Scoring a Multile Alignment Alignment score = sum of column scores Identify a reasonable method of obtaining a cumulative score for substitutions in each column is a challenge. Column score: I V - I I I V - SP (Sum of Pairs) measure (oular method): Comute airwise scores of all airs and sum them SP-score(I, score(i,-,i,v),i,v) = (I,-)+(I,I)+(I,V)+ (-,I)+(,I)+(-,V)+(I,V),V)+(I,V) Ga enalty constant or linear or nonlinear (affects the comutational comlexity). (-,-) =?0 SP-score( score(α) = Σ score(α ij where α ij is the airwise alignment induced by α on sequences s i, s j.

Ex. A Problem with SP-score N-N N match N-C match Score (BLOSUM2) Sequence Column A 2 3 5 Column B Column C A B C 0 0 0 2 N N N CN (3N-N(), N(), C-C(9)) C(9)) 9 The scores decrease raidly! SP-score tends to overweight the influence of mutations. -3 N A Problem with SP-score Sequence Column A 2 n s(n,n) =, s(n,c)= 3; Score of the column A = Score of the column B = Column B Column C n(n-)/2 n(n-)/2 9(n-) Relative difference: 9(n-)/[n(n-)/2] = 3/n (inverse deendence on n) Counter-intuitive: relative diff should be increase with the more evidence we have for a conserved asaragine N. Multile Alignments: Scoring Sum of airs (SP-Score) Score) Number of matches (multile longest common subsequence score) Entroy score Multile LCS Score A column is a match if all the letters in the column are the same AAA AAA AAT ATC Only good for very similar sequences Methods for Multile Alignment Dynamic Programming Progressive Alignment Star CLUSTALW Iterative Alignment Hidden Markov Model s s2 s3 Aligning Three Sequences GATTCA GTCTGA GATATT 2

Aligning Three Sequences Same strategy as aligning two sequences Use a 3-D Manhattan Cube, with each axis reresenting a sequence to align For global alignments, go from source to sink source W 2-D vs 3-D Alignment Grid V 2-D edit grah sink 3-D edit grah 2-D cell versus 3-D Alignment Cell Architecture of 3-D Alignment Cell (i-,j-,k-) (i-,j-,k) (i-,j,k-) (i-,j,k) In 2-D, 3 edges in each unit square? In 3-D, 7 edges in each unit cube (i,j-,k-) (i,j-,k) (i,j,k-) (i,j,k) Alignment Paths Multile Alignment: Dynamic Programming 0 2 3 A -- T G C 0 2 3 3 A A T -- C 0 0 2 3 -- A T G C x coordinate y coordinate z coordinate Resulting ath in (x,y,z) sace: (0,0,0) (,,0) (,2,) (2,3,2) (3,3,3) (,,) s i,j,k = max s i-,j-,k- + δ(v i, w j, u k ) s i-,j-,k + δ (v i, w j, _ ) s i-,j,k- + δ (v i, _, u k ) s i,j-,k- + δ (_, w j, u k ) s i-,j,k + δ (v i, _, _) s i,j-,k + δ (_, w j, _) s i,j,k- + δ (_, _, u k ) cube diagonal: no indels face diagonal: one indel edge diagonal: two indels (x, y, z) is an entry in the 3-D scoring matrix 3

Multile Alignment: Running Time For 3 sequences of length n,, the run time is 7n 3 ; O(n 3 ) For k sequences, build a k-dimensional Manhattan, with run time (2 k -)( )(n k ); O(2 k n k ) Progressive Alignment Star method CLUSTALW Conclusion: dynamic rogramming aroach for alignment between two sequences is easily extended to k sequences but it is imractical due to exonential running time Multile Alignment Induces Pairwise Alignments Reverse Problem: Constructing Multile Alignment from Pairwise Alignments Every multile alignment induces airwise alignments Induces: x: AC-GCGG-C y: AC-GC-GAGGAG z: GCCGC-GAGGAG x: ACGCGG-C; C; x: AC-GCGG-C; C; y: AC-GCGAG y: ACGC-GAC; GAC; z: GCCGC-GAG; GAG; z: GCCGCGAG Given 3 arbitrary airwise alignments: x: ACGCTGG-C; C; x: AC-GCTGG-C; C; y: AC-GC-GAGGAG y: ACGC--GAC; z: GCCGCA-GAG; GAG; z: GCCGCAGAG can we construct a multile alignment that induces them? NOT ALWAYS The STAR Alignment Method Using a airwise alignment method find the sequence that is most similar to all the other sequences: score(α i ) = Σ score(α Using this best sequence as the center (of a star, hence the name) align the other sequences following the once a ga always a ga rule. Ex: S S S5 A T T G C C A T T A T G G C C A T T A T C C A A T T T T A T C T T C T T A C T G A C C More on STAR Alignment Assuming similarity matrix for the airwise comaring of the sequences: S S S5 Σscore( score(α ij S - 7-2 0-3 2 7 - -2 0 - -2-2 - 0-7 - S 0 0 0 - -3-3 S5-3 - -7-3 - -7 Choose s be the center of the Star! S5 S S

More on STAR Alignment Next we get the best alignment between S and the other sequences as follows: S A T T G C C A T T A T G G C C A T T S A T T G C C A T T - - A T C - C A A T T T T S A T T G C C A T T S A T C T T C - T T S A T T G C C A T T S5 A C T G A C C - - More on STAR Alignment Build the MSA starting with S? and : A T T G C C A T T A T G G C C A T T Adding using once a ga always a ga A T T G C C A T T - - A T G G C C A T T - - A T C - C A A T T T T Reeat to include all the sequences A T T G C C A T T - - A T G G C C A T T - - A T C - C A A T T T T A T C T T C - T T - - A C T G A C C - - - - Comlexity of STAR Alignment Clearly, the time comlexity of the STAR method is dominated by comuting the airwise alignment. For k sequences, there are O(k 2 ) airs Each airwise alignment takes O(n 2 ), n = length of each seq. Cost for comuting all airwise alignments: O((kn) 2 ) Cost to merge the sequences into a msa. If n max is the uer bound of the alignment length, one merge takes O(kn max ). Total takes O(k 2 n max ). The total time comlexity for STAR method: O( k 2 n 2 + k 2 n max ) Profile Alignment Problem with Star aroach --- all alignment are determined by airwise sequence alignments. Profile alignment uses osition-secificsecific information from grou s multile alignment to align a new sequence to it. Mismatches at highly conserved ositions should be enalized more Gas should be enalized more at ositions where few gas occur Scoring function SP-score Profile Alignment Aligning two multile alignment (rofiles) using SP-score.. A T T G C C A T T k+. A T C - C A A T k. A T G G C C A T T K. A - C T G A A C Recall: SP-score( score(α ) = Σ score(α ij SP-score( score(α) = Σ SP-score( score(α ) SP-score( score(α) = Σ SP-score( score(α ) = ΣΣ score(α ij = ΣΣ score(α + ΣΣ score(α + Σ Σ score(α ij k k< i k k<j K The alignment can be done exactly like a standard airwise alignment! Need to be otimized CLUSTALW CLUSTALW is a rogressive method use a airwise alignment method to determine the most related sequences rogressively add less related sequences or grous of sequences to the initial alignment CLUSTAL family CLUSTAL - gives equal weight to all sequences CLUSTALW - can give different weights to the sequences & other rogram arameters CLUSTALX - rovides a GUI to CLUSTAL 5

CLUSTALW Construct a distance matrix of all k(k )/2 airs by airwise dynamic rogramming alignment and comute the distances between all air sequences (-distance: the roortion () of nucleotide sites at which two sequences being comared are different). Construct a guide tree by a neighbor-joining clustering algorithm Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence, sequence-rofile, and rofile-rofile alignment More on CLUSTALW After comuting the distance between all airs of sequences we ut them into a matrix. For examle if we consider a set of 7 sequences we could have the following matrix: Seq. S S - S S5 S S7.7 -.59.0 - S.59.59.3 - S5.77.77.75.75 - S.8.82.73.7.80 - S7.87.8.8.88.93.90 - Neighbor Joining Very oular method! Assumes additivity: distance between airs of leaves = sum of lengths of edges connecting them Produces unrooted tree Very much like the Fitch-Margoliash method, excet that the choice as to which sequence to air is done differently 3 Neighbor Joining 2 0. 0. 0. 0. 0. Additivity: distance between airs of leaves equals to the sum of lengths of edges connecting them. d km = (d im + d jm d ij ( k = arent of i & j ) How to choose the neighbor leaves? /2. Neighbor Joining Find the modified distance matrix: Find the sum of the distance between seq i and all other sequences: r i = Σ k d ik / (n 2), (n = total # of seqs) Find the modified distance matrix: D ij = d ij ( r i + r j ), Claim: A air of leaves i, j for which D ij is minimal will be neighboring leaves. Algorithm: Neighbor Joining Initialization Define T to be the set of leaf nodes, one for each given seq Let L = T Iteration: Pick i, j in L for which D ij is the minimal Define a new node k and set d km =(d im +d jm d /2, m in L Add k to T with edges of lengths d ik =(d ij +r i r j )/2, d jk =d ij - d ik Remove i and j from L and add k Termination: When L consists of two leaves i and j add the remaining edge between i and j,, with length d ij

d 2 3 0.3 0.5 0. 2 0. 0.5 3 0.9 D 2 3 -. -.2 -. 2 -. -.2 3 -. 3 5 3 0. Examle d ik =(d ij +r i r )/2, j d jk =d ij - d ik 0. 5 d km =(d im +d jm d /2 d 5 2 5 0.2 0.5 2 0.5 D 5 2 5 -.2 -.2 2 -.2 0. 5 3 0. 0. 0. 0. 2 2 0. 0. 0. 0. 0. 3 d 5 5 0. Profile Alignment Aligning two multile alignment (rofiles) using SP-score.. A T T G C C A T T k+. A T C - C A A T k. A T G G C C A T T K. A - C T G A A C Recall: SP-score( score(α ) = Σ score(α ij SP-score( score(α) = Σ SP-score( score(α ) SP-score( score(α) = Σ SP-score( score(α ) = ΣΣ score(α ij = ΣΣ score(α + ΣΣ score(α + Σ Σ score(α ij k k< i k k<j K The alignment can be done exactly like a standard airwise alignment! Need to be otimized s s2 s3 s GTTGA GTTTGA GATATT GTATA Exercise Distances -distance: For a airwise alignment, count the number of mismatches/gas between the two sequences, then divide this value by the length of the alignment. Ex. N K L - O N distance = 3/ =.5 - M L N O N Jukes-Cantor distance d = (¾)log[-(/3)](/3)] =-distance More on CLUSTALW Construct a guide tree using Neighbor Joining method. For the distance matrix in the examle we could construct the following guide tree. S7.5057 S.08.08.227.03.09.393 S.09.09.25.08 S S5 7

More on CLUSTALW Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence, sequence-rofile, and rofile-rofile alignment In our examle we first align S with (grou) then with S (grou2), then align grou with grou2, then we continue until we have only one alignment. CLUSTALW htt://clustalw.genome.j/ Ex: (FASTA format) >seqa GARFIELDTHEFASTCAT >seqb GARFIELDTHEVERYFASTCAT >seqc GARFIELDTHEFATCAT Problem of Sequence Weights The available sequences are not randomly samled, but reflect biases in how we collect sequences. If weight everything equally, then closely related sequences will be allowed to dominate the multile alignment. As a result, conclusions about ) conservation, 2) evolutionary distance, 3) reliability of redictions would be wrong. Sequence Weighting Examle CYEGNGHF Human- CYEGNGDF Human-2 CYHGNGDF Human-2 CYHGNGDS Mouse CYHGNGQS Rat CFEGNGHS Pig Solutions: don t weight the three humans equally with the others. Use a measure of similarity to down-weight weight their influence on the multile alignment. More on CLUSTALW More heuristics of CLUSTALW: Sequences are weighted to comensate for biased reresentation in large subfamilies and the defects of the sum-of-airs. Use different substitution matrix (BLOSUM80 for closely related sequences; BLOSUM50 for distant sequences) Set ga enalty be a function of the residues observed at the osition (hydrohobic residues give higher ga enalties than hydrohilic or flexible residues) Set ga and ga extension enalties to force all the gas to occur in the same laces ClustalW In Summary Poular multile alignment tool today W stands for weighted (different arts of alignment are weighted differently). Three-ste rocess ) Construct airwise alignments 2) Build Guide Tree 3) Progressive Alignment guided by the tree 8

Iterative Methods Shortcoming of Progressive Aroach: Deendence uon initial alignments Sub-alignments are frozen frozen Errors in alignment roagated Iterative Methods: Begin with an initial alignment A sequence or a grou of sequences is taken out and realigned to a rofile of the remaining aligned sequences. Alignment is reeatedly refined until the alignment does not change. 9