Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis '17- lecture 8 Multiple sequence alignment

Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database size and P() is the EVD, which models the number of random database search scores. So, by definition, the number of random database search scores is the e-value. m* P(S x) = 10 e-value = 10 m* P(S x) = 4 e-value = 4 m*p(s x) = 2 e-value = 2 m*p(s x) = 1 e-value = 1 m* P(S x) = 3 10 Random chance number of occurrences in a database search 1 1 10 e-value 2

Manual editing of alignments in UGENE Download and open bad alignment from the course web page* Align using Kalign. Can you make it better? Edit manually to consolidate gaps without forcing too many mismatches. How many indel events are implied by your alignment? * Opens as an alignment. Older versions of UGENE open this as a list of sequences instead of an alignment. If it does, select them and right-click/export all sequences as alignment, add to Project. 3

Methods for multiple sequence alignment Dynamic programming Star Progressive ClustalW, uses variable gap penalty Kalign. Very fast. Uses exact match. Progressive + stochastic Muscle. 4 MSA algorithms must be computationally efficient AND biologically relevant.

Is dynamic programming possible for more than two sequences? A 3 sequence alignment matrix... DP in 3D S(i,j,k) = MAX { A(i-1,j-1,k-1)+S(i,j,k), A(i-1,j,k)-gap, A(i,j-1,k)-gap, A(i,j,k-1)-gap, A(i-1,j-1,k)-gap, A(i-1,j,k-1)-gap, A(i,j-1,k-1)-gap } How about adding a 4th seq? How does DP run-time scale with number of seuqences? 5

Star alignment 1. Align all sequences to one sequence. 2. Stack them up. B Potential problems with star alignment: Unaligned gaps. Ambiguous associations C E A D A G H. I. W W. P F W P A G H. I I F W. P Y.. A G H I I.. W F P F W P A G H. I P W W. P... F G Each pairwise alignment by itself looks fine, but when you stack them up, you see disagreements. 6

What that alignment should look like. A G H I. W W P F W P A G H I I F W P Y.. A G H I I W F P F W P A G H I P W W P... 7

BLAST "query-anchored" alignments are star alignments 8

Progressive alignment Method for progressive alignment 1. Align all pairs. Save scores in a 2. Make a guide tree. 3.Pairwise align two most similar. 4. Align the next two most similar sequence. Etc. 5. Add sequences until all sequences are aligned Current alignment { sequence to add A W P Y distance matrix gap A G H I. W W P F A G H I I F W P Y DP alignment matrix guide tree 9

"Distances" versus "similarities" Maximizing similarity and Minimizing distance are equivalent if d(i,j) + s(i,j) = s max, where s max is the maximum possible similarity, and the minimum distance is d=0. For each position in the alignment. Distance based on identity score (p-distance) d = 100 - %identity Distance using empirical J-C correction djc = -ln((s real -S rand )/(S ident -S rand )) where Sident = score of an identity alignment, and Srand = mode score of a false alignment. For proteins, Srand 25%. Twilight zone (R. Doolittle, 1986) djc sreal

Juke-Cantor for proteins Empirical J-C correction djc = -ln((pid-25)/75) where 25 = mode score of a false alignment. djc 0.25 0.75 sreal p-distance 0

Progressive alignment Method for progressive alignment 1. Align all pairs. Save scores in a distance matrix 2. Make a guide tree. 3.Pairwise align two most similar. 4. Align the next two most similar sequence. Etc. 5. Add sequences until all sequences are aligned distance matrix 13 Select shortest distance i,j Join i,j Reduce the rank of the distance matrix by joining columns i and j, rows i, j Minimum rule: select the minimum of the values Maximum rule: select the maximum of the values Repeat until rank = 1.

In class: progressive alignment Making a guide tree Neighbor-joining algorithm: A B C D E F A B C D E F A 97 81 77 82 59 32 80 55 31 90 65 40 61 42 33 Fill in J-C distances. B C D E F Draw guide tree here

How do we represent two aligned sequences as one "sequence"? A G H I. W W P F A G H I I F W P Y A 1 0 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0.5 0 0 0.5 G 0 1 0 0 0 0 0 0 0 H 0 0 1 0 0 0 0 0 0 I 0 0 0 1 1 0 0 0 0 K 0 0 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 N 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 1 0 Q 0 0 0 0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 0 0 0 0 T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0 W 0 0 0 0 0 0.5 1 0 0 Y 0 0 0 0 0 0 0 0 0.5

PSSMs and profiles 20xN scoring matrix. Set of probability distributions over the 20 amino acids. (Gap probabilities are (usually) not included.) P(a i) = ws / ws S Si=a [Spoken equation: The probability of amino acid a at position i is the sum of the sequence weights ws over all ] sequences S such that the amino acid at position i of that sequence Si is a, divided by the sum over the sequence weights ws for all sequences S.

Sequence weights??? w1 w2 0.75 0.25 A G H I. W W P F A G H I I F W P Y A 1 0 0 0 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 D 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 F 0 0 0 0 0 0.25 0 0 0.75 G 0 1 0 0 0 0 0 0 0 H 0 0 1 0 0 0 0 0 0 I 0 0 0 1 1 0 0 0 0 K 0 0 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 N 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 1 0 Q 0 0 0 0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 0 0 0 0 T 0 0 0 0 0 0 0 0 0 V 0 0 0 0 0 0 0 0 0 W 0 0 0 0 0 0.75 1 0 0 Y 0 0 0 0 0 0 0 0 0.25 18

Why do we need sequence weights? A MSA represents a sequence "family" A sequence family has an amino acid preference at each position. That preference is determined by counting. But, the MSA may be over-represented. primates rabbit rat E. coli lawyer

Sequence weighting corrects for uneven Simplest weighting scheme: Build a tree sampling Start with weight = 1.0 at the common ancestor of the tree. Split the weight evenly at each node. 1.000 0.500 0.125 0.250 0.250 Primate sequences are 10/18 of the tree, but only 0.125 of the weights, because they are overrepresented. 0.0625 0.0625 0.125 0.125 0.125 weights: 0.008 0.008 0.016 0.008 0.008 0.016 0.016 0.016 0.016 0.016 0.0625 0.031 0.031 0.031 0.031 0.0625 0.0625 0.500 primates rabbit rat lawyer E. coli

Progressive alignment Method for progressive alignment 1. Align all pairs. Save scores in a distance matrix 2. Make a guide tree. 3.Pairwise align two most similar. 4. Align the next two most similar sequence. Etc. 5. Add sequences until all sequences are aligned A W P Y { gap A G H I. W W P F A G H I I F W P Y DP alignment matrix matchscore =(0.25*S(P,W) + 0.75*S(P,F)) 21 Match score for multiple sequence alignments: matchscore(i,j) =ΣΣ wnwms(s n i,s m j) n m n=number of sequence in group 1 m=number of sequence in group 2 wn = weight of sequence n wm = weight of sequence m S(aa1,aa2) = substitution matrix value for aa1 to aa2

NOTE: Initial pairwise alignments are used to get the distances that are used make the guide tree, but these alignments are discarded and new alignments are made using the progressive method. 22

CLUSTALW JD Thompson, DG Higgins, TJ Gibson - Nucleic acids research, 1994 Start with unrooted tree, using Neighbor joining. choose root to get guide tree progressive alignment matches are scored using sequence weights gaps are position dependent GOP lower for polar residues GOP zero where there is already a gap http://www.ebi.ac.uk/tools/msa/clustalw2/ http://www.ch.embnet.org/software/clustalw-xxl.html 23

Lightning-striking-twice-in-the same-place theory There should be no gap penalty for aligning a gap to an already existing gap! If i is already a gap position in any sequence, set gap(i)=0. A W P Y A G H I. W W P F A G H I I F W P Y A(i,j) = A(i-1,j) - gap(1,i) A(i,j) = A(i,j-1) - gap(2,j) No gap penalty for the purple arrow. Sequence-specific, Position-specific gap penalties. NOTE: DP is still optimal when the gap penalty is position-specific. 24

CLUSTALW Position specific gap penalty 25

MUSCLE Iterative MSA k-mer distance matrix UPGMA tree progressive alignment--> MSA1 UPGMA tree progressive alignment -->MSA2 For randomly selected tree branches: 1.split alignment into two groups 2.calculate profiles 3.align profiles 4.accept or reject the new alignment. 5.Repeat RC Edgar - Nucleic acids research, 2004 Not DP. Based on short identical matches One way to build a guide tree. 26

UPGMA Unweighted pair group method using averages A B C D E Species A ---- 0.20 0.50 0.45 0.40 Species B 0.23 ---- 0.40 0.55 0.50 Species C 0.87 0.59 ---- 0.15 0.40 Species D 0.73 1.12 0.17 ---- 0.25 Species E 0.59 0.89 0.61 0.31 ---- J-C corrected distances 1) Generate neighbor-joining tree. (NJ) 2) For first neighbors, distance to ancestor is dij/2 3) For next neighbors, distance to ancestor is average pairwise distance between taxa in two clades, divided by two. 4) Subtract to get lineage distances. 0.145 0.23 0.115 0.115 0.085 0.085 A B C D E raw p-distances To be discussed again when we talk about trees... 27

MUSCLE iterative alignment XP_001615335 YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYVSIFIYGNIAMPTEKEDENATS-- XP_002259219 YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYVSIFIYGNIAMTTEKENENATS-- XP_001347897 YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYVVIFIYGNIIISDLKGEENITKNN XP_726635 YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYINIFIYGNLSIPNEINIKNETN-- XP_671449 ------------------------------------------------------------ XP_001458064 VVQAQYYTAELFLEELNILDLESLQQFHSNYFSNFRVSSFVSGNILRSEVEDLLHSIR-- XP_001347129 VVQAQYYTSQLFQDELATLDLESLQEFHSNYFSNFRVSSFVSGNILRSEVEDLLHTIR-- XP_002283970 DNTWPWMDG---LEVIPHLEADDLAKFVPMLLSRAFLECYIAGNIEPKEAEAMIHHIE-- XP_002367832 RNRFSQLDLRSAVTDASS-QFEDFKVFLEKVLTKNALDVFIMGDIDYEEARKLAEDFRAA phylogenetic tree X random cut point VVQAQYYTAELFLEELNILDLESLQQFHSNYFSNFRVSSFVSGNILRSEVEDLLHSIR-- VVQAQYYTSQLFQDELATLDLESLQEFHSNYFSNFRVSSFVSGNILRSEVEDLLHTIR-- DNTWPWMDG---LEVIPHLEADDLAKFVPMLLSRAFLECYIAGNIEPKEAEAMIHHIE-- RNRFSQLDLRSAVTDASS-QFEDFKVFLEKVLTKNALDVFIMGDIDYEEARKLAEDFRAA YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYVSIFIYGNIAMPTEKEDENATS-- YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYVSIFIYGNIAMTTEKENENATS-- YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYVVIFIYGNIIISDLKGEENITKNN YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYINIFIYGNLSIPNEINIKNETN-- DP profile-profile alignment YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYV..SIFIYGNIAMPTEKEDENATS-- YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYV..SIFIYGNIAMTTEKENENATS-- YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYV..FIYGNIIISDLKGEENITKNN YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYI..NIFIYGNLSIPNEINIKNETN-- VVQAQYYTAELFLEELNILDLESLQQFHS..NYFSNFRVSSFVSGNILRSEVEDLLHSIR-- VVQAQYYTSQLFQDELATLDLESLQEFHS..NYFSNFRVSSFVSGNILRSEVEDLLHTIR-- DNTWPWMDG---LEVIPHLEADDLAKFVP..MLLSRAFLECYIAGNIEPKEAEAMIHHIE-- RNRFSQLDLRSAVTDASS-QFEDFKVFLE..KVLTKNALDVFIMGDIDYEEARKLAEDFRAA new MSA In each iteration: The phylogenetic tree is cut at a random branch, the two subtrees are converted to profiles, and aligned. The new alignment is either accepted or rejected 28

Databases of multiple sequence alignments balibase -- structural alignment-based BLOCKS -- gapless regions PFAM -- Hidden Markov models CDD -- conserved domain database FSSP -- structural alignment-based (families) 29

Visit balibase A database of curated multiple sequence alignments derived from structure-based alignments. http://www.lbgi.fr/balibase/ 30

Selective re-alignment Global affine-gap DP alignment may be used to refine an alignment between two, conserved and confidently aligned columns. Select. Align with MUSCLE. Selected columns. Or, paste into ClustalW web site. Use same penalty for opening gap and end gap. 31

Exercise 7: make a MSA due Oct 5 Select a protein sequence in NCBI. Run a BLAST search. Keep the top 50. Select the hits and download to a FASTA file. Open in UGENE (merge sequences into an alignment) Run MUSCLE. Color using Zappo. Reduce size so that the entire alignment (or as much of it as possible) fits on the screen. Save image. Paste into a file and write a blurb (10 words or less) Save as PDF and send to me in an email. 32

Review Are multiple sequence alignments optimal? How is phylogenetic information used in MSA algorithms? What are the advantages/disadvantages of a star alignment? What information is ClustalW encoding in its MSA algorithm? What is the outermost loop in the MUSCLE alignment? 33