Ch. 9 Multiple Sequence Alignment (MSA)

Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA - comparing DNA or protein seqs. - difficult for DNA because of mutation - why MSA? search for evolutionary (phylogentic analysis) and structure similarity - for protein seqs, regions that are similar in seq. are usually superimpose in structure as well - because it is easy to generate bad alignments that looks good, we need to evaluate the quality of alignment (use Tcoffee) - one can comparing seqs. that cannot align Gibbs sampler (identify related segments of the same length) and Pratt (a motif discover tool) MSA is ideal for study seqs. share the same common ancestor MSA cannot be used if the seqs. has no similarity The MSA problem - given N seqs., with L = the longest of the aligned seq. - a minimal number of gaps is introduced in the seqs. so that the number of matches or similarities in each column is maximized - let A i be the ith seq. and A i,k is the residue at the kth position of the ith seq., G(A i ) is a gap penalty function - the objective of MSA is to maximize the score function Score(A N ) - it is the sum of all possible pairwise alignement scores, i.e. there are N(N-1)/2 possible alignments A i, k Score( A N ) = L N 1 N k = 1 i= 1 j= i+ 1 A j, k sub( A, A ) N G( A ) i, k j, k i i= 1 where sub(, A j,k ) is the score of substitution for (i.e. insertion, deletion, substitution). A i, k

2 distance (MAMPR, LOXAF) = 0.01852 + 0.00265 = 0.02117 distance (LOXAF, ELEMA) = 0.01852 + 0.00265 = 0.02117 distance (MAMAPR, ELEMA) = 0.01852 + 0.01852 = 0.03704 Help you research with MSA Applications Extrapolation Phylogentic analysis Pattern identification Domain identification DNA regulatory elements Structure prediction Procedure MSA help convince you that an uncharacterized seq. is really a member of a protein family If you carefully choose the seqs. to include in MSA, you can reconstruct the history of these proteins. By discovering very conserver region, one can identify regions that characterize function. Turn MSA into a profile that describes a protein family or domain. One can use this profile to scan the databases and look for new members for the family. Align promoters of a set of similarly regulated genes may reveal consensus binding sites for regulatory proteins. Turn a DNA MSA of a binding site into a weight matrix and scan the DNA database and look for potential binding site. A good MSA can give a very good prediction of the protein/rna secondary structure. - Consider the human parvalbumin, P20472, a calcium binding protein involved in muscle relaxation - Use Expasy BLAST server to retrieve similarity seqs., storing them and then MSA

3 - for protein/dna use blastp/tblastn to search the whole database - you can restrict the number of hits by selecting a smaller database, e.g. microbial database How to select the seqs. you want? - for a first analysis, select about 10 seqs., ideally the seqs. select is evenly spaced between very good E-value (say 10-40 ) and less good value (say 10-5 ) - select seqs. that were about the same length, don t select fragment seqs., (MSA is good for alignment seqs. having similar length) - pick P20472, P80079, P02626, P02619, P43305, P32930, Q91482, P02620, P02622

4 Three are three ways to export these seqs.: FASTA, ClustalW, Tcoffee EBI ClustalW server http://www.ebi.ac.uk/clustalw/index.html Interpreting MSA result - (*) an entirely conserved column - (:) indicates columns where all the residues have roughly he same size and same hydropathy - (.) indicates columns where the size or hydropathy is roughly the same - a good block is a unit with at least 10-30 amino-acids long exhibiting 1~3 (*), 5~7 (:) and a few (.) If you know the accession numbers of the seqs., you can retrieve them as shown in the following:

http://tw.expasy.org/sprot/sprot-retrieve-list.html P20472, P80079, P02626, P02619, P43305, P32930, Q91482, P02620, P02622 5 See Appendix A for all the seqs. FASTA format EBI ClustalW server http://www.ebi.ac.uk/clustalw/index.html

6 Changing ClustalW parameters Parameter Effect Substitution matrix Substitution matrices control the cost of mutations in seq. alignments. If you select a matrix, like PAM or BLOSUM, ClustalW automatically chooses the adapted index. If the seqs. are closely related, a change of matrix has no effect. If your seqs. alignment is difficult to interpret, it is worth to change from BLOSUM PAM. Gap-opening penalty (GOP) The higher the value of GOP, the more difficult it is to open a gap. Turning has little effect because ClustalW readjust GOP automatically. Gap-extension penalty (GEP) GEP control the size of the gaps. It is impossible to predict the optimal combination of GOP/GEP. The only way to find this combination is empirical. - reason for changing the parameters is to test whether slightly changes can improve the overall alignment. ClustalW - Clustal program by Higgins & Sharp,1988, - ClustalW is a more recent revision with W assign weights to the seqs. reflect the evolutionary changes in the aligned seqs. and the distribution of gaps between conserved domains - ClustalX a graphic interfaces Progressive algorithm (fast but errors made earlier in the process cannot be corrected (frozen-in errors)) for obvious errors edit manually tools : such Jalview (Chpater 10), Seaview or Cinema - Start with the most similarity seq. pair and continue to add seqs. in decreasing order of similarity - One builds a cluster of seqs. looks like a phylogenetic tree (dendrogram, that is the file with.dnd extension), for instance one has two alignments AB and CD - Then it aligns the two alignments as if each of them was a single seq., for instance one can replace each alignment with a consensus seq. - alignment of AB alignment of CD alignment of AB with CD (ABCD) alignment of (ABCD)E alignment of (ABCDE)F alignment of (ABCDE)F)G. Making MSA with Tcoffee - one of the most recently developed method for MSA - yields more accurate alignments at the cost of a longer running time

7 - - aln : a text file has the same format as ClustalW alignment - pdf Pattern of Conservation in MSA Amino Acid Characteristic W, Y, F Tryptophans large hydrophobic a.a., locate in the core of proteins, important for stability not easy to mutate, if mutate W Y or F (aromatic a.a.) - conserved aromatic a.a domains

G, P Often associated with beta strands or alpha helices C - conserved column of C C-C disulphide bridge - conserved columns of C with a distance signature of domains H, S Probably a catalytic site, especially proteases D, E, R, K Charged a.a. often involved in ligand binding or salt bridge (association of two ionic protein groups of +/- charge) L Rarely very conserved unless involved in protein-protein interactions such as leucine zipper. 8 Adding distantly related seq. - the alignment contains many conserved positions - add a few distantly related seq. one by one and check the effect of these seqs. on the overall alignment quality - want to make sure these distantly related seqs. enhance existing patterns rather destroy them - include those seqs. that BLAST reported as marginal hits when we scan SWISS-PROT for homologues - add P02591, TPCC_RABIT, the troponin C of rabbit, BLAST e-value = 3.1

9 - the new seq. respects the blocks that already existed, and shunting some conserved positions - it also reveals regions where insertion and deletions are likely to occur mostly likely a loop - add another distantly related seq. to check that these few highly conserved regions are indeed conserved across the whole protein family, even when we compare distantly related species - include P19123, TPCC_MOUSE, mouse troponin C, BLAST E-value = 3.1

- the most conserved columns remain - these two highly conserved regions are involved in some biological processes - we know most of the proteins binds calcium - one can safely bet that the calcium-binding site involves some of these conserved positions - in fact SWISS-PROT annotation indicate that these regions involved in calcium binding 10 Comparing sequences that you cannot align - sometimes may need to compare seqs. don t necessary have a common ancestor - Gibbs sampler looks for short, partially conserved gap-free segments - Pratt looks for flexible patterns that can contains gaps and only needs to conserved at certain positions Gibbs Sampler http://bioweb.pasteur.fr/seqanal/interfaces/gibbs-simple.html - stochastic method, difficult to reproduce the same results - but it can offer very sensible solutions - very good at identifying HTH (Helix Turn Helix) domains across a protein family - a nice way to search for regulatory elements shared by unrelated DNA seqs. - to get good results you need >20 seqs. - Gibbs sampler is useful only when the segments your are looking have exactly the same length, like HTH domains - For motif of different length use Pratt (http://www.ebi.ac.uk/pratt/index.html), TEIRESIAS (http://cbcsrv.watson.ibm.com/tspd.html), MEME (http://meme.sdc.edu/meme/website)

11 Appendix A >sp P20472 PRVA_HUMAN Parvalbumin alpha - Homo sapiens (Human). SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES >sp P80079 PRVA_FELCA Parvalbumin alpha - Felis silvestris catus (Cat). SMTDLLGAEDIKKAVEAFTAVDSFDYKKFFQMVGLKKKSPDDIKKVFHILDKDKSGFIEE DELGFILKGFYPDARDLSVKETKMLMAAGDKDGDGKIDVDEFFSLVAKS >sp P02626 PRVA_AMPME Parvalbumin alpha - Amphiuma means (Salamander) (Two-toed amphiuma). SMTDVIPEADINKAIHAFKAGEAFDFKKFVHLLGLNKRSPADVTKAFHILDKDRSGYIEE EELQLILKGFSKEGRELTDKETKDLLIKGDKDGDGKIGVDEFTSLVAES >sp P02619 PRVB_ESOLU Parvalbumin beta - Esox lucius (Northern pike). SFAGLKDADVAAALAACSAADSFKHKEFFAKVGLASKSLDDVKKAFYVIDQDKSGFIEED ELKLFLQNFSPSARALTDAETKAFLADGDKDGDGMIGVDEFAAMIKA >sp P43305 PRVU_CHICK Parvalbumin, thymic CPV3 (Parvalbumin 3) - Gallus gallus (Chicken). MSLTDILSPSDIAAALRDCQAPDSFSPKKFFQISGMSKKSSSQLKEIFRILDNDQSGFIE EDELKYFLQRFECGARVLTASETKTFLAAADHDGDGKIGAEEFQEMVQS >sp P32930 ONCO_HUMAN Oncomodulin (OM) (Parvalbumin beta) - Homo sapiens (Human). SITDVLSADDIAAALQECQDPDTFEPQKFFQTSGLSKMSANQVKDVFRFIDNDQSGYLDE EELKFFLQKFESGARELTESETKSLMAAADNDGDGKIGAEEFQEMVHS >sp Q91482 PRV1_SALSA Parvalbumin beta 1 (Major allergen Sal s 1) - Salmo salar (Atlantic salmon). MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ >sp P02620 PRVB_MERME Parvalbumin beta - Merluccius merluccius (European hake). AFAGILADADITAALAACKAEGSFKHGEFFTKIGLKGKSAADIKKVFGIIDQDKSDFVEE DELKLFLQNFSAGARALTDAETATFLKAGDSDGDGKIGVEEFAAMVKG >sp P02622 PRVB_GADCA Parvalbumin beta (Allergen Gad c 1) (Gad c I) (Allergen M) - Gadus callarias (Baltic cod). AFKGILSNADIKAAEAACFKEGSFDEDGFYAKVGLDAFSADELKKLFKIADEDKEGFIEE DELKLFLIAFAADLRALTDAETKAFLKAGDSDGDGKIGVDEFGALVDKWGAKG >sp P02591 TPCC_RABIT Troponin C, slow skeletal and cardiac muscles (TN-C) - Oryctolagus cuniculus (Rabbit). MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKIM LQATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE >sp P19123 TPCC_MOUSE Troponin C, slow skeletal and cardiac muscles (TN-C) - Mus musculus (Mouse). MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM LQATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE