Multiple Sequence Alignments...... Elements of Bioinformatics Spring, 2003 Tom Carter http://astarte.csustan.edu/ tom/ March, 2003 1
Sequence Alignments Often, we would like to make direct comparisons between two or more residue sequences. This allows us to see which subsequences are conserved, and what differences there are between the two sequences. In many cases, we would like to make comparisons among more than two sequences. These comparisons can help us understand evolutionary changes in molecules, and to identify functional or structurally important regions of the molecules. 2
In general, it can be computationally infeasible to look for globally optimal alignments of sequences, particularly when we allow gaps in the sequences. There may also be ambiguities about what should be considered optimal alignments in particular cases. There are a variety of ways in which two residue sequences (or subsequences of a sequence) might be similar. They might be evolutionarily homologous, sharing a (relatively) recent common ancestor. They might be structurally similar, contributing in similar ways to the secondary (or tertiary) structure of the molecule (e.g., alpha helices or beta sheets). They might have functional similarity, such as binding sites. In general, we would expect two evolutionarily homologous sequences to match each other fairly well at the residue level, and for 3
similarity scoring via such models as PAM or BLOSUM to work fairly well. On the other hand, these scoring matrices may not do very well in recognizing structural or functional similarities. For these purposes, we may need more sophisticated methods for building alignments. For example, an algorithm might use a simple PAM or BLOSUM scoring matrix approach in its early phases, and in later phases use a matching model more closely tailored to structural or functional features. For example, current implementations of ClustalW use a variety of approaches, including: Variation of amino acid substitution matrices at different alignment stages according to the divergence of the sequences to be aligned. Residue specific gap penalties and locally reduced gap penalties in hydrophilic 4
regions encourage new gaps in potential loop regions rather than regular secondary structure. Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. (These are from the ClustalW 1.7 descriptive material.) This is a fairly active area of research. New approaches, such as HMM (hidden Markov models), are providing new tools and methods. 5
Basic approach to multiple sequence alignments The general approach used by many algorithms such as ClustalW for building multiple sequence alignments is as follows: Using a scoring matrix approach, do pairwise comparisons between the sequences. If there are n sequences, we will be doing n(n 1)/2 pairwise comparisons. Using the pairwise comparison scores, build a relatedness tree for the sequences. Starting with the most closely related pair of sequences, build successively larger clusters of sequence alignments until all sequences are aligned. 6
Here is a brief example, using some sequences from the Copper-Zinc superoxide dismutase (SODC) family: SOD2-LYCES [Lycopersicon esculentum (Tomato)] SODC-SPIOL [Spinacia oleracea (Spinach)] SODC-YEAST [Saccharomyces cerevisiae (Baker s yeast)] SODC-XENLA [Xenopus laevis (African clawed frog)] SODC-RAT [Rattus norvegicus (Rat)] SODC-MOUSE [Mus musculus (Mouse)] SODC-HUMAN [Homo sapiens (Human)] SODC-DROVI [Drosophila virilis (Fruit fly)] SODC-CHICK [Gallus gallus (Chicken)] 7
The first step is to do pairwise similarity scoring of the sequences. In this case, we use a Gonnet variation of the PAM250 scoring matrix: 1 2 3 4 5 6 7 8 9 SOD2_LYCES 1 84 55 57 54 56 54 59 55 SODC_SPIOL 2 53 55 52 53 53 59 55 SODC_YEAST 3 56 56 54 54 53 55 SODC_XENLA 4 66 66 66 58 67 SODC_RAT 5 96 83 59 71 SODC_MOUSE 6 83 61 71 SODC_HUMAN 7 59 72 SODC_DROVI 8 63 SODC_CHICK 9 Then we build a relatedness tree: ((LYCES, SPIOL 84), (YEAST, (XENLA, (((RAT, MOUSE 96), HUMAN 83), CHICK 71) 66), DROVI 58)) 8
Here is a pictorial version of the unrooted relatedness tree: SODC unrooted relatedness tree 9
Here is a multiple alignment: SODC multiple alignment (ClustalW 1.81) 10
Here is another version of the multiple alignment: SODC multiple alignment (boxshade) 11
Notations in ClustalW output: The conservation line output in the clustal format alignment file uses three characters: * indicates positions which have a single, fully conserved residue : indicates that one of the following strong groups is fully conserved:- STA NEQK NHQK NDEQ QHRK MILV MILF HY FYW 12
. indicates that one of the following weaker groups is fully conserved:- CSA ATV SAG STNK STPA SGND SNDEQK NDEQHK NEQHRK FVLIM HFY These are all the positively scoring groups that occur in the Gonnet Pam250 matrix. The strong and weak groups are defined as strong score > 0.5 and weak score <= 0.5 respectively. 13