Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT

Size: px
Start display at page:

Download "Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT"

Transcription

1 22 Genome Informatics 16(1): (2005) Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT Kazutaka Katohl Takashi Miyata2" Kei-ichi Kuma1 Hiroyuki Toh5 1 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto , Japan 2 Biohistory Research Hall, Takatsuki, Osaka , Japan 3 Department of Electrical Engineering and Bioscience, Science and Engineering, 4 Waseda University, Tokyo , Japan Department of Biophysics, Graduate School of Science, Kyoto University, Kyoto , Japan Division of Bioinformatics, Research Center for Prevention of Infectious Diseases, Medical Institute of Bioregulation, Kyushu University, Fukuoka ,Japan Abstract In 2002, we developed and released a rapid multiple sequence alignment program MAFFT that was designed to handle a huge (up to `5,000 sequences) and long data (,-2,000 as or `5,000 nt) in a reasonable time on a standard desktop PC. As for the accuracy, however, the previous versions (v.4 and lower) of MAFFT were outperformed by ProbCons and TCoffee v.2, both of which were released in 2004, in several benchmark tests. Here we report a recent extension of MAFFT that aims to improve the accuracy with as little cost of calculation time as possible. The extended version of MAFFT (v.5) has new iterative refinement options, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-i in this report). These options use a new objective function combining the weighted sum-of-pairs (WSP) score and a score similar to COFFEE derived from all pairwise alignments. We discuss the improvement in accuracy brought by this extension, mainly using two benchmark tests released very recently, BAliBASE v.3 (for protein alignments) and BR,AliBASE (for RNA alignments). According to BAliBASE v.3, the overall average accuracy of L-INS-i was higher than those of other methods successively released in 2004, although the difference among the most accurate methods (ProbCons, TCoffee v.2 and new options of MAFFT) was small. The advantage in accuracy of [GL]-INS-i became greater for the alignments consisting of sequences. By utilizing this feature of MAFFT, we also examined another possible approach to improve the accuracy by incorporating homolog information collected from database. The [GL]-INS-i options are applicable to aligning up to r\-, 200 sequences, although not applicable to thousands of sequences because of time and space complexities. Keywords: multiple sequence alignment, dynamic programming, objective function 1 Introduction Multiple alignment of protein or DNA sequences is a classical problem in the area of bioinformatics and an important tool required in a wide range of molecular biological analyses such as detecting conserved residues and inferring evolutionary history of genes and organisms. As a large amount of sequence information is being determined and accumulated by genome and cdna projects of various organisms, the importance of a rapid and accurate algorithm for multiple sequence alignment has been recognized. Under such a situation, several groups are now actively investigating the algorithms

2 MAFFT: A Multiple Sequence Alignment Program 23 Figure 1: Outlined process of MAFFT without pairwise alignment (A) and with pairwise alignments (B). of multiple sequence alignment and a number of new programs have been released since last year [5-7, 13, 15, 28, 32]. ClustalW [30] released in 1994 is a standard multiple sequence alignment software. It employsthe progressive method [8], which is simple but has room for improvement in accuracy and speed. Many of recent methods have tried to improve the accuracy. Gotoh [12]found that the alignment accuracy is significantlyimproved by an iterative refinement method using the weighted sum-of-pairs (WSP) score as an objective function. He developed a program, PRRN, that adopts the method. DIALIGN [17]and TCoffee [19] achieved high accuracies by combininglocal and global alignment features. Although the ability to combine different types of alignments derived from heterogeneoussources, such as structural alignment, is important [21],TCoffee requires a large CPU time proportional to N3. Thus it is hard to apply TCoffee to a large alignment consisting of dozens of sequences. The accuracy of TCoffee has recently been further improved in ver. 2 in comparison with ver. 1. Do et al. [5] proposed a new method ProbCons, whose accuracy is comparable to or slightly higher than TCoffee v.2 as shown in Table 1, and rather faster than TCoffee. Katoh et al. [16]developed MAFFT, in 2002, that aimed to reduce the CPU time of the progressive method. MAFFT also has an iterative refinement option that is a simplified version of PRRN. Edgar [6, 7] also implemented progressiveand iterative alignment strategies in MUSCLE in The algorithms of major options of MUSCLE seem similar to the NW-NS-[i2]options of MAFFT explained below. To reduce the CPU time, UPGMA [26]is more efficientlycoded in MUSCLE.Lassmann and Sonnhammer (manuscript submitted) released Kalign, in Mar. 2005, which adopts a progressive approach to achieve the reduction of CPU time. Kalign has a merit that the length dependency of the CPU time is near to linear. The previous version of MAFFT is used in part by several projects such as PFAM [2],ASTRAL [4] and MEROPS [23]. The outline of procedure of the previous version of MAFFT is explained below (see Figure 1A also). An initial alignment is constructed by the progressive method [8,30] and then improved by the iterative refinement method [3,11]. A user can select an appropriate strategy from the fastest one (FFT-NS-1) to the most accurate one (FFT-NS-i) in v Progressive alignment (1). A rough distance between every pair of input sequences is rapidly calculated based on the number of 6-tuples shared by the two sequences [7, 14,16]. A guide tree is constructed from the distance matrix using UPGMA [26] with modified linkage. Input sequences are progressivelyaligned [8,30] followingthe branching order of the guide tree. This procedure is referred to as FFT-NS Progressive alignment (2). The initial distance matrix is less reliable than that based on all pairwise alignments. We can obtain a more reliable distance matrix by using the FFT-NS-1 alignment [7,16,29]. Progressive alignment is re-performed based on the new guide tree calculated from the new distance matrix. This method is referred to as FFT-NS-2.

3 24 Katoh et al. 3. Iterative refinement. The FFT-NS-2 alignment is further improved by the iterative refinement method [3, 11] that maximizes the weighted sum of pairs (WSP) score proposed by Gotoh [10], using an approximate group-to-group alignment algorithm [16]. This procedure is referred to as FFT-NS-i. For the progressive alignment processes, a fast Fourier transform (FFT) approximation [16] is used in the FFT-NS-2, FFT-NS-1 and FFT-NS-i options (collectively denoted as FFT-NS-[12i] hereafter). When the input sequences are highly conserved, these options require CPU times effectively proportional to average sequence length L for alignments of a single gene. The slopes of the FFT-NS-[2i] lines are indeed close to 1 in Figure 4C. MAFFT also offers three more options NW-NS-[12i] with full dynamic programming (DP) [18] for the same part. In contrast to the case of FFT-NS-[12i], the time complexity of full DP options is O(L2), regardless of the similarity among input sequences. In addition to these options, two additional strategies, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-i hereafter), have been introduced in v.5. The [GL]-INS-i strategies use a new objective function combining the WSP [12] score and a score similar to COFFEE [20] derived from all pairwise alignments in order to improve the alignment accuracy. Iterative refinement with the objective function is performed in reasonable time, as [GL] INS-i were designed to have low computational complexity. See Methods for details of the algorithms and the objective function. Recently, reference databases of protein alignments based on 3D structural superposition, such as HOMSTRAD [27], SABmark [32] and PREFAB [7], have been independently developed. They are valuable resources to find good parameters for a sequence alignment program as well as to evaluate the performance of a program. After the publication of our latest paper describing MAFFT v.5 [15] in Jan. 2005, two benchmark databases have already been released. (i) The most widely used benchmark test, BAliBASE [31], has been substantially updated to v.3 (Thompson et al., manuscript submitted) and now contains alignments that are difficult to make accurate residue-to-residue correspondences with only sequence information, due to high sequence divergence, long indels, etc. That is, the alignments are considered to represent difficult cases we may encounter when we deal with real problems. (ii) BRAliBASE, the first reference database for RNA alignment, has been recently released [9]. Thus, we compared the performances of the currently available sequence alignment programs including the new version of MAFFT, using the new reference alignment databases. The parameters of MAFFT and other methods are not closely optimized for these new benchmark datasets. It was suggested that the accuracy of a multiple alignment of distantly related sequences can be improved if they are aligned together with a number of their homologs, because the information from many sequences is expected to reduce the 'noise' [12]. This point has recently been investigated by Katoh et al. [15] and Simossis et al. [24]. We discuss the effect of number of homologs involved in an alignment in this report also. 2 Method 2.1 Incorporating Pairwise Alignment Information into the Objective Function In the two new iterative refinement options, [GL]-INS-i, the objective function is defined as the summation of the WSP score and a score derived from all pairwise alignments. Both global and local alignment algorithms were tested as all pairwise alignment; G-INS-i uses global alignment [18] with an FFT approximation [16], whereas L-INS-i uses local alignment algorithm [25]. The procedures of the algorithms of [GL]-INS-i are as follows (see Figure 1B also). 1. An initial distance matrix is constructed from the pairwise scores, instead of shared 6-tuples, using the equation shown in ref. [16] and a guide tree is built using UPGMA with modified linkage [15]. Unlike FFT-NS-[2i] explained above, the guide tree is not reconstructed in the

4 MAFFT: A Multiple Sequence Alignment Program 25 [GL]-INS-i strategies, because such a reconstruction did not show significant improvement of the alignment accuracy in our preliminary study (data not shown). 2. Each pairwise alignment is divided into gap-free segments and number n is assigned to each segment. The information of these segments is stored in a set of arrays, score [S(s,t,n)], which represents the alignment score of the nth gap-free segment between sequences s and t, length [L(s, t, n)] of the aligned segment (s, t, n), position [P(s, t, n)] for every pair of sequences s and t, and the importance value [E(s, t, n)] that is calculated, as described below, from the score of the segment and how frequently the residues are involved in gap-free segments. We denote (s, t, p, q) E P(s, t, n) if the pth site of sequence s corresponds to the qth site of segment t in aligned segment (s, t, n). 3. The frequency value f (s, p), which represents the frequency of the pth site of sequence s involved in gap-free segments, is calculated as where wt is weighting factor (see [30] for definition) for sequence t. The importance value E(s, t, n) for aligned segment is calculated as The importance matrix I/(s, t, p, q) between the pth site of sequence s and the qth site of sequence t is defined as 4. An alignment of a subset of given sequences, which is generated during the procedures of progressive and iterative refinement methods, is referred to as 'group.' To align groups i and j, matrix H(groupi, groupj, p, q) is constructed as (1) H(groupi, groupj, p, q) where A(s,p) is the pth amino acid residue on sequence s. Mab is a score between a pair of amino acids a and b. The score matrices examined in this study are described in the 'Parameter optimization' section. wst is a weighting factor between sequences s and t. A weighting scheme proposed by Thompson et al. [30] is used in progressive alignment stage whereas the iterative refinement stage adopts the weighting scheme by Gotoh [10]. WI, a weighting factor for the importance value, was set at 2.7 in this study. The alignment between two groups is computed by applying the dynamic programming algorithm [18] to matrix H(,,,) at each step of progressive alignment. The alignment produced by this procedure is referred to as [GL]-INS-1. The performances of [GL]-INS-1 are not shown in this report. 5. An alignment by [GL]-INS-1 is improved by the iterative refinement method ([GL]-INS-i), which optimizes the alignment with a objective function defined as the summation of the WSP score and importance value given by equation 1.

5 26 Katoh et al. 2.2 Performance Evaluation To evaluate the accuracy of the above methods, we used three datasets, BAliBASE v.3 (Thompson et al., manuscript submitted), HOMSTRAD [27] and SABmark [32], for protein alignment. BAliBASE is diveded into six different reference sets. Each set consists of: Ref11, small (-6) number of equidistant sequences with low similarity (%id<20); Refl2, small ( `6) number of equi-distant sequences with medium similarity (%id of 20-40); Ref3, subgroups with <25% identity between the subgroups: Ref4, sequences that have long terminal extensions; Ref5, sequences with internal insertions. From HOMSTRAD, highly diverged alignments was extracted and used in our tests. From SABmark, a part of the TWI subset was used in this report (see refs [15, 21] for the details). The accuracy of alignment was evaluated using either or both of the TC and SP values [31]. TC is the fraction of columns aligned identically to the reference alignment, whereas SP is the total number of correctly aligned sites over all sequence pairs divided by the number of all residue pairs in the reference alignment (see [31] for definition). Obviously, the TC and SP values are highly correlated to each other. For RNA alignment, BRAliBASE [9] was used. The accuracy of alignments were evaluated with SP and the structure conservation index (SCI) using the RNAz software [33]. SCI is a measure of conserved secondary-structural information contained within the alignment, which is proposed to he used for evaluating accuracy of RNA alignments by Gardner et al. The SCI is close to 1 if perfectly conserved structure is identified in an alignment, whereas TC 0 if no common structure found. Unlike SP and TC, SCI is calculated without using reference alignment. We also compared the accuracies and CPU times of several methods developed by other groups, ClustalW v1.83, PRRN v.3.11, TCoffee v2.03, MUSCLE v3.52, ProbCons v.1.09 and Kalign v1.0. For PRRN, a set of gap penalties re-tuned by Yamada et al. [34] was used instead of the default values. For other methods, default parameters were used. Several different options were tested for some programs according to need. These methods can be classified into three types of multiple sequence alignment strategies. (a) Progressive methods: ClustalW, the FFT-NS-[12] options of MAFFT, a progressive option of MUSCLE and Kalign. (b) Iterative refinement methods: PRRN, the FFT-NS-i option of MAFFT and the iterative refinement option of MUSCLE. (c) Methods incorporating pairwise alignment information: TCoffee, ProbCons and [GL]-INS-i options of MAFFT. 2.3 Parameter Optimization for Protein Alignments Gap-opening penalty S P and the offset value Sa (see ref. [16] for definitions) were determined to provide the highest accuracy of the FFT-NS-2 strategy for a part of SABmark, by using golden section search (e.g. [22]). We examined five scoring matrices (BLOSUM45, 62, 80, JTT100 and,ttt200) and selected the matrix providing the highest accuracy. 2.4 Availability MAFFT was written in C, and runs on Linux, Mac OS X and the Cygwin environment on Windows. The MAFFT package is available at http: // jp/~katoh/programs/ align/mafft/. The performances were measured on a 2.8 GHz Xeon processor with 1GB of RAM running SuSE Linux 9.0. The gcc v compiler was used with the '-O3' optimization option. 3 Results and Discussion 3.1 Improvement in Accuracy by New Parameter Set for Protein Alignments Out of the five scoring matrices examined, BLOSUM62 showed the highest accuracy when FFT-NS- 2 was applied to a part of SABmark. The golden section search provided the optimal parameters, 1.53 for Sop and for Sa (see ref. [15] for details). The improvement in accuracy from the

6 MAFFT: A Multiple Sequence Alignment Program 27 Figure 2: The SP values as a function of average pairwise sequence identity for several methods. The curves were fitted to the all SP values for BAliBASE v.3 using a cubic spline. Percent identity was calculated from reference alignments. Progressive methods (a) are shown as dash-dotted lines, iterative refinement methods (b) are shown as dotted lines and methods incorporating pairwise alignment information (c) are shown in solid lines. See the footnote of Table 1 for the command line option of each method. previous parameter set (JTT200, Sop=2.40, Sa=0.06) was approximately 5 percentage points. The new parameter also set gave generally higher accuracy than the previous parameter set for different datasets (the rest part of SABmark, HOMSTRAD and BAliBASE), although of course not optimal. Thus we judged this improvement is relevant for various protein alignments and adopted this parameter as default. 3.2 Accuracy Evaluation for Protein Alignments Table 1 shows the the averaged TC and SP accuracy values for each reference set of BAliBASE v.3. In Figure 2, the SP values of major methods are plotted as functions of the sequence similarities. The improvement in the SP value from a standard method (ClustalW) and the currently most accurate method (L-INS-i) was approximately 20%, for protein alignments with percent identity of ti 20%. A comparison among the accuracies of three types algorithms [(a) progressive methods, (b) iterative methods without pairwise alignment information, (c) methods incorporating pairwise alignment information] revealed a tendency of (c) > (b) > (a). There was no statistically significant difference in accuracy among the three most accurate methods (L-INS-i, ProbCons and TCoffee), when BAliBASE v.3 was used for the evaluation. Also in the other two benchmark tests (HOMATRAD+0 and SABmark+0 in Table 2), the differences in accuracy values were small. The SABmark test is possibly biased to MAFFT, since the default parameters of MAFFT were optimized by applying FFT-NS-2 to a part of SABmark as noted above. On the other hand, the default parameters of ProbCons were trained via an unsupervised learning method on unaligned sequences of the previous version of BAliBASE [5]. The accuracies are possibly improved by closely adjusting gap penalties and other parameters as in the case of RNA alignment (see section 3.4). To examine this possibility, we applied the downhill simplex method (e.g. [221) to search the optimal set of gap penalties of ProbCons and L-INS-i for

7 28 Katoh et al.

8 MAFFT: A Multiple Sequence Alignment Program 29 each of refl 1, refl2 and ref5 of BAliBASE v.3. With the optimized penalties, the SP accuracy values of L-INS-i were 67.4%, 93.95% and 90.85% for refl 1, ref12 and ref5, respectively, whereas those of ProbCons were %, 94.27% and 90.40% for refl 1, ref12 and ref5, respectively. For both methods, the difference between accuracy values with default penalties and that with the optimized penalties were quite small (<1%). This results suggest that the default parameters of these methods were already well tuned unlike the case of RNA alignment (section 3.4). Table 2: Accuracy values for the HOMSTRAD and SABmark tests. The TC scores are shown for HOMSTRAD and the fp scores (similar to SP) are shown for the SABmark test. Main parts of this were taken from our latest paper [15]. The +n columns indicate that n homologs collected from the SwissProt database were added to each of HOMSTRAD and SABmark tests, aligned together and then removed before calculating the accuracy scores. See the footnote of Table 1 for command line option of each method. 3.3 Effect of the Number of Homologs Involved in an Alignment To examine the effect of the number of homologs involved in an alignment, we carried out the following test using HOMSTRAD and SABmark. (i) Homologs of the each entry of HOMSTRAD and SABmark were collected with BLAST [1]. Three cases of the number of collected homologues, 20, 50 and 100 (only for HOMSTRAD), were considered in this study. (ii) The input sequences were aligned, together with the collected homologs. by each of the alignment methods. (iii) The homologs collected by BLAST were excluded from the alignment. (iv) Then, the resulting alignment was compared with the corresponding reference alignment (see ref. [15] for details). As expected, the accuracy of a multiple alignment increased as the number of collected homologs involved in the alignment increased. This tendency was remarkable for MAFFT, although similar behavior was observed for most of the methods to some extent. In both the HOMSTRAD and SABmark tests (Table 2), the improvement by adding 50 or 100 homologs were about 10 percentage points at most. The improvement was comparable to that by introducing the structural information of one or two proteins [21], and rather larger than that by modification of algorithm (from FFT-NS-i to [GL]-INS-i) presented in this paper. These results suggest the importance of including a number of homologs for obtaining an accurate sequence alignment. An ability to handle a large number of sequences is therefore required for a multiple sequence alignment program. 3.4 RNA Alignments Figure 3 shows two types of accuracy values (SCI and SP) for the BRAliBASE test. From the BRAliBASE tests, a serious problem of the previous versions of MAFFT (v.5.5 and lower) has been turned out; there are too many gaps in comparison with the reference alignment and the accuracy values were remarkably low. Thus the gap penalties was tripled in the latest version (v.5.6 and

9 30 Katoh et al. Figure 3: The SCI (A) and SP (B) values for BRAliBASE test. The curves were fitted to the all accuracy values for BRAliBASE using a cubic spline. Percent identity was calculated from reference alignments. In each panel, two thin lines show the accuracy values with default gap penalties of MAFFT v.4.22 and v Bold lines show the accuracy values with the optimized gap penalties for BRAliBASE. Accuracy values of ClustalW are also shown with default (thin) and optimized (bold) gap penalties for reference. higher) compared to the old penalties, which were the same to the penalties for protein alignments. Although the new parameter set showed higher accuracy values than the previous one, they are not yet optimum for BRAliBASE. The accuracy values of other methods, such as ClustalW and PRRN, were also improved by changing gap penalties from default. As BRAliBASE is the first large benchmark database for RNA alignments and the parameters of sequence alignment programs are not yet tuned well, the accuracy values are possibly drastically improved by using this information. The improvement in accuracy by introducing the new objective function was smaller than the effect of changing gap penalties. The difference between the average accuracy value of FFT-NS-i and that of L-INS-i was only `0.5% for both SCI and SP. The possible reasons are: (i) Such a objective function is effective for protein alignments but not for RNA alignments. (ii) The new objective function possibly improves the accuracy of RNA alignments but the difference was not detected in this analysis, because BRAliBASE contains only more simple problems than BAliBASE, every entry of BRAliBASE consists of sequences of similar lengths and the number of sequences is restricted to five. 3.5 CPU Time The L-INS-i option is at least several times faster than TCoffee and ProbCons as shown in Figure 4. However, the process of L-INS-i contains all pairwise alignment, which requires CPU time proportional to N2L2, where N is the number of sequences and L is the sequence length. Because of the dependence on the number and length of the sequences, the L-INS-i options are considerably slower than the other options (FFT-NS-[i2]) of MAFFT, in which a distance matrix is calculated by a very rapid way. When the highest accuracy is not required, the FFT-NS-i or FFT-NS-2 option may be still useful, considering the tradeoff between computational time and accuracy. 3.6 Conclusion The L-INS-i option of MAFFT is one of the most accurate methods and, in particular, suitable to alignments having medium number of sequences (20-200). The CPU time required for L-INS-i is relatively small. This method has a drawback that it requires all-pairwise alignments, which consumes a large amount of CPU time. A possible way to further improve the performance is the re-examination

10 MAFFT: A Multiple Sequence Alignment Program 31 Figure 4: The CPU times required for various sizes of alignments by various methods. Dash-dotted lines represent progressive methods (a), dotted lines represent iterative refinement methods (b) and solid lines represent methods incorporating pairwise alignment information (c). A and B, The number of input sequences (N) versus CPU time. Average sequence length was fixed at 300 aa. C and D, Average length (L) of input sequences versus CPU time. The number of sequence was fixed at 40. A and C simulates conserved alignments (average distance among input sequences was set at 100 PAM), whereas B and D simulates diverged alignments (average distance among input sequences was set at 250 PAM). of the objective function. More simple and relevant objective function may be useful for exploring the relationship between sequence and protein/rna structure, as well as for reducing the CPU time. Acknowledgments We thank J. D. Thompson, P. P. Gardner and T. Lassmann for discussion and permitting us to use unpublished data. References [1] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J., Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res., 25: , [2] Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C., and Eddy, S. R., The pfam protein families database, Nucleic Acids Res., 32: D138-D141, [3] Berger, M. P. and Munson, P. J., A novel randomized iterative strategy for aligning multiple protein sequences, Comput. Appi. Biosci., 7: , 1991.

11 32 Katoh et al. [4] Chandonia, J. M., Hon, G., Walker, N. S., Lo Conte, L., Kochi, P., Levitt, M., and Brenner, S. E., The ASTRAL compendium in 2004, Nucleic Acids Res., 32: D189 -D192, [5] Do, C. B., Mahabhashyam, M. S., Brudno, M., and Batzoglou, S., ProbCons: Probabilistic consistency-based multiple sequence alignment, Genorne Res., 15: , [6] Edgar, R. C., MUSCLE: A multiple sequence alignment method with reduced time and space complexity, BMC Bioinformatics, 5: 113, [7] Edgar, R. C., MUSCLE: Multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res., 32: , [8] Feng, D. F. and Doolittle, R. F., Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J. Mol. Evol., 25: , [9] Gardner, P. P., Wilm, A., and Washietl, S., A benchmark of multiple sequence alignment programs upon structural RNAs, Nucleic Acids Res., 33(8): , [10] Gotoh, O., A weighting system and algorithm for aligning many phylogenetically related sequences, Comput. Appl. Biosci., 11: , [11] Gotoh, O., Optimal alignment between groups of sequences and its application to multiple sequence alignment, Comput. Appl. Biosci., 9: , [12] Gotoh, O., Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments, J. Mol. Biol., 264: , [13] Grasso, C. and Lee, C., Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems, Bioinformatics, 20: , [14] Jones, D. T., Taylor, W. R., and Thornton, J. M. The rapid generation of mutation data matrices from protein sequences, Comput. Appl. Biosci., 8: , [15] Katoh, K., Kuma, K., Toh, H., and Miyata, T., MAFFT version 5: Improvement in accuracy of multiple sequence alignment, Nucleic Acids Res., 33: , [16] Katoh, K., Misawa, K., Kuma, K., and Miyata, T., MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Res., 30: , [17] Morgenstern, B., Dress, A., and Werner, T., Multiple DNA and protein sequence alignment based on segment-to-segment comparison, Proc. Natl. Acad. Sci. USA, 93: , [18] Needleman, S. B. and Wunsch, C. D., A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol., 48: , [19] Notredame, C., Higgins, D. G., and Heringa, J., T-Coffee: A novel method for fast and accurate multiple sequence alignment, J. Mol. Biol., 302: , [20] Notredame, C., Holm, L., and Higgins, D. G., COFFEE: An objective function for multiple sequence alignments, Bioinformatics, 14(5): , [21] O'Sullivan, O., Suhre, K., Abergel, C., Higgins, D. G., and Notredame, C., 3DCoffee: Combining protein sequences and structures within multiple sequence alignments, J. Mol. Biol., 340: , 2004.

12 MAFFT: A Multiple Sequence Alignment Program 33 [22] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P., Numerical Recipes in C: the Art of Scientific Computing, Cambridge University Press, Cambridge, 2nd edition, [23] Rawlings, N. D., Tolle, D. P., and Barrett, A. J., MEROPS: The peptidase database, Nucleic Acids Res., 32: D160 D164, [24] Simossis, V. A., Kleinjung, J., and Heringa, J., Homology-extended sequence alignment, Nucleic Acids Res., 33: , [25] Smith, T. F. and Waterman, M. S., Identification of common molecular subsequences, J. Mol. Biol., 147: , [26] Sokal, R. R. and Michener, C. D., A statistical method for evaluating systematic relationships, University of Kansas Scientific Bulletin, 28: , [27] Stebbings, L. A. and Mizuguchi, K., HOMSTRAD: Recent developments of the homologous protein structure alignment database, Nucleic Acids Res., 32: D203-D207, [28] Subramanian, A. R., Weyer-Menkhoff, J., Kaufmann, M., and Morgenstern, B., DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment, BMC Bioinformatics, 6: 66, [29] Tateno, Y., Ikeo, K., Imanishi, T., Watanabe, H., Endo, T., Yamaguchi, Y., Suzuki, Y., Takahashi, K., Tsunoyama, K., Kawai, M., Kawanishi, Y., Naitou, K., and Gojobori, T., Evolutionary motif and its biological and structural significance, J. Mol. Evol., 44: S38-S43, [30] Thompson, J. D., Higgins, D. G., and Gibson, T. J., CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res., 22: , [31] Thompson, J. D., Plewniak, F., and Poch, O., BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs, Bioinformatics, 15: 87-88, [32] Van Walle, I., Lasters, I., and Wyns, L., Align-m - a new algorithm for multiple alignment of highly divergent sequences, Bioinformatics, 20: , [33] Washietl, S., Hofacker, I. L., and Stadler, P. F. Fast and reliable prediction of noncoding RNAs, Proc. Natl. Acad. Sci. USA, 102: , [34] Yamada, S., Gotoli, O., and Yamana, H., Extension of PRRN: Implementation of a doubly nested randomized iterative refinement strategy under a piecewise linear gap cost, Genome Informatics, 15: P082, 2004.

Probalign: Multiple sequence alignment using partition function posterior probabilities

Probalign: Multiple sequence alignment using partition function posterior probabilities Sequence Analysis Probalign: Multiple sequence alignment using partition function posterior probabilities Usman Roshan 1* and Dennis R. Livesay 2 1 Department of Computer Science, New Jersey Institute

More information

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences

Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Yue Lu 1 and Sing-Hoi Sze 1,2 1 Department of Biochemistry & Biophysics 2 Department of Computer Science, Texas A&M University,

More information

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs Shirley Sutton, Biochemistry 218 Final Project, March 14, 2008 Introduction For both the computational biologist and the research

More information

Multiple sequence alignment

Multiple sequence alignment Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Chapter 11 Multiple sequence alignment

Chapter 11 Multiple sequence alignment Chapter 11 Multiple sequence alignment Burkhard Morgenstern 1. INTRODUCTION Sequence alignment is of crucial importance for all aspects of biological sequence analysis. Virtually all methods of nucleic

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Citation BMC Bioinformatics, 2016, v. 17 n. suppl. 8, p. 285:

Citation BMC Bioinformatics, 2016, v. 17 n. suppl. 8, p. 285: Title PnpProbs: A better multiple sequence alignment tool by better handling of guide trees Author(s) YE, Y; Lam, TW; Ting, HF Citation BMC Bioinformatics, 2016, v. 17 n. suppl. 8, p. 285:633-643 Issued

More information

Optimization of a New Score Function for the Detection of Remote Homologs

Optimization of a New Score Function for the Detection of Remote Homologs PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm017

BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm017 Vol. 23 no. 7 2007, pages 802 808 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm017 Sequence analysis PROMALS: towards accurate multiple sequence alignments of distantly related proteins

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Recent developments in the MAFFT multiple sequence alignment program

Recent developments in the MAFFT multiple sequence alignment program A preprint of Katoh and Toh (2008, Briefings in Bioinformatics 9:286-298) Recent developments in the MAFFT multiple sequence alignment program Kazutaka Katoh 1 and Hiroyuki Toh 2 1 Digital Medicine Initiative,

More information

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists A greedy, graph-based algorithm for the alignment of multiple homologous gene lists Jan Fostier, Sebastian Proost, Bart Dhoedt, Yvan Saeys, Piet Demeester, Yves Van de Peer, and Klaas Vandepoele Bioinformatics

More information

Upcoming challenges for multiple sequence alignment methods in the high-throughput era

Upcoming challenges for multiple sequence alignment methods in the high-throughput era BIOINFORMATICS REVIEW Vol. 25 no. 19 2009, pages 2455 2465 doi:10.1093/bioinformatics/btp452 Sequence analysis Upcoming challenges for multiple sequence alignment methods in the high-throughput era Carsten

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Overview Multiple Sequence Alignment

Overview Multiple Sequence Alignment Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments

More information

Protein sequence alignment with family-specific amino acid similarity matrices

Protein sequence alignment with family-specific amino acid similarity matrices TECHNICAL NOTE Open Access Protein sequence alignment with family-specific amino acid similarity matrices Igor B Kuznetsov Abstract Background: Alignment of amino acid sequences by means of dynamic programming

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

The PRALINE online server: optimising progressive multiple alignment on the web

The PRALINE online server: optimising progressive multiple alignment on the web Computational Biology and Chemistry 27 (2003) 511 519 Software Note The PRALINE online server: optimising progressive multiple alignment on the web V.A. Simossis a,b, J. Heringa a, a Bioinformatics Unit,

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Research Proposal Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Name: Minjal Pancholi Howard University Washington, DC. June 19, 2009 Research

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Simultaneous Sequence Alignment and Tree Construction Using Hidden Markov Models. R.C. Edgar, K. Sjölander

Simultaneous Sequence Alignment and Tree Construction Using Hidden Markov Models. R.C. Edgar, K. Sjölander Simultaneous Sequence Alignment and Tree Construction Using Hidden Markov Models R.C. Edgar, K. Sjölander Pacific Symposium on Biocomputing 8:180-191(2003) SIMULTANEOUS SEQUENCE ALIGNMENT AND TREE CONSTRUCTION

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

PROMALS3D web server for accurate multiple protein sequence and structure alignments

PROMALS3D web server for accurate multiple protein sequence and structure alignments W30 W34 Nucleic Acids Research, 2008, Vol. 36, Web Server issue Published online 24 May 2008 doi:10.1093/nar/gkn322 PROMALS3D web server for accurate multiple protein sequence and structure alignments

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12 Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Some Problems from Enzyme Families

Some Problems from Enzyme Families Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Why Look at More Than One Sequence? 1. Multiple Sequence Alignment shows patterns of conservation 2. What and how many

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Introduction to Evolutionary Concepts

Introduction to Evolutionary Concepts Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq

More information

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment Stefano Iantorno 1,2,*, Kevin Gori 3,*, Nick Goldman 3, Manuel Gil 4, Christophe Dessimoz 3, 1 Wellcome Trust Sanger

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 19 2008, pages 2165 2171 doi:10.1093/bioinformatics/btn414 Sequence analysis Model-based prediction of sequence alignment quality Virpi Ahola 1,2,, Tero Aittokallio

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc Supplemental Data. Perea-Resa et al. Plant Cell. (22)..5/tpc.2.3697 Sm Sm2 Supplemental Figure. Sequence alignment of Arabidopsis LSM proteins. Alignment of the eleven Arabidopsis LSM proteins. Sm and

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

Large Grain Size Stochastic Optimization Alignment

Large Grain Size Stochastic Optimization Alignment Brigham Young University BYU ScholarsArchive All Faculty Publications 2006-10-01 Large Grain Size Stochastic Optimization Alignment Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu

More information

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17: Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:

More information

Cédric Notredame and Chantal Abergel

Cédric Notredame and Chantal Abergel 8VLQJPXOWLSOHDOLJQPHQWPHWKRGVWRDVVHVVWKHTXDOLW\RI JHQRPLFGDWDDQDO\VLV Cédric Notredame and Chantal Abergel Information Génétique et Structurale UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille Email:

More information

Similarity searching summary (2)

Similarity searching summary (2) Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

More information

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis '17- lecture 8. Multiple sequence alignment Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

PROMALS3D: a tool for multiple protein sequence and structure alignments

PROMALS3D: a tool for multiple protein sequence and structure alignments Nucleic Acids Research Advance Access published February 20, 2008 Nucleic Acids Research, 2008, 1 6 doi:10.1093/nar/gkn072 PROMALS3D: a tool for multiple protein sequence and structure alignments Jimin

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Multiple Genome Alignment by Clustering Pairwise Matches

Multiple Genome Alignment by Clustering Pairwise Matches Multiple Genome Alignment by Clustering Pairwise Matches Jeong-Hyeon Choi 1,3, Kwangmin Choi 1, Hwan-Gue Cho 3, and Sun Kim 1,2 1 School of Informatics, Indiana University, IN 47408, USA, {jeochoi,kwchoi,sunkim}@bio.informatics.indiana.edu

More information

Multiple Sequence Alignment

Multiple Sequence Alignment Multiple Sequence Alignment Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences.! What about more than two? And what for?! A faint similarity between two

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

BIOINFORMATICS. RASCAL: rapid scanning and correction of multiple sequence alignments. J. D. Thompson, J. C. Thierry and O.

BIOINFORMATICS. RASCAL: rapid scanning and correction of multiple sequence alignments. J. D. Thompson, J. C. Thierry and O. BIOINFORMATICS Vol. 19 no. 9 2003, pages 1155 1161 DOI: 10.1093/bioinformatics/btg133 RASCAL: rapid scanning and correction of multiple sequence alignments J. D. Thompson, J. C. Thierry and O. Poch Laboratoire

More information

Moreover, the circular logic

Moreover, the circular logic Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT

More information

Truncated Profile Hidden Markov Models

Truncated Profile Hidden Markov Models Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 11-1-2005 Truncated Profile Hidden Markov

More information

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity 1 frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity HUZEFA RANGWALA and GEORGE KARYPIS Department of Computer Science and Engineering

More information

Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices. Shifra Ben-Dor Irit Orr Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Segment-based scores for pairwise and multiple sequence alignments

Segment-based scores for pairwise and multiple sequence alignments From: ISMB-98 Proceedings. Copyright 1998, AAAI (www.aaai.org). All rights reserved. Segment-based scores for pairwise and multiple sequence alignments Burkhard Morgenstern 1,*, William R. Atchley 2, Klaus

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

Detecting Distant Homologs Using Phylogenetic Tree-Based HMMs

Detecting Distant Homologs Using Phylogenetic Tree-Based HMMs PROTEINS: Structure, Function, and Genetics 52:446 453 (2003) Detecting Distant Homologs Using Phylogenetic Tree-Based HMMs Bin Qian 1 and Richard A. Goldstein 1,2 * 1 Biophysics Research Division, University

More information

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) 2 19 2015 Scribe: John Ekins Multiple Sequence Alignment Given N sequences x 1, x 2,, x N : Insert gaps in each of the sequences

More information

Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix

Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix Item Type Article Authors Ndhlovu, Andrew; Hazelhurst, Scott; Durand, Pierre M. Citation Ndhlovu et al.

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information