Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT
|
|
- Winifred Jacobs
- 5 years ago
- Views:
Transcription
1 22 Genome Informatics 16(1): (2005) Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT Kazutaka Katohl Takashi Miyata2" Kei-ichi Kuma1 Hiroyuki Toh5 1 Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto , Japan 2 Biohistory Research Hall, Takatsuki, Osaka , Japan 3 Department of Electrical Engineering and Bioscience, Science and Engineering, 4 Waseda University, Tokyo , Japan Department of Biophysics, Graduate School of Science, Kyoto University, Kyoto , Japan Division of Bioinformatics, Research Center for Prevention of Infectious Diseases, Medical Institute of Bioregulation, Kyushu University, Fukuoka ,Japan Abstract In 2002, we developed and released a rapid multiple sequence alignment program MAFFT that was designed to handle a huge (up to `5,000 sequences) and long data (,-2,000 as or `5,000 nt) in a reasonable time on a standard desktop PC. As for the accuracy, however, the previous versions (v.4 and lower) of MAFFT were outperformed by ProbCons and TCoffee v.2, both of which were released in 2004, in several benchmark tests. Here we report a recent extension of MAFFT that aims to improve the accuracy with as little cost of calculation time as possible. The extended version of MAFFT (v.5) has new iterative refinement options, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-i in this report). These options use a new objective function combining the weighted sum-of-pairs (WSP) score and a score similar to COFFEE derived from all pairwise alignments. We discuss the improvement in accuracy brought by this extension, mainly using two benchmark tests released very recently, BAliBASE v.3 (for protein alignments) and BR,AliBASE (for RNA alignments). According to BAliBASE v.3, the overall average accuracy of L-INS-i was higher than those of other methods successively released in 2004, although the difference among the most accurate methods (ProbCons, TCoffee v.2 and new options of MAFFT) was small. The advantage in accuracy of [GL]-INS-i became greater for the alignments consisting of sequences. By utilizing this feature of MAFFT, we also examined another possible approach to improve the accuracy by incorporating homolog information collected from database. The [GL]-INS-i options are applicable to aligning up to r\-, 200 sequences, although not applicable to thousands of sequences because of time and space complexities. Keywords: multiple sequence alignment, dynamic programming, objective function 1 Introduction Multiple alignment of protein or DNA sequences is a classical problem in the area of bioinformatics and an important tool required in a wide range of molecular biological analyses such as detecting conserved residues and inferring evolutionary history of genes and organisms. As a large amount of sequence information is being determined and accumulated by genome and cdna projects of various organisms, the importance of a rapid and accurate algorithm for multiple sequence alignment has been recognized. Under such a situation, several groups are now actively investigating the algorithms
2 MAFFT: A Multiple Sequence Alignment Program 23 Figure 1: Outlined process of MAFFT without pairwise alignment (A) and with pairwise alignments (B). of multiple sequence alignment and a number of new programs have been released since last year [5-7, 13, 15, 28, 32]. ClustalW [30] released in 1994 is a standard multiple sequence alignment software. It employsthe progressive method [8], which is simple but has room for improvement in accuracy and speed. Many of recent methods have tried to improve the accuracy. Gotoh [12]found that the alignment accuracy is significantlyimproved by an iterative refinement method using the weighted sum-of-pairs (WSP) score as an objective function. He developed a program, PRRN, that adopts the method. DIALIGN [17]and TCoffee [19] achieved high accuracies by combininglocal and global alignment features. Although the ability to combine different types of alignments derived from heterogeneoussources, such as structural alignment, is important [21],TCoffee requires a large CPU time proportional to N3. Thus it is hard to apply TCoffee to a large alignment consisting of dozens of sequences. The accuracy of TCoffee has recently been further improved in ver. 2 in comparison with ver. 1. Do et al. [5] proposed a new method ProbCons, whose accuracy is comparable to or slightly higher than TCoffee v.2 as shown in Table 1, and rather faster than TCoffee. Katoh et al. [16]developed MAFFT, in 2002, that aimed to reduce the CPU time of the progressive method. MAFFT also has an iterative refinement option that is a simplified version of PRRN. Edgar [6, 7] also implemented progressiveand iterative alignment strategies in MUSCLE in The algorithms of major options of MUSCLE seem similar to the NW-NS-[i2]options of MAFFT explained below. To reduce the CPU time, UPGMA [26]is more efficientlycoded in MUSCLE.Lassmann and Sonnhammer (manuscript submitted) released Kalign, in Mar. 2005, which adopts a progressive approach to achieve the reduction of CPU time. Kalign has a merit that the length dependency of the CPU time is near to linear. The previous version of MAFFT is used in part by several projects such as PFAM [2],ASTRAL [4] and MEROPS [23]. The outline of procedure of the previous version of MAFFT is explained below (see Figure 1A also). An initial alignment is constructed by the progressive method [8,30] and then improved by the iterative refinement method [3,11]. A user can select an appropriate strategy from the fastest one (FFT-NS-1) to the most accurate one (FFT-NS-i) in v Progressive alignment (1). A rough distance between every pair of input sequences is rapidly calculated based on the number of 6-tuples shared by the two sequences [7, 14,16]. A guide tree is constructed from the distance matrix using UPGMA [26] with modified linkage. Input sequences are progressivelyaligned [8,30] followingthe branching order of the guide tree. This procedure is referred to as FFT-NS Progressive alignment (2). The initial distance matrix is less reliable than that based on all pairwise alignments. We can obtain a more reliable distance matrix by using the FFT-NS-1 alignment [7,16,29]. Progressive alignment is re-performed based on the new guide tree calculated from the new distance matrix. This method is referred to as FFT-NS-2.
3 24 Katoh et al. 3. Iterative refinement. The FFT-NS-2 alignment is further improved by the iterative refinement method [3, 11] that maximizes the weighted sum of pairs (WSP) score proposed by Gotoh [10], using an approximate group-to-group alignment algorithm [16]. This procedure is referred to as FFT-NS-i. For the progressive alignment processes, a fast Fourier transform (FFT) approximation [16] is used in the FFT-NS-2, FFT-NS-1 and FFT-NS-i options (collectively denoted as FFT-NS-[12i] hereafter). When the input sequences are highly conserved, these options require CPU times effectively proportional to average sequence length L for alignments of a single gene. The slopes of the FFT-NS-[2i] lines are indeed close to 1 in Figure 4C. MAFFT also offers three more options NW-NS-[12i] with full dynamic programming (DP) [18] for the same part. In contrast to the case of FFT-NS-[12i], the time complexity of full DP options is O(L2), regardless of the similarity among input sequences. In addition to these options, two additional strategies, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-i hereafter), have been introduced in v.5. The [GL]-INS-i strategies use a new objective function combining the WSP [12] score and a score similar to COFFEE [20] derived from all pairwise alignments in order to improve the alignment accuracy. Iterative refinement with the objective function is performed in reasonable time, as [GL] INS-i were designed to have low computational complexity. See Methods for details of the algorithms and the objective function. Recently, reference databases of protein alignments based on 3D structural superposition, such as HOMSTRAD [27], SABmark [32] and PREFAB [7], have been independently developed. They are valuable resources to find good parameters for a sequence alignment program as well as to evaluate the performance of a program. After the publication of our latest paper describing MAFFT v.5 [15] in Jan. 2005, two benchmark databases have already been released. (i) The most widely used benchmark test, BAliBASE [31], has been substantially updated to v.3 (Thompson et al., manuscript submitted) and now contains alignments that are difficult to make accurate residue-to-residue correspondences with only sequence information, due to high sequence divergence, long indels, etc. That is, the alignments are considered to represent difficult cases we may encounter when we deal with real problems. (ii) BRAliBASE, the first reference database for RNA alignment, has been recently released [9]. Thus, we compared the performances of the currently available sequence alignment programs including the new version of MAFFT, using the new reference alignment databases. The parameters of MAFFT and other methods are not closely optimized for these new benchmark datasets. It was suggested that the accuracy of a multiple alignment of distantly related sequences can be improved if they are aligned together with a number of their homologs, because the information from many sequences is expected to reduce the 'noise' [12]. This point has recently been investigated by Katoh et al. [15] and Simossis et al. [24]. We discuss the effect of number of homologs involved in an alignment in this report also. 2 Method 2.1 Incorporating Pairwise Alignment Information into the Objective Function In the two new iterative refinement options, [GL]-INS-i, the objective function is defined as the summation of the WSP score and a score derived from all pairwise alignments. Both global and local alignment algorithms were tested as all pairwise alignment; G-INS-i uses global alignment [18] with an FFT approximation [16], whereas L-INS-i uses local alignment algorithm [25]. The procedures of the algorithms of [GL]-INS-i are as follows (see Figure 1B also). 1. An initial distance matrix is constructed from the pairwise scores, instead of shared 6-tuples, using the equation shown in ref. [16] and a guide tree is built using UPGMA with modified linkage [15]. Unlike FFT-NS-[2i] explained above, the guide tree is not reconstructed in the
4 MAFFT: A Multiple Sequence Alignment Program 25 [GL]-INS-i strategies, because such a reconstruction did not show significant improvement of the alignment accuracy in our preliminary study (data not shown). 2. Each pairwise alignment is divided into gap-free segments and number n is assigned to each segment. The information of these segments is stored in a set of arrays, score [S(s,t,n)], which represents the alignment score of the nth gap-free segment between sequences s and t, length [L(s, t, n)] of the aligned segment (s, t, n), position [P(s, t, n)] for every pair of sequences s and t, and the importance value [E(s, t, n)] that is calculated, as described below, from the score of the segment and how frequently the residues are involved in gap-free segments. We denote (s, t, p, q) E P(s, t, n) if the pth site of sequence s corresponds to the qth site of segment t in aligned segment (s, t, n). 3. The frequency value f (s, p), which represents the frequency of the pth site of sequence s involved in gap-free segments, is calculated as where wt is weighting factor (see [30] for definition) for sequence t. The importance value E(s, t, n) for aligned segment is calculated as The importance matrix I/(s, t, p, q) between the pth site of sequence s and the qth site of sequence t is defined as 4. An alignment of a subset of given sequences, which is generated during the procedures of progressive and iterative refinement methods, is referred to as 'group.' To align groups i and j, matrix H(groupi, groupj, p, q) is constructed as (1) H(groupi, groupj, p, q) where A(s,p) is the pth amino acid residue on sequence s. Mab is a score between a pair of amino acids a and b. The score matrices examined in this study are described in the 'Parameter optimization' section. wst is a weighting factor between sequences s and t. A weighting scheme proposed by Thompson et al. [30] is used in progressive alignment stage whereas the iterative refinement stage adopts the weighting scheme by Gotoh [10]. WI, a weighting factor for the importance value, was set at 2.7 in this study. The alignment between two groups is computed by applying the dynamic programming algorithm [18] to matrix H(,,,) at each step of progressive alignment. The alignment produced by this procedure is referred to as [GL]-INS-1. The performances of [GL]-INS-1 are not shown in this report. 5. An alignment by [GL]-INS-1 is improved by the iterative refinement method ([GL]-INS-i), which optimizes the alignment with a objective function defined as the summation of the WSP score and importance value given by equation 1.
5 26 Katoh et al. 2.2 Performance Evaluation To evaluate the accuracy of the above methods, we used three datasets, BAliBASE v.3 (Thompson et al., manuscript submitted), HOMSTRAD [27] and SABmark [32], for protein alignment. BAliBASE is diveded into six different reference sets. Each set consists of: Ref11, small (-6) number of equidistant sequences with low similarity (%id<20); Refl2, small ( `6) number of equi-distant sequences with medium similarity (%id of 20-40); Ref3, subgroups with <25% identity between the subgroups: Ref4, sequences that have long terminal extensions; Ref5, sequences with internal insertions. From HOMSTRAD, highly diverged alignments was extracted and used in our tests. From SABmark, a part of the TWI subset was used in this report (see refs [15, 21] for the details). The accuracy of alignment was evaluated using either or both of the TC and SP values [31]. TC is the fraction of columns aligned identically to the reference alignment, whereas SP is the total number of correctly aligned sites over all sequence pairs divided by the number of all residue pairs in the reference alignment (see [31] for definition). Obviously, the TC and SP values are highly correlated to each other. For RNA alignment, BRAliBASE [9] was used. The accuracy of alignments were evaluated with SP and the structure conservation index (SCI) using the RNAz software [33]. SCI is a measure of conserved secondary-structural information contained within the alignment, which is proposed to he used for evaluating accuracy of RNA alignments by Gardner et al. The SCI is close to 1 if perfectly conserved structure is identified in an alignment, whereas TC 0 if no common structure found. Unlike SP and TC, SCI is calculated without using reference alignment. We also compared the accuracies and CPU times of several methods developed by other groups, ClustalW v1.83, PRRN v.3.11, TCoffee v2.03, MUSCLE v3.52, ProbCons v.1.09 and Kalign v1.0. For PRRN, a set of gap penalties re-tuned by Yamada et al. [34] was used instead of the default values. For other methods, default parameters were used. Several different options were tested for some programs according to need. These methods can be classified into three types of multiple sequence alignment strategies. (a) Progressive methods: ClustalW, the FFT-NS-[12] options of MAFFT, a progressive option of MUSCLE and Kalign. (b) Iterative refinement methods: PRRN, the FFT-NS-i option of MAFFT and the iterative refinement option of MUSCLE. (c) Methods incorporating pairwise alignment information: TCoffee, ProbCons and [GL]-INS-i options of MAFFT. 2.3 Parameter Optimization for Protein Alignments Gap-opening penalty S P and the offset value Sa (see ref. [16] for definitions) were determined to provide the highest accuracy of the FFT-NS-2 strategy for a part of SABmark, by using golden section search (e.g. [22]). We examined five scoring matrices (BLOSUM45, 62, 80, JTT100 and,ttt200) and selected the matrix providing the highest accuracy. 2.4 Availability MAFFT was written in C, and runs on Linux, Mac OS X and the Cygwin environment on Windows. The MAFFT package is available at http: // jp/~katoh/programs/ align/mafft/. The performances were measured on a 2.8 GHz Xeon processor with 1GB of RAM running SuSE Linux 9.0. The gcc v compiler was used with the '-O3' optimization option. 3 Results and Discussion 3.1 Improvement in Accuracy by New Parameter Set for Protein Alignments Out of the five scoring matrices examined, BLOSUM62 showed the highest accuracy when FFT-NS- 2 was applied to a part of SABmark. The golden section search provided the optimal parameters, 1.53 for Sop and for Sa (see ref. [15] for details). The improvement in accuracy from the
6 MAFFT: A Multiple Sequence Alignment Program 27 Figure 2: The SP values as a function of average pairwise sequence identity for several methods. The curves were fitted to the all SP values for BAliBASE v.3 using a cubic spline. Percent identity was calculated from reference alignments. Progressive methods (a) are shown as dash-dotted lines, iterative refinement methods (b) are shown as dotted lines and methods incorporating pairwise alignment information (c) are shown in solid lines. See the footnote of Table 1 for the command line option of each method. previous parameter set (JTT200, Sop=2.40, Sa=0.06) was approximately 5 percentage points. The new parameter also set gave generally higher accuracy than the previous parameter set for different datasets (the rest part of SABmark, HOMSTRAD and BAliBASE), although of course not optimal. Thus we judged this improvement is relevant for various protein alignments and adopted this parameter as default. 3.2 Accuracy Evaluation for Protein Alignments Table 1 shows the the averaged TC and SP accuracy values for each reference set of BAliBASE v.3. In Figure 2, the SP values of major methods are plotted as functions of the sequence similarities. The improvement in the SP value from a standard method (ClustalW) and the currently most accurate method (L-INS-i) was approximately 20%, for protein alignments with percent identity of ti 20%. A comparison among the accuracies of three types algorithms [(a) progressive methods, (b) iterative methods without pairwise alignment information, (c) methods incorporating pairwise alignment information] revealed a tendency of (c) > (b) > (a). There was no statistically significant difference in accuracy among the three most accurate methods (L-INS-i, ProbCons and TCoffee), when BAliBASE v.3 was used for the evaluation. Also in the other two benchmark tests (HOMATRAD+0 and SABmark+0 in Table 2), the differences in accuracy values were small. The SABmark test is possibly biased to MAFFT, since the default parameters of MAFFT were optimized by applying FFT-NS-2 to a part of SABmark as noted above. On the other hand, the default parameters of ProbCons were trained via an unsupervised learning method on unaligned sequences of the previous version of BAliBASE [5]. The accuracies are possibly improved by closely adjusting gap penalties and other parameters as in the case of RNA alignment (see section 3.4). To examine this possibility, we applied the downhill simplex method (e.g. [221) to search the optimal set of gap penalties of ProbCons and L-INS-i for
7 28 Katoh et al.
8 MAFFT: A Multiple Sequence Alignment Program 29 each of refl 1, refl2 and ref5 of BAliBASE v.3. With the optimized penalties, the SP accuracy values of L-INS-i were 67.4%, 93.95% and 90.85% for refl 1, ref12 and ref5, respectively, whereas those of ProbCons were %, 94.27% and 90.40% for refl 1, ref12 and ref5, respectively. For both methods, the difference between accuracy values with default penalties and that with the optimized penalties were quite small (<1%). This results suggest that the default parameters of these methods were already well tuned unlike the case of RNA alignment (section 3.4). Table 2: Accuracy values for the HOMSTRAD and SABmark tests. The TC scores are shown for HOMSTRAD and the fp scores (similar to SP) are shown for the SABmark test. Main parts of this were taken from our latest paper [15]. The +n columns indicate that n homologs collected from the SwissProt database were added to each of HOMSTRAD and SABmark tests, aligned together and then removed before calculating the accuracy scores. See the footnote of Table 1 for command line option of each method. 3.3 Effect of the Number of Homologs Involved in an Alignment To examine the effect of the number of homologs involved in an alignment, we carried out the following test using HOMSTRAD and SABmark. (i) Homologs of the each entry of HOMSTRAD and SABmark were collected with BLAST [1]. Three cases of the number of collected homologues, 20, 50 and 100 (only for HOMSTRAD), were considered in this study. (ii) The input sequences were aligned, together with the collected homologs. by each of the alignment methods. (iii) The homologs collected by BLAST were excluded from the alignment. (iv) Then, the resulting alignment was compared with the corresponding reference alignment (see ref. [15] for details). As expected, the accuracy of a multiple alignment increased as the number of collected homologs involved in the alignment increased. This tendency was remarkable for MAFFT, although similar behavior was observed for most of the methods to some extent. In both the HOMSTRAD and SABmark tests (Table 2), the improvement by adding 50 or 100 homologs were about 10 percentage points at most. The improvement was comparable to that by introducing the structural information of one or two proteins [21], and rather larger than that by modification of algorithm (from FFT-NS-i to [GL]-INS-i) presented in this paper. These results suggest the importance of including a number of homologs for obtaining an accurate sequence alignment. An ability to handle a large number of sequences is therefore required for a multiple sequence alignment program. 3.4 RNA Alignments Figure 3 shows two types of accuracy values (SCI and SP) for the BRAliBASE test. From the BRAliBASE tests, a serious problem of the previous versions of MAFFT (v.5.5 and lower) has been turned out; there are too many gaps in comparison with the reference alignment and the accuracy values were remarkably low. Thus the gap penalties was tripled in the latest version (v.5.6 and
9 30 Katoh et al. Figure 3: The SCI (A) and SP (B) values for BRAliBASE test. The curves were fitted to the all accuracy values for BRAliBASE using a cubic spline. Percent identity was calculated from reference alignments. In each panel, two thin lines show the accuracy values with default gap penalties of MAFFT v.4.22 and v Bold lines show the accuracy values with the optimized gap penalties for BRAliBASE. Accuracy values of ClustalW are also shown with default (thin) and optimized (bold) gap penalties for reference. higher) compared to the old penalties, which were the same to the penalties for protein alignments. Although the new parameter set showed higher accuracy values than the previous one, they are not yet optimum for BRAliBASE. The accuracy values of other methods, such as ClustalW and PRRN, were also improved by changing gap penalties from default. As BRAliBASE is the first large benchmark database for RNA alignments and the parameters of sequence alignment programs are not yet tuned well, the accuracy values are possibly drastically improved by using this information. The improvement in accuracy by introducing the new objective function was smaller than the effect of changing gap penalties. The difference between the average accuracy value of FFT-NS-i and that of L-INS-i was only `0.5% for both SCI and SP. The possible reasons are: (i) Such a objective function is effective for protein alignments but not for RNA alignments. (ii) The new objective function possibly improves the accuracy of RNA alignments but the difference was not detected in this analysis, because BRAliBASE contains only more simple problems than BAliBASE, every entry of BRAliBASE consists of sequences of similar lengths and the number of sequences is restricted to five. 3.5 CPU Time The L-INS-i option is at least several times faster than TCoffee and ProbCons as shown in Figure 4. However, the process of L-INS-i contains all pairwise alignment, which requires CPU time proportional to N2L2, where N is the number of sequences and L is the sequence length. Because of the dependence on the number and length of the sequences, the L-INS-i options are considerably slower than the other options (FFT-NS-[i2]) of MAFFT, in which a distance matrix is calculated by a very rapid way. When the highest accuracy is not required, the FFT-NS-i or FFT-NS-2 option may be still useful, considering the tradeoff between computational time and accuracy. 3.6 Conclusion The L-INS-i option of MAFFT is one of the most accurate methods and, in particular, suitable to alignments having medium number of sequences (20-200). The CPU time required for L-INS-i is relatively small. This method has a drawback that it requires all-pairwise alignments, which consumes a large amount of CPU time. A possible way to further improve the performance is the re-examination
10 MAFFT: A Multiple Sequence Alignment Program 31 Figure 4: The CPU times required for various sizes of alignments by various methods. Dash-dotted lines represent progressive methods (a), dotted lines represent iterative refinement methods (b) and solid lines represent methods incorporating pairwise alignment information (c). A and B, The number of input sequences (N) versus CPU time. Average sequence length was fixed at 300 aa. C and D, Average length (L) of input sequences versus CPU time. The number of sequence was fixed at 40. A and C simulates conserved alignments (average distance among input sequences was set at 100 PAM), whereas B and D simulates diverged alignments (average distance among input sequences was set at 250 PAM). of the objective function. More simple and relevant objective function may be useful for exploring the relationship between sequence and protein/rna structure, as well as for reducing the CPU time. Acknowledgments We thank J. D. Thompson, P. P. Gardner and T. Lassmann for discussion and permitting us to use unpublished data. References [1] Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J., Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res., 25: , [2] Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L., Studholme, D. J., Yeats, C., and Eddy, S. R., The pfam protein families database, Nucleic Acids Res., 32: D138-D141, [3] Berger, M. P. and Munson, P. J., A novel randomized iterative strategy for aligning multiple protein sequences, Comput. Appi. Biosci., 7: , 1991.
11 32 Katoh et al. [4] Chandonia, J. M., Hon, G., Walker, N. S., Lo Conte, L., Kochi, P., Levitt, M., and Brenner, S. E., The ASTRAL compendium in 2004, Nucleic Acids Res., 32: D189 -D192, [5] Do, C. B., Mahabhashyam, M. S., Brudno, M., and Batzoglou, S., ProbCons: Probabilistic consistency-based multiple sequence alignment, Genorne Res., 15: , [6] Edgar, R. C., MUSCLE: A multiple sequence alignment method with reduced time and space complexity, BMC Bioinformatics, 5: 113, [7] Edgar, R. C., MUSCLE: Multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res., 32: , [8] Feng, D. F. and Doolittle, R. F., Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J. Mol. Evol., 25: , [9] Gardner, P. P., Wilm, A., and Washietl, S., A benchmark of multiple sequence alignment programs upon structural RNAs, Nucleic Acids Res., 33(8): , [10] Gotoh, O., A weighting system and algorithm for aligning many phylogenetically related sequences, Comput. Appl. Biosci., 11: , [11] Gotoh, O., Optimal alignment between groups of sequences and its application to multiple sequence alignment, Comput. Appl. Biosci., 9: , [12] Gotoh, O., Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments, J. Mol. Biol., 264: , [13] Grasso, C. and Lee, C., Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems, Bioinformatics, 20: , [14] Jones, D. T., Taylor, W. R., and Thornton, J. M. The rapid generation of mutation data matrices from protein sequences, Comput. Appl. Biosci., 8: , [15] Katoh, K., Kuma, K., Toh, H., and Miyata, T., MAFFT version 5: Improvement in accuracy of multiple sequence alignment, Nucleic Acids Res., 33: , [16] Katoh, K., Misawa, K., Kuma, K., and Miyata, T., MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform, Nucleic Acids Res., 30: , [17] Morgenstern, B., Dress, A., and Werner, T., Multiple DNA and protein sequence alignment based on segment-to-segment comparison, Proc. Natl. Acad. Sci. USA, 93: , [18] Needleman, S. B. and Wunsch, C. D., A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol., 48: , [19] Notredame, C., Higgins, D. G., and Heringa, J., T-Coffee: A novel method for fast and accurate multiple sequence alignment, J. Mol. Biol., 302: , [20] Notredame, C., Holm, L., and Higgins, D. G., COFFEE: An objective function for multiple sequence alignments, Bioinformatics, 14(5): , [21] O'Sullivan, O., Suhre, K., Abergel, C., Higgins, D. G., and Notredame, C., 3DCoffee: Combining protein sequences and structures within multiple sequence alignments, J. Mol. Biol., 340: , 2004.
12 MAFFT: A Multiple Sequence Alignment Program 33 [22] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P., Numerical Recipes in C: the Art of Scientific Computing, Cambridge University Press, Cambridge, 2nd edition, [23] Rawlings, N. D., Tolle, D. P., and Barrett, A. J., MEROPS: The peptidase database, Nucleic Acids Res., 32: D160 D164, [24] Simossis, V. A., Kleinjung, J., and Heringa, J., Homology-extended sequence alignment, Nucleic Acids Res., 33: , [25] Smith, T. F. and Waterman, M. S., Identification of common molecular subsequences, J. Mol. Biol., 147: , [26] Sokal, R. R. and Michener, C. D., A statistical method for evaluating systematic relationships, University of Kansas Scientific Bulletin, 28: , [27] Stebbings, L. A. and Mizuguchi, K., HOMSTRAD: Recent developments of the homologous protein structure alignment database, Nucleic Acids Res., 32: D203-D207, [28] Subramanian, A. R., Weyer-Menkhoff, J., Kaufmann, M., and Morgenstern, B., DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment, BMC Bioinformatics, 6: 66, [29] Tateno, Y., Ikeo, K., Imanishi, T., Watanabe, H., Endo, T., Yamaguchi, Y., Suzuki, Y., Takahashi, K., Tsunoyama, K., Kawai, M., Kawanishi, Y., Naitou, K., and Gojobori, T., Evolutionary motif and its biological and structural significance, J. Mol. Evol., 44: S38-S43, [30] Thompson, J. D., Higgins, D. G., and Gibson, T. J., CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res., 22: , [31] Thompson, J. D., Plewniak, F., and Poch, O., BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs, Bioinformatics, 15: 87-88, [32] Van Walle, I., Lasters, I., and Wyns, L., Align-m - a new algorithm for multiple alignment of highly divergent sequences, Bioinformatics, 20: , [33] Washietl, S., Hofacker, I. L., and Stadler, P. F. Fast and reliable prediction of noncoding RNAs, Proc. Natl. Acad. Sci. USA, 102: , [34] Yamada, S., Gotoli, O., and Yamana, H., Extension of PRRN: Implementation of a doubly nested randomized iterative refinement strategy under a piecewise linear gap cost, Genome Informatics, 15: P082, 2004.
Probalign: Multiple sequence alignment using partition function posterior probabilities
Sequence Analysis Probalign: Multiple sequence alignment using partition function posterior probabilities Usman Roshan 1* and Dennis R. Livesay 2 1 Department of Computer Science, New Jersey Institute
More informationMultiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences
Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Yue Lu 1 and Sing-Hoi Sze 1,2 1 Department of Biochemistry & Biophysics 2 Department of Computer Science, Texas A&M University,
More informationMultiple Sequence Alignment: A Critical Comparison of Four Popular Programs
Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs Shirley Sutton, Biochemistry 218 Final Project, March 14, 2008 Introduction For both the computational biologist and the research
More informationMultiple sequence alignment
Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple
More informationInDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9
Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic
More informationSequence Alignment Techniques and Their Uses
Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this
More informationCONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018
CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationA profile-based protein sequence alignment algorithm for a domain clustering database
A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationChapter 11 Multiple sequence alignment
Chapter 11 Multiple sequence alignment Burkhard Morgenstern 1. INTRODUCTION Sequence alignment is of crucial importance for all aspects of biological sequence analysis. Virtually all methods of nucleic
More informationIn-Depth Assessment of Local Sequence Alignment
2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.
More informationSingle alignment: Substitution Matrix. 16 march 2017
Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block
More informationCitation BMC Bioinformatics, 2016, v. 17 n. suppl. 8, p. 285:
Title PnpProbs: A better multiple sequence alignment tool by better handling of guide trees Author(s) YE, Y; Lam, TW; Ting, HF Citation BMC Bioinformatics, 2016, v. 17 n. suppl. 8, p. 285:633-643 Issued
More informationOptimization of a New Score Function for the Detection of Remote Homologs
PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department
More informationAn Introduction to Sequence Similarity ( Homology ) Searching
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
More informationTools and Algorithms in Bioinformatics
Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and
More information3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
More informationBIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm017
Vol. 23 no. 7 2007, pages 802 808 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm017 Sequence analysis PROMALS: towards accurate multiple sequence alignments of distantly related proteins
More informationEffects of Gap Open and Gap Extension Penalties
Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationRecent developments in the MAFFT multiple sequence alignment program
A preprint of Katoh and Toh (2008, Briefings in Bioinformatics 9:286-298) Recent developments in the MAFFT multiple sequence alignment program Kazutaka Katoh 1 and Hiroyuki Toh 2 1 Digital Medicine Initiative,
More informationA greedy, graph-based algorithm for the alignment of multiple homologous gene lists
A greedy, graph-based algorithm for the alignment of multiple homologous gene lists Jan Fostier, Sebastian Proost, Bart Dhoedt, Yvan Saeys, Piet Demeester, Yves Van de Peer, and Klaas Vandepoele Bioinformatics
More informationUpcoming challenges for multiple sequence alignment methods in the high-throughput era
BIOINFORMATICS REVIEW Vol. 25 no. 19 2009, pages 2455 2465 doi:10.1093/bioinformatics/btp452 Sequence analysis Upcoming challenges for multiple sequence alignment methods in the high-throughput era Carsten
More informationSequence analysis and Genomics
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
More informationOverview Multiple Sequence Alignment
Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments
More informationProtein sequence alignment with family-specific amino acid similarity matrices
TECHNICAL NOTE Open Access Protein sequence alignment with family-specific amino acid similarity matrices Igor B Kuznetsov Abstract Background: Alignment of amino acid sequences by means of dynamic programming
More informationBioinformatics. Dept. of Computational Biology & Bioinformatics
Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS
More informationThe PRALINE online server: optimising progressive multiple alignment on the web
Computational Biology and Chemistry 27 (2003) 511 519 Software Note The PRALINE online server: optimising progressive multiple alignment on the web V.A. Simossis a,b, J. Heringa a, a Bioinformatics Unit,
More informationBIOINFORMATICS: An Introduction
BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and
More informationChapter 5. Proteomics and the analysis of protein sequence Ⅱ
Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationSequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013
Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationTiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1
Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with
More informationSUPPLEMENTARY INFORMATION
Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)
More informationResearch Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.
Research Proposal Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Name: Minjal Pancholi Howard University Washington, DC. June 19, 2009 Research
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationSimultaneous Sequence Alignment and Tree Construction Using Hidden Markov Models. R.C. Edgar, K. Sjölander
Simultaneous Sequence Alignment and Tree Construction Using Hidden Markov Models R.C. Edgar, K. Sjölander Pacific Symposium on Biocomputing 8:180-191(2003) SIMULTANEOUS SEQUENCE ALIGNMENT AND TREE CONSTRUCTION
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More information5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT
5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:
More informationPROMALS3D web server for accurate multiple protein sequence and structure alignments
W30 W34 Nucleic Acids Research, 2008, Vol. 36, Web Server issue Published online 24 May 2008 doi:10.1093/nar/gkn322 PROMALS3D web server for accurate multiple protein sequence and structure alignments
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationCopyright 2000 N. AYDIN. All rights reserved. 1
Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment
More informationSequence Database Search Techniques I: Blast and PatternHunter tools
Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered
More informationPairwise sequence alignment
Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
More informationLarge-Scale Genomic Surveys
Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction
More informationHomology Modeling. Roberto Lins EPFL - summer semester 2005
Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,
More informationWeek 10: Homology Modelling (II) - HHpred
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationSequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir
Sequence Bioinformatics Multiple Sequence Alignment Waqas Nasir 2010-11-12 Multiple Sequence Alignment One amino acid plays coy; a pair of homologous sequences whisper; many aligned sequences shout out
More informationMATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME
MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:
More informationModule: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment
Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand
More informationSome Problems from Enzyme Families
Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems
More informationProtein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.
Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein
More informationSequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5
Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Why Look at More Than One Sequence? 1. Multiple Sequence Alignment shows patterns of conservation 2. What and how many
More informationPairwise & Multiple sequence alignments
Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived
More informationIntroduction to Evolutionary Concepts
Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq
More informationUsing Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics
Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu
More informationQuantifying sequence similarity
Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity
More informationWho Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment
Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment Stefano Iantorno 1,2,*, Kevin Gori 3,*, Nick Goldman 3, Manuel Gil 4, Christophe Dessimoz 3, 1 Wellcome Trust Sanger
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationA New Similarity Measure among Protein Sequences
A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence
More informationPractical considerations of working with sequencing data
Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!
More informationBIOINFORMATICS ORIGINAL PAPER
BIOINFORMATICS ORIGINAL PAPER Vol. 24 no. 19 2008, pages 2165 2171 doi:10.1093/bioinformatics/btn414 Sequence analysis Model-based prediction of sequence alignment quality Virpi Ahola 1,2,, Tero Aittokallio
More informationPhylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches
Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell
More informationIntroduction to Bioinformatics Online Course: IBT
Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple
More informationSupplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc
Supplemental Data. Perea-Resa et al. Plant Cell. (22)..5/tpc.2.3697 Sm Sm2 Supplemental Figure. Sequence alignment of Arabidopsis LSM proteins. Alignment of the eleven Arabidopsis LSM proteins. Sm and
More informationBiologically significant sequence alignments using Boltzmann probabilities
Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this
More informationGrouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids
Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids
More informationAlignment & BLAST. By: Hadi Mozafari KUMS
Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence
More informationTools and Algorithms in Bioinformatics
Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology
More informationHomology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB
Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded
More informationLarge Grain Size Stochastic Optimization Alignment
Brigham Young University BYU ScholarsArchive All Faculty Publications 2006-10-01 Large Grain Size Stochastic Optimization Alignment Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu
More informationMultiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:
Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:
More informationCédric Notredame and Chantal Abergel
8VLQJPXOWLSOHDOLJQPHQWPHWKRGVWRDVVHVVWKHTXDOLW\RI JHQRPLFGDWDDQDO\VLV Cédric Notredame and Chantal Abergel Information Génétique et Structurale UMR 1889 31 Chemin Joseph Aiguier 13 006 Marseille Email:
More informationSimilarity searching summary (2)
Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity
More informationProtein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its
More informationFirst generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences
First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a
More informationSequence Analysis '17- lecture 8. Multiple sequence alignment
Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database
More informationAlignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)
Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in
More informationPROMALS3D: a tool for multiple protein sequence and structure alignments
Nucleic Acids Research Advance Access published February 20, 2008 Nucleic Acids Research, 2008, 1 6 doi:10.1093/nar/gkn072 PROMALS3D: a tool for multiple protein sequence and structure alignments Jimin
More informationSUPPLEMENTARY INFORMATION
Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,
More informationMultiple Genome Alignment by Clustering Pairwise Matches
Multiple Genome Alignment by Clustering Pairwise Matches Jeong-Hyeon Choi 1,3, Kwangmin Choi 1, Hwan-Gue Cho 3, and Sun Kim 1,2 1 School of Informatics, Indiana University, IN 47408, USA, {jeochoi,kwchoi,sunkim}@bio.informatics.indiana.edu
More informationMultiple Sequence Alignment
Multiple Sequence Alignment Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences.! What about more than two? And what for?! A faint similarity between two
More informationCMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison
CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture
More informationBIOINFORMATICS. RASCAL: rapid scanning and correction of multiple sequence alignments. J. D. Thompson, J. C. Thierry and O.
BIOINFORMATICS Vol. 19 no. 9 2003, pages 1155 1161 DOI: 10.1093/bioinformatics/btg133 RASCAL: rapid scanning and correction of multiple sequence alignments J. D. Thompson, J. C. Thierry and O. Poch Laboratoire
More informationMoreover, the circular logic
Moreover, the circular logic How do we know what is the right distance without a good alignment? And how do we construct a good alignment without knowing what substitutions were made previously? ATGCGT--GCAAGT
More informationTruncated Profile Hidden Markov Models
Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 11-1-2005 Truncated Profile Hidden Markov
More informationfrmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity
1 frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity HUZEFA RANGWALA and GEORGE KARYPIS Department of Computer Science and Engineering
More informationScoring Matrices. Shifra Ben-Dor Irit Orr
Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison
More informationPairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55
Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise
More informationSegment-based scores for pairwise and multiple sequence alignments
From: ISMB-98 Proceedings. Copyright 1998, AAAI (www.aaai.org). All rights reserved. Segment-based scores for pairwise and multiple sequence alignments Burkhard Morgenstern 1,*, William R. Atchley 2, Klaus
More informationBioinformatics for Biologists
Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational
More informationDetecting Distant Homologs Using Phylogenetic Tree-Based HMMs
PROTEINS: Structure, Function, and Genetics 52:446 453 (2003) Detecting Distant Homologs Using Phylogenetic Tree-Based HMMs Bin Qian 1 and Richard A. Goldstein 1,2 * 1 Biophysics Research Division, University
More informationLecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins
Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) 2 19 2015 Scribe: John Ekins Multiple Sequence Alignment Given N sequences x 1, x 2,, x N : Insert gaps in each of the sequences
More informationRobust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix
Robust sequence alignment using evolutionary rates coupled with an amino acid substitution matrix Item Type Article Authors Ndhlovu, Andrew; Hazelhurst, Scott; Durand, Pierre M. Citation Ndhlovu et al.
More informationBioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre
Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement
More informationBasic Local Alignment Search Tool
Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses
More information