The PRALINE online server: optimising progressive multiple alignment on the web

Computational Biology and Chemistry 27 (2003) 511 519 Software Note The PRALINE online server: optimising progressive multiple alignment on the web V.A. Simossis a,b, J. Heringa a, a Bioinformatics Unit, Faculty of Sciences, Vrije Universiteit, De Boelelaan 1081A, 1081 HV Amsterdam, The Netherlands b Division of Mathematical Biology, The National Institute for Medical Research, The Ridgeway, NW7 1AA, London UK Received 9 September 2003; received in revised form 9 September 2003; accepted 9 September 2003 Abstract We introduce the online server for PRALINE (http://ibium.cs.vu.nl/programs/pralinewww/), an iterative versatile progressive multiple sequence alignment (MSA) tool. PRALINE provides various MSA optimisation strategies including weighted global and local profile pre-processing, secondary structure-guided alignment and a reliability measure for aligned individual residue positions. The latter can also be used to optimise the alignment when the profile pre-processing strategies are iterated. In addition, we have modelled the server output to enable comprehensive visualisation of the generated alignment and easy figure generation for publications. The alignment is represented in five default colour schemes based on: residue type, position conservation, position reliability, residue hydrophobicity and secondary structure; depending on the options set. We have also implemented a custom colour scheme that allows the user to select which colour will represent one or more amino acids in the alignment. The grouping of sequences, on which the alignment is based, can also be visualised as a dendrogram. The PRALINE algorithm is designed to work more as a toolkit for MSA rather than a one step process. Keywords: PRALINE; Multiple sequence alignment; Profile pre-processing; Secondary structure-guided MSA; Positional reliability 1. Introduction Biological data processing tools are applied in many disciplines and vary largely in complexity and specificity. One very important and complex problem is multiple sequence alignment (MSA), which comprises a cornerstone field in bioinformatics. A wide range of disciplines in computational biology such as phylogeny, function prediction, secondary and tertiary structure prediction, modelling, sequence analysis and many more, all largely based on MSA. The MSA problem has been addressed in various ways and many strategies have been developed over the last two decades to try and improve the quality and reliability of MSA over a vast number of alignment cases (for reviews see Heringa et al., 1997; Notredame, 2002; Simossis et al., 2003). One of the most successful alignment strategies is progressive alignment (Hogeweg and Hesper, 1984; Feng and Doolittle, 1987), which is implemented in most top per- Corresponding author. Tel.: +31-20-444-7749; fax: +31-20-444-7653 E-mail address: heringa@cs.vu.nl (V.A. Simossis). forming MSA methods (Thompson et al., 1994; Heringa, 1999; Notredame et al., 2000, Holmes, 2003). Commonly in progressive alignment, a dendrogram is precompiled based on sequence similarity scores, and used in progressively ordering the most related, and thus least error prone, sequences to be aligned first. However, the main problem with progressive alignment is that once a sequence has been aligned into the growing MSA it cannot be altered, even if newly added sequences require it ( once a gap always a gap, Feng and Doolittle, 1987). Therefore, early alignment errors are carried into the successive alignment steps and can cause further, larger errors to arise (error propagation). Such error propagation becomes even more detrimental to the alignment quality when the progressive strategy is used iteratively (Heringa, 1999, 2000, 2002). To counteract this weakness of progressive alignment, various researches have developed optimisation steps to minimise the probability of early errors. Amongst the most successful progressive alignment methods using optimisation strategies are PRA- LINE, whose strategies will be briefly discussed (for elaborate accounts see Heringa, 1999, 2002), ClustalW with a number of heuristics (Thompson et al., 1994) and T-Coffee 1476-9271/$ see front matter doi:10.1016/j.compbiolchem.2003.09.002

512 V.A. Simossis, J. Heringa / Computational Biology and Chemistry 27 (2003) 511 519 with the matrix extension strategy (Notredame et al., 2000). PRALINE follows a methodology similar to other progressive alignment methods but comprises three novel optimisation strategies: global profile pre-processing, local profile pre-processing, and secondary structure-guided alignment. These optimisation strategies can be used as single steps or in combination to construct a MSA, and can also be further optimised by iteration. PRALINE is a well characterised alignment method (Heringa, 1999, 2000, 2002) and has recently been parallelised to minimise its processing time when aligning large datasets (Kleinjung et al., 2002). 2. Profile pre-processing The profile pre-processing philosophy (Heringa, 1999) is to use information from other, related sequences in the sequence set to be aligned. In combination with position-specific gap penalties, it allows increased matching of distant sequences and likely placement of gaps outside un-gapped core regions during progressive alignment (Heringa 1999, 2002). Initially, a score is calculated for all pairs of sequences, representing their degree of similarity. This similarity score is calculated for each pair of sequences by performing pairwise global or local alignments for the global and local strategies, respectively. A global pre-alignment is then created for each sequence, which only includes sequences that have a similarity score higher than a user-specified threshold. Consequently, the pre-alignment only contains sequences that are as closely related as the user requires, which leads to an increase in the information each sequence carries into the MSA and at the same time minimises the incorporation of misleading input arising from incorrect alignment (Heringa 1999, 2002). Each pre-alignment is then converted to a global or local pre-profile, according to the strategy used, which represents each of the original sequences for the final MSA. In effect, the final MSA is no longer an alignment of a set of sequences, but rather a set of profiles that contain more useful information relative to each individual sequence. During the profile pre-processing strategies, sequences can be included in more than one pre-alignment depending on whether their similarity score overcomes the preset threshold. The consistency with which they align in all pre-alignments is used to generate a reliability measure for each aligned residue (Heringa, 1999). The more consistent an aligned position is, the higher its reliability score. Since each of the pre-processed profiles can contain information about their related sequences, except if they are distant outliers and do not have other sequences in their pre-profile and visa versa, each sequence in the final alignment can be assessed in terms of the degree of consistency reached across the pre-profiles, which is translated in a reliability score for each amino acid in the final MSA. 3. Iteration In addition, the consistency of pre-processed profiles can be used to optimise the alignment through iteration by keeping the consistent pre-profile positions and realigning the inconsistent segments. Iteration is guided by these obtained scores, which are used as weights in the construction of alignments during the next MSA step (Heringa, 1999, 2002). From the resulting set of iterative alignments, the one with the highest cumulative score over all pairwise matched amino acids in the alignment can be selected as a safeguard to prevent alignments from wandering away to less optimal areas in the alignment space (Heringa, 2002). 4. Secondary structure-guided MSA The conservation of secondary structure elements across related sequences is usually much higher than that of single residues ( structure is more conserved than sequence, Clothia and Lesk, 1986; Sander and Schneider, 1991; Rost, 1999). Therefore, in an alignment of related sequences, the secondary structure elements should align in the same regions. By taking into consideration the secondary structure identity of each sequence position we apply a local weight to the global alignment keeping secondary structure element regions ungapped. The algorithm proceeds by initially constructing a MSA without information about the corresponding secondary structure. If the structure of a sequence is known, i.e. it has a PDB entry (Berman et al., 2000), its secondary structure is determined using DSSP (Kabsch and Sander, 1983), otherwise the secondary structure is predicted by the PREDATOR (Frishman and Argos, 1996, 1997) or the widely-used PHD method (Rost and Sander, 1993), although in principle prediction could be done by any available secondary structure prediction method, and a new alignment is constructed, now using the corresponding secondary structure. The initial alignment is constructed using a default residue exchange matrix (e.g. the BLOSUM62 matrix) and related gap penalties (Henikoff and Henikoff, 1992). After secondary structure prediction, resulting in a tentative secondary structure for each sequence or in a single secondary structure when using a single sequence-based or an MSA-reliant method, respectively, PRALINE interchangeably uses three secondary structure-specific residue exchange matrices (Lüthy et al., 1994) and associated gap penalties. The residue exchange weights for matched sequence positions with identical secondary structure states is taken from the corresponding residue exchange matrix, while matched sequence positions with non-identical secondary structure states are assigned the corresponding value from the default exchange matrix; e.g. the BLO- SUM62 matrix (Heringa, 2002). If the PHD prediction method is used, due to its dependability on the alignment structure this strategy can be iterated, each iteration producing a better MSA that is passed on to the next iteration,

V.A. Simossis, J. Heringa / Computational Biology and Chemistry 27 (2003) 511 519 513 guiding a more accurate secondary structure prediction, which in turn guides alignment and so on. Ultimately, as the iteration cycles are supervised convergence, divergence or limit cycle are detected and reported back to the user. In the remainder of this paper, we introduce the online server for the MSA method PRALINE. PRALINE is fully customisable and with appropriate use of its options can perform equally or better than the current leading method T-Coffee and other popular methods such as ClustalW and Dialign (Morgenstern, 1999) on benchmarking standards such as BAliBASE (Thompson et al., 1999a,b). Also, it has been shown to perform much better on specific biological examples such as the alignment of the flavodoxin family members (Heringa, 1999), where the other methods get confused. The most important aspect of PRALINE is that it allows the user to use different settings to optimise an alignment. As a result, although PRALINE still retains the automated aspect of running the program with its default settings, it allows purposeful tweaking and optimisation of alignment parameters for specific problems. 5. Online accessibility The PRALINE Server is accessible on the IBIVU website at the Free University of Amsterdam (URL:http://ibivu.cs. vu.nl/programs/pralinewww/) or at the mirror site on the Department of Mathematical Biology Server at the National Institute of Medical Research in London (URL: http://mathbio.nimr.mrc.ac.uk/ vsimoss/pralinewww/). 6. The PRALINE server The PRALINE server aims to provide both the nonspecialist as well as the specialist users with a fast and informative approach to align protein sequences. We provide Fig. 1. The PRALINE server standard user interface. (a) Text area for FASTA or PIR sequences, (b) path for uploading a FASTA or PIR file, (c) submit job for default run, (d) gap penalties and amino acid exchange weights matrix selection, (e) alignment method selection, (f) secondary structure information (no iteration at present), (g) select tree representation, (h) select user-defined colour scheme, (i) select final alignment file format.

514 V.A. Simossis, J. Heringa / Computational Biology and Chemistry 27 (2003) 511 519 online help sections for each of the different parameters PRALINE may be set with, containing background information and examples and an online documentation section describing how PRALINE uses this information. 6.1. The standard user interface The standard user interface is targeted mainly towards non-specialist users. The sequences to be aligned must be in FASTA (Pearson, 1999) or PIR (Barker et al., 2000) format and can either be entered manually in the text field provided (see Fig. 1a) or uploaded as a file (see Fig. 1b). PRALINE can be run using its default settings (gap opening penalty 12.0, gap extension penalty 1.0 and the amino acid substitution matrix BLOSUM62, to do a single global alignment of the sequences) or otherwise, there is a help section to describe how the gap penalties work and some example combinations for standard amino acid substitution matrices. At present, the amino acid substitution matrices available are PAM250, BLOSUM50, BLOSUM62 (Dayhoff et al., 1983), and GON250 (Gonnet et al., 1992). There is a help section to aid the choice of the ideal matrix depending on the type of sequences the user wants to align (see Fig. 1d). The PRALINE server provides three different alignment optimisation strategies: global, global with profile pre-processing and global with local profile pre-processing (see Fig. 1e). The profile pre-processing threshold values for the latter two methods are alignment-dependant and therefore, it is up to the user to decide on an optimal value. All pairwise scores are saved in a list that is available on the results page, after an initial run. This means that it would be sensible to run an alignment using a threshold value of 0, which will include all sequences, and then choose an optimal threshold value from the score list on the results page and re-run the alignment using that threshold value. If a negative threshold value x is used ( x), the threshold scores are weighted each time according to the sequence lengths; otherwise, the length is not taken into consideration. Heringa (2002) recommends a setting of 9.5 for the length dependent threshold value. Fig. 2. The PRALINE server advanced user interface. (a) Text area for FASTA or PIR sequences, (b) path for uploading a FASTA or PIR file, (c) command line for PRALINE options, (d) path for uploading user-defined amino acid exchange weights matrix, (e) select user-defined colour scheme, (f) complete PRALINE options list.

V.A. Simossis, J. Heringa / Computational Biology and Chemistry 27 (2003) 511 519 515 In addition, the two profile pre-processing methods also provide iteration capabilities from 0 to 10 iterations. Finally, when using a profile pre-processing method, PRA- LINE produces the alignment providing reliability scores for each amino acid position as well as an average reliability for each alignment position, at each iteration. The reliability scores are represented in the position reliability colour scheme. PRALINE can use either PREDATOR or PHD to predict the secondary structure of the input sequences, but not together. It is also possible to search the PDB (Berman et al., 2000) to find 3D structure information for the input sequences and use the DSSP derived secondary structure for the alignment. If both DSSP and a prediction method are selected, then predictions will only be done on the sequences that do not have a PDB file (see Fig. 1f). The predicted structures are represented in the secondary structure colour scheme (vide infra). PRALINE can also provide the grouping of the sequences the alignment is based on as a dendrogram representation at each iteration (see Fig. 1g). The tree is available as one of the viewing formats on the results page (vide infra). PRALINE can currently save the final alignment into a file either in MSF (Genetics Computer Group, 1993) or FASTA format for possible further use (see Fig. 1h). 6.2. The advanced user interface The advanced user interface is targeted mainly towards specialist users. Similarly to the standard user interface, the input sequences must be in FASTA or PIR format and can either be entered in the text field provided or uploaded as a file (see Fig. 2a,b). Instead of selectable options, the advanced user interface has a command line so that the user can manually enter more options than provided in the standard interface (see Fig. 2c). In addition, we provide the user with the ability to use a custom amino acid substitution matrix that can be uploaded for use in the same way as an input sequence file (see Fig. 2d). A sample amino acid substitution matrix is made available for viewing in the format that PRALINE can read it in. Finally, the user can use the reference options table that has all the options currently available to PRALINE with a short description of each option and restrictions on the different combinations (see Fig. 2e). 6.3. Alignment representation methods: the results page When a PRALINE job is submitted, the user is presented with a holding page that refreshes automatically and displays the results page once the job is complete. Fig. 3. The results page headers.

516 V.A. Simossis, J. Heringa / Computational Biology and Chemistry 27 (2003) 511 519 Fig. 4. The default colour schemes.

V.A. Simossis, J. Heringa / Computational Biology and Chemistry 27 (2003) 511 519 517 Fig. 5. The user-defined colour table (left) and alignment representation (right).

518 V.A. Simossis, J. Heringa / Computational Biology and Chemistry 27 (2003) 511 519 The results page contains various parts depending on the options selected (see Fig. 3). Firstly, if the iteration number selected is greater than 0, a subtitle informs the user which iteration cycle results are presented on the page. The alignment from each iteration cycle is presented on a different page and is accessible by the corresponding links. In addition, it informs the user of the total time taken for the process to complete, whether all the iterations were completed or whether the iterations halted due to alignment convergence or limit cycle convergence and which iteration was the last. Secondly, if profile pre-processing is selected the user has the option of viewing the profile pre-processing scores for all pairwise alignments for deriving an optimum cut-off value. Finally, if selected, there is a link to download the alignment file in the selected format (MSF or FASTA) and also view the PRALINE output raw data. 6.4. Colour schemes The default colour schemes are based on residue type, conservation by alignment position, reliability by alignment position and position average reliability, hydrophobicity and finally secondary structure (see Fig. 4). Each scheme has a short explanation of how to interpret the colours and also a colour reference key at the top of the alignment. The default representation is the conservation scheme. Residue specific colours have been used in accordance with the colouring scheme of ClustalX (Thompson et al., 1997) and hydrophobicity scaling has been assigned according to Eisenberg et al. (1984). The reliability colours are only available if profile pre-processing methods have been used. The secondary structure representation is in three states (H-helix, E-strand and blank-other). It is only available if secondary structure has been used to guide the alignment. Apart from the default five colour schemes, we also provide a user-defined colour scheme. The user-defined colour scheme is optional. It enables the user to select from a table of eight pre-set colours and assign any of them to one or more amino acids, in any combination desired for viewing (see Fig. 5). This is particularly useful when a specific position or a motif needs to stand out in the alignment, or if specific amino acids need to be depicted for illustrative purposes. 7. Caveats The PRALINE server has some limitations that need to be clear to the user. Firstly, PRALINE is not a DNA alignment program and does not accept DNA sequence as an input, nor does it translate it into protein. Secondly, profile pre-processing, secondary structure prediction and iterations make a huge improvement in alignment quality and information feedback, but can make PRALINE slow, albeit a parallelised version has been made available (Kleinjung et al., 2002). Finally, all alignment methods will produce some sort of alignment whether biologically meaningful or not. However, the ability to manually optimise parameters and the position reliability scores provided by PRALINE allow the user to make a reasonable assessment of the alignment quality and choose the best resulting alignment. 8. Concluding remarks The PRALINE server offers some unique features that make it a versatile and useful alignment tool. It provides the user with feedback about the quality of the alignment produced in an iterative scenario and in addition enables the user to use this information to optimise the alignment by having fully customisable parameters. Another feature is that it provides more than one alignment strategy and can use secondary structure input, thus covering a wide range of alignment cases. In addition, the multiple representations of the alignment offer a convenient and diverse way for alignment illustration according to the users needs. Apart from being an accurate method, the PRALINE Server is a toolbox for protein sequence alignment that gives users the opportunity to learn more about their alignment problem, the means to find a best possible solution and present it in more detailed and educational form. Acknowledgements This project was funded by the generous contributions of the Medical Research Council and the Free University Amsterdam. References Barker, W.C., Garavelli, J.S., Huang, H., McGarvey, P.B., Orcutt, B.C., Srinivasarao, G.Y., Xiao, C., Yeh, L.S., Ledley, R.S., Janda, J.F., Pfeiffer, F., Mewes, H.W., Tsugita, A., Wu, C., 2000. The Protein Information Resource (PIR). Nucleic Acids Res. 28, 41 44. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E., 2000. The Protein Data Bank. Nucleic Acids Res. 28, 235 242. Clothia, C., Lesk, A.M., 1986. The relationship between the divergence of sequence and structure in proteins. EMBO J. 5, 823 826. Dayhoff, M.O., Barker, W.C., Hunt, L.T., 1983. Establishing homologies in protein sequences. Methods Enzymol. 91, 524 545. Eisenberg, D., Schwarz, E., Komaromy, M., Wall, R., 1984. Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J. Mol. Biol. 179 (1), 125 142. Feng, D.F., Doolittle, R.F., 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351 360. Frishman, D., Argos, P., 1996. Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng. 9 (2), 133 142. Frishman, D., Argos, P., 1997. Seventy-five percent accuracy in protein secondary structure prediction. Proteins 27 (3), 329 335. Genetics Computer Group, 1993. Program manual for the GCG package, version 8, 575 Science Drive, Madison, WI.

V.A. Simossis, J. Heringa / Computational Biology and Chemistry 27 (2003) 511 519 519 Gonnet, G.H., Cohen, M.A., Benner, S.A., 1992. Exhaustive matching of the entire protein sequence database. Science 256 (5062), 1443 1445. Henikoff, S., Henikoff, J.G., 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, 10915 10919. Heringa, J., 1999. Two strategies for sequence comparison: profile-preprocessed and secondary structure-induced multiple alignment. Comput. Chem. 23, 341 364. Heringa, J., 2000. Computational methods for protein secondary structure prediction using multiple sequence alignment. Curr. Protein Pept. Sci. 1, 273 301. Heringa, J., 2002. Local weighting schemes for protein multiple sequence alignments. Comput. Chem. 26, 459 477. Heringa, J., Frishman, D., Argos, P., 1997. Computational Methods Relating Sequence and Structure: in Protein: A Comprehensive Treatise, Ch. 4, vol. 1. JAI Press Inc, Greenwich, CT, pp. 165 268. Hogeweg, P., Hesper, B., 1984. The alignment of sets of sequences and the construction of phyletic trees: an integrated method. J. Mol. Evol. 20, 175 186. Holmes, I., 2003. Using guide trees to construct multiple-sequence evolutionary HMMs. Bioinformatics 19 (Suppl. 1), I147 I157. Kabsch, W., Sander, C., 1983. A dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers 22, 2577 2637. Kleinjung, J., Douglas, N., Heringa, J., 2002. Parallelized multiple alignment. Bioinformatics 18 (9), 1270 1271. Lüthy, R., Xenarios, I., Bucher, P., 1994. Improving the sensitivity of the sequence profile method. Protein Sci. 3, 139 146. Morgenstern, B., 1999. DIALIGN 2: improvement of the segmentto-segment approach to multiple sequence alignment. Bioinformatics 15, 211 218. Notredame, C., 2002. Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3 (1), 1 14. Notredame, C., Higgins, D.G., Heringa, J., 2000. T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205 217. Pearson, W.R., 1999. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132, 185 219. Rost, B., 1999. Twilight zone of protein sequence alignment. Protein Eng. 12, 85 94. Rost, B., Sander, C., 1993. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232, 584 599. Sander, C., Schneider, R., 1991. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins: Struct. Funct. Genet. 9, 56 68. Simossis V.A., Kleinjung J., Heringa, J., 2003 An overview of Multiple Sequence Alignment in: Current Protocols in Bioinformatics, Wiley & Sons Inc., in press. Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties, and weight matrix choices. Nucleic Acids Res. 22 (22), 4673 4680. Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., Higgins, D.G., 1997. The ClustalX windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 24, 4876 4882. Thompson, J.D., Plewniak, F., Poch, O., 1999a. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 27 (13), 2682 2690. Thompson, J.D., Plewniak, F., Poch, O., 1999b. BAliBASE: a benchmark alignments database for the evaluation of multiple sequence alignment programs. Bioinformatics 15, 87 88.