Cladistics. The deterministic effects of alignment bias in phylogenetic inference. Mark P. Simmons a, *, Kai F. Mu ller b and Colleen T.

Cladistics Cladistics 27 (2) 42 46./j.96-3.2.333.x The deterministic effects of alignment bias in phylogenetic inference Mark P. Simmons a, *, Kai F. Mu ller b and Colleen T. Webb a a Department of Biology, Colorado State University, Fort Collins, CO 8523, USA; b Institute for Evolution and Biodiversity, University of Mu nster, Hu fferstrasse, 4849 Mu nster, Germany Accepted June 2 Abstract Alignment of nucleotide and or amino acid sequences is a fundamental component of sequence-based molecular phylogenetic studies. Here we examined how different alignment methods affect the phylogenetic trees that are inferred from the alignments. We used simulations to determine how alignment errors can lead to systematic biases that affect phylogenetic inference from those sequences. We compared four approaches to sequence alignment: progressive pairwise alignment, simultaneous multiple alignment of sequence fragments, local pairwise alignment and direct optimization. When taking into account branch support, implied alignments produced by direct optimization were found to show the most extreme behaviour (based on the alignment programs for which nearly equivalent alignment parameters could be set) in that they provided the strongest support for the correct tree in the simulations in which it was easy to resolve the correct tree and the strongest support for the incorrect tree in our long-branchattraction simulations. When applied to alignment-sensitive process partitions with different histories, direct optimization showed the strongest mutual influence between the process partitions when they were aligned and phylogenetically analysed together, which makes detecting recombination more difficult. Simultaneous alignment performed well relative to direct optimization and progressive pairwise alignment across all simulations. Rather than relying upon methods that integrate alignment and tree search into a single step without accounting for alignment uncertainty, as with implied alignments, we suggest that simultaneous alignment using the similarity criterion, within the context of information available on biological processes and function, be applied whenever possible for sequence-based phylogenetic analyses. Ó The Willi Hennig Society 2. Alignment of nucleotide and or amino acid sequences is a fundamental component of sequence-based molecular phylogenetic studies. As demonstrated by Morrison and Ellis (997), how our sequences are aligned may have a greater effect on our inferred trees than how we perform phylogenetic inference on our aligned sequences. The importance of obtaining an accurate alignment, however, is of some debate. For example, Mugridge et al. (2, p. 85) asserted that, Any nonhomologous alignment will lead to an erroneous phylogeny. In contrast, Ogden and Rosenberg (26, p. 39) found that, stochastically, some inaccurate alignments produce better trees than the correct alignment in their simulations. *Corresponding author: E-mail address: psimmons@lamar.colostate.edu Most comparisons of alignment programs have focused on how congruent the inferred alignments are with known alignments (e.g. McClure et al., 994; Hickson et al., 2; Grasso and Lee, 24). In the present study, however, our focus was on how different alignment methods affect the gene trees that are derived from their inferred alignments. There is not necessarily a linear correlation between alignment accuracy and phylogenetic accuracy when the inferred alignments are a good approximation of the correct alignment (Wang et al., in press). With respect to the gene tree, the gold standard for alignments should be the similarity of tree topology and branch-support values to those provided by the correct alignment. It is possible for alignment errors to result in more accurate tree topologies than those that would be inferred from the correct alignments. But this should be regarded as a fortuitous Ó The Willi Hennig Society 2

M.P. Simmons et al. / Cladistics 27 (2) 42 46 43 bias similar to long-branch attraction between sister groups (Swofford et al., 2). We used simulations to determine how alignment errors can lead to systematic biases that affect phylogenetic inference from those sequences. We compared four approaches to sequence alignment: progressive pairwise alignment (Feng and Doolittle, 987), as implemented in Clustal W (Thompson et al., 994) and MUSCLE (Edgar, 24a,b); simultaneous multiple alignment of sequence fragments (To nges et al., 996), as implemented in DCA (Stoye, 998); local pairwise alignment (Morgenstern et al., 996), as implemented in DI- ALIGN 2 (Morgenstern, 999) and DIALIGN-T (Subramanian et al., 25), and direct optimization (Wheeler, 996), as implemented in POY 3 (Wheeler et al., 23) and POY 4 (Varo n et al., 29). In progressive pairwise alignment, the most similar pair(s) of sequences are aligned first, followed by progressively less similar pairs or sets of sequences until all sequences are aligned in a single multiple alignment. Lake (99) demonstrated that the sequence-addition order used in progressive pairwise alignment can undesirably determine the topology of the phylogeny inferred from the aligned sequences (see also Thorne et al., 99; Thorne and Kishino, 992). Lake (99) suggested that the solution to this problem would be to restrict phylogenetic inference to using characters from those regions for which the alignment is independent of any particular progressive alignment order. Alternatively, simultaneous alignment may potentially be used to eliminate any biases that may be caused by use of any particular progressive alignment order. Both progressive pairwise alignment and simultaneous alignment are global alignment methods for which the alignment criterion is similarity. In contrast to the other three methods examined, local pairwise alignment (Smith and Waterman, 98) does not necessarily align all nucleotides amino acids. In DIALIGN 2, the order of pairwise segment-to-segment alignments is independent of the overall similarity between pairs of sequences, in contrast to DIALIGN- T, in which the overall similarity between pairs of sequences is taken into account when determining the pairwise segment-to-segment alignment order (Subramanian et al., 25). Direct optimization differs from the other methods examined in that its optimality criterion is likelihood or parsimony, rather than similarity (Wheeler, 996, 26). The alignment(s) that produces the trees with the highest likelihood or fewest steps are favoured over alternative alignments. This is accomplished by integrating alignment and phylogenetic tree search into a single step. Questions to address We set out to address two questions in this study. First, when only a single process partition (Bull et al., 993) is simulated, how accurately do the gene trees derived from the inferred alignments reconstruct the simulated tree topology? Are they more or less accurate than the gene tree derived from the correct alignment? This question is broadly relevant to selecting among alignment methods and programs for studies in which phylogenetic inference is performed from the aligned sequences. Our a priori hypothesis was that direct optimization would show the most extreme behaviour, followed by progressive alignment, followed by simultaneous and local alignment. In this context, extreme behaviour is defined as showing a positive bias relative to the correct alignment (i.e. being more likely to infer the correct tree topology with greater branch support) in our simulations in which it is easy to resolve the correct tree (i.e. all branches of equal length) and a greater negative bias in our simulations in which it is easy to resolve the incorrect tree (i.e. two long branches that are not sister groups separated by a short internal branch; Felsenstein, 978). This first hypothesis was based on the use of a single order (or in some cases two or more orders for direct optimization when equally optimal alignments are found) in which entire sequences are aligned (in the case of progressive pairwise alignment) or optimized (in the case of direct optimization). In contrast, no single order is necessarily used by simultaneous or local alignment methods. The use of a single tree (or subset of possible trees) to determine alignment or optimization order produces alignments from which phylogenetic trees are derived that are subject to the artefact described by Lake (99; see above). Direct optimization optimizes sequences at internal nodes. In contrast, progressive pairwise alignment relies on pairwise comparisons among each sequence in one set of sequences being aligned relative to all sequences in the other set of sequences that are being aligned (e.g. see Thompson et al., 994, fig. 2). As such, direct optimization is more effective at minimizing the number of substitutions and indels required to change from one sequence to another on the mostparsimonious tree(s). Hence, the implied alignment(s) (Wheeler, 23) derived from direct optimization is more closely linked to its associated tree topology (or topologies, in some cases where two or more equally optimal alignments are found), and their bias in favour of this tree topology will be more pronounced than a progressive pairwise alignment that used the same topology to determine the pairwise alignment order. Among the two local alignment programs, we expected more extreme behaviour from DIALIGN-T than from DIALIGN 2 because DIALIGN-T uses overall similarity when determining the order of pairwise segment-to-segment alignments, whereas DIALIGN 2 relies exclusively upon local similarity. Our second question was: when different process partitions that have different histories are aligned and

44 M.P. Simmons et al. / Cladistics 27 (2) 42 46 phylogenetically analysed together, how congruent are the gene trees derived from the inferred alignments with the tree topology for the process partition with the greatest number of characters? Are they more or less congruent than the gene tree derived from the correct alignment? This question is relevant to determining the relative degree to which the different alignment methods allow different portions of the sequences to influence the alignment of each other. Our a priori hypothesis was that direct optimization would show the strongest influence, followed by the progressive alignment methods, followed by the simultaneous and local alignment methods. Among the two local alignment programs, we expected greater influence from DIALIGN-T than from DIALIGN 2 given DIALIGN-TÕs use of overall similarity when determining the order of pairwise segmentto-segment alignments. Materials and methods Simulations Nucleotide sequences for four terminals in each replicate were simulated by using Rose ver..3 (Stoye et al., 998). Settings used for all simulations were: average pairwise distance, ; choose from leaves, yes; transition bias, ; transition transversion ratio,. The Jukes Cantor substitution model (Jukes and Cantor, 969) was used with equal nucleotide frequencies. So as to maintain a consistent number of nucleotides across all indel rates examined, the insertion and deletion probabilities and length distributions were set identically to each other. The indel length distribution included indels from to 2 bp long and was set at.,.25,.72,.38,.,.89,.72,.57,.46,.37,.3,.24,.9,.5,.2,.,.8,.6,.5 and.4 for indels from to 2 bp long. This indel length distribution is the exponential curve fitted by Simmons et al. (27) to processed pseudogene distributions from the human nuclear genome (Gu and Li, 995; Zhang and Gerstein, 23). Note that this indel length distribution is unlikely to apply to rdna, which is often difficult to align. All simulated sequences consisted of bp, on average, for each of four terminals. Because only four terminals were used, any gapped positions are parsimony-uninformative for nucleotide characters (when gaps are treated as missing data), and are effectively excluded from the parsimony tree searches. Therefore, the inferred trees may be considered to be based on a conservative selection of characters. One hundred replicates were simulated for each set of simulation parameters used. As noted by an anonymous reviewer, this simulation, like others, relies on the assumption that both indels and substitutions are independent and identically distributed, which is known to be false for biological sequences (e.g. Haag-Liautard et al., 28; Zhang et al., 28). Failure to account for the full complexity of empirical data is an unavoidable limitation of simulationbased studies such as this one, although we do not expect this limitation to be determinate to our results when comparing alignment methods relative to each other. For the equal-branch-length simulations, the tree topology and relative branch lengths were set at: (seq:, (seq2:, (seq3:, seq4: ):)). For the long-branchattraction simulations, the tree topology and relative branch lengths were set at: (seq:.55, (seq2: 2.27275, (seq3:.55, seq4: 2.27275):.55)). These relative branch lengths were set so as to lead to borderline longbranch attraction in trees inferred using the correct alignments. For both the equal-branch-length and the long-branch-attraction simulations, five substitution rates (.,.,.2,.3,.4) and five insertion and deletion rates (.,.5,.,.5,.2) were used for a total of 25 simulations each. The substitution rates used should result in averages of, 5,, 5 and 2 substitutions across the entire tree, while the indel rates used should result in averages of, 5 (25 insertions, 25 deletions),, 5 and 2 indels across the entire tree. The higher substitution and indel rates were specifically chosen so as to represent difficult alignment problems. For highly similar sequences, all alignment programs were expected to perform nearly identically, which could have led to trivial results in this study. For the different-histories simulations, two separate process partitions were simulated for each of the three sets of simulation parameters used. In the first set, the first partition consisted of an average of 7 bp and the second partition consisted of an average of 3 bp. In the second set, the first partition consisted of an average of 6 bp and the second partition consisted of an average of 4 bp. In the third set, both partitions consisted of an average of 5 bp. In both partitions, the substitution rate was set at. and the insertion and deletion rates were set at.2. This resulted in alignment-sensitive sequences given the moderately high substitution rate and the high indel rate. The tree topology and relative branch lengths for the first partition were (seq:, (seq2:, (seq3:, seq4: ):)), whereas the tree for the second partition was (seq:, (seq3:, (seq2:, seq4: ):)). Alignments Alignments of the nucleotide sequences were performed using Clustal W ver..83, DCA ver.., DIALIGN ver. 2.2., DIALIGN-T ver..2., MUS- CLE ver. 3.6 and POY ver. 3... Identical alignment parameters, to the degree possible, were specified for all four global alignment programs. The default cost set for

M.P. Simmons et al. / Cladistics 27 (2) 42 46 45 DCA (terminal gap, ; gap opening, 5; gap extension, 3; all substitutions, 7; match, ) was chosen as the baseline to which the costs of other programs were modified. These same costs were used for MUSCLE, except that terminal gaps were given half the cost of internal gaps, following the programõs default setting (Edgar, 24b, p. 7). In contrast to the other alignment programs, DCA uses the sum of the gap opening and gap extension costs for singleton gaps. As such, a singleton gap inserted by DCA would cost 8 rather than 5. Because of its greater gap cost, DCA may be expected to insert slightly fewer gaps than MUSCLE or POY 3. Implied alignments were output by POY 3 using costs of: gap opening, 5; gap extension, 3; all substitutions, 7. With the gap extension cost just less than half the substitution cost, the triangle inequality (Wheeler, 993) may be violated when longer gaps are inserted (Aagesen et al., 25). But Aagesen et al. (25; see also Aagesen, 25) found non-metric gap costs to show maximal congruence between loci in two of the three sets of empirical data sets that they examined. Terminal gaps were treated using the -noleading command, wherein leading and trailing gaps are counted and accounted for during builds and refinements (to prevent trivial alignments) but discounted when determining tree length (De Laet and Wheeler, 23, p. 39). Searches were performed using replicates, with 5 builds per replicate and a maximum of ten trees held. Varo n and Wheeler (28) described an error in POY 3 in which it can underestimate tree cost based on overlapping but non-homologous gaps when extension gap penalties are used. It is unclear what effect this would have, if any, on our results. Although POY 3 may be more likely to align non-homologous gaps at overlapping positions, we do not expect this to cause a bias against POY 3. This is because gaps were treated as either missing data or simple indel coding wherein overlapping gaps with different endpoints are coded as separate characters rather than as fifth character states. In contrast, coding gaps as fifth states (e.g. Giribet and Wheeler, 999) could have led to a bias when many indels were simulated. Unfortunately, it is impossible to set identical alignment parameters between Clustal and POY (Ogden and Whiting, 23) because Clustal relies on assigning positive scores for matches. Assigning zero for matches and negative scores for mismatches interfered with ClustalÕs use of the guide tree because all sequences had pairwise alignment scores of zero. The following costs were used for Clustal W (for both pairwise and multiple alignment): terminal gap, ; gap opening, 5; gap extension, 3; transition weight, ; match, +7. Although these settings maintained the proportional cost of matches and mismatches as well as the gap opening penalty relative to the gap extension penalty, they were problematic in that proportional alignment costs were not maintained with the other global alignment programs when examining alternative alignments because credit was given for matches rather than being taken away for mismatches. As such, this was a confounding effect in our simulations for the Clustal results and we rely primarily on MUSCLE when comparing progressive pairwise alignment with the alternative methods. Both DIALIGN 2.2. and DIALIGN-T use raw counts of matches and mismatches; there is no differential transition transversion cost (B. Morgenstern, pers. commun., 25). As local alignment programs, they do not employ gap costs (Morgenstern et al., 998). Unaligned nucleotides (output by the programs as lowercase letters) were re-scored as unknown (? ) prior to indel-coding and phylogenetic inference. For the different-histories simulations, two alternative approaches were used for alignment. First, the two process partitions were concatenated and then aligned simultaneously with one another. Second, the two process partitions were aligned separately from one another and then concatenated. In addition to the degree of dependence among the process partitions, two other factors may account for differences among the two alternative approaches. First, in the simultaneousalignment approach, it is possible for nucleotides from the different partitions (the 3 end of the first partition and the 5 end of the second partition) to be aligned at the same position. Second, in the separate-alignment approach, trailing gaps from the first partition and leading gaps for the second partition will be scored as separate characters for the analyses that incorporated gap characters. Tree searches Aligned sequences (as well as the correct alignments) were analysed using two approaches. Gaps were alternatively treated as missing data or scored as separate characters (for internal gaps flanked by aligned bases on both ends) using simple indel coding (SIC; Simmons and Ochoterena, 2) as implemented in SeqState (Mu ller, 25). Tree searches were performed using equally weighted parsimony in PAUP* ver. 4.b (Swofford, 2). All most-parsimonious tree(s) were found using an exhaustive search and the strict consensus was calculated. Fifty per cent chance-of-deletion jackknife analyses (Farris et al., 996) were performed using jackknife replicates with branch-and-bound searches within each replicate. Subsequently, 5, 75, 88, 94, 97, 98, 99 and % majority rule jackknife trees were then calculated. The number of clades correctly resolved (relative to the tree topology used in the simulations) and the number of clades incorrectly resolved were determined

46 M.P. Simmons et al. / Cladistics 27 (2) 42 46 by using PEST ver. 2.2 (Zujko-Miller and Miller, 23). For the different-histories simulations, the tree topology for the first partition was used as the reference tree. The overall success of resolution (OSR; the number of clades correctly resolved minus the number of clades incorrectly resolved; Simmons and Miya, 24) was quantified by using the strict consensus of the mostparsimonious trees. The overall success of resolution quantifies performance of phylogenetic inference irrespective of branch support. The averaged overall success of resolution (aosr; the overall success of resolution averaged across bootstrap or jackknife trees so as to incorporate branch support; Simmons and Webb, 26) was calculated by using the jackknife trees. By using a 5% chance-of-deletion jackknife analysis and averaging across the eight different (5 %) output jackknife trees, the amount of support for each resolved clade was scaled to that provided by one to eight uncontradicted synapomorphies. In contrast to the other alignment programs examined, it is possible for POY to find multiple equally optimal alignments. In cases where two or three equally optimal implied alignments were found (no cases of four or more equally optimal alignments were obtained), tree searches were performed on all alignments independently of one another. As a conservative approach, the strict consensus of the trees was used to calculate the OSR and aosr (i.e. for all four-sequence simulations where the inferred tree topologies were not identical). Two or three equally optimal implied alignments were output by POY 3 for 22 replicates. For all three replicates in which three equally optimal alignments were reported, the third alignment was found to be identical to the second alignment. The two equally optimal alignments reported in nine of the 22 replicates were identical. Of the remaining 6 replicates for which two different alignments were reported: four resulted in conflicting topologies for both gaps treated as missing data or coded as separate characters; two resulted in unresolved trees for gaps treated as missing data, but conflicting trees for gaps coded as separate characters; one resulted in conflicting trees for gaps treated as missing data, but identical topologies for gaps coded as separate characters; and three resulted in unresolved trees for both gap codings. The final six replicates did not conflict with one another; one of the two alignments resulted in more resolution and or jackknife support than the other. POY 4 analyses In addition to our use of POY 3 to generate implied alignments that were analysed with equal weighting among characters and character states, we also performed a second set of analyses using POY ver. 4... The POY 4 analyses were run entirely within POY 4 using direct optimization rather than conducting tree searches in PAUP* based on implied alignments. The cost set applied for most other programs (gap opening, 5; gap extension, 3; all substitutions, 7), is non-metric and can violate the triangle inequality (Wheeler, 993) in extreme circumstances (e.g. favouring two 3-bp indels at a cost of 26 instead of 3 substitutions at a cost of 27). To avoid non-metric cost sets, a similar but metric cost set (gap opening, 3; gap extension, 7; all substitutions, 3) was used instead. Note that the cost set used by POY 4 for direct optimization results in a positive bias for POY 4 when compared with the trees inferred using equal character and character-state weighting from alignments generated by all other programs. Upweighting gaps over substitutions is an advantage for POY 4 given that an average of 2 substitutions were simulated across the bp whereas an average of only 2 indels were simulated. Likewise, upweighting longer gaps is an advantage for POY 4 given that the simulated indel length distribution is an exponential curve with many short gaps and few long gaps. Because of these biases in favour of POY 4, we consider the POY 3 results to be more appropriate for comparison with other alignment programs whereas the POY 4 results are most appropriately restricted to comparisons with other POY 4 results. POY 4 analyses were performed to infer the optimal tree(s), static jackknife support (i.e. using implied alignments), static Bremer (988) support and dynamic Bremer support. Optimal direct optimization searches were conducted using the standard direct optimization algorithm with branch-and-bound searches, storing up to optimal trees [set (exhaustive_do); build (branch_and_bound, trees:)] and reporting the strict consensus tree [report (consensus)]. Static jackknife support was then calculated by using implied alignments [transform (static_approx)], running a second, identical search for the optimal tree(s) and performing 5%-probability-of-deletion jackknife replicates and identical branch-and-bound searches within each replicate [calculate_support (jackknife: (resample:, remove:5), build (branch_and_bound, trees:))]. Dynamic jackknife support was not calculated because only one or two process partitions were simulated in this study and POY 4 conducts resampling at the fragment (locus) level (Te mkin et al., 29). Static Bremer support was calculated after the optimal searches by using implied alignments, running a second, identical search for the optimal tree(s) and then calculating support using random-addition trees, holding up to optimal trees and joining at all possible positions during branch swapping [calculate_support (bremer, build (), swap (trees:, all))]. Dynamic Bremer support was calculated after the optimal searches by using random-addition

M.P. Simmons et al. / Cladistics 27 (2) 42 46 47 trees, holding up to optimal trees and joining at all possible positions during branch swapping. Three types of unexpected results were encountered when analysing static and or dynamic Bremer support values. First, positive dynamic Bremer support values were occasionally observed for clades that were unresolved in the strict consensus trees. These support values were changed to zero. Second, negative dynamic Bremer support values were occasionally observed. These negative support values suggest that a more optimal tree had been found in the searches that were used to calculate Bremer support values than in the original optimal-tree search. Replicates for which negative dynamic Bremer support values were obtained were removed from the statistical analyses. Third, infinite dynamic and static Bremer support values were frequently observed, particularly in those simulations with high nucleotide substitution and or indel rates. Given that these values made up the large majority of replicates for several simulations, we could not simply delete the replicates in question. Rather, we set them to the maximum positive value observed among the noninfinite replicates for that set of simulation parameters. Note that the POY 4 results for the different-histories simulations cannot be directly compared with those from other programs because it is impossible to align the two partitions completely independently of each other in POY 4 (i.e. so as not to enforce the same tree topology for each process partition) without exporting implied alignments and analysing them using a different treesearch program such as PAUP*, as we did with POY 3. Statistical analyses Mean performance of the alignment methods was measured by using the OSR separately from the aosr. Means were calculated by using gaps treated as missing data separately from gaps coded using SIC, and the equal-branch-length simulations separately from the long-branch-attraction simulations. Means were calculated across all substitution and indel rates. Because the data were non-normally distributed, the non-parametric Kruskal Wallace test and non-parametric Van der Waerden test were used to compare multiple means. These two tests differ in their power when applied to different error distributions. The error distributions were not easy to calculate given the number of tests that were needed, so the more conservative P-value produced from both tests was used to determine if a comparison was significant. A significant test of multiple means determined that at least some of the alignment methods differed in performance. If the multiple-means test was significant, pairwise comparisons were then used to determine which of the alignment methods differed in performance. In order to limit the number of pairwise comparisons made, not all pairwise comparisons were performed. Instead, pairwise comparisons in order of increasing mean performance were performed with a Bonferroni correction for multiple tests. The POY 4 results were included in these comparisons. Because the means for POY 4 using static Bremer support and dynamic Bremer support were adjacent to one another, our pairwise comparisons allow us to compare static and dynamic Bremer support results. All comparisons were performed in JMP IN (SAS Institute, Inc., Cary, NC, USA). The multiple-comparison-followed-by-pairwise-comparisons approach described above was also applied only to the simulations where the substitution rate was zero and the indel rate was greater than zero. These tests were limited to the gaps-treated-as-missing-data results. These tests were applied to determine the relative susceptibility of the different alignment programs to creating phylogenetic signal based on inferred substitutions where there is none for alignment-variable sequences. The same multiple-comparison-followed-by-pairwisecomparisons approach was used to compare the potential for differences in the aosr in the differenthistories simulations. Initially, the differences in aosr were calculated as the aosr when the two partitions were aligned separately from one another subtracted from the aosr when the two partitions were aligned together with each other. Calculating the difference in this way measures the degree to which the different alignment programs allow different portions of the sequences to influence the alignment of each other. Specifically, this approach is based on the expectation that the longer partition would influence the alignment of the shorter partition. The three simulations with different proportions of the two process partitions (7 3% or 6 4%) or equal proportions of the two process partitions (5 5%) were compared with each other in a series of pairwise comparisons. The multiplecomparison test was then performed followed by a single pairwise comparison of MUSCLE vs. POY 3, which appeared by eye to have two of the greatest differences (see supplementary Fig. S3). No significant difference was found. We inferred that if this pairwise comparison was non-significant then other comparisons with smaller differences would probably also be nonsignificant. Following further exploration of the data, we inferred that the difference in the aosr should be calculated as the value of the aosr when the two partitions were aligned separately from one another subtracted from the value of the aosr when the two partitions were aligned simultaneously with each other and then taking the absolute value of the result. This approach is not based on which partition would influence the alignment of the other partition. Means were calculated for each of the three different-histories simulations independently of

48 M.P. Simmons et al. / Cladistics 27 (2) 42 46 each other, and our multiple-comparison-followed-bypairwise-comparisons approach was applied. These tests were applied to the gaps-treated-as-missing-data results separately from the gaps-coded-using-sic results. Results The results for the equal-branch-length simulations are presented in Table and Figs and S; the raw data are available at: http://www.biology.colostate.edu/ Research/. When comparing DCA (simultaneous alignment), Clustal and MUSCLE (progressive alignment) and POY 3 (direct optimization followed by implied alignments that were analysed using equal weighting) using the OSR, Clustal and DCA outperformed POY 3, which performed at least as well as MUSCLE. Yet, when using the aosr, Clustal and POY 3 outperformed DCA, which performed at least as well as MUSCLE. Direct optimization using internal weighting as implemented in POY 4 was outperformed by Clustal and DCA with respect to both OSR and aosr, although it outperformed MUSCLE with respect to aosr. When comparing the local alignment methods with each other, DIALIGN 2 consistently outperformed DIALIGN-T with respect to both the OSR and the aosr by a large margin (.3). The results for the long-branch-attraction simulations are presented in Table 2 and Figs 2 and S2. All of the global alignment methods were outperformed by the correct alignment, as measured by both the OSR and the aosr. DCA performed at least as well as MUSCLE, which performed at least as well as POY 3, which outperformed Clustal according to both the OSR and the aosr. Direct optimization using internal weighting in POY 4 outperformed the global alignment methods with respect to OSR, but was outperformed by both DCA and MUSCLE when measured by the aosr. When comparing the local alignment methods with each other, no significant differences were detected between DIALIGN 2 and DIALIGN-T according to either the OSR or the aosr. Both local alignment methods consistently outperformed all global alignment methods. The results for the different-histories simulations are presented in Table 3 and Figs S3 and S4. The difference in the aosr was calculated as the value of the aosr when the two partitions were aligned separately from one another subtracted from the value of the aosr when the two partitions were aligned simultaneously with each other and then taking the absolute value of the result (Table 3, Fig. S4). Although POY 3 showed relatively little difference in the aosr in the 7 3% partition, it showed the highest differences in the 6 4% and 5 5% partitions (.23 and.37 greater aosr differences than all other alignment programs) and was followed by Clustal for both of those partitions. The results were consistent regardless of whether gaps were treated as missing data or coded using SIC. Between the two local alignment methods, DIALIGN-T was generally found to show larger differences in the aosr than DIALIGN 2 and typically by a large margin (.2.8). The results for the analyses in which gaps are treated as missing data and comparisons were restricted to the simulations in which the substitution rate was zero and the indel rate was greater than zero are presented in Table 4 and Figs, 2, S and S2. All inferred alignments outperformed the correct alignment in the equal-branchlength simulations and were outperformed by the correct alignment in the long-branch-attraction simulations. Clustal showed the most extreme behaviour and performed far worse than all other alignment programs in the long-branch-attraction simulations ( ).28 OSR, ).3 aosr). In the POY 4-specific Bremer-support analyses, static Bremer support significantly (P <.) and dramatically outperformed dynamic Bremer support (means of Table The relative performance of each of the different alignment methods as measured by the overall success of resolution (OSR) and the averaged overall success of resolution (aosr) in the equal-branch-length simulations Alignment OSR aosr method Gaps=missing Gaps = SIC Gaps = missing Gaps = SIC Clustal.95 st.95 st.9 st.92* 2nd DCA.94 st.96 st.8* 2nd.89* 4th DIALIGN 2.84 3rd.88 2nd.75* 4th.8 7th DIALIGN-T.5* 5th.57* 3rd.4* 5th.46* 8th MUSCLE.86* 3rd.88 2nd.75 4th.83*à 6th POY 3.9 2nd.9* 2nd.85 st.9* 3rd POY 4 internal N A N A.82 2nd N A N A.83*à 5th Correct.8* 4th.96 st.8* 3rd.95 st Statistically significant differences among alignment methods are indicated by different ranks. SIC, simple indel coding; N A, not applicable. *Significant at P =. after Bonferroni correction relative to the next highest mean. Significant at P =.5 after Bonferroni correction relative to the next highest mean. àthe significant difference reported between these two rounded values is significant but not necessarily biologically meaningful.

M.P. Simmons et al. / Cladistics 27 (2) 42 46 49 (a) (b) (c) (d) averaged overall success of resolution.75.5.25 5 5 2.75.5.25.75.5.25 correct DCA Muscle 5 5 2 5 5 2 5 5 2 averagenumberof indelssimulated.75.5.25 POY 3 equal weight (e) (f) (g) (h) averaged overall success of resolution.75.5.25.75.75.5.5.25.25 Dialign-T Clustal Dialign 2 5 5 2 5 5 2 5 5 2 average number of indels simulated.75.5.25 POY4 internal weight 5 5 2. missing. SIC. missing. SIC.2 missing.2 SIC.3 missing.3 SIC.4 missing.4 SIC Fig.. The averaged overall success of resolution in trees inferred from the different alignments in the equal-branch-length simulations: (a) correct alignments, (b) DCA alignments, (c) MUSCLE alignments, (d) POY 3 implied alignments, (e) Clustal alignments, (f) DIALIGN 2 alignments, (g) DIALIGN-T alignments, (h) POY 4 trees. 348 vs 686) in the equal-branch-length simulations. In contrast, dynamic Bremer support significantly (P <.) and dramatically outperformed static Bremer support (means of )484 vs )78) in the longbranch-attraction simulations. Discussion Performance in the single-history simulations When gaps were treated as missing data in the equalbranch-length simulations, most or all of the global alignment methods (Clustal, DCA, MUSCLE and POY 3) outperformed the correct alignment, as measured by both the OSR and the aosr (Table ). In contrast, when gaps were coded as characters, none of the alignment methods outperformed the correct alignment, as measured by either the OSR or the aosr. Together, these two results suggest that those global alignment programs increased the apparent phylogenetic signal of nucleotide characters at the expense of gap characters under the alignment parameters applied. The outperformance of most or all of the global alignment methods over the correct alignment when gaps were treated as missing data cannot be ascribed to tree-search

4 M.P. Simmons et al. / Cladistics 27 (2) 42 46 Table 2 The relative performance of each of the different alignment methods as measured by the overall success of resolution (OSR) and the averaged overall success of resolution (aosr) in the long-branch-attraction simulations OSR aosr Alignment method Gaps = missing Gaps = SIC Gaps = missing Gaps = SIC Clustal ).7* 4th ).7* 5th ).6* 4th ).62 6th DCA ).47 2nd ).56* 3rd ).25 2nd ).3* 2nd DIALIGN 2 ).9 st ).9 st ).9 st ). st DIALIGN-T ).6 st ).28 st ).5 st ).8 st MUSCLE ).44* 2nd ).63* 4th ).23* 2nd ).39* 3rd POY 3 ).6* 3rd ).68 4th ).48* 3rd ).57 5th POY 4 internal N A N A ).45* 2nd N A N A ).42* 4th Correct ).6 st ).23 st ).7 st ).9 st Statistically significant differences among alignment methods are indicated by different ranks. SIC, simple indel coding; N A, not applicable. *Significant at P =. after Bonferroni correction relative to the next highest mean. Significant at P =.5 after Bonferroni correction relative to the next highest mean. heuristics because exact tree searches were performed in PAUP*. Our hypothesis that direct optimization would show the most extreme behaviour, followed by the progressive alignment methods, followed by the simultaneous and local alignment methods, was tested in the equal-branchlength simulations and the long-branch-attraction simulations. This hypothesis was generally corroborated when the comparisons were performed using the aosr, in which direct optimization followed by equal character weighting applied to implied alignments (POY 3) showed the most extreme behaviour among the alignment methods (except progressive alignment as implemented in Clustal, for which it is impossible to set directly comparable alignment parameters with the other global alignment programs) between the two sets of simulations. Contra our hypothesis, however, simultaneous alignment (DCA) performed at least as well as progressive alignment (as implemented in MUSCLE) in both sets of simulations. Furthermore, when comparisons were made using the OSR, simultaneous alignment performed at least as well as MUSCLE and outperformed POY 3, in both simulations. Progressive alignment (as implemented in MUSCLE) showed more extreme behaviour than local alignment (as implemented in DIALIGN 2), performing at least as well as local alignment in the equal-branch-length simulations, yet progressive alignment was outperformed by local alignment in the long-branch-attraction simulations. Although the differences may be attributable in part to progressive alignment using a guide tree whereas local alignment does not, a confounding factor is the exclusion by DIALIGN 2 of regions that it was unable to align. Based on the generally fewer parsimony-informative characters and shorter tree lengths produced by DIALIGN 2 relative to the global alignment programs (supplementary raw data at: http:// www.biology.colostate.edu/research/), this exclusion often accounted for a large proportion of the simulated nucleotides. Our hypothesis that DIALIGN-T would show more extreme behaviour than DIALIGN 2 was not supported. DIALIGN 2 consistently outperformed DIALIGN-T with respect to both the OSR and the aosr in the equal-branch-length simulations, and no significant differences were found between the two local alignment programs in the long-branch-attraction simulations. These results suggest that of the many changes made in DIALIGN-T (Subramanian et al., 25) relative to the older DIALIGN 2, not all of them were advantageous for nucleotide-based alignment. Based on the fewer parsimony-informative characters and shorter tree lengths produced by DIALIGN-T relative to DIALIGN 2 (supplementary raw data), this appears to be a consequence of DIALIGN-T aligning fewer variable nucleotides at the same positions. The different way in which Clustal treats nucleotide matches and differences relative to the other global alignment methods (i.e. assigning positive scores for matches rather than just assigning costs for mismatches and gaps) confounds comparisons of ClustalÕs results with those from DCA, MUSCLE and POY. But it is unproblematic to compare ClustalÕs results across simulations and with the correct alignment. These comparisons demonstrate that Clustal showed the predicted pattern for progressive alignment methods in that it performed very well in the equal-branch-length simulations, yet very poorly in the long-branch-attraction simulations. Performance in the different-histories simulations Our hypothesis that direct optimization would show the strongest influence when different process partitions that have different histories are aligned and phylogenetically analysed together, followed by the progressive

M.P. Simmons et al. / Cladistics 27 (2) 42 46 4 averaged overall success of resolution -.5 (a) 2 3 4 5 2 3 4 5 correct - 5 5 2 -.5 (b) DCA -.5 (c) 2 3 4 5 Muscle - - - 5 5 2 5 5 2 5 5 2 average number of indels simulated -.5 (d) POY 3 equal weight 2 3 4 5 (e) (f) (g) (h) averaged overall success of resolution 2345 -.5 Clustal -.5 2 3 4 5 Dialign 2 -.5 2 3 4 5 2 3 4 5 Dialign-T -.5 POY4 internal weight - - - 5 5 2 5 5 2 5 5 2 average number of indels simulated - 5 5 2. missing. SIC. missing. SIC.2 missing.2 SIC.3 missing.3 SIC.4 missing.4 SIC Fig. 2. The averaged overall success of resolution in trees inferred from the different alignments in the long-branch-attraction simulations: (a) correct alignments, (b) DCA alignments, (c) MUSCLE alignments, (d) POY 3 implied alignments, (e) Clustal alignments, (f) DIALIGN 2 alignments, (g) DIALIGN-T alignments, (h) POY 4 trees. alignment methods, followed by the simultaneous and local alignment methods was partially supported. Our initial statistical tests were based on two assumptions. First, there would be more influence between the partitions when they comprised differential proportions of the two process partitions than when the two partitions comprised equal proportions of the process partitions. Second, the minority partition would be affected by the majority partition, therefore providing greater support for the tree topology on which the majority partition was simulated. There was no statistical support for these assumptions when comparing the two programs that appeared by eye to have two of the greatest differences. Our second set of statistical tests were not based on either of those two assumptions and indicated significant differences between direct optimization and the other alignment methods for two of the three partitions. For the 7 3% partition, POY 3 appeared to show a low difference in aosr simply because it frequently

42 M.P. Simmons et al. / Cladistics 27 (2) 42 46 Table 3 The relative performance of each of the different alignment methods as measured by the difference in the averaged overall success of resolution (aosr) in the different-histories simulations Alignment 7% vs 3% 6% vs 4% 5% vs 5% method Gaps = missing Gaps = SIC Gaps = missing Gaps = SIC Gaps = missing Gaps = SIC Clustal.34.33 st.45 2nd.45* 2nd.53* 2nd.53* 2nd DCA.3.27 st.35 3rd.39* 3rd.4* 3rd.4 3rd DIALIGN 2.7.4 2nd.9 3rd.9 3rd.9* 4th.8* 4th DIALIGN-T.35.3 st.35* 3rd.3 3rd.33 3rd.3 3rd MUSCLE.3.35 st.3 3rd.36 3rd.33 3rd.43* 3rd POY 3.26.2 2nd.65 st.68 st.79 st.9 st Statistically significant differences among alignment methods are indicated by different ranks. SIC, simple indel coding. *Significant at P =. after Bonferroni correction relative to the next highest mean. Significant at P =.5 after Bonferroni correction relative to the next highest mean. Table 4 The relative performance of each of the different alignment methods as measured by the overall success of resolution (OSR) and the averaged overall success of resolution (aosr) in the simulations where gaps were treated as missing data Alignment method Equal-branch-length simulations Long-branch-attraction simulations OSR aosr OSR aosr Clustal.99 st.92 st ).68* 4th ).44* 4th DCA.9 2nd.55* 3rd ).6 2nd ).3 2nd DIALIGN 2.89 2nd.7* 2nd ).23 2nd ).9 2nd DIALIGN-T.89 2nd.68 2nd ).7* 2nd ).* 2nd MUSCLE.78* 3rd.45* 4th ).2 2nd ).4 2nd POY 3.9* 2nd.67 2nd ).4* 3rd ).3* 3rd Correct * 4th * 5th st st Means were taken across all four simulations in which the substitution rate was zero and the indel rate was greater than zero. Statistically significant differences among alignment methods are indicated by different ranks. *Significant at P =. after Bonferroni correction relative to the next highest mean. saturated the measure (e.g. obtaining the maximum value in 96 of the replicates for simultaneous alignment with gaps coded using SIC). Taken together, these results suggest that direct optimization may produce this artefact even when two process partitions with different histories contain about the same amount of phylogenetic signal. Furthermore, even when one process partition contains more expected phylogenetic signal than another, the directionality in which the bias will be expressed is not completely predictable (i.e. direct optimization could favour the process partition with less expected phylogenetic signal over the process partition with greater expected phylogenetic signal) even when the process of molecular evolution acts identically on the two process partitions. There was no statistical support for our hypothesis that progressive alignment, as implemented in MUS- CLE, would be more susceptible to this artefact than simultaneous alignment. Between the two local alignment programs, our hypothesis that DIALIGN-T, which uses overall similarity when determining the order of pairwise segment-to-segment alignments, would be more susceptible to this artefact than DIALIGN 2, which only uses local similarity, was generally supported. Artificial creation of phylogenetic signal and gaps For alignment-variable sequences in which no substitutions occurred, there are no variable, let alone parsimony-informative, characters when the correct alignment is used and gaps are treated as missing data. Yet all of the alignment programs created alignments from which trees with positive OSR and aosr were created in the equal-branch-length simulations and trees with negative OSR and aosr were created in the longbranch-attraction simulations. Therefore, all of these alignment programs (including the tree-independent programs DCA and DIALIGN 2) are susceptible to creating phylogenetic signal based on inferred substitutions where there is none for alignment-variable sequences, and to doing so in a biased manner. As noted by an anonymous reviewer, this may be caused by mis-aligning adjacent and overlapping gaps at identical positions, as demonstrated by Golubchik et al. (27). Among the global alignment programs in which nearly identical alignment parameters were specified, POY 3 was at least as susceptible to creating these determinate artefacts as DCA, which was at least as susceptible as MUSCLE. POY 4 analyses In the POY 4-specific Bremer-support analyses, static Bremer support showed more extreme behaviour than dynamic Bremer support when comparing the equalbranch-length simulations and the long-branch-attraction simulations. This result re-iterates the importance

M.P. Simmons et al. / Cladistics 27 (2) 42 46 43 of calculating branch-support values for direct optimization in the context of dynamic, rather than static, homology. As reviewed by Simmons et al. (28, p. 47), implied alignments are frequently used in empirical studies by POY users to calculate branch support (e.g. Grant et al., 26; Arango and Wheeler, 27; Janies et al., 28; Sharma and Giribet, 29). Furthermore, the POY 4..2. program documentation (Tëmkin et al., 29, p. 55) does not recommend that dynamic bootstrapping or jackknife analyses even be applied. Unless investigators are prepared to restrict their analyses to Bremer support or conduct bootstrap or jackknife analyses at the fragment rather than the nucleotide level, which is not recommended in the POY 4 manual, users of POY 4 are limited to inferring branch support using implied alignments. Criticisms of the similarity criterion for alignment De Laet (25, p. 4) provided a hypothetical example (partially reproduced here as Fig. 3a) in which alignments based on similarity following Zurawski and Clegg (987) and Simmons (24) leads to rejection of a tree that the data cannot distinguish from a tree that it accepts. That is, aligning by similarity would deterministically select one alignment tree over another whereas direct optimization would not. De Laet (25, p. 4): ascribed this result to the similarity procedure select[ing] the tree that has the higher amount of homoplasious similarity than the higher amount of homologous similarity. Indeed, aligning by similarity does not distinguish between primary homology assessments that are supported as secondary homology assessments at the level that they were originally proposed (de Pinna, 99), or not. De LaetÕs (25) criticism also applies to conventional coding of morphological characters (e.g. Pimentel and Riggins, 987). It is also true that aligning by similarity would favour the alignment that De Laet (25) presented in his fig. 6.8b over that presented in his fig. 6.8c. But the example that De Laet (25, p. 4) used is not sufficient to demonstrate that aligning by (a) O) TCCA A) C E) CG F) GC G) ACA H) GG (b) O) TCCA TCCA TCCA A) C-- -C-- --C- E) CG-- -CG- -CG- F) GC-- -GC- -GC- G) ACA -ACA -ACA H) GG-- -GG- -GG- (c) O) TCCA A) -??- E)???- F)???- G) -ACA H)?G?- Fig. 3. Ambiguous-alignment example from De Laet (25, p. 3): (a) unaligned sequences for ambiguously aligned region, (b) three equally optimal alignments based on the similarity criterion, (c) appropriate character coding that takes into account the ambiguity amongst the three equally optimal alignments. Sequences A D are identical in De LaetÕs (25) example; only sequence A is shown here. similarity leads to rejection of a tree that the data cannot distinguish from a tree that it accepts. In order to make that conclusion, one must consider all equally optimal alignments, not just the one presented by De Laet in his fig. 6.8b. When all three equally optimal alignments of six differences each (Fig. 3b) are taken into account such that ambiguously aligned positions are scored as? in the data matrix (Fig. 3c), both of the trees presented by De Laet (25, p. 3) are equally parsimonious. Kluge and Grant (26, p. 286) criticized use of the similarity criterion for alignment, calling the approach phenetic. To support their assertion, they took the hypothetical example from Simmons (24) fig. and presented it as their fig. 2. Although both of the alignments presented in Kluge and GrantÕs (26, p. 28) fig. 2 are equally optimal based on the similarity criterion (both have nine differences), they assert that the alignment in their Fig. 2b, which leads to a less parsimonious tree than the alignment in their fig. 2a, is The optimal alignment obtained a priori, where the characters are treated as objects and aligned according to similarity In fact, there are three equally optimal alignments per direct repeat. Given that the alignment of each of the four repeats is independent of the alignment of the other repeats when using the similarity criterion, there are 3 3 3 3 = 8 equally optimal alignments for this example. The three equally optimal alignments per direct repeat, each of which entails two differences, are: () CA-G CA-G C-AG (2) CATG CATG CATG (3) CT-G C-TG C-TG Problems with direct optimization Rather than using similarity for alignment, both De Laet (25) and Kluge and Grant (26) preferred direct optimization. Negative consequences of direct optimization have been demonstrated by Simmons (24) using contrived examples, Simmons et al. (28) for sensitivity to alignment parameters, Yoshizawa (2) for a contaminant sequence, Simmons et al. (unpublished data) using randomly generated sequences, and in this paper using simulations. In all of these cases, the problems are caused by direct optimization maximizing dependence of different positions in the alignment. As stated by Morrison (29, p. 3), Shuffling the character states while simultaneously building the phylogenetic tree means that congruence among the characters becomes part of the definition of the characters rather than a test of them. In contrast to direct optimization, dependence among positions is minimized by simultaneously aligning all sequences for each locus independent of the other loci using the similarity criterion. Using secondary structure