Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Molecular Phylogenetics and Evolution 31 (2004) 865 873 MOLECULAR PHYLOGENETICS AND EVOLUTION www.elsevier.com/locate/ympev Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used Helen Piontkivska * Institute of Molecular Evolutionary Genetics and Department of Biology, Pennsylvania State University, USA Received 14 October 2002; revised 21 September 2003 Abstract Choice of a substitution model is a crucial step in the maximum likelihood (ML) method of phylogenetic inference, and investigators tend to prefer complex mathematical models to simple ones. However, when complex models with many parameters are used, the extent of noise in statistical inferences increases, and thus complex models may not produce the true topology with a higher probability than simple ones. This problem was studied using computer simulation. When the number of nucleotides used was relatively large (1000 bp), the HKY + C model showed smaller d T (topological distance between the inferred and the true trees) than the JC and Kimura models. In the cases of shorter sequences (300 bp) simpler model and search algorithm such as JC model and SA + NNI search were found to be as efficient as more complicated searches and models in terms of topological distances, although the topologies obtained under HKY + C model had the highest likelihood values. The performance of relatively simple search algorithm SA + NNI was found to be essentially the same as that of more extensive SA + TBR search under all models studied. Similarly to the conclusions reached by Takahashi and Nei [Mol. Biol. Evol. 17 (2000) 1251], our results indicate that simple models can be as efficient as complex models, and that use of complex models does not necessarily give more reliable trees compared with simple models. Ó 2003 Elsevier Inc. All rights reserved. Keywords: Maximum likelihood; Nucleotide substitution model; Phylogenetic tree; Topological distance 1. Introduction In the maximum likelihood (ML) 1 method of the phylogenetic inferences the likelihood of observing a given set of sequence data for a specific substitution model is maximized for each topology, and the topology with the highest maximum likelihood is chosen as the final tree (Felsenstein, 1981; Nei and Kumar, 2000). Construction of ML trees is extremely time-consuming, especially when complex substitution models are used. Although there are heuristic algorithms that speed up * Present address: Department of Biological Sciences, University of South Carolina, 501 Coker Building, 700 Sumter Street, Columbia, SC 29208, USA. Fax: 1-803-777-4002. E-mail address: elena@biol.sc.edu. 1 Abbreviations used: JC, Jukes and Cantor; HKY, Hasegawa, Kishino, and Yano; C, gamma; bp, base pairs; CR, constant rate; VR, varying rate; ML, maximum likelihood; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisection-reconnection. computation (e.g., fastdnaml method (Olsen et al., 1994), NJML method (Ota and Li, 2000, 2001), TrExML (Wolf et al., 2000)), the computational time required is still substantial (Lemmon and Milinkovitch, 2002; Rogers and Swofford, 1998; Salter, 2001). Because the actual pattern of nucleotide substitutions is very complicated, many investigators tend to use complex and therefore time-consuming substitution models rather than simple ones (Hedin and Maddison, 2001; Posada and Crandall, 2001; Reyes et al., 2000; Rice et al., 1997). However, the probability of getting the true topology does not depend on the computational time, and use of complex models may not produce the true topology with a higher probability than use of simpler ones (Nei et al., 1998; Sullivan and Swofford, 2001; Takahashi and Nei, 2000). It has been shown that when the number of nucleotides relative to the number of sequences used is small, simple model such as the Jukes Cantor (JC) shows almost the same or even better performance than more complex models such as the 1055-7903/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved. doi:10.1016/j.ympev.2003.10.011

866 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) 865 873 Hasegawa, Kishino, and Yano + Gamma (HKY + C) model with which the simulated sequence data were obtained (Takahashi and Nei, 2000). Yet, the log likelihood score for the HKY + C model was always much higher than that for the JC model. As a consequence, the tree inferred under the HKY + C model was always selected as the ML tree, although in terms of topological distance this ML tree was often farther away from the true tree than the tree inferred under the JC model. However, these simulations were performed using only a relatively small number of nucleotides (300 bp). Therefore, we decided to investigate the efficiencies of different nucleotide substitution models in more details, using the relatively long nucleotide sequences. The relative efficiencies of various heuristic algorithms were also compared. 2. Materials and methods 2.1. Model trees and nucleotide substitution models DNA sequences were randomly generated according to a given model tree (Fig. 1) and a given substitution model (see below). These sequences were subsequently used for tree construction using different substitution models. The simulation scheme generally corresponds to the one described in Takahashi and Nei (2000). Because of the prohibitive amount of time required to perform ML analysis on the large data sets, we concentrated our study primarily on the case of 24 sequences. Two model topologies were considered. The first topology, designated as VR, did not follow the molecular clock assumption, and the rate of nucleotide substitution varied among branches of the tree (Fig. 1A). The second topology (CR) had the assumption of a constant rate of evolution for all branches (Fig. 1B). 48 taxa topology, taken from Takahashi and NeiÕs (2000) simulation, was also considered (Fig. 1C; corresponds to that of Fig. 1F in the latter study). The data sets for 24 sequences were generated according to three different nucleotide substitution models: JC (Jukes and Cantor, 1969), Kimura (Kimura, 1980), and HKY (Hasegawa et al., 1985) models. Rate variation among sites following the Gamma distribution with gamma parameter a (Jin and Nei, 1990) was incorporated into all three models. Four values of a were considered: low values of a ¼ 0:1; 0:2; 0:3, and relatively moderate value of a ¼ 1:0. Under the HKY model two different values of the transition/transversion rate ratio (k) were assumed: k ¼ 5 and k ¼ 20 (that corresponds to the cases with a moderate and a severe transition/ transversion bias, respectively). The following equilibrium nucleotide frequencies were used: g A ¼ g C ¼ g G ¼ g T ¼ 0:25 for the JC and Kimura models, and g A ¼ 0:10, g C ¼ 0:40, g G ¼ 0:40, g T ¼ 0:10 for the HKY model. The maximum divergence level (d max ) corresponded to the expected number of substitutions between the most distantly related sequences, and it was set to be 1.0. These parameters were applied to 300 and 1000 bp long sequences. To directly compare our results with those of Takahashi and Nei (2000), 48 taxa data sets were generated under the same conditions used in their study. In particular, under the HKY + C model, the following parameter values were employed: (case 1) k ¼ 4, a ¼ 1:0, Fig. 1. Model trees used for computer simulation. 24 taxa trees (A,B) were randomly generated as described in Takahashi and Nei (2000). Tree A represents the case where substitution rate varies with evolutionary lineage (VR case), tree B represents the case of constant-rate evolution (CR case); (C) 48 taxa tree taken from Takahashi and Nei (2000) simulations, tree F.

H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) 865 873 867 g A ¼ 0:15, g C ¼ 0:35, g G ¼ 0:35, g T ¼ 0:15; (case 2) k ¼ 10, a ¼ 0:5, g A ¼ 0:10, g C ¼ 0:40, g G ¼ 0:40, g T ¼ 0:10. To demonstrate that although JC or Kimura models can not be considered true models for these sequences, their performance is very similar to that of true HKY + C model, these somewhat biased nucleotide frequencies were chosen. Similarly to 24 taxa cases, the maximum level of sequence divergence was set to d max ¼ 1:0. For the case of 24 sequences, 50 random sets of sequence data were generated according to each combination of the simulation parameters described above. For the case of 48 sequences, 30 random sets were generated. For each data set, the ML trees under several substitution models were reconstructed (see below). 2.2. Phylogenetic reconstruction Three different tree-making algorithms were used. The stepwise addition (SA) algorithm with the randomized input order option was used as the computationally least extensive algorithm (Nei and Kumar, 2000). Upon quickly obtained SA tree, two more exhaustive algorithms, incorporating further tree search, were employed: nearest neighbor interchange (NNI) and tree bisection-reconnection (TBR). Among these algorithms, the latter is the most extensive one (Nei and Kumar, 2000; Swofford, 1998). Once sequence data were generated, they were used to reconstruct the phylogenetic tree using several models of nucleotide substitutions (JC, Kimura, and HKY + C). The parameters a, k, and nucleotide frequencies for a given substitution model were estimated by the ML method using the SA tree obtained with the JC model. Once these parameters were estimated, we constructed a ML tree for each substitution model. All the phylogenetic trees in this study were constructed by using the beta version of PAUP* 4.0 program (Swofford, 1998). 2.3. Efficiency of the tree topology estimations The efficiency of the tree making algorithms for inferring the true tree was measured by the topological distance (d T ) (Penny and Hendy, 1985; Robinson and Foulds, 1981) between the inferred tree and the true tree. In general, d T value is roughly twice the number of the interior branch interchanges required to obtain the true topology from the inferred tree (Rzhetsky and Nei, 1992). Because the probability of obtaining the true tree for a large data set is very small unless sequences are considerably long (Nei et al., 1998), we did not consider here another measure of the phylogenetic accuracy, the proportion or percentage of obtaining the correct tree (Tateno et al., 1982). When multiple tie trees were identified, the average of d T values between the true tree and all the tie trees was computed. The ML value for each tree obtained was also recorded. Because the results obtained for the VR topology were essentially the same as those obtained on the sequences generated under the CR assumption, therefore, we will primarily discuss the results obtained under the VR assumption. The results obtained using the CR topology are available as a supplementary material from http://mep.bio.psu.edu/databases. 3. Results 3.1. Efficiency of search algorithms The results of our simulations are summarized in Tables 1 5 (see also supplementary Tables A E). The d T values shown are the average topological distance values from the true tree to each of the topologies found. To make tables more comprehensive, the negative log likelihood values are not presented. However, we should note that in all cases examined the best log likelihood Table 1 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura, and HKY + C models under the ML criteria (24 taxa data sets generated according to HKY + C model with k ¼ 5 and based on VR topology) Gamma JC Kimura HKY + C parameter a SA SA + NNI SA + TBR SA SA + NNI SA + TBR SA SA + NNI SA + TBR 300 nucleotides 0.1 22.5 0.4 21.1 0.5 19.2 0.5 22.0 0.5 21.4 0.5 20.5 0.6 22.3 0.6 21.8 0.6 23.7 0.7 0.2 16.4 0.5 15.3 0.5 13.5 0.5 15.7 0.5 14.7 0.5 13.8 0.5 15.4 0.6 16.2 0.7 14.5 0.7 0.3 14.9 0.5 11.5 0.5 9.2 0.5 13.1 0.6 11.3 0.6 9.4 0.6 13.1 0.6 11.4 0.7 8.7 0.7 1.0 8.5 0.5 6.4 0.5 6.1 0.6 8.9 0.5 7.0 0.6 5.5 0.6 8.4 0.6 6.4 0.6 6.1 0.7 1000 nucleotides 0.1 11.7 0.5 9.0 0.5 8.3 0.5 12.4 0.5 9.2 0.5 10.2 0.6 8.3 0.6 6.5 0.7 6.9 0.7 0.2 8.5 0.4 7.2 0.5 6.3 0.5 8.0 0.5 8.0 0.6 8.0 0.6 6.0 0.6 4.0 0.7 4.0 0.8 0.3 8.5 0.5 7.5 0.5 6.5 0.5 7.6 0.5 6.0 0.6 5.6 0.6 5.6 0.6 3.5 0.6 2.7 0.7 1.0 4.8 0.5 4.3 0.5 4.1 0.5 5.6 0.5 4.0 0.5 3.6 0.5 2.4 0.6 2.0 0.6 2.0 0.7 Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisectionreconnection.

868 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) 865 873 Table 2 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura, and HKY + C models under the ML criteria (24 taxa data sets generated according to HKY + C model with k ¼ 20 and based on VR topology) Gamma JC Kimura HKY + C parameter a SA SA + NNI SA + TBR SA SA + NNI SA + TBR SA SA + NNI SA + TBR 300 nucleotides 0.1 22.8 0.5 23.7 0.5 24.0 0.6 21.8 0.6 20.7 0.6 20.3 0.6 22.8 0.6 21.0 0.6 20.9 0.7 0.2 17.0 0.6 14.4 0.7 13.9 0.7 15.3 0.6 14.4 0.7 13.2 0.7 14.5 0.7 11.2 0.8 12.1 0.8 0.3 17.7 0.6 15.3 0.6 11.3 0.7 14.5 0.6 11.9 0.7 10.6 0.7 14.4 0.6 7.3 0.7 8.1 0.8 1.0 10.2 0.5 8.3 0.6 8.0 0.6 8.1 0.6 6.0 0.67 6.3 0.6 7.8 0.6 4.8 0.6 4.5 0.6 1000 nucleotides 0.1 13.8 0.5 13.2 0.6 10.8 0.5 13.5 0.6 11.4 0.6 11.2 0.7 9.0 0.8 7.8 0.8 6.6 0.9 0.2 9.9 0.6 7.6 0.6 7.1 0.6 9.4 0.5 6.0 0.6 5.2 0.6 6.1 0.7 2.9 0.8 2.8 0.8 0.3 9.2 0.6 7.1 0.6 5.6 0.5 7.6 0.5 5.6 0.7 5.8 0.5 6.0 0.8 3.7 0.8 3.7 0.9 1.0 8.5 0.8 6.4 0.6 6.3 0.7 5.4 0.6 4.2 0.6 4.2 0.6 3.9 0.9 2.5 0.9 2.4 1.0 Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisectionreconnection. Table 3 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura. and HKY + C models under the ML criteria (24 taxa data sets generated according to JC + C model and based on VR topology) Gamma JC Kimura HKY + C parameter a SA SA + NNI SA + TBR SA SA + NNI SA + TBR SA SA + NNI SA + TBR 300 nucleotides 0.1 19.7 0.5 19.2 0.5 18.3 0.6 20.1 0.5 17.9 0.5 18.7 0.6 20.4 0.6 18.2 0.6 17.6 0.7 0.2 15.2 0.5 13.1 0.5 12.4 0.5 15.2 0.6 13.0 0.6 12.4 0.7 16.1 0.7 13.8 0.7 13.4 0.7 0.3 12.0 0.5 9.7 0.6 8.6 0.6 11.8 0.6 9.2 0.7 8.2 0.7 11.8 0.7 9.5 0.7 8.9 0.8 1.0 10.8 0.6 8.1 0.5 8.6 0.6 10.9 0.6 8.9 0.7 7.5 0.7 10.3 0.7 8.0 0.6 7.4 0.7 1000 nucleotides 0.1 8.9 1.0 7.3 1.0 6.4 1.0 8.9 1.0 7.3 1.0 6.4 1.0 7.0 1.0 6.2 1.0 5.7 1.0 0.2 7.5 0.9 5.2 0.9 4.4 0.9 7.5 0.9 5.2 0.9 4.1 0.9 5.2 1.0 3.0 1.0 3.2 1.1 0.3 5.9 0.8 2.8 0.9 2.8 0.9 5.7 0.9 2.8 0.9 2.9 0.9 3.7 0.9 1.8 1.0 2.0 1.0 1.0 3.3 0.8 2.0 0.8 1.6 0.9 3.3 0.9 2.0 0.9 1.6 0.9 0.9 0.9 0.6 1.0 0.3 1.1 Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisectionreconnection. Table 4 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura, and HKY + C models under the ML criteria (24 taxa data sets generated according to Kimura + C model and based on VR topology) Gamma JC Kimura HKY + C parameter a SA SA + NNI SA + TBR SA SA + NNI SA + TBR SA SA + NNI SA + TBR 300 nucleotides 0.1 23.3 0.6 22.5 0.6 19.7 0.7 22.1 0.7 20.4 0.7 19.2 0.8 20.5 0.8 18.9 0.9 14.9 0.9 0.2 16.9 0.6 14.4 0.6 12.7 0.7 14.7 0.6 12.9 0.8 10.8 0.8 13.6 0.9 11.4 0.9 9.4 0.9 0.3 15.2 0.6 12.1 0.7 10.7 0.7 11.9 0.7 8.4 0.8 8.8 0.8 10.9 0.8 8.1 0.9 7.6 0.9 1.0 9.1 0.6 7.3 0.7 5.9 0.8 8.0 0.8 5.9 0.9 5.4 0.9 8.4 0.9 5.5 0.9 4.9 0.9 1000 nucleotides 0.1 9.2 0.6 6.0 0.7 6.0 0.7 8.4 0.6 6.4 0.8 6.0 0.8 4.8 0.8 5.0 0.8 3.0 0.9 0.2 9.2 0.6 7.6 0.6 6.4 0.7 9.2 0.7 7.2 0.8 6.8 0.8 6.0 0.8 3.0 0.9 3.0 0.9 0.3 4.0 0.8 4.0 0.8 4.0 0.8 6.8 0.8 2.8 0.8 3.5 0.8 5.5 0.9 2.0 0.9 3.0 0.9 1.0 4.8 0.6 2.4 0.7 2.4 0.7 3.6 0.7 2.8 0.8 2.8 0.8 3.2 0.9 2.0 0.9 2.0 0.9 Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisectionreconnection. score belongs to the topology obtained under the combination of the most complex model and most extensive search algorithm. Furthermore, for every data set analyzed, the negative log likelihood value has gradually increased with the increase of complexity of the analysis performed. The minimal value of log likelihood score

H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) 865 873 869 Table 5 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura, and HKY + C models under the ML criteria (48 taxa data sets) JC Kimura HKY + C SA SA + NNI SA SA + NNI SA SA + NNI Case 1 HKY + C ðk ¼ 4; a ¼ 1:0; ½AŠ ¼½TŠ ¼0:15½CŠ ¼½GŠ ¼0:35Þ 9.6 0.5 6.9 0.6 9.2 0.5 7.3 0.6 12.7 0.5 9.0 0.5 Case 2 HKY + C ðk ¼ 10; a ¼ 0:5; ½AŠ ¼½TŠ ¼0:10; ½CŠ ¼½GŠ ¼0:40Þ 11.0 0.6 9.3 0.5 10.6 0.5 9.8 0.5 10.8 0.4 10.6 0.6 Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange. for each data set was obtained under the JC model with the simple SA search. Respectively, the ML tree was always the tree obtained under the HKY + C model with TBR search. However, as will be discussed below, in many cases the inferred tree that had the best log likelihood score (the ML tree) was not the true tree, as indicated by the topological distance values. When average d T values were compared among different search algorithms, it appeared that efficiency of the simplest search algorithm SA is apparently much lower than that of the more extensive search algorithms such as SA + NNI or SA + TBR. In particular, in all cases examined d T values obtained by the SA + NNI or SA + TBR algorithms were smaller than those obtained by the SA algorithm, with more than one interior branch difference. However, the relative efficiencies of branch-swapping algorithms NNI and TBR appear to be very similar, as they showed nearly the same d T values (difference in d T values was considered small if it did not exceed one internal branch, i.e., d T 6 2). In some cases (e.g., Tables 1 4) the topologies found by SA + TBR search had identical or even slightly higher d T values than those identified by SA + NNI search. In the presence of extreme rate variation among sites (e.g., a ¼ 0:1 and 0.2) the difference between the performances of SA + NNI and SA + TBR was even more noticeable compared to the cases with moderate rate variation (e.g., a ¼ 1). In the latter cases the majority of data sets exhibited almost identical d T values between SA + NNI and SA + TBR searches. 3.2. 24 taxa topology Table 1 presents the d T values for the VR case when the sequence data were generated under the HKY + C model (k ¼ 5). As one can see, these d T values were rather high in cases of short sequences generated under the very low gamma parameter values (e.g., a ¼ 0:1 and 0.2). In some extreme cases the inferred topologies differed from the true topology by approximately half of the interior branches regardless of the model and/or branch-swapping algorithm used. However, as the sequence length increased from 300 to 1000 bp, the efficiency of finding the true tree also increased (as indicated by smaller d T values). Similar effects on the relative efficiency were observed when the rate of among-site variation decreased (e.g., a value increased). For all three models examined (i.e., JC, Kimura, and HKY + C) d T values for SA + NNI and SA + TBR appeared to be roughly the same. Furthermore, when compared to SA, SA + NNI heuristic search showed significant decrease of the d T values. We should also note that on 300 bp sequences trees obtained under the relatively simple JC and Kimura models showed d T values close to those obtained under the more complicated HKY + C model. That is, wrong simple models showed essentially the same efficiency in finding the true tree as more complicated, but true, model (i.e., HKY + C). Similar results were observed for the CR case (supplementary Table A). Table 2 shows the results for the VR case when the sequence data were generated under the extreme transition/transversion bias (k ¼ 20). As in Table 1, highest d T values were observed on short sequences generated under low gamma parameter values. ML trees that were inferred with SA only search had the largest topological differences from the true tree. The employment of more extensive searches (SA + NNI, SA + TBR) led to decrease of d T values compared with those obtained using SA search. Interestingly, approximately the same d T values were observed for the trees inferred under both SA + NNI and SA + TBR, suggesting that these two searches were finding the true tree with similar efficiencies. Perhaps, for the purpose of finding the ML tree, relatively simple SA + NNI search can be considered sufficient, yet computationally it is much more efficient than more extensive searches such as SA + TBR or SPR. In some cases we observed SA + TBR search taking up to several days and even weeks on either Dual Intel PIII 500 or Sun Ultra 60 platforms, in contrast to only several hours of SA + NNI search. Supplementary Table B shows that essentially the same results were obtained for the CR cases. Tables 3 and 4 present the results when the sequence data were generated under the relatively simple JC + C and Kimura + C models, respectively. Similarly to the results presented in Tables 1 and 2, the increase in sequence length from 300 to 1000 bp significantly im-

870 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) 865 873 proves the performance of all three models of phylogenetic inferences. The results also show that the simplest JC model gives essentially the same results as Kimura or HKY + C models, especially when cases with moderate value of gamma parameter (a ¼ 1:0) are considered. In two out of four cases Kimura model outperformed HKY + C on the short sequences (Table 3). SA + TBR search overall showed d T values being slightly better than those obtained under SA + NNI search, although the observed improvements did not extend beyond 0.5 interior branches on average. When these results were compared to those obtained under the true model, it appeared that incorporating the rate variation among sites into the ML inference models slightly improves the performance of the phylogenetic inferences (also see supplementary Tables C and D). This trend was particularly noticeable when SA + NNI search was employed, although the differences in d T values among models were rather small, regardless whether they were true models or not. 3.3. 48 taxa topology To compare our results to those of Takahashi and Nei (2000), longer (1000 bp) sequences were generated under exactly the same conditions as those used in the later study. Two cases were studied (Table 5). In both cases the true model of the phylogenetic inference (i.e., HKY + C here), has been found to perform worse in comparison to the simpler JC or Kimura models. In fact, the simplest JC model outperformed two more complex models. In the case of the relatively low k value and less biased nucleotides frequencies (case 1), d T values for the simple algorithm, SA, were 9.6, 9.2, and 12.7 for the JC, Kimura, and HKY + C models, respectively. However, employment of SA + NNI branch-swapping algorithm reduced d T values for each particular model to 6.9, 7.3, and 9.0, respectively, with the best d T value achieved under the JC model. Because of the enormous amount of computational time required, the more extensive search algorithm SA + TBR was used only for a few data sets, and it appeared that simple SA + NNI algorithm indeed can perform as efficient as more complex SA + TBR as indicated by close d T values (results not shown). Similar results were observed on the empirical data, when both extensive and simple heuristic searches produced essentially the same topologies. In particular, the ML trees of primate MHC class I genes identified under the relatively simple heuristic search 10SA + NNI were essentially the same as those identified under the extensive 10SA + TBR search (Piontkivska and Nei, 2003). Interestingly, when more biased base frequencies and higher value of k were used (case 2), the overall performance of all three models became essentially the same: d T value of 9.3 versus 9.6 versus 10.6 for the SA + NNI search algorithm. However, as in the case 1, the ML trees with the lowest d T values were inferred under the JC model. Our results showed that the extension of sequence length to 1000 nucleotides led to the differences between the performance of the simple and complex models become more noticeable. In the case 1, d T values were 6.9 versus 9.0, comparing to 10.7 versus 11.6 for 300 bp sequences, while in the case 2 corresponding d T values were 9.3 versus 10.6, comparing to 18.5 versus 18.7 for the short sequences (Table 5; see also Takahashi and Nei, 2000). Theoretically, the minimum and maximum possible d T values for 48 taxa topologies are 0 and 90, respectively (Nei and Kumar, 2000). Thus, d T ¼ 9:3 means an error in the branching pattern (sequence partition) for approximately 5 interior branches, while d T ¼ 10:6 implies the difference in about 5.5 interior branches. From the biological point of view the difference in two internal branches between two 48 taxa topologies may be considered rather small (Penny and Hendy, 1985; Takahashi and Nei, 2000), yet, statistically it will appear as being significant (t test, p < 0:005). 4. Discussion 4.1. Search algorithms Relative efficiency of different substitution models used in the ML phylogenetic inferences and performance of different heuristic search algorithms under the ML criterion was examined using computer simulations. The results showed that when relatively large number of sequences is used, the overall performance of computationally less extensive SA + NNI search is essentially the same as the performance of more extensive SA + TBR search. Similarity in the performance of these two heuristic searches became more prominent when relatively long (1000 bp) sequences were considered. In some cases SA + NNI has been observed to outperform SA + TBR, although the differences among inferred topologies could be considered rather small from the biological point of view (e.g., about two internal branches). Overall, our results demonstrated that the use of the extensive search algorithm, such as SA + TBR, does not guarantee the finding of the true tree, especially when sequences are short and/or the rate variation among sites is very large (i.e., when gamma parameter a is low). As has been shown earlier by Nei et al. (1998), the ML method tends to give incorrect topologies when the number of sequences examined is large and the number of nucleotides is relatively small. In our simulations the ML tree was always the one found under the combination of the most complex substitution model and the most extensive heuristic search (i.e., HKY + C/

H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) 865 873 871 SA + TBR combination) (for examples see supplementary Table E). However, when topological differences between the inferred ML tree and the true tree were examined, in many cases the ML tree was not the true tree or even the tree closest to the true tree. Furthermore, it appeared that in some cases when short (300 bp) sequences were generated under the extremely low gamma parameter values (i.e., a ¼ 0:1 0:3), the overall number of equally likely topologies (i.e., tie trees) found by the heuristic search exceeded 3 and more topologies per data set. When tie trees were inferred, TBR search usually identified at least the same number of tie trees as did NNI search. In some extreme cases up to 44 tie trees and more were inferred. Similar trend was observed on the longer sequences, although the maximum number of the equally likely topologies per data set was smaller than in case of 300 bp sequences, and did not exceed 27 tie trees per data set. On the same set of sequences, SA + TBR search has produced larger number of tie trees than SA + NNI search (Piontkivska and Nei, unpublished results). However, topological differences between tie trees found by these two searches and the true tree, as measured by d T values, were approximately the same. We should note that in cases of multiple tie trees identified, none of them corresponded to the true tree (i.e., d T > 0). Unlike simulated data, where true tree may be known beforehand, for the empirical data sets such trees are generally unknown. In the latter cases, each tie topology (i.e. equally likely topology) should be further examined with other sorts of data in order to decide whether this topology represents the true tree. 4.2. Choice of nucleotide substitution model Our simulation results showed that when the number of nucleotides used is small relative to the number of sequences, the employment of simple substitution model, such as JC or Kimura models can be as efficient, or even better, than the use of more complicated model, such as HKY + C. These simple models appear to be quite efficient even when the relatively large number of sequences is used (Takahashi and Nei, 2000). Similar conclusions were reached on the data sets with small number of sequences (Bruno and Halpern, 1999; Yang, 1997). This phenomenon can partly be attributed to the amount of computational noise associated with the process of phylogenetic inferences. When complex model is used, the number of parameters to be estimated increases compared with the relatively simple models. Thus, employment of complex models leads to the increase in the amount of computational noise. On the other hand, actual pattern of nucleotide substitutions is always unknown and perhaps even more complicated than any possible models. And the substitution pattern can change with the evolutionary time, since even closely related species can exhibit substantial differences in codon usage (Anderson et al., 1993; Lloyd and Sharp, 1992; Shields, 1990; Tarrio et al., 2001), making it difficult to employ model that is true for all the species included in the data set. It has been suggested that the likelihood ratio test should be used to select the appropriate model to be used for ML construction (Huelsenbeck and Crandall, 1997; Posada and Crandall, 2001; Posada and Crandall, 1998; Swofford et al., 1996). However, our simulations showed that the highest likelihood value would always be observed for the trees inferred under the most complex model among all models compared. At the same time in most of the cases examined, these ML trees do not appear to be any closer to the true tree than those obtained under the less complicated models, and, respectively, the trees that had lower log likelihood values (i.e., non-ml trees). Empirical study with known phylogeny showed that models selected using the likelihood ratio test do not necessarily produce better topologies that simpler model (Russo et al., 1996). This potentially may lead to the selection of wrong trees (or trees that are topologically further from the true tree than non-ml trees) as the ML tree. Use of the likelihood ratio test should be examined more carefully with both simulated and empirical data (Nei and Kumar, 2000; Takahashi and Nei, 2000). The value of gamma parameter a also influences the efficiency of different substitution models. Our results showed that extremely small values (such as a ¼ 0:1; 0:2) are generally associated with the decrease in the overall performance of every model. This holds true even if substitution model used for the ML inferences considers rate variation (see Tables 1 4). In terms of topological differences better results were obtained for the data sets with fairly moderate value of gamma parameter (a ¼ 1:0), and the topological distances between the inferred and the true trees increased as the gamma parameter of the data set decreased. Noticeably, presence of the strong transition-transversion bias does not appear to heavily influence the overall performance of the substitution model used for the ML phylogenetic inferences. Our results suggest that for the relatively short sequences it is not necessary to use most complex models and most extensive search algorithms available. In many cases simple JC model and SA + NNI search algorithm can be as efficient in finding the true tree as more complicated models and more extensive searches. Although in terms of the negative log likelihood value the topologies inferred under the JC model do not appear to be ML trees when compared with the trees inferred under more complicated models, in terms of the topological distance these trees can be as close to the true tree as the latter, ML, trees. However, for longer sequences, the HKY + C model showed better d T values than other models. When different heuristic search algorithms are

872 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) 865 873 considered, the relatively simple SA + NNI search appears to be as efficient in finding the true topology as more extensive searches, such as SA + TBR (this study), or SA + SPR (Takahashi and Nei, 2000). However, we should note that the results of the simulations greatly depend on the model tree topology used to simulate the sequence data. And since we have used only few randomly generated topologies, our observations may be limited to the similar type of topologies, although molecular-clock hypothesis seems not to affect the overall performance of different models. Further, the overwhelming complexity of the actual nucleotide substitution pattern poses the problem of the model choice every time the empirical data set is considered. And while the most sophisticated model might appear as the most logical choice at the time, in many cases, especially when the number of sites is relatively small (Nei et al., 1998), choosing simple model can be considered as the most practical choice. Acknowledgments I thank Masatoshi Nei for our numerous inspirational discussions. I am also grateful to Wen-Hsiung Li and two anonymous reviewers for their comments on earlier version of this manuscript. This work was supported by Grants from NIH (GM20293) and NASA (NCC2-1057) to Masatoshi Nei. References Anderson, C.L., Carew, E.A., Powell, J.R., 1993. Evolution of the Adh locus in the Drosophila willistoni group: the loss of an intron, and shift in codon usage. Mol. Biol. Evol. 10, 605 618. Bruno, W.J., Halpern, A.L., 1999. Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16, 564 566. Felsenstein, J., 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, 368 376. Hasegawa, M., Kishino, H., Yano, T., 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, 160 174. Hedin, M.C., Maddison, W.P., 2001. A combined molecular approach to phylogeny of the jumping spider subfamily Dendryphantinae (Araneae: Salticidae). Mol. Phylogenet. Evol. 18, 386 403. Huelsenbeck, J.P., Crandall, K.A., 1997. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst. 28, 437 466. Jin, L., Nei, M., 1990. Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol. Biol. Evol. 7, 82 102. Jukes, T.H., Cantor, C.R., 1969. Evolution of protein molecules. In: Munro, H.N. (Ed.), Mammalian Protein Metabolism. Academic Press, New York, pp. 21 132. Kimura, M., 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111 112. Lemmon, A.R., Milinkovitch, M.C., 2002. The metapopulation genetic algorithm: an efficient solution for the problem of large phylogeny estimation. Proc. Natl. Acad. Sci. USA 99, 10516 10521. Lloyd, A.T., Sharp, P.M., 1992. Evolution of codon usage patterns: the extent and nature of divergence between Candida albicans and Saccharomyces cerevisiae. Nucleic Acids Res. 20, 5289 5295. Nei, M., Kumar, S., 2000. Molecular Evolution and Phylogenetics. Oxford University Press, Oxford. Nei, M., Kumar, S., Takahashi, K., 1998. The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc. Natl. Acad. Sci. USA 95, 12390 12397. Olsen, G.J., Matsuda, H., Hagstrom, R., Overbeek, R., 1994. fastdnaml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10, 41 48. Ota, S., Li, W.H., 2000. NJML: a hybrid algorithm for the neighborjoining and maximum-likelihood methods. Mol. Biol. Evol. 17, 1401 1409. Ota, S., Li, W.H., 2001. NJML+: an extension of the NJML method to handle protein sequence data and computer software implementation. Mol. Biol. Evol. 18, 1983 1992. Penny, D., Hendy, M.D., 1985. The use of tree comparison metrics. Syst. Biol. 34, 75 82. Piontkivska, H., Nei, M., 2003. Birth-and-death evolution in primate MHC class I genes: divergence time estimates. Mol. Biol. Evol. 20, 601 609. Posada, D., Crandall, K., 2001. Selecting the best-fit model of nucleotide substitution. Syst. Biol. 50, 580 601. Posada, D., Crandall, K.A., 1998. MODELTEST: testing the model of DNA substitution. Bioinformatics 14, 817 818. Reyes, A., Pesole, G., Saccone, C., 2000. Long-branch attraction phenomenon and the impact of among-site rate variation on rodent phylogeny. Gene 259, 177 187. Rice, K.A., Donoghue, M.J., Olmstead, R.G., 1997. Analyzing large data sets: rbcl revisited. Syst. Biol. 46, 554 563. Robinson, D.F., Foulds, L.R., 1981. Comparison of phylogenetic trees. Math. Biosci. 53, 131 147. Rogers, J.S., Swofford, D.L., 1998. A fast method for approximating maximum likelihoods of phylogenetic trees from nucleotide sequences. Syst. Biol. 47, 77 89. Russo, C.A., Takezaki, N., Nei, M., 1996. Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. Mol. Biol. Evol. 13, 525 536. Salter, L.A., 2001. Complexity of the likelihood surface for a large DNA dataset. Syst. Biol. 50, 970 978. Shields, D.C., 1990. Switches in species-specific codon preferences: the influence of mutation biases. J. Mol. Evol. 31, 71 80. Sullivan, J., Swofford, D.L., 2001. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50, 723 729. Swofford, D.L., 1998. PAUP* Phylogenetic Analysis Using Parsimony (*and Other Methods). Sinauer Associates, Sunderland, MA. Swofford, D.L., Olsen, G.J., Waddell, P.J., Hillis, D.M., 1996. Phylogenetic inference. In: Hillis, D.M., Moritz, C., Mable, B.K. (Eds.), Molecular Systematics, second ed. Sinauer, Sunderland, MA, pp. 407 514. Takahashi, K., Nei, M., 2000. Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol. Biol. Evol. 17, 1251 1258. Tarrio, R., Rodriguez-Trelles, F., Ayala, F.J., 2001. Shared nucleotide composition biases among species and their impact on phylogenetic

H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) 865 873 873 reconstructions of the drosophilidae. Mol. Biol. Evol. 18, 1464 1473. Tateno, Y., Nei, M., Tajima, F., 1982. Accuracy of estimated phylogenetic trees from molecular data. I. Distantly related species. J. Mol. Evol. 18, 387 404. Wolf, M.J., Easteal, S., Kahn, M., McKay, B.D., Jermiin, L.S., 2000. TrExML: a maximum-likelihood approach for extensive tree-space exploration. Bioinformatics 16, 383 394. Yang, Z., 1997. How often do wrong models produce better phylogenies? Mol. Biol. Evol. 14, 105 108.