Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Size: px
Start display at page:

Download "Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used"

Transcription

1 Molecular Phylogenetics and Evolution 31 (2004) MOLECULAR PHYLOGENETICS AND EVOLUTION Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used Helen Piontkivska * Institute of Molecular Evolutionary Genetics and Department of Biology, Pennsylvania State University, USA Received 14 October 2002; revised 21 September 2003 Abstract Choice of a substitution model is a crucial step in the maximum likelihood (ML) method of phylogenetic inference, and investigators tend to prefer complex mathematical models to simple ones. However, when complex models with many parameters are used, the extent of noise in statistical inferences increases, and thus complex models may not produce the true topology with a higher probability than simple ones. This problem was studied using computer simulation. When the number of nucleotides used was relatively large (1000 bp), the HKY + C model showed smaller d T (topological distance between the inferred and the true trees) than the JC and Kimura models. In the cases of shorter sequences (300 bp) simpler model and search algorithm such as JC model and SA + NNI search were found to be as efficient as more complicated searches and models in terms of topological distances, although the topologies obtained under HKY + C model had the highest likelihood values. The performance of relatively simple search algorithm SA + NNI was found to be essentially the same as that of more extensive SA + TBR search under all models studied. Similarly to the conclusions reached by Takahashi and Nei [Mol. Biol. Evol. 17 (2000) 1251], our results indicate that simple models can be as efficient as complex models, and that use of complex models does not necessarily give more reliable trees compared with simple models. Ó 2003 Elsevier Inc. All rights reserved. Keywords: Maximum likelihood; Nucleotide substitution model; Phylogenetic tree; Topological distance 1. Introduction In the maximum likelihood (ML) 1 method of the phylogenetic inferences the likelihood of observing a given set of sequence data for a specific substitution model is maximized for each topology, and the topology with the highest maximum likelihood is chosen as the final tree (Felsenstein, 1981; Nei and Kumar, 2000). Construction of ML trees is extremely time-consuming, especially when complex substitution models are used. Although there are heuristic algorithms that speed up * Present address: Department of Biological Sciences, University of South Carolina, 501 Coker Building, 700 Sumter Street, Columbia, SC 29208, USA. Fax: address: elena@biol.sc.edu. 1 Abbreviations used: JC, Jukes and Cantor; HKY, Hasegawa, Kishino, and Yano; C, gamma; bp, base pairs; CR, constant rate; VR, varying rate; ML, maximum likelihood; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisection-reconnection. computation (e.g., fastdnaml method (Olsen et al., 1994), NJML method (Ota and Li, 2000, 2001), TrExML (Wolf et al., 2000)), the computational time required is still substantial (Lemmon and Milinkovitch, 2002; Rogers and Swofford, 1998; Salter, 2001). Because the actual pattern of nucleotide substitutions is very complicated, many investigators tend to use complex and therefore time-consuming substitution models rather than simple ones (Hedin and Maddison, 2001; Posada and Crandall, 2001; Reyes et al., 2000; Rice et al., 1997). However, the probability of getting the true topology does not depend on the computational time, and use of complex models may not produce the true topology with a higher probability than use of simpler ones (Nei et al., 1998; Sullivan and Swofford, 2001; Takahashi and Nei, 2000). It has been shown that when the number of nucleotides relative to the number of sequences used is small, simple model such as the Jukes Cantor (JC) shows almost the same or even better performance than more complex models such as the /$ - see front matter Ó 2003 Elsevier Inc. All rights reserved. doi: /j.ympev

2 866 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) Hasegawa, Kishino, and Yano + Gamma (HKY + C) model with which the simulated sequence data were obtained (Takahashi and Nei, 2000). Yet, the log likelihood score for the HKY + C model was always much higher than that for the JC model. As a consequence, the tree inferred under the HKY + C model was always selected as the ML tree, although in terms of topological distance this ML tree was often farther away from the true tree than the tree inferred under the JC model. However, these simulations were performed using only a relatively small number of nucleotides (300 bp). Therefore, we decided to investigate the efficiencies of different nucleotide substitution models in more details, using the relatively long nucleotide sequences. The relative efficiencies of various heuristic algorithms were also compared. 2. Materials and methods 2.1. Model trees and nucleotide substitution models DNA sequences were randomly generated according to a given model tree (Fig. 1) and a given substitution model (see below). These sequences were subsequently used for tree construction using different substitution models. The simulation scheme generally corresponds to the one described in Takahashi and Nei (2000). Because of the prohibitive amount of time required to perform ML analysis on the large data sets, we concentrated our study primarily on the case of 24 sequences. Two model topologies were considered. The first topology, designated as VR, did not follow the molecular clock assumption, and the rate of nucleotide substitution varied among branches of the tree (Fig. 1A). The second topology (CR) had the assumption of a constant rate of evolution for all branches (Fig. 1B). 48 taxa topology, taken from Takahashi and NeiÕs (2000) simulation, was also considered (Fig. 1C; corresponds to that of Fig. 1F in the latter study). The data sets for 24 sequences were generated according to three different nucleotide substitution models: JC (Jukes and Cantor, 1969), Kimura (Kimura, 1980), and HKY (Hasegawa et al., 1985) models. Rate variation among sites following the Gamma distribution with gamma parameter a (Jin and Nei, 1990) was incorporated into all three models. Four values of a were considered: low values of a ¼ 0:1; 0:2; 0:3, and relatively moderate value of a ¼ 1:0. Under the HKY model two different values of the transition/transversion rate ratio (k) were assumed: k ¼ 5 and k ¼ 20 (that corresponds to the cases with a moderate and a severe transition/ transversion bias, respectively). The following equilibrium nucleotide frequencies were used: g A ¼ g C ¼ g G ¼ g T ¼ 0:25 for the JC and Kimura models, and g A ¼ 0:10, g C ¼ 0:40, g G ¼ 0:40, g T ¼ 0:10 for the HKY model. The maximum divergence level (d max ) corresponded to the expected number of substitutions between the most distantly related sequences, and it was set to be 1.0. These parameters were applied to 300 and 1000 bp long sequences. To directly compare our results with those of Takahashi and Nei (2000), 48 taxa data sets were generated under the same conditions used in their study. In particular, under the HKY + C model, the following parameter values were employed: (case 1) k ¼ 4, a ¼ 1:0, Fig. 1. Model trees used for computer simulation. 24 taxa trees (A,B) were randomly generated as described in Takahashi and Nei (2000). Tree A represents the case where substitution rate varies with evolutionary lineage (VR case), tree B represents the case of constant-rate evolution (CR case); (C) 48 taxa tree taken from Takahashi and Nei (2000) simulations, tree F.

3 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) g A ¼ 0:15, g C ¼ 0:35, g G ¼ 0:35, g T ¼ 0:15; (case 2) k ¼ 10, a ¼ 0:5, g A ¼ 0:10, g C ¼ 0:40, g G ¼ 0:40, g T ¼ 0:10. To demonstrate that although JC or Kimura models can not be considered true models for these sequences, their performance is very similar to that of true HKY + C model, these somewhat biased nucleotide frequencies were chosen. Similarly to 24 taxa cases, the maximum level of sequence divergence was set to d max ¼ 1:0. For the case of 24 sequences, 50 random sets of sequence data were generated according to each combination of the simulation parameters described above. For the case of 48 sequences, 30 random sets were generated. For each data set, the ML trees under several substitution models were reconstructed (see below) Phylogenetic reconstruction Three different tree-making algorithms were used. The stepwise addition (SA) algorithm with the randomized input order option was used as the computationally least extensive algorithm (Nei and Kumar, 2000). Upon quickly obtained SA tree, two more exhaustive algorithms, incorporating further tree search, were employed: nearest neighbor interchange (NNI) and tree bisection-reconnection (TBR). Among these algorithms, the latter is the most extensive one (Nei and Kumar, 2000; Swofford, 1998). Once sequence data were generated, they were used to reconstruct the phylogenetic tree using several models of nucleotide substitutions (JC, Kimura, and HKY + C). The parameters a, k, and nucleotide frequencies for a given substitution model were estimated by the ML method using the SA tree obtained with the JC model. Once these parameters were estimated, we constructed a ML tree for each substitution model. All the phylogenetic trees in this study were constructed by using the beta version of PAUP* 4.0 program (Swofford, 1998) Efficiency of the tree topology estimations The efficiency of the tree making algorithms for inferring the true tree was measured by the topological distance (d T ) (Penny and Hendy, 1985; Robinson and Foulds, 1981) between the inferred tree and the true tree. In general, d T value is roughly twice the number of the interior branch interchanges required to obtain the true topology from the inferred tree (Rzhetsky and Nei, 1992). Because the probability of obtaining the true tree for a large data set is very small unless sequences are considerably long (Nei et al., 1998), we did not consider here another measure of the phylogenetic accuracy, the proportion or percentage of obtaining the correct tree (Tateno et al., 1982). When multiple tie trees were identified, the average of d T values between the true tree and all the tie trees was computed. The ML value for each tree obtained was also recorded. Because the results obtained for the VR topology were essentially the same as those obtained on the sequences generated under the CR assumption, therefore, we will primarily discuss the results obtained under the VR assumption. The results obtained using the CR topology are available as a supplementary material from 3. Results 3.1. Efficiency of search algorithms The results of our simulations are summarized in Tables 1 5 (see also supplementary Tables A E). The d T values shown are the average topological distance values from the true tree to each of the topologies found. To make tables more comprehensive, the negative log likelihood values are not presented. However, we should note that in all cases examined the best log likelihood Table 1 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura, and HKY + C models under the ML criteria (24 taxa data sets generated according to HKY + C model with k ¼ 5 and based on VR topology) Gamma JC Kimura HKY + C parameter a SA SA + NNI SA + TBR SA SA + NNI SA + TBR SA SA + NNI SA + TBR 300 nucleotides nucleotides Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisectionreconnection.

4 868 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) Table 2 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura, and HKY + C models under the ML criteria (24 taxa data sets generated according to HKY + C model with k ¼ 20 and based on VR topology) Gamma JC Kimura HKY + C parameter a SA SA + NNI SA + TBR SA SA + NNI SA + TBR SA SA + NNI SA + TBR 300 nucleotides nucleotides Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisectionreconnection. Table 3 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura. and HKY + C models under the ML criteria (24 taxa data sets generated according to JC + C model and based on VR topology) Gamma JC Kimura HKY + C parameter a SA SA + NNI SA + TBR SA SA + NNI SA + TBR SA SA + NNI SA + TBR 300 nucleotides nucleotides Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisectionreconnection. Table 4 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura, and HKY + C models under the ML criteria (24 taxa data sets generated according to Kimura + C model and based on VR topology) Gamma JC Kimura HKY + C parameter a SA SA + NNI SA + TBR SA SA + NNI SA + TBR SA SA + NNI SA + TBR 300 nucleotides nucleotides Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange; TBR, tree bisectionreconnection. score belongs to the topology obtained under the combination of the most complex model and most extensive search algorithm. Furthermore, for every data set analyzed, the negative log likelihood value has gradually increased with the increase of complexity of the analysis performed. The minimal value of log likelihood score

5 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) Table 5 Average topological distances from the true tree (d T )(SE) of trees inferred according to JC, Kimura, and HKY + C models under the ML criteria (48 taxa data sets) JC Kimura HKY + C SA SA + NNI SA SA + NNI SA SA + NNI Case 1 HKY + C ðk ¼ 4; a ¼ 1:0; ½AŠ ¼½TŠ ¼0:15½CŠ ¼½GŠ ¼0:35Þ Case 2 HKY + C ðk ¼ 10; a ¼ 0:5; ½AŠ ¼½TŠ ¼0:10; ½CŠ ¼½GŠ ¼0:40Þ Notes. JC, Jukes Cantor; HKY, Hasegawa Kishino Yano; SA, stepwise-addition; NNI, nearest neighbor interchange. for each data set was obtained under the JC model with the simple SA search. Respectively, the ML tree was always the tree obtained under the HKY + C model with TBR search. However, as will be discussed below, in many cases the inferred tree that had the best log likelihood score (the ML tree) was not the true tree, as indicated by the topological distance values. When average d T values were compared among different search algorithms, it appeared that efficiency of the simplest search algorithm SA is apparently much lower than that of the more extensive search algorithms such as SA + NNI or SA + TBR. In particular, in all cases examined d T values obtained by the SA + NNI or SA + TBR algorithms were smaller than those obtained by the SA algorithm, with more than one interior branch difference. However, the relative efficiencies of branch-swapping algorithms NNI and TBR appear to be very similar, as they showed nearly the same d T values (difference in d T values was considered small if it did not exceed one internal branch, i.e., d T 6 2). In some cases (e.g., Tables 1 4) the topologies found by SA + TBR search had identical or even slightly higher d T values than those identified by SA + NNI search. In the presence of extreme rate variation among sites (e.g., a ¼ 0:1 and 0.2) the difference between the performances of SA + NNI and SA + TBR was even more noticeable compared to the cases with moderate rate variation (e.g., a ¼ 1). In the latter cases the majority of data sets exhibited almost identical d T values between SA + NNI and SA + TBR searches taxa topology Table 1 presents the d T values for the VR case when the sequence data were generated under the HKY + C model (k ¼ 5). As one can see, these d T values were rather high in cases of short sequences generated under the very low gamma parameter values (e.g., a ¼ 0:1 and 0.2). In some extreme cases the inferred topologies differed from the true topology by approximately half of the interior branches regardless of the model and/or branch-swapping algorithm used. However, as the sequence length increased from 300 to 1000 bp, the efficiency of finding the true tree also increased (as indicated by smaller d T values). Similar effects on the relative efficiency were observed when the rate of among-site variation decreased (e.g., a value increased). For all three models examined (i.e., JC, Kimura, and HKY + C) d T values for SA + NNI and SA + TBR appeared to be roughly the same. Furthermore, when compared to SA, SA + NNI heuristic search showed significant decrease of the d T values. We should also note that on 300 bp sequences trees obtained under the relatively simple JC and Kimura models showed d T values close to those obtained under the more complicated HKY + C model. That is, wrong simple models showed essentially the same efficiency in finding the true tree as more complicated, but true, model (i.e., HKY + C). Similar results were observed for the CR case (supplementary Table A). Table 2 shows the results for the VR case when the sequence data were generated under the extreme transition/transversion bias (k ¼ 20). As in Table 1, highest d T values were observed on short sequences generated under low gamma parameter values. ML trees that were inferred with SA only search had the largest topological differences from the true tree. The employment of more extensive searches (SA + NNI, SA + TBR) led to decrease of d T values compared with those obtained using SA search. Interestingly, approximately the same d T values were observed for the trees inferred under both SA + NNI and SA + TBR, suggesting that these two searches were finding the true tree with similar efficiencies. Perhaps, for the purpose of finding the ML tree, relatively simple SA + NNI search can be considered sufficient, yet computationally it is much more efficient than more extensive searches such as SA + TBR or SPR. In some cases we observed SA + TBR search taking up to several days and even weeks on either Dual Intel PIII 500 or Sun Ultra 60 platforms, in contrast to only several hours of SA + NNI search. Supplementary Table B shows that essentially the same results were obtained for the CR cases. Tables 3 and 4 present the results when the sequence data were generated under the relatively simple JC + C and Kimura + C models, respectively. Similarly to the results presented in Tables 1 and 2, the increase in sequence length from 300 to 1000 bp significantly im-

6 870 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) proves the performance of all three models of phylogenetic inferences. The results also show that the simplest JC model gives essentially the same results as Kimura or HKY + C models, especially when cases with moderate value of gamma parameter (a ¼ 1:0) are considered. In two out of four cases Kimura model outperformed HKY + C on the short sequences (Table 3). SA + TBR search overall showed d T values being slightly better than those obtained under SA + NNI search, although the observed improvements did not extend beyond 0.5 interior branches on average. When these results were compared to those obtained under the true model, it appeared that incorporating the rate variation among sites into the ML inference models slightly improves the performance of the phylogenetic inferences (also see supplementary Tables C and D). This trend was particularly noticeable when SA + NNI search was employed, although the differences in d T values among models were rather small, regardless whether they were true models or not taxa topology To compare our results to those of Takahashi and Nei (2000), longer (1000 bp) sequences were generated under exactly the same conditions as those used in the later study. Two cases were studied (Table 5). In both cases the true model of the phylogenetic inference (i.e., HKY + C here), has been found to perform worse in comparison to the simpler JC or Kimura models. In fact, the simplest JC model outperformed two more complex models. In the case of the relatively low k value and less biased nucleotides frequencies (case 1), d T values for the simple algorithm, SA, were 9.6, 9.2, and 12.7 for the JC, Kimura, and HKY + C models, respectively. However, employment of SA + NNI branch-swapping algorithm reduced d T values for each particular model to 6.9, 7.3, and 9.0, respectively, with the best d T value achieved under the JC model. Because of the enormous amount of computational time required, the more extensive search algorithm SA + TBR was used only for a few data sets, and it appeared that simple SA + NNI algorithm indeed can perform as efficient as more complex SA + TBR as indicated by close d T values (results not shown). Similar results were observed on the empirical data, when both extensive and simple heuristic searches produced essentially the same topologies. In particular, the ML trees of primate MHC class I genes identified under the relatively simple heuristic search 10SA + NNI were essentially the same as those identified under the extensive 10SA + TBR search (Piontkivska and Nei, 2003). Interestingly, when more biased base frequencies and higher value of k were used (case 2), the overall performance of all three models became essentially the same: d T value of 9.3 versus 9.6 versus 10.6 for the SA + NNI search algorithm. However, as in the case 1, the ML trees with the lowest d T values were inferred under the JC model. Our results showed that the extension of sequence length to 1000 nucleotides led to the differences between the performance of the simple and complex models become more noticeable. In the case 1, d T values were 6.9 versus 9.0, comparing to 10.7 versus 11.6 for 300 bp sequences, while in the case 2 corresponding d T values were 9.3 versus 10.6, comparing to 18.5 versus 18.7 for the short sequences (Table 5; see also Takahashi and Nei, 2000). Theoretically, the minimum and maximum possible d T values for 48 taxa topologies are 0 and 90, respectively (Nei and Kumar, 2000). Thus, d T ¼ 9:3 means an error in the branching pattern (sequence partition) for approximately 5 interior branches, while d T ¼ 10:6 implies the difference in about 5.5 interior branches. From the biological point of view the difference in two internal branches between two 48 taxa topologies may be considered rather small (Penny and Hendy, 1985; Takahashi and Nei, 2000), yet, statistically it will appear as being significant (t test, p < 0:005). 4. Discussion 4.1. Search algorithms Relative efficiency of different substitution models used in the ML phylogenetic inferences and performance of different heuristic search algorithms under the ML criterion was examined using computer simulations. The results showed that when relatively large number of sequences is used, the overall performance of computationally less extensive SA + NNI search is essentially the same as the performance of more extensive SA + TBR search. Similarity in the performance of these two heuristic searches became more prominent when relatively long (1000 bp) sequences were considered. In some cases SA + NNI has been observed to outperform SA + TBR, although the differences among inferred topologies could be considered rather small from the biological point of view (e.g., about two internal branches). Overall, our results demonstrated that the use of the extensive search algorithm, such as SA + TBR, does not guarantee the finding of the true tree, especially when sequences are short and/or the rate variation among sites is very large (i.e., when gamma parameter a is low). As has been shown earlier by Nei et al. (1998), the ML method tends to give incorrect topologies when the number of sequences examined is large and the number of nucleotides is relatively small. In our simulations the ML tree was always the one found under the combination of the most complex substitution model and the most extensive heuristic search (i.e., HKY + C/

7 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) SA + TBR combination) (for examples see supplementary Table E). However, when topological differences between the inferred ML tree and the true tree were examined, in many cases the ML tree was not the true tree or even the tree closest to the true tree. Furthermore, it appeared that in some cases when short (300 bp) sequences were generated under the extremely low gamma parameter values (i.e., a ¼ 0:1 0:3), the overall number of equally likely topologies (i.e., tie trees) found by the heuristic search exceeded 3 and more topologies per data set. When tie trees were inferred, TBR search usually identified at least the same number of tie trees as did NNI search. In some extreme cases up to 44 tie trees and more were inferred. Similar trend was observed on the longer sequences, although the maximum number of the equally likely topologies per data set was smaller than in case of 300 bp sequences, and did not exceed 27 tie trees per data set. On the same set of sequences, SA + TBR search has produced larger number of tie trees than SA + NNI search (Piontkivska and Nei, unpublished results). However, topological differences between tie trees found by these two searches and the true tree, as measured by d T values, were approximately the same. We should note that in cases of multiple tie trees identified, none of them corresponded to the true tree (i.e., d T > 0). Unlike simulated data, where true tree may be known beforehand, for the empirical data sets such trees are generally unknown. In the latter cases, each tie topology (i.e. equally likely topology) should be further examined with other sorts of data in order to decide whether this topology represents the true tree Choice of nucleotide substitution model Our simulation results showed that when the number of nucleotides used is small relative to the number of sequences, the employment of simple substitution model, such as JC or Kimura models can be as efficient, or even better, than the use of more complicated model, such as HKY + C. These simple models appear to be quite efficient even when the relatively large number of sequences is used (Takahashi and Nei, 2000). Similar conclusions were reached on the data sets with small number of sequences (Bruno and Halpern, 1999; Yang, 1997). This phenomenon can partly be attributed to the amount of computational noise associated with the process of phylogenetic inferences. When complex model is used, the number of parameters to be estimated increases compared with the relatively simple models. Thus, employment of complex models leads to the increase in the amount of computational noise. On the other hand, actual pattern of nucleotide substitutions is always unknown and perhaps even more complicated than any possible models. And the substitution pattern can change with the evolutionary time, since even closely related species can exhibit substantial differences in codon usage (Anderson et al., 1993; Lloyd and Sharp, 1992; Shields, 1990; Tarrio et al., 2001), making it difficult to employ model that is true for all the species included in the data set. It has been suggested that the likelihood ratio test should be used to select the appropriate model to be used for ML construction (Huelsenbeck and Crandall, 1997; Posada and Crandall, 2001; Posada and Crandall, 1998; Swofford et al., 1996). However, our simulations showed that the highest likelihood value would always be observed for the trees inferred under the most complex model among all models compared. At the same time in most of the cases examined, these ML trees do not appear to be any closer to the true tree than those obtained under the less complicated models, and, respectively, the trees that had lower log likelihood values (i.e., non-ml trees). Empirical study with known phylogeny showed that models selected using the likelihood ratio test do not necessarily produce better topologies that simpler model (Russo et al., 1996). This potentially may lead to the selection of wrong trees (or trees that are topologically further from the true tree than non-ml trees) as the ML tree. Use of the likelihood ratio test should be examined more carefully with both simulated and empirical data (Nei and Kumar, 2000; Takahashi and Nei, 2000). The value of gamma parameter a also influences the efficiency of different substitution models. Our results showed that extremely small values (such as a ¼ 0:1; 0:2) are generally associated with the decrease in the overall performance of every model. This holds true even if substitution model used for the ML inferences considers rate variation (see Tables 1 4). In terms of topological differences better results were obtained for the data sets with fairly moderate value of gamma parameter (a ¼ 1:0), and the topological distances between the inferred and the true trees increased as the gamma parameter of the data set decreased. Noticeably, presence of the strong transition-transversion bias does not appear to heavily influence the overall performance of the substitution model used for the ML phylogenetic inferences. Our results suggest that for the relatively short sequences it is not necessary to use most complex models and most extensive search algorithms available. In many cases simple JC model and SA + NNI search algorithm can be as efficient in finding the true tree as more complicated models and more extensive searches. Although in terms of the negative log likelihood value the topologies inferred under the JC model do not appear to be ML trees when compared with the trees inferred under more complicated models, in terms of the topological distance these trees can be as close to the true tree as the latter, ML, trees. However, for longer sequences, the HKY + C model showed better d T values than other models. When different heuristic search algorithms are

8 872 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) considered, the relatively simple SA + NNI search appears to be as efficient in finding the true topology as more extensive searches, such as SA + TBR (this study), or SA + SPR (Takahashi and Nei, 2000). However, we should note that the results of the simulations greatly depend on the model tree topology used to simulate the sequence data. And since we have used only few randomly generated topologies, our observations may be limited to the similar type of topologies, although molecular-clock hypothesis seems not to affect the overall performance of different models. Further, the overwhelming complexity of the actual nucleotide substitution pattern poses the problem of the model choice every time the empirical data set is considered. And while the most sophisticated model might appear as the most logical choice at the time, in many cases, especially when the number of sites is relatively small (Nei et al., 1998), choosing simple model can be considered as the most practical choice. Acknowledgments I thank Masatoshi Nei for our numerous inspirational discussions. I am also grateful to Wen-Hsiung Li and two anonymous reviewers for their comments on earlier version of this manuscript. This work was supported by Grants from NIH (GM20293) and NASA (NCC2-1057) to Masatoshi Nei. References Anderson, C.L., Carew, E.A., Powell, J.R., Evolution of the Adh locus in the Drosophila willistoni group: the loss of an intron, and shift in codon usage. Mol. Biol. Evol. 10, Bruno, W.J., Halpern, A.L., Topological bias and inconsistency of maximum likelihood using wrong models. Mol. Biol. Evol. 16, Felsenstein, J., Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17, Hasegawa, M., Kishino, H., Yano, T., Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22, Hedin, M.C., Maddison, W.P., A combined molecular approach to phylogeny of the jumping spider subfamily Dendryphantinae (Araneae: Salticidae). Mol. Phylogenet. Evol. 18, Huelsenbeck, J.P., Crandall, K.A., Phylogeny estimation and hypothesis testing using maximum likelihood. Annu. Rev. Ecol. Syst. 28, Jin, L., Nei, M., Limitations of the evolutionary parsimony method of phylogenetic analysis. Mol. Biol. Evol. 7, Jukes, T.H., Cantor, C.R., Evolution of protein molecules. In: Munro, H.N. (Ed.), Mammalian Protein Metabolism. Academic Press, New York, pp Kimura, M., A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, Lemmon, A.R., Milinkovitch, M.C., The metapopulation genetic algorithm: an efficient solution for the problem of large phylogeny estimation. Proc. Natl. Acad. Sci. USA 99, Lloyd, A.T., Sharp, P.M., Evolution of codon usage patterns: the extent and nature of divergence between Candida albicans and Saccharomyces cerevisiae. Nucleic Acids Res. 20, Nei, M., Kumar, S., Molecular Evolution and Phylogenetics. Oxford University Press, Oxford. Nei, M., Kumar, S., Takahashi, K., The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. Proc. Natl. Acad. Sci. USA 95, Olsen, G.J., Matsuda, H., Hagstrom, R., Overbeek, R., fastdnaml: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10, Ota, S., Li, W.H., NJML: a hybrid algorithm for the neighborjoining and maximum-likelihood methods. Mol. Biol. Evol. 17, Ota, S., Li, W.H., NJML+: an extension of the NJML method to handle protein sequence data and computer software implementation. Mol. Biol. Evol. 18, Penny, D., Hendy, M.D., The use of tree comparison metrics. Syst. Biol. 34, Piontkivska, H., Nei, M., Birth-and-death evolution in primate MHC class I genes: divergence time estimates. Mol. Biol. Evol. 20, Posada, D., Crandall, K., Selecting the best-fit model of nucleotide substitution. Syst. Biol. 50, Posada, D., Crandall, K.A., MODELTEST: testing the model of DNA substitution. Bioinformatics 14, Reyes, A., Pesole, G., Saccone, C., Long-branch attraction phenomenon and the impact of among-site rate variation on rodent phylogeny. Gene 259, Rice, K.A., Donoghue, M.J., Olmstead, R.G., Analyzing large data sets: rbcl revisited. Syst. Biol. 46, Robinson, D.F., Foulds, L.R., Comparison of phylogenetic trees. Math. Biosci. 53, Rogers, J.S., Swofford, D.L., A fast method for approximating maximum likelihoods of phylogenetic trees from nucleotide sequences. Syst. Biol. 47, Russo, C.A., Takezaki, N., Nei, M., Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. Mol. Biol. Evol. 13, Salter, L.A., Complexity of the likelihood surface for a large DNA dataset. Syst. Biol. 50, Shields, D.C., Switches in species-specific codon preferences: the influence of mutation biases. J. Mol. Evol. 31, Sullivan, J., Swofford, D.L., Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50, Swofford, D.L., PAUP* Phylogenetic Analysis Using Parsimony (*and Other Methods). Sinauer Associates, Sunderland, MA. Swofford, D.L., Olsen, G.J., Waddell, P.J., Hillis, D.M., Phylogenetic inference. In: Hillis, D.M., Moritz, C., Mable, B.K. (Eds.), Molecular Systematics, second ed. Sinauer, Sunderland, MA, pp Takahashi, K., Nei, M., Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. Mol. Biol. Evol. 17, Tarrio, R., Rodriguez-Trelles, F., Ayala, F.J., Shared nucleotide composition biases among species and their impact on phylogenetic

9 H. Piontkivska / Molecular Phylogenetics and Evolution 31 (2004) reconstructions of the drosophilidae. Mol. Biol. Evol. 18, Tateno, Y., Nei, M., Tajima, F., Accuracy of estimated phylogenetic trees from molecular data. I. Distantly related species. J. Mol. Evol. 18, Wolf, M.J., Easteal, S., Kahn, M., McKay, B.D., Jermiin, L.S., TrExML: a maximum-likelihood approach for extensive tree-space exploration. Bioinformatics 16, Yang, Z., How often do wrong models produce better phylogenies? Mol. Biol. Evol. 14,

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Kei Takahashi and Masatoshi Nei

Kei Takahashi and Masatoshi Nei Efficiencies of Fast Algorithms of Phylogenetic Inference Under the Criteria of Maximum Parsimony, Minimum Evolution, and Maximum Likelihood When a Large Number of Sequences Are Used Kei Takahashi and

More information

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe? How should we go about modeling this? gorilla GAAGTCCTTGAGAAATAAACTGCACACACTGG orangutan GGACTCCTTGAGAAATAAACTGCACACACTGG Model parameters? Time Substitution rate Can we observe time or subst. rate? What

More information

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Minimum evolution using ordinary least-squares is less robust than neighbor-joining Minimum evolution using ordinary least-squares is less robust than neighbor-joining Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA email: swillson@iastate.edu November

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS Masatoshi Nei" Abstract: Phylogenetic trees: Recent advances in statistical methods for phylogenetic reconstruction and genetic diversity analysis were

More information

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition David D. Pollock* and William J. Bruno* *Theoretical Biology and Biophysics, Los Alamos National

More information

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution Today s topics Inferring phylogeny Introduction! Distance methods! Parsimony method!"#$%&'(!)* +,-.'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;?@AB=C?DEF Overview of phylogenetic inferences Methodology Methods

More information

What Is Conservation?

What Is Conservation? What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.

More information

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

More information

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging Method KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging Zhang Zhang 1,2,3#, Jun Li 2#, Xiao-Qian Zhao 2,3, Jun Wang 1,2,4, Gane Ka-Shu Wong 2,4,5, and Jun Yu 1,2,4 * 1

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26 Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

More information

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Ziheng Yang Department of Biology, University College, London An excess of nonsynonymous substitutions

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

The Importance of Proper Model Assumption in Bayesian Phylogenetics

The Importance of Proper Model Assumption in Bayesian Phylogenetics Syst. Biol. 53(2):265 277, 2004 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490423520 The Importance of Proper Model Assumption in Bayesian

More information

Estimating Divergence Dates from Molecular Sequences

Estimating Divergence Dates from Molecular Sequences Estimating Divergence Dates from Molecular Sequences Andrew Rambaut and Lindell Bromham Department of Zoology, University of Oxford The ability to date the time of divergence between lineages using molecular

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation

Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation Homoplasy Selection of models of molecular evolution David Posada Homoplasy indicates identity not produced by descent from a common ancestor. Graduate class in Phylogenetics, Campus Agrário de Vairão,

More information

Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics

Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics Journal of Mammalian Evolution, Vol. 4, No. 2, 1997 Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics Jack Sullivan1'2 and David L. Swofford1 The monophyly of Rodentia

More information

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods MBE Advance Access published May 4, 2011 April 12, 2011 Article (Revised) MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Points of View JACK SULLIVAN 1 AND DAVID L. SWOFFORD 2

Points of View JACK SULLIVAN 1 AND DAVID L. SWOFFORD 2 Points of View Syst. Biol. 50(5):723 729, 2001 Should We Use Model-Based Methods for Phylogenetic Inference When We Know That Assumptions About Among-Site Rate Variation and Nucleotide Substitution Pattern

More information

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Letter to the Editor. The Effect of Taxonomic Sampling on Accuracy of Phylogeny Estimation: Test Case of a Known Phylogeny Steven Poe 1

Letter to the Editor. The Effect of Taxonomic Sampling on Accuracy of Phylogeny Estimation: Test Case of a Known Phylogeny Steven Poe 1 Letter to the Editor The Effect of Taxonomic Sampling on Accuracy of Phylogeny Estimation: Test Case of a Known Phylogeny Steven Poe 1 Department of Zoology and Texas Memorial Museum, University of Texas

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building How Molecules Evolve Guest Lecture: Principles and Methods of Systematic Biology 11 November 2013 Chris Simon Approaching phylogenetics from the point of view of the data Understanding how sequences evolve

More information

COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS

COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS *Stamatakis A.P., Ludwig T., Meier H. Department of Computer Science, Technische Universität München Department of Computer Science,

More information

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Preliminaries. Download PAUP* from:   Tuesday, July 19, 16 Preliminaries Download PAUP* from: http://people.sc.fsu.edu/~dswofford/paup_test 1 A model of the Boston T System 1 Idea from Paul Lewis A simpler model? 2 Why do models matter? Model-based methods including

More information

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Benny Chor Michael Hendy David Penny Abstract We consider the problem of finding the maximum likelihood rooted tree under

More information

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A J Mol Evol (2000) 51:423 432 DOI: 10.1007/s002390010105 Springer-Verlag New York Inc. 2000 Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft] Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

More information

Inferring Molecular Phylogeny

Inferring Molecular Phylogeny Dr. Walter Salzburger he tree of life, ustav Klimt (1907) Inferring Molecular Phylogeny Inferring Molecular Phylogeny 55 Maximum Parsimony (MP): objections long branches I!! B D long branch attraction

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

The impact of sequence parameter values on phylogenetic accuracy

The impact of sequence parameter values on phylogenetic accuracy eissn: 09748369, www.biolmedonline.com The impact of sequence parameter values on phylogenetic accuracy Bhakti Dwivedi 1, *Sudhindra R Gadagkar 1,2 1 Department of Biology, University of Dayton, Dayton,

More information

Molecular Clocks. The Holy Grail. Rate Constancy? Protein Variability. Evidence for Rate Constancy in Hemoglobin. Given

Molecular Clocks. The Holy Grail. Rate Constancy? Protein Variability. Evidence for Rate Constancy in Hemoglobin. Given Molecular Clocks Rose Hoberman The Holy Grail Fossil evidence is sparse and imprecise (or nonexistent) Predict divergence times by comparing molecular data Given a phylogenetic tree branch lengths (rt)

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

AUTHOR COPY ONLY. Hetero: a program to simulate the evolution of DNA on a four-taxon tree

AUTHOR COPY ONLY. Hetero: a program to simulate the evolution of DNA on a four-taxon tree APPLICATION NOTE Hetero: a program to simulate the evolution of DNA on a four-taxon tree Lars S Jermiin, 1,2 Simon YW Ho, 1 Faisal Ababneh, 3 John Robinson, 3 Anthony WD Larkum 1,2 1 School of Biological

More information

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018 Maximum Likelihood Tree Estimation Carrie Tribble IB 200 9 Feb 2018 Outline 1. Tree building process under maximum likelihood 2. Key differences between maximum likelihood and parsimony 3. Some fancy extras

More information

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Lab 9: Maximum Likelihood and Modeltest

Lab 9: Maximum Likelihood and Modeltest Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2010 Updated by Nick Matzke Lab 9: Maximum Likelihood and Modeltest In this lab we re going to use PAUP*

More information

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia Lie Markov models Jeremy Sumner School of Physical Sciences University of Tasmania, Australia Stochastic Modelling Meets Phylogenetics, UTAS, November 2015 Jeremy Sumner Lie Markov models 1 / 23 The theory

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

arxiv: v1 [q-bio.pe] 27 Oct 2011

arxiv: v1 [q-bio.pe] 27 Oct 2011 INVARIANT BASED QUARTET PUZZLING JOE RUSINKO AND BRIAN HIPP arxiv:1110.6194v1 [q-bio.pe] 27 Oct 2011 Abstract. Traditional Quartet Puzzling algorithms use maximum likelihood methods to reconstruct quartet

More information

A Statistical Test of Phylogenies Estimated from Sequence Data

A Statistical Test of Phylogenies Estimated from Sequence Data A Statistical Test of Phylogenies Estimated from Sequence Data Wen-Hsiung Li Center for Demographic and Population Genetics, University of Texas A simple approach to testing the significance of the branching

More information

Distances that Perfectly Mislead

Distances that Perfectly Mislead Syst. Biol. 53(2):327 332, 2004 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490423809 Distances that Perfectly Mislead DANIEL H. HUSON 1 AND

More information

An Investigation of Phylogenetic Likelihood Methods

An Investigation of Phylogenetic Likelihood Methods An Investigation of Phylogenetic Likelihood Methods Tiffani L. Williams and Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131-1386 Email: tlw,moret @cs.unm.edu

More information

Consensus Methods. * You are only responsible for the first two

Consensus Methods. * You are only responsible for the first two Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Variance and Covariances of the Numbers of Synonymous and Nonsynonymous Substitutions per Site

Variance and Covariances of the Numbers of Synonymous and Nonsynonymous Substitutions per Site Variance and Covariances of the Numbers of Synonymous and Nonsynonymous Substitutions per Site Tatsuya Ota and Masatoshi Nei Institute of Molecular Evolutionary Genetics and Department of Biology, The

More information

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC

More information

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions PLGW05 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 1 joint work with Ilan Gronau 2, Shlomo Moran 3, and Irad Yavneh 3 1 2 Dept. of Biological Statistics and Computational

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

Molecular Evolution, course # Final Exam, May 3, 2006

Molecular Evolution, course # Final Exam, May 3, 2006 Molecular Evolution, course #27615 Final Exam, May 3, 2006 This exam includes a total of 12 problems on 7 pages (including this cover page). The maximum number of points obtainable is 150, and at least

More information

Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction

Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction William J. Bruno,* Nicholas D. Socci, and Aaron L. Halpern *Theoretical Biology and Biophysics, Los Alamos

More information

On the Uniqueness of the Selection Criterion in Neighbor-Joining

On the Uniqueness of the Selection Criterion in Neighbor-Joining Journal of Classification 22:3-15 (2005) DOI: 10.1007/s00357-005-0003-x On the Uniqueness of the Selection Criterion in Neighbor-Joining David Bryant McGill University, Montreal Abstract: The Neighbor-Joining

More information

Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities

Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities Computational Statistics & Data Analysis 40 (2002) 285 291 www.elsevier.com/locate/csda Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities T.

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

A Fitness Distance Correlation Measure for Evolutionary Trees

A Fitness Distance Correlation Measure for Evolutionary Trees A Fitness Distance Correlation Measure for Evolutionary Trees Hyun Jung Park 1, and Tiffani L. Williams 2 1 Department of Computer Science, Rice University hp6@cs.rice.edu 2 Department of Computer Science

More information

FUNDAMENTALS OF MOLECULAR EVOLUTION

FUNDAMENTALS OF MOLECULAR EVOLUTION FUNDAMENTALS OF MOLECULAR EVOLUTION Second Edition Dan Graur TELAVIV UNIVERSITY Wen-Hsiung Li UNIVERSITY OF CHICAGO SINAUER ASSOCIATES, INC., Publishers Sunderland, Massachusetts Contents Preface xiii

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Understanding relationship between homologous sequences

Understanding relationship between homologous sequences Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

More information

Inferring Speciation Times under an Episodic Molecular Clock

Inferring Speciation Times under an Episodic Molecular Clock Syst. Biol. 56(3):453 466, 2007 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150701420643 Inferring Speciation Times under an Episodic Molecular

More information

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057 Bootstrapping and Tree reliability Biol4230 Tues, March 13, 2018 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 Rooting trees (outgroups) Bootstrapping given a set of sequences sample positions randomly,

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Supplementary Note S2 Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Phylogenetic trees reconstructed by a variety of methods from either single-copy orthologous loci (Class

More information

Points of View. The Biasing Effect of Compositional Heterogeneity on Phylogenetic Estimates May be Underestimated

Points of View. The Biasing Effect of Compositional Heterogeneity on Phylogenetic Estimates May be Underestimated Points of View Syst. Biol. 53(4):638 643, 2004 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490468648 The Biasing Effect of Compositional Heterogeneity

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/39

More information

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

More information

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation Syst Biol 55(2):259 269, 2006 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 101080/10635150500541599 Inferring Complex DNA Substitution Processes on Phylogenies

More information

7. Tests for selection

7. Tests for selection Sequence analysis and genomics 7. Tests for selection Dr. Katja Nowick Group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute for Brain Research www. nowicklab.info

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Statistical estimation of models of sequence evolution Phylogenetic inference using maximum likelihood:

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study Li-San Wang Robert K. Jansen Dept. of Computer Sciences Section of Integrative Biology University of Texas, Austin,

More information

Performance-Based Selection of Likelihood Models for Phylogeny Estimation

Performance-Based Selection of Likelihood Models for Phylogeny Estimation Syst. Biol. 52(5):674 683, 2003 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150390235494 Performance-Based Selection of Likelihood Models for

More information

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise Bot 421/521 PHYLOGENETIC ANALYSIS I. Origins A. Hennig 1950 (German edition) Phylogenetic Systematics 1966 B. Zimmerman (Germany, 1930 s) C. Wagner (Michigan, 1920-2000) II. Characters and character states

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution

Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution Accuracy and Power of the Likelihood Ratio Test in Detecting Adaptive Molecular Evolution Maria Anisimova, Joseph P. Bielawski, and Ziheng Yang Department of Biology, Galton Laboratory, University College

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2018 University of California, Berkeley Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

More information

BIOINFORMATICS DISCOVERY NOTE

BIOINFORMATICS DISCOVERY NOTE BIOINFORMATICS DISCOVERY NOTE Designing Fast Converging Phylogenetic Methods!" #%$&('$*),+"-%./ 0/132-%$ 0*)543768$'9;:(0'=A@B2$0*)A@B'9;9CD

More information

A Comparative Analysis of Popular Phylogenetic. Reconstruction Algorithms

A Comparative Analysis of Popular Phylogenetic. Reconstruction Algorithms A Comparative Analysis of Popular Phylogenetic Reconstruction Algorithms Evan Albright, Jack Hessel, Nao Hiranuma, Cody Wang, and Sherri Goings Department of Computer Science Carleton College MN, 55057

More information

Parsimony via Consensus

Parsimony via Consensus Syst. Biol. 57(2):251 256, 2008 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150802040597 Parsimony via Consensus TREVOR C. BRUEN 1 AND DAVID

More information