An Investigation of Phylogenetic Likelihood Methods

Similar documents
BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Letter to the Editor. Department of Biology, Arizona State University

Constructing Evolutionary/Phylogenetic Trees

Molecular Evolution & Phylogenetics

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Constructing Evolutionary/Phylogenetic Trees

COMPUTING LARGE PHYLOGENIES WITH STATISTICAL METHODS: PROBLEMS & SOLUTIONS

TheDisk-Covering MethodforTree Reconstruction

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

arxiv: v1 [q-bio.pe] 27 Oct 2011

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

arxiv: v1 [q-bio.pe] 1 Jun 2014

BIOINFORMATICS DISCOVERY NOTE

Isolating - A New Resampling Method for Gene Order Data

Phylogenetic Tree Reconstruction

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Phylogenetic Networks, Trees, and Clusters

Recent Advances in Phylogeny Reconstruction

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Effects of Gap Open and Gap Extension Penalties

Bayesian Models for Phylogenetic Trees

Estimating Evolutionary Trees. Phylogenetic Methods

A Fitness Distance Correlation Measure for Evolutionary Trees

Inferring Molecular Phylogeny

Jijun Tang and Bernard M.E. Moret. Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

A Bayesian Approach to Phylogenetics

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Phylogenetics. BIOL 7711 Computational Bioscience

Improving Tree Search in Phylogenetic Reconstruction from Genome Rearrangement Data

Dr. Amira A. AL-Hosary

An Introduction to Bayesian Inference of Phylogeny

Chapter 9 BAYESIAN SUPERTREES. Fredrik Ronquist, John P. Huelsenbeck, and Tom Britton

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Fast computation of maximum likelihood trees by numerical approximation of amino acid replacement probabilities

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

High-Performance Algorithm Engineering for Large-Scale Graph Problems and Computational Biology

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

X X (2) X Pr(X = x θ) (3)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Bayesian support is larger than bootstrap support in phylogenetic inference: a mathematical argument

Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT

Bayesian Phylogenetics

A (short) introduction to phylogenetics

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Quartet Inference from SNP Data Under the Coalescent Model

The Importance of Proper Model Assumption in Bayesian Phylogenetics

Distances that Perfectly Mislead

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

Consistency Index (CI)

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Reconstructing Phylogenetic Networks Using Maximum Parsimony

Lecture Notes: Markov chains

Concepts and Methods in Molecular Divergence Time Estimation

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

STA 4273H: Statistical Machine Learning

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

MCMC: Markov Chain Monte Carlo

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetics: Building Phylogenetic Trees

arxiv: v1 [q-bio.pe] 4 Sep 2013

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

Bayesian Methods for Machine Learning

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

EVOLUTIONARY DISTANCES

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

A Comparative Analysis of Popular Phylogenetic. Reconstruction Algorithms

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Reconstruire le passé biologique modèles, méthodes, performances, limites

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Biol 206/306 Advanced Biostatistics Lab 12 Bayesian Inference

Bayesian Networks Inference with Probabilistic Graphical Models

O 3 O 4 O 5. q 3. q 4. Transition

On the errors introduced by the naive Bayes independence assumption

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

MCMC notes by Mark Holder

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Molecular Evolution, course # Final Exam, May 3, 2006

Phylogenetic inference

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Lecture 4. Models of DNA and protein change. Likelihood methods

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Phylogenetic analyses. Kirsi Kostamo

BIOINFORMATICS. New approaches for reconstructing phylogenies from gene order data. Bernard M.E. Moret, Li-San Wang, Tandy Warnow and Stacia K.

Transcription:

An Investigation of Phylogenetic Likelihood Methods Tiffani L. Williams and Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131-1386 Email: tlw,moret @cs.unm.edu Abstract We analyze the performance of likelihoodbased approaches used to reconstruct phylogenetic trees. Unlike other techniques such as Neighbor-Joining () and Maximum Parsimony (MP), relatively little is known regarding the behavior of algorithms founded on the principle of likelihood. We study the accuracy, speed, and likelihood scores of four representative likelihood-based methods (, MrBayes, PAUP -ML, and TREE- PUZZLE) that use either Maximum Likelihood (ML) or Bayesian inference to find the optimal tree. is also studied to provide a baseline comparison. Our simulation study is based on random birth-death trees, which are deviated from ultrametricity, and uses the Kimura 2- parameter + Gamma model of sequence evolution. We find that MrBayes (a Bayesian inference approach) consistently outperforms the other methods in terms of accuracy and running time. I. INTRODUCTION Evolutionary biology is founded on the concept that organisms share a common origin and have subsequently diverged through time. Phylogenies (typically formulated as trees) represent our attempts to reconstruct evolutionary history. Phylogenetic analysis is used in all branches of biology with applications ranging from studies on the origin of human populations to investigations of the transmission patterns of HIV [14], and beyond, with a variety of uses in drug discovery, forensics, and security [1]. The accurate estimation of evolutionary trees is a challenging computational problem. For a given set of organisms (or taxa), the number of possible evolutionary trees is exponential. There are more than two million different trees for taxa, and more than 2 182 different trees for taxa. An exhaustive search through the tree space is certainly not an option. Thus, scientists have designed a plethora of heuristics to assist them with phylogenetic analysis. Here, we study the performance of likelihood-based approaches for reconstructing phylogenetic trees. Likelihood is the probability of observing the data given a particular model. It is represented mathematically as Pr D M, read probability of D given M. Here, D is the observed data and M is the model [18]. Different models may make the observed data more or less probable. For instance, suppose you toss a fair coin. After independent tosses, you observe heads once and tails 99 times. Under the assumed model of 1 a fair coin, the likelihood of this event is 1 2 or 7 89 29. For a coin tossed times, there are 2 different, equally-likely possibilities. Only of those sequences yield a single heads. Hence, the above event is not very likely if a fair coin is used. A more likely event is the appearance of 5 heads and 5 tails, 1 which has a likelihood of 5 or 7 96 2. 2 Many biologists prefer likelihood approaches since they are statistically well-founded, possibly producing the most accurate phylogenetic trees. However, techniques such as Neighbor-Joining () [22] and Maximum Parsimony(MP) are also popular, since they are relatively fast, whereas likelihood-based approaches are extremely slow, limiting their use to small problems. Many of the reconstruction methods used by biologists (distance methods, parsimony search, or quartet puzzling) have been studied extensively in an experimental environment [15], [], [25]. Yet, very little is known concerning the behavior of likelihood-based methods. Previous work has studied the performance of such algorithms in a limited context, where typically a single criterion is of interest (i.e, number of taxa, execution time, or accuracy) [3], [7], [11], [12], []. Our study is the first (to the best of our knowledge) to examine the behavior of likelihood methods under a variety of different conditions. Specifically, our study varies the number of taxa (, 4, 6), sequence length (,, 4, 8, 16), and evolutionary distance (.1,.3, and 1.). Space constraints force us to refer the reader to [27] for performance results with an evolutionary distance of.1. We study four representative likelihood methods (, MrBayes, PAUP -ML, TREE- PUZZLE) that use either Maximum Likelihood (ML) or Bayesian inference to find the optimal tree. PAUP -ML denotes using the ML procedure in PAUP 4. [26].

is also used to provide a baseline comparison. For all experiments, MrBayes (a Bayesian inference approach) clearly outperformed its competitors. The rest of this paper is organized as follows. Section II provides background information on using likelihood in phylogenetic analysis. Our experimental methodology is presented in Section III. Results are presented in Section IV. Finally, Section V provides concluding remarks. II. BACKGROUND The principle of likelihood suggests that the explanation that makes the observed outcome the most likely occurrence is the one to be preferred. Formally, given data D and a model M, the likelihood of that data is given by L Pr D M, which is the probability of obtaining D given M. In the context of phylogenetics, D is the sequences of interest, and M is a phylogenetic tree. Below, we describe two approaches that use the principle of likelihood Maximum Likelihood (ML) and Bayesian inference. A. Maximum Likelihood The Maximum Likelihood (ML) estimate of a phylogeny is the tree for which the observed data are most probable. Consider the data as aligned DNA sequences for n taxa. ML reconstruction consists of two tasks. The first task involves finding edge lengths that maximize the likelihood given a particular topology. Techniques to estimate branch lengths are based on iterative methods such as Expectation Maximization (EM) [2] or Newton- Raphson optimization [17]. Each iteration of these methods requires a large amount of computation. For a given tree, we must consider all possible nucleotides (A, G, C, T) at each interior node. The number of nucleotide combinations to examine for an n-taxa tree is 4 n 2, since there are n 2 interior nodes. The second task is to find a tree topology that maximizes the likelihood. Naïve, exhaustive search of the tree space is infeasible, since the number of bifurcating 2n unrooted trees to be explored for n taxa is 5! 2 n 3 n 3!. Furthermore, finding the best tree is hampered by the costly procedure of estimating edge lengths for different trees. Instead, the most likely tree is constructed in a greedy fashion by using a stepwise-addition algorithm. An initial tree is created by starting with three randomly chosen taxa. There is only one possible topology for a set of three taxa. A fourth randomly chosen taxon is attached, in turn, to each branch of the initial tree. The tree with the highest score is used as the starting tree for adding a fifth randomly chosen taxon. The process continues until all n taxa are added to the tree. B. Bayesian inference The Bayesian approach to phylogenetics builds upon a likelihood foundation. It is based on a quantity called the posterior probability of a tree. Bayes theorem Pr Tree Data Pr Data Tree Pr Tree Pr Data is used to combine the prior probability of a phylogeny Pr Tree with the likelihood Pr Data Tree to produce a posterior probability distribution on trees Pr Tree Data. The posterior probability represents the probability that the tree is correct. Note that, unlike likelihood scores, posterior probabilities sum to 1. Inferences about the history of the group are then based on the posterior probability of trees. For example, the tree with the highest posterior probability might be chosen as the best estimate of phylogeny. Typically all trees are considered a priori equally probable, and likelihood is calculated using a substitution model of evolution. Computing the posterior probability involves a summation over all trees, and, for each tree, integration over all possible combinations of branch length and substitution model parameter values. A number of numerical methods are available to allow the posterior probability to be approximated. The most common is Markov Chain Monte Carlo (MCMC). For the phylogeny problem, the MCMC algorithm involves two steps. First, a new tree is proposed by stochastically perturbing the current tree. Afterwards, the tree is either accepted or rejected with a probability described by Metropolis et al. [13] and Hastings [5]. If the new tree is accepted, then it is subjected to more perturbations. For a properly constructed and adequately run Markov chain, the proportion of the time that any tree is visited is a valid approximation of the posterior probability of that tree. C. ML versus Bayesian inference Bayesian analysis of phylogenies is similar to ML in that the user postulates a model of evolution and the program searches for the best tree(s) that are consistent with both the model and the data. Both attempt to estimate a conditional probability density function: Pr Data Tree for ML and Pr Tree Data for Bayesian methods. The salient difference between the two methods is that Bayesian analysis requires an additional parameter, the prior probability, that must be fixed in advance. Since the prior probability is unknown, using any fixed guess introduces a potential source of error. Another difference that is often argued is that ML seeks the single most likely tree, whereas Bayesian analysis searches for the best set of trees leading to a construction of a consensus tree. However, such a difference is more a matter of traditional usage than one of algorithmic

design. For example, the ML approach does not prohibit the user from keeping the top trees and computing their consensus. III. EXPERIMENTAL METHODOLOGY We use a simulation study in order to evaluate the likelihood-based methods under consideration. In a simulated environment, the true (or model) tree is artificially generated and forms the basis for comparison among phylogenetic methods are compared. (With real data, the true tree is typically unknown). The simulation process starts with evolving a single DNA sequence (placed at the root of the model tree) under some assumed stochastic model of evolution to a set of sequences (located at the leaves of the tree). These sequences are given to the phylogenetic reconstruction methods, which infer a tree based on the given sequences. The inferred trees are then compared against the model tree for topological accuracy. The above process is repeated many times to obtain a statistically significant test of the performance of the methods under these conditions. A. Model Trees We used random birth-death trees as model trees for our experiments. The birth-death trees were generated using the program r8s [23], using a backward Yule (coalescent) process with waiting times taken from an exponential distribution. An edge e in a tree generated by this process has length λ e, a value that indicates the expected number of times a random site changes on the edge. Trees produced by this process are ultrametric, meaning that all root-to-leaf paths have the same length. A random site is expected to change once on any rootto-leaf path; that is the (evolutionary) height of the tree is 1. In our simulation study, we scale the edge lengths of the trees to give different heights (or evolutionary distances). To scale the length of the tree to height h, we multiply each edge of the tree by h. We used heights of.1,.3, and 1. in our study. We also modify the edge lengths of birth-death trees to deviate the tree away from the assumption that sites evolve under a strong molecular clock. To deviate a tree away from ultrametricity with deviation c we proceed as follows. For each edge e T, choose a number, x, uniformly from lgc lgc. Afterwards, multiply the branch length λ e by e x. Given deviation c, the expected deviation will be c 1 c 2lnc. In our experiments, the deviation factor, c, was set to 4. Hence, the expected value each branch length is multiplied by is 1 36. B. DNA Sequence Evolution Once model trees are constructed, sequences are generated based on a model of evolution. The software used for sequence evolution is Seq-Gen [19]. We generated sequences under the Kimura 2-parameter (K2P) [] + Gamma [28] model of DNA sequence evolution, one of the standard models for studying the performance of phylogenetic reconstruction methods. Under K2P, each site evolves down the tree under the Markov assumption. In this model, nucleotide substitutions are categorized based on the substitutions of pyrimidines (C and T) and purines (A and G). A transition substitutes a purine for a purine or a pyrimidine for a pyrimidine, whereas a transversion is a substitution of a purine for a pyrimidine or vice versa. The probability of a given nucleotide substitution depends on the edge and upon the type of substitution. We set the transition/transversion ratio, κ, to 2 (the standard setting). For rate heterogeneity, different rates were assigned to different sites according to the Gamma distribution with parameter α set to 1 (the standard setting). C. Measures of Accuracy We use the Robinson-Foulds (RF) distance [21] to measure the error between trees. For every edge e in a leaf-labeled tree, T defines a bipartition π e on the leaves (induced by the deletion of e). The tree T is uniquely characterized by the set C T π e : e E T, where E T is the set of all internal edges of T. If T is a model tree and T is the tree obtained by a phylogenetic reconstruction method, then the error in the topology can be calculated as follows: False Positives (FP): C T C T. False Negatives (FN): C T C T. The RF distance is the average of the number of False Positives and False Negatives. Our figures plot the RF rates, which are obtained by normalizing the RF distance by the number of internal edges. Thus, the RF rate varies between % and %. D. Phylogenetic Reconstruction Methods We considered five phylogenetic reconstruction methods in our simulation study. Three of the methods (fastd- NAml, TREE-PUZZLE, and PAUP -ML) use ML to construct the inferred tree. MrBayes is the sole algorithm that uses a Bayesian approach. was used to provide a baseline comparison. The software package PAUP 4. was used to construct ML and trees, which we denote by PAUP -ML and, respectively. 1) : [17] uses the stepwiseaddition algorithm (described in Section II-A) to build a tree consisting of n-taxa. The best tree resulting from each step is subjected to local rearrangements, which explore the search space for a more likely tree. After all n taxa are added to the tree, a global rearrangement stage may be invoked.

2) MrBayes: MrBayes [6] is a Bayesian inference method that incorporates both Markov Chain Monte Carlo (MCMC) (described in Section II-B) and Metropolis-Coupled Markov Chain Monte Carlo (MC 3 ) analysis [4]. MC 3 can be visualized as a set of independent searches that occasionally exchange information, which may allow a search to escape an area of local optima. For both types of analysis, the final output is a set of trees that the program has repeatedly visited. A majority rule consensus tree may be built from the output trees. 3) TREE-PUZZLE: TREE-PUZZLE [24] is a quartetpuzzling (QP) algorithm. Let S be the set of sequences labeling the leaves of an evolutionary tree T. A quartet of S is a set of four sequences taken from S, and a quartet topology is an evolutionary tree on four sequences. The QP algorithm consists of three steps. In the ML step, n all 4 quartet ML trees are reconstructed to find the most likely relationship for each set of four out of n sequences. During the puzzling step, these quartet trees are composed into an intermediate tree adding sequences one by one. The result of this step is highly dependent on the order of sequences. As a result, many intermediate trees from different input orders are constructed. From these intermediate trees a consensus tree (typically using a majority rule) is built in the consensus step. 4) PAUP -ML : Similarly to, PAUP -ML uses a stepwise-addition strategy to build the final tree. Local rearrangements occur after each taxon has been added to find a better tree. Once the final taxon has been added, a series of global rearrangements occur to find the final tree. 5) Neighbor-Joining: Neighbor-Joining [22] is one of the most popular distance-based methods. use a distance matrix to compute the resulting tree. For every pair of taxa, it determines a score based on the distance matrix. The algorithm joins the pair with the minimum score, making a subtree whose root replaces the two chosen taxa in the matrix. Distances are recalculated based on this new node, and the joining continues until three nodes remain. These nodes are joined to form an unrooted binary tree. Appealing features of are its simplicity and speed. IV. EXPERIMENTAL RESULTS Our simulation study explores the behavior of the phylogenetic methods under a variety of different conditions. We generated model trees for each tree size (, 4, and 6 taxa) and scaled their heights by.1,.3, and 1. to vary the evolutionary distance. Afterwards, DNA sequences of lengths,, 4, 8, and 16 were created from the scaled trees. All programs were run with the recommended default settings on Pentium 4 computers. Since MrBayes incorporates MC 3 analysis, we ran MrBayes with 2 chains, denoted. Section IV-D considers the performance of MrBayes when the number of chains varies. For MrBayes, we used 6, generations, the current tree was saved every generations, and consensus analysis ignored the first 1 trees (the burn-in value). A. Speed Figure 1 presents the execution time required by the different phylogenetic methods. is fast and requires less than 1 second to output a final tree. The likelihood approaches, on the other hand, require large amounts of time to complete a phylogenetic analysis. For taxa, TREE-PUZZLE is the fastest likelihood-based algorithm, but the algorithm is unable to complete an analysis for either 4 or 6 taxa. (A three-day limit was placed on an experimental run.) The maximum n likelihood calculation of 4 quartet trees is the source of TREE-PUZZLE s inability to produce an answer within a reasonable amount of time. MrBayes is the only likelihood-based method that consistently produced a tree on all of our datasets. Its running time scales linearly with the sequence length. Neither nor PAUP -ML terminated for 6 taxa. Instead, these ML methods cycled infinitely among equally likely trees. With increases in both the number of taxa and sequence length, the likelihood score approaches negative infinity, which cannot be represented precisely in computer hardware. Bayesian approaches avoid infinite loops since the user must specify the number of generations an analysis will execute. (PAUP - ML provides an option to place a time limit on the heuristic search, but we did not use it.) For 4 taxa, the ML methods require 1 3 hours of computation time. MrBayes, on the other hand, requires less than minutes of execution time on all datasets. MrBayes is quite fast for a likelihood-based approach. However, if compared to other methods (i.e.,, Weighbor, and Greedy MP), its slow nature becomes apparent [16]. B. Accuracy Figure 2 shows the accuracy of the phylogenetic methods as a function of the sequence length. Although is computationally fast, it is the worst performer in terms of accuracy. does outperform TREE-PUZZLE in Figure 2(a). All methods improve their RF rates as the sequence length is increased. Our results are consistent with those of Nakhleh et al. in that s ability to produce a topologically correct tree degrades rapidly with an increase in evolutionary distance [16]. MrBayes and PAUP -ML are less susceptible to changes in accuracy as the evolutionary distance increases. When PAUP -ML terminated, the tree it produced was quite accurate., on the other hand, finds it more

8 6 4 TREE-PUZZLE 6 5 4 TREE-PUZZLE (a) Taxa =, Height =.3 (a) Taxa =, Height =.3 25 15 5 TREE-PUZZLE 8 7 6 5 4 TREE-PUZZLE (b) Taxa =, Height = 1. (b) Taxa =, Height = 1. 8 6 4 (c) Taxa = 4, Height =.3 45 4 35 25 15 (c) Taxa = 4, Height =.3 8 6 4 8 7 6 5 4 (d) Taxa = 4, Height = 1. Fig. 1. Execution time of the phylogenetic methods as a function of the sequence length for various values of taxa and height. (d) Taxa = 4, Height = 1. Fig. 2. Average RF rate of the phylogenetic methods as a function of the sequence length for various values of taxa and height.

-lnl 5 4 MrBayes-CH4 Fig. 3. Likelihood scores of,, and PAUP - ML as a function of sequence length for 4 taxa and a height of 1.. difficult to produce a topologically accurate solution on trees with large evolutionary distances. At 6, generations, MrBayes converges to a stable posterior estimate. However, the number of generations affects the overall running time. We considered the accuracy of MrBayes as the number of generations varies. A shorter number of generations (, and 4,) resulted in a serious drop in topological accuracy. Increases in the number of generations (,) resulted in relatively small changes in accuracy (not shown). C. Likelihood scores The likelihood, L, of a tree is often a very small number. So, likelihoods are expressed as natural logarithms and referred to as log likelihoods, lnl. Since the log likelihood of a tree is typically negative, our graphs plot positive log likelihoods or ln L. Consequently, in our graphs, the best likelihood is the one with the lowest score. Figure 3 plots the likelihood scores of,, and PAUP -ML as a function of sequence length. The likelihood scores increase linearly with sequence length. Similar behavior was observed with an increase in the number of taxa (not shown). It is widely believed that higher likelihood scores correlates with an increase in topological accuracy. Yet, our results do not support this hypothesis. Although the likelihood scores returned by the phylogenetic methods are practically indistinguishable, the error rates are not. Once a method reaches its top trees, their likelihood values are so similar that selecting a tree essentially becomes a random choice. D. MrBayes: A Closer Examination Our results clearly establish MrBayes as the best likelihood-based method, so we take a closer look at its performance. MrBayes is capable of both Markov Chain Monte Carlo (MCMC) and Metropolis-Coupled Markov Chain Monte Carlo (MC 3 ) analysis. Proponents of the MC 3 method claim that multiple chains enable the search to avoid being trapped in a suboptimal region. We study the performance of MrBayes when 1, 2, and 4 chains are used. Figure 4 plots the execution time of MrBayes for different numbers of chains as a function of the sequence length. Clearly, using multiple chains requires additional execution time. The increase in running time is due to chains swapping with each other to explore different parts of the tree space. An MCMC (single chain) analysis avoids such overhead. However, are multiple chains worth the extra computational effort? In many cases, the answer is no (see Figure 5). Running MrBayes with a single chain produced fairly good trees for low evolutionary rates (.1,.3). As the evolutionary rates increase, multiple chains provide better solutions. The results above show that MrBayes achieves consistent accuracy if two chains are used. Figure 6 plots the average RF rates of and on trees of various heights. In this figure, MrBayes-.1, MrBayes-.3, and MrBayes-1. are the plots of for trees of height.1,.3, and 1., respectively. The accuracy of degrades dramatically as the evolutionary distance is increased. MrBayes, on the other hand, seems to overcome this problem. On trees of large height, MrBayes-1. approaches the topological accuracy achieved on trees of lower height. V. CONCLUDING REMARKS This is the first study (to the best of our knowledge) to experimentally study the performance of likelihood methods as the number of taxa, sequence length, and evolutionary distance vary. Our results clearly demonstrate that MrBayes is the best algorithm of the methods we studied. The ML methods (, PAUP -ML, and TREE-PUZZLE) have difficulty producing an answer on a consistent basis. (When PAUP -ML converged to a solution, the inferred tree was quite good.) We recommend running MrBayes with two chains for phylogenetic analysis. MC 3 analysis seems to be best suited for large evolutionary distances. Although MrBayes runs relatively quickly, its running time is still a limiting factor. One cannot expect to use MrBayes, in its current form, to solve large datasets (of several thousand taxa). Although MrBayes produced the best trees, its likelihood score was indistinguishable from the scores returned by the other methods. For the top trees of a heuristic search, likelihood scores do not correlate well to topological accuracy. These results do not imply that such scores are unimportant just that the correlation, which clearly exists over the tree space as a whole, is lost in regions of high likelihood. There are many directions for future work. Combining disk-covering methods (DCMs) [8], [9] with Bayesian inference is one approach to speeding up a likelihood

45 4 35 MrBayes_CH1 MrBayes_CH2 MrBayes_CH4 18 16 14 8 6 4 MrBayes_CH1 MrBayes_CH2 MrBayes_CH4 (a) Taxa = 4, Height =.3 25 15 5 18 (a) Taxa = 6, Height =.3 5 MrBayes_CH1 MrBayes_CH2 MrBayes_CH4 4 MrBayes-CH1 MrBayes-CH4 15 5 (b) Taxa = 4, Height = 1. (b) Taxa = 6, Height = 1. Fig. 5. Average RF rate for MrBayes as a function of the sequence length for 6 taxa and heights.3 and 1.. 35 MrBayes_CH1 MrBayes_CH2 MrBayes_CH4 25 15 5 (c) Taxa = 6, Height =.3 8 MrBayes-.1 MrBayes-.3 7 MrBayes-1. -.1 6 -.3 5-1. 4 (a) Taxa = 6 25 15 MrBayes-.1 MrBayes-.3 MrBayes-1. -.1 -.3-1. (d) Taxa = 6, Height = 1. Fig. 4. Execution time on MrBayes as a function of the sequence length for various values of taxa and height. 5 25 35 4 45 5 55 6 Taxa (b) = 16 Fig. 6. Average RF rate for MrBayes and as a function of (a) sequence length and (b) number of taxa. MrBayes-.1, MrBayes-.3, and MrBayes-1. are the plots of for trees of height.1,.3, and 1., respectively.

calculation. The question of whether likelihood scores correlates with topological accuracy is an interesting one that needs further exploration. VI. ACKNOWLEDGEMENTS During the course of this research, Williams was supported by a Alfred P. Sloan Foundation Postdoctoral Fellowship in Computational Molecular Biology (Office of Science (BER), U.S. Department of Energy, Grant No. DE-FG3-2ER63426), and Moret was supported by the National Science Foundation under grants ACI -8144, EIA 1-195, EIA 1-21377, and DEB 1-79. REFERENCES [1] D. Bader, B. M. Moret, and L. Vawter. Industrial applications of high-performance computing for phylogeny reconstruction. In H. Siegel, editor, Proceedings of SPIE Commercial Applications for High-Performance Computing, volume 4528, pages 159 168, Denver, CO, Aug. 1. SPIE. [2] J. Felsenstein. Evolutionary trees from dna sequences: a maximum likelihood approach. J. Mol. Evol., 17:368 376, 1981. [3] N. Friedman, M. Ninio, I. Pe er, and T. Pupko. A structural EM algorithm for phylogenetic inference. J. Comp. Biol., 9:331 353, 2. [4] C. J. Geyer. Markov chain Monte Carlo maximum likelihood. In Keramidas, editor, Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, pages 156 163. Interface Foundation, 1991. [5] W. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57:97 9, 197. [6] J. P. Huelsenbeck and F. Ronquist. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics, 17(8):754 755, 1. [7] J. P. Huelsenbeck, F. Ronquist, R. Nielsen, and J. P. Bollback. Bayesian inference of phylogeny and its impact on evolutionary biology. Science, 294:23 2314, 1. [8] D. Huson, S. Nettles, and T. Warnow. Disk-covering, a fastconverging method for phylogenetic tree reconstruction. Comput. Biol., 6:369 386, 1999. [9] D. Huson, L. Vawter, and T. Warnow. Solving large scale phylogenetic problems using DCM2. In ISMB99, pages 118 129, 1999. [] M. Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16:111 1, 198. [11] B. Larget and D. L. Simon. Markov chain monte carlo algorithms for the bayesian analysis of phylogenetic trees. Mol. Biol. Evol., 16:75 759, 1999. [12] P. O. Lewis. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data. Mol. Biol. Evol., 16:277 283, 1998. [13] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. J. Chem. Phys., 21:87 92, 1953. [14] M. L. Metzker, D. P. Mindell, X.-M. Liu, R. G. Ptak, R. A. Gibbs, and D. M. Hillis. Molecular evidence of HIV-1 transmission in a criminal case. PNAS, 99(2):14292 14297, 2. [15] B. M. Moret, U. Roshan, and T. Warnow. Sequence length requirements for phylogenetic methods. In Proc. 2nd Int l Workshop on Algorithms in Bioinformatics (WABI 2), volume 2452 of Lecture Notes in Computer Science, pages 343 356, 2. [16] L. Nakhlehn, B. Moret, U. Roshan, K. S. John, and T. Warnow. The accuracy of fast phylogenetic methods for large datasets. In Proc. 7th Pacific Symp. on Biocomputing (PSB 2), pages 211 222. World Scientific Pub. (2), 2. [17] G. J. Olson, H. Matsuda, R. Hagstrom, and R. Overbeek. : A tool for construction of phylogenetic trees of dna sequences using maximum likelihood. Comput. Appl. Biosci., :41 48, 1994. [18] R. D. M. Page and E. C. Holmes. Molecular Evolution: A Phylogenetic Approach. Blackwell Science Inc., 1998. [19] A. Rambaut and N. C. Grassly. Seq-gen: An application for the monte carlo simulation of dna sequence evolution along phylogenetic trees. Comp. Appl. Biosci., 13:235 238, 1997. [] V. Ranwez and O. Gascuel. Quartet-based phylogenetic inference: Improvements and links. Mol. Biol. Evol., 18:13 1116, 1. [21] D. F. Robinson and L. R. Foulds. Comparison of phylogenetic trees. Mathematical Biosciences, 53:131 147, 1981. [22] N. Saitou and M. Nei. The neighnbor-joining method: A new method for reconstructiong phylogenetic trees. Mol. Biol. Evol., 4(46 425):1987, 1987. [23] M. J. Sanderson. r8s software package. Available from http://ginger.ucdavis.edu/r8s/. [24] H. Schmidt, K. Strimmer, M. Vingron, and A. von Haeseler. Treepuzzle: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics, 18:52 54, 2. [25] K. St. John, T. Warnow, B. M. Moret, and L. Vawter. Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining. J. Algorithms, 3. to appear. [26] D. Swofford. PAUP :Phylogenetic analysis using parsimony (and other methods). Sinauer Associates, Sunderland, MA, 1996. Version 4.. [27] T. L. Williams and B. M. Moret. An investigation of phylogenetic likelihood methods. Technical Report TR-CS-2-33, The University of New Mexico, 2. [28] Z. Yang. Maximum likelihood estimation of phylogeny from dna sequences when substitution rates differ over sites. Mol. Biol. Evol., :1396 141, 1993.