LOWER BOUNDS ON SEQUENCE LENGTHS REQUIRED TO RECOVER THE EVOLUTIONARY TREE. (extended abstract submitted to RECOMB '99)

Similar documents
TheDisk-Covering MethodforTree Reconstruction

Phylogenetic Networks, Trees, and Clusters

A few logs suce to build (almost) all trees: Part II

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

1 Introduction The j-state General Markov Model of Evolution was proposed by Steel in 1994 [14]. The model is concerned with the evolution of strings

Reconstructing Trees from Subtree Weights

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Consistency Index (CI)

Output: A tree metric T which spans S and ts D. This denition leaves two points unanswered: rst, what kind of tree metric, and second, what does it me

Recent Advances in Phylogeny Reconstruction

X X (2) X Pr(X = x θ) (3)

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Tree Reconstruction

A Phylogenetic Network Construction due to Constrained Recombination

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Key words. computational learning theory, evolutionary trees, PAC-learning, learning of distributions,

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Evolutionary Tree Analysis. Overview

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

A Faster Algorithm for the Perfect Phylogeny Problem when the Number of Characters is Fixed

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Dr. Amira A. AL-Hosary

Constructing Evolutionary/Phylogenetic Trees

arxiv: v5 [q-bio.pe] 24 Oct 2016

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

c 2001 Society for Industrial and Applied Mathematics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University


EVOLUTIONARY DISTANCES

BINF6201/8201. Molecular phylogenetic methods

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

arxiv: v1 [q-bio.pe] 1 Jun 2014

Reconstruction of certain phylogenetic networks from their tree-average distances

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Phylogenetics: Building Phylogenetic Trees

On the Uniqueness of the Selection Criterion in Neighbor-Joining

The Complexity of Constructing Evolutionary Trees Using Experiments

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

INVARIANTS STEVEN N. EVANS AND XIAOWEN ZHOU. Abstract. The method of invariants is an approach to the problem of reconstructing

Effects of Gap Open and Gap Extension Penalties

Lecture Notes: Markov chains

The Generalized Neighbor Joining method

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Computational Biology: Basics & Interesting Problems

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Letter to the Editor. Department of Biology, Arizona State University

k-protected VERTICES IN BINARY SEARCH TREES

arxiv: v1 [cs.cc] 9 Oct 2014

Theory of Evolution Charles Darwin

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Tree-average distances on certain phylogenetic networks have their weights uniquely determined

Realization Plans for Extensive Form Games without Perfect Recall

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Phylogeny of Mixture Models

Phylogenetics. BIOL 7711 Computational Bioscience

Exact Algorithms and Experiments for Hierarchical Tree Clustering

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Structure-Based Comparison of Biomolecules

Solving the Maximum Agreement Subtree and Maximum Comp. Tree problems on bounded degree trees. Sylvain Guillemot, François Nicolas.

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

Lecture 4. Models of DNA and protein change. Likelihood methods

Is the equal branch length model a parsimony model?

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

c 1999 Society for Industrial and Applied Mathematics

Properties of normal phylogenetic networks

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

The Power of Amnesia: Learning Probabilistic. Automata with Variable Memory Length

Graphs, permutations and sets in genome rearrangement

Distance Corrections on Recombinant Sequences

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Reading for Lecture 13 Release v10

Extracted from a working draft of Goldreich s FOUNDATIONS OF CRYPTOGRAPHY. See copyright notice.

Reconstruire le passé biologique modèles, méthodes, performances, limites

The least-squares approach to phylogenetics was first suggested

RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

Lecture 11 : Asymptotic Sample Complexity

Algebraic Statistics Tutorial I

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Organisatorische Details

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Transcription:

LOWER BOUNDS ON SEQUENCE LENGTHS REQUIRED TO RECOVER THE EVOLUTIONARY TREE MIKL OS CS } UR OS AND MING-YANG KAO (extended abstract submitted to RECOMB '99) Abstract. In this paper we study the sequence length requirements of distance-based evolutionary tree building algorithms in the Jukes-Cantor model of evolution. By deriving lower bounds on sequence lengths required to recover the evolutionary tree topology correctly, we show that two algorithms, the Short Quartet Method and the Harmonic Greedy Triplets algorithms have optimal sequence length requirements. 1. Introduction. A main area of computational biology is the development and analysis of evolutionary tree building algorithms [15]. By using biomolecular sequences, these algorithms have not only enabled the exploration of evolutionary relationships among species but also led to the discovery of new proteins and to the inference of transmission chains of viral diseases such as AIDS [19]. The evolutionary tree in biology can be dened as an edge weighted binary tree, in which the nodes correspond to taxa, and edge weights correspond to time of divergence between them [23]. Evolution in this model can be viewed as the broadcasting of a character sequence (the root sequence) along the edges from the root towards the leaves. On each edge, the sequence undergoes a certain number of changes or mutations, and the mutated sequences are observed at the leaves. By dening a meaningful mutation model that relates changes in character sequences to time of divergence, one can attempt to reconstruct the weighted binary tree. The primary goal of such reconstruction is to correctly recover the tree topology, i.e., the tree without the edge weights. A secondary goal is to estimate the edge weights as correctly as possible. By dening a probabilistic model of mutations, which denes a probability distribution is dened over sets of n sequences of length `, where n is the number of leaves and ` is the length of an observed sequence. An obvious requirement for a tree building algorithm, which we call computational eciency, is that it runs in time polynomial in n and `. It is also important to realize that the sequences are rather short, as the length of currently available biomolecular sequences ranges from a few hundreds to a few thousands. Therefore, another principle to consider is statistical eciency, Department of Computer Science, Yale University, New Haven, CT 06520; contact e-mail:csuros-miklos@cs.yale.edu. 1

which requires that the algorithm recovers the tree with high probability from sequences that have polynomial length in the number of taxa. Unfortunately, almost all the existing algorithms violate one or both of these principles. Among the known algorithms, the class of parsimony algorithms [14] are the most popular. They attempt to compute a tree that minimizes the number of mutations leading to the observed sequences. Unfortunately, the problem of optimizing parsimony is NP-hard [9]. In addition, it is not a consistent method [5, 13] in that increasing the length of sample sequences generated by an evolutionary tree ad innitum may still not result in the correct topology being inferred. Consequently, statistical eciency cannot be expected for all evolutionary trees, and computational eciency can be achieved only by using heuristics to derive approximations to optimal solutions. The class of distance-based algorithms are also widely used [23]. These algorithms rst calculate pairwise evolutionary distances between taxa from the observed sequences and then build a tree from the resulting distance matrix. Many of these algorithms strive to nd an evolutionary tree among all possible trees to t the observed distances the best according to some metric. Such an optimization task is provably NP-hard in known cases; see [8] for L 1 and L 2 metric norms, and see [1] for L 1. However, a number of distance-based algorithms are computationally ecient. Examples include Neighbor-Joining [21, 20], Short Quartet [10], Harmonic Greedy Triplets [7], the algorithm of Farach and Kannan [12], and the algorithm of Cryan, Goldberg and Goldberg [6] to mention a few. The statistical ef- ciency of these algorithms has been analyzed mostly in a simple model of evolution, the Jukes-Cantor model. The Short Quartet (SQM) and Harmonic Greedy Triplets (HGT) algorithms have been shown to be statistically ecient in this model, whereas the Neighbor Joining algorithm and the one of Farach and Kannan have exponential sample length requirements [2, 11]. In this paper we derive lower bounds on the sample length requirements for the eciency of any distance-based algorithm, which match the bounds of HGT and SQM tightly. 2. Model of sequence evolution. 2.1. The Jukes-Cantor model of sequence evolution. From a sample of aligned biological sequences (e.g., see Figure 2.1), one can build a tree that provides a stochastic model of the succession of pointwise mutations leading to the observed sequences. The elements of the sequences are taken from a nite alphabet such as that of amino acids, nucleic acids, or codons. The nodes of such a tree represent taxa, each corresponding to a 2

Woolly mammoth (MPR) African elephant (LAF) Asian elephant (EMA) Dugong (DDG) Manatee (TMA) CTAAATCATCACTGATCAAAGAGAGC CTAAATCATCACCGATCAAAGAGAGC CTAAATCATCGCTGATCAAAGAGAGC TTAAATCACTCCCGATCATAAAGGAGC TCAAATCATTACTGACCATAAAGGAGC 81 91 101 character position MPR LAF EMA DDG TMA Fig. 2.1. Taken from [18], this example uses the sequences of 12S ribosomal RNA and cytochrome b in mitochondrial DNA to establish the evolutionary relationship among the woolly mammoth and its extant relatives. The sequences shown are for 12S rrna, base positions 81{110, omitting deletions that are common in the ve taxa above. sequence. The leaves represent terminal taxa, which correspond to observable sequences; the nonleaf nodes represent ancestors of the terminal taxa, which correspond to unobservable sequences. The mutations occur along the edges. To formalize this modeling, this paper employs the generalized Jukes- Cantor model [16, 17] of sequence evolution dened as follows. Let m 2 and n 3 be two integers. Let A = fa 1 ; : : :; a m g be a nite alphabet. An evolutionary tree T for A is a rooted binary tree of n leaves with an edge mutation probability p e for each tree edge e. The edge mutation probabilities are bounded away from 0 and 1? 1 m, i.e., there exist f and g such that for every edge e of T, 0 < f p e g < 1? 1 m : Given a sequence s 1 s` 2 A` associated with the root of T, a set of n mutated sequences in A` is generated by ` random labelings of the tree at the nodes. These ` labelings are mutually independent. The labelings at the j-th leaf give the j-th sequence s (j) 1 s(j) `, and the i-th labeling of the tree gives the i-th symbols s (1) i ; : : :; s (n) i. The i-th labeling is carried out from the root towards the leaves along the edges. The root is labeled by s i. On edge e, the child's label is the same as the parent's with probability 1?p e or is dierent with probability pe for each dierent symbol. Such mutations m?1 of symbols along the edges are mutually independent. The topology T (T ) of T is the unrooted tree obtained from T by omitting the edge mutation probabilities, and replacing the two edges between the root and its children with a single edge. We further require that the leaves of T (T ) are labeled with the same sequences as in T, but T (T ) need not be 3

labeled otherwise. Our task is to design a learning algorithm that takes ` mutated sequences as input and recovers T (T ) with high probability. 2.2. Evolutionary distance. Distance-based tree building methods rst calculate pairwise evolutionary distances between the terminal taxa, and subsequently build the tree based on these values. The function is a distance metric on T if and only if the following four conditions hold. 1. takes values on ordered pairs of nodes and for any two nodes X and Y in T, XY 2 [0; 1). 2. is symmetric, i.e., for any two nodes X and Y, XY = Y X. 3. is additive, i.e., for any three nodes X, Y, and X such that Y lies on the tree path between X and Z, XZ = XY + Y Z. 4. For two nodes X and Y such that Prf X = Y g = 1, XY = 0. In particular, XX = 0. For an edge e with endpoints X and Y, XY is referred to as the edge length or edge length of e, denoted by e. The function dened as XY = e? XY measures the similarity of the tree nodes. By the properties of the distance metric, 0 < XY < 1, XY = Y X, and XY is multiplicative along any tree path. For two nodes X and Y, let Also, for brevity, let p XY = Prf X 6= Y g : = m m? 1 : The following theorem is well-known in the literature, for a formal proof, see, for example [7]. Theorem 2.1. Dene the function on every node pair X; Y as XY =? ln Prf X = Y g? 1 m? 1 Prf X 6= Y g =? ln(1? p XY ): Then is a distance metric in the generalized Jukes-Cantor model of evolution. One might wonder if there are other possible distance metrics in this model. It is evident that if is a distance metric, then c is a distance metric, as well, for any c > 0. The following lemma shows that if is a function of p XY, then that function is uniquely determined up to a scaling factor. 4

Lemma 2.2. Let be an additive distance along any tree path, dened by XY = '(Prf X 6= Y g), where ' : [0; 1? 1=m) 7! [0; +1) is a function with lim x!+0 = '(0). Then there exists a real number c such that '(p) =?c ln(1? p) for all p. Proof. Assume that X, Y, and Z are three consecutive nodes on a tree path, with Y being a child of X and Z a child of Y. Let p XY = p Y Z = p. Then p XZ = 2p? p 2 because is multiplicative. Since is additive,2'(p) = '(2p? p 2 ) for every p. Dene 1 (x) = (1? e?x )= and 2 (x) = '( 1 (x)), for x 2 [0; 1). Then 2 2 (x) = 2 (2x). We prove that there exists a real number c such that 2 (x) = cx for all x by contradiction. Assume that there is no such constant and thus there exists x; v > 0 and u 6= 0 such that 2(x + v) x + v = 2 (x) x + u: Dene the series a k = (x + v)=2 k and b k = x=2 k for k 0. Since 2 (2x) = 2 2 (x), 2 (a k )=a k = 2 (x+v)=(x+v) and 2 (b k )=b k = 2 (x)=x. Therefore, 2(a k ) = 2 (b k ) + u(1 + v=x) and thus lim k!1 2 (a k ) 6= lim k!1 2 (b k ). Since lim k!1 1 (a k ) = lim k!1 1 (b k ) = 0, ' cannot be continuous at 0 by virtue of the fact that lim k!1 '( 1 (a k )) 6= lim k!1 '( 2 (b k )). 3. Evolutionary tree building algorithms. 3.1. Estimation of evolutionary distances. Distance-based algorithms start by estimating evolutionary distances between terminal taxa. If X and Y are leaves, their similarity can be estimated using sample sequences as (3.1) ^ XY = 1` `X i=1 I Xi Y i ; where X 1 ; : : :; X` and Y 1 ; : : :; Y` are the symbols at positions 1; : : :; ` of the observed sample sequences for the two leaves, and (?1 if x 6= y; I xy = m?1 1 if x = y: The distance between X and Y is estimated as (3.2) ^ XY = (? ln ^XY if ^ XY > 0; 1 otherwise. 5

Distance-based algorithms build the tree by using the distance estimates ^ between leaves. In order to recover the topology successfully, these estimates have to be close to the true distances. The next lemma gives a lower bound on the sequence length required for an accurate estimation. Lemma 3.1. Let 0 < < 1, and 0 < < 1. For any leaf pairs X and Y, if then ` = 1 2 f 2 2 XY Pr n ^ XY? XY? ln(1? f) o ;., Proof. (Sketch.) First observe, that ^ XY can be viewed as a linear transformation of a binomially distributed random variable with parameters ` and (1? XY )=. The proof bounds the rate of convergence of that binomial random variable by the tail of the standard normal distribution using the Berry-Esseen Theorem [3] about the convergence rate in the Central Limit Theorem. The resulting integral is bounded by using a Taylor series approximation. 3.2. Distance matrices. A distance matrix is a symmetric n n matrix, in which diagonal entries are zero and non-diagonal entries are positive. An additive distance matrix is a distance matrix D for which there exists an evolutionary tree with leaves 1; : : :; n such that D[i; j] = ij. A distancebased algorithm is dened as a partial function F on the set of distance matrices such that for any D, either F(D) = fail or F(D) is a topology. Each T denes a distance matrix D T by the distances between pairs of nodes XY, so that T (T ) is uniquely determined by D T [4]. We assume that F(D T ) = T (T ). Based on ideas of Atteson [2] and Erd}os et al. [11], we dene the following method to construct trees that dene distance matrices close to the one dened by T. Let e be an edge of T. By contracting the edge e and preserving the edge lengths of every other edge, one obtains the non-binary edge weighted tree T 0. T 0 has exactly one vertex adjacent to four edges. Subsequently, this vertex can be replaced by an edge e 0 with a positive edge weight e 0? ln(1? f). In this way, one can obtain three trees with dierent topologies, one of which has the same topology as T. If an evolutionary tree T 00 can be obtained from T with e 0 = x in this manner, and T (T 00 ) 6= T (T ), then T 00 and T have a similar topology e;x denoted by T ` T 00. Let T 00 dene the distance metric 00. Dene C e as the set of leaf pairs XY for which XY 6= 00 XY. If x = e, then the matrices D T and D T 00 dier only at the entries corresponding to C e, by e. 6

Theorem 3.2. Let T and T 0 be two trees such that T ` T 0. Let ^D be a distance matrix corresponding to ^ on leaf pairs, calculated from a sample of length ` that is generated by either T or T 0. Suppose that F has a failure probability less than on sequences of length `. In other words, with probability at least 1?, F( ^D) = T (T ) when ^D is generated by T and F( ^D) = T (T 0 ) when ^D is generated by T 0. Then ` = 1 p 2 e max XY 2Ce 2 XY Proof. (Sketch.) The proof uses Lemma 3.1 to prove a lower bound on ` by showing that when ^D comes from a shorter sample, then F cannot recognize both topologies with high probability. 4. Optimal algorithms. Similarly to [22, 11, 10], we dene the notion of depth as follows. The g-depth of a node in a rooted tree is the smallest number of edges in a path from the node to a leaf. Let e be an edge between nodes u 1 and u 2 in a rooted tree T 0. Let T 0 and 1 T 0 2 be the subtrees of T 0 obtained by cutting e which contain u 1 and u 2, respectively. The g-depth of e in T 0 is the larger of the g-depth of u 1 in T 0 and that of 1 u 2 in T 0. 2 The g-depth of a rooted tree is the largest possible g-depth of an edge in the tree. (We add the prex g to the term depth because this usage of depth is nonstandard in graph theory.) Dene d as the g-depth of T. Then d 1 + blog 2 (n? 1)c. As a corollary of 3.2, we obtain the following result Corollary 4.1. For every 2 < d 1 + blog 2 (n? 1)c, there is a tree T with depth d such that any algorithm F needs sample sequences of length 1 ` = f 2 (1? g) 4d to recover T (T ) with probability 1? o (1). Proof. (Sketch.) the proof consists of constructing a tree T such that it has an internal edge e with p e = f, every edge e 0 has p e 0 = g, and d equals the g-depth of the endpoints of e in the subtrees obtained when T is cut at e. The diameter of T is dened as the maximum number of edges in a path between leaves of T. The diameter is always at least as large as 2d and can be even (n). Many distance-based algorithms have sequence length bounds that are exponential in the diameter and therefore in n. Atteson [2] established an O bound on sample length require- log n f 2 (1?g) diam ments of Neighbor-Joining [20] and related algorithms. The algorithms! : e;e 7

of Agarwala et al. [1], and Farach and Kannan [12] try to nd a tree T such that the distance matrix D T is close to ^D in the L 1 metric, i.e., max i;j j ^D[i; j]? DT [i; j]j is small. This approach results also in exponential sequence length requirements. In particular, a ` = 1 f 2 (1?g) diam lower bound can be derived [11]. Cryan, Goldberg and Goldberg [6] recently developed an algorithm that outputs a T such that max i;j jd T [i; j]?d T [i; j]j is small. Our lower bound results apply here, as well, when this algorithm is used for topology estimation. However, if instead of recovering the topology, the minimization of max i;j jd T [i; j]? D T [i; j]j is the goal, then the sample length bounds do not depend on f and g. The Short Quartet Method (SQM) [10] is an algorithm based on a greedy selection of quartets of leaves that recovers the correct tree topology with high probability from short sample sequences. In particular, for every T with depth d, there exists an ` = O log n f 2 (1? g) 4d+6 such that the topology is recovered correctly with probability 1? o (1). The Harmonic Greedy Triplets (HGT) [7] algorithm is based on a greedy selection of triplets and successfully recovers T (T ) from sequences of length log n ` = O f 2 (1? g) 4d+8 with probability 1? o (1). In addition, HGT recovers the edge weights with high accuracy, whereas SQM returns only the topology. Both algorithms provide a sample length bound that is close to our lower bound derived for any distance-based algorithm. 5. Summary. We derived lower bounds on sample length requirements of any distance-based algorithm in the Jukes-Cantor model of sequence evolution. We also showed that the usual eviolutionary distance denition is unique in this model and thus no distance-based algorithm can achieve a better performance by using a dierent distance denition. Finally, we showed the two distance-based algorithms, the Short Quartet and the Harmonic Greedy Triplets algorithms match these lower bounds closely and therefore these algorithms oer optimal performance. 8

REFERENCES [1] R. Agarwala, V. Bafna, M. Farach, B. Narayanan, M. Paterson, and M. Thorup, On the approximability of numerical taxonomy (tting distances by tree metrics), in Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, Atlanta, Georgia, 28{30 Jan. 1996, pp. 365{372. [2] K. Atteson, The performance of neighbor-joining algorithms of phylogeny reconstruction, in Computing and Combinatorics, Third Annual International Conference, Shanghai, China, T. Jiang and D. T. Lee, eds., vol. 1276 of Lecture Notes in Computer Science, Berlin, 1997, Springer-Verlag, pp. 101{110. [3] R. N. Bhattachraya and R. Ranga Rao, Normal approximation and asymptotic expansions, John Wiley & Sons, New York, 1976. [4] P. Buneman, The recovery of trees from dissimilarity matrices, in Mathematics in the Archaelogical and Historical Sciences, F. R. Hodson, D. G. Kendall, and P. Tautu, eds., Edinburgh University Press, Edinburgh, 1971, pp. 387{395. [5] J. Cavender, Taxonomy with condence, Mathematical Biosciences, 40 (1978), pp. 271{280. [6] M. Cryan, L. A. Goldberg, and P. W. Goldberg, Evolutionary trees can be learned in polynomial time in the two-state general Markov-model, Tech. Report RR347, Department of Computer Science, University of Warwick, UK, 1998. preliminary version at FOCS '98. [7] M. Cs}uros and M.-Y. Kao, Recovering evolutionary trees through Harmonic Greedy Triplets, in SODA '99, 1999. [8] W. H. E. Day, Computational complexity of inferring phylogenies from dissimilarity matrices, Bulletin of Mathematical Biology, 49 (1987), pp. 461{467. [9] W. H. E. Day, D. S. Johnson, and D. Sankoff, The computational complexity of inferring rooted phylogenies by parsimony, Mathematical Biosciences, 81 (1986), pp. 33{42. [10] P. Erd}os, K. Rice, M. A. Steel, L. A. Szekely, and T. Warnow, The Short Quartet Method, Mathematical Modeling and Scientic Computing, (1998). to appear. [11] P. Erd}os, M. A. Steel, L. A. Szekely, and T. Warnow, A few logs suce to build (almost) all trees (ii), Tech. Report 97-72, DIMACS, 1997. [12] M. Farach and S. Kannan, Ecient algorithms for inverting evolution, in Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, Philadelphia, Pennsylvania, 22{24 May 1996, pp. 230{236. [13] J. Felsenstein, Cases in which parsimony or compatibility methods will be positively misleading, Systematic Zoology, 22 (1978), pp. 240{249. [14], Numerical methods for inferring evolutionary trees, The Quarterly Review of Biology, 57 (1982), pp. 379{404. [15] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, UK, 1997. [16] T. H. Jukes and C. R. Cantor, Evolution of protein molecules, in Mammalian Protein Metabolism, H. N. Munro, ed., vol. III, Academic Press, New York, 1969, ch. 24, pp. 21{132. [17] J. Neyman, Molecular studies of evolution: a source of novel statistical problems, in Statistical Decision Theory and Related Topics, S. S. Gupta and J. Yackel, eds., Academic Press, New York, 1971, pp. 1{27. [18] M. Noro, R. Masuda, I. A. Dubrovo, M. C. Yoshida, and M. Kato, Molecular 9

phylogenetic inference of the Woolly Mammoth mammuthus primigenius, based on complete sequences of mitochondrial cytochrome b and 12S ribosomal RNA genes, Journal of Molecular Evolution, 46 (1998), pp. 314{326. [19] C.-Y. Ou, C. A. Cieselski, G. Myers, C. I. Bandea, C.-C. Luo, B. T. M. Korber, J. I. Mullins, G. Schochetman, R. L. Berkelman, A. N. Economou, J. J. Witte, L. J. Furman, G. A. Satten, K. A. MacInnes, J. W. Curran, and H. W. Jaffe, Molecular epidemiology of HIV transmission in a dental practice, Science, 256 (1992), pp. 1165{1171. [20] N. Saitou and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Molecular Biology and Evolution, 4 (1987), pp. 406{425. [21] S. Sattath and A. Tversky, Additive similarity trees, Psychometrika, 42 (1977), pp. 319{345. [22] D. D. Sleator and R. E. Tarjan, A data structure for dynamic trees, Journal of Computer and System Sciences, 26 (1983), pp. 362{391. [23] D. L. Swofford, G. J. Olsen, P. J. Waddell, and D. M. Hillis, Phylogenetic inference, in Molecular Systematics, D. M. Hillis, C. Moritz, and B. K. Mable, eds., Sinauer Associates, Inc., Sunderland, Ma, 2nd ed., 1996, ch. 11, pp. 407{ 514. 10