LOWER BOUNDS ON SEQUENCE LENGTHS REQUIRED TO RECOVER THE EVOLUTIONARY TREE. (extended abstract submitted to RECOMB '99)

LOWER BOUNDS ON SEQUENCE LENGTHS REQUIRED TO RECOVER THE EVOLUTIONARY TREE MIKL OS CS } UR OS AND MING-YANG KAO (extended abstract submitted to RECOMB '99) Abstract. In this paper we study the sequence length requirements of distance-based evolutionary tree building algorithms in the Jukes-Cantor model of evolution. By deriving lower bounds on sequence lengths required to recover the evolutionary tree topology correctly, we show that two algorithms, the Short Quartet Method and the Harmonic Greedy Triplets algorithms have optimal sequence length requirements. 1. Introduction. A main area of computational biology is the development and analysis of evolutionary tree building algorithms [15]. By using biomolecular sequences, these algorithms have not only enabled the exploration of evolutionary relationships among species but also led to the discovery of new proteins and to the inference of transmission chains of viral diseases such as AIDS [19]. The evolutionary tree in biology can be dened as an edge weighted binary tree, in which the nodes correspond to taxa, and edge weights correspond to time of divergence between them [23]. Evolution in this model can be viewed as the broadcasting of a character sequence (the root sequence) along the edges from the root towards the leaves. On each edge, the sequence undergoes a certain number of changes or mutations, and the mutated sequences are observed at the leaves. By dening a meaningful mutation model that relates changes in character sequences to time of divergence, one can attempt to reconstruct the weighted binary tree. The primary goal of such reconstruction is to correctly recover the tree topology, i.e., the tree without the edge weights. A secondary goal is to estimate the edge weights as correctly as possible. By dening a probabilistic model of mutations, which denes a probability distribution is dened over sets of n sequences of length `, where n is the number of leaves and ` is the length of an observed sequence. An obvious requirement for a tree building algorithm, which we call computational eciency, is that it runs in time polynomial in n and `. It is also important to realize that the sequences are rather short, as the length of currently available biomolecular sequences ranges from a few hundreds to a few thousands. Therefore, another principle to consider is statistical eciency, Department of Computer Science, Yale University, New Haven, CT 06520; contact e-mail:csuros-miklos@cs.yale.edu. 1

which requires that the algorithm recovers the tree with high probability from sequences that have polynomial length in the number of taxa. Unfortunately, almost all the existing algorithms violate one or both of these principles. Among the known algorithms, the class of parsimony algorithms [14] are the most popular. They attempt to compute a tree that minimizes the number of mutations leading to the observed sequences. Unfortunately, the problem of optimizing parsimony is NP-hard [9]. In addition, it is not a consistent method [5, 13] in that increasing the length of sample sequences generated by an evolutionary tree ad innitum may still not result in the correct topology being inferred. Consequently, statistical eciency cannot be expected for all evolutionary trees, and computational eciency can be achieved only by using heuristics to derive approximations to optimal solutions. The class of distance-based algorithms are also widely used [23]. These algorithms rst calculate pairwise evolutionary distances between taxa from the observed sequences and then build a tree from the resulting distance matrix. Many of these algorithms strive to nd an evolutionary tree among all possible trees to t the observed distances the best according to some metric. Such an optimization task is provably NP-hard in known cases; see [8] for L 1 and L 2 metric norms, and see [1] for L 1. However, a number of distance-based algorithms are computationally ecient. Examples include Neighbor-Joining [21, 20], Short Quartet [10], Harmonic Greedy Triplets [7], the algorithm of Farach and Kannan [12], and the algorithm of Cryan, Goldberg and Goldberg [6] to mention a few. The statistical ef- ciency of these algorithms has been analyzed mostly in a simple model of evolution, the Jukes-Cantor model. The Short Quartet (SQM) and Harmonic Greedy Triplets (HGT) algorithms have been shown to be statistically ecient in this model, whereas the Neighbor Joining algorithm and the one of Farach and Kannan have exponential sample length requirements [2, 11]. In this paper we derive lower bounds on the sample length requirements for the eciency of any distance-based algorithm, which match the bounds of HGT and SQM tightly. 2. Model of sequence evolution. 2.1. The Jukes-Cantor model of sequence evolution. From a sample of aligned biological sequences (e.g., see Figure 2.1), one can build a tree that provides a stochastic model of the succession of pointwise mutations leading to the observed sequences. The elements of the sequences are taken from a nite alphabet such as that of amino acids, nucleic acids, or codons. The nodes of such a tree represent taxa, each corresponding to a 2

Woolly mammoth (MPR) African elephant (LAF) Asian elephant (EMA) Dugong (DDG) Manatee (TMA) CTAAATCATCACTGATCAAAGAGAGC CTAAATCATCACCGATCAAAGAGAGC CTAAATCATCGCTGATCAAAGAGAGC TTAAATCACTCCCGATCATAAAGGAGC TCAAATCATTACTGACCATAAAGGAGC 81 91 101 character position MPR LAF EMA DDG TMA Fig. 2.1. Taken from [18], this example uses the sequences of 12S ribosomal RNA and cytochrome b in mitochondrial DNA to establish the evolutionary relationship among the woolly mammoth and its extant relatives. The sequences shown are for 12S rrna, base positions 81{110, omitting deletions that are common in the ve taxa above. sequence. The leaves represent terminal taxa, which correspond to observable sequences; the nonleaf nodes represent ancestors of the terminal taxa, which correspond to unobservable sequences. The mutations occur along the edges. To formalize this modeling, this paper employs the generalized Jukes- Cantor model [16, 17] of sequence evolution dened as follows. Let m 2 and n 3 be two integers. Let A = fa 1 ; : : :; a m g be a nite alphabet. An evolutionary tree T for A is a rooted binary tree of n leaves with an edge mutation probability p e for each tree edge e. The edge mutation probabilities are bounded away from 0 and 1? 1 m, i.e., there exist f and g such that for every edge e of T, 0 < f p e g < 1? 1 m : Given a sequence s 1 s` 2 A` associated with the root of T, a set of n mutated sequences in A` is generated by ` random labelings of the tree at the nodes. These ` labelings are mutually independent. The labelings at the j-th leaf give the j-th sequence s (j) 1 s(j) `, and the i-th labeling of the tree gives the i-th symbols s (1) i ; : : :; s (n) i. The i-th labeling is carried out from the root towards the leaves along the edges. The root is labeled by s i. On edge e, the child's label is the same as the parent's with probability 1?p e or is dierent with probability pe for each dierent symbol. Such mutations m?1 of symbols along the edges are mutually independent. The topology T (T ) of T is the unrooted tree obtained from T by omitting the edge mutation probabilities, and replacing the two edges between the root and its children with a single edge. We further require that the leaves of T (T ) are labeled with the same sequences as in T, but T (T ) need not be 3

labeled otherwise. Our task is to design a learning algorithm that takes ` mutated sequences as input and recovers T (T ) with high probability. 2.2. Evolutionary distance. Distance-based tree building methods rst calculate pairwise evolutionary distances between the terminal taxa, and subsequently build the tree based on these values. The function is a distance metric on T if and only if the following four conditions hold. 1. takes values on ordered pairs of nodes and for any two nodes X and Y in T, XY 2 [0; 1). 2. is symmetric, i.e., for any two nodes X and Y, XY = Y X. 3. is additive, i.e., for any three nodes X, Y, and X such that Y lies on the tree path between X and Z, XZ = XY + Y Z. 4. For two nodes X and Y such that Prf X = Y g = 1, XY = 0. In particular, XX = 0. For an edge e with endpoints X and Y, XY is referred to as the edge length or edge length of e, denoted by e. The function dened as XY = e? XY measures the similarity of the tree nodes. By the properties of the distance metric, 0 < XY < 1, XY = Y X, and XY is multiplicative along any tree path. For two nodes X and Y, let Also, for brevity, let p XY = Prf X 6= Y g : = m m? 1 : The following theorem is well-known in the literature, for a formal proof, see, for example [7]. Theorem 2.1. Dene the function on every node pair X; Y as XY =? ln Prf X = Y g? 1 m? 1 Prf X 6= Y g =? ln(1? p XY ): Then is a distance metric in the generalized Jukes-Cantor model of evolution. One might wonder if there are other possible distance metrics in this model. It is evident that if is a distance metric, then c is a distance metric, as well, for any c > 0. The following lemma shows that if is a function of p XY, then that function is uniquely determined up to a scaling factor. 4

Lemma 2.2. Let be an additive distance along any tree path, dened by XY = '(Prf X 6= Y g), where ' : [0; 1? 1=m) 7! [0; +1) is a function with lim x!+0 = '(0). Then there exists a real number c such that '(p) =?c ln(1? p) for all p. Proof. Assume that X, Y, and Z are three consecutive nodes on a tree path, with Y being a child of X and Z a child of Y. Let p XY = p Y Z = p. Then p XZ = 2p? p 2 because is multiplicative. Since is additive,2'(p) = '(2p? p 2 ) for every p. Dene 1 (x) = (1? e?x )= and 2 (x) = '( 1 (x)), for x 2 [0; 1). Then 2 2 (x) = 2 (2x). We prove that there exists a real number c such that 2 (x) = cx for all x by contradiction. Assume that there is no such constant and thus there exists x; v > 0 and u 6= 0 such that 2(x + v) x + v = 2 (x) x + u: Dene the series a k = (x + v)=2 k and b k = x=2 k for k 0. Since 2 (2x) = 2 2 (x), 2 (a k )=a k = 2 (x+v)=(x+v) and 2 (b k )=b k = 2 (x)=x. Therefore, 2(a k ) = 2 (b k ) + u(1 + v=x) and thus lim k!1 2 (a k ) 6= lim k!1 2 (b k ). Since lim k!1 1 (a k ) = lim k!1 1 (b k ) = 0, ' cannot be continuous at 0 by virtue of the fact that lim k!1 '( 1 (a k )) 6= lim k!1 '( 2 (b k )). 3. Evolutionary tree building algorithms. 3.1. Estimation of evolutionary distances. Distance-based algorithms start by estimating evolutionary distances between terminal taxa. If X and Y are leaves, their similarity can be estimated using sample sequences as (3.1) ^ XY = 1` `X i=1 I Xi Y i ; where X 1 ; : : :; X` and Y 1 ; : : :; Y` are the symbols at positions 1; : : :; ` of the observed sample sequences for the two leaves, and (?1 if x 6= y; I xy = m?1 1 if x = y: The distance between X and Y is estimated as (3.2) ^ XY = (? ln ^XY if ^ XY > 0; 1 otherwise. 5

Distance-based algorithms build the tree by using the distance estimates ^ between leaves. In order to recover the topology successfully, these estimates have to be close to the true distances. The next lemma gives a lower bound on the sequence length required for an accurate estimation. Lemma 3.1. Let 0 < < 1, and 0 < < 1. For any leaf pairs X and Y, if then ` = 1 2 f 2 2 XY Pr n ^ XY? XY? ln(1? f) o ;., Proof. (Sketch.) First observe, that ^ XY can be viewed as a linear transformation of a binomially distributed random variable with parameters ` and (1? XY )=. The proof bounds the rate of convergence of that binomial random variable by the tail of the standard normal distribution using the Berry-Esseen Theorem [3] about the convergence rate in the Central Limit Theorem. The resulting integral is bounded by using a Taylor series approximation. 3.2. Distance matrices. A distance matrix is a symmetric n n matrix, in which diagonal entries are zero and non-diagonal entries are positive. An additive distance matrix is a distance matrix D for which there exists an evolutionary tree with leaves 1; : : :; n such that D[i; j] = ij. A distancebased algorithm is dened as a partial function F on the set of distance matrices such that for any D, either F(D) = fail or F(D) is a topology. Each T denes a distance matrix D T by the distances between pairs of nodes XY, so that T (T ) is uniquely determined by D T [4]. We assume that F(D T ) = T (T ). Based on ideas of Atteson [2] and Erd}os et al. [11], we dene the following method to construct trees that dene distance matrices close to the one dened by T. Let e be an edge of T. By contracting the edge e and preserving the edge lengths of every other edge, one obtains the non-binary edge weighted tree T 0. T 0 has exactly one vertex adjacent to four edges. Subsequently, this vertex can be replaced by an edge e 0 with a positive edge weight e 0? ln(1? f). In this way, one can obtain three trees with dierent topologies, one of which has the same topology as T. If an evolutionary tree T 00 can be obtained from T with e 0 = x in this manner, and T (T 00 ) 6= T (T ), then T 00 and T have a similar topology e;x denoted by T ` T 00. Let T 00 dene the distance metric 00. Dene C e as the set of leaf pairs XY for which XY 6= 00 XY. If x = e, then the matrices D T and D T 00 dier only at the entries corresponding to C e, by e. 6

Theorem 3.2. Let T and T 0 be two trees such that T ` T 0. Let ^D be a distance matrix corresponding to ^ on leaf pairs, calculated from a sample of length ` that is generated by either T or T 0. Suppose that F has a failure probability less than on sequences of length `. In other words, with probability at least 1?, F( ^D) = T (T ) when ^D is generated by T and F( ^D) = T (T 0 ) when ^D is generated by T 0. Then ` = 1 p 2 e max XY 2Ce 2 XY Proof. (Sketch.) The proof uses Lemma 3.1 to prove a lower bound on ` by showing that when ^D comes from a shorter sample, then F cannot recognize both topologies with high probability. 4. Optimal algorithms. Similarly to [22, 11, 10], we dene the notion of depth as follows. The g-depth of a node in a rooted tree is the smallest number of edges in a path from the node to a leaf. Let e be an edge between nodes u 1 and u 2 in a rooted tree T 0. Let T 0 and 1 T 0 2 be the subtrees of T 0 obtained by cutting e which contain u 1 and u 2, respectively. The g-depth of e in T 0 is the larger of the g-depth of u 1 in T 0 and that of 1 u 2 in T 0. 2 The g-depth of a rooted tree is the largest possible g-depth of an edge in the tree. (We add the prex g to the term depth because this usage of depth is nonstandard in graph theory.) Dene d as the g-depth of T. Then d 1 + blog 2 (n? 1)c. As a corollary of 3.2, we obtain the following result Corollary 4.1. For every 2 < d 1 + blog 2 (n? 1)c, there is a tree T with depth d such that any algorithm F needs sample sequences of length 1 ` = f 2 (1? g) 4d to recover T (T ) with probability 1? o (1). Proof. (Sketch.) the proof consists of constructing a tree T such that it has an internal edge e with p e = f, every edge e 0 has p e 0 = g, and d equals the g-depth of the endpoints of e in the subtrees obtained when T is cut at e. The diameter of T is dened as the maximum number of edges in a path between leaves of T. The diameter is always at least as large as 2d and can be even (n). Many distance-based algorithms have sequence length bounds that are exponential in the diameter and therefore in n. Atteson [2] established an O bound on sample length require- log n f 2 (1?g) diam ments of Neighbor-Joining [20] and related algorithms. The algorithms! : e;e 7

of Agarwala et al. [1], and Farach and Kannan [12] try to nd a tree T such that the distance matrix D T is close to ^D in the L 1 metric, i.e., max i;j j ^D[i; j]? DT [i; j]j is small. This approach results also in exponential sequence length requirements. In particular, a ` = 1 f 2 (1?g) diam lower bound can be derived [11]. Cryan, Goldberg and Goldberg [6] recently developed an algorithm that outputs a T such that max i;j jd T [i; j]?d T [i; j]j is small. Our lower bound results apply here, as well, when this algorithm is used for topology estimation. However, if instead of recovering the topology, the minimization of max i;j jd T [i; j]? D T [i; j]j is the goal, then the sample length bounds do not depend on f and g. The Short Quartet Method (SQM) [10] is an algorithm based on a greedy selection of quartets of leaves that recovers the correct tree topology with high probability from short sample sequences. In particular, for every T with depth d, there exists an ` = O log n f 2 (1? g) 4d+6 such that the topology is recovered correctly with probability 1? o (1). The Harmonic Greedy Triplets (HGT) [7] algorithm is based on a greedy selection of triplets and successfully recovers T (T ) from sequences of length log n ` = O f 2 (1? g) 4d+8 with probability 1? o (1). In addition, HGT recovers the edge weights with high accuracy, whereas SQM returns only the topology. Both algorithms provide a sample length bound that is close to our lower bound derived for any distance-based algorithm. 5. Summary. We derived lower bounds on sample length requirements of any distance-based algorithm in the Jukes-Cantor model of sequence evolution. We also showed that the usual eviolutionary distance denition is unique in this model and thus no distance-based algorithm can achieve a better performance by using a dierent distance denition. Finally, we showed the two distance-based algorithms, the Short Quartet and the Harmonic Greedy Triplets algorithms match these lower bounds closely and therefore these algorithms oer optimal performance. 8

REFERENCES [1] R. Agarwala, V. Bafna, M. Farach, B. Narayanan, M. Paterson, and M. Thorup, On the approximability of numerical taxonomy (tting distances by tree metrics), in Proceedings of the Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, Atlanta, Georgia, 28{30 Jan. 1996, pp. 365{372. [2] K. Atteson, The performance of neighbor-joining algorithms of phylogeny reconstruction, in Computing and Combinatorics, Third Annual International Conference, Shanghai, China, T. Jiang and D. T. Lee, eds., vol. 1276 of Lecture Notes in Computer Science, Berlin, 1997, Springer-Verlag, pp. 101{110. [3] R. N. Bhattachraya and R. Ranga Rao, Normal approximation and asymptotic expansions, John Wiley & Sons, New York, 1976. [4] P. Buneman, The recovery of trees from dissimilarity matrices, in Mathematics in the Archaelogical and Historical Sciences, F. R. Hodson, D. G. Kendall, and P. Tautu, eds., Edinburgh University Press, Edinburgh, 1971, pp. 387{395. [5] J. Cavender, Taxonomy with condence, Mathematical Biosciences, 40 (1978), pp. 271{280. [6] M. Cryan, L. A. Goldberg, and P. W. Goldberg, Evolutionary trees can be learned in polynomial time in the two-state general Markov-model, Tech. Report RR347, Department of Computer Science, University of Warwick, UK, 1998. preliminary version at FOCS '98. [7] M. Cs}uros and M.-Y. Kao, Recovering evolutionary trees through Harmonic Greedy Triplets, in SODA '99, 1999. [8] W. H. E. Day, Computational complexity of inferring phylogenies from dissimilarity matrices, Bulletin of Mathematical Biology, 49 (1987), pp. 461{467. [9] W. H. E. Day, D. S. Johnson, and D. Sankoff, The computational complexity of inferring rooted phylogenies by parsimony, Mathematical Biosciences, 81 (1986), pp. 33{42. [10] P. Erd}os, K. Rice, M. A. Steel, L. A. Szekely, and T. Warnow, The Short Quartet Method, Mathematical Modeling and Scientic Computing, (1998). to appear. [11] P. Erd}os, M. A. Steel, L. A. Szekely, and T. Warnow, A few logs suce to build (almost) all trees (ii), Tech. Report 97-72, DIMACS, 1997. [12] M. Farach and S. Kannan, Ecient algorithms for inverting evolution, in Proceedings of the Twenty-Eighth Annual ACM Symposium on the Theory of Computing, Philadelphia, Pennsylvania, 22{24 May 1996, pp. 230{236. [13] J. Felsenstein, Cases in which parsimony or compatibility methods will be positively misleading, Systematic Zoology, 22 (1978), pp. 240{249. [14], Numerical methods for inferring evolutionary trees, The Quarterly Review of Biology, 57 (1982), pp. 379{404. [15] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, Cambridge, UK, 1997. [16] T. H. Jukes and C. R. Cantor, Evolution of protein molecules, in Mammalian Protein Metabolism, H. N. Munro, ed., vol. III, Academic Press, New York, 1969, ch. 24, pp. 21{132. [17] J. Neyman, Molecular studies of evolution: a source of novel statistical problems, in Statistical Decision Theory and Related Topics, S. S. Gupta and J. Yackel, eds., Academic Press, New York, 1971, pp. 1{27. [18] M. Noro, R. Masuda, I. A. Dubrovo, M. C. Yoshida, and M. Kato, Molecular 9

phylogenetic inference of the Woolly Mammoth mammuthus primigenius, based on complete sequences of mitochondrial cytochrome b and 12S ribosomal RNA genes, Journal of Molecular Evolution, 46 (1998), pp. 314{326. [19] C.-Y. Ou, C. A. Cieselski, G. Myers, C. I. Bandea, C.-C. Luo, B. T. M. Korber, J. I. Mullins, G. Schochetman, R. L. Berkelman, A. N. Economou, J. J. Witte, L. J. Furman, G. A. Satten, K. A. MacInnes, J. W. Curran, and H. W. Jaffe, Molecular epidemiology of HIV transmission in a dental practice, Science, 256 (1992), pp. 1165{1171. [20] N. Saitou and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Molecular Biology and Evolution, 4 (1987), pp. 406{425. [21] S. Sattath and A. Tversky, Additive similarity trees, Psychometrika, 42 (1977), pp. 319{345. [22] D. D. Sleator and R. E. Tarjan, A data structure for dynamic trees, Journal of Computer and System Sciences, 26 (1983), pp. 362{391. [23] D. L. Swofford, G. J. Olsen, P. J. Waddell, and D. M. Hillis, Phylogenetic inference, in Molecular Systematics, D. M. Hillis, C. Moritz, and B. K. Mable, eds., Sinauer Associates, Inc., Sunderland, Ma, 2nd ed., 1996, ch. 11, pp. 407{ 514. 10