Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT

Size: px

Start display at page:

Download "Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT"

Evelyn Hawkins
5 years ago
Views:

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 6, Numbers 3/4, 1999 Mary Ann Liebert, Inc. Pp Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction DANIEL H. HUSON, 1 SCOTT M. NETTLES, 2 and TANDY J. WARNOW 3 ABSTRACT The evolutionary history of a set of species is represented by a phylogenetic tree, which is a rooted, leaf-labeled tree, where internal nodes represent ancestral species and the leaves represent modern day species. Accurate (or even boundedly inaccurate) topology reconstructions of large and divergent trees from realistic length sequences have long been considered one of the major challenges in systematic biology. In this paper, we present a simple method, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods under various Markov models of evolution. We analyze the performance of DCM-boosted distance methods under the Jukes Cantor Markov model of biomolecular sequence evolution, and prove that for almost all trees, polylogarithmic length sequences suf ce for complete accuracy with high probability, while polynomial length sequences always suf ce. We also provide an experimental study based upon simulating sequence evolution on model trees. This study con rms substantial reductions in error rates at realistic sequence lengths. Key words: algorithms, evolution, phylogenetic trees, clustering, Jukes Cantor, biomolecular data, chordal graphs. 1. INTRODUCTION THE EVOLUTION OF BIOMOLECULAR SEQUENCES is usually modeled as a Markov process operating on a rooted binary tree. A biomolecular sequence at the root of the tree evolves down the tree, each edge of the tree introducing point mutations, thereby generating sequences at the leaves of the tree. The phylogenetic tree reconstruction problem is to take the sequences that occur at the leaves of the tree, and infer, as accurately as possible, the tree that generated the sequences. Under most Markov models of sequence evolution, locating the root is impossible. Consequently, the primary objective in phylogenetic analysis is to recover the branching process, as represented by the unrooted leaf-labeled topology of the evolutionary tree, and the secondary objective is to estimate the parameters of the evolutionary process. While inferring the parameters of the evolutionary process is of interest to both biologists and statisticians, this is relatively easy once an accurate estimate of the topology of the true tree is obtained; consequently, biologists have mostly focused upon the simpler question of inferring the tree topology. Methods for inferring or estimating the evolutionary history of biomolecular sequences are evaluated according to the accuracy of this topology 1 Celera Genomics, Rockville, Maryland. 2 Department of Electrical and Computer Engineering, University of Texas, Austin, Texas. 3 Department of Computer Science, University of Texas, Austin, Texas. 369

2 370 HUSON ET AL. estimation. Indeed, this is one of the most important problems for computational biology as a whole, because of the centrality of evolutionary studies to biology (Dobzhansky, 1993). [Evolutionary trees are often the basis of multiple sequence alignment algorithms (Gus eld, 1991; Gus eld and Wang, 1996; Hein, 1989), protein structure prediction routines (Rost and Sander, 1993), and other problems in biology.] Experimentally investigating the performance of phylogenetic methods by simulating sequence evolution on different model trees in order to determine how the sequence length affects the accuracy of the topology prediction is central to systematic biology studies (see, for example, Hillis, 1996; Hillis et al., 1994; Huelsenbeck, 1995; Huelsenbeck and Hillis, 1993; Kuhner and Felsenstein, 1994; Saitou and Imanishi, 1989; Schöniger and von Haeseler, 1995; Sourdis and Nei, 1996; Strimmer and von Haeseler, 1996). The focus on sequence length in these studies is primarily practical: biomolecular sequences are not particularly long (those used for phylogenetic tree reconstruction purposes are typically bounded by 1000 nucleotides, often by much smaller numbers), and sequence lengths of 5000 nucleotides are generally considered to be unusually long (Puruis and Quicke, 1997). Furthermore, experimental and anecdotal evidence suggests that methods are very sensitive to sequence lengths, and that different methods can have great differences in their accuracy on realistic sequence lengths. In an earlier paper (Huson et al., 1998b), we showed that a critical factor affecting the accuracy of different methods is the maximum evolutionary distance in the tree, which we call the divergence of the tree. Theoretical studies (Berry and Gascuel, 1997; Erd os et al., 1997a,b, 1999) have bounded the convergence rates of various polynomial time distance methods [Neighbor-Joining (Saitou and Nei, 1987), a simple and very popular clustering technique used by systematic biologists, and more sophisticated methods developed by the theoretical computer science community (Agarwala et al., 1996; Ambainis et al., 1997; Farach et al., 1995), and shown that these bounds grow exponentially in the divergence. This exponential dependency upon the divergence translates to superpolynomial convergence rates for almost all trees (Erd ós et al., 1997b), suggesting that accuracy may be poor if these methods are used to infer tree topologies under conditions of high divergence. However, these are upper bounds, and may be pessimistic. Our paper has several contributions. First, we introduce the concept of fast convergence under a model of evolution, such as the Jukes Cantor model. By this we mean that for all model trees, a method will with high probability recover the tree from sequences that grow only polynomially in the number of leaves, once we bound the mutation probabilities on the edges of the tree. We present a general technique for designing methods that are fast converging, and we present a new fast-converging method. This particular method is actually a boosted version of a classical method, called the Buneman Method; thus, we also introduce a general technique for boosting the performance of phylogenetic methods. This is the Disk-Covering Method (DCM). In its exact form DCM is not polynomial time, but has provable performance guarantees when used with certain base methods. For example, we can prove that under the Jukes Cantor Markov model (Jukes and Cantor, 1969), the DCM Buneman method is fast converging. In its heuristic form, DCM is polynomial time, and our experimental studies indicate substantial reductions in error rates for DCM-boosted polynomial time methods. The rest of this paper is organized as follows. In Section 2 we present de nitions and some basic results about convergence rates of standard methods. We also present a general technique for establishing fast convergence. In Section 3, we describe the fast-converging method, DCM-Buneman, and provide the proof of fast convergence. In Section 4, we describe DCM more generally, and describe how it can be used with other distance-based methods. In Section 5, we present the results of a preliminary experimental study. Finally, in Section 6, we present open problems and discuss related research Markov models of evolution 2. BASICS Most Markov models of evolution assume that the sites in the sequence evolve identically and independently (i.i.d.), and for the most part, the results that can be proven about any Markov model of i.i.d. site evolution can be proven about any other such model. Because of this, we will describe our results in terms of the Jukes Cantor model. T. Jukes and C. Cantor (1969) introduced a very simple model of DNA sequence evolution. [Please see the excellent review articles by Felsenstein (1988) and Swofford et al. (1996) for more information about statistical aspects of phylogenetic inference.]

3 DISK-COVERING 371 De nition 1. Let T be a xed rooted tree with leaves labeled 1 n. The Jukes Cantor Markov model of evolution makes the following assumptions: 1. The sites (positions within the sequence) evolve identically and independently (i.i.d.) down the tree from the root. The probability of an event on edge e is independent of the events outside the subtree below e. 2. The possible states for each site are A, C, T, G (denoting the four possible nucleotides), and for each site, the state at the root is drawn from a distribution (typically uniform). 3. For each edge e D (u, v) 2 E (T ), with u the parent of v, if the state of a site is different at u than at v, then the probability that v has any particular state of the three remaining states is equal. 4. Associated to each edge e in the tree T is a Poisson random variable X e for the number of mutations of a randomly selected site on that edge. We let l e denote the expectation of X e. We will sometimes refer to the pair (T, l) as a Jukes Cantor tree. We are interested in estimating the topology of the Jukes Cantor tree, and will study the accuracy of different methods for this problem. Since the number of changes on each edge in uences the sequence length that is needed to estimate the tree, we will be interested in the performance of different phylogenetic estimators when f D min e l e and g D max e l e are xed, but arbitrary Topological accuracy While complete accuracy is the objective, partial accuracy is the rule, especially when trees are large and/or highly divergent (that is, contain evolutionarily distant pairs of taxa). For this reason, quanti cation of degrees of accuracy have been proposed. The most favored such quanti cation is as follows. De nition 2. Let T be the (unrooted) true tree, and let T 0 be the (unrooted) inferred tree, both leaf-labeled by a set S of taxa. Let e be an edge in T, and let p e denote the bipartition induced on S by the removal of e from T. Let C (T ) D fp e : e 2 E (T )g and let C (T 0 ) be equivalently de ned. Then T D T 0 if and only if C (T ) D C (T 0 ). Any bipartition p 2 C (T ) C (T 0 ) is called a false negative (FN) and any bipartition in C (T 0 ) C (T ) is said to be a false positive (FP). Thus, a false negative is an edge in the true tree that is missing from the inferred tree, while a false positive is an edge in the inferred tree that is not in the true tree. We will say that an edge e 2 E (T ) is recovered in T 0 (or by the method M which produces T 0 ) if the bipartition in C (T ) associated to e appears in C (T 0 ). The FN rate is the ratio of FN and the number of edges in T, and similarly the FP rate is the ratio of FP and the number of edges in T 0. Usually the model (true) tree is binary, and hence has n 3 edges, where n is the number of leaves. Also, many distance methods, such as Neighbor-Joining (Saitou and Nei, 1987) and the method of Agarwala et al. (1996) (which obtains a 3-approximation to the L 1 -nearest tree problem), always produce binary trees. However, the Buneman Tree method (Buneman, 1971) often produces unresolved (i.e., not binary) trees Performance criteria The assumption of Markov models of evolution allows the performance of phylogenetic tree reconstruction methods to be assessed, either through analytical means or by experiments. There are several criteria that have been considered to be of fundamental importance by statisticians working in evolutionary tree reconstruction (Felsenstein, 1988): ² statistical consistency (with respect to a given model of evolution), which is said to be given, if the probability of recovering the leaf-labeled tree converges to 1 as the sequence length increases; ² convergence rate (with respect to a given model of evolution), which is the rate at which the probability (of recovering the leaf-labeled tree) goes to 1, as the sequence length increases; and ² accuracy (with respect to a given model of evolution), the expected number of topological errors at a given sequence length. The reference to the model of evolution is important, as some methods are statistically consistent under some models but not under others. Under the Jukes Cantor model, many distance methods (and all the methods we consider in this paper) are statistically consistent. However, if the assumption of i.i.d. site evolution (or that the sites evolve under different

4 372 HUSON ET AL. rates but from a known distribution) is relaxed, then positive results become less likely. For example, under more general models of evolution, even maximum likelihood (under the correct model!) can be statistically inconsistent (Steel et al., 1994). The theorems we will develop in this paper are stated in terms of Jukes Cantor site evolution, but apply more generally to other models for which statistical consistency of distance-based methods has been established Distance-based reconstruction We begin with some de nitions. De nition 3. A matrix D is called additive (or said to be a tree-metric) if there exists a tree T with positive edge weighting w such that D i j D di T j :D P e2p i j w(e), where P i j is the path in T between leaves i and j. Additive matrices have the nice property that they correspond uniquely to positively edge-weighted trees (Buneman, 1971), and that given D, the tree T and the weighting w can be recovered in polynomial time (Waterman et al., 1977). Let E (T ) denote the set of edges of T. We represent the evolutionary process by a set fx e : e 2 E (T )g of Poisson processes, where X e is the Poisson random variable for the number of mutations of a random site on the edge e. Let X i j D P e2p i j X e. Then X i j is a Poisson random variable. Let l i j D [X i j ]. Then, l is an additive matrix. We will call l i j the expected evolutionary distance, or true distance between i and j. Clearly, l i j D P e2p i j l e, where l e D [X e ]. Because the matrix l is additive, given l the tree T can be constructed in polynomial time using a number of different distance-based methods. Furthermore, a distance correction transformation exists for the Jukes Cantor Markov model of site evolution, so that arbitrarily good approximations to the matrix l can be obtained if the sequence length is unboundedly large (Felsenstein, 1988; Warnow, 1996). De nition 4. The Jukes Cantor distance correction is given as follows: d i j :D 3 4 log h i j where h i j is the normalized Hamming distance, or H (i, j )/ k, where H (i, j) is the Hamming distance between sequences i and j, and k is the sequence length. It is not hard to show that for all Jukes Cantor trees (T, l) and for all i, j, as k! 1, that d i j! l i j. Consequently, the following two-step process is a statistically consistent distance-based approach to Jukes Cantor tree reconstruction: First, compute the Jukes Cantor distance matrix d, then, map d (using some distance method M such as Neighbor-Joining) to a (nearby) additive matrix M (d) D D. Since accuracy in phylogenetic reconstruction is based upon comparisons between unrooted tree topologies, we will say that the method is accurate on input d if D and l de ne the same unrooted leaf-labeled tree, even if they assign different weights to the edges of the tree Some distance methods In this paper we demonstrate the value of DCM boosting on three different distance-based methods, all of which are statistically consistent for the Jukes Cantor model when used with properly corrected distances. We now describe these three methods Buneman. The Buneman Tree method was originally suggested by Peter Buneman (1971), and polynomial time algorithms for this method were obtained in Bandelt and Dress (1992) and Berry and Gascuel (1997). This method takes as input a dissimilarity matrix d, and computes a tree as follows. First, the topology on every quartet of taxa is inferred using the Four-Point method, as follows: The Four-Point Method (FPM) computes trees on four-leaf subsets only. Given the 4 4 dissimilarity matrix on i, j, k, l, the topology i j j kl is returned (meaning i, j are separated from k, l by an edge) if d i j C d kl < minfd i k C d j l, d il C d j k g. If the minimum of the three pairwise sums is not unique, then the FPM returns the star tree (that is, the tree with one interior node, and all leaves adjacent to that interior node).

5 DISK-COVERING 373 Given a set Q of trees, one on each quartet of leaves, the Buneman Tree is de ned to be the maximally resolved tree satisfying the following condition: ² for all quartets i, j, k, l if T restricted to i, j, k, l induces a binary tree (instead of a star), then the tree in Q on i, j, k, l is the same binary tree. Such a tree always exists, since the star tree satis es this constraint. What is nice is that the maximally resolved tree with this property is unique, as the following lemma shows. Lemma 1. Let d be an input dissimilarity matrix, and let Q be the set of four-leaf trees de ned by the FPM, and let T be the Buneman Tree de ned by d. Then C(T ) is the set of bipartitions ( A, B) de ned by ² For all fa, a 0 g µ A and fb, b 0 g µ B, the tree a, a 0 j b, b 0 2 Q The Agarwala et al. Method. The L 1 -nearest tree problem is to nd, for input dissimilarity matrix d, an additive matrix D minimizing L 1 (d, D). This is an NP-hard optimization problem (Farach et al., 1995), but can be 3-approximated in polynomial time. Thus, a 3-approximation algorithm for the L 1 -nearest tree problem takes as input a dissimilarity matrix d, and returns an additive metric D 0 such that L 1 (d, D 0 ) 3L 1 (d, D) for all additive metrics D. There are several 3-approximation algorithms for this problem, of which the one by Agarwala et al. (1996) is the rst (to our knowledge) with a provable performance guarantee Neighbor-Joining. This is a polynomial time method very much favored in the systematic biology community for use with large data sets. See Saitou and Nei (1987) for the rst paper on this method and Atteson (1997) for a proof of statistical consistency and a bound on its convergence rate Convergence rates of some distance methods We begin with the only established upper bounds on the convergence rates of these three methods. This analysis relates the convergence rate to the divergence of a tree, which we now de ne. De nition 5. Let T be an arbitrary Jukes Cantor tree and let l e be the expected number of mutations of a random site on an edge e. Then l i j D P e2p i j l e, where P i j is the path in the tree T between leaves i and j. We de ne the divergence of the tree T to be l max D max i j fl i j g. Note that l max is unbounded even when n (the number of leaves) is bounded, due to changes that are not observed. Theorem 1. Let T be a Jukes Cantor tree with n leaves, Let l e be the expected number of changes of a random site on edge e, and let f D min e l e. Let d i j be an estimation of l i j, the expected number of changes of a random site on the path between i and j. Let e D L 1 (d, l). 1. The Neighbor-Joining method is guaranteed to be accurate if e < f / The Buneman Tree method is guaranteed to be accurate if e < f / 2. Furthermore, every edge e 2 E (T ) such that l e > 2e is recovered by the Buneman Tree Method. 3. Any 3-approximation algorithm for the L 1 -nearest tree problem is guaranteed to be accurate if e < f / 8. Furthermore, if l e > 8e, then the edge e is recovered by the algorithm. For each method above, there is a constant C that depends upon f and d such that if the sequence length k exceeds C log ne O (lmax ) then with probability at least 1 d, the method is accurate given sequences of length k generated on T. Proof. (1) was proven in Atteson (1997), and (2) and (3) were proven in Erd ós et al. (1999) and Huson et al. (1998b). The bound on the convergence rate is based upon the suf cient condition for accuracy, established in items (1) (3) above, and is given in Erd ós et al. (1999).

6 374 HUSON ET AL. Here is an intuition about why this result should be true. Recall that l i j D [X i j ]. As X i j is a Poisson random variable, its expectation is the same as its variance. Consequently, l i j D Var[X i j ], and when the variance is high, errors in estimating l i j are also high. These results prove statistical consistency, and also provide upper bounds on the sequence length that suf ces for accuracy with high probability for these three methods. They also provide bounds on the falsenegative rate for the Buneman Tree Method, and for any 3-approximation algorithm for the L 1 -nearest tree. The term l max is the important term in this upper bound, and it is bounded by g diam(t ), where g D max e l e, and where diam(t ) denotes the length of the longest path (measured in terms of the number of edges) in the tree T. The diameter of a tree on n leaves can be as much as n 1, or as small as O (log n), but under the uniform distribution it is typically V( p n), as was shown in Erd ós et al. (1999). Therefore, this theorem shows that the convergence rate is bounded by a function of n that is at worst exponential, and is typically superpolynomial Fast convergence Theorem 1 gives an upper bound on the sequence length that suf ces for accuracy for three different distance methods, but the upper bound is high. This is unsatisfactory, because we would like methods to converge more quickly to the true tree than those upper bounds would suggest. We therefore de ne fast convergence as follows: De nition 6. A method is said to be fast converging for the Jukes Cantor model if for all xed f, g with 0 < f g, and all Jukes Cantor trees T with f l e g for all edges e, and all d > 0, there is a constant C that depends upon f, g, and d, and a polynomial p(n) so that if k exceeds C p(n) then the method recovers the true tree topology from sequences of length k with probability at least 1 d. Until recently, there were no proofs of fast convergence, and in fact, no bounds on the convergence rates of any method. However, in recent years, there have been several papers providing bounds on the convergence rates for various distance-based methods (Ambainis et al., 1997; Erd ós et al., 1997a, 1999; Farach and Kannan, 1996), and the introduction of the rst provably fast-converging methods (Csuros and Kao, 1999; Erd ós et al., 1997a,b, 1999; Cryan et al., 1998) Proving fast convergence The fast-converging methods have a similar structure, in that they use only close relationships to determine the tree. The proofs of fast convergence differ somewhat, but there is in general a common idea behind the proofs, which suggest a common proof technique. We now introduce this proof technique. De nition 7. Let e i j D jd i j l i j j and let e(q) D maxfe i j : min(d i j, l i j ) qg. If q is large, then e(q) will converge to 0 more slowly than if q is small, as we now show: Theorem 2. Let T be a Jukes Cantor tree. For all w > 0, d > 0, and y > 0, there exists a constant C that depends upon d and y, such that the sequence length that suf ces for e(w) < y with probability at least 1 d is C log ne O (w). Proof. The proof follows from results obtained in the proofs of Theorems 8 and 9 from Erd ós et al. (1999). The proof of Theorem 8 shows the following: For all Jukes Cantor model trees (T, l), and for all d > 0, q > 0, y > 0, there exists a constant C that depends upon d and y, such that if the sequence length k exceeds C log ne O (q), then Prob[jd i j l i j j < y 8i, j with l i j q] > 1 d. Similarly the proof of Theorem 9 shows the following: For all Jukes Cantor model trees (T, l), and for all d > 0, q > 0, y > 0, there exists a constant C that depends upon d and y, such that if the sequence length k exceeds C log ne O (q), then Prob[jd i j l i j j < y 8i, j with d i j q] > 1 d. The proof of the theorem follows from these two observations. Corollary 1. Let M be a xed phylogenetic method and assume that there exist functions y( f ), A(g), and F (g, n), so that for each Jukes Cantor tree (T, l) on n leaves with 0 < f l e g for all edges e,

7 DISK-COVERING 375 ² y( f ) D H( f ), ² F (g, n) D O(A(g) log n), and ² for all input dissimilarity matrices d, whenever e[f (g, n)] < y( f ) then M is correct on input d. Then M is fast converging for the Jukes Cantor model. Proof. By Theorem 2, the sequence length that suf ces for e(f (g, n)) < y( f ) is O(log n e O (F (g,n)) ). But this is O(log n n O ( A(g)) ), which is bounded by a polynomial in n since we have bounded g. Note that the smaller A(g) is, the smaller the degree of the polynomial. Thus, a proof technique for establishing fast convergence under the Jukes Cantor model is to show that functions such as y, A, and F exist. The rst methods that were proven fast converging were the Short Quartet Methods (Erd ós et al., 1997a,b, 1999), followed by the Harmonic Greedy Triplets method (Csuros and Kao, 1999), a method proposed by Cryan et al. (1998), and the fast-converging method in this paper. The techniques used to prove fast convergence in these papers are not alike, but the ones in Erd ós et al. (1997b, 1999) can be restated as using the technique above, and we believe the other proofs can also be so restated. The new fast-converging method we present is obtained by using a general technique for boosting the performance of phylogenetic methods, which we call the Disk-Covering Method, or DCM. 3. DCM BUNEMAN, A FAST CONVERGING METHOD The basic structure of DCM Buneman is quite similar to the basic structure of the Short Quartet Methods, from which it is derived. These methods have two phases, and take as input a dissimilarity matrix d. During the rst phase a collection of trees is constructed, one for each q 2 fd i j g. The input to the reconstruction of T q is the submatrix of d consisting only of those entries for which d i j q. Note that for q < q 0, the input to the reconstruction of T q is more reliable than the input to the reconstruction of T q 0, because e(q) e(q 0 ). On the other hand, the amount of data given as input to the reconstruction of T q 0 is more than the amount of data given as input to the reconstruction of T q. Thus, the trees T q will differ as q ranges over d i j, because they will be based upon differing quantities and qualities of data, and the task of the second phase is to select, from the returned trees, one that made best use of the input. The particular instantiations of the two phases differ between these different fast-converging methods. In the Short Quartet Methods, when reconstructing T q, we infer a tree on each quartet of leaves if its maximum interleaf distance is bounded by q. A unique tree on the entire set of leaves is then sought that agrees with every quartet tree in the input; if no such tree could be found, then T q was left unde ned. The analysis of the Short Quartet Method showed that when the sequences were long enough, then for all q such that T q is de ned, T q would be the true tree with high probability. The Short Quartet Methods are polynomial time and fast converging, but by design either reconstruct the true tree or fail (with high probability) to reconstruct anything. The fast-converging method that we have designed is very similar to the Short Quartet Method, but it has two distinct advantages: rst, it always reconstructs a tree, and second, it has much better performance in experimental performance studies. Thus, while it does not provide any theoretical advantage over the Short Quartet Methods, it provides empirical advantages. Furthermore, the fast converging method we present is the result of applying the Disk-Covering Method (or DCM) to a simple polynomial time method. DCM is a very general phylogenetic method booster, which can be used with any phylogenetic reconstruction method. Our experimental performance study shows that DCM boosting improves accuracy at realistic sequence lengths of many distance-based methods Phase I of DCM Buneman We now describe how we compute each T q, as q ranges over the entries of d i j. Much of the algorithm for computing T q is based upon graph-theoretic concepts and results, and so we will begin with some graph theory.

8 376 HUSON ET AL Graph-theoretic material. De nition 8. or more. A graph is triangulated (or chordal ) if no subset of nodes induces a cycle of size four There are many theoretical results established about triangulated graphs, and many NP-hard problems become solvable in polynomial time when restricted to triangulated graphs. The following can be found in Buneman (1974) and Golumbic (1980). Lemma 2. Every triangulated graph is the intersection graph of subtrees of a tree, and vice versa. Every triangulated graph G has a simplicial elimination ordering, v 1, v 2,..., v n ; this is an ordering of the nodes so that the set X i D fv j : j > i and (v i, v j ) 2 E g forms a clique [i.e., for all fv k, v l g µ X i, (v k, v l ) 2 E ]. The maximal cliques (cliques that cannot be enlarged by the addition of any further vertices) in G are of the form fv i g [ X i ; hence there are at most n maximal cliques and these can be found in O (n 2 ) time. Given a triangulated graph, a simplicial elimination ordering for the graph can be found in O(n 2 ) time, and from it the maximal cliques can also be found in that time. We now de ne threshold graphs. De nition 9. Let d be an n n dissimilarity matrix (i.e., a symmetric matrix that is 0 on the diagonal) and let q be any real number. The threshold graph Thresh(d, q) is de ned as follows. The vertex set is 1, 2,..., n and (i, j ) is an edge if and only if d i j q. Lemma 3. If d is an additive matrix, then Thresh(d, q) is triangulated. Proof. (We are not the rst to observe this fact, but we provide a proof because it is extremely simple.) By Lemma 2, to prove a graph is triangulated, it suf ces to prove that it is isomorphic to such an intersection graph. Let d be an arbitrary additive matrix, and let (T, w) be the edge-weighted tree associated uniquely to d. Let q > 0 be given. Add intermediate vertices to the edges of T and reweight the edges so that the path distances between leaf pairs are unchanged, but so that for every pair of leaves u, v in T if d u,v > q/ 2 then there is a node x in the enlarged tree T 0 so that d T 0 (u, x ) D q/ 2 and d T 0 (x, v) D d T 0 (u, v) q/ 2. Now let X u denote the subtree of T 0 of distance at most q/ 2 of u. Note that X u \ X v 6D ; if and only if d u,v q, and that the threshold graph Thresh(d, q) is identical to the intersection graph of the X u, as u ranges over the leaves of T. Consequently Thresh(d, q) is triangulated Constructing T q. We now describe how we compute a particular T q. ² Step 1: Compute Thresh(d, q). ² Step 2: Triangulate Thresh(d, q): Add edges to Thresh(d, q) to make it triangulated, while minimizing the weight of the largest edge added. (The weight of edge i, j is given by d i, j.) We call the resultant triangulated graph Thresh (d, q). ² Step 3: Compute Buneman Trees for all maximal cliques in Thresh (d, q). Each maximal clique de nes a subset of the taxa (for example, as represented by the DNA sequences at the associated leaves of the tree). We compute the Buneman Tree for each such subset of the taxa. ² Step 4: Merge the subtrees into a supertree. We now discuss the speci c techniques we use to implement the various steps, and their computational complexity. Computing the threshold graph is polynomial time, but minimally triangulating the threshold graph is NP-hard (McMorris et al., 1994). [In practice, the triangulation can be obtained using greedy heuristics, and these will generally perform well when Thresh(d, q) is close to triangulated. Because we compute d using the Jukes Cantor distance calculation, for long sequences d is close to additive, and so Thresh(d, q) will be close to triangulated. Thus, polynomial time techniques can be used to triangulate the threshold graph without too much loss in performance, as our experiments suggest.] Calculating maximal cliques in triangulated graphs is polynomial time (Golumbic, 1980). The merger of the subtrees into a supertree is the only remaining task, but we show that we can accomplish this merger in polynomial time and ensure accuracy in the supertree, when the subtrees are correct and based upon a large enough threshold graph.

9 DISK-COVERING Supertree Construction Algorithm (SCA). The construction of the supertree from the subtrees is an interesting problem, because we would like to ensure that if all the subtrees are correct (in that the true tree induces these subtrees when restricted to the subsets of leaves) then a supertree consistent with all the subtrees should be returned. However, this generalizes to the Subtree Compatibility Problem, which is NPcomplete (Steel, 1992). Thus we will need a special case of the Subtree Compatibility Problem if we are to solve this problem exactly and in polynomial time. (Note that while we were willing to accept a suboptimal triangulation, we are not willing to suboptimally construct the supertree, because we need to obtain the true tree, if possible, and not an incorrect tree.) Also, even if the subtrees are correct, they may not uniquely de ne the supertree (i.e., many different trees may be consistent with the set of subtrees). For this reason, the set of subsets must be de ned with care, with respect both to computational consequences as well as to uniqueness of the supertree compatible with the subtrees. We now describe how we compute a supertree from a set of subtrees. Our algorithm has the nice property that when applied to properly de ned inputs, it is guaranteed to reconstruct a unique supertree consistent with the inputs. We assume that the input to the supertree construction algorithm is a triangulated graph G and a collection of subtrees, one for each maximal clique in G. We let T C denote the subtree for clique C. If G is not connected, then the algorithm produces a forest (i.e., a tree on each component of G); thus we will assume that G is connected. Stage I: Preprocessing: First obtain a simplicial elimination ordering v 1, v 2,..., v n for G. Compute C i D fv i g [ X i, where X i D C(v i ) \ fv ic1, v ic2,..., v n g is the set of neighbors of v i that follow it in the simplicial elimination ordering. The set of maximal cliques is a subset of the set C 1, C 2,..., C n. For each C i, nd a maximal clique C containing C i and compute a tree for C i by deleting the leaves in C C i from T C. In this way, we associate a tree t i with every C i. Stage II: Construct the tree: For i D n 4, n 3,..., 1, compute the tree T i formed by merging t i and T ic1, using the Strict Consensus Subtree Merger method. Strict Consensus Subtree Merger. The Strict Consensus Subtree Merger method contracts a minimum set of edges in each tree in order to make them identical on the subtrees they induce on X. The strict consensus (Day, 1995) of the induced subtrees is de ned to be the maximally resolved tree that is a common contraction of the two subtrees. We will call this subtree on X the backbone. Merging the two trees together is then achieved by attaching the pieces of each tree appropriately to the different edges of the backbone. It is worth noting that the strict consensus subtree merger of two trees, while it always exists, may not be unique. In other words, it may be that some piece of each tree attaches onto the same edge of the backbone. We call this a collision. For example, in Fig. 1, the common intersection of the two leaf-sets is X D f1, 2, 3, 4g, and the strict consensus of the two subtrees induced by X is the 4-star. This is the backbone, it has four edges, and there is a collision on the edge of the backbone incident to leaf 4, but no collision on any other edge. Collisions are problematic, as the Strict Consensus Subtree Merger will potentially introduce false edges or lose true edges when they occur. However, as we will show, when the subtrees are correct and the threshold is selected to be large enough, then there are no collisions. In this case, the true tree is reconstructed. FIG. 1. Merging two trees together, by rst transforming them (through edge contractions ) so that they induce the same subtrees on their shared leaves.

10 378 HUSON ET AL. Theorem 3. Let G be a triangulated graph with n vertices, be the associated set of trees on each maximal clique, and assume that G and are given as input. Then SCA takes O (n 2 ) time. Proof. The proof of this is straightforward. Computing the perfect elimination ordering takes O(n 2 ) time, and the rest follows from the observation that merging two trees takes O(n) time since computing the strict consensus of two trees takes O(n) time (Day, 1995) Conditions under which T q is the true tree. We now describe the conditions under which the reconstructed tree T q is the true tree. We begin with some de nitions. De nition 10. Let (T, w) be a binary tree edge weighted by w : E (T )! C, and leaf labeled by the set S D f1, 2,..., ng of taxa. Let l be the additive distance matrix associated to T. Let e be an edge in T that is not incident to a leaf of T. Around e, there are four subtrees, A, B, C, and D. Let a, b, c, and d be four leaves in each of the four subtrees A, B, C and D, respectively, closest to e [where the distance between nodes p and q is measured as P e2p pq w(e)]. We call fa, b, c, dg a short quartet around e, and the collection of all short quartets around internal edges of T is denoted by Q short (T ). The maximum l i, j such that i and j are in a short quartet together is called the l-width(t ). The graph G sq on vertex set S D f1, 2,..., ng is de ned by (i, j ) 2 E (G sq ) if i and j are in some short quartet together. Theorem 4. Let T be a xed leaf-labeled tree, let G be a triangulated graph such that G sq µ G, and assume that the Buneman Tree Method applied to each maximal clique in G reconstructs the correct subtree (i.e., it reconstructs the subtree of T induced by the maximal clique). Let be the collection of Buneman Trees on maximal cliques of G, and let T be the tree obtained by applying SCA to (G, ). Then T D T. Proof. Let T be a tree whose leaves are labeled by S D fv 1, v 2,..., v n g. Let G be a triangulated graph on S, and let D ft A g, where T A is a tree on leaf set A for every maximal clique A in G. Let s D fv 1, v 2,..., v n g be a simplicial elimination ordering for G. Recall the de nitions of t i, T i, X i from the description of SCA. The proof proceeds by showing that T j fv i, v ic1,..., v n g D T i for all i. The base case requires that we show that T n 3 D T j fv n 3, v n 2, v n 1, v n g, but this follows trivially since we assume T n 3 is true. Now assume that T i D T j fv i, v ic1,..., v n g for some i 2 f1, 3,..., n 4g. Consider X i 1. Note that by de nition X i 1 D C(v i 1 ) \ fv i, v ic1,..., v n g, and that X i 1 forms the leaf set of the backbone of the strict consensus merger of t i 1 and T i. Also X i 1 is a clique, and so by assumption T i j X i 1 D t i 1 j X i 1. Consequently there is no edge contraction when we compute the backbone. To complete the proof that T i 1 D T j fv i 1, v i,..., v n g we need only show that there is no collision formed by the merger of the two trees. There can be a collision only if the backbone contains an edge onto which both v i 1 and some other v j 62 X i attach. Let e be the edge onto which v i 1 attaches, and suppose there is a collision on this edge e. Thus, some subtree t 0 of T i attaches onto e. (Note that in this case, these are true attachments in the sense that v i 1 and t 0 also attach to the path associated to e in the true tree.) Let the leaf set of T 0 be Y, and note Y µ fv i, v ic1,..., v n g X i 1. Let P be the path in T corresponding to the edge e and let its endpoints be a and b. Consider the subtree T 0 of T obtained by deleting all the nodes in T that are separated from a by the deletion of b, or vice versa, and let A a,b be the leaves of T 0. In other words, T 0 consists of the path P and all subtrees of T that attach to interior nodes of P. The following conditions are then true: 1. v i 1 2 A a,b and all leaves in t 0 are also in A a,b. 2. G sq restricted to A a,b is path connected. 3. X i 1 \ A a,b D ;. The proofs of (1) and (3) follow from the fact that T i and t i 1 are correct. Fact (2) can be proven by induction, and uses the fact that every short quartet in the true tree T induces a four-clique in G. Now, let P 0 be a path lying in G sq \ A a,b from v i 1 to some node in Y. Let y be the rst node from Y on P 0 ; by de nition, y 62 X i 1. By (3), the path from v i 1 to y lies entirely in v 1, v 2,..., v i 1, so that (v i 1, y) 2 E (G) [this follows from facts about simplicial elimination orderings, see (Golumbic, 1980)]. Consequently y 2 C(v i 1 ) \ fv i, v ic1,..., v n g D X i 1. However, this contradicts our earlier conclusion that y 62 X i 1. We now describe another condition under which T q is guaranteed to be the true tree.

11 DISK-COVERING 379 Theorem 5. Let (T, l) be a Jukes Cantor tree, d the input dissimilarity matrix, and G a triangulated graph with G sq µ G; thus every short quartet in T induces a four-clique in G. Furthermore assume that for every short quartet fi, j, k, lg in T that the Buneman Tree on fi, j, k, lg is T j fi, j, k, lg (i.e., the correct tree). If the supertree construction algorithm applied to Buneman Trees on the maximal cliques of G produces a binary tree T 0, then T 0 D T. Proof. We begin by citing the result proved by Erd ós et al. (1999). Lemma 4. Let (T, w) be an edge-weighted tree leaf labeled by S and let T 0 be a tree also leaf labeled by S. If for every short quartet fi, j, k, lg in T, the tree T 0 induces the same tree on fi, j, k, lg as T, then T D T 0. Now suppose that the supertree construction algorithm produces a binary tree T 0. In this case, every Buneman Tree on every maximal clique is binary, and there are no collisions during the merger. Therefore, T 0 agrees with T for every short quartet of T. Then by the above stated lemma, T D T Phase II of DCM Buneman In the previous sections we showed how to compute each T q, and also established two conditions under which T q would be the true tree. We now show how we select a particular tree T q to return as the output of the DCM Buneman Method. We select T q using the following rule: Return the most resolved T q (i.e., the one with the most internal edges), and if there are more than one such tree, then return the one associated to the largest q Performance guarantees of DCM Buneman Theorem 6. Let T be a Jukes Cantor model tree and let 0 < f l e g for all edges e. Recall that l-width(t) is the largest l-distance between two leaves in a short quartet. Then DCM Buneman is accurate on input d if e(l-width(t ) C 3 f / 2) < f / 2. Proof. Let q D l-width(t ) C 3 f / 2 and assume that e(q) < f / 2. Now consider the threshold graph Thresh[d, l-width(t ) C f / 2]. Since e(q) < f / 2, the following are true: ² G sq µ Thresh[d, l-width(t ) C f / 2]. ² Thresh[d, l-width(t ) C f / 2] µ Thresh[l, l-width(t ) C f ]. ² Thresh[l, l-width(t ) C f ] µ Thresh[d, l-width(t ) C 3 f / 2]. Since l is additive, Thresh[l, l-width(t ) C f ] is triangulated by Lemma 3. Hence, the minimal triangulation of Thresh[d, l-width(t ) C f / 2] is a subgraph of Thresh[d, l-width(t ) C 3 f / 2]. Consequently the Buneman Tree method computes the correct tree for every maximal clique in Thresh [d, l-width(t ) C f / 2]. By Theorem 4, the strict consensus subtree merger reconstructs the true tree. Hence there is at least one threshold p for which T p is the true tree, and it is p D l-width(t ) C f / 2. In Phase II of DCM Buneman we select the most resolved tree, and if there is more than one equally resolved tree, we select the tree associated to the largest threshold. So suppose there is a p 0 p such that T p 0 is also binary. By Theorem 5, they are identical, and both equal T. Thus if e(q) < f / 2 where q D l-width(t ) C 3 f / 2, then DCM Buneman reconstructs the true tree. Theorem 7. DCM Buneman is fast converging for the Jukes Cantor model. Proof. By Theorem 6 and Corollary 1, all we need to establish is that l-width(t ) C 3 f / 2 D O(g log n). Clearly l-width(t ) C 3 f / 2 D O(l-width(T )). Then l-width(t ) D O(g log n), as was shown in Erd ós et al. (1999). Thus, DCM Buneman is fast converging. A comparison between the convergence rates of the Buneman Tree Method and DCM Buneman is interesting. Let (T, l) be a xed Jukes Cantor model tree. The Buneman Tree Method is statistically consistent for the Jukes Cantor model of evolution, but the only established upper bounds on the convergence rate indicate that it converges from sequence lengths that grow exponentially in l max D max i j l i j, and so the Buneman

12 380 HUSON ET AL. Tree Method is not likely to be fast converging (see discussion following Theorem 1). However, for the same tree, the convergence rate of the DCM Buneman method is much faster, and in fact DCM Buneman is fast converging. The difference in convergence rates is obtained through restricting the attention to the small distances in the data set, rather than using all the distances. Thus DCM Buneman is a boosted version of the Buneman method. 4. DCM BOOSTING OTHER METHODS DCM Buneman is actually a special instantiation of a very general two-phase technique (called the Disk- Covering Method, or DCM), which can be used in conjunction with any phylogenetic method. In the rst phase, we construct a tree T q for each q 2 d i j, and in the second phase, we compute a tree on the entire set of taxa, by taking a consensus of the trees T q. When used in conjunction with the phylogenetic method M, we call this DCM M. The method M is used to reconstruct the subtrees on the maximal cliques of the triangulated threshold graph; and second, we design the second phase (taking the consensus of the trees T q ) to optimize the performance of the method Phase I We now describe how we perform the rst phase, in which a tree T q is computed for each q 2 fd i j g. Let S be the input set of sequences, and let q 2 fd i j g be the selected threshold. Let M be the base phylogenetic method. ² We construct the threshold graph Thresh(d, q). We triangulate each component of Thresh(d, q), minimizing the weight of the largest edge added, thus obtaining a triangulated graph Thresh (d, q). ² We compute the maximal cliques in Thresh (d, q), and compute a tree on each maximal clique using the method M. ² We apply the Supertree Construction Algorithm to the set of trees de ned for the maximal cliques, obtaining a tree T q if Thresh (d, q) is connected, and otherwise obtaining a forest F q. This rst phase can be modi ed to allow for the triangulation of the threshold graph to be done suboptimally, and in practice this is what we have done (greedy triangulations affect the performance of the method only very slightly, as our experiments show). The construction of a supertree from subtrees can be implemented in various ways, as well. We have elected to be quite conservative, and hence our Supertree Construction Algorithm employs the Strict Consensus Subtree Merger technique; however, this can also be modi ed Phase II In the second phase, we take the trees T q we have computed in Phase I and infer a consensus of these trees. We have experimented with using DCM boosting in conjunction with several distance based methods, including Neighbor-Joining (NJ), the Agarwala et al. (1996) algorithm that 3-approximates the L 1 -nearest tree, and the Buneman Tree method. Our experimental studies for all these methods indicate that for almost every small q, the tree T q has very low false-positive rates, typically close to 0. Consequently, almost all T q are either contractions of the true tree or close to being contractions of the true tree. (This is perhaps not at all surprising, since we designed the merger using the strict consensus technique, and this collapses edges that are not supported by every subtree!) This suggests the following implementation of Phase II: take all the trees T q, and compute the asymmetric median tree of these trees (Phillips and Warnow, 1996). We now de ne this consensus technique. De nition 11. The asymmetric median tree of a set of leaf-labeled trees D ft 1, T 2,..., T p g computes a tree T such that C (T ) µ [ i C (T i ), and such that if each c 2 C (T ) is weighted by the number of trees T i that contain c, then w(t ) D P c2c (T ) w(c) is maximum. The idea behind the asymmetric median tree is that when the input trees have low false-positive rates, the asymmetric median tree method recovers as many of the true edges as possible. Computing the asymmetric median tree is NP-hard, and so in practice we have implemented this using a greedy strategy (which is not guaranteed to nd an optimal solution). This greedy technique neverthless has good empirical performance,

13 DISK-COVERING 381 as our experimental study shows. This implementation of Phase II can be used with DCM Buneman as well, but under these conditions we do not have provable performance guarantees. 5. EXPERIMENTAL RESULTS We brie y describe a small portion of our experimental performance analysis. For additional performance results based upon simulating sequence evolution, see Huson et al. (1998a) Model trees and simulated datasets The basic model tree that we use has its topology and rates of evolution along the edges based upon reconstructions of the African Eve data set (Maddison et al., 1992) restricted to its human mitochondrial DNA sequences. We then scaled the rates of evolution on this basic model tree up, to produce a number of different trees on which there were high evolutionary rates. We used this larger set of model trees to generate several hundred different sets of DNA sequences using the ecat simulator (Rice, 1997), and using the Jukes Cantor model of evolution. Later, we will report on the performance of DCM boosting on one particular scaled up version (in which the largest probability of change on any edge equal to 0.48) of this basic tree Distance calculations We computed Jukes Cantor distance matrices for each data set. On some data sets the rate of evolution was high enough that some pairs of sequences differed in 75% or more of their positions. For such pairs of sequences, the standard Jukes Cantor distance calculation cannot be used since the log cannot be computed. For these pairs, we de ned d i j using the following version of the large value replacement technique (Swofford et al., 1996). We computed the maximum Jukes Cantor distance, multiplied that value by the number n of leaves in the matrix, and replaced all unde ned values by this large number. These matrices were then input to six different distance-based methods: Neighbor-Joining (NJ), the Agarwala et al. method, the Buneman Tree, and the DCM-boosted versions of these three methods Performance evaluation criteria We explored performance with respect to accuracy of the topology recovered by each method, by comparing the reconstructed tree to the model tree. Recall that this accuracy is quanti ed by examining false-negative (FP) rates and false-positive (FN) rates (see De nition 2). Recall that sequence lengths beyond 5000 nucleotides are considered unusually long for tree reconstruction, and that in general convergence to the true tree or acceptable error rates within 1000 nucleotides is thus the critical test of performance (all these methods will converge to the true tree given long enough sequences, since they are all statistically consistent under this model of evolution! The question is at what rate). Also, for systematic biology purposes, error rates below 5% can probably be tolerated, though of course this will depend upon the tree. Hence, we examined these experiments with the following speci c questions in mind: ² At what sequence length do we get an error rate below 5%? ² At what sequence length (if any) do we recover the true tree reliably? ² How well do the different methods do when restricted to typical length sequences (between 200 and 1200 nucleotides)? Since we are interested in how DCM boosting affects performance, we will speci cally address how DCM-boosted methods differ from their base methods with respect to these three questions Summary of experimental results We report on the results of a set of experiments on the African Eve tree with rates of evolution scaled up so that the maximum probability of change was This model tree is a good example of how DCM boosting affects performance when the tree is a dif cult one to reconstruct, due to the combination of large numbers of taxa and high divergence. Here are some of the basic observations about the performance of these six methods on this tree.

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document