ON THE UNIQUENESS OF BALANCED MINIMUM EVOLUTION

ON THE UNIQUENESS OF LNE MINIMUM EVOLUTION RON KLEINMN ecember 3, 20 bstract. Minimum evolution is a class of parsimonious distance-based phylogenetic reconstruction methods. One noteworthy example is balanced minimum evolution (ME), which is the theoretical underpinning of the neighbor-joining algorithm. The robustness of a minimum evolution method can be captured in part by a statistic known as its L radius. ME is known to have L radius 2, the best possible. We show that ME is in fact the only minimum evolution method with radius 2. 0.. Introduction The minimum evolution (ME) approach to phylogenetic reconstruction is broadly based on the following idea: given a matrix δ of pairwise distances between a set of n taxa, find the tree that explains δ with as little evolution as possible [, 6]. More explicitly, ME employs the following algorithm: () For each tree topology T, find the branch lengths of T assuming δ comes from T ; (2) Use the branch lengths to compute the length l T of the tree T ; (3) hoose the tree ˆT = arg min T l T with minimum length. There is some ambiguity in how to use negative branch lengths to compute the length of the tree. Kidd and Sgaramella-Zonta [] proposed summing the absolute value of the edge lengths, while Swofford et. al. [20] suggested summing only positive edge lengths. Throughout, we will follow Rzhetsky and Nei [5] in calculating the length of the tree by summing all the edges, with sign. The effectiveness of this method then depends crucially on how we select the the branch lengths for a given tree T. One classic approach, first proposed by avalli-sforza and Edwards [4] and Fitch and Margoliash [8], chooses the edge lengths that minimize the sum of squares (δ ij δij) T 2, where δij T is the sum of lengths of edges in the path between i and j. This method of assigning edge lengths is known as ordinarly least squares (OLS), and we refer to the corresponding ME method as OLS+ME. If we know the variances V ij of the δ ij, then the variance-minimizing estimate of the edge lengths is given by the δ T that minimizes the weighted least squares (WLS) (0.) V ij (δ ij δ T ij) 2. Supported by an NSF Graduate Research Fellowship. The author thanks Lior Pachter for suggesting this problem. 0-

0-2 RON KLEINMN WLS assumes the δ ij are uncorrelated. Generalized least squares (GLS) gets rid of this assumption and seeks to minimize V ij,kl (δ ij δij)(δ T kl δkl), T ij,kl where V is the inverse of the variance-covariance matrix of the δ ij. s more data is gathered, we expect the observed pairwise distances δ to converge to the true distances δ T. OLS+ME satisfies the important property that it is statistically consistent [5]: that is, when δ is sufficiently close to δ T, OLS+ME will recover the correct tree topology T. While this property holds for WLS methods other than OLS [5], it does not hold for all WLS (and therefore all GLS) ME methods [9]. In GLS, computing the optimal edge weights is equivalent to minimizing a quadratic form, and the edge weights are then linear in the elements of δ. More precisely, we have (0.2) ˆlT = (ST t V S T ) ST t V δ, where V is the ( ( n 2) n ) 2 varinace-covariance matrix and ST is the ( n 2) E matrix whose entries are given by { if e is an edge on the path in T between i and j, (S T ) ij,e = 0 otherwise. We briefly mention that GLS has the following statistical interpretation. Suppose δ ij = ij + ɛ ij, where δ ij is the observed distance, ij is the true distance and the ɛ ij are error terms that are normally distributed with mean zero and covariance matrix V. Then ˆl e is the linear unbiased estimator for the length of edge e with minimal variance. Under GLS, the total length of the tree is a linear form in the coefficients of δ given by l T = tˆlt, where is the vector of ones of length E. If δ actually is T -additive, δ = S T E T for some positive vector E T, and then by (0.2) a GLS+ME method on T will estimate the length of δ to be E T, which is the correct length. This suggests the following definition. Let T be the set of phylogenetic X-trees. onsider the space of all dissimilarity maps on X as a cone in R (n 2), and let L be the space of linear forms on R (n 2). efinition 0.. minimum evolution (ME) method is a map φ : T L such that if δ is T -additive with length l, (0.3) (φ(t ))(δ) = l. (For notational convenience, in what follows we frequently write φ(t, δ) to mean (φ(t ))(δ)). We say φ is consistent if we further have (0.4) T = argmin T φ(t, δ) Equation 0.3 is a kind of normalization requirement that allows us to recover the correct length of the tree. Equation 0.4 says that when our disismilarity map is tree-like, our method recovers the correct tree. Note that a general ME method doesn t attempt to calculate the edge lengths - it cares only about the total length of the tree. Similarly, there is no statistical interpretation for a general ME method.

ON THE UNIQUENESS OF LNE MINIMUM EVOLUTION 0-3 We have shown that all GLS+ME methods are ME methods. onversely, we say φ is a GLS+ME method (respectively, WLS+ME method) if, for each tree T, there is a variancecovariance matrix (respectively, diagonal variance-covariance matrix) V T such that φ(t ) = t (ST t V T S T ) ST t V T δ. alanced Minimum Evolution (ME), first introduced in [4], is a WLS+ME method that corresponds to taking (V T ) ij = 2 P T (i,j), where P T (i, j) is number of edges on the path P T (i, j) between i and j in T. Pauplin showed [4] that in this case the WLS-minimizing edge lengths have computationally simple expressions and sum to give the nice formula φ ME (T ) := 2 P (i,j) δ ij. Like OLS+ME, this is consistent [6]. ME also has applications to the Neighbor Joining (NJ) algorithm. First introduced in [7], NJ has historically been very important in phylogenetic reconstruction. It constructs a tree agglomaratively as follows: () Given a distance matrix δ : X X R, compute the Q-criterion Q δ (i, j) = (n 2)δ(i, j) k i δ(i, k) k j δ(j, k). (2) Select a pair (a, b) of taxa that minimize Q δ. If there are more than three taxa, replace this pair by a leaf ab and construct a new dissimilarity map δ given by δ (i, ab) = (δ(i, a) + δ(i, b) δ(a, b)). 2 (3) Repeat until there are three taxa remaining. NJ is motivated by the result [7, 9] that if δ is a T -additive tree metric, then a pair (a, b) minimizing Q δ is a cherry in T. This shows that NJ is consistent. onsistency doesn t guarantee accuracy in practice, since actual data will rarely be tree-additive due to noise, but there are a number of results [, 2] showing that NJ is maximally robust to small perturbations in the elements of the dissimilarity map. To make this more precise we use the following definition: efinition 0.2. tree reconstruction method has L radius α if, for each T -additive dissimilarity δ T with minimal branch length w min, and dissimilarity δ with δ δ T < 2 w min, the method returns T when given δ as input. ny reconstruction method with positive L radius is necessarily consistent. tteson showed [] that NJ has L radius. For i {, 2} there exist distinct trees T 2 i with minimum branch length w min, and T i -additive dissimilarities δ T i such that δ δ T i = w 2 min, so this is the best L radius possible. While other reconstruction algorithms might be similarly robust, NJ is special in at least one way. ryant showed [3] that the Q-criterion the only criterion that is: linear in the coefficients of δ; consistent (i.e. that given tree-like data the criterion will select a cherry at each step); indifferent to the order of the taxa. Thus, one can consider NJ as the unique algorithm satisfying a certain set of desirable properties.

0-4 RON KLEINMN lthough NJ was originally created as a way to approximate OLS+ME, Gascuel and Steel showed [?] that NJ is a greedy agglomarative implementation of ME. The relationship between NJ and ME is bolstered in [3], where it is shown that ME also has L radius 2, while OLS+ME has L radius 0 as the number of taxa n increases to infinity. These are the only L radii that have been precisely computed for ME methods. Given the close ties between NJ and ME, and ryant s result on the uniqueness of NJ, it is natural to wonder if ME is somehow unique in the class of ME methods. The main result of this paper is the following affirmative result: Theorem 0.3. ME is the only ME method with L radius 2. 0.2. Proof We begin with some standard definitions. Let X be a set of taxa, X = n. phylogenetic X-tree is an unrooted trivalent tree whose leaves are labelled bijectively with the elements of X. split of X is a partition of X into two nonempty pieces. Each edge e in a tree T gives rise to a split obtained by removing e and looking at the taxa in the two disconnected components of T. clade of T is a subset of X which is a component of a split obtained in this way. If, are disjoint clades in X, let σ be the dissimilarity map given by { if {i, j} =, σ (i, j) = 0 otherwise. If e is an edge in T that gives the X-split, let σ e denote σ. dissimiliarity on a set X is a symmetric matrix δ whose diagonal elements are 0. If T is an X-tree, we say δ is T -additive if there is a positive function w : E(T ) R + such that δ ij = w e, e P T (i,j) where P T (i, j) is the path between i and j. We now begin the proof. Suppose φ is a consistent ME method and let l T = φ(t ). y a classic result [8] the map δ is T -additive if and only if it has the form δ = w e σ e for some positive w e. For such δ, w e = l T (δ) = w e l T (σ e ). This must hold for arbitrary w e, so l T (σ e ) = for all e E(T ). Now suppose S is an X-split that is not in T, and let T be a tree that contains this split. If l T = φ(t ) and δ = σ S, then δ is T -additive and not T -additive, so by consistency l T (δ ) > l T (δ ) which implies l T (σ S ) >. Following [3], let U(T ) denote the set of linear forms l such that l(σ S ) for each X-split S, with equality if and only if S lies in T. Then we have shown: Lemma 0.4. φ is consistent if and only if φ(t ) U(T ) for all T T.

ON THE UNIQUENESS OF LNE MINIMUM EVOLUTION 0-5 e e e e T e T ' T '' Figure 0.. Three trees T, T, T that are nearest-neighbor interchanges of each other We must show that if φ has radius then it is ME. First, we assume only that φ is 2 consistent. Fix T T and let l = φ(t ). If e,..., e k E(T ) form a path with ends determining disjoint clades and, then σ e...e k represents σ. Lemma 0.5. Let e be an internal edge of T and let,,, be the disjoint clades obtained by removing the four edges adjacent to e with the split corresponding to e, as in Figure. Then l(σ ) = l(σ ) and l(σ ) = l(σ ). Proof. Let v be an internal vertex with edges e, e 2, e 3. We have = l(σ ei ) = l(σ ei e j ) + l(σ ei e k ) for {i, j, k} = {, 2, 3}. This is a system of three equations and three unknowns, and solving shows l(σ ei e j ) = 2. pplying this to e e gives Similarly, 2 = l(σ ) = l(σ ) + l(σ ). 2 = l(σ ) = l(σ ) + l(σ ), 2 = l(σ ) = l(σ ) + l(σ ), 2 = l(σ ) = l(σ ) + l(σ ). Simple manipulation then proves the lemma. Two trees T, T are separated by a nearest neighbor interchange (NNI) if one can be obtained from the other by tranposing two subtrees that are precisely three edges apart. Pick an interior edge e E(T ) and label the clades given by the edges adjacent to e in this way, so T has the split. Let T be the tree obtained by a NNI so that T has

0-6 RON KLEINMN the split, as in Figure, and take l = φ(t ). Now suppose δ is a T -additive distance matrix δ = w e σ e. Then l(δ) = w e. Note that for each e E(T ) with e e, the split induced by e in T is also a split in T. So l (σ e ) = for all e e, and l (δ) l(δ) = (w e l (σ e ) + w e l (σ e )) w e l(σ e ) = w e (l (σ e ) ) e e = w e (l (σ ) + l (σ ) + l (σ ) + l (σ ) ) = 2w e l (σ ), by Lemma 0.5. Let δ be a dissimilarity map satisfying δ δ < αw min. Then l ( δ) l( δ) = (l ( δ) l (δ)) (l( δ) l(δ)) + (l (δ) l(δ)) = c T ij ( δ δ) ij c T ij( δ δ) ij + 2w e l (σ ) c T ij c T ij αw min + 2w e l (σ ). Now assume φ has L radius α. ecause equality can be achieved by taking ( δ δ) ij = αw min sgn(c T ij c T ij), the inequality l ( δ) l( δ) 0 gives (0.5) 2w e l (σ ) αw min ij c T ij. The sum in the right-hand side of (0.5) is (0.6) ij c T ij ij c T ij + ij c T ij + ij c T ij + ij c T ij. i i i i j j j j Now i j ij c T ij c T ij c T ij = l (σ ) l(σ ) = l (σ ) /2 l (σ ). i j Similar calculations show i j i j i j ij c T ij l(σ ), ij c T ij l(σ ), ij c T ij l (σ ).

ON THE UNIQUENESS OF LNE MINIMUM EVOLUTION 0-7 Substituting into (0.6) gives ij c T ij 2l(σ ) + 2l (σ ), so (0.5) becomes 2w e l (σ ) 2αw min (l(σ ) + l (σ )). This must hold when w e = w min, so we have ( α)l (σ ) αl(σ ). n identical argument gives ( α)l(σ ) αl (σ ). When α = these two inequalities 2 combine to give l (σ ) = l(σ ). This implies equality holds in each of the inequalities above, so in particular c T ij = c T ij i, j s.t. (σ ) ij = 0. Let T be the tree with split obtained by a single NNI from T and let l = φ(t ). Since T can also be obtained from T by a NNI, arguments identical to the one above show l (σ ) = l (σ ) and l (σ ) = l(σ ). ombining these three equations gives l(σ ) = l(σ ) and, since l(σ ) + l(σ ) =, we have 2 l(σ ) = l(σ ) = 4. We have shown that for k =, 2, 3, if e,..., e k E(T ) form a path with ends determining disjoint clades and, then l(σ e...e k ) = 2 k. We will prove this for all k by induction. So let T, T, T be the trees in Figure 2; note each can be obtained by an NNI from the other two. y our inductive hypothesis Since c T ij = c T ij Similar reasoning gives l(σ ) = l (σ E ) = 2 k. for all i, j, l(σ ) = l (σ ) and l(σ ) = l (σ E ). l (σ E ) = l (σ E ), l (σ E ) = l(σ ). ombining these equalities shows l(σ ) = l(σ ), so l(σ ) = 2 k and the induction is proved. Finally, for any i, j X let e,..., e k be the unique path between i and j. Then we have shown c T ij = l(σ e...e k ) = 2 k, and the theorem is proved. 0.3. oncluding remarks Finding the tree T that minimizies φ ME (T, δ) is in general NP-hard to approximate [7], so ME methods are seldom used directly in practice. ut ME has served as a theoretical guide to several distance-based algorithms [2, 0]. While the relationship between the robustness of an ME method and the robustness of an algorithm may be complex, Theorem 0.3 suggests that if such an algorithm is going to be based on a ME method, it s best to choose ME. In efinition 0., we required that a ME method satisfy φ(t, δ) = l when δ is a T -additive dissimilarity with length l. lthough this normalization requirement holds for all GLS+ME

0-8 RON KLEINMN F F 2 F k-2 E F F 2 F k-2 E...... T T ' F F 2 F k-2... T '' E Figure 0.2. Three trees T, T, T used in the induction. methods, it seems unnatural in this broader context. We now drop this requirement, so in what follows a ME method is just a map φ : T L. efinition 0.6. Let f be a function that assigns a real number to each X-split. For each X-tree T, let U f (T ) be the set of linear forms l such that l(σ S ) f(s) for each X-split S with equality iff S lies in T. We say a ME method φ is f-consistent if φ(t ) U f (T ) for all T T. Our usual definition of consistency corresponds to the case f. n easy generalization of Lemma 0.4 shows Lemma 0.7. If φ is statistically consistent, it is f-consistent for a unique function f. Following the proof of Theorem 0.3, we obtain Theorem 0.8. For each f, there is at most one f-consistent ME method with L radius 2. The proof is constructive and the resulting ME methods are combinatorially interesting, but we do not have space to explore them here. We will note that there are some functions f for which no f-consistent ME methods exist. For example, when n = 4, X = {,,, }, we have σ + σ + σ + σ = σ + σ + σ. Every X-tree contains the four splits on the left hand side and only one of the splits on the right hand. So if there exists an f-consistent ME method φ, applying φ(t ) to both sides for any T gives f( )+f( )+f( )+f( ) f( )+f( )+f( ). If this inequality is not satisfied then U f (T ) is empty. Let P be the polytope in R (n 2) given by P = {x x σ S f(s) splits S}.

ON THE UNIQUENESS OF LNE MINIMUM EVOLUTION 0-9 These generalized ME methods can be thought of geometrically as selecting points from certain faces of a P. Their combinatorics and geometry should be interesting objects of study. References. Kevin tteson, The performance of neighbor-joining methods of phylogenetic reconstruction, lgorithmica 25 (999), 25 278, 0.007/PL00008277. 2. Magnus ordewich and Radu Mihaescu, ccuracy guarantees for phylogeny reconstruction algorithms based on balanced minimum evolution, lgorithms in ioinformatics (Vincent Moulton and Mona Singh, eds.), Lecture Notes in omputer Science, vol. 6293, Springer erlin / Heidelberg, pp. 250 26. 3. avid ryant, On the uniqueness of the selection criterion in neighbor-joining, Journal of lassification 22 (2005), 3 5, 0.007/s00357-005-0003-x. 4. L.L. avalli-sforza and.w.f. Edwards, Phylogenetic analysis: Models and estimation procedures, merican Journal of Human genetics 9 (967), 223 257. 5. Franois enis and Olivier Gascuel, On the consistency of the minimum evolution principle of phylogenetic inference, iscrete pplied Mathematics 27 (2003), no., 63 77. 6. Richard esper and Olivier Gascuel, Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting, Molecular iology and Evolution 2 (2004), 587 598. 7. Samuel Fiorini and Gwenaël Joret, pproximating the balanced minimum evolution problem, orr abs/04.080 (20). 8. Walter M. Fitch and Emanuel Margoliash, onstruction of phylogenetic trees, Science 55 (967), no. 3760, 279 284. 9. Olivier Gascuel, avid ryant, and Franois enis, Strengths and limitations of the minimum evolution principle, Systematic iology 50 (200), no. 5, pp. 62 627 (English). 0. Olivier Gascuel and Mike Steel, Neighbor-Joining Revealed, Molecular iology and Evolution 23 (2006), no., 997 2000.. K. K. Kidd and L.. Sgaramella-Zonta, Phylogenetic analysis: concepts and methods, m J Hum Genet 23 (97), no. 3, 235 252. 2. Radu Mihaescu, an Levy, and Lior Pachter, Why neighbor-joining works, lgorithmica 54 (2009), 24. 3. Fabio Pardi, Sylvain Guillemot, and Olivier Gascuel, Robustness of phylogenetic inference based on minimal evolution, ulletin of Mathematical iology 72 (200), 820 839. 4. Yves Pauplin, irect calculation of a tree length using a distance matrix, Journal of Molecular Evolution 5 (2000), no., 4 47. 5. ndre Rzhetsky and Masatoshi Nei, Theoretical foundation of the minimal-evolution method of phylogenetic inference, Molecular iology Evolution 0 (993), no. 5, 073 095. 6. ndrey Rzhetsky and Masatoshi Nei, simple method for estimating and testing minimum-evolution trees, Molecular iology and Evolution 9 (992), no. 5, 945. 7. Naruya Saitou and Masatoshi Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Molecular iology and Evolution 4 (987), 406 425. 8. harles Semple and Mike Steel, yclic permutations and evolutionary trees, dvances in pplied Mathematics 32 (2004), 669 680. 9. James. Studier and Karl J. Keppler, note on the neighbor-joining method of Saitou and Nei, Molecular iology and Evolution 5 (988), 729 73. 20..L. Swofford, G.J. Olsen, P.J. Waddell, and.m. Hillis, Phylogenetic inference, Molecular Systematics (avid M. Hillis, raig Moritz, and arbara K. Mable, eds.), Sinauer ssociates, 996, pp. 407 54. epartment of Mathematics, U erkeley E-mail address: kleinman@math.berkeley.edu