THE TRIPLES DISTANCE FOR ROOTED BIFURCATING PHYLOGENETIC TREES

Similar documents
Assessing Congruence Among Ultrametric Distance Matrices

k-protected VERTICES IN BINARY SEARCH TREES

C3020 Molecular Evolution. Exercises #3: Phylogenetics

arxiv: v1 [cs.ds] 1 Nov 2018

Dr. Amira A. AL-Hosary

DISTRIBUTIONS OF CHERRIES FOR TWO MODELS OF TREES

Constructing Evolutionary/Phylogenetic Trees

Letter to the Editor. Department of Biology, Arizona State University

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

The expected value of the squared euclidean cophenetic metric under the Yule and the uniform models

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Distances that Perfectly Mislead

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Constructing Evolutionary/Phylogenetic Trees

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Parsimony via Consensus

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Congruence of Morphological and Molecular Phylogenies

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Consensus Methods. * You are only responsible for the first two

Probabilities of Evolutionary Trees under a Rate-Varying Model of Speciation

X X (2) X Pr(X = x θ) (3)

Phylogenetic inference

arxiv: v1 [q-bio.pe] 1 Jun 2014

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Phylogenetic Tree Reconstruction

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Evolutionary Tree Analysis. Overview

Phylogenetic Networks, Trees, and Clusters

Non-independence in Statistical Tests for Discrete Cross-species Data

Effects of Gap Open and Gap Extension Penalties

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Notes 6 : First and second moment methods

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

FORMULATION OF THE LEARNING PROBLEM

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

What is Phylogenetics

Reconstructing Trees from Subtree Weights

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

arxiv: v1 [cs.cc] 9 Oct 2014

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

A Phylogenetic Network Construction due to Constrained Recombination

Combining Data Sets with Different Phylogenetic Histories

CSCE 222 Discrete Structures for Computing. Review for Exam 2. Dr. Hyunyoung Lee !!!

arxiv: v1 [q-bio.pe] 3 May 2016

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Maximum Agreement Subtrees

Evaluating phylogenetic hypotheses

Concepts and Methods in Molecular Divergence Time Estimation

Combining the cycle index and the Tutte polynomial?

What Is Conservation?

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Algebraic Statistics Tutorial I

A Generalization of Wigner s Law

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

The expansion of random regular graphs

A STATISTICAL FRAMEWORK TO TEST THE CONSENSUS OF TWO NESTED CLASSIFICATIONS

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Supplementary Information

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Enumeration of subtrees of trees

Lecture 6 Phylogenetic Inference

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley


A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data

Theory of Evolution Charles Darwin

Lower Bounds for Testing Bipartiteness in Dense Graphs

Systematics Lecture 3 Characters: Homology, Morphology

Final Exam, Machine Learning, Spring 2009

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

arxiv: v1 [q-bio.pe] 16 Aug 2007

Lecture 1: Brief Review on Stochastic Processes

OMICS Journals are welcoming Submissions

ANALYSIS OF CHARACTER DIVERGENCE ALONG ENVIRONMENTAL GRADIENTS AND OTHER COVARIATES

should be presented and explained in the combined species tree (Fitch, 1970; Goodman et al., 1979). The gene divergence can be the results of either s

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

arxiv:math.pr/ v1 17 May 2004

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

How should we organize the diversity of animal life?

Reconstructing the history of lineages

Chapter 7: Models of discrete character evolution

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Transcription:

Syst. Biol. 45(3):33-334, 1996 THE TRIPLES DISTANCE FOR ROOTED BIFURCATING PHYLOGENETIC TREES DOUGLAS E. CRITCHLOW, DENNIS K. PEARL, AND CHUNLIN QIAN Department of Statistics, Ohio State University, Columbus, Ohio 4310, USA; E-mail: dkp@stat.mps.ohio-state.edu (D.K.P.) Abstract. We investigated the triples distance as a measure of the distance between two rooted bifurcating phylogenetic trees. The triples distance counts the number of subtrees of three taxa that are different in the two trees. Exact expressions are given for the mean and variance of the sampling distribution of this distance measure. Also, a normal approximation is proved under the class of label-invariant models on the distribution of trees. The theory is applied to the usage of the triples distance as a statistic for testing the null hypothesis that the similarities in two trees can be explained by independent random structures. In an example, two phylogenies that describe the same seven species of chloroccalean zoosporic green algae are compared: one phylogeny based on morphological characteristics and one based on ribosomal RNA gene sequence data. [Tree comparison metrics; random trees; label-invariant models; hypothesis test.] Developing interpretable measures of the distance between trees and of their sampling distributions under various probability models is important to the study of phylogenetic inference. Distance measures are a valuable tool for comparing phylogenetic trees created from two or more sources of data (e.g., Penny et al., 198; Bledsoe and Raikow, 1990; Penny et al., 1991; Swofford, 1991; Estabrook, 199), for reporting the results of a bootstrap analysis or of a comparison of phylogeny algorithms (e.g., Kuhner and Felsenstein, 1994), for making confidence statements about a proposed phylogeny, and for examining subtrees of particular taxa. As an example of the first use, consider the two phylogenies presented in Figure 1 for seven species of chloroccalean zoosporic green algae. Figure la shows a rooted bifurcating tree based on an assessment of certain morphological characteristics (primarily the details of the flagellar apparatus of motile cells), and Figure lb shows the parsimony tree based on ribosomal RNA gene (rdna) sequence data (Wilcox et al., 199). Can we quantify the difference between the trees? Can the similarities in the two trees be explained by random chance? In this paper, we propose using the number of subtrees of three taxa that are different in the two trees as a measure of the distance between them. To answer the second question, we find the mean and variance of this statistic, along with its asymptotic distribution, under the model that the two trees have completely independent structures. Several metrics for comparing phylogenies of n taxa have previously been suggested, and the most appropriate one to use depends on the underlying question of a particular investigation (Penny and Hendy, 1985). The branch-swapping metric proposed by Waterman and Smith (1978) computes the number of nearest-neighbor interchanges required to convert one tree into another. It has an appealing interpretation but cannot be calculated in polynomial time (i.e., the time required to compute the metric grows faster than any power of n). The symmetric difference metric discussed by Robinson and Foulds (1981) counts the number of partitions of the taxa, created by deleting internal edges, that differ in the two trees. The time required to calculate this metric is proportional to n for large trees (Day, 1985), its small sample distribution has been tabulated for specific probability models on the set of trees (Hendy et al., 1984), and a large sample Poisson approximation has been proved for the general class of label-invariant probability models (Steel, 1988; Steel and Penny, 1993). The quartet metric for unrooted trees proposed by Estabrook 33

34 SYSTEMATIC BIOLOGY VOL. 45 (a) (b) 1 3 5 7 6 4 17 5 6 3 4 FIGURE 1. Two phytogenies for chloroccalean zoosporic green algae. 1 = Glycine max; = Characium perforatum; 3 = Friedmannia ismelensis; 4 = Parietochbris pseudoalveolaris; 5 = Dunaliella parva; 6 = Characium hindakii; 7 = Chlamydomonas. (a) Tree based on morphological characteristics, (b) Tree based on 18S rdna sequence data. the distributional results we obtained may also give new findings for the quartet distance. Regardless of the analytical distributional results available for a particular measure, it is always possible to carry out significance tests using simulation techniques (Shao and Sokal, 1986). For example, the triples distance was used by Page (1988) to test biogeographical hypotheses using simulation methods. In this paper, we provide a formal definition of the triples distance and find an exact expression for the mean and variance of this statistic when the two trees are independent and we assume a label-invariant probability model on the set of all rooted bifurcating trees. The triples distance has a limiting normal distribution under this class of probability models. The triples distance can be calculated in O(n ) time, and we applied the probabilistic theory to a hypothesis test of the independence of the trees in Figure 1. The proof of the normal approximation theorem and tables of the null distribution of the triples distance for n < 50 under two specific models are also provided. et al. (1985) and studied quantitatively by Day (1986) counts the number of unrooted subtrees of four taxa that are different in the two trees. This metric can be calculated in O(n 3 ) time for a tree of n taxa, and its approximate variance was given for bifurcating trees by Steel and Penny (1993) (exactly under the model that all such trees are equally likely). The triples distance for rooted trees was suggested by Dobson (1975) as a method of comparing the shapes of trees, although she did not study aspects of its calculation or distribution. Each of the above metrics measures dissimilarity only with respect to the labeled topology of a phylogenetic tree the theme of the present paper. Other metrics also consider differences in the branch lengths joining the taxa (e.g., Lapointe and Legendre, 1990, 199). The triples distance for rooted bifurcating trees is a close cousin of the quartet measure for unrooted trees. Consequently, THE PROBABILITY DISTRIBUTION OF THE TRIPLES DISTANCE Basic Description and Notation Consider two labeled rooted bifurcating trees, each having the same n taxa (as in Fig. 1). We follow the usage of Steel and Penny (1993) in calling a labeled topology a tree (ignoring branch lengths) and generally further restrict our attention to the rooted bifurcating case. The triples distance S n between two such trees is defined as follows. For each triple {i, j, k} of distinct taxa in one of the original trees, consider the subtree that relates these three taxa alone. There are just three possibilities for this subtree, depending on which of i, j, and k is the most distant leaf relative to the other two. Let the indicator function fl if taxa i, j, k have different subtrees in the two trees * i * if taxa i, j, k have the same subtree in the two trees and define the triples distance as

1996 CRITCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 35 ^n -1 *ijkr ijk where the summation is over all the possible unordered triples {i, j, k) of distinct taxa. For example, to compute the triples distance between the two trees in Figure 1, there are I «= 35 subtrees of size three to be examined. The subtree made up of the triple {Glycine max, Characium perforatum, Friedmannia israelensis} = {1,, 3} is congruent in both trees because G. max is the most distant leaf among the three in each case. However, the subtree made up of {C. perforatum, F israelensis, Parietochloris pseudoalveolaris) = {, 3, 4} is incongruent. The overall triples distance equals 15 for the two trees because the 15 triples {, 3, 4}, {, 3, 5}, {, 3, 6}, {, 3, 7}, {, 4, 6}, {, 5, 6}, {, 5, 7}, {3, 4, 5}, {3, 4, 6}, {3, 4, 7), {3, 5, 7}, {3, 6, 7}, {4, 5, 6}, {4, 5, 7} and {5, 6, 7} are incongruent in the two trees. A fast general algorithm for calculating S n is provided. The probability distribution of the triples distance between two trees depends on the underlying distribution of the trees themselves. We investigated probabilistic properties of the triples distance under the assumption that both trees are drawn independently from the same underlying probability distribution. These probabilistic properties of S n are of interest in their own right and are also useful for developing a statistical test of the hypothesis of independence. Initially, the underlying probability distribution on trees was taken to be the uniform model, and then the results were extended to general label-invariant distributions. Distribution of S n under the Uniform Model For n taxa, there are {In 3)!! = (n 3)(n 1)-...-31 possible labeled rooted bifurcating trees. A simple probability distribution of interest is the uniform model (e.g., Shao and Rohlf, 1983), under which each of these possible trees is assigned equal probability [{In 3)!!]" 1. If two trees are drawn independently from this model, it is straightforward to check that for each triple {i, j, k] I ijk is a Bernoulli random variable with expectation /3 and that I ijk and I rj1c, are independent whenever [i, j, k) C\ {i',j',k'} = 0. Hence, and E(S.) = X m^) = (fjl (i) Var(SJ = Var(y + ijk ijkk' + ijkj Tc' 3o(fjCov{I iijk, () where all indices are distinct. To complete the variance calculation, note that the covariances in Equation depend only on the joint probability distribution of the two subtrees containing the five taxa i, j, k, )', and k'. Independence of the two original trees implies independence of these two subtrees, and under the uniform model, each of the (-5 3)!! = 105 possible subtree topologies is equally likely. Thus, there are (105) = 11,05 equally likely possibilities for the two subtrees. A direct computer enumeration of all these possibilities and the corresponding values of I ijk, I ijk., and I ij1c. gives Cov{I ijk/ I ijk ) = 8/5 and Cov{I ijk, I ij1c ) = (8/105). Substituting back into Equation and simplifying yields =18/ n \ 3/n\. /n + 735 \5) 75 9\3 under the uniform model. An additional result is that for large n, the triples distance is approximately normally distributed under the uniform model. This fact, combined with the preceding expectation and variance calculations, gives a useful approximation for big trees

36 SYSTEMATIC BIOLOGY VOL. 45 FIGURE. The two topologies with n = 4 taxa. There are 1 possible labeled trees of type a and three of type b. and allows for the simple implementation of a hypothesis test of independence. Distribution of S n under a General Label-Invariant Model A probability distribution on trees is said to be label invariant if the probability of a tree remains constant under an arbitrary permutation of the taxa labels (Steel and Penny, 1993). For example, in the case of n 4 taxa, label invariance implies that the probability is a fixed constant for each of the 1 possible labeled trees of the type in Figure a and similarly for the three possible trees of the type in Figure b. Most of the derivations under the uniform model rely exclusively on the fact that the uniform distribution on trees is itself label invariant. Only the calculations of Cov(I ijk/ I ijk ) and Cov(I ijk/ I ij1c ) use any additional features of the uniform model. Thus, Equations 1 and remain true under an arbitrary label-invariant model. To complete the variance calculation, Cov(I ijk, I ijk ) and Cov{I ijk, l ij1c ) can be found, as in the uniform case, by a direct computer enumeration of all 11,05 possibilities. (However, these possibilities are no longer all equally likely, so that their probabilities must also be computed under the label-invariant model of interest.) The limiting normality result also carries over. Theorem 1. Under an arbitrary label-invariant distribution on trees and the assumption that the two trees are independent, S n is approximately normally distributed for large n (with mean and variance given by Eqs. 1 and ). More precisely, [S n - E(S n )]/[Var(S n )]* converges in distribution to the standard normal distribution, as n > oo. Proof. See Appendix 1 for the proof, which amounts to showing that all of the moments of the standardized triples distance converge to the corresponding moments of the standard normal distribution. Example: Distribution of S n under the Markov Model A widely studied example of a nonuniform label-invariant distribution on trees is the Markov model, initially examined by Harding (1971). This model is often considered to be more realistic than the uniform model in capturing the salient features of some evolutionary situations (e.g., Slowinski, 1990; Page, 1991). The Markov model is defined by combining the labelinvariance property with the following recursive principle: To construct the tree distribution for n + 1 taxa from the tree distribution for n taxa, it is stipulated that each of the n existing taxa is equally likely to be the source of the next bifurcation. For example, in the case of n = 4 taxa, one can verify that the Markov model assigns a probability of 1/18 to each of the 1 possible labeled trees of the type in Figure a and a probability of 1/9 to each of the three trees of the type in Figure b. A computer enumeration reveals that, under the Markov model, Cov(I ijk, I ijk ) = 77/16-4/9 = 5/16 and Cov(I ijk/ I ij1c ) = 401/900-4/9 = 1/900. Substituting back into Equation yields V «- 30 5 + 7 4 Thus, the theorem also gives a potentially useful approximation for big trees under the Markov model.

1996 CRITCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 37 Tabulation of the S n Distribution The probability distribution of the triples distance is tabulated in Appendices and 3, under both the uniform and Markov models. The tabulated probabilities are P(S n ^ x) and will correspond to possible P values for the hypothesis test discussed below. These probabilities were computed exactly for small numbers of taxa (n ^ 7) by a direct enumeration of all possible pairs of trees. For 8 == n < 50, critical values were approximated by simulating 100,000 pairs of random trees from the underlying probability distribution, making the significance levels accurate to about three decimal places. The normal approximation is recommended for larger values of n, for which it seems to work adequately except in the extreme tail of the distribution. APPLICATION OF THE THEORY TO HYPOTHESIS TESTING Rapid Calculation of S n Along with the convenient distribution theory, the triples distance can also be computed rapidly. Obviously, a direct search over all triples allows for a "brute force" algorithm requiring O(n 3 ) time. However, an efficient O(n ) time algorithm is equally easy to program. This algorithm assumes that the two bifurcating phylogenies to be compared are stored in the form of generational matrices. The (/, ;')th entry of a generational matrix is the generation number at which taxa i and j split (i.e., the number of nodes on the path from the root to the most recent common ancestor of i and j). For example, the symmetric generational matrices that uniquely describe the trees in Figures la and lb are and a = I 1 1 1 1 1 1\ 4 3 5 3 1 1 4 343 1 3 3 34 1 5 4 3 3 V 3 3 4 3 b = I 1 1 1 1 1 1\ 1 5 4 3 3 15 433 14 4 33 13 3 3 4 13 3 3 4 1 The (4, 5) element in the matrix a is 3 because taxa 4 and 5 split at the third generation from the top of that tree. Next, define the generational pattern associated with taxa i and / in the two trees to be the pair (a(i, j), b(i, /)) (i.e., i and / split at generation a(i, j) in the first tree and at generation b(i, j) in the second). Note that for any triple i, j, k of taxa, i is the most distant leaf among i, j, k in the tree of Figure la if and only if a(i, j) = a(i, k), and similarly i is the most distant leaf in Figure lb if and only if b(i, j) = b(i, k). In other words, i is the most distant leaf among i, j, k in both trees whenever the generational patterns for taxa i and j and taxa i and k are the same. It follows that _ v n(m, i) where n(m, i) is the number of times that taxon i is associated with the rath generational pattern. This observation is the basis for the fast algorithm for computing S n ; the required values n(m, i) can be computed quickly by scanning the n(n 1) elements of the two generational matrices. Example: The Congruence of Morphological and Molecular Algae Data We now return to the two phylogenies presented in Figure 1 and try to answer the question: can their apparent similarities be explained by chance variation? To use the fast algorithm to compute the triples distance, note that the generational pattern (1, 1) is repeated six times for the first taxon (scanning the first rows of the matrices a and b), (, 3) occurs two times for the third taxon, (3, 3) occurs two times for the fifth taxon, and (3, ) occurs three times for the seventh taxon, and these are

38 SYSTEMATIC BIOLOGY VOL. 45 the only generational patterns that are associated more than once with any taxon. Thus, = 35-15 - 1-1 - 3 = 15, which agrees with the value found previously by a "brute force" enumeration. Is this value of the triples distance statistically significant? From the tables in Appendix, there is an 8.6% probability that 15 or fewer incongruent triples would occur by chance when the two trees are constructed independently under the uniform model. This probability can be interpreted as a P value for testing the null hypothesis that the two trees are statistically independent under the uniform model. Moreover, from the table, the analogous P value under the Markov model is 6.9%. Thus, under either model there is only minimal evidence in these trees to indicate that the evolution of the seven species of algae suggested by the molecular data is associated with the evolution suggested by morphological characteristics. The normal approximation is provided by Theorem 1 (although for n = 7 taxa we would not expect the approximation to work well here). Under the uniform model, this gives the standardized test statistic value z = 15 - -(35) f (35) + (35, -1.35, which yields an approximate P value of 8.9%. Similarly, the normal approximation of the P value under the Markov model is 3.6%. Thus, in this example, the normal distribution appears to provide a better approximation under the uniform model. In general, our investigations suggest that the approximation works adequately for trees with a larger number of taxa, e.g., n > 50. For values of n < 50, the tables in the appendices should be used in preference to the approximation, especially for low significance levels such as 0.01. The Conservative Test The proof given for Theorem 1 remains valid when the two trees are allowed to have different (and arbitrary) label-invariant distributions. In particular, if the labelinvariant distributions for trees A and B assign a probability of 1 to some arbitrary fixed topologies T A and T B, then E(S n T A, T B ) / \ = -jo, for any such T A and T B. It follows that VarfSJ = E{Var[S M T A, T B ]} + Var{ [S n T A, T B ]} = E{Var[S n T A, T B ]}. Therefore, the variance of S n is maximized (over all possible pairs of label-invariant distributions for trees A and B) when these distributions assign a probability of 1 to those topologies, T A and T B, that yield the largest conditional variance Var[S n i A, T B ]. This type of conditional variance can be calculated using a simple method that allows an extension of the preceding hypothesis test to cases where it is unclear which probability model is most appropriate for the two trees. An examination of the variance formula of Equation reveals that the I. 1 term depends only on the topologies of all the subtrees of size 4, whereas the _ I term depends on the to- W pologies of the subtrees of size 5. There are two types of topologies for trees with four taxa (type 1 [Fig. a] and type [Fig. b]) and three types of topologies for trees with five taxa (type 1 [Fig. 3a], type [Fig. 3b], and type 3 [Fig. 3c]). Let p im {A) denote the proportion of subtrees of size m that have topology type i in the full tree A. Then, a straightforward argument shows Var[S n T A, T B ] = c (3)

1996 CRTTCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 39 where = -Pi 4 (A)p 14 (B) and + p Z5 (A)p 15 (B)] p 35 (A)p 15 (B)] + l p 35 (A) P5 (B)] Using the facts that,- p, m (A) = 1 and that p M (A) = (4/5) + (l/5)p 15 (A) - (/5) P5 (A), it follows that Equation 3 is maximized when p w (A) = p l5 {B) = 1 (provided n > 5). Substituting these values into Equation 3 gives the maximum possible variance of S n over all possible label-invariant distributions on the two trees: Consequently, if we use V max in computing our standardized test statistic, that is, take z = 3 3 then we will have a conservative test that yields the maximum P value over any label-invariant distributions on the trees. Rejection of the null hypothesis using this FIGURE 3. The three topologies with n = 5 taxa. There are 60 possible labeled trees of type a, 30 of type b, and 15 of type c. conservative approach would be especially forceful evidence that the similarities in the two trees cannot be explained by random chance. When n = 7, V max = 4333/90, so that the conservative test statistic in the algae example is z «* 1.0, which yields an approximate P value of 11.5%. Thus, even at the 10% significance level, the null hypothesis that the two trees are statistically independent cannot be rejected using the conservative test: there exists a label-invariant distribution on trees that provides a reasonable explanation for the congruencies in these particular data. However, in situations where the conservative test rejects the null hypothesis, then a very strong conclusion is justified: the similarities between two trees cannot be attributed

330 SYSTEMATIC BIOLOGY VOL. 45 to chance variation under any label-invariant model. The Conditional Test An alternative approach neglects the issue of choosing a distribution on trees and considers the permutation test conditioned on the topologies of the two trees that are actually observed. Thus, the null model says that all random relabelings of the nodes in the given topologies are equally likely. Because this test amounts to assuming that the tree distribution puts a probability of 1 on the observed topologies, the asymptotic normality of Theorem 1 still applies. The test is carried out in practice by computing the conditional variance given by Equation 3 and requires only a simple count of the number of occurrences of each possible type of subtree topology of size 4 and size 5, as illustrated in Figures and 3. Returning once again to the comparison of the molecular and the morphology trees in Figure 1, notice that they coincidentally have the same topology. Of the = 1 subtrees of size 5, 14 are of type 1 (Fig. 3a), 1 is of type (Fig. 3b) and 6 are of type 3 (Fig. 3c). Of the I ^ J = 35 subtrees of size 4, 3 are of type 1 (Fig. a) and 3 are of type (Fig. b). Substituting the corresponding proportions into Equation 3 gives the conditional variance V cond = 1364/35 and the conditional test statistic z «1.33, yielding an approximate P value of 9.1%. SUMMARY In this paper we have described a metric, the triples distance, for comparing rooted bifurcating phylogenetic trees. This distance is easy to interpret and easy to calculate and has a well-developed sampling theory. The sampling theory enables use of the triples distance as a statistic for testing the null hypothesis that the similarities in two trees can be explained by independent random structures. However, as with any hypothesis test, a statistically significant result should not be interpreted as proof of a global pattern, especially for large trees. For example, it is possible to reject the null hypothesis based on the close agreement of a very small subset of the total collection of taxa (combined with otherwise independent structures). Thus, when n is large, it may be fruitful to also apply the triples distance methodology to particular subtrees that correspond to important subgroups of taxa. An especially appealing attribute of the triples test is its potential applicability under any choice of the probability distribution on the set of possible trees. If the application indicates a good candidate for this distribution on trees, then the formulae provided can be used to compute the test statistic under this distribution. The statistic is quite robust to small deviations from the candidate tree distribution (e.g., when n = 7, VVar(SJ ranges only from a minimum of 5.35 to a maximum of 6.94, over all possible label-invariant tree distributions). On the other hand, if no reasonable candidate distribution exists, the user may choose a conservative statistic, valid for any tree distribution, or a conditional permutation statistic, valid for the topologies of the trees that are actually observed. Although other metrics may be more appropriate for particular applications, the triples distance provides a useful, robust new resource in the systematist's tool kit. REFERENCES BLEDSOE, A. H., AND R. J. RAIKOW. 1990. A quantitative assessment of congruence between molecular and nonmolecular estimates of phylogeny. J. Mol. Evol. 30:47-59. DAY, W. H. E. 1985. Optimal algorithms for comparing trees with labeled leaves. J. Classif. :7-8. DAY, W. H. E. 1986. Analysis of quartet dissimilarity measures between undirected phylogenetic trees. Syst. Zool. 35:35-333. DOBSON, A. J. 1975. Comparing the shapes of trees. Pages 95-100 in Lecture notes in mathematics, no. 45. Combinatorial mathematics III (A. P. Street and W. D. Wallis, eds.). Springer-Verlag, New York. ESTABROOK, G. F. 199. Evaluating undirected positional congruence of individual taxa between two estimates of the phylogenetic tree for a group of taxa. Syst. Biol. 41:17-177. ESTABROOK, G. E, F. R. MCMORRIS, AND C. A. MEA-

1996 CRITCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 331 CHAM. 1985. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst. Zool. 34:193-00. FELLER, W. 1971. An introduction to probability theory and its applications, Volume. John Wiley and Sons, New York. HARDING, E. F. 1971. The probabilities of rooted treeshapes generated by random bifurcation. Adv. Appl. Probab. 3:44-77. HENDY, M. D., C. H. C. LITTLE, AND D. PENNY. 1984. Comparing trees with pendant vertices labelled. SLAM J. Appl. Math. 44:1054-1065. KUHNER, M. A., AND J. FELSENSTEIN. 1994. A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:45^-468. LAPOINTE, F.-J., AND P. LEGENDRE. 1990. A statistical framework to test the consensus of two nested classifications. Syst. Zool. 39:1-13. LAPOINTE, F.-J., AND P. LEGENDRE. 199. A statistical framework to test the consensus among additive trees (cladograms). Syst. Biol. 41:158-171. PAGE, R. D. M. 1988. Quantitative cladistic biogeorgraphy: Constructing and comparing area cladograms. Syst. Zool. 37:54-70. PAGE, R. D. M. 1991. Random dendrograms and null hypotheses in cladistic biogeography. Syst. Zool. 40: 54-6. PENNY, D., L. R. FOULDS, AND M. D. HENDY. 198. Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences. Nature 97:197-00. PENNY, D., AND M. D. HENDY. 1985. The use of tree comparison metrics. Syst. Zool. 34:75-8. PENNY, D., M. D. HENDY, AND M. A. STEEL. 1991. Testing the theory of descent. Pages 155-183 in Phylogenetic analysis of DNA sequences (M. M. Miyamoto and J. Cracraft, eds.). Oxford Univ. Press, New York. ROBINSON, D. E, AND L. R. FOULDS. 1981. Comparison of phylogenetic trees. Math. Biosci. 53:131-147. SHAO, K., AND F. J. ROHLF. 1983. Sampling distribution of consensus indices when all bifurcating trees are equally likely. Pages 13-136 in Numerical taxonomy (J. Felsenstein, ed.). Springer-Verlag, Berlin. SHAO, K., AND R. R. SOKAL. 1986. Significance tests of consensus indices. Syst. Zool. 35:58-590. SLOWINSKI, J. B. 1990. Probabilities of n-trees under two models: A demonstration that asymmetrical interior nodes are not improbable. Syst. Zool. 39:89-94. STEEL, M. A. 1988. Distribution of the symmetric difference metric on phylogenetic trees. SIAM J. Disc. Math. 1:541-551. STEEL, M. A., AND D. PENNY. 1993. Distributions of tree comparison metrics Some new results. Syst. Biol. 4:16-141. SWOFFORD, D. L. 1991. When are phylogeny estimates from molecular and morphological data incongruent? Pages 95-333 in Phylogenetic analysis of DNA sequences (M. M. Miyamoto and J. Cracraft, eds.). Oxford Univ. Press, New York. WATERMAN, M. S., AND T. F. SMITH. 1978. On the similarity of dendrograms. J. Theor. Biol. 73:789-800. WILCOX, L. W., L. A. LEWIS, P. A. FUERST, AND G. L. FLOYD. 199. Assessing the relationships of autosporic and zoosporic chloroccalean green algae with 18S rdna sequence data. J. Phycol. 8:381-386. Received 9 March 1995; accepted 8 March 1996 Associate Editor: Daniel Faith APPENDIX 1 PROOF OF THE NORMAL APPROXIMATION, THEOREM 1 Recall the notation S n = X ijk l iik, where l ijk 1 if the triple of taxa i, ], k has a different subtree topology in both trees, and 0 otherwise. Let Z n denote [S n - E(SJ]/[Var(SJ] 1/, and let M n( denote E(Z n >). To prove asymptotic normality, it is sufficient to show that the moments M nl all converge to the corresponding moments of the standard normal distribution, i.e., that lim M nt = the rth moment of standard normal """ distribution = j if f is even [o if t is odd (e.g., Feller, 1971:69, 4-9). For notational convenience, let A ijk = l ijk /3, so that E(A ijk ) = 0 and S n - E(S n ) = X ijk A ljk. Then In the above expression, note that by Equation, Var(SJ = (c/4)n 5 + o{n 5 ), where c = Cov(I i/k, I n ). Also note that [X iik A ijk ]' can be expanded as a summation of t-told products, each having the form n^=1 A Wsts. For each such f-fold product, let m denote the number of distinct indices among all the 3f indices i u j x, k lf i, j, k,..., i t, j t, k t that occur in the product. We distinguish three possible cases, according to the value of m. Case 1 For any fixed m < 5t/, the number of possible products that attain this value of m is of order n m. Therefore, the contribution to E[X ijk A ijk \ from all such products is asymptotically negligible compared to [Var(SJ]' /. Case Next consider any fixed m > 5t/. Any product that attains this value of m must have the following property: there exists some triple i s., j s,, k s, such that A is, u, ks, occurs in the product and such that i s,, j s ; k s. are distinct from all of the other indices occurring in the product. But then A k, kkt, is independent of all other terms in the product, so ruisjsk. = 0. Note that if t is odd, then either case 1 or case

33 SYSTEMATIC BIOLOGY VOL. 45 must always hold, and therefore \im n _> x M nl = 0, as claimed. However, if t is even, consider the third case. ways of choosing j v k u j, k v..., j lr, k r. Hence, the number of possible products is Case 3 Suppose m = 5t/. Then t is even, e.g., t = r. Consider any product that attains this value of m. If there happens to exist s' as described in case, then still E[IIJ =1 A isjsk ) = 0 as argued under case. If there does not exist such an s', then this together with m = 5t/ implies that the product must have the form (r - 1)] n - 5r + = (r - l)!!n!/[(n - 5r)! ']. In conclusion, when t = r is even, lim M n, where i u..., i r, j u..., ; r, and k v..., k r axe all distinct. The expectation of such a product is (E[A hhki A ilj j ]y = c T. Moreover, it is straighforward to count the number of possible products of form Al: there are (r - 1)!! ways to decide which terms in the product are paired with each other by possessing a common index; n(n - 1)... (n - r + 1) ways of choosing i u i^..., i r ; and n-r\(n-r- \ n-5r+ where the summation is over all possible products of form Al. By evaluating this expectation as above and substituting the leading term cn 5 /4 of Var(S n ), we obtain lim M nl = lim (cn74)' (r - (n - 5r)\ r = (r - 1)!! as claimed.

1996 CRITCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 333 APPENDIX Tables of the exact distribution of S n. For each n (number of taxa: 4 < n < 7), the tabulated probabilix) for both the uniform and Markov ties are P(S n < models. X n = 4 0 1 3 4 n = 5 0 1 3 4 5 6 7 8 9 10 n = 6 0 1 3 4 5 6 7 8 9 10 11 1 13 14 15 16 17 18 19 0 n = 7 0 1 3 4 5 6 7 8 9 Uniform 0.0667 0.1733 0.3333 0.7600 0.0095 0.059 0.04 0.100 0.1674 0.37 0.3959 0.5973 0.8041 0.9565 0.0011 0.0031 0.0053 0.0113 0.0186 0.059 0.048 0.0746 0.1076 0.138 0.1910 0.531 0.333 0.456 0.5808 0.7049 0.891 0.958 0.974 0.9936 0.0001 0.0003 0.0006 0.001 0.000 0.008 0.0043 0.0068 0.0100 0.0135 Markov 0.0741 0.148 0.3333 0.7778 0.010 0.034 0.0417 0.0769 0.1333 0.333 0.4074 0.5889 0.8 0.985 0.0016 0.0047 0.0087 0.0144 0.0191 0.07 0.03 0.0538 0.086 0.1196 0.179 0.417 0.314 0.4356 0.5745 0.798 0.8688 0.9613 0.9961 0.999 0.000 0.0006 0.0010 0.00 0.0037 0.0048 0.0060 0.0073 0.0094 0.0117 X 10 11 1 13 14 15 16 17 18 19 0 1 3 4 5 6 7 8 9 30 31 3 33 34 35 APPENDIX Continued. Uniform 0.0184 0.059 0.0374 0.0516 0.0661 0.0858 0.1096 0.1350 0.1665 0.047 0.473 0.3045 0.3754 0.447 0.573 0.6181 0.7074 0.7918 0.865 0.918 0.9549 0.9796 0.9916 0.9966 0.9993 Markov 0.0144 0.019 0.077 0.038 0.0506 0.0690 0.091 0.1150 0.144 0.1767 0.09 0.80 0.3518 0.4336 0.58 0.693 0.735 0.8311 0.9139 0.9671 0.9899 0.9976 0.9995 0.9999 0.99997 0

334 SYSTEMATIC BIOLOGY VOL. 45 APPENDIX 3 Level a critical values of the statistic S n. For each n (number of taxa: 8 ^ n s 50), the largest value of x such that P(S n ^ x) ^ a is tabulated for both the uniform and Markov models. n 8 9 10 11 1 13 14 15 16 17 18 19 0 1 3 4 5 6 7 8 9 30 31 3 33 34 35 36 37 38 39 40 41 4 43 44 45 46 47 48 49 50 0.01 Uniform Markov 16 7 43 63 88 10 159 04 57 30 389 467 558 659 775 898 1036 1190 1354 159 179 1939 166 416 679 960 363 3569 3909 489 4667 5066 5498 594 647 6935 7475 8009 8574 91 9856 10487 1100 17 8 44 67 95 17 167 18 73 338 415 500 595 704 85 960 1104 164 144 169 1836 065 306 565 849 3141 3445 379 415 4537 493 5371 58 680 6785 7306 787 8457 9068 9694 10366 11071 11791 0i 0.05 Uniform Markov 3 37 55 79 109 145 188 38 97 366 443 53 630 741 863 999 1148 1311 1487 1681 1890 116 360 6 903 300 351 3856 414 4608 5008 5434 5891 636 6859 739 7953 855 914 9774 10440 11130 11860 4 39 59 83 115 153 198 5 314 385 467 559 66 779 906 1049 103 1376 1561 1760 1979 18 471 741 303 3344 3678 407 4403 4801 519 5669 6134 667 7148 7691 871 8873 9506 10161 10844 11571 135 0.1 Uniform Markov 6 4 6 87 119 157 03 56 317 389 470 56 665 781 908 1048 10 1371 1555 1754 1970 0 45 73 3011 3318 3647 3995 4364 4758 517 5611 608 6561 7076 7619 8189 8779 939 10054 10735 1144 118 8 44 65 91 15 164 1 68 33 406 491 586 693 81 944 1090 149 144 1615 180 044 85 543 81 3119 3437 3774 4133 451 491 5345 5801 677 6778 7304 7859 8445 9058 9699 10363 11061 11793 1557