A Statistical Test of Phylogenies Estimated from Sequence Data

Similar documents
Variances of the Average Numbers of Nucleotide Substitutions Within and Between Populations

Constructing Evolutionary/Phylogenetic Trees

Letter to the Editor. Department of Biology, Arizona State University

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Dr. Amira A. AL-Hosary

Variance and Covariances of the Numbers of Synonymous and Nonsynonymous Substitutions per Site

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Phylogenetic inference

Lecture 4. Models of DNA and protein change. Likelihood methods

Constructing Evolutionary/Phylogenetic Trees

Theory of Evolution Charles Darwin

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogenetic Tree Reconstruction

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Simple Methods for Testing the Molecular Evolutionary Clock Hypothesis

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Concepts and Methods in Molecular Divergence Time Estimation

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

EVOLUTIONARY DISTANCES

Cladistics and Bioinformatics Questions 2013

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

8/23/2014. Phylogeny and the Tree of Life

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Algorithms in Bioinformatics

Agricultural University

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Estimating Divergence Dates from Molecular Sequences

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

What Is Conservation?

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

FUNDAMENTALS OF MOLECULAR EVOLUTION

BINF6201/8201. Molecular phylogenetic methods

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

Lecture 11 Friday, October 21, 2011

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogenetics: Building Phylogenetic Trees

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences


Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Theory of Evolution. Charles Darwin

Multiple Sequence Alignment. Sequences

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Understanding relationship between homologous sequences

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Bootstraps and testing trees. Alog-likelihoodcurveanditsconfidenceinterval

Phylogenetic Analysis and Intraspeci c Variation : Performance of Parsimony, Likelihood, and Distance Methods

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Probability Distribution of Molecular Evolutionary Trees: A New Method of Phylogenetic Inference

Estimating Evolutionary Trees. Phylogenetic Methods

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

C.DARWIN ( )

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogeny: building the tree of life

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

ESTIMATION OF CONSERVATISM OF CHARACTERS BY CONSTANCY WITHIN BIOLOGICAL POPULATIONS

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

PHYLOGENY AND SYSTEMATICS

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics

7. Tests for selection

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Phylogenetic Analysis

A (short) introduction to phylogenetics

Letter to the Editor. Temperature Hypotheses. David P. Mindell, Alec Knight,? Christine Baer,$ and Christopher J. Huddlestons

What is Phylogenetics

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Distances that Perfectly Mislead

How to read and make phylogenetic trees Zuzana Starostová

Molecular Evolution and Phylogenetic Tree Reconstruction

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Lecture 4. Models of DNA and protein change. Likelihood methods

A Phylogenetic Network Construction due to Constrained Recombination

Classification and Phylogeny

Lecture 6 Phylogenetic Inference

Phylogenetic Analysis

Phylogenetic Analysis

Appendix from L. J. Revell, On the Analysis of Evolutionary Change along Single Branches in a Phylogeny

Consistency Index (CI)

Effects of Gap Open and Gap Extension Penalties

Chapter 16: Reconstructing and Using Phylogenies

Classification and Phylogeny

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Transcription:

A Statistical Test of Phylogenies Estimated from Sequence Data Wen-Hsiung Li Center for Demographic and Population Genetics, University of Texas A simple approach to testing the significance of the branching order, estimated from protein or DNA sequence data, of three taxa is proposed. The branching order is inferred by the transformed-distance method, under the assumption that one or two outgroups are available, and the branch lengths are estimated by the least-squares method. The inferred branching order is considered significant if the estimated inter-nodal distance is significantly greater than zero. To test this, a formula for the variance of the inter-nodal distance has been developed. The statistical test proposed has been checked by computer simulation. The same test also applies to the case of four taxa with no outgroup, if one considers an unrooted tree. Formulas for the variances of internodal distances have also been developed for the case of five taxa. Conditions are given under which it is more efficient to add the sequence of a fifth taxon than to do 25% more nucleotide sequencing in each of the original four. A method is presented for combining analyses of disparate data to get a single P value. Finally, the test, applied to the human-chimpanzee-gorilla problem, shows that the issue is not yet resolved. Introduction Although phylogenetic reconstruction has long been recognized as a problem in statistical inference (Edwards and Cavalli-Sforza 1964)) few authors have considered how to evaluate the confidence level for estimated phylogenies (Cavender 1978; Felsenstein 1981, 1985a, 198%; Mueller and Ayala 1982; Templeton 1983; Nei et al. 1985; Lake 1987; P. Pamilo, personal communication). This problem has become important because the rapid accumulation of molecular data has generated much interest in phylogenetic studies. How to test the significance of an inferred phylogeny is a difficult problem. A simpler problem is to test the significance of estimated internodal distances. As will be explained later, in the case of four taxa significance of the internodal distance can be taken as significance of the inferred phylogeny. When the number of taxa under study is more than four, the two problems are no longer equivalent and the requirement of all internodal distances being significantly greater than zero seems to be too stringent a test for the significance of the inferred branching order. A simple way to test the significance of internodal distances is to study their variances. Mueller and Ayala ( 1982) proposed to compute these variances by the jackknife method, while Nei et al. ( 1985) derived analytic formulas for the case of a UPGMA tree, i.e., a tree estimated by the unweighted pair-group method of analysis (Sneath and_sokal 1973 ). The UPGMA method assumes a constant rate of evolution, Key words: phylogenetic reconstruction, transformed distance significance of branching order, phylogeny of apes and man. method, variances of branch lengths, Address for correspondence and reprints: Wen-Hsiung Li, Center for Demographic and Population Genetics, University of Texas, P.O. Box 20334, Houston, TX 77225. Mol. Biol. Evol. 6(4):424-435. 1989. 0 1989 by The University of Chicago. All rights reserved. 0737-4038/89/0604-0009$02.00

Statistical Test of Phylogenies 425 but there is now strong evidence that this assumption is often violated (Wu and Li 1985; Britten 1986; Li et al. 1987). It is therefore desirable to consider an approach that does not make this assumption. In this paper I propose a two-step approach. The first step is to infer the branching order. One can use the transformed-distance method (Farris 1977; Klotz et al. 1979; Li 1981)) the neighbor-joining method (Saitou and Nei 1987)) the maximum parsimony method (Eck and Dayhoff 1966; Fitch 1977), or any other method that does not assume rate constancy and that has been shown to be effective for obtaining the correct tree. The second step is to estimate the branch lengths by the least-squares method (Cavalli-Sforza and Edwards 1967; Chakraborty 1977). The variances of internodal distances are then obtained from the equations derived from the least-squares method. In this study analytic formulas for these variances have been developed for the cases of four and five taxa. Computer simulation of the case of four taxa confirmed that the statistical test proposed can indeed be used to test the significance of an inferred phylogeny. The present theory was applied to the human-chimpanzee-gorilla trichotomy problem. Variances of Internodal Branch Lengths In the following I shall explain how to derive the variance of a branch length, assuming that the tree topology has already been inferred by one of the methods mentioned above. Since the focus of this paper is on the internodal branches, the variances of the other branches are presented in an appendix; these variances are useful for evaluating the reliability of estimates of branch lengths. Four Taxa Denote the four taxa under study by 1, 2, 3, and 4. Suppose that the inferred tree topology is as shown in figure la; the root of the tree can be determined if one of the four taxa is an outgroup. The branch lengths should satisfy the following equations: d12 = a + b, (1) d,,=a+c+d, (2) d,,=b+c+d, (3) (4) (5) ds4 = d + e, (6) where dti is the distance between taxa i and j. From these equations I obtain the following least-squares solution: a = %d12 + l/4( d13-d23+d14--d24), (7) b = d12 - a (8)

426 Li a b FIG. 1.-Model trees used in the derivation of the mean and variance of the branch lengths c = 4 d,3+d23+d14+44) - l/2( &+&$), (9) d = /2d34 + Y4( &+&3-44-c&4) ) (10) e = dj4 - d, (11) The variance (V) of c can be obtained using formula (9) and following the method of Nei et al. (1985) and Wu and Li (1985): v(c) = /16[V(d13)+V(d23)+V(d14)+~(d24)+2V(d16)+2V(d26) +4~(d,,)+2~/(d53)+2v(d54)] - l~~[w-ad+w&) (12) +v(d,,)+v(&)] + 1/2W,2) + l~wd, where V( d,) denotes the variance of the estimate of the distance between sequences i andj. First, consider protein sequence data. The mean and variance of the number (d,) of amino acid replacements per site between sequences i and j can be estimated by du =_fw l-p/f), (13) Vd,) = PC l-~)l[u l-~/f)~l, (14) where f = 19 /20, p is the proportion of different amino acids between the two sequences, and L is the number of residue sites compared. For a pair of extant sequences, these formulas are readily applicable. However, the sequences at nodes 5 and 6 do not exist, and thus variances such as V( d16) and V( ds6) cannot be estimated directly from actual data. However, they can be estimated as follows (Nei et al. 1985): I use V( d16) as an example. From formula ( 13) I obtain p = f( 1 -&l f). (15) Since d16 = a + c, p = f[ 1 - e-(a+c) f] ; a + c can be obtained from formulas ( 7) and (9). Putting p into formula ( 14)) one readily obtains V( d16).

Statistical Test of Phylogenies 427 Next, consider nucleotide sequence data. Under the assumption of random substitution among the four types of nucleotide, i.e., the one-parameter model, the mean and variance of the number of substitutions per nucleotide site between sequences i and j are also given by formulas ( 13) and ( 14), except that now f = 3 /4, p is the proportion of different nucleotides between the two sequences, and L is the number of nucleotide sites compared (Jukes and Cantor 1969; Kimura and Ohta 1972). Under the two-parameter model (Kimura 1980)) the formulas corresponding to formulas (13) and (14) are dii = A + B, (16) V( d,) = [ x2p+z2q- ( xp+zq)~] /L, (17) where P and Q are, respectively, the proportions of transitional and transversional differences between sequences i and j, x = 1 /( l -2P-Q), y = 1 /( l -2Q), z = (x+v)/ 2, A = Mln( x) - *An(v) is the number of transitional substitutions per site, and B = Mln(y) is the number of transversional substitutions per site. Note that, unlike formula ( 14)) formula ( 17) involves two parameters, P and Q. The formulas corresponding to formula ( 15 ) are given by Q = l/2( 1 -em2 ), (18) p = 1/2[ l_q_e-(2 l+b)], (19) (Wu and Li 1985). Five Taxa Suppose that the inferred lengths are then given by branching order is as in figure lb. The branch b = d12 - a, (21) c = 1h(d13+d23) + 1/&i,4+d24+d1s+d2s) - Wd34+dx) - M2 7 (22) d = /2(d13+d23) - %d12 - c, (23) e= 1h(d34+d35- & - dn) + %(d14+dst+du+dz) - l/2d4s 9 (24) f= ds4-d-e, (25) g = 45 - f. (26) If two of the five taxa, say taxa 4 to obtain only the variance of c. and 5, are known to be outgroups, then one needs v(c) = 1/64[~(d,4)+l/(d,5)+~(d24)+V(d25)1 + 1/32[~(d,8)+V(d2*)+~(d46)+V(d56)1

428 Li + /,6[~(d13)+V(d23)+V(d34)+V(d35)+~(d68)1 (27) + 1/g[V(d,7)+V(d27)+~(d36)+~(d38)-V(&7)-v(d57)1 + /4[~(d12)+V(A67)-V(d7g)] - 1/2[v(d16)+v(d26)+v(d37)1 * If only one or no outgroup exists, then one needs also to obtain the variance of e. v(e) = 1/64[~(d14)+V(dlS)fl/(d24)+l/(d25)1 + /32[~(d,8)+V(d28)fV(d46)+l/(d56)1 + 1/16[~(d13)+V(d23)+~(d34)+v(d35)+V(d68)1 (28) + /8[--(d17)-~(d27)+V(d36)+~(d38)+~(d47)+V(d57)1 + /4[l/(d45)-~(d67)+v(d78)] - 1/2[l/(d58)+~(d48)+v(d37)1s Computer programs for a floppy disk to the author. the above formulas are available on request by sending Test of Significance of an Inferred Phylogeny In the case of three taxa with one or two outgroups, the above results can be used to test the significance of an inferred phylogeny. Since in this case there is only one internal branch, i.e., branch c, testing the significance of the internal branch is equivalent to testing the significance of the inferred phylogeny. More explicitly, the null hypothesis is that the true phylogeny is a trichotomy, i.e., the three taxa diverged at the same time. This hypothesis is the same as the hypothesis of c = 0. Therefore, if the estimated c is significantly ~0, the null hypothesis of trichotomy is rejected and the inferred branching order can be taken as statistically significant. The same argument applies to the case of four taxa with no outgroup if one considers unrooted trees. This is easy to see from figure la: since branch c is the only internal branch, the inferred topology can be taken as significant if c is significantly >O. When the number of taxa under study is more than four, the situation becomes complicated. For example, in the case of five taxa there are two internal branches (fig. 1 b), and the probability for (only) one of them to become by chance significantly greater than zero at the level of c1 = 5% is 2a = 10%. Thus, in this case one cannot reject the null hypothesis that all the internal branches have zero length, i.e., that all the taxa diverged at the same time point and forrn a star phylogeny; of course, this null hypothesis can be rejected if a I 2.5%. On the other hand, the probability for both internal branches to be by chance significant at the level of a = 5% is approximately only a 2 = 0. 0025 (it is not strictly o2 because the two internal branch lengths are not estimated independently). Hence, the requirement of all internal branches being significant seems to be too stringent a test for the significance of the inferred topology. Another difficulty is that one cannot draw a conclusion about the significance of an inferred tree topology as long as one or more of the internodal distances are nonsignificant; of course, the uncertainty can be restricted to a subset of taxa. In short, a more careful study is required for understanding the problem of testing the significance of an inferred phylogeny when more than four taxa are involved.

Statistical Test of Phylogenies 429 I now come back to the case of three taxa, where the task is to test the null hypothesis of trichotomy or c = 0. The above formulas for the mean and variance of c were derived under the assumption that the inferred branching order of the three taxa was (( 1, 2) 3); the notation ((i, j)k) means that lineage k branched off earlier than did lineages i and j. If, instead, the inferred branching pattern is (( 1, 3 ) 2)) then the subscripts 2 and 3 in the above formulas should be exchanged, and if the inferred branching pattern is (( 2, 3) 1 ), subscripts 1 and 3 should be exchanged. Under the null hypothesis of trichotomy, the three branching patterns (( 1, 2) 3 ), (( 1, 3) 2)) and (( 2, 3) 1) occur with equal probability. However, for each set of data only one pattern can occur and only one c can be positive and is tested for significant deviation from 0, so that there is no multiple-test problem. Moreover, regardless of which pattern occurs, the probability that c will assume a particular (nonnegative) value is the same. If the distribution of c is the same as the distribution of 1x1, where x is a standard normal random variate, then the standard statistical test based on the standard normal distribution can be applied. In particular, the estimated c is significant at the 5% level if the ratio of mean to SE is 2 1.96, and it is significant at the 1% level if the ratio is 22.60. Obviously, the case of four taxa with no outgroup can be treated in the same manner, if one considers unrooted trees. To test the accuracy of the level of significance defined by the above criteria, I conducted a computer simulation for the case of three taxa with one outgroup. I assumed that the three taxa diverged at the same time, and I used the two-parameter model of nucleotide substitution. The simulation results are shown in table 1. In the table a, b, and d denote the expected lengths of the three lineages (i.e., expected numbers of substitutions per nucleotide site), while e denotes the expected length from the common ancestor of the three taxa to the outgroup. Let Y be the ratio of the estimated c value to the SE. The percentage of replicates with Y 2 1.96 is <5% when a, b, and d are ~0.20 (table 1) but tends to be somewhat >5% when a, b, and d are 20.20, suggesting that under the latter situation a slightly higher r value, say 22.2, is required for the 5% significance level. On the other hand, the percentage of replicates with r 2 2.60 is usually < 1%. Therefore, although the simulation results do not support the assumption of normality for the distribution of c, the standard normal test appears to be generally applicable. In the two cases where d is larger than a and b, so that the rate-constancy assumption is violated, the percentages of replicates with Y 2 1.96 or 2.60 are similar to those for the cases where the rate-constancy holds. In the above simulation I have not considered branch lengths >0.45 because at this stage of divergence the distance between two sequences is close to 1, so that estimates of the number of substitutions per site will become unreliable (e.g., see Li et al. 1985). Numerical Examples To better understand the theory developed above, consider some numerical examples. I assume that the rate of nucleotide substitution is constant over time and that the observed number of substitutions between each pair of sequences is equal to the expected value. First, consider the case of three species with an outgroup (taxon 4) (fig. 1 a). In table 2, c1 denotes the proportion of transitional changes; a = 1 / 3 if substitutions occur randomly. The SE, which is the square root of V(c), is larger for a = 2/ 3 than for a = 1 / 3. Since transitional changes generally occur more often than transversional changes (Brown et al. 1982; Li et al. 1984)) the two-parameter model is more realistic

Table 1 Percentage of Replicates with r Exceeding a Specified Value BRANCH LENGTH (no. of substitutions/site) a=b d e L PERCENTAGE a= l/3 a = 213 r.2 1.96 r 1 2.60 r 2 1.96 r 2 2.60 0.05 0.05 0.15. 0.10 0.10 0.20. 0.20 0.20 0.25. 0.30 0.30 0.35. 0.30 0.33 0.35. 0.40 0.40 0.45. 0.40 0.43 0.45.. 1,000 1.5 0.0 1.5 0.0 4,000 3.2 0.4 4.0 0.4 8,000 4.8 0.8 3.2 0.8 1,000 3.2 0.0 4.3 0.3 4,000 2.0 0.0 4.8 0.0 8,000 1.6 0.0 4.0 0.0 1,000 4.6 0.1 4.7 0.4 4,000 4.8 1.6 6.8 1.6 8,000 6.4 1.6 7.2 0.8 1,000 4.9 0.6 5.3 0.7 4,000 7.2 0.8 7.6 0.8 8,000 6.4 0.8 6.4 0.0 1,000 4.7 0.6 5.4 0.9 4,000 5.2 0.4 5.6 1.6 8,000 5.6 0.8 6.4 0.0 1,000 5.7 0.8 6.2 0.6 4,000 6.6 1.2 6.8 0.8 8,000 6.4 0.8 6.0 1.2 1,000 5.7 1.3 5.8 0.4 4,000 6.8 0.8 6.4 0.8 8,000 6.4 0.8 7.2 0.8 NOTE.-In all cases, the true value of c is 0. L = number of nucleotide sites studied; a = proportion of transitional substitutions; r = 1.96 is significant at the 5% level and r = 2.60 is significant at the 1% level under the assumption of the standard normal distribution. In each case the number of replicates is 1,000 for L = 1,000, 250 for L = 4,000, and 125 for L = 8,000. Table 2 SE of the Estimate of the Length of Branch c in Figure la c a SE c/se L rb 0.010.... l/3 0.0046 2.17 850 213 0.0050 2.00 1,000 0.005...... l/3 0.0039 1.28 2,500 213 0.0043 1.16 3,000 0.001..... l/3 0.0032 0.31 4 1.ooo 213 0.0037 0.27 54.000 NOTE.-The branch lengths in fig. la are a = b = 0.05, d = 0.05 + c, and e = 0.15 - c. Symbols are as defined in the text and table 1. a Computed under the assumption that L = 1,000. b Number of nucleotide sites required for the ratio c/se to be 32 (i.e., to be -5% significant).

Statistical Test of Phylogenies 43 1 than the one-parameter model; the former is applicable to all a values, whereas the latter is applicable only to a = 1 / 3. The ratio c/se can be used to test whether c is significantly ~0. A ratio of 2 can be taken as significant at the 5% level. All the values in table 2 were obtained for L = 1,000. When c = 0.0 1, the ratio is 2 or larger if a! I */3. Thus, this case requires only a small amount of sequence data to resolve the branching order of the three species. When c = 0.005, then the ratio is considerably smaller than 2; for example, the ratio is 1.28 for a = 1 / 3. Formulas ( 14) and ( 17 ) imply that V( c) is inversely proportional to L. Therefore, for the ratio to increase from 1.28 to 2 the L value should increase from 1,000 to L = 1,000 X (2/ 1.28)2 = -2,500. The other L values in table 2 were obtained in the same manner. If c = 0.00 1, then the number of nucleotide sites needed to be studied is rather large, >50,000. Saitou and Nei ( 1986) have earlier considered this problem from a different angle. They studied the probability of obtaining the correct topology as a function of the number of nucleotides studied under various tree-making methods. Next, consider the amount of reduction in V(c) when a second outgroup (taxon 5) is added (fig. 1 b). Let us denote the V(c) value for the case of one outgroup by V, (c) and that for the case of two outgroups by V2( c). A comparison of these two values is shown in table 3. The reduction increases as V,(c) becomes larger. Since V(c) is inversely proportional to L, a reduction in V(c) can also be achieved by increasing L. Is it more advantageous to increase L or to add a second outgroup? The total number of nucleotides sequenced is 4 L for the case of one outgroup and 5 L for the case of two outgroups, the latter being 1.25 times the former. Therefore, if the same total number of nucleotides is to be sequenced, it is less advantageous to add a second outgroup than to increase L if Vi (c)l V2( c) < 1.25, whereas the reverse is true if the ratio is > 1.25. In table 3 the ratio is ~1.25 for the first six cases and is > 1.25 for the last six cases. Since the ratio tends to increase with V, (c), in general it is more advantageous to increase L if Vi (c) is relatively small but more advantageous to add a second outgroup if I, (c) is relatively large. In all the cases in table 3, the distances from sequences 4 and 5 to the other three are the same, i.e., g = f in figure lb, so that the fifth sequence is as good a reference as the fourth one. If the fifth is more distantly related to the other three than the fourth sequence is, then the reduction in V(c) is expected to be smaller than those shown in table 3. Further, the effect will also be reduced if sequences 4 and 5 are closely related to each other. Discussion Heterogeneous Data Phylogenetic studies often use sequence data from different DNA regions. If all the regions studied have similar rates of nucleotide substitution, then all the data can be combined together into one single set. However, if substantial variation in rates exists, regions with different rates should be treated separately. The question then arises as to how to test the significance when the results from different data sets are combined. A simple test procedure is the inverse x2 method (Fisher 1932). Suppose that there are k different data sets. Let Pi be the significance level (probability) estimated from the ith data set. If the null hypothesis is true (i.e., the three taxa represent a trichotomy), then -21n(Pi) has a x2 distribution with 2 degrees of freedom and P = -2 2 ln(pi) i=l

432 Li Table 3 Variance [V,(c)] of the Estimate of the Length of Branch c in Figure lb a=b Vi(C) V2(4 d f=g e c cc (X 10-4) (x10-4) VlW v2w 0.02 0.05 0.10 0.025 0.05 0.025 0.005 l/3 0.072 0.066 1.09 213 0.079 0.07 1 1.11 0.029 0.00 1 l/3 0.028 0.023 1.22 213 0.033 0.027 1.22 0.055 0.10 0.045 0.005 l/3 0.153 0.126 1.21 213 0.188 0.150 1.25 0.049 0.00 1 l/3 0.104 0.079 1.32 213 0.135 0.100 1.35 0.110 0.15 0.040 0.010 l/3 0.445 0.35 1 1.27 213 0.577 0.444 1.30 0.045 0.005 l/3 0.376 0.287 1.31 213 0.501 0.374 1.34 NOTE.-Symbols are as defined in the text and table 1. Obtained under the assumption that the second outgroup (taxon 5 in fig. 1 b) is not available. has a x2 distribution with 2k degrees of freedom. The probability corresponding to the computed P value can be easily obtained from a x2 table. Branching Order of Human, Chimpanzee, and Gorilla Holmquist et al. ( 1988) have recently applied Lake s ( 1987) method of phylogenetic reconstruction to study the human-chimp-gorilla trichotomy problem by using two sets of data: ( 1) nuclear DNA for a IO-kb region around the t-l-globin pseudogene locus (Miyamoto et al. 1987; Maeda et al. 1988, and references therein) and (2) mitochondrial (mt) DNA for the 896-bp fragment characterized by Brown et al. ( 1982). I applied the present theory to the same two sets of data. From the nucleotide differences tabulated in table 2 of Holmquist et al. ( 1988 ), I obtained the proportions of transitional and transversional differences between each pair of the four species: human, chimpanzee, gorilla, and orangutan (table 4). Here the problem is to determine the neighbor pairs. In both sets of data the transformed-distance method pairs human and chimpanzee in one clade. Applying the present theory to the nuclear DNA data, I obtain a = 0.006 1, b = 0.0077, and c = 0.00032 with SE = 0.00022 if the one-parameter model of nucleotide substitution is used and a = 0.0060, b = 0.0077, and c = 0.00034 with SE = 0.00024 if the two-parameter model is used. Under both models the ratio c/se is - 1.42 and the probability for this to occur is 5 16%. For the mtdna data, I obtain a = 0.04 12, b = 0.05 17, c/se = 0.0086 /0.0046 = 1.87, and a probability of 16% for the one-parameter model and a = 0.0422, b = 0.0532, c/se = 0.0086/0.0055 = 1.56, and a probability of 5 12% for the two-parameter model (a more rigorous treatment of the mtdna data should consider different types of regions separately). In this case the two-parameter model is much more realistic because there is a strong bias for transitional changes (Brown et al. 1982). To test the significance of the combined results, note that P, = 0.16, P2 = 0.12, and -2 2 ln( Pi) = 7.90. Since the probability of x2 = 7.90 with 4 degrees of freedom is 0.10, one cannot reject the hypothesis of trichotomy. For both sets of data Lake s test gave a probability of ~25% (Holmquist et al. 1988), which is considerably larger than the probabilities obtained above. Holmquist

Statistical Test of Phylogenies 433 Table 4 P and 0 between SDecies HUMAN CHIMP GORILLA ORANGUTAN P Q P Q P Q P Q Human... Chimp... Gorilla... Orangutan... 0.0817 0.0056 0.0929 0.0090 0.1198 0.0392 0.0093 0.0043 0.0963 0.0100 0.1332 0.038 1 0.0109 0.0034 0.0119 0.004 1 0.1300 0.0370 0.0195 0.0093 0.0200 0.0101 0.0215 0.0089 NOTE.-The values below the diagonal were computed from the 10,046-bp region around the q-globin pseudogene locus (Miyamoto et al. 1987; Maeda et al. 1988); the values above the diagonal were computed from the mtdna for the 896-bp fragment of Brown et al. (1982). et al. have combined the two sets of data in one set and obtained a probability of 13%. One reason for this low probability is as follows: In Lake s test one calculates a parsimony-like term P and a background term B for each of the three alternative trees. Under the null hypothesis that the tree under consideration is wrong, P and B are statistically equal. Holmquist et al. used the binomial distribution to test the equality of P = B. It happened that for both sets of data P = 3 and B = 0 for the tree with human and chimpanzee in one clade, and so, when the two sets of data were combined, P = 6 and B = 0, from which Holmquist et al. obtained a probability of 3%. Since the mtdna segment used has evolved six to seven times faster than the nuclear DNA segment (see the above a and b values), both the P and B values should have different probability distributions for the two sets of data. It is therefore not clear that they should be combined together as was done by Holmquist et al. This problem deserves a more careful study. A more serious problem is that in Lake s method three independent tests (one for each of the three alternative trees) are conducted. Thus, the probability for rejecting the null hypothesis should be 1 - ( 1 -P)3 = 3P = 9%, instead of 3%. APPENDIX Variances of Branch Lengths The variances of internal branch lengths have already been given in the text. Here I present the variances of peripheral branch lengths. First, let us consider the case of four taxa. From equation ( 7 ), v(a) = /4V(d12) + /2[V(d,,)-V(d2~)] + /16[~(d13)+V(d14)+V(d23)+~(d24)1 + /#(d,,)+~(d,6)-~(~3+~(~45)-2~(&6)1. The four peripheral branches a, b, d, and e of figure la are topologically equivalent, so that the variance for any of the other branch lengths can be readily obtained from the above formula by exchanging the subscripts; for example, a comparison of equation ( 7) and ( 10) shows that to obtain V(d) one needs to exchange between 1 and 3, 2 and 4, and 5 and 6 ( see fig. la). Next, consider the case of five taxa. From equation (20), V(a) = /4V(d12) + 1/2[~(&)-V(~26)1 + 1/36[i+&3)+~(d,3)+~(d14)+~(~24)+V(&)+V(d2d1

434 Li + /18[2~(d17)+2V(d18)+2V(d27)-V(d36)-V(d46) +v(d56)-4v(d67)-2~/(d68)l. The four peripheral branches a, b, f, and g of figure 1 b are topologically equivalent, so that the variances of b, f, and g can be obtained from the above formula by exchanging subscripts. The variance of d is given by -v(d74)-2l/(d78)-v(d75)1 + /64[~(d,4)+V(d24)+V(d15)+~(d25)1 + 1/32[V(d64)+V(d18)+2v(dss)+v(d,8)+v(d,5)l - Acknowledgments I thank R. Chakraborty, M. Gouy, P. Pamilo, P. M. Sharp, and K. H. Wolfe for suggestions. This study was supported by NIH grant GM30998. LITERATURE CITED BRITTEN, R. J. 1986. Rates of DNA sequence evolution differ between taxonomic groups. Science 231: 1393-1398. BROWN, W. M., E. M. PRAGER, A. WANG, and A. C. WILSON. 1982. Mitochondrial DNA sequences of primates: tempo and mode of evolution. J. Mol. Evol. l&225-239. CAVALLI-SFORZA, L. L., and A. W. F. EDWARDS. 1967. Phylogenetic analysis models and estimation procedures. Am. J. Hum. Genet. 19:233-257. CAVENDER, J. A. 1978. Taxonomy with confidence. Math. Biosci. 40:271-280. CHAKFUBORTY, R. 1977. Estimation of time of divergence from phylogenetic studies. Can. J. Genet. Cytol. 19:2 17-223. ECK, R. V., and M. 0. DAYHOFF. 1966. Atlas of protein sequence and structure. National Biomedical Research Foundation, Silver Spring, Md. EDWARDS, A. W. F., and L. L. CAVALLI-SFORZA. 1964. Reconstruction of evolutionary trees. Pp. 67-76 in V. H. HEYWCKID and J. MCNEILL, eds. Phenetic and phylogenetic classification. Systematics Association Publication 6. Systematics Association, London. FARRIS, J. S. 1977. On the phenetic approach to vertebrate classification. Pp. 823-850 in M. K. HECHT, P. C. GOODY, and B. M. HECHT, eds. Major patterns in vertebrate evolution. Plenum, New York. FELSENSTEIN, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368-376. - 198%~. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39:783-79 1. -. 198% Confidence limits on phylogenies with a molecular clock. Syst. Zool. 34:152-161. FISHER, R. A. 1932. Statistical methods for research workers. 4th ed. Oliver & Boyd, London. FITCH, W. M. 1977. On the problem of discovering the most parsimonious tree. Am. Nat. 111: 223-257. HOLMQUIST, R., M. M. MIYAMOTO, and M. GOODMAN. 1988. Analysis of higher-primate phylogeny from transversion differences in nuclear and mitochondrial DNA by Lake s methods of evolutionary parsimony and operator metrics. Mol. Biol. Evol. 5:2 17-236.

Statistical Test of Phylogenies 435 JUKES, T. H., and C. R. CANTOR. 1969. Evolution of protein molecules. Pp. 2 I- 132 in H. N. MUNRO, ed. Mammalian protein metabolism. Academic Press, New York. KIMURA, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 11 l- 120. KIMURA, M., and T. OHTA. 1972. On the stochastic model for estimation of mutational distance between homologous proteins. J. Mol. Evol. 2:87-90. KLOTZ, L. C., N. KOMAR, R. L. BLANKEN, and R. M. MITCHELL. 1979. Calculation of evolutionary trees from sequence data. Proc. Natl. Acad. Sci. USA 76:45 16-4520. LAKE, J. A. 1987. A rate-independent technique for analysis of nucleic acid sequences: evolutionary parsimony. Mol. Biol. Evol. 4: 167-191. LI, W.-H. 1981. Simple method for constructing phylogenetic trees from distance matrices. Proc. Natl. Acad. Sci. USA 78:1085-1089. LI, W.-H., C.-C. LUO, and C.-I. WV. 1985. Evolution of DNA sequences. Pp. l-94 in R. J. MACINTYRE, ed. Molecular evolutionary genetics. Plenum, New York. LI, W.-H., M. TANIMURA, and P. M. SHARP. 1987. An evaluation of the molecular clock hypothesis using mammalian DNA sequences. J. Mol. Evol. 25:330-342. LI, W.-H., C.-I. WV, and C.-C. LUO. 1984. Nonrandomness of point mutation as reflected in nucleotide substitutions in pseudogenes and its evolutionary implications. J. Mol. Evol. 21: 58-71. MAEDA, N., C.-I. WV, J. BLISKA, and J. RENEKE. 1988. Molecular evolution of intergenic DNA in higher primates: pattern of DNA changes, molecular clock and evolution of repetitive sequences. Mol. Biol. Evol. 5: l-20. MIYAMOTO, M. M., J. L. SLIGHTOM, and M. GOODMAN. 1987. Phylogenetic relationships of human and African apes as ascertained from DNA sequences (7.1 kilobase pairs) of the wrlglobin region. Science 238:369-373. MUELLER, L. D., and F. J. AYALA. 1982. Estimation and interpretation of genetic distance in empirical studies. Genet. Res. 40: 127-137. NEI, M., J. C. STEPHENS, and N. SAITOU. 1985. Methods for computing the standard errors of branching points in an evolutionary tree and their application to molecular data from humans and apes. Mol. Biol. Evol. 2:66-85. SAITOU, N., and M. NEI. 1986. The number of nucleotides required to determine the branching order of three species with special reference to the human-chimpanzee-gorilla divergence. J. Mol. Evol. 24: 189-204. -. 1987. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406-425. SNEATH, P. H. A., and R. R. SOKAL. 1973. Numerical taxonomy. W. H. Freeman, San Francisco. TEMPLETON, A. R. 1983. Phylogenetic inference from restriction endonuclease cleavage site maps with particular reference to the evolution of humans and the apes. Evolution 37:22 l- 244. WV, C.-I., and W.-H, LI. 1985. Evidence for higher rates of nucleotide substitution in rodents than in man, Proc. Natl. Acad. Sci. USA 82: 174 l- 1745. WALTER M. FITCH, reviewing editor Received September 28, 1988; revision received January 10, 1989