THE TRIPLES DISTANCE FOR ROOTED BIFURCATING PHYLOGENETIC TREES

Size: px
Start display at page:

Download "THE TRIPLES DISTANCE FOR ROOTED BIFURCATING PHYLOGENETIC TREES"

Transcription

1 Syst. Biol. 45(3):33-334, 1996 THE TRIPLES DISTANCE FOR ROOTED BIFURCATING PHYLOGENETIC TREES DOUGLAS E. CRITCHLOW, DENNIS K. PEARL, AND CHUNLIN QIAN Department of Statistics, Ohio State University, Columbus, Ohio 4310, USA; (D.K.P.) Abstract. We investigated the triples distance as a measure of the distance between two rooted bifurcating phylogenetic trees. The triples distance counts the number of subtrees of three taxa that are different in the two trees. Exact expressions are given for the mean and variance of the sampling distribution of this distance measure. Also, a normal approximation is proved under the class of label-invariant models on the distribution of trees. The theory is applied to the usage of the triples distance as a statistic for testing the null hypothesis that the similarities in two trees can be explained by independent random structures. In an example, two phylogenies that describe the same seven species of chloroccalean zoosporic green algae are compared: one phylogeny based on morphological characteristics and one based on ribosomal RNA gene sequence data. [Tree comparison metrics; random trees; label-invariant models; hypothesis test.] Developing interpretable measures of the distance between trees and of their sampling distributions under various probability models is important to the study of phylogenetic inference. Distance measures are a valuable tool for comparing phylogenetic trees created from two or more sources of data (e.g., Penny et al., 198; Bledsoe and Raikow, 1990; Penny et al., 1991; Swofford, 1991; Estabrook, 199), for reporting the results of a bootstrap analysis or of a comparison of phylogeny algorithms (e.g., Kuhner and Felsenstein, 1994), for making confidence statements about a proposed phylogeny, and for examining subtrees of particular taxa. As an example of the first use, consider the two phylogenies presented in Figure 1 for seven species of chloroccalean zoosporic green algae. Figure la shows a rooted bifurcating tree based on an assessment of certain morphological characteristics (primarily the details of the flagellar apparatus of motile cells), and Figure lb shows the parsimony tree based on ribosomal RNA gene (rdna) sequence data (Wilcox et al., 199). Can we quantify the difference between the trees? Can the similarities in the two trees be explained by random chance? In this paper, we propose using the number of subtrees of three taxa that are different in the two trees as a measure of the distance between them. To answer the second question, we find the mean and variance of this statistic, along with its asymptotic distribution, under the model that the two trees have completely independent structures. Several metrics for comparing phylogenies of n taxa have previously been suggested, and the most appropriate one to use depends on the underlying question of a particular investigation (Penny and Hendy, 1985). The branch-swapping metric proposed by Waterman and Smith (1978) computes the number of nearest-neighbor interchanges required to convert one tree into another. It has an appealing interpretation but cannot be calculated in polynomial time (i.e., the time required to compute the metric grows faster than any power of n). The symmetric difference metric discussed by Robinson and Foulds (1981) counts the number of partitions of the taxa, created by deleting internal edges, that differ in the two trees. The time required to calculate this metric is proportional to n for large trees (Day, 1985), its small sample distribution has been tabulated for specific probability models on the set of trees (Hendy et al., 1984), and a large sample Poisson approximation has been proved for the general class of label-invariant probability models (Steel, 1988; Steel and Penny, 1993). The quartet metric for unrooted trees proposed by Estabrook 33

2 34 SYSTEMATIC BIOLOGY VOL. 45 (a) (b) FIGURE 1. Two phytogenies for chloroccalean zoosporic green algae. 1 = Glycine max; = Characium perforatum; 3 = Friedmannia ismelensis; 4 = Parietochbris pseudoalveolaris; 5 = Dunaliella parva; 6 = Characium hindakii; 7 = Chlamydomonas. (a) Tree based on morphological characteristics, (b) Tree based on 18S rdna sequence data. the distributional results we obtained may also give new findings for the quartet distance. Regardless of the analytical distributional results available for a particular measure, it is always possible to carry out significance tests using simulation techniques (Shao and Sokal, 1986). For example, the triples distance was used by Page (1988) to test biogeographical hypotheses using simulation methods. In this paper, we provide a formal definition of the triples distance and find an exact expression for the mean and variance of this statistic when the two trees are independent and we assume a label-invariant probability model on the set of all rooted bifurcating trees. The triples distance has a limiting normal distribution under this class of probability models. The triples distance can be calculated in O(n ) time, and we applied the probabilistic theory to a hypothesis test of the independence of the trees in Figure 1. The proof of the normal approximation theorem and tables of the null distribution of the triples distance for n < 50 under two specific models are also provided. et al. (1985) and studied quantitatively by Day (1986) counts the number of unrooted subtrees of four taxa that are different in the two trees. This metric can be calculated in O(n 3 ) time for a tree of n taxa, and its approximate variance was given for bifurcating trees by Steel and Penny (1993) (exactly under the model that all such trees are equally likely). The triples distance for rooted trees was suggested by Dobson (1975) as a method of comparing the shapes of trees, although she did not study aspects of its calculation or distribution. Each of the above metrics measures dissimilarity only with respect to the labeled topology of a phylogenetic tree the theme of the present paper. Other metrics also consider differences in the branch lengths joining the taxa (e.g., Lapointe and Legendre, 1990, 199). The triples distance for rooted bifurcating trees is a close cousin of the quartet measure for unrooted trees. Consequently, THE PROBABILITY DISTRIBUTION OF THE TRIPLES DISTANCE Basic Description and Notation Consider two labeled rooted bifurcating trees, each having the same n taxa (as in Fig. 1). We follow the usage of Steel and Penny (1993) in calling a labeled topology a tree (ignoring branch lengths) and generally further restrict our attention to the rooted bifurcating case. The triples distance S n between two such trees is defined as follows. For each triple {i, j, k} of distinct taxa in one of the original trees, consider the subtree that relates these three taxa alone. There are just three possibilities for this subtree, depending on which of i, j, and k is the most distant leaf relative to the other two. Let the indicator function fl if taxa i, j, k have different subtrees in the two trees * i * if taxa i, j, k have the same subtree in the two trees and define the triples distance as

3 1996 CRITCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 35 ^n -1 *ijkr ijk where the summation is over all the possible unordered triples {i, j, k) of distinct taxa. For example, to compute the triples distance between the two trees in Figure 1, there are I «= 35 subtrees of size three to be examined. The subtree made up of the triple {Glycine max, Characium perforatum, Friedmannia israelensis} = {1,, 3} is congruent in both trees because G. max is the most distant leaf among the three in each case. However, the subtree made up of {C. perforatum, F israelensis, Parietochloris pseudoalveolaris) = {, 3, 4} is incongruent. The overall triples distance equals 15 for the two trees because the 15 triples {, 3, 4}, {, 3, 5}, {, 3, 6}, {, 3, 7}, {, 4, 6}, {, 5, 6}, {, 5, 7}, {3, 4, 5}, {3, 4, 6}, {3, 4, 7), {3, 5, 7}, {3, 6, 7}, {4, 5, 6}, {4, 5, 7} and {5, 6, 7} are incongruent in the two trees. A fast general algorithm for calculating S n is provided. The probability distribution of the triples distance between two trees depends on the underlying distribution of the trees themselves. We investigated probabilistic properties of the triples distance under the assumption that both trees are drawn independently from the same underlying probability distribution. These probabilistic properties of S n are of interest in their own right and are also useful for developing a statistical test of the hypothesis of independence. Initially, the underlying probability distribution on trees was taken to be the uniform model, and then the results were extended to general label-invariant distributions. Distribution of S n under the Uniform Model For n taxa, there are {In 3)!! = (n 3)(n 1) possible labeled rooted bifurcating trees. A simple probability distribution of interest is the uniform model (e.g., Shao and Rohlf, 1983), under which each of these possible trees is assigned equal probability [{In 3)!!]" 1. If two trees are drawn independently from this model, it is straightforward to check that for each triple {i, j, k] I ijk is a Bernoulli random variable with expectation /3 and that I ijk and I rj1c, are independent whenever [i, j, k) C\ {i',j',k'} = 0. Hence, and E(S.) = X m^) = (fjl (i) Var(SJ = Var(y + ijk ijkk' + ijkj Tc' 3o(fjCov{I iijk, () where all indices are distinct. To complete the variance calculation, note that the covariances in Equation depend only on the joint probability distribution of the two subtrees containing the five taxa i, j, k, )', and k'. Independence of the two original trees implies independence of these two subtrees, and under the uniform model, each of the (-5 3)!! = 105 possible subtree topologies is equally likely. Thus, there are (105) = 11,05 equally likely possibilities for the two subtrees. A direct computer enumeration of all these possibilities and the corresponding values of I ijk, I ijk., and I ij1c. gives Cov{I ijk/ I ijk ) = 8/5 and Cov{I ijk, I ij1c ) = (8/105). Substituting back into Equation and simplifying yields =18/ n \ 3/n\. /n \5) 75 9\3 under the uniform model. An additional result is that for large n, the triples distance is approximately normally distributed under the uniform model. This fact, combined with the preceding expectation and variance calculations, gives a useful approximation for big trees

4 36 SYSTEMATIC BIOLOGY VOL. 45 FIGURE. The two topologies with n = 4 taxa. There are 1 possible labeled trees of type a and three of type b. and allows for the simple implementation of a hypothesis test of independence. Distribution of S n under a General Label-Invariant Model A probability distribution on trees is said to be label invariant if the probability of a tree remains constant under an arbitrary permutation of the taxa labels (Steel and Penny, 1993). For example, in the case of n 4 taxa, label invariance implies that the probability is a fixed constant for each of the 1 possible labeled trees of the type in Figure a and similarly for the three possible trees of the type in Figure b. Most of the derivations under the uniform model rely exclusively on the fact that the uniform distribution on trees is itself label invariant. Only the calculations of Cov(I ijk/ I ijk ) and Cov(I ijk/ I ij1c ) use any additional features of the uniform model. Thus, Equations 1 and remain true under an arbitrary label-invariant model. To complete the variance calculation, Cov(I ijk, I ijk ) and Cov{I ijk, l ij1c ) can be found, as in the uniform case, by a direct computer enumeration of all 11,05 possibilities. (However, these possibilities are no longer all equally likely, so that their probabilities must also be computed under the label-invariant model of interest.) The limiting normality result also carries over. Theorem 1. Under an arbitrary label-invariant distribution on trees and the assumption that the two trees are independent, S n is approximately normally distributed for large n (with mean and variance given by Eqs. 1 and ). More precisely, [S n - E(S n )]/[Var(S n )]* converges in distribution to the standard normal distribution, as n > oo. Proof. See Appendix 1 for the proof, which amounts to showing that all of the moments of the standardized triples distance converge to the corresponding moments of the standard normal distribution. Example: Distribution of S n under the Markov Model A widely studied example of a nonuniform label-invariant distribution on trees is the Markov model, initially examined by Harding (1971). This model is often considered to be more realistic than the uniform model in capturing the salient features of some evolutionary situations (e.g., Slowinski, 1990; Page, 1991). The Markov model is defined by combining the labelinvariance property with the following recursive principle: To construct the tree distribution for n + 1 taxa from the tree distribution for n taxa, it is stipulated that each of the n existing taxa is equally likely to be the source of the next bifurcation. For example, in the case of n = 4 taxa, one can verify that the Markov model assigns a probability of 1/18 to each of the 1 possible labeled trees of the type in Figure a and a probability of 1/9 to each of the three trees of the type in Figure b. A computer enumeration reveals that, under the Markov model, Cov(I ijk, I ijk ) = 77/16-4/9 = 5/16 and Cov(I ijk/ I ij1c ) = 401/900-4/9 = 1/900. Substituting back into Equation yields V « Thus, the theorem also gives a potentially useful approximation for big trees under the Markov model.

5 1996 CRITCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 37 Tabulation of the S n Distribution The probability distribution of the triples distance is tabulated in Appendices and 3, under both the uniform and Markov models. The tabulated probabilities are P(S n ^ x) and will correspond to possible P values for the hypothesis test discussed below. These probabilities were computed exactly for small numbers of taxa (n ^ 7) by a direct enumeration of all possible pairs of trees. For 8 == n < 50, critical values were approximated by simulating 100,000 pairs of random trees from the underlying probability distribution, making the significance levels accurate to about three decimal places. The normal approximation is recommended for larger values of n, for which it seems to work adequately except in the extreme tail of the distribution. APPLICATION OF THE THEORY TO HYPOTHESIS TESTING Rapid Calculation of S n Along with the convenient distribution theory, the triples distance can also be computed rapidly. Obviously, a direct search over all triples allows for a "brute force" algorithm requiring O(n 3 ) time. However, an efficient O(n ) time algorithm is equally easy to program. This algorithm assumes that the two bifurcating phylogenies to be compared are stored in the form of generational matrices. The (/, ;')th entry of a generational matrix is the generation number at which taxa i and j split (i.e., the number of nodes on the path from the root to the most recent common ancestor of i and j). For example, the symmetric generational matrices that uniquely describe the trees in Figures la and lb are and a = I \ V b = I \ The (4, 5) element in the matrix a is 3 because taxa 4 and 5 split at the third generation from the top of that tree. Next, define the generational pattern associated with taxa i and / in the two trees to be the pair (a(i, j), b(i, /)) (i.e., i and / split at generation a(i, j) in the first tree and at generation b(i, j) in the second). Note that for any triple i, j, k of taxa, i is the most distant leaf among i, j, k in the tree of Figure la if and only if a(i, j) = a(i, k), and similarly i is the most distant leaf in Figure lb if and only if b(i, j) = b(i, k). In other words, i is the most distant leaf among i, j, k in both trees whenever the generational patterns for taxa i and j and taxa i and k are the same. It follows that _ v n(m, i) where n(m, i) is the number of times that taxon i is associated with the rath generational pattern. This observation is the basis for the fast algorithm for computing S n ; the required values n(m, i) can be computed quickly by scanning the n(n 1) elements of the two generational matrices. Example: The Congruence of Morphological and Molecular Algae Data We now return to the two phylogenies presented in Figure 1 and try to answer the question: can their apparent similarities be explained by chance variation? To use the fast algorithm to compute the triples distance, note that the generational pattern (1, 1) is repeated six times for the first taxon (scanning the first rows of the matrices a and b), (, 3) occurs two times for the third taxon, (3, 3) occurs two times for the fifth taxon, and (3, ) occurs three times for the seventh taxon, and these are

6 38 SYSTEMATIC BIOLOGY VOL. 45 the only generational patterns that are associated more than once with any taxon. Thus, = = 15, which agrees with the value found previously by a "brute force" enumeration. Is this value of the triples distance statistically significant? From the tables in Appendix, there is an 8.6% probability that 15 or fewer incongruent triples would occur by chance when the two trees are constructed independently under the uniform model. This probability can be interpreted as a P value for testing the null hypothesis that the two trees are statistically independent under the uniform model. Moreover, from the table, the analogous P value under the Markov model is 6.9%. Thus, under either model there is only minimal evidence in these trees to indicate that the evolution of the seven species of algae suggested by the molecular data is associated with the evolution suggested by morphological characteristics. The normal approximation is provided by Theorem 1 (although for n = 7 taxa we would not expect the approximation to work well here). Under the uniform model, this gives the standardized test statistic value z = (35) f (35) + (35, -1.35, which yields an approximate P value of 8.9%. Similarly, the normal approximation of the P value under the Markov model is 3.6%. Thus, in this example, the normal distribution appears to provide a better approximation under the uniform model. In general, our investigations suggest that the approximation works adequately for trees with a larger number of taxa, e.g., n > 50. For values of n < 50, the tables in the appendices should be used in preference to the approximation, especially for low significance levels such as The Conservative Test The proof given for Theorem 1 remains valid when the two trees are allowed to have different (and arbitrary) label-invariant distributions. In particular, if the labelinvariant distributions for trees A and B assign a probability of 1 to some arbitrary fixed topologies T A and T B, then E(S n T A, T B ) / \ = -jo, for any such T A and T B. It follows that VarfSJ = E{Var[S M T A, T B ]} + Var{ [S n T A, T B ]} = E{Var[S n T A, T B ]}. Therefore, the variance of S n is maximized (over all possible pairs of label-invariant distributions for trees A and B) when these distributions assign a probability of 1 to those topologies, T A and T B, that yield the largest conditional variance Var[S n i A, T B ]. This type of conditional variance can be calculated using a simple method that allows an extension of the preceding hypothesis test to cases where it is unclear which probability model is most appropriate for the two trees. An examination of the variance formula of Equation reveals that the I. 1 term depends only on the topologies of all the subtrees of size 4, whereas the _ I term depends on the to- W pologies of the subtrees of size 5. There are two types of topologies for trees with four taxa (type 1 [Fig. a] and type [Fig. b]) and three types of topologies for trees with five taxa (type 1 [Fig. 3a], type [Fig. 3b], and type 3 [Fig. 3c]). Let p im {A) denote the proportion of subtrees of size m that have topology type i in the full tree A. Then, a straightforward argument shows Var[S n T A, T B ] = c (3)

7 1996 CRTTCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 39 where = -Pi 4 (A)p 14 (B) and + p Z5 (A)p 15 (B)] p 35 (A)p 15 (B)] + l p 35 (A) P5 (B)] Using the facts that,- p, m (A) = 1 and that p M (A) = (4/5) + (l/5)p 15 (A) - (/5) P5 (A), it follows that Equation 3 is maximized when p w (A) = p l5 {B) = 1 (provided n > 5). Substituting these values into Equation 3 gives the maximum possible variance of S n over all possible label-invariant distributions on the two trees: Consequently, if we use V max in computing our standardized test statistic, that is, take z = 3 3 then we will have a conservative test that yields the maximum P value over any label-invariant distributions on the trees. Rejection of the null hypothesis using this FIGURE 3. The three topologies with n = 5 taxa. There are 60 possible labeled trees of type a, 30 of type b, and 15 of type c. conservative approach would be especially forceful evidence that the similarities in the two trees cannot be explained by random chance. When n = 7, V max = 4333/90, so that the conservative test statistic in the algae example is z «* 1.0, which yields an approximate P value of 11.5%. Thus, even at the 10% significance level, the null hypothesis that the two trees are statistically independent cannot be rejected using the conservative test: there exists a label-invariant distribution on trees that provides a reasonable explanation for the congruencies in these particular data. However, in situations where the conservative test rejects the null hypothesis, then a very strong conclusion is justified: the similarities between two trees cannot be attributed

8 330 SYSTEMATIC BIOLOGY VOL. 45 to chance variation under any label-invariant model. The Conditional Test An alternative approach neglects the issue of choosing a distribution on trees and considers the permutation test conditioned on the topologies of the two trees that are actually observed. Thus, the null model says that all random relabelings of the nodes in the given topologies are equally likely. Because this test amounts to assuming that the tree distribution puts a probability of 1 on the observed topologies, the asymptotic normality of Theorem 1 still applies. The test is carried out in practice by computing the conditional variance given by Equation 3 and requires only a simple count of the number of occurrences of each possible type of subtree topology of size 4 and size 5, as illustrated in Figures and 3. Returning once again to the comparison of the molecular and the morphology trees in Figure 1, notice that they coincidentally have the same topology. Of the = 1 subtrees of size 5, 14 are of type 1 (Fig. 3a), 1 is of type (Fig. 3b) and 6 are of type 3 (Fig. 3c). Of the I ^ J = 35 subtrees of size 4, 3 are of type 1 (Fig. a) and 3 are of type (Fig. b). Substituting the corresponding proportions into Equation 3 gives the conditional variance V cond = 1364/35 and the conditional test statistic z «1.33, yielding an approximate P value of 9.1%. SUMMARY In this paper we have described a metric, the triples distance, for comparing rooted bifurcating phylogenetic trees. This distance is easy to interpret and easy to calculate and has a well-developed sampling theory. The sampling theory enables use of the triples distance as a statistic for testing the null hypothesis that the similarities in two trees can be explained by independent random structures. However, as with any hypothesis test, a statistically significant result should not be interpreted as proof of a global pattern, especially for large trees. For example, it is possible to reject the null hypothesis based on the close agreement of a very small subset of the total collection of taxa (combined with otherwise independent structures). Thus, when n is large, it may be fruitful to also apply the triples distance methodology to particular subtrees that correspond to important subgroups of taxa. An especially appealing attribute of the triples test is its potential applicability under any choice of the probability distribution on the set of possible trees. If the application indicates a good candidate for this distribution on trees, then the formulae provided can be used to compute the test statistic under this distribution. The statistic is quite robust to small deviations from the candidate tree distribution (e.g., when n = 7, VVar(SJ ranges only from a minimum of 5.35 to a maximum of 6.94, over all possible label-invariant tree distributions). On the other hand, if no reasonable candidate distribution exists, the user may choose a conservative statistic, valid for any tree distribution, or a conditional permutation statistic, valid for the topologies of the trees that are actually observed. Although other metrics may be more appropriate for particular applications, the triples distance provides a useful, robust new resource in the systematist's tool kit. REFERENCES BLEDSOE, A. H., AND R. J. RAIKOW A quantitative assessment of congruence between molecular and nonmolecular estimates of phylogeny. J. Mol. Evol. 30: DAY, W. H. E Optimal algorithms for comparing trees with labeled leaves. J. Classif. :7-8. DAY, W. H. E Analysis of quartet dissimilarity measures between undirected phylogenetic trees. Syst. Zool. 35: DOBSON, A. J Comparing the shapes of trees. Pages in Lecture notes in mathematics, no. 45. Combinatorial mathematics III (A. P. Street and W. D. Wallis, eds.). Springer-Verlag, New York. ESTABROOK, G. F Evaluating undirected positional congruence of individual taxa between two estimates of the phylogenetic tree for a group of taxa. Syst. Biol. 41: ESTABROOK, G. E, F. R. MCMORRIS, AND C. A. MEA-

9 1996 CRITCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 331 CHAM Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Syst. Zool. 34: FELLER, W An introduction to probability theory and its applications, Volume. John Wiley and Sons, New York. HARDING, E. F The probabilities of rooted treeshapes generated by random bifurcation. Adv. Appl. Probab. 3: HENDY, M. D., C. H. C. LITTLE, AND D. PENNY Comparing trees with pendant vertices labelled. SLAM J. Appl. Math. 44: KUHNER, M. A., AND J. FELSENSTEIN A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. Mol. Biol. Evol. 11:45^-468. LAPOINTE, F.-J., AND P. LEGENDRE A statistical framework to test the consensus of two nested classifications. Syst. Zool. 39:1-13. LAPOINTE, F.-J., AND P. LEGENDRE A statistical framework to test the consensus among additive trees (cladograms). Syst. Biol. 41: PAGE, R. D. M Quantitative cladistic biogeorgraphy: Constructing and comparing area cladograms. Syst. Zool. 37: PAGE, R. D. M Random dendrograms and null hypotheses in cladistic biogeography. Syst. Zool. 40: PENNY, D., L. R. FOULDS, AND M. D. HENDY Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences. Nature 97: PENNY, D., AND M. D. HENDY The use of tree comparison metrics. Syst. Zool. 34:75-8. PENNY, D., M. D. HENDY, AND M. A. STEEL Testing the theory of descent. Pages in Phylogenetic analysis of DNA sequences (M. M. Miyamoto and J. Cracraft, eds.). Oxford Univ. Press, New York. ROBINSON, D. E, AND L. R. FOULDS Comparison of phylogenetic trees. Math. Biosci. 53: SHAO, K., AND F. J. ROHLF Sampling distribution of consensus indices when all bifurcating trees are equally likely. Pages in Numerical taxonomy (J. Felsenstein, ed.). Springer-Verlag, Berlin. SHAO, K., AND R. R. SOKAL Significance tests of consensus indices. Syst. Zool. 35: SLOWINSKI, J. B Probabilities of n-trees under two models: A demonstration that asymmetrical interior nodes are not improbable. Syst. Zool. 39: STEEL, M. A Distribution of the symmetric difference metric on phylogenetic trees. SIAM J. Disc. Math. 1: STEEL, M. A., AND D. PENNY Distributions of tree comparison metrics Some new results. Syst. Biol. 4: SWOFFORD, D. L When are phylogeny estimates from molecular and morphological data incongruent? Pages in Phylogenetic analysis of DNA sequences (M. M. Miyamoto and J. Cracraft, eds.). Oxford Univ. Press, New York. WATERMAN, M. S., AND T. F. SMITH On the similarity of dendrograms. J. Theor. Biol. 73: WILCOX, L. W., L. A. LEWIS, P. A. FUERST, AND G. L. FLOYD Assessing the relationships of autosporic and zoosporic chloroccalean green algae with 18S rdna sequence data. J. Phycol. 8: Received 9 March 1995; accepted 8 March 1996 Associate Editor: Daniel Faith APPENDIX 1 PROOF OF THE NORMAL APPROXIMATION, THEOREM 1 Recall the notation S n = X ijk l iik, where l ijk 1 if the triple of taxa i, ], k has a different subtree topology in both trees, and 0 otherwise. Let Z n denote [S n - E(SJ]/[Var(SJ] 1/, and let M n( denote E(Z n >). To prove asymptotic normality, it is sufficient to show that the moments M nl all converge to the corresponding moments of the standard normal distribution, i.e., that lim M nt = the rth moment of standard normal """ distribution = j if f is even [o if t is odd (e.g., Feller, 1971:69, 4-9). For notational convenience, let A ijk = l ijk /3, so that E(A ijk ) = 0 and S n - E(S n ) = X ijk A ljk. Then In the above expression, note that by Equation, Var(SJ = (c/4)n 5 + o{n 5 ), where c = Cov(I i/k, I n ). Also note that [X iik A ijk ]' can be expanded as a summation of t-told products, each having the form n^=1 A Wsts. For each such f-fold product, let m denote the number of distinct indices among all the 3f indices i u j x, k lf i, j, k,..., i t, j t, k t that occur in the product. We distinguish three possible cases, according to the value of m. Case 1 For any fixed m < 5t/, the number of possible products that attain this value of m is of order n m. Therefore, the contribution to E[X ijk A ijk \ from all such products is asymptotically negligible compared to [Var(SJ]' /. Case Next consider any fixed m > 5t/. Any product that attains this value of m must have the following property: there exists some triple i s., j s,, k s, such that A is, u, ks, occurs in the product and such that i s,, j s ; k s. are distinct from all of the other indices occurring in the product. But then A k, kkt, is independent of all other terms in the product, so ruisjsk. = 0. Note that if t is odd, then either case 1 or case

10 33 SYSTEMATIC BIOLOGY VOL. 45 must always hold, and therefore \im n _> x M nl = 0, as claimed. However, if t is even, consider the third case. ways of choosing j v k u j, k v..., j lr, k r. Hence, the number of possible products is Case 3 Suppose m = 5t/. Then t is even, e.g., t = r. Consider any product that attains this value of m. If there happens to exist s' as described in case, then still E[IIJ =1 A isjsk ) = 0 as argued under case. If there does not exist such an s', then this together with m = 5t/ implies that the product must have the form (r - 1)] n - 5r + = (r - l)!!n!/[(n - 5r)! ']. In conclusion, when t = r is even, lim M n, where i u..., i r, j u..., ; r, and k v..., k r axe all distinct. The expectation of such a product is (E[A hhki A ilj j ]y = c T. Moreover, it is straighforward to count the number of possible products of form Al: there are (r - 1)!! ways to decide which terms in the product are paired with each other by possessing a common index; n(n - 1)... (n - r + 1) ways of choosing i u i^..., i r ; and n-r\(n-r- \ n-5r+ where the summation is over all possible products of form Al. By evaluating this expectation as above and substituting the leading term cn 5 /4 of Var(S n ), we obtain lim M nl = lim (cn74)' (r - (n - 5r)\ r = (r - 1)!! as claimed.

11 1996 CRITCHLOW ET AL. TRIPLES DISTANCE FOR ROOTED TREES 333 APPENDIX Tables of the exact distribution of S n. For each n (number of taxa: 4 < n < 7), the tabulated probabilix) for both the uniform and Markov ties are P(S n < models. X n = n = n = n = Uniform Markov X APPENDIX Continued. Uniform Markov

12 334 SYSTEMATIC BIOLOGY VOL. 45 APPENDIX 3 Level a critical values of the statistic S n. For each n (number of taxa: 8 ^ n s 50), the largest value of x such that P(S n ^ x) ^ a is tabulated for both the uniform and Markov models. n Uniform Markov i 0.05 Uniform Markov Uniform Markov

Assessing Congruence Among Ultrametric Distance Matrices

Assessing Congruence Among Ultrametric Distance Matrices Journal of Classification 26:103-117 (2009) DOI: 10.1007/s00357-009-9028-x Assessing Congruence Among Ultrametric Distance Matrices Véronique Campbell Université de Montréal, Canada Pierre Legendre Université

More information

k-protected VERTICES IN BINARY SEARCH TREES

k-protected VERTICES IN BINARY SEARCH TREES k-protected VERTICES IN BINARY SEARCH TREES MIKLÓS BÓNA Abstract. We show that for every k, the probability that a randomly selected vertex of a random binary search tree on n nodes is at distance k from

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

arxiv: v1 [cs.ds] 1 Nov 2018

arxiv: v1 [cs.ds] 1 Nov 2018 An O(nlogn) time Algorithm for computing the Path-length Distance between Trees arxiv:1811.00619v1 [cs.ds] 1 Nov 2018 David Bryant Celine Scornavacca November 5, 2018 Abstract Tree comparison metrics have

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

DISTRIBUTIONS OF CHERRIES FOR TWO MODELS OF TREES

DISTRIBUTIONS OF CHERRIES FOR TWO MODELS OF TREES DISTRIBUTIONS OF CHERRIES FOR TWO MODELS OF TREES ANDY McKENZIE and MIKE STEEL Biomathematics Research Centre University of Canterbury Private Bag 4800 Christchurch, New Zealand No. 177 May, 1999 Distributions

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition David D. Pollock* and William J. Bruno* *Theoretical Biology and Biophysics, Los Alamos National

More information

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS In our work on hypothesis testing, we used the value of a sample statistic to challenge an accepted value of a population parameter. We focused only

More information

The expected value of the squared euclidean cophenetic metric under the Yule and the uniform models

The expected value of the squared euclidean cophenetic metric under the Yule and the uniform models The expected value of the squared euclidean cophenetic metric under the Yule and the uniform models Gabriel Cardona, Arnau Mir, Francesc Rosselló Department of Mathematics and Computer Science, University

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Distances that Perfectly Mislead

Distances that Perfectly Mislead Syst. Biol. 53(2):327 332, 2004 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490423809 Distances that Perfectly Mislead DANIEL H. HUSON 1 AND

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT COMMUNICATIONS IN INFORMATION AND SYSTEMS c 2009 International Press Vol. 9, No. 4, pp. 295-302, 2009 001 THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT DAN GUSFIELD AND YUFENG WU Abstract.

More information

Parsimony via Consensus

Parsimony via Consensus Syst. Biol. 57(2):251 256, 2008 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150802040597 Parsimony via Consensus TREVOR C. BRUEN 1 AND DAVID

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft] Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

More information

Congruence of Morphological and Molecular Phylogenies

Congruence of Morphological and Molecular Phylogenies Acta Biotheor DOI 10.1007/s10441-007-9015-8 REGULAR A RTICLE Congruence of Morphological and Molecular Phylogenies Davide Pisani Æ Michael J. Benton Æ Mark Wilkinson Received: 10 March 2007 / Accepted:

More information

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Minimum evolution using ordinary least-squares is less robust than neighbor-joining Minimum evolution using ordinary least-squares is less robust than neighbor-joining Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA email: swillson@iastate.edu November

More information

Consensus Methods. * You are only responsible for the first two

Consensus Methods. * You are only responsible for the first two Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is

More information

Probabilities of Evolutionary Trees under a Rate-Varying Model of Speciation

Probabilities of Evolutionary Trees under a Rate-Varying Model of Speciation Probabilities of Evolutionary Trees under a Rate-Varying Model of Speciation Mike Steel Biomathematics Research Centre University of Canterbury, Private Bag 4800 Christchurch, New Zealand No. 67 December,

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

arxiv: v1 [q-bio.pe] 1 Jun 2014

arxiv: v1 [q-bio.pe] 1 Jun 2014 THE MOST PARSIMONIOUS TREE FOR RANDOM DATA MAREIKE FISCHER, MICHELLE GALLA, LINA HERBST AND MIKE STEEL arxiv:46.27v [q-bio.pe] Jun 24 Abstract. Applying a method to reconstruct a phylogenetic tree from

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004, Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Non-independence in Statistical Tests for Discrete Cross-species Data

Non-independence in Statistical Tests for Discrete Cross-species Data J. theor. Biol. (1997) 188, 507514 Non-independence in Statistical Tests for Discrete Cross-species Data ALAN GRAFEN* AND MARK RIDLEY * St. John s College, Oxford OX1 3JP, and the Department of Zoology,

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Notes 6 : First and second moment methods

Notes 6 : First and second moment methods Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative

More information

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Ziheng Yang Department of Biology, University College, London An excess of nonsynonymous substitutions

More information

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 1, 2001 Mary Ann Liebert, Inc. Pp. 69 78 Perfect Phylogenetic Networks with Recombination LUSHENG WANG, 1 KAIZHONG ZHANG, 2 and LOUXIN ZHANG 3 ABSTRACT

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2018 University of California, Berkeley Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

More information

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise Bot 421/521 PHYLOGENETIC ANALYSIS I. Origins A. Hennig 1950 (German edition) Phylogenetic Systematics 1966 B. Zimmerman (Germany, 1930 s) C. Wagner (Michigan, 1920-2000) II. Characters and character states

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

Reconstructing Trees from Subtree Weights

Reconstructing Trees from Subtree Weights Reconstructing Trees from Subtree Weights Lior Pachter David E Speyer October 7, 2003 Abstract The tree-metric theorem provides a necessary and sufficient condition for a dissimilarity matrix to be a tree

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

arxiv: v1 [cs.cc] 9 Oct 2014

arxiv: v1 [cs.cc] 9 Oct 2014 Satisfying ternary permutation constraints by multiple linear orders or phylogenetic trees Leo van Iersel, Steven Kelk, Nela Lekić, Simone Linz May 7, 08 arxiv:40.7v [cs.cc] 9 Oct 04 Abstract A ternary

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

More information

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Benny Chor Michael Hendy David Penny Abstract We consider the problem of finding the maximum likelihood rooted tree under

More information

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them? Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them? Carolus Linneaus:Systema Naturae (1735) Swedish botanist &

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Combining Data Sets with Different Phylogenetic Histories

Combining Data Sets with Different Phylogenetic Histories Syst. Biol. 47(4):568 581, 1998 Combining Data Sets with Different Phylogenetic Histories JOHN J. WIENS Section of Amphibians and Reptiles, Carnegie Museum of Natural History, Pittsburgh, Pennsylvania

More information

CSCE 222 Discrete Structures for Computing. Review for Exam 2. Dr. Hyunyoung Lee !!!

CSCE 222 Discrete Structures for Computing. Review for Exam 2. Dr. Hyunyoung Lee !!! CSCE 222 Discrete Structures for Computing Review for Exam 2 Dr. Hyunyoung Lee 1 Strategy for Exam Preparation - Start studying now (unless have already started) - Study class notes (lecture slides and

More information

arxiv: v1 [q-bio.pe] 3 May 2016

arxiv: v1 [q-bio.pe] 3 May 2016 PHYLOGENETIC TREES AND EUCLIDEAN EMBEDDINGS MARK LAYER AND JOHN A. RHODES arxiv:1605.01039v1 [q-bio.pe] 3 May 2016 Abstract. It was recently observed by de Vienne et al. that a simple square root transformation

More information

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline Phylogenetics Todd Vision iology 522 March 26, 2007 pplications of phylogenetics Studying organismal or biogeographic history Systematics ating events in the fossil record onservation biology Studying

More information

Maximum Agreement Subtrees

Maximum Agreement Subtrees Maximum Agreement Subtrees Seth Sullivant North Carolina State University March 24, 2018 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 1 / 23 Phylogenetics Problem Given a collection

More information

Evaluating phylogenetic hypotheses

Evaluating phylogenetic hypotheses Evaluating phylogenetic hypotheses Methods for evaluating topologies Topological comparisons: e.g., parametric bootstrapping, constrained searches Methods for evaluating nodes Resampling techniques: bootstrapping,

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

Combining the cycle index and the Tutte polynomial?

Combining the cycle index and the Tutte polynomial? Combining the cycle index and the Tutte polynomial? Peter J. Cameron University of St Andrews Combinatorics Seminar University of Vienna 23 March 2017 Selections Students often meet the following table

More information

What Is Conservation?

What Is Conservation? What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.

More information

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3

More information

Algebraic Statistics Tutorial I

Algebraic Statistics Tutorial I Algebraic Statistics Tutorial I Seth Sullivant North Carolina State University June 9, 2012 Seth Sullivant (NCSU) Algebraic Statistics June 9, 2012 1 / 34 Introduction to Algebraic Geometry Let R[p] =

More information

A Generalization of Wigner s Law

A Generalization of Wigner s Law A Generalization of Wigner s Law Inna Zakharevich June 2, 2005 Abstract We present a generalization of Wigner s semicircle law: we consider a sequence of probability distributions (p, p 2,... ), with mean

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200B Spring 2009 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley B.D. Mishler Jan. 22, 2009. Trees I. Summary of previous lecture: Hennigian

More information

The expansion of random regular graphs

The expansion of random regular graphs The expansion of random regular graphs David Ellis Introduction Our aim is now to show that for any d 3, almost all d-regular graphs on {1, 2,..., n} have edge-expansion ratio at least c d d (if nd is

More information

A STATISTICAL FRAMEWORK TO TEST THE CONSENSUS OF TWO NESTED CLASSIFICATIONS

A STATISTICAL FRAMEWORK TO TEST THE CONSENSUS OF TWO NESTED CLASSIFICATIONS Syst. ZooL, 39(1):1-13, 1990 A STATISTICAL FRAMEWORK TO TEST THE CONSENSUS OF TWO NESTED CLASSIFICATIONS FRANCOIS-JOSEPH LAPOINTE AND PIERRE LEGENDRE Departement de Sciences biologiques, Universite de

More information

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Daniel Štefankovič Eric Vigoda June 30, 2006 Department of Computer Science, University of Rochester, Rochester, NY 14627, and Comenius

More information

Supplementary Information

Supplementary Information Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers

More information

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS PETER J. HUMPHRIES AND CHARLES SEMPLE Abstract. For two rooted phylogenetic trees T and T, the rooted subtree prune and regraft distance

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

CSCI1950 Z Computa4onal Methods for Biology Lecture 5 CSCI1950 Z Computa4onal Methods for Biology Lecture 5 Ben Raphael February 6, 2009 hip://cs.brown.edu/courses/csci1950 z/ Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi DNA Phylogeny Signals and Systems in Biology Kushal Shah @ EE, IIT Delhi Phylogenetics Grouping and Division of organisms Keeps changing with time Splitting, hybridization and termination Cladistics :

More information

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution Today s topics Inferring phylogeny Introduction! Distance methods! Parsimony method!"#$%&'(!)* +,-.'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;?@AB=C?DEF Overview of phylogenetic inferences Methodology Methods

More information

Enumeration of subtrees of trees

Enumeration of subtrees of trees Enumeration of subtrees of trees Weigen Yan a,b 1 and Yeong-Nan Yeh b a School of Sciences, Jimei University, Xiamen 36101, China b Institute of Mathematics, Academia Sinica, Taipei 1159. Taiwan. Theoretical

More information

Lecture 6 Phylogenetic Inference

Lecture 6 Phylogenetic Inference Lecture 6 Phylogenetic Inference From Darwin s notebook in 1837 Charles Darwin Willi Hennig From The Origin in 1859 Cladistics Phylogenetic inference Willi Hennig, Cladistics 1. Clade, Monophyletic group,

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200 Spring 2018 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley D.D. Ackerly Feb. 26, 2018 Maximum Likelihood Principles, and Applications to

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data Yujun Wu, Marc G. Genton, 1 and Leonard A. Stefanski 2 Department of Biostatistics, School of Public Health, University of Medicine

More information

Theory of Evolution Charles Darwin

Theory of Evolution Charles Darwin Theory of Evolution Charles arwin 858-59: Origin of Species 5 year voyage of H.M.S. eagle (83-36) Populations have variations. Natural Selection & Survival of the fittest: nature selects best adapted varieties

More information

Lower Bounds for Testing Bipartiteness in Dense Graphs

Lower Bounds for Testing Bipartiteness in Dense Graphs Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due

More information

Systematics Lecture 3 Characters: Homology, Morphology

Systematics Lecture 3 Characters: Homology, Morphology Systematics Lecture 3 Characters: Homology, Morphology I. Introduction Nearly all methods of phylogenetic analysis rely on characters as the source of data. A. Character variation is coded into a character-by-taxon

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogeny? - Systematics? The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogenetic systematics? Connection between phylogeny and classification. - Phylogenetic systematics informs the

More information

arxiv: v1 [q-bio.pe] 16 Aug 2007

arxiv: v1 [q-bio.pe] 16 Aug 2007 MAXIMUM LIKELIHOOD SUPERTREES arxiv:0708.2124v1 [q-bio.pe] 16 Aug 2007 MIKE STEEL AND ALLEN RODRIGO Abstract. We analyse a maximum-likelihood approach for combining phylogenetic trees into a larger supertree.

More information

Lecture 1: Brief Review on Stochastic Processes

Lecture 1: Brief Review on Stochastic Processes Lecture 1: Brief Review on Stochastic Processes A stochastic process is a collection of random variables {X t (s) : t T, s S}, where T is some index set and S is the common sample space of the random variables.

More information

OMICS Journals are welcoming Submissions

OMICS Journals are welcoming Submissions OMICS Journals are welcoming Submissions OMICS International welcomes submissions that are original and technically so as to serve both the developing world and developed countries in the best possible

More information

ANALYSIS OF CHARACTER DIVERGENCE ALONG ENVIRONMENTAL GRADIENTS AND OTHER COVARIATES

ANALYSIS OF CHARACTER DIVERGENCE ALONG ENVIRONMENTAL GRADIENTS AND OTHER COVARIATES ORIGINAL ARTICLE doi:10.1111/j.1558-5646.2007.00063.x ANALYSIS OF CHARACTER DIVERGENCE ALONG ENVIRONMENTAL GRADIENTS AND OTHER COVARIATES Dean C. Adams 1,2,3 and Michael L. Collyer 1,4 1 Department of

More information

should be presented and explained in the combined species tree (Fitch, 1970; Goodman et al., 1979). The gene divergence can be the results of either s

should be presented and explained in the combined species tree (Fitch, 1970; Goodman et al., 1979). The gene divergence can be the results of either s On a Mirkin-Muchnik-Smith Conjecture for Comparing Molecular Phylogenies Louxin Zhang lxzhang@iss.nus.sg BioInformatics Center Institute of Systems Science Heng Mui Keng Terrace Singapore 119597 Abstract

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

arxiv:math.pr/ v1 17 May 2004

arxiv:math.pr/ v1 17 May 2004 Probabilistic Analysis for Randomized Game Tree Evaluation Tämur Ali Khan and Ralph Neininger arxiv:math.pr/0405322 v1 17 May 2004 ABSTRACT: We give a probabilistic analysis for the randomized game tree

More information

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1 Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 Lecturer: Wing-Kin Sung Scribe: Ning K., Shan T., Xiang

More information

How should we organize the diversity of animal life?

How should we organize the diversity of animal life? How should we organize the diversity of animal life? The difference between Taxonomy Linneaus, and Cladistics Darwin What are phylogenies? How do we read them? How do we estimate them? Classification (Taxonomy)

More information

Reconstructing the history of lineages

Reconstructing the history of lineages Reconstructing the history of lineages Class outline Systematics Phylogenetic systematics Phylogenetic trees and maps Class outline Definitions Systematics Phylogenetic systematics/cladistics Systematics

More information

Chapter 7: Models of discrete character evolution

Chapter 7: Models of discrete character evolution Chapter 7: Models of discrete character evolution pdf version R markdown to recreate analyses Biological motivation: Limblessness as a discrete trait Squamates, the clade that includes all living species

More information

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1 I. Short

More information

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057 Bootstrapping and Tree reliability Biol4230 Tues, March 13, 2018 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 Rooting trees (outgroups) Bootstrapping given a set of sequences sample positions randomly,

More information