The expected value of the squared euclidean cophenetic metric under the Yule and the uniform models

Size: px

Start display at page:

Download "The expected value of the squared euclidean cophenetic metric under the Yule and the uniform models"

Harold Jackson
6 years ago
Views:

1 The expected value of the squared euclidean cophenetic metric under the Yule and the uniform models Gabriel Cardona, Arnau Mir, Francesc Rosselló Department of Mathematics and Computer Science, University of the Balearic Islands, E-07 Palma de Mallorca, Spain arxiv:30.53v [q-bio.pe] Jan 03 Abstract The cophenetic metrics d ϕ,p, for p {0} [, [, are a recent addition to the kit of available distances for the comparison of phylogenetic trees. Based on a fifty years old idea of Sokal and Rohlf, these metrics compare phylogenetic trees on a same set of taxa by encoding them by means of their vectors of cophenetic values of pairs of taxa and depths of single taxa, and then computing the L p norm of the difference of the corresponding vectors. In this paper we compute the expected value of the square of d ϕ, on the space of fully resolved rooted phylogenetic trees with n leaves, under the Yule and the uniform probability distributions. Keywords: Phylogenetic tree, Cophenetic metric, Uniform model, Yule model, Sackin index, Total cophenetic index. Introduction The definition and study of metrics for the comparison of rooted phylogenetic trees on the same set of taxa is a classical problem in phylogenetics [0, Ch. 30], and many metrics have been introduced so far with this purpose. A recent addition to the set of metrics available in this context are the cophenetic metrics d ϕ,p introduced in [8]. Based on a fifty years old idea of Sokal and Rohlf, these metrics compare phylogenetic trees on a same set of taxa by first encoding the trees by means of their vectors of cophenetic values of pairs of taxa and depths of single taxa, and then computing the L p norm of the difference of the corresponding vectors. Once the disimilarity between two phylogenetic trees has been computed through a given metric, it is convenient in many situations to assess its signifiance. One possibility is to compare the value obtained with its expected, or Corresponding author addresses: gabriel.cardona@uib.es (Gabriel Cardona), arnau.mir@uib.es (Arnau Mir), cesc.rossello@uib.es (Francesc Rosselló) Preprint submitted to Elsevier January 3, 03

2 mean, value: is it much larger, much smaller, similar? [8] This makes it necessary to study the distribution of the metric, or, at least, to have a formula for the expected value of the metric for any number n of leaves. The distribution of several metrics has been studied so far: see, for instance, [5, 6, 6, 7, 8]. The expected value of a distance depends on the probability distribution on the space of phylogenetic trees under consideration. The most popular distribution on the space T n of binary phylogenetic trees with n leaves is the uniform distribution, under which all trees in T n are equiprobable. But phylogeneticists consider also other probability distributions on T n, defined through stochastic models of evolution [0, Ch. 33]. The most popular is the so-called Yule model [4, 9], defined by an evolutionary process where, at each step, each currently extant species can give rise, with the same probability, to two new species. Under this model, different phylogenetic trees with the same number of leaves may have different probabilities, which depend on their shape. In this paper we provide explicit formulas for the expected values under the uniform and the Yule models of the square of the euclidean cophenetic metric d ϕ,. The proofs of these formulas are based on long and tedious algebraic computations and thus, to ease the task of the reader interested only in the formulas and the path leading to them, but not in the details, we have moved these computations to an Appendix at the end of the paper. Besides the aforemenentioned application of this value in the assessment of tree comparisons, the knowledge of formulas for the expected value of d ϕ, under different models may allow the use of d ϕ, to test stochastic models of tree growth, a popular line of research in the last years which so far has been mostly based on shape indices; see, for instance, [3, 9]. As a proof of concept, in 4 we report on a basic, preliminary such test performed on the binary phylogenetic trees contained in the TreeBASE database [0].. Preliminaries In this paper, by a phylogenetic tree on a set S of taxa we mean a fully resolved, or binary, rooted tree with its leaves bijectively labeled in S. We understand such a rooted tree as a directed graph, with its arcs pointing away from the root. To simplify the language, we shall always identify a leaf of a phylogenetic tree with its label. We shall also use the term phylogenetic tree with n leaves to refer to a phylogenetic tree on the set {,..., n}. We shall denote by T (S) the space of all phylogenetic trees on S and by T n the space of all phylogenetic trees with n leaves. Let T be a phylogenetic tree. If there exists a directed path from u to v in T, we shall say that v is a descendant of u and also that u is an ancestor of v. The lowest common ancestor LCA T (u, v) of a pair of nodes u, v in T is the unique common ancestor of them that is a descendant of every other common ancestor of them. The depth δ T (v) of a node v in T is the distance (in number of arcs) from the root of T to v. The cophenetic value ϕ T (i, j) of a pair of leaves i, j in T is the depth of their LCA. To simplify the notations, we shall often write ϕ T (i, i) to denote the depth δ T (i) of a leaf i.

3 Given two phylogenetic trees T, T on disjoint sets of taxa S, S, respectively, we shall denote by T T the phylogenetic tree on S S obtained by connecting the roots of T and T to a (new) common root. Every phylogenetic tree T T n is obtained as T k T n k, for some k n, some subset S k {,..., n} with k elements, some tree T k on S k and some tree T n k on Sc k {,..., n}\s k. Actually, every phylogenetic tree in T n is obtained in this way twice. The Yule, or Equal-Rate Markov, model of evolution [4, 9] is a stochastic model of phylogenetic trees growth. It starts with a node, and at every step a leaf is chosen randomly and uniformly and it is splitted into two leaves. Finally, the labels are assigned randomly and uniformly to the leaves once the desired number of leaves is reached. This corresponds to a model of evolution where, at each step, each currently extant species can give rise, with the same probability, to two new species. Under this stochastic model, if T T n is a phylogenetic tree with set of internal nodes V int (T ), and if for every v V int (T ) we denote by l T (v) the number of its descendant leaves, then the probability of T is [4, 7] P Y (T ) n! v V int(t ) l T (v). The uniform, or Proportional to Distinguishable Arrangements, model [] is another stochastic model of phylogenetic trees growth. Unlike the Yule model, its main feature is that all phylogenetic trees T T n have the same probability: P U (T ), where (n 3)!! (n 3)(n 5) 3. (n 3)!! From the point of view of tree growth, this model is described as the process that starts with a node labeled and then, at the k-th step, a new pendant arc, ending in the leaf labeled k +, is added either to a new root (whose other child will be, then, the original root) or to some edge, with all possible locations of this new pendant arc being equiprobable [9, 6]. Although this is not an explicit model of evolution, only of tree growth, several interpretations of it in terms of evolutionary processes have been given in the literature: see [3, p. 686] and the references therein. 3. Main results Let T T n be a phylogenetic tree with n leaves. The cophenetic vector of T is ϕ(t ) ( ϕ T (i, j) ) i j n Rn(n+)/, with its elements lexicographically ordered in (i, j). It turns out [8] that the mapping ϕ : T n R n(n+)/ sending each T T n to its cophenetic vector ϕ(t ), is injective up to isomorphism. As it is well known, this allows to induce metrics on T n from metrics defined on powers of R. In particular, in this paper 3

4 we consider the cophenetic metric d ϕ, on T n induced by the euclidean distance: d ϕ, (T, T ) (ϕ T (i, j) ϕ T (i, j)). i j n To distinguish it from other cophenetic metrics obtained through other L p normes, we shall call it the euclidean cophenetic metric. Example. Consider the phylogenetic trees T, T T 4 depicted in Fig.. Their total cophenetic vectors are ϕ(t ) (,, 0, 0,, 0, 0,,, ) ϕ(t ) (, 0, 0, 0,,,, 3,, 3) and therefore d ϕ, (T, T ) 7. As we shall see below, the expected values of the square of d ϕ, on T 4 under the uniform and the Yule models are, respectively, 0.56 and 9.4, and hence these two trees are quite more similar than average with respect to the euclidean cophenetic metric under both models. 3 4 T 3 4 T Figure : Two phylogenetic trees with 4 leaves. Let Dn the random variable that chooses a pair of trees T, T T n and computes d ϕ, (T, T ). Its expected values under the Yule and the uniform models are given by the following two theorems. Recall that the n-th harmonic number H n is defined as H n n i /i. Theorem. For every n, the expected value of Dn under the Yule model is E Y (Dn) n ( 3n 0n + 8(n + )H n 4(n + )H ) n. n Theorem 3. For every n, the expected value of Dn under the uniform model is E U (Dn) Å ã 3 (4n3 +8n n(n + 3) (n )!! n(n + 7) (n )!! 0n) (n 3)!! 4 (n 3)!! Since H n ln(n) and (n )!!/(n 3)!! πn, these formulas imply that ( 4 E Y (Dn) 6n, E U (Dn) 3 π ) n 3. 4 We shall prove the formulas in Theorems and 3 by reducing the computation of the expected value of D n to that of the following random variables: 4

5 S n, the random variable that chooses a tree T T n and computes its Sackin index S [3], defined by S(T ) n δ T (i) Φ n, the random variable that chooses a tree T T n and computes its total cophenetic index Φ [8], defined by Φ(T ) ϕ T (i, j) Φ () n i i<j n, the random variable that chooses a tree T T n and computes Φ () (T ) ϕ T (i, j) i j n For the models under consideration, the expected values of these variables are related to that of Dn by the next proposition. In it and henceforth, we shall denote by E(X) the expected value of a random variable X on T n under a generic probability distribution p : T n [0, ] on T n invariant under relabelings. The probability distributions p Y and p U defined by the Yule and the uniform models, respectively, are invariant under relabelings, and therefore the expected values under these specific models, which will be denoted by E Y and E U, respectively, are special cases of E. Proposition 4. E(Dn) E(Φ () n ) E(S n) 4 E(Φ n) n n(n ). Proof. To simplify the notations, let ϕ n be the random variable that chooses a tree T T n and computes ϕ T (, ). δ n be the random variable that chooses a tree T T n and computes δ T (). Let us compute now E(Dn) from its very definition: E(Dn) d ϕ, (T, T ) p(t )p(t ) (T,T ) T n ( (T,T ) T n i j n i j n (T,T ) Tn i j n ( (T,T ) T n (T,T ) T n (ϕ T (i, j) ϕ T (i, j)) )p(t )p(t ) (ϕ T (i, j) + ϕ T (i, j) ϕ T (i, j)ϕ T (i, j))p(t )p(t ) ϕ T (i, j) p(t )p(t ) + ) ϕ T (i, j)ϕ T (i, j)p(t )p(t ) (T,T ) T n ϕ T (i, j) p(t )p(t ) 5

6 ( ϕ T (i, j) p(t ) + ϕ T (i, j) p(t ) i j n( T T n T T )( n )) ϕ T (i, j)p(t ) ϕ T (i, j)p(t ) T T n T T n ( ( ) ) ϕ T (i, j) p(t ) ϕ T (i, j)p(t ) i j n T T n T T n ( ( ϕ T (i, j) )p(t ) ϕ T (i, j)p(t ) T T n i j n i<j n T T n ( ) ϕ T (i, i)p(t ) i n T T n Ç å Φ () n ( ) (T )p(t ) ϕ T (, )p(t ) T T n T T ( n ) n δ T ()p(t ) T T n E(Φ () n ) n(n )E(ϕ n ) ne(δ n ) Now, the values of E(δ n ) and E(ϕ n ) can be easily obtained from E(S n ) and E(Φ n ), respectively, using the invariance under relabelings of the probability distribution under which we compute the expected values E: E(δ n ) E(S n )/n, E(ϕ n ) E(Φ n )/ ( ) n The formula in the statement is then obtained by replacing E(δ n ) and E(ϕ n ) by these values. The expected values of S n and Φ n under the Yule and the uniform models are known: ( (n )!! ) E Y (S n ) n(h n ) E U (S n ) n (n 3)!! E Y (Φ n ) n(n ) n(h n ) E U (Φ n ) Ç å n ((n )!! ) (n 3)!! The formula for E Y (S n ) was proved in [5] and the other three, in [8]. To obtain the expected values of Dn, it remains to compute the expected values of Φ () n. They are given by the following result. Proposition 5. For every n, (a) E Y (Φ () n ) 5n(n ) 8n(H n ) (b) E U (Φ () n ) 6 n(4n + n 7) 3 )!! n(n + 3)(n 4 (n 3)!! This proposition is proved in the Appendix at the end of this paper. Finally, the identities given in Theorems and 3 are obtained by replacing, in the identity ) 6

7 given in Proposition 4, E(S n ), E(Φ n ), and E(Φ () n ) by their values under the Yule and the uniform models, respectively. We leave the last details to the reader. 4. An experiment on TreeBASE In this section we report on a very simple experiment to show how d ϕ, can be used to test evolutionary hypotheses. In this experiment, we have compared the expected value of d ϕ, on T n under the uniform and the Yule models with its average value on the set TreeBASE bin,n of binary phylogenetic trees with n leaves contained in TreeBASE [0]. To perform this experiment, we have taken some decisions. First, since there are only very few values n > 50 such that TreeBASE bin,n > 0, we have decided to consider only those binary trees contained in TreeBASE with n 50 leaves. On the other hand, even for those n such that TreeBASE bin,n is relatively large, in most cases it does not contain many pairs of trees with the same taxa. So, instead of computing the average value of d ϕ, on TreeBASE bin,n by averaging the values d ϕ,(t, T ) for pairs of trees T, T with exactly the same n taxa, we have made use of the formula given in Proposition 4, as if TreeBASE bin,n was closed under relabelings: that is, we have taken only into account the shapes of the trees contained in it. This is consistent with the fact that our final goal is to test models of evolution that produce tree shapes. So, we have computed the average values of Φ (), of the Sackin index S, and of the total cophenetic index Φ on TreeBASE bin,n, and we have taken as average value of d ϕ, on this set the result of appying the formula in Proposition 4. The detailed results of these computations, as well as the Python and R scripts used to compute and analyze them, are available in the Supplementary Material web page Fig. plots the log of these average values as a function of log(n). We have added the curves of the log of the expected values of Dn under the Yule distribution (lower, dotted curve) and under the uniform distribution (upper, dashed curve), again as a function of log(n). The graphic shows that the expected value of d ϕ, on (the shapes of) the phylogenetic trees contained in TreeBASE is better explained by the uniform model than by the Yule model. This agrees with the results of similar experiments using other measures (see, for instance, [3, 8]). 5. Conclusions and discussion In this paper we have obtained formulas for the expected values under the Yule and the uniform models of the square of the euclidean cophenetic metric d ϕ,, defined by the euclidean distance between cophenetic vectors. These formulas are explicit and hold on spaces T n of fully resolved phylogenetic trees with any number n of leaves. 7

8 log of the expected value of the square of the distance number of leaves Figure : Log-log plots of the mean of Dn for the binary trees in TreeBASE with a fixed number n of leaves, of E Y (Dn ) (dotted curve) and E U (Dn ) (dashed curve). These formulas have been obtained through long algebraic manipulations of sums of sequences. To double-check our results, we have computed the exact value of E Y (D n) and E U (D n) for n 3,..., 7, by generating all trees with up to 7 leaves. Moreover, we have computed numerical approximations to these values for n 0, 0,..., 00, by generating pairs of random trees until the numerical method stabilizes. These numerical experiments confirm that our formulas give the right figures. Table gives the exact values for n 3,..., 7. The results of the simulations for n 0, 0,..., 00, as well as the Python scripts used in these computations, are also available in the aforementioned Supplementary Material web page expectedcophdist/ E Y (Dn) E U (Dn) Table : Values of E Y (Dn ) and E U (Dn ) for n 3,..., 7. They agree with those given by our formulas. The formulas for E Y (D n) and E U (D n) grow in different orders: E Y (D n) is in Θ(n ), while E U (D n) is in Θ(n 3 ). Therefore, they can be used to test the Yule and the uniform models as null stochastic models of evolution for collections of phylogenetic trees reconstructed by different methods. We have reported on a first experiment of this type, which reinforces the conclusion that real world phylogenetic trees (that is, those contained in TreeBASE) are not consistent with the Yule model of evolution. We plan to report in a future paper on more 8

9 extensive tests on stochastic models of evolutionary processes, including Ford s α-model [] and Aldous β-model []. Acknowledgements The research reported in this paper has been partially supported by the Spanish government and the UE FEDER program, through projects MTM and TIN E. References [] M. Abramowitz, I. Stegun, Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover (964). [] D. Aldous. Probability distributions on cladograms. Random discrete structures, IMA Vol. Math. Appl. 76 (Springer,996), 8. [3] M. G. B. Blum, O. François, Which random processes describe the Tree of Life? A large-scale study of phylogenetic tree imbalance. Sys. Biol. 55 (006), [4] J. Brown, Probabilities of evolutionary trees. Syst. Biol. 43 (994), [5] D. Bryant, M. Steel, Computing the distribution of a tree metric. IEEE/ACM Trans. Comp. Biol. Bioinf. 6 (009), [6] G. Cardona, A. Mir, F. Rosselló, The expected value under the Yule model of the squared path-difference distance. Appl. Math. Let. 5 (0), [7] G. Cardona, A. Mir, F. Rosselló, Exact formulas for the variance of several balance indices under the Yule model. To appear in J. Math. Bio. http: //dx.doi.org/0.007/s [8] G. Cardona, A. Mir, L. Rotger, F. Rosselló, D. Sánchez, Cophenetic metrics for phylogenetic trees, after Sokal and Rohlf. BMC Bioinformatics (03) 4:3 [9] L. L. Cavalli-Sforza, A. Edwards, Phylogenetic analysis. Models and estimation procedures. Am. J. Hum. Genet., 9 (967), [0] J. Felsenstein, Inferring Phylogenies. Sinauer Associates Inc., 004. [] D. Ford. Probabilities on cladograms: Introduction to the alpha model. arxiv:math/0546 [math.pr] (005). [] Hypergeometric3F/03/07/0. 9

10 [3] HypergeometricF/03/03/0. [4] E. Harding, The probabilities of rooted tree-shapes generated by random bifurcation. Adv. Appl. Prob. 3 (97), [5] S. B. Heard, Patterns in Tree Balance among Cladistic, Phenetic, and Randomly Generated Phylogenetic Trees. Evolution 46 (99), [6] M. Hendy, C. Little, D. Penny, Comparing Trees with Pendant Vertices Labelled. SIAM J. Applied Math. 44 (984), [7] A. Mir, F. Rosselló, The mean value of the squared path-difference distance for rooted phylogenetic trees. J. Math. Anal. Appl. 37 (00), [8] A. Mir, F. Rosselló, L. Rotger, A new balance index for phylogenetic trees. Math. Biosc. 4 (03), [9] A. Mooers, S. B. Heard, Inferring evolutionary process from phylogenetic tree shape. Quart. Rev. Biol. 7 (997), [0] V. Morell, TreeBASE: the roots of phylogeny. Science 73 (996), [] M. Petkovsek, H. Wilf, D. Zeilberger, A B. AK Peters Ltd. (996). Available online at [] D. E. Rosen, Vicariant Patterns and Historical Explanation in Biogeography. Syst. Biol. 7 (978), [3] M. J. Sackin, Good and bad phenograms. Sys. Zool, (97), 5 6. [4] R. Sokal, F. Rohlf, The Comparison of Dendrograms by Objective Methods. Taxon (96), [5] M. Steel, Distribution of the symmetric difference metric on phylogenetic trees. SIAM J. Discr. Math. (988), [6] M. Steel, A. McKenzie, Distributions of cherries for two models of trees. Math. Biosc. 64 (000), 8 9. [7] M. Steel, A. McKenzie, Properties of phylogenetic trees generated by Yuletype speciation models. Math. Biosc. 70 (00), 9. [8] M. A. Steel, D. Penny, Distributions of tree comparison metrics some new results, Syst. Biol. 4 () (993) 6 4. [9] G. U. Yule, A mathematical theory of evolution based on the conclusions of Dr J. C. Willis. Phil. Trans. Royal Soc. (London) Series B 3 (94), 87. 0

11 Appendix: Proof of Proposition 5 Proof of Proposition 5.(a) For every T T n, let Φ(T ) S(T ) + Φ(T ) i j n ϕ T (i, j), and let Φ n be the random variable that chooses a tree T T n and computes Φ(T ). We have that E Y (Φ n ) E Y (S n ) + E Y (Φ n ) n(n ). To compute E Y (Φ () n ), we shall use an argument similar to the one used in the proof of [6, Prop. 3]. Notice that E Y (Φ () n ) Φ () (T ) py (T ) T T n S k {,...,n} S k k T k T (S k ) T n k T (Sc k ) Φ () (Tk T n k) p Y (T k T n k) Now, on the one hand, we have the following easy lemma on P Y (T T ): see [7, Lem. ]. Lemma 6. Let S k {,..., n} with S k k, let T k T (S k ) and T n k T (Sk c ). Then, P Y (T k T n k) (n ) ( n)p (T k )P (T n k). k On the other hand, we have the following recursive expression for Φ () (T T ). Lemma 7. Let S k {,..., n} with S k k, let T k T (S k ) and T n k T (Sk c ). Then Ç å Ç å Φ () (T k T n k) Φ () (T k )+Φ () k + n k + (T n k)+φ(t k )+Φ(T n k)+ +. Proof. Let us assume, without any loss of generality, that S {,..., m} and S {m +,..., n}. Then ϕ Tk (i, j) + if i, j k ϕ Tk T j) ϕ n k(i, T (i, j) + if k + i, j n n k 0 otherwise

12 and therefore Φ () (T k T n k) ϕ Tk T (i, j) n k i j n (ϕ Tk (i, j) + ) + (ϕ T (i, j) + ) n k i j k k+ i j n (ϕ Tk (i, j) + ϕ Tk (i, j) + ) + i j k Ç å k+ i j n Φ () k + (T k ) + Φ(T k ) + + Φ () (T n k) + Φ(T n k) + (ϕ T (i, j) + ϕ n k T (i, j) + ) n k Ç å n k +. So, if we set f(a, b) ( ) ( a+ + b+ ), we have that E Y (Φ () n ) Ç å n k T k T k ] T n k T n k +f(k, n k) (n ) ( n)p Y (T k )P Y (T n k) k [ Φ () (T k )P Y (T k )P Y (T n n k) T k T n k + Φ () (T n k)p Y (T k )P Y (T n k) T k T n k + Φ(T k )P Y (T k )P Y (T n k) T k T n k + Φ(T n k)p Y (T k )P Y (T n k) T k T n k + ] f(k, n k)p Y (T k )P Y (T n k) T k n T n k [ Φ () (T k ) + Φ () (T n k) + (Φ(T k ) + Φ(T n k)) [ Φ () (Tk )P Y (T k ) + Φ () (T n k)p Y (T n k) T k T n k + Φ(T k )P Y (T k ) + Φ(T n k)p Y (T n k) + f(k, n k) T k T n k [ E Y (Φ () k ) + E Y (Φ () n n k) + E Y (Φ k ) + E Y (Φ n k ) Ç å Ç å k + n k + ] + + n E Y (Φ () k ) + 4 n E Y (Φ k ) + n(n + ). 3 ]

13 In particular E Y (Φ ) n n and therefore E Y (Φ () n ) n n + n n E Y (Φ () k ) + 4 n n n n 4 n n E Y (Φ k ) + n(n ). 3 E Y (Φ () k ) + n E Y (Φ () ) E Y (Φ k ) + 4 n E Y (Φ ) + n n n(n ) + n 3 n n E Y (Φ () ) + n E Y (Φ () ) + 4 n E Y (Φ ) + n n n E Y (Φ () ) + 5n 8. Setting x n E Y (Φ () n )/n, this recurrence becomes x n x n and the solution of this recursive equation with x E Y (Φ () ) 0 is x n n k ( 5 8 ) 5(n ) 8(H n ) 5n + 3 8H n k from where we deduce that E Y (Φ () n ) 5n + 3n 8nH n, as we claimed. Proof of Proposition 5.(b) To compute E U (Φ () n ), we shall use an argument similar to the one used in [7]. For every k,..., n, let f k,n d k,n {T T n ϕ T (, ) k} {T T n ϕ T (i, j) k} for every i < j n {T T n δ T () k} {T T n δ T (i) k} for every i n (where X denotes the cardinal of the set X). Lemma 8. For every n, E U (Φ () ( n ) n k d k,n + (n 3)!! Ç å n n k f k,n ) 3

14 Proof. Under the uniform model, E U (Φ () T T n ) n Φ () (T ), (n 3)!! where Φ () (T ) ϕ T (i, j) T T n T T n i j n δ T (i) + ϕ T (i, j) i n T T n i<j n T T n k {T T n δ T (i) k} i n + i n i<j n k d k,n + n k d k,n + i j n n k {T T n ϕ T (i, j) k} Ç n i<j n å n n k f k,n k f k,n. T T n ϕ T (i, j) A formula for d k,n was obtained in the proof of [8, Lem. ]: (n k 3)! k d k,n. () (n k )!n k As far as f k,n goes, we have the following result. In it, and henceforth, p F q denotes the (generalized) hypergeometric function defined by Å ã a,..., a p pf q b,..., b ; z (a ) k (a p ) k zk q (b ) k (b q ) k k!, k 0 where (a) 0 and (a) k : a (a + ) (a + k ) for k. Lemma 9. For every n, f 0,n (n 4)!! and Å ã (n k 5)!k, n, k + n f k,n (n k 4)!! 3F k+5 k n, n + 3 ; for every k,..., n. Proof. Let us start by proving f 0,n (n 4)!! by induction on n. It is clear that f 0, ( 4)!!. Assume now that f 0, ((n ) 4)!!. Every phylogenetic tree T with n leaves such that ϕ T (, ) 0, that is, where LCA T (, ) is the root, is obtained by taking a phylogenetic tree T with n 4

15 leaves such that ϕ T (, ) 0 and adding a new pendant edge, ending in the leaf n, to any edge in T. Then, since there are f 0, (n 6)!! trees T T such that ϕ T (, ) 0, and each one of them has (n ) edges where we can add the new edge, we obtain f 0,n (n 4)(n 6)!! (n 4)!!. Now, to compute f k,n for k, we shall study the structure of a tree T T n such that ϕ T (, ) k; to simplify the notations, let us denote by x the node LCA T (, ), which has depth k, and by T 0 the subtree of T rooted at x. Then, on the one hand, T 0 is a phylogenetic tree on a subset S 0 {,..., n} containing,, and since its root x is the LCA of and in T, we have that ϕ T0 (, ) 0. On the other hand, there is a path (r v, v, v 3,..., v k+ x) in T from r to x. For every j,..., k, let T j be the subtree rooted at the child of v j other than v j+ ; see Fig. 3. So, the tree T is determined by: A number 0 m n k, so that m + will be the number of leaves of the phylogenetic tree T 0 rooted at LCA T (, ) A subset {i,..., i m } of {3,..., n}. There are ( ) n m such subsets. A phylogenetic tree T 0 on {,, i,..., i m } such that ϕ T0 (, ) 0. There are f 0,m+ (m)!! such trees. An ordered k-forest, that is, an ordered sequence of phylogenetic trees (T, T,..., T k ) such that k i L(T i) {,..., n} {,, i,..., i m }. The number of such ordered k-forests is (see, for instance, [7, Lem. ]) (n m k 5)!k (n m k )! n m k.... T T x T 0 T k Figure 3: The structure of a tree T with ϕ T (, ) k. 5

16 This shows that f k,n can be computed as f k,n k n k m0 n k m0 n k m0 (n )!k n k (number of ways of choosing {i,..., i m }) (number of trees in T m+ with ϕ T (, ) 0) (number of ordered k-forests on n m leaves) Ç å n (m)!! m (n m k 5)!k (n m k )! n m k (n )!m! m (n m k 5)! m!(n m )!(n m k )! n m k n k m0 Now, taking into account that () m m! 4 m (n m k 5)! (n m )!(n m k )! m (n )! ( n) m ( ) (n m )! m (n k )! (k + n) m ( ) Å ã k + 5 n Å ã k n + 3 we have that Å ã, n, k + n 3F k+5 k n, n + 3 ; m 0 m 0 n k m0 m m (n k m )! ( )m (n k 5)!! m (n k m 5)!!, ( )m (n k 6)!! m (n k m 6)!! () m ( n) m (k + n) m ( k+5 n) m ( k n + 3) m m! m!(n )!(n k )! m (n k m 5)!! m (n k m 6)!! (n m )!(n k m )!(n k 5)!!(n k 6)!!m! (n )!(n k )!(n k m 5)! m (n m )!(n k m )!(n k 5)! (n )!(n k )! (n k 5)! from where we deduce that n k m0 n k m0 (n k m 5)!4 m (n m )!(n k m )! (n k m 5)!4 m (n m )!(n k m )! Å ã (n k 5)!, n, k + n (n )!(n k )! 3 F k+5 k n, n + 3 ; 6

17 and hence f k,n as we claimed. n k (n )!k 4 m (n m k 5)! n k (n m )!(n m k )! m0 Å ã (n )!k n k (n k 5)!, n, k + n (n )!(n k )! 3 F k+5 k n, n + 3 ; Å ã (n k 5)!k, n, k + n (n k 4)!! 3F k+5 k n, n + 3 ; We must compute now the sums k d k,n, n k f k,n. To do that, we shall use the following auxiliary lemma. Lemma 0. For every n and m, let Then, U n,m n k0 k m (n + k )! k! k. U n,0 (n 4)!! U n, (n )(n 4)!! (n 3)!! U n, (n )(n 4)!! (n )(n 3)!! U n,3 (n 3 + 3n 3n )(n 4)!! (3n + n )(n 3)!! Proof. The proof of these identities is standard, using well known equalities for hypergeometric functions and the lookup algorithm given in [, p. 36]. We shall prove in detail the identity for m, and we leave the details of the rest to the reader. Notice that U n, n k0 k0 k n (n + k )! k (n + k )! k! k k! k (k + ) (n + k )! (k + )! k+ kn n 3 k0 (k + ) (n + k )! (k + )! k+ (k + ) (n + k )! (k + )! k+ Set X n k0 (k + ) (n + k )! (k + )! k+, Y n kn (k + ) (n + k )! (k + )! k+ We compute now these two summands. 7

18 As to X n, If we set we have that X n (n )! k0 (k + ) (n + k )! (n )!(k + )! k t k (k + ) (n + k )! (n )!(k + )! k, t k+ (k + )(k + n) t k (k + ) and therefore, by the lookup algorithm [, p. 36], we have that Å (n )!, n X n F ; ã Å ã (n )! n n, F ; (using (5.3.4) in [, p. 559]) (n )! (n) k ( ) k ( )k () k k! k 0 ( (n )! (n)0 ( ) 0 ( )0 + (n) ) ( ) ( ) () 0 0! ()! (n )! (n + ) As to Y n, If we take now we have that (k + n ) (n + k 3)! Y n (k + n )! k+ k0 (n ) (n 3)! (k + n ) (n + k 3)! (n )! k0 (k + n )! k () (n 3)! ()! t k (k + n ) (n + k 3)! (k + n )! k () (n 3)! ()! t k+ (n + k)(n + k ) t k (k + n ) and therefore, again by the lookup algorithm [, p. 36], we have that Y n (n Å ) (n 3)!, n, n (n )! 3F n, n ; (n ) (n 3)! [ Å n, (n )! F n ; ã ã + n F Å n, n ; ã] (using []). 8

19 Now Å n, F n ; ã Å n, F n (n ) Γ(n ) [ Γ(n ) Γ(n ) Γ(0) (n ) (n )! [ (n )! + (n 3)! (n )! ã ; + Γ(n) Γ() + Γ(n ) (n 3)!! ] (using (5.3.4) in [, p. 559]) ] Γ( ) (using [3]) + (n 3)!! Å n, F n ; ã Å ã, n F n ; (using (5.3.4) in [, p. 559]) Γ(n) ( Γ(n 4 ) ( n) Γ(n ) Γ( ) + Γ(n + ) ) Γ( 3 ) + Γ(n) (using [3]) n (n )! ( (n 3)!! (n )!! ) (n )! + + (n )! Therefore, (n )! ((n 3)!! + (n )!! + n (n )!) (n )! Y n (n ) (n 3)! [ (n )! (n )! + (n 3)!! + n (n )! ] ((n 3)!! + (n )!! + n (n )!) (n )! n (n + )(n )! + (n )!! and finally as we claimed. U n, X n Y n n (n + )(n )! (n )!! (n )(n 4)!! (n )(n 3)!! Lemma. For every n, k d k,n (4n )(n 3)!! 3(n )!!. Proof. By equation (), k d k,n k 3 n (n k 3)! (n k )! n k k0 (n k ) 3 (n + k )! k! k (n ) 3 U n,0 3(n ) U n, + 3(n )U n, U n,3 (n ) 3 (n 4)!! 3(n ) ( (n )(n 4)!! (n 3)!! ) +3(n ) ( (n )(n 4)!! (n )(n 3)!! ) ( (n 3 + 3n 3n )(n 4)!! (3n + n )(n 3)!! ) (4n )(n 3)!! 3(n )(n 4)!!. 9

20 Lemma. For every n, n k f k,n 3 (4n + )(n 3)!! 3 (n )!!. Proof. To simplify the notations, set S n n proof of Lemma 9, and therefore S n Set now S n f k,n (n )! n (n )!k n k n n n k m0 n k k k 3 m0 k f k,n. As we have seen in the 4 m (n m k 5)! (n m )!(n m k )! 4 m (n k m 5)! (n k )!(n k m )! (n )! n k n k k 3 4 n k m (k + m )! (k + m)!m! m0 ( n (n )! n k 3 n k Ç å) k k + k + m 4 m m k + m ( m (n )! n 6 n n + n + k 3 n k Ç å) k + m k 4 m m k + m m n k 3 n k k m 4 m m Since S 3 0, we have that Ç å k + m k + m n 3 S n (S p+ S p) p3 k 3 n k k m 4 m m Ç å k + m k + m and S p+ S p p 3 Ç (p )3 k 3 p k 3 p + k (p k )4 p k p (p )3 p + p (p )3 p + (p )3 p + p 3 p (p )! p (p )! å k 3 (p k 3)! k (p k )(p )!(p k )! p 3 p 3 k 3 (p k 3)! k (p k )! (p k ) 3 (p + k )! k p+ (k + )! 0

21 (p )3 p + (p )3 p + p (p )! p k [ p (p k ) 3 (p + k )! k k! (p k ) 3 (p + k )! p (p )! k k! k0 (p ) 3 (p )! ] (p )3 (p )! p (p ) (p k ) 3 (p + k )! p + p (p )! k k! k0 (p ) ( ) p + (4p )(p 3)!! 3(p )!! (p )!! (p ) (p 3)!! p + (4p ) (p )!! 3 Therefore ( S n (p 3)!! (4p ) (p )!! p3 (p ) ) p 3 (by Lemma ) Now, applying Gosper s algorithm [, p. 77] we have that (p 3)!! (4p ) (p )!! ( Ç å n 3 ) 3 n+ 3(4n 3n ) 39 n n p3 and then S n ( Ç å n 3 ) 3 n+ 3(4n 3n ) 39 n n n 8(n + ) n+ 3(n 3) n + (4n + )(n 3)!! 3(n + ) +. n 3(n 4)!! Finally, S n Å ã (n )! n 6 n + n + S n 3(n )! n (4n + )(n 3)!! (4n + )(n 3)!! 3 (n )!!.

22 Summarizing, by Lemmas 8,, and, we have that as we claimed. ( Ç å n n ) n k d k,n + k f k,n (n 3)!! [ n((4n )(n 3)!! 3(n )!!) (n 3)!! Ç å n ( + 3 (4n + )(n 3)!! 3 )] (n )!! E U (Φ () n ) 6 n(4n + n 7) 3n(n + 3) 4 (n )!! (n 3)!!

arxiv: v1 [q-bio.pe] 13 Aug 2015

arxiv: v1 [q-bio.pe] 13 Aug 2015 Noname manuscript No. (will be inserted by the editor) On joint subtree distributions under two evolutionary models Taoyang Wu Kwok Pui Choi arxiv:1508.03139v1 [q-bio.pe] 13 Aug 2015 August 14, 2015 Abstract