Maximum Agreement Subtrees Seth Sullivant North Carolina State University March 24, 2018 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 1 / 23
Phylogenetics Problem Given a collection of species, find the tree that explains their history. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 2 / 23
Phylogenetics Problem Given a collection of species, find the tree that explains their history. Yeates, Meier, Wiegman, 2015 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 2 / 23
Rooted Binary X -Trees Definition A rooted tree T has a distinguished vertex ρ, the root. A rooted binary phylogenetic X tree T is a binary tree that has a distinguished root vertex and where the leaves are labeled by X. 1 6 4 2 5 7 8 3 In phylogenetics, only have access to data on extant (not extinct) species. We don t know data or information about species corresponding to internal nodes in the tree. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 3 / 23
Induced subtrees Let X be a label set, with n = X. Let T be a binary rooted phylogenetic X -tree. Given S X, T S is the binary restriction tree. 3 4 6 8 1 2 5 7 3 2 5 3 2 5 Definition Given T 1, T 2 binary rooted phylogenetic X -trees, MAST(T 1, T 2 ) = max{#s : S X and T 1 S = T 2 S } This is the size of a maximum agreement subtree. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 4 / 23
Example 3 4 6 8 1 2 5 7 1 6 4 2 5 7 8 3 MAST(T 1, T 2 ) = 3 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 5 / 23
Example 3 4 6 8 1 2 5 7 1 6 4 2 5 7 8 3 MAST(T 1, T 2 ) = 3 3 2 5 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 5 / 23
Example 3 4 6 8 1 2 5 7 1 6 4 2 5 7 8 3 MAST(T 1, T 2 ) = 3 3 2 5 6 5 7 4 2 7 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 5 / 23
Example 3 4 6 8 1 2 5 7 1 6 4 2 5 7 8 3 MAST(T 1, T 2 ) = 3 3 2 5 6 5 7 4 2 7 Theorem (Steel-Warnow 1993) There is an O(n 2 ) algorithm to compute MAST(T 1, T 2 ) of binary rooted phylogenetic X -trees. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 5 / 23
What is the distribution of MAST(T 1, T 2 )? Problem Determine the distribution of MAST(T 1, T 2 ) for reasonable nice probability distributions on rooted binary trees. Remark Uniform distribution Yule-Harding distribution Simulations [Bryant-Mackenzie-Steel 2003] suggest that under both the uniform distribution and the Yule-Harding distribution E[MAST(T 1, T 2 )] c n where n = X, for some constant c depending on the distribution. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 6 / 23
Motivation: Comparing New Phylogenetic Methods Suppose we come up with a new phylogenetic method. This method takes a data set D and constructs the tree M(D). If we know the correct tree T we can evaluate the method by computing MAST(T, M(D)). If MAST(T, M(D)) is consistently small (for lots of different D), then we conclude that the new method does not work well. How small is small? Is it smaller than what you would expect to see by random chance? Need to know the distribution of MAST(T, T ). Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 7 / 23
Motivation: Cospeciation Let T H be a phylogenetic tree of host species, and T P a phylogenetic tree of parasite species. Host and parasites are paired, so T H and T P have same label set. If MAST(T H, T P ) is large, reject hypothesis that T H and T P evolved independently. i.e. large MAST(T H, T P ) = cospeciation. Need distribution of MAST(T 1, T 2 ) for random trees under null hypothesis of independence to perform hypothesis test. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 8 / 23
Motivation: Cospeciation Let T H be a phylogenetic tree of host species, and T P a phylogenetic tree of parasite species. Host and parasites are paired, so T H and T P have same label set. If MAST(T H, T P ) is large, reject hypothesis that T H and T P evolved independently. i.e. large MAST(T H, T P ) = cospeciation. Need distribution of MAST(T 1, T 2 ) for random trees under null hypothesis of independence to perform hypothesis test. Hafner, M.S., Nadler, S.A. (1988) Nature 332: 258-259 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 8 / 23
Motivation: Cool Math Suppose that both T 1 and T 2 are comb trees. 1 2 3 4 5 6 7 8 9 w w w w w w w 1 2 3 4 5 6 7 w8 w9 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 9 / 23
Motivation: Cool Math Suppose that both T 1 and T 2 are comb trees. 1 2 3 4 5 6 7 8 9 w w w w w w w 1 2 3 4 5 6 7 w8 w9 A maximum agreement subtree corresponds to a longest increasing subsequence of the permutation w = w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9, denoted L(w). Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 9 / 23
Motivation: Cool Math Suppose that both T 1 and T 2 are comb trees. 1 2 3 4 5 6 7 8 9 w w w w w w w 1 2 3 4 5 6 7 w8 w9 A maximum agreement subtree corresponds to a longest increasing subsequence of the permutation w = w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9, denoted L(w). MAST(T 1, T 2 ) for uniformly random comb trees is equivalent to L(w) for uniformly random permutations w S n. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 9 / 23
Motivation: Cool Math Suppose that both T 1 and T 2 are comb trees. 1 2 3 4 5 6 7 8 9 w w w w w w w 1 2 3 4 5 6 7 w8 w9 A maximum agreement subtree corresponds to a longest increasing subsequence of the permutation w = w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9, denoted L(w). MAST(T 1, T 2 ) for uniformly random comb trees is equivalent to L(w) for uniformly random permutations w S n. Theorem (Baik-Deift-Johansson 1999) E[L(w)] = 2 n cn 1/6 + o(n 1/6 ) c 1.77108 L(w) 2 n n 1/6 Tracy-Widom Random Variable Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 9 / 23
Random Trees Biologists are interested in models for random trees as models for speciation processes. Uniform distribution: Select a uniform tree from all (2n 3)!! rooted binary phylogenetic trees Yule-Harding distribution: Grow a random tree by successively splitting leaves selected uniformly at random, then apply leaf labels at random. β-splitting model, α-splitting model, etc. Question 1 5 3 4 2 How well do the different random tree models match the shape and structure of phylogenetic trees occurring in nature? Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 10 / 23
Properties of Random Trees Proposition Both Yule-Harding and uniform random trees satisfy exchangeability and sampling consistency. P( )= P( ) Exchangeability: 1 2 3 4 5 1 5 3 4 2 Sampling Consistency: If T is a random tree, and S X then T S is a random tree from the same distribution on leaf label set S. Theorem (Aldous) The expected depth of a uniformly random tree is Θ( n). The expected depth of Yule-Harding random tree is Θ(log n). Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 11 / 23
Conjecture About The Maximum Agreement Subtree Conjecture For any exchangeable sampling consistent distribution on rooted binary phylogenetic X -trees, E[MAST(T 1, T 2 )] = Θ( n) where n = X. Recall that f (n) = Θ( n) means that there are positive constants c and C such that c n f (n) C n. Note that the constants c and C might depend on the probability distribution. We hope further that we can show that, asymptotically E[MAST(T 1, T 2 )] d n for some d (depending on the distribution) as n. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 12 / 23
Upper bounds Theorem (BHLSSS) For any exchangeable sampling consistent distribution on rooted binary phylogenetic trees, E[MAST(T 1, T 2 )] = O( n). Proof sketch for uniform distribution. For S X let X S = 1 if T 1 S = T 2 S, X S = 0 otherwise. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 13 / 23
Upper bounds Theorem (BHLSSS) For any exchangeable sampling consistent distribution on rooted binary phylogenetic trees, E[MAST(T 1, T 2 )] = O( n). Proof sketch for uniform distribution. For S X let X S = 1 if T 1 S = T 2 S, X S = 0 otherwise. Let Y n,k = S X,#S=k X S = number of agreement sets of size k Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 13 / 23
Upper bounds Theorem (BHLSSS) For any exchangeable sampling consistent distribution on rooted binary phylogenetic trees, E[MAST(T 1, T 2 )] = O( n). Proof sketch for uniform distribution. For S X let X S = 1 if T 1 S = T 2 S, X S = 0 otherwise. Let Y n,k = E[Y n,k ] = S X,#S=k ( ) n P(X k S = 1) = X S = number of agreement sets of size k ( ) n k 1 (2k 3)!! 0 if k > c n Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 13 / 23
Upper bounds Theorem (BHLSSS) For any exchangeable sampling consistent distribution on rooted binary phylogenetic trees, E[MAST(T 1, T 2 )] = O( n). Proof sketch for uniform distribution. For S X let X S = 1 if T 1 S = T 2 S, X S = 0 otherwise. Let Y n,k = E[Y n,k ] = S X,#S=k ( ) n P(X k S = 1) = X S = number of agreement sets of size k ( ) n k 1 (2k 3)!! 0 if k > c n Since E[Y n,k ] 0, with n large = P(MAST(T 1, T 2 ) > c n) goes to 0. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 13 / 23
Lower Bounds: Uniform Distribution Theorem (BHLSSS) Under the uniform distribution on trees E[MAST(T 1, T 2 )] = Ω(n 1/8 ). Proof Sketch. The expected depth of a uniform random tree is Θ(n 1/2 ). So with high probability there is a subset S X so T 1 S is a comb tree of size cn 1/2. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 14 / 23
Lower Bounds: Uniform Distribution Theorem (BHLSSS) Under the uniform distribution on trees E[MAST(T 1, T 2 )] = Ω(n 1/8 ). Proof Sketch. The expected depth of a uniform random tree is Θ(n 1/2 ). So with high probability there is a subset S X so T 1 S is a comb tree of size cn 1/2. Similarly, with high probability there is a subset S S with #S = Θ(n 1/4 ) so that T 1 S and T 2 S are both comb trees. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 14 / 23
Lower Bounds: Uniform Distribution Theorem (BHLSSS) Under the uniform distribution on trees E[MAST(T 1, T 2 )] = Ω(n 1/8 ). Proof Sketch. The expected depth of a uniform random tree is Θ(n 1/2 ). So with high probability there is a subset S X so T 1 S is a comb tree of size cn 1/2. Similarly, with high probability there is a subset S S with #S = Θ(n 1/4 ) so that T 1 S and T 2 S are both comb trees. By sampling consistency, T 1 S and T 2 S are uniformly random comb trees with Θ(n 1/4 ) leaves. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 14 / 23
Lower Bounds: Uniform Distribution Theorem (BHLSSS) Under the uniform distribution on trees E[MAST(T 1, T 2 )] = Ω(n 1/8 ). Proof Sketch. The expected depth of a uniform random tree is Θ(n 1/2 ). So with high probability there is a subset S X so T 1 S is a comb tree of size cn 1/2. Similarly, with high probability there is a subset S S with #S = Θ(n 1/4 ) so that T 1 S and T 2 S are both comb trees. By sampling consistency, T 1 S and T 2 S are uniformly random comb trees with Θ(n 1/4 ) leaves. By Baik-Deift-Johansson, T 1 S and T 2 S have an agreement subtree of expected size Θ(n 1/8 ). Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 14 / 23
Lower Bounds: Yule-Harding Distribution Theorem (BHLSSS) Under the Yule-Harding distribution on trees E[MAST(T 1, T 2 )] = Ω(n α ) where α is the positive root of 2 2 α = (α + 1)(α + 2) (α.344). From the Steel-Warnow algorithm, we see that for trees T 1 and T 2 of the following shapes A B C D MAST(T 1, T 2 ) max (MAST(A, C) + MAST(B, D), MAST(A, D) + MAST(B, C)) Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 15 / 23
Lower Bounds: Fixed Tree Shape Theorem (Misra-S.) Let T 1 and T 2 be uniformly random trees with the same tree shape with n leaves. Then E[MAST(T 1, T 2 )] = Θ( n). Idea comes from random comb trees and connections to longest increasing subsequences. 1 2 3 4 5 6 7 8 9 w w w w w w w 1 2 3 4 5 6 7 w8 w9 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 16 / 23
The Simplest Proof of Ω( n) Lower Bound Let w 1 w 2 w n be a uniformly random permutation. Break this into blocks of length k. B 1 B 2 B n/k = (w 1 w k ) (w k+1 w 2n ) (w n k+1 w n ) Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 17 / 23
The Simplest Proof of Ω( n) Lower Bound Let w 1 w 2 w n be a uniformly random permutation. Break this into blocks of length k. B 1 B 2 B n/k = (w 1 w k ) (w k+1 w 2n ) (w n k+1 w n ) Let s call block B i awesome if one of the numbers (i 1)k + 1,..., ik appears in that block. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 17 / 23
The Simplest Proof of Ω( n) Lower Bound Let w 1 w 2 w n be a uniformly random permutation. Break this into blocks of length k. B 1 B 2 B n/k = (w 1 w k ) (w k+1 w 2n ) (w n k+1 w n ) Let s call block B i awesome if one of the numbers (i 1)k + 1,..., ik appears in that block. Note that the awesome blocks gives AN increasing subsequence (but probably not the longest). 1 5 6 11 8 9 2 3 16 15 13 4 7 14 10 12 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 17 / 23
The Simplest Proof of Ω( n) Lower Bound Let w 1 w 2 w n be a uniformly random permutation. Break this into blocks of length k. B 1 B 2 B n/k = (w 1 w k ) (w k+1 w 2n ) (w n k+1 w n ) Let s call block B i awesome if one of the numbers (i 1)k + 1,..., ik appears in that block. Note that the awesome blocks gives AN increasing subsequence (but probably not the longest). 1 5 6 11 8 9 2 3 16 15 13 4 7 14 10 12 So if we can get a lower bound on the expected number of awesome blocks, that will give a lower bound on the length of the longest increasing subsequence. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 17 / 23
A block B i awesome if one of the numbers (i 1)k + 1,..., ik appears in that block. The probability that B i is awesome is approximately ( 1 1 k ) k n Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 18 / 23
A block B i awesome if one of the numbers (i 1)k + 1,..., ik appears in that block. The probability that B i is awesome is approximately ( 1 1 k ) k n The expected number of awesome blocks is then ( ( 1 1 k ) ) k n n k Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 18 / 23
A block B i awesome if one of the numbers (i 1)k + 1,..., ik appears in that block. The probability that B i is awesome is approximately ( 1 1 k ) k n The expected number of awesome blocks is then ( ( 1 1 k ) ) k n n k Taking k = n we get the expected number of awesome blocks is ( ( 1 1 1 ) ) n n (1 e 1 ) n n Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 18 / 23
A block B i awesome if one of the numbers (i 1)k + 1,..., ik appears in that block. The probability that B i is awesome is approximately ( 1 1 k ) k n The expected number of awesome blocks is then ( ( 1 1 k ) ) k n n k Taking k = n we get the expected number of awesome blocks is ( ( 1 1 1 ) ) n n (1 e 1 ) n n Proposition The expected length of the longest increasing subsequence of a uniformly random permutation is bounded below by (1 e 1 ) n. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 18 / 23
Extending These Ideas for Trees of Same Shape Proposition The leaf set of any tree on n leaves can be grouped into at least n 4k blobs of size between k and 2k 2. The blobs yield a scaffold tree which can force a structure for certain agreement subtrees between two trees of the same shape. n = 17, k = 3 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 19 / 23
Extending These Ideas for Trees of Same Shape Proposition The leaf set of any tree on n leaves can be grouped into at least n 4k blobs of size between k and 2k 2. The blobs yield a scaffold tree which can force a structure for certain agreement subtrees between two trees of the same shape. n = 17, k = 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 19 / 23
Extending These Ideas for Trees of Same Shape Proposition The leaf set of any tree on n leaves can be grouped into at least n 4k blobs of size between k and 2k 2. The blobs yield a scaffold tree which can force a structure for certain agreement subtrees between two trees of the same shape. n = 17, k = 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 19 / 23
Extending These Ideas for Trees of Same Shape Proposition The leaf set of any tree on n leaves can be grouped into at least n 4k blobs of size between k and 2k 2. The blobs yield a scaffold tree which can force a structure for certain agreement subtrees between two trees of the same shape. n = 17, k = 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 19 / 23
Let T 1, T 2 be two random trees with the same shape. Have corresponding blobs B 1 (T i ),..., B n/4k (T i ) i = 1, 2. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 20 / 23
Let T 1, T 2 be two random trees with the same shape. Have corresponding blobs B 1 (T i ),..., B n/4k (T i ) i = 1, 2. Call a blob B j awesome if B j (T 1 ) B j (T 2 ). The expected number of awesome blobs is at least ( ( n 1 1 k ) ) k. 4k n Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 20 / 23
Let T 1, T 2 be two random trees with the same shape. Have corresponding blobs B 1 (T i ),..., B n/4k (T i ) i = 1, 2. Call a blob B j awesome if B j (T 1 ) B j (T 2 ). The expected number of awesome blobs is at least ( ( n 1 1 k ) ) k. 4k n Awesome blobs give AN agreement subtree between T 1 and T 2, subtree of the scaffold. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 20 / 23
Let T 1, T 2 be two random trees with the same shape. Have corresponding blobs B 1 (T i ),..., B n/4k (T i ) i = 1, 2. Call a blob B j awesome if B j (T 1 ) B j (T 2 ). The expected number of awesome blobs is at least ( ( n 1 1 k ) ) k. 4k n Awesome blobs give AN agreement subtree between T 1 and T 2, subtree of the scaffold. Taking k = n gives: Proposition If T 1 and T 2 are uniformly random tree with n leaves and the same shape then E[MAST(T 1, T 2 )] 1 e 1 n. 4 Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 20 / 23
Trees with Same Shape Theorem (Misra-S.) Let T 1 and T 2 be uniformly random trees with the same tree shape with n leaves. Then E[MAST(T 1, T 2 )] = Θ( n). Conjecture (Martin Thatte 2013) If T 1 and T 2 are arbitrary completely balanced trees with n leaves, then MAST(T 1, T 2 ) n. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 21 / 23
Summary: Now We Know How Little We Know Computing the distribution of MAST(T 1, T 2 ) is a generalization of hard problems in combinatorial probability. Simulations suggest that E[MAST(T 1, T 2 )] cn 1/2 for the uniform and Yule-Harding distributions. We have upper bounds of the form Cn 1/2 for all exchangeable, sampling consistent distributions. We have lower bounds of the form cn α for uniform, Yule-Harding distributions, fixed shape, and some β-splitting examples. Question Is E[MAST(T 1, T 2 )] cn 1/2 universal for all exchangeable, sampling consistent distributions? What else can be said about the distribution of MAST(T 1, T 2 )? Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 22 / 23
References Aldous, David. Probability distributions on cladograms. Random discrete structures (Minneapolis, MN, 1993), 1 18, IMA Vol. Math. Appl., 76, Springer, New York, 1996. Baik, Jinho; Deift, Percy; Johansson, Kurt On the distribution of the length of the longest increasing subsequence of random permutations. J. Amer. Math. Soc. 12 (1999), no. 4, 1119 1178. Bernstein, Daniel Irving; Ho, Lam Si Tung; Long, Colby; Steel, Mike; St. John, Katherine; Sullivant, Seth. Bounds on the expected size of the maximum agreement subtree. SIAM J. Discrete Math. 29 (2015), no. 4, 2065 2074. Bryant, David; McKenzie, Andy; Steel, Mike. The size of a maximum agreement subtree for random binary trees. Bioconsensus (Piscataway, NJ, 2000/2001), 55?65, DIMACS Ser. Discrete Math. Theoret. Comput. Sci., 61, Amer. Math. Soc., Providence, RI, 2003. Martin, Daniel M. and Thatte, Bhalchandra D. The maximum agreement subtree problem. Discrete Appl. Math. 161 (2013), no. 13 14, 1805 1817. Seth Sullivant (NCSU) Maximum Agreement Subtrees March 24, 2018 23 / 23