Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Math 239: Discrete Mathematics for the Life Sciences Spring 2008 Lecture 14 March 11 Lecturer: Lior Pachter Scribe/ Editor: Maria Angelica Cueto/ C.E. Csar 14.1 Introduction The goal of today s lecture is to prove the remaining theorem of last lecture characterizing tree additive dissimilarity maps. Namely, Theorem 14.1. A dissimilarity map δ is tree additive (with respect to a given tree T ) iff δ satisfies the weak four point condition. As we discussed last time, rather than proving this theorem we will change our framework to dissimilarity maps with values in a given group G and provide a more general result in this new setting. In our previous framework, the group G is (R, +), with identity 1 G = 0 and our original Theorem 14.1 will follow immediately from the general result. For general literature including the material discussed today, we refer the reader to the book by Semple and Steel. 14.2 General setting In this section we provide analogous definitions for all concepts developed in Lecture 13. Definition 14.2. Given a group G a G -dissimilarity map is a map δ : X X G such that δ(x, x) = 1 G for all x X. Note that in this definition we avoid the symmetry condition required for dissimilarity maps. Why have we decided to do so? Two reasons justify our choice: G in general may not be an abelian group, and the general framework for dissimilarity maps realized by trees will allow directed trees with directed edge weights, so that we may have δ(i, j) δ(i, j) for adjacent nodes i, j V (T ), where δ(i, j) denotes the weight of the edge i j. Definition 14.3. δ is a tree dissimilarity map if there exists a tree T (i.e. a phylogenetic X-tree) and weight function w : E(T ) G such that δ(x, y) = w(e), e path from ϕ(x) to ϕ(y) where ϕ : X V (T ) is the corresponding labeling function and the product is the operation in the group G. 14-1

Since (G, G ) may not be abelian, the product defining δ(x, y) must be considered in the order given by the path from x to y, that is if the path is x = v 1 v 1... v r y, then δ(x, y) = w(e v0 v 1 ) G w(e v1 v 2 ) G... G w(e vry). As we discussed before, we may assign weights in each direction of the edges of T. 14.3 Main Theorem We are now in conditions of stating the general result. For simplicity of notation we will avoid the subscript G in the operation of the group G, but the reader should have this in mind. Theorem 14.4. (Main Theorem) Let G be a group and δ a G -dissimilarity map on X. Consider the set H δ = {δ ik δ 1 jk δ jl δ 1 il i, j, k, l X} G. If δ is a tree dissimilarity map then: 1. i, j, k X : δ ij δ 1 kj δ ki = δ ik δ 1 jk δ ji ( three point condition ); 2. i, j, k, l X pairwise distinct, there exists some ordering of these points (i.e. a relabeling of them) such that δ ik δ 1 jl = δ il δ 1 jl ( four point condition ); Moreover, if δ satisfies the previous conditions and H δ has no element of order 2 in G, then δ is a tree dissimilarity map. Remark: As we discussed previously, Theorem 14.1 will be a consequence of the Main Theorem, since (R, +) has no elements of order 2. The proof of the sufficient conditions of the Main Theorem will mimic the arguments provided on Theorem of last lecture. For this we will need to define the notion of an ultrametric G -dissimilarity map as well as an ultrametric tree representation. We will show that these two definitions are equivalent. The Main Theorem will be proven by induction on X. Given our G -dissimilarity map δ on X satisfying conditions (1) and (2) we will construct a suitable ultrametric on X = X {a}. This will give us an ultrametric tree representation T and we will need to attach a node a and modify the weights of the edges of T in order to obtain our result. The proof of the necessary conditions will be immediate. We will illustrated the desired conditions by an example. i j. Example. Assume δ is a tree metric. We will denote the inverse element δ 1 ij by a squiggly arrow: 14-2

Math 239 Lecture 14 March 11 Spring 2008 Each side of condition (1) is given by the following weighted directed arrows. The (LHS) corresponds to k whereas the (RHS) corresponds to i j i j k If we cann u the middle node, and we compute the product of the (LHS) and (RHS) of the equation, we get δ iu δ ui due to several cancellations. Namely, (LHS) = (δ iu δ uj )(δ ku δ uj ) 1 )(δ ku δ ui ) = δ iu δ ui = (δ iu δ uk )(δ ju δ uk ) 1 )(δ ju δ ui ) = (RHS). For condition (2) we have (LHS) equal to whereas the (RHS) is given by i k j l i k j l In this case, we proceed as in condition (1). Call u the node connecting the leaves i and j. We get that both sides of the equation for condition (2) give the same expression δ iu δ 1 ju, so condition (2) also holds. 14-3

Several cancellations will provide the equality of each side in conditions (1) and (2). By a similar method we will be able to show that conditions (1) and (2) are necessary for δ to be a tree G -dissimilarity map. So we only need to prove the converse, provided that H δ has no elements of order 2. As we anticipated earlier, the main idea will be to build an ultrametric form δ using the Gromov product: δ x (i, j) = δ xi δ 1 ji δ jx i, j x. Note that this function δ x may not be a dissimilarity map, since δ x (i, i) need not be 1 G. However, the important fact is that δ x will be an ultrametric in a more general setting that we explain later. 14.4 Ultrametric conditions and ultrametric tree representation In this section we define the generalized notion of ultrametric conditions and ultrametric tree representations in context of G -valued functions. Definition 14.5. We say that δ : X X G satisfies the ultrametric conditions if 1. δ(i, j) = δ(j, i) (i.e., δ is symmetric), 2. {δ(i, j), δ(i, k), δ(j, k)} 2, i.e. we have equality of at least two of these elements of G ( weak three point condition ), 3. (Technical condition for H δ ) There does not exist four pairwise distinct points i, j, k, l X with { δij = δ jk = δ kl } { δjl = δ li = δ ik }. In words, this says that things have to fit together nicely. Before stating the next definition and the key result relating both notions, let us motivate this definition through an example. Example. Assume G = (R, +). Suppose we are given a rooted X-tree T with weights assigned to its edges, which corresponds to a tree metric d. Assume that the distance from the root ρ to each leaf is the same number δ(ρ, x). 1 a ρ 3 3 1 2 b 2 c 14-4 1 d 1 e

We claim that the edge weighting of the tree T will be equivalent to giving a weight to each internal node of T in the following way. For each internal node v we assign w(v) = 2d(x, v) for any leaf x. Likewise we assign the weigh w(ρ) = 2d(x, ρ) to the root ρ. Since the distance from ρ to each leaf x is the same, this numbers w(v) will be the same for any choice of the leaf x. In our example: ρ 4 2 2 a 2 d e b c A first question one might ask is why did we include to add the factor of 2 when defining w(v). A reason for this is that if v is the internal node corresponding to the cherry of the leaves x, y then we have that the weights d(x, v) = w(e xv ) = 1 2 d(x, y) = w(e vy) = d(v, y). In our example we have d(b, c) = 4 = 2( 2). Moreover, in general we have the following identity d(x, y) = label (weight) of the least common ancestor of x and y. So given T and the distance function d in V (T ) provided by the weights on E(T ) we can construct weights for the internal nodes of T. Conversely, assume that we have defined these weights on the internal nodes we want to construct the distance function d. This will be given by assigning a weight to each edge as we ascend from the leaves of T towards the root ρ, bearing in mind that w(v) = 2d(x, v). Since these two weighting representations of a tree T are equivalent, we will define an ultrametric tree representation by simply labeling the internal vertices of a rooted tree via a function t, that is δ(x, y) = t(l.c.a.(x, y)). Note that the weights on the internal nodes are free from any a priori restriction, so this notion can be generalized to take values in an arbitrary set, not necessarily in a fixed group. Definition 14.6. An ultrametric tree representation is a rooted phylogenetic X-tree, together with a labeling of the (internal) vertices of T by elements of G given by a function t : V (T ) G. We now state the key result for our Main Theorem, without proof because it will be the same as the one provided for the analogous result from Lecture 13. Theorem 14.7. Given an ultrametric tree representation t, then δ defined by δ(x, y) = t(l.c.a.(x, y)) is an ultrametric. Conversely, given an ultrametric δ we can construct an ultrametric tree representation that realizes δ. 14-5

Proof of Main Theorem As we said before, the argument will preceed as follows. We need to show that the three ultrametric conditions are satisfied. By induction we will construct a tree and then we will need to transform it in order to get our tree dissimilarity map. We omit the details since it is very similar to Theorem of last lecture. Remark: Note that, as in Theorem, we have a constructive proof, hence we have an algorithm for building the tree dissimilarity map. As a consequence we obtain Corollary 14.8. A G -dissimilarity map is tree additive iff it satisfies the weak four point condition. Proof. The four point condition with two equal nodes provides condition (1). On the other hand, condition (2) is just the four point condition. For further details on the previous proof we refer to the book by Semple and Steel. 14.5 Why is this theorem relevant? In this section we aim to discuss the importance of this theorem from a historical perspective. In 1967, a paper by Cavalli-Sforza and Edwards appeared. It was the first paper to discuss statistical approaches to phylogenetics. The idea suggested in that work was the following. Starting with fixed DNA data, build a dissimilarity map in some way (today we would rather use the Jukes-Cantor connection, which was unknown at that time). By the evolutionary theorem we have that δ comes from a tree metric (recall that it came from real DNA data). The goal was to find the corresponding representing tree T. To fulfill this they proposed the following approach. Let T be a phylogenetic tree. Given δ : X X R >0 (which takes positive real values since it corresponds to distances between vertices of T ) the idea was to find ˆδ that minimizes the following expression: (δ ij ˆδ ij ) 2, (*) where ˆδ is a tree metric for T. However, this approach has two main problems: i,j 1. What happens in the case where T is an unknown? One possible solution would be to construct a tree metric ˆδ T for every tree T and find the one that is closest to δ in the sense of (*). 2. The other difficulty we may encounter is how to find an explicit ˆδ minimizing (*). 14-6

For the second task, if we weaken our restriction on (*) by allowing ˆδ to be tree additive rather than being a tree metric, then we have a formula computing ˆδ: ˆδ = S T ˆl where S T denotes the incidence matrix and ˆl are optimal weights of the edges of T. In this case, the least squares formula gives: ˆl = (S t T S T ) 1S t T δ, where δ is the given dissimilarity map. Note that in this case we obtain ˆl R E (where E = E(T )) and it may not have positive entries. On the contrary, if we require ˆl (R >0 ) E, we have a constrained least square problem, so the optimization task is harder in this case. In fact, we will need to use an iterative approach to solve it. Moreover in the tree additive setting, ˆl has a very simple formula. Given any edge e E(T ) we have 2 ˆl e = where n A n D + n B n C ( ) n A n C + n B n D ( ) DAC +D BD + DAD +D BC DAB D CD (n A + n B )(n C + n D ) (n A + n B )(n C + n D ) (**) n A = #{labeled nodes in the cluster A } and similarly for B, C and D. D AB = δ ab and similarly for D AC, D BC, D AD, D BD and D CD. a A b B A C ˆl e e B D Note that in this case B, C and D correspond to groups of nodes rather than single nodes. There are two important remarks to make concerning formula (**): Observation: 1. ˆle depends only on δ xy where the path from x y touches the edge e. This is called the group property since we have groups of nodes. We say that the path touches rather than contains the edge e since the paths in D AB only touch e at its left node. Moreover the formula (**) doesn t involve distances between nodes in the same cluster of nodes: we always need to pick one node from each group A, B, C or D. 14-7

2. Although less obvious, we have an important complexity result: (**) gives an O(n 2 ) algorithm to find ˆδ, where n = X. So it has the optimal possible complexity. This result is due to Vach (1989). These two facts give a strong argument in favor of considering tree additive maps instead of tree metrics. If we are lucky enough, our algorithms will give tree metrics, but a priori we should expect tree additive maps instead. An example of this general behaviour is the Neighbor-Joining algorithm. 14.6 Homework Exercise (optional): Give a simple direct proof of result Theorem for the case (R, +), i.e. try to avoid passing through the ultrametric construction. (For references to this approach, see a paper by Hakimi and Patinos form the early 1970s.) 14-8