Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Similar documents
The least-squares approach to phylogenetics was first suggested

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Molecular Evolution and Phylogenetic Tree Reconstruction

a 2n = . On the other hand, the subsequence a 2n+1 =

Phylogenetic trees 07/10/13

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

Dot Products, Transposes, and Orthogonal Projections

Evolutionary Tree Analysis. Overview

Lecture 4: Constructing the Integers, Rationals and Reals

On improving matchings in trees, via bounded-length augmentations 1

Reconstructing Trees from Subtree Weights

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Week 5: Distance methods, DNA and protein models

Lecture 9 : Identifiability of Markov Models

Vectors, metric and the connection

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Combinatorial Aspects of Tropical Geometry and its interactions with phylogenetics

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

2. Introduction to commutative rings (continued)

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

Consistency Index (CI)

Writing proofs for MATH 61CM, 61DM Week 1: basic logic, proof by contradiction, proof by induction

CHAPTER 3 Further properties of splines and B-splines

Quivers of Period 2. Mariya Sardarli Max Wimberley Heyi Zhu. November 26, 2014

Energy method for wave equations

Reconstruction of certain phylogenetic networks from their tree-average distances

32 Divisibility Theory in Integral Domains

Lecture 1: Contraction Algorithm

Lecture: Modeling graphs with electrical networks

Recitation 8: Graphs and Adjacency Matrices

MATH 320, WEEK 7: Matrices, Matrix Operations

Algebraic Methods in Combinatorics

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

THEODORE VORONOV DIFFERENTIABLE MANIFOLDS. Fall Last updated: November 26, (Under construction.)

Math 291-2: Lecture Notes Northwestern University, Winter 2016

EVOLUTIONARY DISTANCES

Supplementary Notes on Inductive Definitions

Lecture 2: Vector Spaces, Metric Spaces

Last Update: March 1 2, 201 0

2. Two binary operations (addition, denoted + and multiplication, denoted

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogeny: traditional and Bayesian approaches

Algorithmic Game Theory and Applications. Lecture 4: 2-player zero-sum games, and the Minimax Theorem

Thus, X is connected by Problem 4. Case 3: X = (a, b]. This case is analogous to Case 2. Case 4: X = (a, b). Choose ε < b a

X X (2) X Pr(X = x θ) (3)

The Generalized Neighbor Joining method

Automorphism groups of wreath product digraphs

PROBLEM SET 3: PROOF TECHNIQUES

Group, Rings, and Fields Rahul Pandharipande. I. Sets Let S be a set. The Cartesian product S S is the set of ordered pairs of elements of S,

Measures and Measure Spaces

CS 70 Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand Midterm 1 Solutions

The Complex Numbers c ). (1.1)

. Get closed expressions for the following subsequences and decide if they converge. (1) a n+1 = (2) a 2n = (3) a 2n+1 = (4) a n 2 = (5) b n+1 =

Physics 110. Electricity and Magnetism. Professor Dine. Spring, Handout: Vectors and Tensors: Everything You Need to Know

Lecture Notes on Inductive Definitions

Phylogeny Tree Algorithms

Math 54 Homework 3 Solutions 9/

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

CS286.2 Lecture 8: A variant of QPCP for multiplayer entangled games

arxiv:math/ v1 [math.co] 20 Apr 2004

Lecture 3. 1 Polynomial-time algorithms for the maximum flow problem

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

Algebraic structures I

University of California Berkeley CS170: Efficient Algorithms and Intractable Problems November 19, 2001 Professor Luca Trevisan. Midterm 2 Solutions

Properties of θ-super positive graphs

7 Curvature of a connection

Lecture 2 Some Sources of Lie Algebras

Lecture 4 October 18th

25 Minimum bandwidth: Approximation via volume respecting embeddings

Lecture Notes on Inductive Definitions

Definition 5.1. A vector field v on a manifold M is map M T M such that for all x M, v(x) T x M.

6.842 Randomness and Computation March 3, Lecture 8

Eigenvectors and Hermitian Operators

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

CHAPTER 4: EXPLORING Z

Lecture 6: Finite Fields

Math 4320, Spring 2011

Math 443/543 Graph Theory Notes 5: Graphs as matrices, spectral graph theory, and PageRank

Faithful Tropicalization of the Grassmannian of planes

Math 1060 Linear Algebra Homework Exercises 1 1. Find the complete solutions (if any!) to each of the following systems of simultaneous equations:

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programming II Date: 10/12/17

PHYS 705: Classical Mechanics. Rigid Body Motion Introduction + Math Review

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

CS281A/Stat241A Lecture 19

SZEMERÉDI S REGULARITY LEMMA FOR MATRICES AND SPARSE GRAPHS

SOME STRUCTURE THEOREMS FOR INVERSE LIMITS WITH SET-VALUED FUNCTIONS

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Mathematical Methods wk 2: Linear Operators

RECURSION EQUATION FOR

Metric spaces and metrizability

On the Limiting Distribution of Eigenvalues of Large Random Regular Graphs with Weighted Edges

ALGORITHMS FOR RECONSTRUCTING PHYLOGENETIC TREES FROM DISSIMILARITY MAPS

Topic: Balanced Cut, Sparsest Cut, and Metric Embeddings Date: 3/21/2007

CS367 Lecture 3 (old lectures 5-6) APSP variants: Node Weights, Earliest Arrivals, Bottlenecks Scribe: Vaggos Chatziafratis Date: October 09, 2015

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

Factorization in Domains

HOMEWORK 2 - RIEMANNIAN GEOMETRY. 1. Problems In what follows (M, g) will always denote a Riemannian manifold with a Levi-Civita connection.

Decomposing planar cubic graphs

1/12/05: sec 3.1 and my article: How good is the Lebesgue measure?, Math. Intelligencer 11(2) (1989),

Transcription:

Math 239: Discrete Mathematics for the Life Sciences Spring 2008 Lecture 14 March 11 Lecturer: Lior Pachter Scribe/ Editor: Maria Angelica Cueto/ C.E. Csar 14.1 Introduction The goal of today s lecture is to prove the remaining theorem of last lecture characterizing tree additive dissimilarity maps. Namely, Theorem 14.1. A dissimilarity map δ is tree additive (with respect to a given tree T ) iff δ satisfies the weak four point condition. As we discussed last time, rather than proving this theorem we will change our framework to dissimilarity maps with values in a given group G and provide a more general result in this new setting. In our previous framework, the group G is (R, +), with identity 1 G = 0 and our original Theorem 14.1 will follow immediately from the general result. For general literature including the material discussed today, we refer the reader to the book by Semple and Steel. 14.2 General setting In this section we provide analogous definitions for all concepts developed in Lecture 13. Definition 14.2. Given a group G a G -dissimilarity map is a map δ : X X G such that δ(x, x) = 1 G for all x X. Note that in this definition we avoid the symmetry condition required for dissimilarity maps. Why have we decided to do so? Two reasons justify our choice: G in general may not be an abelian group, and the general framework for dissimilarity maps realized by trees will allow directed trees with directed edge weights, so that we may have δ(i, j) δ(i, j) for adjacent nodes i, j V (T ), where δ(i, j) denotes the weight of the edge i j. Definition 14.3. δ is a tree dissimilarity map if there exists a tree T (i.e. a phylogenetic X-tree) and weight function w : E(T ) G such that δ(x, y) = w(e), e path from ϕ(x) to ϕ(y) where ϕ : X V (T ) is the corresponding labeling function and the product is the operation in the group G. 14-1

Since (G, G ) may not be abelian, the product defining δ(x, y) must be considered in the order given by the path from x to y, that is if the path is x = v 1 v 1... v r y, then δ(x, y) = w(e v0 v 1 ) G w(e v1 v 2 ) G... G w(e vry). As we discussed before, we may assign weights in each direction of the edges of T. 14.3 Main Theorem We are now in conditions of stating the general result. For simplicity of notation we will avoid the subscript G in the operation of the group G, but the reader should have this in mind. Theorem 14.4. (Main Theorem) Let G be a group and δ a G -dissimilarity map on X. Consider the set H δ = {δ ik δ 1 jk δ jl δ 1 il i, j, k, l X} G. If δ is a tree dissimilarity map then: 1. i, j, k X : δ ij δ 1 kj δ ki = δ ik δ 1 jk δ ji ( three point condition ); 2. i, j, k, l X pairwise distinct, there exists some ordering of these points (i.e. a relabeling of them) such that δ ik δ 1 jl = δ il δ 1 jl ( four point condition ); Moreover, if δ satisfies the previous conditions and H δ has no element of order 2 in G, then δ is a tree dissimilarity map. Remark: As we discussed previously, Theorem 14.1 will be a consequence of the Main Theorem, since (R, +) has no elements of order 2. The proof of the sufficient conditions of the Main Theorem will mimic the arguments provided on Theorem of last lecture. For this we will need to define the notion of an ultrametric G -dissimilarity map as well as an ultrametric tree representation. We will show that these two definitions are equivalent. The Main Theorem will be proven by induction on X. Given our G -dissimilarity map δ on X satisfying conditions (1) and (2) we will construct a suitable ultrametric on X = X {a}. This will give us an ultrametric tree representation T and we will need to attach a node a and modify the weights of the edges of T in order to obtain our result. The proof of the necessary conditions will be immediate. We will illustrated the desired conditions by an example. i j. Example. Assume δ is a tree metric. We will denote the inverse element δ 1 ij by a squiggly arrow: 14-2

Math 239 Lecture 14 March 11 Spring 2008 Each side of condition (1) is given by the following weighted directed arrows. The (LHS) corresponds to k whereas the (RHS) corresponds to i j i j k If we cann u the middle node, and we compute the product of the (LHS) and (RHS) of the equation, we get δ iu δ ui due to several cancellations. Namely, (LHS) = (δ iu δ uj )(δ ku δ uj ) 1 )(δ ku δ ui ) = δ iu δ ui = (δ iu δ uk )(δ ju δ uk ) 1 )(δ ju δ ui ) = (RHS). For condition (2) we have (LHS) equal to whereas the (RHS) is given by i k j l i k j l In this case, we proceed as in condition (1). Call u the node connecting the leaves i and j. We get that both sides of the equation for condition (2) give the same expression δ iu δ 1 ju, so condition (2) also holds. 14-3

Several cancellations will provide the equality of each side in conditions (1) and (2). By a similar method we will be able to show that conditions (1) and (2) are necessary for δ to be a tree G -dissimilarity map. So we only need to prove the converse, provided that H δ has no elements of order 2. As we anticipated earlier, the main idea will be to build an ultrametric form δ using the Gromov product: δ x (i, j) = δ xi δ 1 ji δ jx i, j x. Note that this function δ x may not be a dissimilarity map, since δ x (i, i) need not be 1 G. However, the important fact is that δ x will be an ultrametric in a more general setting that we explain later. 14.4 Ultrametric conditions and ultrametric tree representation In this section we define the generalized notion of ultrametric conditions and ultrametric tree representations in context of G -valued functions. Definition 14.5. We say that δ : X X G satisfies the ultrametric conditions if 1. δ(i, j) = δ(j, i) (i.e., δ is symmetric), 2. {δ(i, j), δ(i, k), δ(j, k)} 2, i.e. we have equality of at least two of these elements of G ( weak three point condition ), 3. (Technical condition for H δ ) There does not exist four pairwise distinct points i, j, k, l X with { δij = δ jk = δ kl } { δjl = δ li = δ ik }. In words, this says that things have to fit together nicely. Before stating the next definition and the key result relating both notions, let us motivate this definition through an example. Example. Assume G = (R, +). Suppose we are given a rooted X-tree T with weights assigned to its edges, which corresponds to a tree metric d. Assume that the distance from the root ρ to each leaf is the same number δ(ρ, x). 1 a ρ 3 3 1 2 b 2 c 14-4 1 d 1 e

We claim that the edge weighting of the tree T will be equivalent to giving a weight to each internal node of T in the following way. For each internal node v we assign w(v) = 2d(x, v) for any leaf x. Likewise we assign the weigh w(ρ) = 2d(x, ρ) to the root ρ. Since the distance from ρ to each leaf x is the same, this numbers w(v) will be the same for any choice of the leaf x. In our example: ρ 4 2 2 a 2 d e b c A first question one might ask is why did we include to add the factor of 2 when defining w(v). A reason for this is that if v is the internal node corresponding to the cherry of the leaves x, y then we have that the weights d(x, v) = w(e xv ) = 1 2 d(x, y) = w(e vy) = d(v, y). In our example we have d(b, c) = 4 = 2( 2). Moreover, in general we have the following identity d(x, y) = label (weight) of the least common ancestor of x and y. So given T and the distance function d in V (T ) provided by the weights on E(T ) we can construct weights for the internal nodes of T. Conversely, assume that we have defined these weights on the internal nodes we want to construct the distance function d. This will be given by assigning a weight to each edge as we ascend from the leaves of T towards the root ρ, bearing in mind that w(v) = 2d(x, v). Since these two weighting representations of a tree T are equivalent, we will define an ultrametric tree representation by simply labeling the internal vertices of a rooted tree via a function t, that is δ(x, y) = t(l.c.a.(x, y)). Note that the weights on the internal nodes are free from any a priori restriction, so this notion can be generalized to take values in an arbitrary set, not necessarily in a fixed group. Definition 14.6. An ultrametric tree representation is a rooted phylogenetic X-tree, together with a labeling of the (internal) vertices of T by elements of G given by a function t : V (T ) G. We now state the key result for our Main Theorem, without proof because it will be the same as the one provided for the analogous result from Lecture 13. Theorem 14.7. Given an ultrametric tree representation t, then δ defined by δ(x, y) = t(l.c.a.(x, y)) is an ultrametric. Conversely, given an ultrametric δ we can construct an ultrametric tree representation that realizes δ. 14-5

Proof of Main Theorem As we said before, the argument will preceed as follows. We need to show that the three ultrametric conditions are satisfied. By induction we will construct a tree and then we will need to transform it in order to get our tree dissimilarity map. We omit the details since it is very similar to Theorem of last lecture. Remark: Note that, as in Theorem, we have a constructive proof, hence we have an algorithm for building the tree dissimilarity map. As a consequence we obtain Corollary 14.8. A G -dissimilarity map is tree additive iff it satisfies the weak four point condition. Proof. The four point condition with two equal nodes provides condition (1). On the other hand, condition (2) is just the four point condition. For further details on the previous proof we refer to the book by Semple and Steel. 14.5 Why is this theorem relevant? In this section we aim to discuss the importance of this theorem from a historical perspective. In 1967, a paper by Cavalli-Sforza and Edwards appeared. It was the first paper to discuss statistical approaches to phylogenetics. The idea suggested in that work was the following. Starting with fixed DNA data, build a dissimilarity map in some way (today we would rather use the Jukes-Cantor connection, which was unknown at that time). By the evolutionary theorem we have that δ comes from a tree metric (recall that it came from real DNA data). The goal was to find the corresponding representing tree T. To fulfill this they proposed the following approach. Let T be a phylogenetic tree. Given δ : X X R >0 (which takes positive real values since it corresponds to distances between vertices of T ) the idea was to find ˆδ that minimizes the following expression: (δ ij ˆδ ij ) 2, (*) where ˆδ is a tree metric for T. However, this approach has two main problems: i,j 1. What happens in the case where T is an unknown? One possible solution would be to construct a tree metric ˆδ T for every tree T and find the one that is closest to δ in the sense of (*). 2. The other difficulty we may encounter is how to find an explicit ˆδ minimizing (*). 14-6

For the second task, if we weaken our restriction on (*) by allowing ˆδ to be tree additive rather than being a tree metric, then we have a formula computing ˆδ: ˆδ = S T ˆl where S T denotes the incidence matrix and ˆl are optimal weights of the edges of T. In this case, the least squares formula gives: ˆl = (S t T S T ) 1S t T δ, where δ is the given dissimilarity map. Note that in this case we obtain ˆl R E (where E = E(T )) and it may not have positive entries. On the contrary, if we require ˆl (R >0 ) E, we have a constrained least square problem, so the optimization task is harder in this case. In fact, we will need to use an iterative approach to solve it. Moreover in the tree additive setting, ˆl has a very simple formula. Given any edge e E(T ) we have 2 ˆl e = where n A n D + n B n C ( ) n A n C + n B n D ( ) DAC +D BD + DAD +D BC DAB D CD (n A + n B )(n C + n D ) (n A + n B )(n C + n D ) (**) n A = #{labeled nodes in the cluster A } and similarly for B, C and D. D AB = δ ab and similarly for D AC, D BC, D AD, D BD and D CD. a A b B A C ˆl e e B D Note that in this case B, C and D correspond to groups of nodes rather than single nodes. There are two important remarks to make concerning formula (**): Observation: 1. ˆle depends only on δ xy where the path from x y touches the edge e. This is called the group property since we have groups of nodes. We say that the path touches rather than contains the edge e since the paths in D AB only touch e at its left node. Moreover the formula (**) doesn t involve distances between nodes in the same cluster of nodes: we always need to pick one node from each group A, B, C or D. 14-7

2. Although less obvious, we have an important complexity result: (**) gives an O(n 2 ) algorithm to find ˆδ, where n = X. So it has the optimal possible complexity. This result is due to Vach (1989). These two facts give a strong argument in favor of considering tree additive maps instead of tree metrics. If we are lucky enough, our algorithms will give tree metrics, but a priori we should expect tree additive maps instead. An example of this general behaviour is the Neighbor-Joining algorithm. 14.6 Homework Exercise (optional): Give a simple direct proof of result Theorem for the case (R, +), i.e. try to avoid passing through the ultrametric construction. (For references to this approach, see a paper by Hakimi and Patinos form the early 1970s.) 14-8