CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch s Alg. Characters, T A, B Perfect Phylogeny Characters A, B, T Felsenstein Characters, T, B A T = tree topology B = branch lengths A = ancestral states 1
Pairwise Compa4bility Test (Wilson 1965) Binary characters i and j are pairwise compa4ble if and only if: j is homogenous w.r.t i 0 or i 1. Equivalently: i 1 and j 1 are disjoint or one contains the other Equivalently: all 4 rows do not exist (0,0), (0,1), (1,0), (1,1) i 0 i 1 i A 0 B 0 C 0 D 1 E 1 j A 0 B 0 C 1 D 0 E 0 k A 0 B 0 C 1 D 1 E 0 Pairwise Compa4bility Theorem (Estabrook et al. 1976) A set S of binary characters is mutually compa4ble if and only if all pairs c and c of characters in S are pairwise compa4ble. Pairwise compa4bility mutual compa4bility. 2
Perfect Phylogeny A set of mutually compa4ble binary characters gives a perfect phylogeny: 1. Evolu4onary model Binary characters {0,1} Each character changes state only once in evolu4onary history (no homoplasy!). 2. Tree in which every muta4on is on an edge of the tree. All the species in one sub tree contain a 0, and all species in the other contain a 1. For simplicity, assume root = (0, 0, 0, 0, 0) species 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 1 traits Last )me: algorithm to reconstruct a tree. 1 0 Trees and Splits Given a set X, a split is a par44on of X into two non empty subsets A and B. X = A B. For a phylogene4c tree T with leaves L, each edge e defines a split L e = A B, where A and B are the leaves in the subtrees obtained by removing e. i In perfect phylogeny, edges where binary character changes state gave split i 0 and i 1. i 1 i 0 We will return to splits in a future lecture. 3
Splits Equivalence Theorem A phylogene4c tree T defines a collec4on of splits Σ(T) = { L e e is edge in T}. Splits A 1 B 1 and A 2 B 2 are pairwise compa3ble if at least one of A 1 A 2, A 1 B 2, B 1 A 2, and B 1 B 2 is the empty set. Splits Equivalence Theorem: Let Σ be a collec4on of splits. There is a phylogene4c tree such that Σ(T) = Σ if and only if the splits in Σ are pairwise compa4ble. The Pairwise Compa4bility Theorem (for binary characters) follows from this theorem. Outline Distance based methods for phylogene4c tree reconstruc4on. Review of distances/metrics. Tree distances and addi4ve distances Small and large phylogeny problems. Non addi4ve distances and clustering UPGMA and ultrametric distances. 4
Distances A distance on a set X is a func4on d: X R sa4sfying: d(x, y) 0, with equality iff x = y. For all x, y X, d(x, y) = d(y, x) [symmetry] For all x, y, z X, d(x, z) d(x, y) + d(y, z) [triangle inequality] Examples: X = real numbers, d(x, y) = x y is distance. X = strings over some alphabet. d H (s, t) = number of posi4ons where s and t differ is called Hamming distance. Distances in Biological Data String distances (e.g. Hamming distance, edit distance) on DNA/protein sequence data Subs4tu4on model (Jukes Cantor, Kimura, etc.): scores for par4cular changes A T, C G, etc. Rat: ACAGTCACGCCCCACACGT Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGAGGTAGCAAACGA Human: CCTGTGAGGTAGCACACGA 5
Distance Matrix For n species, form n x n distance matrix D ij Example: D ij = edit distance between a gene in species i and species j. Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA 0 7 11 10 7 0 4 6 11 4 0 2 10 6 2 0 Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA Sequence a gene of length m in n species n x m alignment matrix. Reverse transforma4on not possible due to loss of informa4on. Transform into 0 7 11 10 7 0 4 6 11 4 0 2 10 6 2 0 n x n distance matrix 6
Distances in Trees Given a tree T with a posi4ve weight w(e) on each edge, we define the tree distance d T on the set L of leaves by: d T (i, j) = sum of weights of edges on unique path from i to j. In evolu4onary biology, weights are some4mes called branch lengths. Distance in Trees: an Example j i d T (1,4) = 12 + 13 + 14 + 17 + 13 = 69 7
Distance vs. Tree Distance n x n distance matrix for n species Note that d T (i, j), tree distance between i and j, not necessarily equal to D ij as given by distance matrix. Rat: ACAGTGACGCCCCAAACGT Mouse: ACAGTGACGCTACAAACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA Fivng a Distance Matrix Given n species, we can compute the n x n distance matrix D ij Evolu4on of these species is described by a tree that we don t know. We need an algorithm to construct a tree that best fits the distance matrix D ij Find a tree T such that: Lengths of path in an (unknown) tree T D ij = d T (i,j) Distance between species (known) 8
Distance Based Phylogeny Problem Goal: Reconstruct an evolu4onary tree from a distance matrix Input: n x n distance matrix D ij Output: weighted tree T with n leaves fivng D Unknown topology of tree makes evolu4onary tree reconstruc4on hard! # unrooted binary trees n leaves: T(n) = (2n 3)! / ((n 2)! 2 n 2 ) n = 24: T(n) = 5.74 x 10 26 If D is addi3ve, this problem has a solu4on and there is a simple algorithm to solve it Distance based vs. character based Key difference: Distance based methods do not reconstruct ancestral states. A B C D A 0 1 2 2 B 1 0 1 1 C 2 1 0 0 D 2 1 0 0 Note that C and D are iden4cal. 9
Reconstruc4ng a 3 Leaved Tree Tree reconstruc4on for a 3x3 matrix is straighxorward We have 3 leaves i, j, k and a center vertex c Observe: d ic + d jc = D ij d ic + d kc = D ik d jc + d kc = D jk Reconstruc4ng a 3 Leaved Tree (cont d) d ic + d jc = D ij + d ic + d kc = D ik 2d ic + d jc + d kc = D ij + D ik 2d ic + D jk = D ij + D ik d ic = (D ij + D ik D jk )/2 Similarly, d jc = (D ij + D jk D ik )/2 d kc = (D ki + D kj D ij )/2 10
Trees with > 3 Leaves A binary tree with n leaves has 2n 3 edges Fivng a given tree to a distance matrix D requires solving a system with n(n 1)/2 equa4ons and 2n 3 variables Solu4on not always possible for n > 3. Addi4ve Distance Matrices Matrix D is ADDITIVE if there exists a tree T with d ij (T) = D ij NON-ADDITIVE otherwise 11
Addi4ve Distance Phylogeny Small Addi>ve Distance Phylogeny: Given phylogene4c tree T and distance matrix D, determine branch lengths such that d T (i,j) = D ij. Large Addi>ve Distance Phylogeny: Given distance matrix D, find T and branch lengths such that d T (i,j) = D ij. Both of these problems can be solved efficiently. Reconstruc4ng Addi4ve Distances Given T x D v w x y z v 0 10 17 16 16 w 0 15 14 14 x 0 9 15 z y 7 4 5 3 T 3 4 6 w v y 0 14 z 0 If we know T and D, but do not know the length of each edge, we can reconstruct those lengths 12
Reconstruc4ng Addi4ve Distances Given T x D v w x y z y T v 0 10 17 16 16 w 0 15 14 14 z w x 0 9 15 v y 0 14 z 0 Reconstruc4ng Addi4ve Distances Given T D v w x y z v 0 10 17 16 16 w 0 15 14 14 x 0 9 15 y 0 14 z 0 z y x Find neighbors v, w (common parent) a w a x y z d ax = ½ (d vx + d wx d vw ) v D 1 a 0 11 10 10 x 0 9 15 y 0 14 z 0 d ay = ½ (d vy + d wy d vw ) d az = ½ (d vz + d wz d vw ) 13
Reconstruc4ng Addi4ve Distances Given T D 1 a x y z a 0 11 10 10 x 0 9 15 y 0 14 z 0 z y 7 4 x 5 b 3 Neighbors x, y (common parent) c 3 a 4 w D 2 a b z a 0 6 10 b 0 10 z 0 D 3 a c a 0 3 c 0 6 d(a, c) = 3 d(b, c) = d(a, b) d(a, c) = 3 d(c, z) = d(a, z) d(a, c) = 7 d(b, x) = d(a, x) d(a, b) = 5 d(b, y) = d(a, y) d(a, b) = 4 d(a, w) = d(z, w) d(a, z) = 4 d(a, v) = d(z, v) d(a, z) = 6 Correct!!! v Trees and Neighbors Previous algorithm relied only on finding neighboring leaves: 1. Find neighboring leaves i and j with parent k 2. Remove the rows and columns of i and j 3. Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: D km = (D im + D jm D ij )/2 Compress i and j into k, iterate algorithm for rest of tree 14
Finding Neighboring Leaves To find neighboring leaves we simply select a pair of closest leaves. i j k l i 0 13 21 22 j 0 12 13 k 0 13 l 0 WRONG! i and j are neighbors, but (d ij = 13) > (d jk = 12). Finding a pair of neighboring leaves is a nontrivial problem! Degenerate Triples A degenerate triple is a set of three dis4nct elements 1 i, j, k n where D ij + D jk = D ik Element j in a degenerate triple i,j,k lies on the evolu4onary path from i to k (or is ahached to this path by an edge of length 0). 15
Looking for Degenerate Triples If distance matrix D has a degenerate triple i,j,k then j can be removed from D thus reducing the size of the problem. If distance matrix D does not have a degenerate triple i,j,k, one can create a degenera4ve triple in D by shortening all hanging edges (in the tree). Shortening Hanging Edges to Produce Degenerate Triples Shorten all hanging edges (edges that connect leaves) un4l a degenerate triple is found 16
Finding Degenerate Triples If there is no degenerate triple, all hanging edges are reduced by the same amount δ, so that all pair wise distances in the matrix are reduced by 2δ. Eventually this process collapses one of the leaves (when δ = length of shortest hanging edge), forming a degenerate triple i,j,k and reducing the size of the distance matrix D. The ahachment point for j can be recovered in the reverse transforma4ons by saving D ij for each collapsed leaf. Reconstruc4ng Trees for Addi4ve Distance Matrices Trim(D, δ) for all 1 i j n D ij = D ij 2δ 17
Addi4vePhylogeny Algorithm AdditivePhylogeny(D) if D is a 2 x 2 matrix T = tree of a single edge of length D 1,2 return T if D is non-degenerate Compute trimming parameter δ Trim(D, δ) Find a triple i, j, k in D such that D ij + D jk = D ik x = D ij Remove j th row and j th column from D T = AdditivePhylogeny(D) Traceback Addi4vePhylogeny (cont d) Traceback Add a new vertex v to T at distance x from i to k Add j back to T by creating an edge (v,j) of length 0 for every leaf l in T if distance from l to v in the tree D l,j output matrix is not additive return Extend all hanging edges by length δ return T Ques>on: How to compute δ? 18
Addi4ve Distance How to tell if D is addi4ve? Addi4vePhylogeny provides a way to check if distance matrix D is addi4ve An even more efficient addi>vity check is the four point condi>on The Four Point Condi4on (Zaretskii 1965, Buneman 1971) Let 1 i,j,k,l n be four dis4nct leaves in a tree Compute: 1. D ij + D kl, 2. D ik + D jl, 3. D il + D jk 2 and 3 represent the same number: (length of all edges) + 2 * (length middle edge) 2 3 1 1 represents a smaller number: (length of all edges) (length middle edge) 19
The Four Point Condi4on Four point condi>on: Every four leaves (quartet) can be labeled as i,j,k,l such that: D ij + D kl D ik + D jl = D il + D jk Theorem : An n x n matrix D is addi4ve if and only if the four point condi4on holds for every quartet 1 i,j,k,l n The Four Point Condi4on Theorem : An n x n matrix D is addi4ve if and only if the four point condi4on holds for every quartet 1 i,j,k,l n. Proof: Since D addi4ve, D = d T. Find split such that: i, j S 1 and k, l S 2. Define λ m to be weights in tree below. D ik + D jl = (λ 1 + λ 3 + λ 4 ) + (λ 2 + λ 3 + λ 5 ) = D il + D jk >= (λ 1 + λ 2 ) + ( λ 4 + λ 5 ). i λ 1 λ 4 k λ 3 j λ 2 λ 5 l 20
Non addi4ve Distances What if there is no tree T such that D ij = d T (i,j). Approaches: 1. Find tree such that minimizes error 2. Heuris4c approaches: clustering. Least Squares Distance Phylogeny Problem If the distance matrix D is NOT addi4ve, then we look for a tree T that approximates D the best: Squared Error : i,j (d ij (T) D ij ) 2 Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it. Least Squares Distance Phylogeny Problem: Find approxima4on tree T with minimum squared error for a non addi4ve matrix D. (NP hard) 21
Tree construc4on as clustering 1 4 3 2 5 1 4 2 3 5 Pair Group Methods Itera4vely combine closest leaves/groups into larger groups. 1 4 3 2 5 C { {1},, {n} } While C > 2 do [Find closest clusters.] d(c i, C j ) = min d(c i, C j ). C k C i C j [Replace C i and C j by C k.] C (C \ C i \ C j ) C k. 1 4 2 3 5 22
Pair Group Methods What is d? 1 4 How to define branch lengths? 3 2 5 C { {1},, {n} } While C > 2 do [Find closest clusters.] d(c i, C j ) = min d(c i, C j ). C k C i C j [Replace C i and C j by C k.] C (C \ C i \ C j ) C k. 1 4 2 3 5 UPGMA Unweighted Pair Group Method with Averages Distance between clusters defined as average pairwise distance Assigns height to every vertex in the tree, effec4vely da4ng every vertex 1 4 3 2 5 1 4 2 3 5 23
UPGMA Unweighted Pair Group Method with Averages Distance between clusters defined as average pairwise distance 1 4 3 2 5 Given two disjoint clusters C i, C j of sequences, 1 d ij = Σ {p Ci, q C j } d pq C i C j 1 4 2 3 5 UPGMA Unweighted Pair Group Method with Averages Assigns height to every vertex in the tree, effec4vely da4ng every vertex Add a vertex connec4ng C i, C j and place it at height d ij / 2 1 4 3 2 5 1 4 2 3 5 24
UPGMA Algorithm Ini>aliza>on: Assign each x i to its own cluster C i Define one leaf per sequence, each at height 0 Itera>on: Find two clusters C i and C j such that d ij is min Let C k = C i C j Add a vertex connec4ng C i, C j and place it at height d ij /2 Delete C i and C j Termina>on: When a single cluster remains 1 4 3 2 5 1 4 2 3 5 Trees from UPGMA UPGMA produces an ultrametric tree; distance from the root to any leaf is the same The Molecular Clock: The evolu4onary distance between species x and y is twice the Earth 4me to reach the nearest common ancestor That is, the molecular clock has constant rate in all species years 1 4 2 3 5 The molecular clock results in ultrametric distances 25
UPGMA s Weakness: Example Correct tree UPGMA 2 3 1 4 1 4 3 2 Ultrametrics D ij is an ultrametric provided for all species i, j, k (dis4nct leaves of tree) two of the distances D ij, D jk and D ik are equal and the third. Ex. d(i,k) = d(j, k) d(i, j) i j λ 1 λ 1 λ 2 λ k 1 + λ 2 2 λ 1 Thus λ 2 λ 1 Proposi>on: If d is ultrametric, then d is addi4ve. 26
Ultrametrics Both addi4ve distance phylogeny and perfect phylogeny can be reduced to the ultrametric phylogeny problem. Let v = row of D containing largest entry m v. Define D ij = m v + (D ij D vi D vj ) / 2 i = m v λ 3 λ 1 λ 3 v Theorem: D is addi4ve if and only if D is ultrametric. (See Gusfield, Ch. 17) j λ 2 27