CSCI1950 Z Computa4onal Methods for Biology Lecture 5 Ben Raphael February 6, 2009 hip://cs.brown.edu/courses/csci1950 z/ Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA Sequence a gene of length m in n species n x m alignment matrix. Reverse transforma4on not possible due to loss of informa4on. Transform into 0 7 11 10 7 0 4 6 11 4 0 2 10 6 2 0 n x n distance matrix 1
Distances in Trees Given a tree T with a posi4ve weight w(e) on each edge, we define the tree distance d T on the set L of leaves by: d T (i, j) = sum of weights of edges on unique path from i to j. j i d T (1,4) = 12 + 13 + 14 + 17 + 13 = 69 Addi4vity and Four Point Condi4on Theorem : If an n x n matrix D is addi4ve* then there exists a unique (up to isomorphism) phylogene4c tree T such that d T (i, j) = D(i,j). *iff the four point condi4on holds for every quartet 1 i,j,k,l n: D ij + D kl D ik + D jl = D il + D jk i λ 1 λ 4 k λ 3 j λ 2 λ 5 l 2
Fibng an Addi4ve Distance Matrix (Finding T) Addi4vePhylogeny Algorithm (see last lecture) Clustering methods UPGMA: produces ultrametric tree Neighbor joining: today. UPGMA Algorithm Ini-aliza-on: Assign each x i to its own cluster C i Define one leaf per sequence, each at height 0 Itera-on: Find two clusters C i and C j such that d ij is min Let C k = C i C j Add a vertex connec4ng C i, C j and place it at height d ij /2 Delete C i and C j Termina-on: When a single cluster remains 1 4 3 2 5 1 4 2 3 5 3
UPGMA Example From Felsenstein, Inferring Phylogenies UPGMA Example 4
UPGMA Example UPGMA Example 5
UPGMA Example From Felsenstein, Inferring Phylogenies Trees from UPGMA UPGMA produces an ultrametric tree; distance from the root to any leaf is the same The Molecular Clock: The evolu4onary distance between species x and y is twice the Earth 4me to reach the nearest common ancestor That is, the molecular clock has constant rate in all species years 1 4 2 3 5 The molecular clock results in ultrametric distances 6
Ultrametrics D ij is an ultrametric provided for all species i, j, k (dis4nct leaves of tree) two of the distances D ij, D jk and D ik are equal and the third. Ex. d(i,k) = d(j, k) d(i, j) i j λ 1 λ 1 λ 2 λ k 1 + λ 2 2 λ 1 Thus λ 2 λ 1 Proposi-on: If d is ultrametric, then d is addi4ve. Ultrametrics Both addi4ve distance phylogeny and perfect phylogeny can be reduced to the ultrametric phylogeny problem. Let v = row of D containing largest entry m v. Define D ij = m v + (D ij D vi D vj ) / 2 i = m v λ 3 λ 1 λ 3 v Theorem: D is addi4ve if and only if D is ultrametric. (See Gusfield, Ch. 17) j λ 2 7
Addi4ve vs. Ultrametric Trees From Felsenstein, Inferring Phylogenies Neighbor Joining Algorithm (Saitou and Nei 1987) Constructs binary phylogene4c trees. Recall: leaves a and b are neighbors provided that they have a common parent. (Note: In graph theory there is a different usage of neighbor.) Recall: closest leaves are not necessarily neighbors. Pair of leaves that are close to each other but far from other leaves are neighbors. Key Advantages Reproduces correct tree for addi4ve matrix. Gives good approxima4on of correct tree for non addi4ve matrix. Does not rely on molecular clock assump4on like UPGMA. 8
Neighbor Joining as a Pair Group Method Itera4vely combine leaves/groups minimizing selec4on criteria into larger groups. 1 4 C { {1},, {n} } While C > 2 do [Select pair of clusters.] s(c x, C y ) = min s(c i, C j ). C k C x C y [Replace C x and C y by C k.] C (C \ C x \ C y ) C k. 3 2 5 1 4 2 3 5 NJ Selec4on Criterion Let C = {1,, n} be current clusters/leaves. Define: u i = k D(i, k). 1 Intui4vely, u i measures separa4on of i from other leaves. 0.1 Goal: Minimize D(i, j) and maximize u i + u j. 3 0.1 0.1 Solu-on: Find pair (i, j) that minimizes: S D (i, j) = (n 2) D(i, j) u i u j 0.4 0.4 Claim: Given addi4ve matrix D. S D (x, y) = mins D (i, j) if and only if x and y are neighbors in tree T with d T = D. 2 4 9
Algorithm: Neighbor joining Ini4aliza4on: For n clusters, one for each leaf node Define T to be the set of leaf nodes, one per sequence Itera4on: Pick i, j such that S D (i, j) = (n 2) D(i, j) u i u j is minimal. Merge i and j into new node (ij) in T. Assign length ½ (D(i, j) + 1/(n 2) (u i u j )) to edge (i, (ij) ) Assign length ½ ( D(i, j) + 1/(n 2) (u j u i )) to edge (j, (ij) ) Remove rows and columns from D corresponding to i and j. Add row and column to D for new vertex ij. D( (ij), m) = ½ [ D(i, m) + D(j, m) D(i,j)] Termina4on: When only one cluster Neighbor Joining Tree From Felsenstein, Inferring Phylogenies 10
Neighbor Joining vs. UPGMA Tree From Felsenstein, Inferring Phylogenies NJ Selec4on Criterion Let C = {1,, n} be current clusters/leaves. Define: u i = k D(i, k). Goal: Minimize D(i, j) and maximize u i + u j. Solu-on: Find pair (i, j) that minimizes: S D (i, j) = (n 2) D(i, j) u i u j 1 3 0.1 0.1 0.1 0.4 0.4 Claim: Given addi4ve matrix D. S D (x, y) = mins D (i, j) if and only if x and y are neighbors in tree T with d T = D. Proof 2 4 11
Why Neighbor joining? If D is addi4ve then neighbor joining produces the unique* phylogene4c tree T such that d T = D. (Consistency) *up to isomorphism If D is non addi4ve, then neighbor joining performs well. Why Neighbor joining? For a distance matrices D and D define error (l norm) by D D = max D ij D ij. Input: A non addi4ve matrix D. Output: Tree S that is closest to D in the sense that D S D is minimized. 12
Why Neighbor joining? For a distance matrices D and D define error (l norm) by D D = max D ij D ij. Suppose there is a true tree T with addi4ve matrix D = d T. We measure a perturbed matrix D and run NJ on D obtaining a tree T. How different can D and D be and s4ll obtain T = T? Theorem (AIeson 1999) If D D ½ (shortest edge in T), then NJ applied to D reconstructs T. Compact Addi4ve Trees Compact Addi-ve Tree Problem Given an n x n distance matrix, determine if there is an addi4ve tree for D with exactly n ver4ces? Note: Not a usual phylogene4c problem, since we are given data about ancestors. 13
Compact Addi4ve Trees Compact Addi-ve Tree Problem Given an n x n distance matrix, determine if there is an addi4ve tree for D with exactly n ver4ces? Let G(D) be the complete graph with edge weights w( (i,j) ) = D ij. Theorem: If there is a compact addi4ve tree for D, then T must be the unique minimum spanning tree of G(D). Recall: A spanning tree is a tree containing all ver4ces. Minimum spanning tree has the least total weight. Algorithm Summary Distance based Parsimony Probabilis4c Method Input Output Neighbor Joining Distance matrix D T (addi4ve), B UPGMA Distance matrix D T (ultrametric), B Sankoff s & Fitch s Alg. Characters, T A, B Perfect Phylogeny Characters A, B, T Felsenstein Characters, T, B A T = tree topology B = branch lengths A = ancestral states 14