CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Similar documents
CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Molecular Evolution and Phylogenetic Tree Reconstruction

Evolutionary Tree Analysis. Overview

Phylogeny: traditional and Bayesian approaches

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Phylogeny: building the tree of life

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

EVOLUTIONARY DISTANCES


Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetic Tree Reconstruction

Phylogenetic trees 07/10/13

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Constructing Evolutionary/Phylogenetic Trees

CSCI1950 Z Computa3onal Methods for Biology Lecture 24. Ben Raphael April 29, hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

A Phylogenetic Network Construction due to Constrained Recombination

Algorithms in Bioinformatics

Phylogeny Tree Algorithms

Phylogenetic inference

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

A few logs suce to build (almost) all trees: Part II

BINF6201/8201. Molecular phylogenetic methods

X X (2) X Pr(X = x θ) (3)

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Theory of Evolution Charles Darwin

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars

ALGORITHMS FOR RECONSTRUCTING PHYLOGENETIC TREES FROM DISSIMILARITY MAPS

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Hierarchical Clustering

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Reconstruction of certain phylogenetic networks from their tree-average distances

MOLECULAR EVOLUTION AND PHYLOGENETICS SERGEI L KOSAKOVSKY POND CSE/BIMM/BENG 181 MAY 27, 2011

Minimum Edit Distance. Defini'on of Minimum Edit Distance

Consistency Index (CI)

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetics: Building Phylogenetic Trees

A (short) introduction to phylogenetics

Lecture 18 Molecular Evolution and Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

BMI/CS 776 Lecture 4. Colin Dewey

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Tree-average distances on certain phylogenetic networks have their weights uniquely determined

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Chapter 3: Phylogenetics

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances

Week 5: Distance methods, DNA and protein models

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

Theory of Evolution. Charles Darwin

Multiple Sequence Alignment. Sequences

2 (n 1). f (i 1, i 2,..., i k )

The Generalized Neighbor Joining method

Rela%ons and Their Proper%es. Slides by A. Bloomfield

Phylogeny Jan 5, 2016

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogenetics. BIOL 7711 Computational Bioscience

arxiv: v1 [q-bio.pe] 1 Jun 2014

66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

Reconstructing Trees from Subtree Weights

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Evolutionary Models. Evolutionary Models

Constructing Evolutionary/Phylogenetic Trees

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Phylogenetic Networks, Trees, and Clusters

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

3. Coding theory 3.1. Basic concepts

j=1 [We will show that the triangle inequality holds for each p-norm in Chapter 3 Section 6.] The 1-norm is A F = tr(a H A).

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

3D Data, Phylogenetics, and Trees In this lecture.

Dr. Amira A. AL-Hosary

Cartesian Tensors. e 2. e 1. General vector (formal definition to follow) denoted by components

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

The Strong Largeur d Arborescence

Integer Programming for Phylogenetic Network Problems

Lecture 10: Phylogeny

Constructing Evolutionary Trees

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

394C, October 2, Topics: Mul9ple Sequence Alignment Es9ma9ng Species Trees from Gene Trees

Lecture 2: Pairwise Alignment. CG Ron Shamir

Arrangements, matroids and codes

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Example 3.3 Moving-average filter

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Phylogeny. November 7, 2017

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Algebraic Statistics Tutorial I

Week 15-16: Combinatorial Design

Transcription:

CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch s Alg. Characters, T A, B Perfect Phylogeny Characters A, B, T Felsenstein Characters, T, B A T = tree topology B = branch lengths A = ancestral states 1

Pairwise Compa4bility Test (Wilson 1965) Binary characters i and j are pairwise compa4ble if and only if: j is homogenous w.r.t i 0 or i 1. Equivalently: i 1 and j 1 are disjoint or one contains the other Equivalently: all 4 rows do not exist (0,0), (0,1), (1,0), (1,1) i 0 i 1 i A 0 B 0 C 0 D 1 E 1 j A 0 B 0 C 1 D 0 E 0 k A 0 B 0 C 1 D 1 E 0 Pairwise Compa4bility Theorem (Estabrook et al. 1976) A set S of binary characters is mutually compa4ble if and only if all pairs c and c of characters in S are pairwise compa4ble. Pairwise compa4bility mutual compa4bility. 2

Perfect Phylogeny A set of mutually compa4ble binary characters gives a perfect phylogeny: 1. Evolu4onary model Binary characters {0,1} Each character changes state only once in evolu4onary history (no homoplasy!). 2. Tree in which every muta4on is on an edge of the tree. All the species in one sub tree contain a 0, and all species in the other contain a 1. For simplicity, assume root = (0, 0, 0, 0, 0) species 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 1 traits Last )me: algorithm to reconstruct a tree. 1 0 Trees and Splits Given a set X, a split is a par44on of X into two non empty subsets A and B. X = A B. For a phylogene4c tree T with leaves L, each edge e defines a split L e = A B, where A and B are the leaves in the subtrees obtained by removing e. i In perfect phylogeny, edges where binary character changes state gave split i 0 and i 1. i 1 i 0 We will return to splits in a future lecture. 3

Splits Equivalence Theorem A phylogene4c tree T defines a collec4on of splits Σ(T) = { L e e is edge in T}. Splits A 1 B 1 and A 2 B 2 are pairwise compa3ble if at least one of A 1 A 2, A 1 B 2, B 1 A 2, and B 1 B 2 is the empty set. Splits Equivalence Theorem: Let Σ be a collec4on of splits. There is a phylogene4c tree such that Σ(T) = Σ if and only if the splits in Σ are pairwise compa4ble. The Pairwise Compa4bility Theorem (for binary characters) follows from this theorem. Outline Distance based methods for phylogene4c tree reconstruc4on. Review of distances/metrics. Tree distances and addi4ve distances Small and large phylogeny problems. Non addi4ve distances and clustering UPGMA and ultrametric distances. 4

Distances A distance on a set X is a func4on d: X R sa4sfying: d(x, y) 0, with equality iff x = y. For all x, y X, d(x, y) = d(y, x) [symmetry] For all x, y, z X, d(x, z) d(x, y) + d(y, z) [triangle inequality] Examples: X = real numbers, d(x, y) = x y is distance. X = strings over some alphabet. d H (s, t) = number of posi4ons where s and t differ is called Hamming distance. Distances in Biological Data String distances (e.g. Hamming distance, edit distance) on DNA/protein sequence data Subs4tu4on model (Jukes Cantor, Kimura, etc.): scores for par4cular changes A T, C G, etc. Rat: ACAGTCACGCCCCACACGT Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGAGGTAGCAAACGA Human: CCTGTGAGGTAGCACACGA 5

Distance Matrix For n species, form n x n distance matrix D ij Example: D ij = edit distance between a gene in species i and species j. Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA 0 7 11 10 7 0 4 6 11 4 0 2 10 6 2 0 Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA Sequence a gene of length m in n species n x m alignment matrix. Reverse transforma4on not possible due to loss of informa4on. Transform into 0 7 11 10 7 0 4 6 11 4 0 2 10 6 2 0 n x n distance matrix 6

Distances in Trees Given a tree T with a posi4ve weight w(e) on each edge, we define the tree distance d T on the set L of leaves by: d T (i, j) = sum of weights of edges on unique path from i to j. In evolu4onary biology, weights are some4mes called branch lengths. Distance in Trees: an Example j i d T (1,4) = 12 + 13 + 14 + 17 + 13 = 69 7

Distance vs. Tree Distance n x n distance matrix for n species Note that d T (i, j), tree distance between i and j, not necessarily equal to D ij as given by distance matrix. Rat: ACAGTGACGCCCCAAACGT Mouse: ACAGTGACGCTACAAACGT Gorilla: CCTGTGACGTAACAAACGA Chimpanzee: CCTGTGACGTAGCAAACGA Human: CCTGTGACGTAGCAAACGA Fivng a Distance Matrix Given n species, we can compute the n x n distance matrix D ij Evolu4on of these species is described by a tree that we don t know. We need an algorithm to construct a tree that best fits the distance matrix D ij Find a tree T such that: Lengths of path in an (unknown) tree T D ij = d T (i,j) Distance between species (known) 8

Distance Based Phylogeny Problem Goal: Reconstruct an evolu4onary tree from a distance matrix Input: n x n distance matrix D ij Output: weighted tree T with n leaves fivng D Unknown topology of tree makes evolu4onary tree reconstruc4on hard! # unrooted binary trees n leaves: T(n) = (2n 3)! / ((n 2)! 2 n 2 ) n = 24: T(n) = 5.74 x 10 26 If D is addi3ve, this problem has a solu4on and there is a simple algorithm to solve it Distance based vs. character based Key difference: Distance based methods do not reconstruct ancestral states. A B C D A 0 1 2 2 B 1 0 1 1 C 2 1 0 0 D 2 1 0 0 Note that C and D are iden4cal. 9

Reconstruc4ng a 3 Leaved Tree Tree reconstruc4on for a 3x3 matrix is straighxorward We have 3 leaves i, j, k and a center vertex c Observe: d ic + d jc = D ij d ic + d kc = D ik d jc + d kc = D jk Reconstruc4ng a 3 Leaved Tree (cont d) d ic + d jc = D ij + d ic + d kc = D ik 2d ic + d jc + d kc = D ij + D ik 2d ic + D jk = D ij + D ik d ic = (D ij + D ik D jk )/2 Similarly, d jc = (D ij + D jk D ik )/2 d kc = (D ki + D kj D ij )/2 10

Trees with > 3 Leaves A binary tree with n leaves has 2n 3 edges Fivng a given tree to a distance matrix D requires solving a system with n(n 1)/2 equa4ons and 2n 3 variables Solu4on not always possible for n > 3. Addi4ve Distance Matrices Matrix D is ADDITIVE if there exists a tree T with d ij (T) = D ij NON-ADDITIVE otherwise 11

Addi4ve Distance Phylogeny Small Addi>ve Distance Phylogeny: Given phylogene4c tree T and distance matrix D, determine branch lengths such that d T (i,j) = D ij. Large Addi>ve Distance Phylogeny: Given distance matrix D, find T and branch lengths such that d T (i,j) = D ij. Both of these problems can be solved efficiently. Reconstruc4ng Addi4ve Distances Given T x D v w x y z v 0 10 17 16 16 w 0 15 14 14 x 0 9 15 z y 7 4 5 3 T 3 4 6 w v y 0 14 z 0 If we know T and D, but do not know the length of each edge, we can reconstruct those lengths 12

Reconstruc4ng Addi4ve Distances Given T x D v w x y z y T v 0 10 17 16 16 w 0 15 14 14 z w x 0 9 15 v y 0 14 z 0 Reconstruc4ng Addi4ve Distances Given T D v w x y z v 0 10 17 16 16 w 0 15 14 14 x 0 9 15 y 0 14 z 0 z y x Find neighbors v, w (common parent) a w a x y z d ax = ½ (d vx + d wx d vw ) v D 1 a 0 11 10 10 x 0 9 15 y 0 14 z 0 d ay = ½ (d vy + d wy d vw ) d az = ½ (d vz + d wz d vw ) 13

Reconstruc4ng Addi4ve Distances Given T D 1 a x y z a 0 11 10 10 x 0 9 15 y 0 14 z 0 z y 7 4 x 5 b 3 Neighbors x, y (common parent) c 3 a 4 w D 2 a b z a 0 6 10 b 0 10 z 0 D 3 a c a 0 3 c 0 6 d(a, c) = 3 d(b, c) = d(a, b) d(a, c) = 3 d(c, z) = d(a, z) d(a, c) = 7 d(b, x) = d(a, x) d(a, b) = 5 d(b, y) = d(a, y) d(a, b) = 4 d(a, w) = d(z, w) d(a, z) = 4 d(a, v) = d(z, v) d(a, z) = 6 Correct!!! v Trees and Neighbors Previous algorithm relied only on finding neighboring leaves: 1. Find neighboring leaves i and j with parent k 2. Remove the rows and columns of i and j 3. Add a new row and column corresponding to k, where the distance from k to any other leaf m can be computed as: D km = (D im + D jm D ij )/2 Compress i and j into k, iterate algorithm for rest of tree 14

Finding Neighboring Leaves To find neighboring leaves we simply select a pair of closest leaves. i j k l i 0 13 21 22 j 0 12 13 k 0 13 l 0 WRONG! i and j are neighbors, but (d ij = 13) > (d jk = 12). Finding a pair of neighboring leaves is a nontrivial problem! Degenerate Triples A degenerate triple is a set of three dis4nct elements 1 i, j, k n where D ij + D jk = D ik Element j in a degenerate triple i,j,k lies on the evolu4onary path from i to k (or is ahached to this path by an edge of length 0). 15

Looking for Degenerate Triples If distance matrix D has a degenerate triple i,j,k then j can be removed from D thus reducing the size of the problem. If distance matrix D does not have a degenerate triple i,j,k, one can create a degenera4ve triple in D by shortening all hanging edges (in the tree). Shortening Hanging Edges to Produce Degenerate Triples Shorten all hanging edges (edges that connect leaves) un4l a degenerate triple is found 16

Finding Degenerate Triples If there is no degenerate triple, all hanging edges are reduced by the same amount δ, so that all pair wise distances in the matrix are reduced by 2δ. Eventually this process collapses one of the leaves (when δ = length of shortest hanging edge), forming a degenerate triple i,j,k and reducing the size of the distance matrix D. The ahachment point for j can be recovered in the reverse transforma4ons by saving D ij for each collapsed leaf. Reconstruc4ng Trees for Addi4ve Distance Matrices Trim(D, δ) for all 1 i j n D ij = D ij 2δ 17

Addi4vePhylogeny Algorithm AdditivePhylogeny(D) if D is a 2 x 2 matrix T = tree of a single edge of length D 1,2 return T if D is non-degenerate Compute trimming parameter δ Trim(D, δ) Find a triple i, j, k in D such that D ij + D jk = D ik x = D ij Remove j th row and j th column from D T = AdditivePhylogeny(D) Traceback Addi4vePhylogeny (cont d) Traceback Add a new vertex v to T at distance x from i to k Add j back to T by creating an edge (v,j) of length 0 for every leaf l in T if distance from l to v in the tree D l,j output matrix is not additive return Extend all hanging edges by length δ return T Ques>on: How to compute δ? 18

Addi4ve Distance How to tell if D is addi4ve? Addi4vePhylogeny provides a way to check if distance matrix D is addi4ve An even more efficient addi>vity check is the four point condi>on The Four Point Condi4on (Zaretskii 1965, Buneman 1971) Let 1 i,j,k,l n be four dis4nct leaves in a tree Compute: 1. D ij + D kl, 2. D ik + D jl, 3. D il + D jk 2 and 3 represent the same number: (length of all edges) + 2 * (length middle edge) 2 3 1 1 represents a smaller number: (length of all edges) (length middle edge) 19

The Four Point Condi4on Four point condi>on: Every four leaves (quartet) can be labeled as i,j,k,l such that: D ij + D kl D ik + D jl = D il + D jk Theorem : An n x n matrix D is addi4ve if and only if the four point condi4on holds for every quartet 1 i,j,k,l n The Four Point Condi4on Theorem : An n x n matrix D is addi4ve if and only if the four point condi4on holds for every quartet 1 i,j,k,l n. Proof: Since D addi4ve, D = d T. Find split such that: i, j S 1 and k, l S 2. Define λ m to be weights in tree below. D ik + D jl = (λ 1 + λ 3 + λ 4 ) + (λ 2 + λ 3 + λ 5 ) = D il + D jk >= (λ 1 + λ 2 ) + ( λ 4 + λ 5 ). i λ 1 λ 4 k λ 3 j λ 2 λ 5 l 20

Non addi4ve Distances What if there is no tree T such that D ij = d T (i,j). Approaches: 1. Find tree such that minimizes error 2. Heuris4c approaches: clustering. Least Squares Distance Phylogeny Problem If the distance matrix D is NOT addi4ve, then we look for a tree T that approximates D the best: Squared Error : i,j (d ij (T) D ij ) 2 Squared Error is a measure of the quality of the fit between distance matrix and the tree: we want to minimize it. Least Squares Distance Phylogeny Problem: Find approxima4on tree T with minimum squared error for a non addi4ve matrix D. (NP hard) 21

Tree construc4on as clustering 1 4 3 2 5 1 4 2 3 5 Pair Group Methods Itera4vely combine closest leaves/groups into larger groups. 1 4 3 2 5 C { {1},, {n} } While C > 2 do [Find closest clusters.] d(c i, C j ) = min d(c i, C j ). C k C i C j [Replace C i and C j by C k.] C (C \ C i \ C j ) C k. 1 4 2 3 5 22

Pair Group Methods What is d? 1 4 How to define branch lengths? 3 2 5 C { {1},, {n} } While C > 2 do [Find closest clusters.] d(c i, C j ) = min d(c i, C j ). C k C i C j [Replace C i and C j by C k.] C (C \ C i \ C j ) C k. 1 4 2 3 5 UPGMA Unweighted Pair Group Method with Averages Distance between clusters defined as average pairwise distance Assigns height to every vertex in the tree, effec4vely da4ng every vertex 1 4 3 2 5 1 4 2 3 5 23

UPGMA Unweighted Pair Group Method with Averages Distance between clusters defined as average pairwise distance 1 4 3 2 5 Given two disjoint clusters C i, C j of sequences, 1 d ij = Σ {p Ci, q C j } d pq C i C j 1 4 2 3 5 UPGMA Unweighted Pair Group Method with Averages Assigns height to every vertex in the tree, effec4vely da4ng every vertex Add a vertex connec4ng C i, C j and place it at height d ij / 2 1 4 3 2 5 1 4 2 3 5 24

UPGMA Algorithm Ini>aliza>on: Assign each x i to its own cluster C i Define one leaf per sequence, each at height 0 Itera>on: Find two clusters C i and C j such that d ij is min Let C k = C i C j Add a vertex connec4ng C i, C j and place it at height d ij /2 Delete C i and C j Termina>on: When a single cluster remains 1 4 3 2 5 1 4 2 3 5 Trees from UPGMA UPGMA produces an ultrametric tree; distance from the root to any leaf is the same The Molecular Clock: The evolu4onary distance between species x and y is twice the Earth 4me to reach the nearest common ancestor That is, the molecular clock has constant rate in all species years 1 4 2 3 5 The molecular clock results in ultrametric distances 25

UPGMA s Weakness: Example Correct tree UPGMA 2 3 1 4 1 4 3 2 Ultrametrics D ij is an ultrametric provided for all species i, j, k (dis4nct leaves of tree) two of the distances D ij, D jk and D ik are equal and the third. Ex. d(i,k) = d(j, k) d(i, j) i j λ 1 λ 1 λ 2 λ k 1 + λ 2 2 λ 1 Thus λ 2 λ 1 Proposi>on: If d is ultrametric, then d is addi4ve. 26

Ultrametrics Both addi4ve distance phylogeny and perfect phylogeny can be reduced to the ultrametric phylogeny problem. Let v = row of D containing largest entry m v. Define D ij = m v + (D ij D vi D vj ) / 2 i = m v λ 3 λ 1 λ 3 v Theorem: D is addi4ve if and only if D is ultrametric. (See Gusfield, Ch. 17) j λ 2 27