CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Similar documents
CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Evolutionary Tree Analysis. Overview

Molecular Evolution and Phylogenetic Tree Reconstruction

Phylogenetic trees 07/10/13

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003


Algorithms in Bioinformatics

BINF6201/8201. Molecular phylogenetic methods

Theory of Evolution Charles Darwin

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

EVOLUTIONARY DISTANCES

Phylogenetic Tree Reconstruction

Phylogeny: traditional and Bayesian approaches

CSCI1950 Z Computa3onal Methods for Biology Lecture 24. Ben Raphael April 29, hgp://cs.brown.edu/courses/csci1950 z/ Network Mo3fs

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Constructing Evolutionary/Phylogenetic Trees

Theory of Evolution. Charles Darwin

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogeny: building the tree of life

Phylogeny Tree Algorithms

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Consistency Index (CI)

MOLECULAR EVOLUTION AND PHYLOGENETICS SERGEI L KOSAKOVSKY POND CSE/BIMM/BENG 181 MAY 27, 2011

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Constructing Evolutionary Trees

Reconstructing Trees from Subtree Weights

Multiple Sequence Alignment. Sequences

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Phylogenetic inference

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars

Phylogeny Jan 5, 2016

A Phylogenetic Network Construction due to Constrained Recombination

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

BIOINFORMATICS GABRIEL VALIENTE ALGORITHMS, BIOINFORMATICS, COMPLEXITY AND FORMAL METHODS RESEARCH GROUP, TECHNICAL UNIVERSITY OF CATALONIA

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Week 5: Distance methods, DNA and protein models

Hierarchical Clustering

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Minimum Edit Distance. Defini'on of Minimum Edit Distance

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

Phylogeny. November 7, 2017

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Building Phylogenetic Trees UPGMA & NJ

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

A (short) introduction to phylogenetics

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogenetics: Building Phylogenetic Trees

The Generalized Neighbor Joining method

A few logs suce to build (almost) all trees: Part II

Gel Electrophoresis. 10/28/0310/21/2003 CAP/CGS 5991 Lecture 10Lecture 9 1

Chapter 3: Phylogenetics

Lecture 10: Phylogeny

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Algebraic Statistics Tutorial I

Notes on the Matrix-Tree theorem and Cayley s tree enumerator

Phylogenetic Networks, Trees, and Clusters

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Application of new distance matrix to phylogenetic tree construction

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

CS281A/Stat241A Lecture 19

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Phylogenetic analyses. Kirsi Kostamo

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetic Algebraic Geometry

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Supplementary Information

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

ALGORITHMS FOR RECONSTRUCTING PHYLOGENETIC TREES FROM DISSIMILARITY MAPS

Bioinformatics course

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Dr. Amira A. AL-Hosary

Phylogeny and Molecular Evolution. Introduction

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Constructing Evolutionary/Phylogenetic Trees

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

arxiv: v1 [q-bio.pe] 3 May 2016

Metric learning for phylogenetic invariants

Zhongyi Xiao. Correlation. In probability theory and statistics, correlation indicates the

Graphical Models. Lecture 10: Variable Elimina:on, con:nued. Andrew McCallum

Graph fundamentals. Matrices associated with a graph

ACO Comprehensive Exam March 20 and 21, Computability, Complexity and Algorithms

X X (2) X Pr(X = x θ) (3)

Phylogenetics. BIOL 7711 Computational Bioscience

Algorithms for Phylogenetic Reconstructions

Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction

Transcription:

CSCI1950 Z Computa4onal Methods for Biology Lecture 5 Ben Raphael February 6, 2009 hip://cs.brown.edu/courses/csci1950 z/ Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC Chimpanzee: CCTGCCAGTTAGCAAACGC Human: CCTGCCAGTTAGCACACGA Sequence a gene of length m in n species n x m alignment matrix. Reverse transforma4on not possible due to loss of informa4on. Transform into 0 7 11 10 7 0 4 6 11 4 0 2 10 6 2 0 n x n distance matrix 1

Distances in Trees Given a tree T with a posi4ve weight w(e) on each edge, we define the tree distance d T on the set L of leaves by: d T (i, j) = sum of weights of edges on unique path from i to j. j i d T (1,4) = 12 + 13 + 14 + 17 + 13 = 69 Addi4vity and Four Point Condi4on Theorem : If an n x n matrix D is addi4ve* then there exists a unique (up to isomorphism) phylogene4c tree T such that d T (i, j) = D(i,j). *iff the four point condi4on holds for every quartet 1 i,j,k,l n: D ij + D kl D ik + D jl = D il + D jk i λ 1 λ 4 k λ 3 j λ 2 λ 5 l 2

Fibng an Addi4ve Distance Matrix (Finding T) Addi4vePhylogeny Algorithm (see last lecture) Clustering methods UPGMA: produces ultrametric tree Neighbor joining: today. UPGMA Algorithm Ini-aliza-on: Assign each x i to its own cluster C i Define one leaf per sequence, each at height 0 Itera-on: Find two clusters C i and C j such that d ij is min Let C k = C i C j Add a vertex connec4ng C i, C j and place it at height d ij /2 Delete C i and C j Termina-on: When a single cluster remains 1 4 3 2 5 1 4 2 3 5 3

UPGMA Example From Felsenstein, Inferring Phylogenies UPGMA Example 4

UPGMA Example UPGMA Example 5

UPGMA Example From Felsenstein, Inferring Phylogenies Trees from UPGMA UPGMA produces an ultrametric tree; distance from the root to any leaf is the same The Molecular Clock: The evolu4onary distance between species x and y is twice the Earth 4me to reach the nearest common ancestor That is, the molecular clock has constant rate in all species years 1 4 2 3 5 The molecular clock results in ultrametric distances 6

Ultrametrics D ij is an ultrametric provided for all species i, j, k (dis4nct leaves of tree) two of the distances D ij, D jk and D ik are equal and the third. Ex. d(i,k) = d(j, k) d(i, j) i j λ 1 λ 1 λ 2 λ k 1 + λ 2 2 λ 1 Thus λ 2 λ 1 Proposi-on: If d is ultrametric, then d is addi4ve. Ultrametrics Both addi4ve distance phylogeny and perfect phylogeny can be reduced to the ultrametric phylogeny problem. Let v = row of D containing largest entry m v. Define D ij = m v + (D ij D vi D vj ) / 2 i = m v λ 3 λ 1 λ 3 v Theorem: D is addi4ve if and only if D is ultrametric. (See Gusfield, Ch. 17) j λ 2 7

Addi4ve vs. Ultrametric Trees From Felsenstein, Inferring Phylogenies Neighbor Joining Algorithm (Saitou and Nei 1987) Constructs binary phylogene4c trees. Recall: leaves a and b are neighbors provided that they have a common parent. (Note: In graph theory there is a different usage of neighbor.) Recall: closest leaves are not necessarily neighbors. Pair of leaves that are close to each other but far from other leaves are neighbors. Key Advantages Reproduces correct tree for addi4ve matrix. Gives good approxima4on of correct tree for non addi4ve matrix. Does not rely on molecular clock assump4on like UPGMA. 8

Neighbor Joining as a Pair Group Method Itera4vely combine leaves/groups minimizing selec4on criteria into larger groups. 1 4 C { {1},, {n} } While C > 2 do [Select pair of clusters.] s(c x, C y ) = min s(c i, C j ). C k C x C y [Replace C x and C y by C k.] C (C \ C x \ C y ) C k. 3 2 5 1 4 2 3 5 NJ Selec4on Criterion Let C = {1,, n} be current clusters/leaves. Define: u i = k D(i, k). 1 Intui4vely, u i measures separa4on of i from other leaves. 0.1 Goal: Minimize D(i, j) and maximize u i + u j. 3 0.1 0.1 Solu-on: Find pair (i, j) that minimizes: S D (i, j) = (n 2) D(i, j) u i u j 0.4 0.4 Claim: Given addi4ve matrix D. S D (x, y) = mins D (i, j) if and only if x and y are neighbors in tree T with d T = D. 2 4 9

Algorithm: Neighbor joining Ini4aliza4on: For n clusters, one for each leaf node Define T to be the set of leaf nodes, one per sequence Itera4on: Pick i, j such that S D (i, j) = (n 2) D(i, j) u i u j is minimal. Merge i and j into new node (ij) in T. Assign length ½ (D(i, j) + 1/(n 2) (u i u j )) to edge (i, (ij) ) Assign length ½ ( D(i, j) + 1/(n 2) (u j u i )) to edge (j, (ij) ) Remove rows and columns from D corresponding to i and j. Add row and column to D for new vertex ij. D( (ij), m) = ½ [ D(i, m) + D(j, m) D(i,j)] Termina4on: When only one cluster Neighbor Joining Tree From Felsenstein, Inferring Phylogenies 10

Neighbor Joining vs. UPGMA Tree From Felsenstein, Inferring Phylogenies NJ Selec4on Criterion Let C = {1,, n} be current clusters/leaves. Define: u i = k D(i, k). Goal: Minimize D(i, j) and maximize u i + u j. Solu-on: Find pair (i, j) that minimizes: S D (i, j) = (n 2) D(i, j) u i u j 1 3 0.1 0.1 0.1 0.4 0.4 Claim: Given addi4ve matrix D. S D (x, y) = mins D (i, j) if and only if x and y are neighbors in tree T with d T = D. Proof 2 4 11

Why Neighbor joining? If D is addi4ve then neighbor joining produces the unique* phylogene4c tree T such that d T = D. (Consistency) *up to isomorphism If D is non addi4ve, then neighbor joining performs well. Why Neighbor joining? For a distance matrices D and D define error (l norm) by D D = max D ij D ij. Input: A non addi4ve matrix D. Output: Tree S that is closest to D in the sense that D S D is minimized. 12

Why Neighbor joining? For a distance matrices D and D define error (l norm) by D D = max D ij D ij. Suppose there is a true tree T with addi4ve matrix D = d T. We measure a perturbed matrix D and run NJ on D obtaining a tree T. How different can D and D be and s4ll obtain T = T? Theorem (AIeson 1999) If D D ½ (shortest edge in T), then NJ applied to D reconstructs T. Compact Addi4ve Trees Compact Addi-ve Tree Problem Given an n x n distance matrix, determine if there is an addi4ve tree for D with exactly n ver4ces? Note: Not a usual phylogene4c problem, since we are given data about ancestors. 13

Compact Addi4ve Trees Compact Addi-ve Tree Problem Given an n x n distance matrix, determine if there is an addi4ve tree for D with exactly n ver4ces? Let G(D) be the complete graph with edge weights w( (i,j) ) = D ij. Theorem: If there is a compact addi4ve tree for D, then T must be the unique minimum spanning tree of G(D). Recall: A spanning tree is a tree containing all ver4ces. Minimum spanning tree has the least total weight. Algorithm Summary Distance based Parsimony Probabilis4c Method Input Output Neighbor Joining Distance matrix D T (addi4ve), B UPGMA Distance matrix D T (ultrametric), B Sankoff s & Fitch s Alg. Characters, T A, B Perfect Phylogeny Characters A, B, T Felsenstein Characters, T, B A T = tree topology B = branch lengths A = ancestral states 14