Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Similar documents
Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Understanding relationship between homologous sequences

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Letter to the Editor. Department of Biology, Arizona State University

Phylogenetic Tree Reconstruction

EVOLUTIONARY DISTANCES

Lecture Notes: Markov chains

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Constructing Evolutionary/Phylogenetic Trees

Reconstruire le passé biologique modèles, méthodes, performances, limites

Distance Corrections on Recombinant Sequences

Dr. Amira A. AL-Hosary

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

BMI/CS 776 Lecture 4. Colin Dewey

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Using algebraic geometry for phylogenetic reconstruction

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Lecture 4. Models of DNA and protein change. Likelihood methods

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Phylogenetic inference

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Assumptions

Phylogenetic Algebraic Geometry

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Evolutionary Tree Analysis. Overview

Molecular Evolution and Phylogenetic Tree Reconstruction

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Molecular evolution 2. Please sit in row K or forward

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

C.DARWIN ( )

Modeling Noise in Genetic Sequences

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Weighted Quartets Phylogenetics

arxiv: v1 [q-bio.pe] 27 Oct 2011

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Distances that Perfectly Mislead

Recent Advances in Phylogeny Reconstruction

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Inferring Molecular Phylogeny

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

TheDisk-Covering MethodforTree Reconstruction

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Phylogenetic invariants versus classical phylogenetics

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Effects of Gap Open and Gap Extension Penalties

Evolutionary Models. Evolutionary Models

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Lab 9: Maximum Likelihood and Modeltest

Genetic distances and nucleotide substitution models

Predicting the Evolution of two Genes in the Yeast Saccharomyces Cerevisiae

Phylogenetics. BIOL 7711 Computational Bioscience

Lecture 4. Models of DNA and protein change. Likelihood methods

In: M. Salemi and A.-M. Vandamme (eds.). To appear. The. Phylogenetic Handbook. Cambridge University Press, UK.

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Constructing Evolutionary/Phylogenetic Trees

1. Can we use the CFN model for morphological traits?

Maximum Likelihood in Phylogenetics

Tools and Algorithms in Bioinformatics

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Week 5: Distance methods, DNA and protein models

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

The Phylogenetic Handbook

Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction

7. Tests for selection

Molecular Evolution, course # Final Exam, May 3, 2006

On the Uniqueness of the Selection Criterion in Neighbor-Joining

BIOINFORMATICS DISCOVERY NOTE

An Investigation of Phylogenetic Likelihood Methods

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

A Minimum Spanning Tree Framework for Inferring Phylogenies

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Consistency Index (CI)

X X (2) X Pr(X = x θ) (3)

A Statistical Test of Phylogenies Estimated from Sequence Data

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Phylogeny: traditional and Bayesian approaches

Phylogenetic Inference and Hypothesis Testing. Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne

The Generalized Neighbor Joining method

Algebraic Statistics Tutorial I

The least-squares approach to phylogenetics was first suggested

arxiv:q-bio/ v1 [q-bio.pe] 27 May 2005

Transcription:

PLGW05 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 1 joint work with Ilan Gronau 2, Shlomo Moran 3, and Irad Yavneh 3 1 2 Dept. of Biological Statistics and Computational Biology, Cornell University, Ithaca, USA, 3 Computer Science Dept.,Technion, Haifa, Israel June 24, 2011 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 1 / 16

Motivation Motivation X... t 5? t 6 t 1 t 2 t 3 t 4 A t 1 B t 2 t int t 3 t 4 D C ATCA... A A : ATCA... B : ATGA... C : ATTA... D : TTTA... ATGA... ATTA... TTTA... B C A G 1 2 C T D Substitution Rate function (Distance measure) 1 (P) 4 (P) 2 (P) 3 (P) 3 4 D AB = ( ATCA..., ATGA... ) D AC = ( ATCA..., ATTA... ) D AD = ( ATCA..., TTTA... ) D BC = ( ATGA..., ATTA... )... Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 2 / 16

Motivation Motivation... t 5 t 1 t 2 t 3 t 4 X? A t What 6 distance measure t 1 t int should I choose? B t 2 t 3 t 4 D C ATCA... A A : ATCA... B : ATGA... C : ATTA... D : TTTA... ATGA... ATTA... TTTA... B C A G 1 2 C T D Substitution Rate function (Distance measure) 1 (P) 4 (P) 2 (P) 3 (P) 3 4 D AB = ( ATCA..., ATGA... ) D AC = ( ATCA..., ATTA... ) D AD = ( ATCA..., TTTA... ) D BC = ( ATGA..., ATTA... )... Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 2 / 16

Motivation An ongoing quest... Previous works: I. Gronau, S. Moran, and I. Yavneh: Towards optimal distance functions for stochastic substitution models. J Theor Biol, 260(2):294 307, 2009. I. Gronau, S. Moran, and I. Yavneh: Adaptive distance measures for resolving K2P quartets: Metric separation versus stochastic noise. J Comp Biol, 17(11):1391 1400, 2010. Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 3 / 16

SR functions Distance measures Kimura two-parameter (K2P) model Distance measures A α G β C β α β T d uw = d uv + d vw u v w α=transitions β=transversions transition-to-transversion ratio R = α 2β biological evidence that α>β normalization: α + 2β = 1 additive distance measures induce tree metrics in homogeneous models distance measures given by (t) additive distance measure in K2P: standard SR function: K2P (t) = αt + 2βt = t Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 4 / 16

Evaluation of SR functions Experiment: Hasegawa s tree average normalized RF distance 0.30 0.25 0.20 0.15 0.10 0.05 Hasegawa's tree, sequence length = 500bp, R = 2 K2P Mouse 0.77 Gibbon Orang 0.28 0.36 0.05 Gorilla 0.15 0.42 Chimp 0.15 0.03 0.12 Human 0.04 0.92 0.00 0.0 0.5 1.0 1.5 2.0 tree diameter Bovine Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 5 / 16

Evaluation of SR functions Experiment: Hasegawa s tree average normalized RF distance 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Hasegawa's tree, sequence length = 500bp, R = 2 K2P 0.0 0.5 1.0 1.5 2.0 tree diameter 1. simulate evolution according to K2P model with ti-tv ratio R = 2 2. for each tree size, generate 10.000 batches of 7-way alignments of 500bp length 3. measure Robinson-Foulds tree distance to true tree Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 5 / 16

Evaluation of SR functions Experiment: Hasegawa s tree average normalized RF distance 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Hasegawa's tree, sequence length = 500bp, R = 2 K2P JC 0.0 0.5 1.0 1.5 2.0 tree diameter 1. simulate evolution according to K2P model with ti-tv ratio R = 2 2. for each tree size, generate 10.000 batches of 7-way alignments of 500bp length 3. measure Robinson-Foulds tree distance to true tree JC is statistical consistent w.r.t. Haswegawa s tree! Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 5 / 16

Evaluation of SR functions Non-additive SR functions estimated distance (t) 2 1.5 1 0.5 JC K2P ti tv ratio: 10 Jukes Cantor (JC) model is a homogeneous submodel of K2P for R = 1 2 JC deviates from additivity in K2P and in homogeneous submodels for R > 0.5 induces near-additive metric w.r.t to Hasegawa s tree has lower stochastic variance than K2P 0 0 0.5 1 1.5 2 evolutionary time t Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 6 / 16

Evaluation of SR functions Non-additive SR functions estimated distance (t) 2 1.5 1 0.5 ti tv ratio: 10, sequence length: 1000bp JC K2P σ ( JC ) σ ( K2P ) Jukes Cantor (JC) model is a homogeneous submodel of K2P for R = 1 2 JC deviates from additivity in K2P and in homogeneous submodels for R > 0.5 induces near-additive metric w.r.t to Hasegawa s tree has lower stochastic variance than K2P 0 0 0.5 1 1.5 2 evolutionary time t Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 6 / 16

Evaluation of SR functions Non-additive SR functions 2 JC ti tv ratio: 10 Gibbon Orang Gorilla Chimp K2P Mouse Human t 0 estimated distance (t) 1.5 1 0.5 0 0 t 0 0.5 1 t 1 1.5 2 evolutionary time t t 1 Bovine affine additive transformation aff = A + b remains additive allows analysis of non-additive SR function deviation from additivity in [t 0, t 1]: 1 max{ (t) at b : t [t0, t1]} a check consistency using nearadditivity theorem (Atteson, 1999) Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 7 / 16

Evaluation of SR functions Non-additive SR functions 2 JC ti tv ratio: 10 Gibbon Orang Gorilla Chimp K2P Mouse Human t 0 estimated distance (t) 1.5 1 0.5 0 int = A K2P + b 0 0 t 0 0.5 1 t 1 1.5 2 evolutionary time t t 1 Bovine affine additive transformation aff = A + b remains additive allows analysis of non-additive SR function deviation from additivity in [t 0, t 1]: 1 max{ (t) at b : t [t0, t1]} a check consistency using nearadditivity theorem (Atteson, 1999) Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 7 / 16

Evaluation of SR functions Non-additive SR functions estimated distance (t) 2 1.5 1 0.5 0 ti tv ratio: 10, sequence length: 1000bp JC aff = A K2P + b σ ( JC ) σ ( aff ) 0 t 0 0.5 1 t 1 1.5 2 evolutionary time t Mouse t 1 Gibbon Orang Bovine Gorilla Human Chimp affine additive transformation aff = A + b remains additive allows analysis of non-additive SR function deviation from additivity in [t 0, t 1]: 1 max{ (t) at b : t [t0, t1]} a check consistency using nearadditivity theorem (Atteson, 1999) t 0 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 7 / 16

Evaluation of SR functions Non-additive SR functions ti tv ratio: 10, sequence length: 1000bp Gibbon Orang estimated distance (t) 1 0.5 t 0 JC aff = A K2P + b σ ( JC ) σ ( aff ) dev ( JC, [t 0, t 1 ]) fixed error 0.5 1 evolutionary time t t 1 Mouse t 1 Bovine Human Chimp affine additive transformation aff = A + b remains additive allows analysis of non-additive SR function deviation from additivity in [t 0, t 1]: 1 max{ (t) at b : t [t0, t1]} a check consistency using nearadditivity theorem (Atteson, 1999) t 0 Gorilla Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 7 / 16

Experiments on Quartets Two extreme archetypes of quartets with long and short edges Felsenstein quartet Farris quartet t 0 B D t t s s ts t l = 5 t s t 0 t C s D t s t s C D t l t l t l t l A C A B B A t 1 t 1 underestimation of t 0 + t 1 decreases separation of the split AB CD biased towards AC BD underestimation of t 0 + t 1 increases separation of the split AB CD bias towards correct split Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 8 / 16

Experiments on Quartets Two extreme archetypes of quartets with long and short edges 0.20 0.15 Felsenstein quartet Felsensteinquartet, sequencelength=500bp, R =5 Δ J C Δ K2P 0.20 0.15 Farris quartet Farris quartet, sequence length =500bp, R =5 Δ J C Δ K2P failure rate 0.10 failure rate 0.10 0.05 0.05 0.00 0.0 0.5 1.0 1.5 2.0 0.00 0.0 0.5 1.0 1.5 2.0 t 1 t 1 underestimation of t 0 + t 1 decreases separation of the split AB CD biased towards AC BD underestimation of t 0 + t 1 increases separation of the split AB CD bias towards correct split Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 8 / 16

Experiments on Quartets Two extreme archetypes of quartets with long and short edges 0.20 0.15 Felsenstein quartet Felsensteinquartet, sequencelength=500bp, R =5 Δ J C Δ K2P 0.20 0.15 Farris quartet Farris quartet, sequence length =500bp, R =5 Δ J C Δ K2P failure rate 0.10 failure rate 0.10 0.05 0.05 0.00 0.0 0.5 1.0 1.5 2.0 0.00 0.0 0.5 1.0 1.5 2.0 t 1 t 1 despite impedimental bias JC performs better than K2P for moderate t l /t s ratio e.g. t l = 3.5 t s underestimation of t 0 + t 1 increases separation of the split AB CD bias towards correct split Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 8 / 16

Fisher s Linear Discriminant Fisher s Linear Discriminant Fisher s Linear Discriminant measures separation between independent normally-distributed random variables: X N(µ 1, σ 1) and Y N(µ 2, σ 2) FLD(X, Y ) = µ 1 µ 2 σ 2 1 +σ2 2 = SEP( ) NOISE( ) B D µ 1 = D AC + D BD µ 2 = D AB + D CD σ1 2 = σ 2 (D AC ) + σ 2 (D BD ) σ2 2 = σ 2 (D AB ) + σ 2 (D CD ) A w int C Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 9 / 16

Fisher s Linear Discriminant FLD on Felsenstein s quartet t i = 0.2 t l = 1 t s [0.2, 1] % % & & s t i t s t l " # $! t l t s t i t s t l # $!"#!" #$% $!"#!" #$% $!" #$%!" #$%!" #$%!" #$% Simulation: 100,000 trees per data point, sequence length of 1000 bp. Prediction based on FLD. Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 10 / 16

Modelling Seperation and Noise with FLD Fisher s Linear Discriminant: Separation vs. Noise Fisher s Linear Discriminant measures separation between independent normally-distributed random variables: X N(µ 1, σ 1) and Y N(µ 2, σ 2) FLD(X, Y ) = µ 1 µ 2 σ 2 1 +σ2 2 = SEP( ) NOISE( ) SEP( ) = µ 1 µ 2, NOISE( ) = σ 2 1 + σ2 2 FLD( 1 ) FLD( 2 ) = SEP( 1) SEP( 2 ) / NOISE( 1) NOISE( 2 ) independent of sequence length Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 11 / 16

Modelling Seperation and Noise with FLD Fisher s Linear Discriminant: Separation vs. Noise Fisher s Linear Discriminant measures separation between independent normally-distributed random variables: X N(µ 1, σ 1) and Y N(µ 2, σ 2) FLD(X, Y ) = µ 1 µ 2 σ 2 1 +σ2 2 = SEP( ) NOISE( ) SEP( ) = µ 1 µ 2, NOISE( ) = σ 2 1 + σ2 2 FLD( 1 ) FLD( 2 ) = SEP( 1) SEP( 2 ) / NOISE( 1) NOISE( 2 ) independent of sequence length Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 11 / 16

Modelling Seperation and Noise with FLD Separation between noise and deviation from additivity R=5 R=2 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 12 / 16

Experiment on biological data Experiment on biological data difference from RF of BIONJ GTR tree 4 2 0 2 JC K2P LogDet 8000 6000 4000 2000 0 number of trees 0 2 4 6 8 10 12 14 RF between BIONJ GTR tree and LTP tree 163 bacterial species 31 marker genes (Ciccarelli et al, 2006) sample 40, 000 random 10-species sub-alignments extract four-fold degenerate sites reference tree from the Living Tree Project (ARB-Silva) Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 13 / 16

Summary Summary Surprising observation: non-additive SR functions can improve reconstruction accuracy Example: JC SR function in K2P trees Introduced concepts: deviation from additivity affine-additive SR function SEP and NOISE More information in our (soon published) WABI paper Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 14 / 16

Summary The end Thank you! Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 15 / 16

Summary Selected references K. Atteson. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica, 25:251 278, 1999. D. Doerr, I. Gronau, S. Moran, and I. Yavneh. Stochastic errors vs. modeling errors in distance based phylogenetic reconstructions, in preparation, 2011. http://www.cs.technion.ac.il/~moran/r/wabi-in-prep.pdf. Joseph Felsenstein. Inferring Phylogenies. Sinauer Associates, 2 edition, September 2003. I. Gronau, S. Moran, and I. Yavneh. Towards optimal distance functions for stochastic substitution models. J Theor Biol, 260(2):294 307, 2009. I. Gronau, S. Moran, and I. Yavneh. Adaptive distance measures for resolving K2P quartets: Metric separation versus stochastic noise. J Comp Biol, 17(11):1391 1400, 2010. Motoo Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16(2):111 120, June 1980. Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 16 / 16