PLGW05 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 1 joint work with Ilan Gronau 2, Shlomo Moran 3, and Irad Yavneh 3 1 2 Dept. of Biological Statistics and Computational Biology, Cornell University, Ithaca, USA, 3 Computer Science Dept.,Technion, Haifa, Israel June 24, 2011 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 1 / 16
Motivation Motivation X... t 5? t 6 t 1 t 2 t 3 t 4 A t 1 B t 2 t int t 3 t 4 D C ATCA... A A : ATCA... B : ATGA... C : ATTA... D : TTTA... ATGA... ATTA... TTTA... B C A G 1 2 C T D Substitution Rate function (Distance measure) 1 (P) 4 (P) 2 (P) 3 (P) 3 4 D AB = ( ATCA..., ATGA... ) D AC = ( ATCA..., ATTA... ) D AD = ( ATCA..., TTTA... ) D BC = ( ATGA..., ATTA... )... Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 2 / 16
Motivation Motivation... t 5 t 1 t 2 t 3 t 4 X? A t What 6 distance measure t 1 t int should I choose? B t 2 t 3 t 4 D C ATCA... A A : ATCA... B : ATGA... C : ATTA... D : TTTA... ATGA... ATTA... TTTA... B C A G 1 2 C T D Substitution Rate function (Distance measure) 1 (P) 4 (P) 2 (P) 3 (P) 3 4 D AB = ( ATCA..., ATGA... ) D AC = ( ATCA..., ATTA... ) D AD = ( ATCA..., TTTA... ) D BC = ( ATGA..., ATTA... )... Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 2 / 16
Motivation An ongoing quest... Previous works: I. Gronau, S. Moran, and I. Yavneh: Towards optimal distance functions for stochastic substitution models. J Theor Biol, 260(2):294 307, 2009. I. Gronau, S. Moran, and I. Yavneh: Adaptive distance measures for resolving K2P quartets: Metric separation versus stochastic noise. J Comp Biol, 17(11):1391 1400, 2010. Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 3 / 16
SR functions Distance measures Kimura two-parameter (K2P) model Distance measures A α G β C β α β T d uw = d uv + d vw u v w α=transitions β=transversions transition-to-transversion ratio R = α 2β biological evidence that α>β normalization: α + 2β = 1 additive distance measures induce tree metrics in homogeneous models distance measures given by (t) additive distance measure in K2P: standard SR function: K2P (t) = αt + 2βt = t Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 4 / 16
Evaluation of SR functions Experiment: Hasegawa s tree average normalized RF distance 0.30 0.25 0.20 0.15 0.10 0.05 Hasegawa's tree, sequence length = 500bp, R = 2 K2P Mouse 0.77 Gibbon Orang 0.28 0.36 0.05 Gorilla 0.15 0.42 Chimp 0.15 0.03 0.12 Human 0.04 0.92 0.00 0.0 0.5 1.0 1.5 2.0 tree diameter Bovine Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 5 / 16
Evaluation of SR functions Experiment: Hasegawa s tree average normalized RF distance 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Hasegawa's tree, sequence length = 500bp, R = 2 K2P 0.0 0.5 1.0 1.5 2.0 tree diameter 1. simulate evolution according to K2P model with ti-tv ratio R = 2 2. for each tree size, generate 10.000 batches of 7-way alignments of 500bp length 3. measure Robinson-Foulds tree distance to true tree Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 5 / 16
Evaluation of SR functions Experiment: Hasegawa s tree average normalized RF distance 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Hasegawa's tree, sequence length = 500bp, R = 2 K2P JC 0.0 0.5 1.0 1.5 2.0 tree diameter 1. simulate evolution according to K2P model with ti-tv ratio R = 2 2. for each tree size, generate 10.000 batches of 7-way alignments of 500bp length 3. measure Robinson-Foulds tree distance to true tree JC is statistical consistent w.r.t. Haswegawa s tree! Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 5 / 16
Evaluation of SR functions Non-additive SR functions estimated distance (t) 2 1.5 1 0.5 JC K2P ti tv ratio: 10 Jukes Cantor (JC) model is a homogeneous submodel of K2P for R = 1 2 JC deviates from additivity in K2P and in homogeneous submodels for R > 0.5 induces near-additive metric w.r.t to Hasegawa s tree has lower stochastic variance than K2P 0 0 0.5 1 1.5 2 evolutionary time t Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 6 / 16
Evaluation of SR functions Non-additive SR functions estimated distance (t) 2 1.5 1 0.5 ti tv ratio: 10, sequence length: 1000bp JC K2P σ ( JC ) σ ( K2P ) Jukes Cantor (JC) model is a homogeneous submodel of K2P for R = 1 2 JC deviates from additivity in K2P and in homogeneous submodels for R > 0.5 induces near-additive metric w.r.t to Hasegawa s tree has lower stochastic variance than K2P 0 0 0.5 1 1.5 2 evolutionary time t Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 6 / 16
Evaluation of SR functions Non-additive SR functions 2 JC ti tv ratio: 10 Gibbon Orang Gorilla Chimp K2P Mouse Human t 0 estimated distance (t) 1.5 1 0.5 0 0 t 0 0.5 1 t 1 1.5 2 evolutionary time t t 1 Bovine affine additive transformation aff = A + b remains additive allows analysis of non-additive SR function deviation from additivity in [t 0, t 1]: 1 max{ (t) at b : t [t0, t1]} a check consistency using nearadditivity theorem (Atteson, 1999) Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 7 / 16
Evaluation of SR functions Non-additive SR functions 2 JC ti tv ratio: 10 Gibbon Orang Gorilla Chimp K2P Mouse Human t 0 estimated distance (t) 1.5 1 0.5 0 int = A K2P + b 0 0 t 0 0.5 1 t 1 1.5 2 evolutionary time t t 1 Bovine affine additive transformation aff = A + b remains additive allows analysis of non-additive SR function deviation from additivity in [t 0, t 1]: 1 max{ (t) at b : t [t0, t1]} a check consistency using nearadditivity theorem (Atteson, 1999) Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 7 / 16
Evaluation of SR functions Non-additive SR functions estimated distance (t) 2 1.5 1 0.5 0 ti tv ratio: 10, sequence length: 1000bp JC aff = A K2P + b σ ( JC ) σ ( aff ) 0 t 0 0.5 1 t 1 1.5 2 evolutionary time t Mouse t 1 Gibbon Orang Bovine Gorilla Human Chimp affine additive transformation aff = A + b remains additive allows analysis of non-additive SR function deviation from additivity in [t 0, t 1]: 1 max{ (t) at b : t [t0, t1]} a check consistency using nearadditivity theorem (Atteson, 1999) t 0 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 7 / 16
Evaluation of SR functions Non-additive SR functions ti tv ratio: 10, sequence length: 1000bp Gibbon Orang estimated distance (t) 1 0.5 t 0 JC aff = A K2P + b σ ( JC ) σ ( aff ) dev ( JC, [t 0, t 1 ]) fixed error 0.5 1 evolutionary time t t 1 Mouse t 1 Bovine Human Chimp affine additive transformation aff = A + b remains additive allows analysis of non-additive SR function deviation from additivity in [t 0, t 1]: 1 max{ (t) at b : t [t0, t1]} a check consistency using nearadditivity theorem (Atteson, 1999) t 0 Gorilla Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 7 / 16
Experiments on Quartets Two extreme archetypes of quartets with long and short edges Felsenstein quartet Farris quartet t 0 B D t t s s ts t l = 5 t s t 0 t C s D t s t s C D t l t l t l t l A C A B B A t 1 t 1 underestimation of t 0 + t 1 decreases separation of the split AB CD biased towards AC BD underestimation of t 0 + t 1 increases separation of the split AB CD bias towards correct split Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 8 / 16
Experiments on Quartets Two extreme archetypes of quartets with long and short edges 0.20 0.15 Felsenstein quartet Felsensteinquartet, sequencelength=500bp, R =5 Δ J C Δ K2P 0.20 0.15 Farris quartet Farris quartet, sequence length =500bp, R =5 Δ J C Δ K2P failure rate 0.10 failure rate 0.10 0.05 0.05 0.00 0.0 0.5 1.0 1.5 2.0 0.00 0.0 0.5 1.0 1.5 2.0 t 1 t 1 underestimation of t 0 + t 1 decreases separation of the split AB CD biased towards AC BD underestimation of t 0 + t 1 increases separation of the split AB CD bias towards correct split Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 8 / 16
Experiments on Quartets Two extreme archetypes of quartets with long and short edges 0.20 0.15 Felsenstein quartet Felsensteinquartet, sequencelength=500bp, R =5 Δ J C Δ K2P 0.20 0.15 Farris quartet Farris quartet, sequence length =500bp, R =5 Δ J C Δ K2P failure rate 0.10 failure rate 0.10 0.05 0.05 0.00 0.0 0.5 1.0 1.5 2.0 0.00 0.0 0.5 1.0 1.5 2.0 t 1 t 1 despite impedimental bias JC performs better than K2P for moderate t l /t s ratio e.g. t l = 3.5 t s underestimation of t 0 + t 1 increases separation of the split AB CD bias towards correct split Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 8 / 16
Fisher s Linear Discriminant Fisher s Linear Discriminant Fisher s Linear Discriminant measures separation between independent normally-distributed random variables: X N(µ 1, σ 1) and Y N(µ 2, σ 2) FLD(X, Y ) = µ 1 µ 2 σ 2 1 +σ2 2 = SEP( ) NOISE( ) B D µ 1 = D AC + D BD µ 2 = D AB + D CD σ1 2 = σ 2 (D AC ) + σ 2 (D BD ) σ2 2 = σ 2 (D AB ) + σ 2 (D CD ) A w int C Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 9 / 16
Fisher s Linear Discriminant FLD on Felsenstein s quartet t i = 0.2 t l = 1 t s [0.2, 1] % % & & s t i t s t l " # $! t l t s t i t s t l # $!"#!" #$% $!"#!" #$% $!" #$%!" #$%!" #$%!" #$% Simulation: 100,000 trees per data point, sequence length of 1000 bp. Prediction based on FLD. Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 10 / 16
Modelling Seperation and Noise with FLD Fisher s Linear Discriminant: Separation vs. Noise Fisher s Linear Discriminant measures separation between independent normally-distributed random variables: X N(µ 1, σ 1) and Y N(µ 2, σ 2) FLD(X, Y ) = µ 1 µ 2 σ 2 1 +σ2 2 = SEP( ) NOISE( ) SEP( ) = µ 1 µ 2, NOISE( ) = σ 2 1 + σ2 2 FLD( 1 ) FLD( 2 ) = SEP( 1) SEP( 2 ) / NOISE( 1) NOISE( 2 ) independent of sequence length Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 11 / 16
Modelling Seperation and Noise with FLD Fisher s Linear Discriminant: Separation vs. Noise Fisher s Linear Discriminant measures separation between independent normally-distributed random variables: X N(µ 1, σ 1) and Y N(µ 2, σ 2) FLD(X, Y ) = µ 1 µ 2 σ 2 1 +σ2 2 = SEP( ) NOISE( ) SEP( ) = µ 1 µ 2, NOISE( ) = σ 2 1 + σ2 2 FLD( 1 ) FLD( 2 ) = SEP( 1) SEP( 2 ) / NOISE( 1) NOISE( 2 ) independent of sequence length Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 11 / 16
Modelling Seperation and Noise with FLD Separation between noise and deviation from additivity R=5 R=2 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 12 / 16
Experiment on biological data Experiment on biological data difference from RF of BIONJ GTR tree 4 2 0 2 JC K2P LogDet 8000 6000 4000 2000 0 number of trees 0 2 4 6 8 10 12 14 RF between BIONJ GTR tree and LTP tree 163 bacterial species 31 marker genes (Ciccarelli et al, 2006) sample 40, 000 random 10-species sub-alignments extract four-fold degenerate sites reference tree from the Living Tree Project (ARB-Silva) Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 13 / 16
Summary Summary Surprising observation: non-additive SR functions can improve reconstruction accuracy Example: JC SR function in K2P trees Introduced concepts: deviation from additivity affine-additive SR function SEP and NOISE More information in our (soon published) WABI paper Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 14 / 16
Summary The end Thank you! Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 15 / 16
Summary Selected references K. Atteson. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica, 25:251 278, 1999. D. Doerr, I. Gronau, S. Moran, and I. Yavneh. Stochastic errors vs. modeling errors in distance based phylogenetic reconstructions, in preparation, 2011. http://www.cs.technion.ac.il/~moran/r/wabi-in-prep.pdf. Joseph Felsenstein. Inferring Phylogenies. Sinauer Associates, 2 edition, September 2003. I. Gronau, S. Moran, and I. Yavneh. Towards optimal distance functions for stochastic substitution models. J Theor Biol, 260(2):294 307, 2009. I. Gronau, S. Moran, and I. Yavneh. Adaptive distance measures for resolving K2P quartets: Metric separation versus stochastic noise. J Comp Biol, 17(11):1391 1400, 2010. Motoo Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16(2):111 120, June 1980. Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 16 / 16