Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Similar documents
Algebraic Statistics Tutorial I

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

When Do Phylogenetic Mixture Models Mimic Other Phylogenetic Models?

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Phylogeny of Mixture Models

arxiv: v1 [math.ra] 13 Jan 2009

Phylogenetic Algebraic Geometry

A concise proof of Kruskal s theorem on tensor decomposition

1. Can we use the CFN model for morphological traits?

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Workshop III: Evolutionary Genomics

The Generalized Neighbor Joining method

Recent Progress in Combinatorial Statistics

Jed Chou. April 13, 2015

Introduction to Algebraic Statistics

Phylogenetic Inference using RevBayes

Reconstruire le passé biologique modèles, méthodes, performances, limites

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Using algebraic geometry for phylogenetic reconstruction


Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Bayesian Phylogenetics

Example: Hardy-Weinberg Equilibrium. Algebraic Statistics Tutorial I. Phylogenetics. Main Point of This Tutorial. Model-Based Phylogenetics

arxiv: v1 [math.st] 22 Jun 2018

Mixed-up Trees: the Structure of Phylogenetic Mixtures

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Identifiability of latent class models with many observed variables

arxiv: v1 [q-bio.pe] 3 May 2016

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetics: Building Phylogenetic Trees

A Bayesian Approach to Phylogenetics

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Distances that Perfectly Mislead

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Open Problems in Algebraic Statistics

CS 372: Computational Geometry Lecture 4 Lower Bounds for Computational Geometry Problems

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Phylogenetic invariants versus classical phylogenetics

arxiv:q-bio/ v5 [q-bio.pe] 14 Feb 2007

Limitations of Markov Chain Monte Carlo Algorithms for Bayesian Inference of Phylogeny

Constructing Evolutionary/Phylogenetic Trees

Identifiability and Inference of Non-Parametric Rates-Across-Sites Models on Large-Scale Phylogenies

Phylogenetic Inference using RevBayes

DNA-based species delimitation

arxiv: v1 [q-bio.pe] 23 Nov 2017

Phylogenetics: Likelihood

arxiv: v1 [q-bio.pe] 4 Sep 2013

Phylogenetic Assumptions

BMI/CS 776 Lecture 4. Colin Dewey

Bayesian support is larger than bootstrap support in phylogenetic inference: a mathematical argument

Quartet Inference from SNP Data Under the Coalescent Model

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

EVOLUTIONARY DISTANCES

Chapter 7: Models of discrete character evolution

Spectral Theorem for Self-adjoint Linear Operators

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

State Space and Hidden Markov Models

Geometry of Phylogenetic Inference

Dimension. Eigenvalue and eigenvector

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

Estimating Evolutionary Trees. Phylogenetic Methods

Bayesian Phylogenetics

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Phylogenetic Graphical Models and RevBayes: Introduction. Fred(rik) Ronquist Swedish Museum of Natural History, Stockholm, Sweden

PHYLOGENETIC ALGEBRAIC GEOMETRY

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

series. Utilize the methods of calculus to solve applied problems that require computational or algebraic techniques..

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Penalized Likelihood Phylogenetic Inference: Bridging the Parsimony-Likelihood Gap

Phylogenetics and Darwin. An Introduction to Phylogenetics. Tree of Life. Darwin s Trees

arxiv: v1 [q-bio.pe] 16 Aug 2007

Reconstructing Trees from Subtree Weights

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Introduction to Machine Learning CMU-10701

Concepts and Methods in Molecular Divergence Time Estimation

Phylogenetic inference

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

Taming the Beast Workshop

The Effect of Ambiguous Data on Phylogenetic Estimates Obtained by Maximum Likelihood and Bayesian Inference

Phylogenetic Geometry

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

OLIVIER SERMAN. Theorem 1.1. The moduli space of rank 3 vector bundles over a curve of genus 2 is a local complete intersection.

Mathematical Biology. Phylogenetic mixtures and linear invariants for equal input models. B Mike Steel. Marta Casanellas 1 Mike Steel 2

Systematics - Bio 615

Four Point Gauss Quadrature Runge Kuta Method Of Order 8 For Ordinary Differential Equations

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation

Anatomy of a species tree

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

V (v i + W i ) (v i + W i ) is path-connected and hence is connected.

Tensors. Lek-Heng Lim. Statistics Department Retreat. October 27, Thanks: NSF DMS and DMS

BIG4: Biosystematics, informatics and genomics of the big 4 insect groups- training tomorrow s researchers and entrepreneurs

Transcription:

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution Elizabeth S. Allman Dept. of Mathematics and Statistics University of Alaska Fairbanks TM Current Challenges and Problems in Phylogenetics Isaac Newton Institute Cambridge, England 5 September 2007

Jointworkwith J. Rhodes C. Ané

Identifiability. A model of molecular evolution M is identifiable if the values of all parameters can be determined from the joint distribution P of states. Parameters = tree topology(ies), stationary distribution, edge lengths, rate matrix Q, Γ shape parameter, Markov edge matrices, p inv,etc. Identifiability is necessary to have consistency of statistical inference, whether using ML or Bayesian methods. INI Identifiability Slide 1

Known Identifiability results... Negative: For sufficiently complicated rate-across-sites models (non-explicit), tree identifiability can fail (Steel-Székely-Hendy, J. Comp. Biol., 1994) explicit non-generic examples (not r-a-s) of non-identifiability of mixtures (Štefankovič-Vigoda, Sys. Biol., 2007; J. Comp. Biol., 2007) non-generic 2-class mixtures on one tree can exactly agree with 1-class model on different tree (Matsen-Steel, preprint) more general study of many-class non-identifiable mixtures under 2-state symmetric model (Matsen-Mossel-Steel, preprint) INI Identifiability Slide 2

Positive: GTR is identifiable (use log-det distance to identify tree, etc.) GM is identifiable (Chang, Math. Biosci., 1996) general result on mixture models on one tree with small number of classes (Allman-Rhodes, J. Comp. Biol., 2006) For DNA models, tree is generically identifiable for: GTR+I GTR with 3 rate-across-sites classes GTR+GTR+GTR GM+GM+GM covarion with 3 rate classes INI Identifiability Slide 3

Generic vs. non-generic identifiability. If T n denotes n-leaf tree space and M any choice of model, then the parameterization map(s) φ M : T T n (T,S T ) C κ (T,s T ) P = φ M,T (s T ) give rise to the collection of joint distributions P for M. M is identifiable φ M is injective n INI Identifiability Slide 4

For a fixed tree T,themap φ M,T : {Parameters on T } C κn s T P = φ M,T (s T ) associates to each tree T its phylogenetic variety V T. But, V T1 V T2 always (star phylogenies) If the intersection is of lower dimension, then the tree is identifiable for generic parameters. INI Identifiability Slide 5

For a fixed tree T,themap φ M,T : {Parameters on T } C κn s T P = φ M,T (s T ) associates to each tree T its phylogenetic variety V T. V T1 But, V T1 V T2 always (star phylogenies) V T2 If the intersection is of lower dimension, then the tree is identifiable for generic parameters. INI Identifiability Slide 6

Today... Q1: Is the GTR+Γ+I model identifiable? Q2: Are 2-tree mixtures identifiable? INI Identifiability Slide 7

Q1: Is the GTR+Γ+I model identifiable? Rogers (Sys. Biol., 2001) claimed a proof, widely cited, but Argument has several major gaps in showing identifiability: 1) crucial use of an unjustified graphical claim 2) generic vs. non-generic parameters There is no valid, published proof that ML or Bayesian inference using the GTR+Γ+I model is consistent. INI Identifiability Slide 8

None of previous work applies to GTR+Γ or GTR+Γ+I, since: continuous rate distribution prevents application of Allman-Rhodes positive results (or algebraic methods of proof) specifying a particular form of rate distribution prevents application of negative Steel or Matsen-Mossel-Steel results. INI Identifiability Slide 9

New result: Allman, Ané, Rhodes (2007): For 4-state (DNA) models, GTR+Γ is identifiable. And, more generally, For κ-state models, GTR+Γ is generically identifiable. Comments: This is the first proof of identifiability for a rate-across-sites model with a continuous distribution of rates. Identifiability for all parameters, not just generic ones. Proof does not follow Rogers approach. INI Identifiability Slide 10

Main points of GTR+Γ proof: stationary distribution, eigenvectors of rate matrix Q from 1- and 2-taxon marginals Focus on 3-leaf tree to identify α shape parameter (work) a 2 a 3 a 1 then get Q, edge lengths t e. Result for n-leaf tree then follows from combinatorial arguments. Use algebraic arguments to extract information from 3-dim tensor. Use analytic arguments (convexity) for generic identifiability. Detailed analysis of non-generic cases completes proof. INI Identifiability Slide 11

Note: We still lack a proof that the tree is identifiable for GTR+Γ+I. This is likely to be significantly harder to prove since: Γ introduces only 1 parameter (shape parameter α ), Γ+I introduces 2 parameters (α, proportion of invar. sites p inv ) INI Identifiability Slide 12

Tree mixtures. Different parts of sequences may have evolved along different trees gene tree vs. species tree, incomplete lineage sorting Species Tree Gene 1 Gene 2 horizontalgenetransfer INI Identifiability Slide 13

Two-tree mixtures can confound analysis. Mossel E. and Vigoda E., Phylogenetic MCMC algorithms are misleading on mixtures of trees, Science 309, 2207 (2005). Ronquist, F., Larget B., Huelsenbeck, J., Kadane J., Simon D., and van der Mark, P., Comment on Phylogenetic MCMC algorithms are misleading on mixtures of trees, Science 312, 367a (2006). Mossel E. and Vigoda E., Response to comment on Phylogenetic MCMC algorithms are misleading on mixtures of trees, Science 312, 367b (2006). Matsen, F. and Steel M., Phylogenetic mixtures on a single tree can mimic a tree of another topology, preprint. (2-state) Matsen, F., Mossel, E. and Steel M., Mixed-up trees: the structure of phylogenetic mixtures, preprint. (2-state) INI Identifiability Slide 14

Simple model: 4-taxon trees T 1, T 2, T 3 a c a b a b b d c d d c T 1 T 2 T 3 Joint distributions P 1,2 are 2-tree mixtures with δ a mixing parameter. P 1,2 = δp M,T1 +(1 δ)p M,T2 Similarly, for the other two mixtures. INI Identifiability Slide 15

Theorem. Suppose P ij is a joint distribution arising from a 2-tree GM mixture on 4-taxon trees for κ =4states. Then the trees T i, T j and stochastic parameters s i, s j are generically identifiable from P ij. i.e. given P ij, we can generically identify (T i,s i ) and (T j,s j ). A similar result holds for 2-tree GTR mixtures (and JC mixtures). INI Identifiability Slide 16

Two-tree mixtures proof. (GM) Find a specific point B that lies on both V GM,T1,T 2 and V GM,T1,T 3. Prove B is non-singular by computing in Maple the dimension of the tangent spaces H 1,2 to B V GM,T1,T 2 and H 1,3 to B V GM,T1,T 3. B dim(h 1,2 ) = 127, dim(h 1,3 ) = 127 INI Identifiability Slide 17

All computations for GM can be done exactly: B can be chosen to arise from rational parameter values. parameterization is given by polynomials with rational coefficients. Maple performs exact rational arithmetic. Another computation shows that the two tangent spaces intersect in a lower dimensional hyperplane. ( ten minutes of computation) dim(h 1,2 H 1,3 ) = 115 This proves that V GM,T1,T 2 and V GM,T1,T 3 are different, and then by principles of AG we have dim(v GM,T1,T 2 V GM,T1,T 3 ) < 127. INI Identifiability Slide 18

Extension to GTR (non-algebraic): Observe JC GT R. Choose B to be a Jukes-Cantor point (rational, yet GTR) with B X GT R,T1,T 2 X GT R,T1,T 3 Prove that there is a vector v tangent to X GT R,T1,T 3 not lie in the tangent plane at B to V GM,T1,T 2. at B that does INI Identifiability Slide 19

Preprint: http://www.dms.uaf.edu/~eallman/gamid.pdf INI Identifiability Slide 20