Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution Elizabeth S. Allman Dept. of Mathematics and Statistics University of Alaska Fairbanks TM Current Challenges and Problems in Phylogenetics Isaac Newton Institute Cambridge, England 5 September 2007

Jointworkwith J. Rhodes C. Ané

Identifiability. A model of molecular evolution M is identifiable if the values of all parameters can be determined from the joint distribution P of states. Parameters = tree topology(ies), stationary distribution, edge lengths, rate matrix Q, Γ shape parameter, Markov edge matrices, p inv,etc. Identifiability is necessary to have consistency of statistical inference, whether using ML or Bayesian methods. INI Identifiability Slide 1

Known Identifiability results... Negative: For sufficiently complicated rate-across-sites models (non-explicit), tree identifiability can fail (Steel-Székely-Hendy, J. Comp. Biol., 1994) explicit non-generic examples (not r-a-s) of non-identifiability of mixtures (Štefankovič-Vigoda, Sys. Biol., 2007; J. Comp. Biol., 2007) non-generic 2-class mixtures on one tree can exactly agree with 1-class model on different tree (Matsen-Steel, preprint) more general study of many-class non-identifiable mixtures under 2-state symmetric model (Matsen-Mossel-Steel, preprint) INI Identifiability Slide 2

Positive: GTR is identifiable (use log-det distance to identify tree, etc.) GM is identifiable (Chang, Math. Biosci., 1996) general result on mixture models on one tree with small number of classes (Allman-Rhodes, J. Comp. Biol., 2006) For DNA models, tree is generically identifiable for: GTR+I GTR with 3 rate-across-sites classes GTR+GTR+GTR GM+GM+GM covarion with 3 rate classes INI Identifiability Slide 3

Generic vs. non-generic identifiability. If T n denotes n-leaf tree space and M any choice of model, then the parameterization map(s) φ M : T T n (T,S T ) C κ (T,s T ) P = φ M,T (s T ) give rise to the collection of joint distributions P for M. M is identifiable φ M is injective n INI Identifiability Slide 4

For a fixed tree T,themap φ M,T : {Parameters on T } C κn s T P = φ M,T (s T ) associates to each tree T its phylogenetic variety V T. But, V T1 V T2 always (star phylogenies) If the intersection is of lower dimension, then the tree is identifiable for generic parameters. INI Identifiability Slide 5

For a fixed tree T,themap φ M,T : {Parameters on T } C κn s T P = φ M,T (s T ) associates to each tree T its phylogenetic variety V T. V T1 But, V T1 V T2 always (star phylogenies) V T2 If the intersection is of lower dimension, then the tree is identifiable for generic parameters. INI Identifiability Slide 6

Today... Q1: Is the GTR+Γ+I model identifiable? Q2: Are 2-tree mixtures identifiable? INI Identifiability Slide 7

Q1: Is the GTR+Γ+I model identifiable? Rogers (Sys. Biol., 2001) claimed a proof, widely cited, but Argument has several major gaps in showing identifiability: 1) crucial use of an unjustified graphical claim 2) generic vs. non-generic parameters There is no valid, published proof that ML or Bayesian inference using the GTR+Γ+I model is consistent. INI Identifiability Slide 8

None of previous work applies to GTR+Γ or GTR+Γ+I, since: continuous rate distribution prevents application of Allman-Rhodes positive results (or algebraic methods of proof) specifying a particular form of rate distribution prevents application of negative Steel or Matsen-Mossel-Steel results. INI Identifiability Slide 9

New result: Allman, Ané, Rhodes (2007): For 4-state (DNA) models, GTR+Γ is identifiable. And, more generally, For κ-state models, GTR+Γ is generically identifiable. Comments: This is the first proof of identifiability for a rate-across-sites model with a continuous distribution of rates. Identifiability for all parameters, not just generic ones. Proof does not follow Rogers approach. INI Identifiability Slide 10

Main points of GTR+Γ proof: stationary distribution, eigenvectors of rate matrix Q from 1- and 2-taxon marginals Focus on 3-leaf tree to identify α shape parameter (work) a 2 a 3 a 1 then get Q, edge lengths t e. Result for n-leaf tree then follows from combinatorial arguments. Use algebraic arguments to extract information from 3-dim tensor. Use analytic arguments (convexity) for generic identifiability. Detailed analysis of non-generic cases completes proof. INI Identifiability Slide 11

Note: We still lack a proof that the tree is identifiable for GTR+Γ+I. This is likely to be significantly harder to prove since: Γ introduces only 1 parameter (shape parameter α ), Γ+I introduces 2 parameters (α, proportion of invar. sites p inv ) INI Identifiability Slide 12

Tree mixtures. Different parts of sequences may have evolved along different trees gene tree vs. species tree, incomplete lineage sorting Species Tree Gene 1 Gene 2 horizontalgenetransfer INI Identifiability Slide 13

Two-tree mixtures can confound analysis. Mossel E. and Vigoda E., Phylogenetic MCMC algorithms are misleading on mixtures of trees, Science 309, 2207 (2005). Ronquist, F., Larget B., Huelsenbeck, J., Kadane J., Simon D., and van der Mark, P., Comment on Phylogenetic MCMC algorithms are misleading on mixtures of trees, Science 312, 367a (2006). Mossel E. and Vigoda E., Response to comment on Phylogenetic MCMC algorithms are misleading on mixtures of trees, Science 312, 367b (2006). Matsen, F. and Steel M., Phylogenetic mixtures on a single tree can mimic a tree of another topology, preprint. (2-state) Matsen, F., Mossel, E. and Steel M., Mixed-up trees: the structure of phylogenetic mixtures, preprint. (2-state) INI Identifiability Slide 14

Simple model: 4-taxon trees T 1, T 2, T 3 a c a b a b b d c d d c T 1 T 2 T 3 Joint distributions P 1,2 are 2-tree mixtures with δ a mixing parameter. P 1,2 = δp M,T1 +(1 δ)p M,T2 Similarly, for the other two mixtures. INI Identifiability Slide 15

Theorem. Suppose P ij is a joint distribution arising from a 2-tree GM mixture on 4-taxon trees for κ =4states. Then the trees T i, T j and stochastic parameters s i, s j are generically identifiable from P ij. i.e. given P ij, we can generically identify (T i,s i ) and (T j,s j ). A similar result holds for 2-tree GTR mixtures (and JC mixtures). INI Identifiability Slide 16

Two-tree mixtures proof. (GM) Find a specific point B that lies on both V GM,T1,T 2 and V GM,T1,T 3. Prove B is non-singular by computing in Maple the dimension of the tangent spaces H 1,2 to B V GM,T1,T 2 and H 1,3 to B V GM,T1,T 3. B dim(h 1,2 ) = 127, dim(h 1,3 ) = 127 INI Identifiability Slide 17

All computations for GM can be done exactly: B can be chosen to arise from rational parameter values. parameterization is given by polynomials with rational coefficients. Maple performs exact rational arithmetic. Another computation shows that the two tangent spaces intersect in a lower dimensional hyperplane. ( ten minutes of computation) dim(h 1,2 H 1,3 ) = 115 This proves that V GM,T1,T 2 and V GM,T1,T 3 are different, and then by principles of AG we have dim(v GM,T1,T 2 V GM,T1,T 3 ) < 127. INI Identifiability Slide 18

Extension to GTR (non-algebraic): Observe JC GT R. Choose B to be a Jukes-Cantor point (rational, yet GTR) with B X GT R,T1,T 2 X GT R,T1,T 3 Prove that there is a vector v tangent to X GT R,T1,T 3 not lie in the tangent plane at B to V GM,T1,T 2. at B that does INI Identifiability Slide 19

Preprint: http://www.dms.uaf.edu/~eallman/gamid.pdf INI Identifiability Slide 20