Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Similar documents
Lecture Notes: Markov chains

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Mutation models I: basic nucleotide sequence mutation models

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Phylogenetic Assumptions

Phylogenetic invariants versus classical phylogenetics

Lab 9: Maximum Likelihood and Modeltest

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

What Is Conservation?

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Reconstruire le passé biologique modèles, méthodes, performances, limites

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

A Bayesian Approach to Phylogenetics

Lecture 4. Models of DNA and protein change. Likelihood methods

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Introduction to Machine Learning CMU-10701

Statistics 992 Continuous-time Markov Chains Spring 2004

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Remarks on Hadamard conjugation and combinatorial phylogenetics

Taming the Beast Workshop

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Inferring Molecular Phylogeny

REPRESENTATION THEORY OF S n

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Using algebraic geometry for phylogenetic reconstruction

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation

Chapter 7: Models of discrete character evolution

Convex Optimization CMU-10725

Predicting the Evolution of two Genes in the Yeast Saccharomyces Cerevisiae

The Importance of Proper Model Assumption in Bayesian Phylogenetics

Modeling Noise in Genetic Sequences

Exercises on chapter 4

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Evolutionary Analysis of Viral Genomes

Bayesian Phylogenetics

Algebraic Statistics Tutorial I

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Lecture 3: Markov chains.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Letter to the Editor. Department of Biology, Arizona State University

The Phylo- HMM approach to problems in comparative genomics, with examples.

Applications of Hidden Markov Models

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Topics in Representation Theory: Roots and Weights

Quantum Theory and Group Representations

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

LECTURE 4: REPRESENTATION THEORY OF SL 2 (F) AND sl 2 (F)

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

8. Prime Factorization and Primary Decompositions

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Phylogenetics. Andreas Bernauer, March 28, Expected number of substitutions using matrix algebra 2

Molecular Evolution, course # Final Exam, May 3, 2006

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

IEOR 6711, HMWK 5, Professor Sigman

Bayesian Phylogenetics

Definition 2.3. We define addition and multiplication of matrices as follows.

BMI/CS 776 Lecture 4. Colin Dewey

Symmetries, Fields and Particles 2013 Solutions

Lecture 4. Models of DNA and protein change. Likelihood methods

Markov Models & DNA Sequence Evolution

Phylogenetic Inference using RevBayes

arxiv: v3 [math.ag] 21 Aug 2008

Phylogenetics: Building Phylogenetic Trees

Phylogenetic Inference using RevBayes

Hastings Ratio of the LOCAL Proposal Used in Bayesian Phylogenetics

Frequentist Properties of Bayesian Posterior Probabilities of Phylogenetic Trees Under Simple and Complex Substitution Models

3 The Simplex Method. 3.1 Basic Solutions

Representations of quivers

Introduction to Machine Learning CMU-10701

Computational Biology and Chemistry

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetic Inference and Hypothesis Testing. Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne

Bayesian Models for Phylogenetic Trees

Quantifying sequence similarity

The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed

Bayesian support is larger than bootstrap support in phylogenetic inference: a mathematical argument

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Representations Are Everywhere

Evolutionary Models. Evolutionary Models

Symmetries, Fields and Particles. Examples 1.

Variations on a Theme: Fields of Definition, Fields of Moduli, Automorphisms, and Twists

Models of Molecular Evolution

GROUPS. Chapter-1 EXAMPLES 1.1. INTRODUCTION 1.2. BINARY OPERATION

SCHUR-WEYL DUALITY FOR U(n)

Chapter 7. Markov chain background. 7.1 Finite state space

LECTURE 16: LIE GROUPS AND THEIR LIE ALGEBRAS. 1. Lie groups

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

Part III Symmetries, Fields and Particles

Bayesian Inference and MCMC

Transcription:

Lie Markov models Jeremy Sumner School of Physical Sciences University of Tasmania, Australia Stochastic Modelling Meets Phylogenetics, UTAS, November 2015 Jeremy Sumner Lie Markov models 1 / 23

The theory of (matrix) Lie groups G and Lie algebras L Consider the orthogonal group with MM T = 1 i. (M 1 M 2 )(M 1 M 2 ) T = M 1 (M 2 M T 2 )MT 1 = 1 ii. 1 = M 1 (MM T )(M 1 ) T = M 1 1(M 1 ) T = M 1 (M 1 ) T Consider path M(t) and M(0) = 1. Tangents X := dm(t) dt satisfy X +X T = 0 0 Forms a Lie algebra L: i. X +λy L ii. [X,Y]:=XY YX L Exponential map: exp : L G i.e. exp(x)exp(x) T = exp(x)exp( X) = 1 Jeremy Sumner Lie Markov models 2 / 23

DNA substitutions modelled as cont-time Markov chain Model sequence evolution as a CTMC on nucleotides {A,G,C,T} Two extremes: All rates are the same OR All rates (might be!) different. α α α α α α α α α OR? α α α α β γ δ ǫ φ ψ ζ ϕ ξ ω σ What model is best depends on bias-variance tradeoff. Lots of molecular data means model complexity has somewhat been driven by computing power. Jeremy Sumner Lie Markov models 3 / 23

The GTR model (Tavare 1986) Stationary dist: π = (π A,π G,π C,π T ) T Time reversible: rate A T equals rate T A π A s 1 π A s 2 π A s 3 Q = π G s 1 π G s 4 π G s 5 π C s 2 π C s 4 π C s 6 π T s 3 π T s 5 π T s 6 (j)modeltest hierarchy Posada and Crandell, 1998 Huelsenback et. al. 2004 considered submodels via constraints on the relative rates s i I emailed this paper to Peter Jarvis in 2009... Jeremy Sumner Lie Markov models 4 / 23

What about the homogeneity assumption? Phylogenetic models are full of contradictory assumptions (of course!) Typically, substitution rates Q are assumed fixed throughout evolutionary history. Some modern implementations allow for differing rates on each branch. Leads to a problem... Jeremy Sumner Lie Markov models 5 / 23

What s the problem with global homogeneity? REALITY? MODEL Forgot 2 nd taxa Q 1 Q 4 Q 2 Q 3 Q Q Q Q e Q 4 e Q 3 e Q e Q = e Q 3 e Q 4 Is Q in the same model? Q = log(exp(q 3 )exp(q 4 )) = Q 3 +Q 4 + 1 2 [Q 3,Q 4 ]+ 1 12 [Q 3,[Q 3,Q 4 ]] 1 12 [Q 4,[Q 3,Q 4 ]]+... BCH formula with commutators [Q 3,Q 4 ] := Q 3 Q 4 Q 4 Q 3 Jeremy Sumner Lie Markov models 6 / 23

GTR (obviously) doesn t form a Lie algebra π A s 1 π A s 2 π A s 3 Q = π G s 1 π G s 4 π G s 5 π C s 2 π C s 4 π C s 6 π T s 3 π T s 5 π T s 6 Non-linear: q AG q GC q CA = (π A s 1 )(π C s 2 )(π G s 4 ) = q AC q CG q GA Therefore, GTR is not multiplicatively closed. Jeremy Sumner Lie Markov models 7 / 23

Is the GTR model bad for molecular phylogenetics? S et. al. Syst. Biol. 2012 Maximum absolute error in GTR substitution probabilities Percentage error 0 10 20 30 40 50 60 0.5 1.0 1.5 2.0 Distance between Q1 and Q2 Jeremy Sumner Lie Markov models 8 / 23

Almost Lie-Markov: GTR with uniform base frequencies What about if π i = 1 4 in the GTR model? In this case we do have a linear model: s 1 s 2 s 3 Q = s 1 s 4 s 5 s 2 s 4 s 6, i.e. QT = Q. s 3 s 5 s 6 Since this is a linear model, via Q T = Q, the first term in the BCH formula works: (Q 1 +Q 2 ) T = Q T 1 +QT 2 Commutators don t work though: [A,B] T = (AB) T (BA) T = BA AB = [A,B] In practice errors are not so bad up to order O(t 2 ). Will come back to the dual case s i = const. later... Jeremy Sumner Lie Markov models 9 / 23

Bring me a list of all Lie-Markov models... 1 Some (specific and general) models are already closed. e.g. Kimura models, Jukes-Cantor, Felsenstein 81 e.g. Group-based and equivariant 2 What is the Lie-algebraic closure of a model? e.g. GTR = GM and HKY = RY8.8 3 Use regular representation of a finite semigroup. e.g. Group-based and F81 (see later) 4 Constrain problem using symmetries and apply sledgehammer. Jeremy Sumner Lie Markov models 10 / 23

Purine/pyrimidine symmetries Nucleotides can be divided into purines {A,G} and pyrimidines {C,T} Purines: 2 carbon-nitrogen ring, Pyrimidines: 1 carbon-nitrogen ring Jeremy Sumner Lie Markov models 11 / 23

Models with purine/pyrimidine symmetry Kimura 2-parameter stationary (K2ST) 1980: α β β Q = α β β β β α β β α Hasegawa, Kishino and Yano (HKY) 1985: π A α π A β π A β Q = π G α π G β π G β π C β π C β π C α π T β π T β π T α Jeremy Sumner Lie Markov models 12 / 23

Purine/pyrimidine symmetries The mathematicians view AG CT = {{A,G},{C,T}} Symmetries: (AG) and (AC)(GT) S 4 Generates the dihedral group D 8 = C2 C 2 : {e,(ag),(ct),(ag)(ct),(ac)(gt),(at)(gc),(acgt),(atgc)} Jeremy Sumner Lie Markov models 13 / 23

Purine/pyrimidine symmetries The mathematicians view AG CT = {{A,G},{C,T}} Symmetries: (AG) and (AC)(GT) S 4 Generates the dihedral group D 8 = C2 C 2 : {e,(ag),(ct),(ag)(ct),(ac)(gt),(at)(gc),(acgt),(atgc)} Example with σ = (AC)(GT): π A α π A β π A β π C α π C β π C β Q = π G α π G β π G β π C β π C β π C α π T α π T β π T β π A β π A β π A α π T β π T β π T α π G β π G β π G α The labels change but this is still a HKY rate matrix! Jeremy Sumner Lie Markov models 13 / 23

Enter more algebra: group representation theory All popular models (Lie-Markov or not) have some permutation symmetries. e.g. GM, GTR, JC, K3ST, F81 have complete symmetry. K3ST: α β γ Q = α γ β β γ α γ β α e.g. K2ST, HKY have purine/pyrimidine symmetry. Algebraic theory says (for a linear model!) we can decompose into a sum of irreducible representations of the relevant permutation group. e.g. F81 = id (31) and K3ST = id (2 2 ) In other words 4=1+3 and 3=1+2. Jeremy Sumner Lie Markov models 14 / 23

What the hell does F81 = id (31) mean? Jeremy Sumner Lie Markov models 15 / 23

What the hell does F81 = id (31) mean? F81: π 1 π 1 π 1 Q = π 2 π 2 π 2 π 3 π 3 π 3 = π 1R 1 +π 2 R 2 +π 3 R 3 +π 4 R 4 π 4 π 4 π 4 The matrices {R 1,R 2,R 3,R 4 } form a basis for this model, e.g.: 0 1 1 1 R 1 = 0 1 0 0 0 0 1 0 0 0 0 1 Under permutations σ S 4 clearly R i R σ(i) i.e. F81 forms a representation of S 4. id is the trivial part: R 1 +R 2 +R 3 +R 4 (i.e. JC model!) (31) is what s left over: {R 1 R 2,R 1 R 3,R 1 R 4 } Jeremy Sumner Lie Markov models 15 / 23

What was that about a sledgehammer? F81 is a Lie-Markov model: [R i,r j ] = R i R j We can form the analogous model with constant columns: α 2 α 3 α 4 Q = α 1 α 3 α 4 α 1 α 2 α 4 = α 1C 1 +α 2 C 2 +α 3 C 3 +α 4 C 4 α 1 α 2 α 3 Again this provides the id (31) representation of S 4... Jeremy Sumner Lie Markov models 16 / 23

What was that about a sledgehammer? F81 is a Lie-Markov model: [R i,r j ] = R i R j We can form the analogous model with constant columns: α 2 α 3 α 4 Q = α 1 α 3 α 4 α 1 α 2 α 4 = α 1C 1 +α 2 C 2 +α 3 C 3 +α 4 C 4 α 1 α 2 α 3 Again this provides the id (31) representation of S 4... [C 1,C 2 ] = But this is not a Lie-Markov model! ( 3 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 )( 0 1 0 0 0 3 0 0 0 1 0 0 0 1 0 0 ) C 2 C 1 = ( 1 3 0 0 3 1 0 0 1 1 0 0 1 1 0 0 ) X Jeremy Sumner Lie Markov models 16 / 23

Her All-embracing Majesty, the general Markov model (!Weyl 1939) GM = id 2(31) (2 2 ) (21 3 ) In other words: 12 = 1+2 3+2+3. Our big idea: Models with symmetries must come as direct sum of irreducible bits Reduces computational complexity of use a sledgehammer approach just enough to solve the problem. i.e. id (31) = {R 1,R 2,R 3,R 4 } and {C 1,C 2,C 3,C 4 } Result (S et. al. JTB 2012): The Lie subalgebras of GM with full symmetry are JC, K3ST, F81, F+K, and GM. Result (Fernandez-Sanchez et. al. JMB 2015): There are (roughly) 35 Lie-Markov models with purine/pyrimidine symmetry. Jeremy Sumner Lie Markov models 17 / 23

The Lie-Markov models with purine/pyrimidine symmetry Jeremy Sumner Lie Markov models 18 / 23

More than Lie-Markov A matrix algebra A (as opposed to a Lie algebra), is a linear set of matrices closed under products: AB A All matrix algebras form Lie algebras automatically: [A,B] := AB BA A The reverse is not true (see next slide for counter example). In our 2015 characterization of models with purine/pyrimidine symmetry each model we found actually forms a matrix algebra. This is probably because the symmetry conditions are so strong. So do the equivariant models (Draisma and Kuttler 2009). Jeremy Sumner Lie Markov models 19 / 23

More than Lie-Markov: Noether s central dogma Any semi-group produces a Lie-Markov model under the regular representation, as follows. Consider the semigroup S with products xy = x. If S = {a 1,a 2,a 3,a 4 } we have, e.g.: 1 1 1 1 a 1 = 0 0 0 0 0 0 0 0 0 0 0 0 Jeremy Sumner Lie Markov models 20 / 23

More than Lie-Markov: Noether s central dogma Any semi-group produces a Lie-Markov model under the regular representation, as follows. Consider the semigroup S with products xy = x. If S = {a 1,a 2,a 3,a 4 } we have, e.g.: 1 1 1 1 a 1 = 0 0 0 0 0 0 0 0 0 0 0 0 This produces a Lie-Markov model: R i := 1+a i satisfying [R i,r j ] = [a i,a j ] = a i a j = R i R j. Jeremy Sumner Lie Markov models 20 / 23

More than Lie-Markov: Noether s central dogma Any semi-group produces a Lie-Markov model under the regular representation, as follows. Consider the semigroup S with products xy = x. If S = {a 1,a 2,a 3,a 4 } we have, e.g.: 1 1 1 1 a 1 = 0 0 0 0 0 0 0 0 0 0 0 0 This produces a Lie-Markov model: R i := 1+a i satisfying [R i,r j ] = [a i,a j ] = a i a j = R i R j. None other than the F81 model! π 1 π 1 π 1 Q = π 1 R 1 +π 2 R 2 +π 3 R 3 +π 4 R 4 = π 2 π 2 π 2 π 3 π 3 π 3 π 4 π 4 π 4 Jeremy Sumner Lie Markov models 20 / 23

Exactly Lie-Markov We know a few Lie-Markov models which do not form matrix algebras. i. Symmetric embedded Jarvis and S 2012. ii. AustMS 2015 model: L 1 = 3 0 0 1 0 1 L 2 = 1 0 2 1 0 1 2 0 1 0 0 3 Both models satisfy [L 1,L 2 ] = L 1 L 2, but have algebraic closures {L 1,L 2,X,Y,Z} and {L 1,L 2,X} respectively. e.g. 6 0 0 0 0 0 L 2 1 = 0 0 0 = 3L 1 +L 2 +X = 3L 1 +L 2 + 2 0 2 6 0 0 2 0 2 Jeremy Sumner Lie Markov models 21 / 23

Final thoughts Does anyone here have any other ideas on how to proceed? Does this issue matter in other contexts? Can the Lie-Markov condition be used as a productive constraint in other contexts? What about time-inhomogeneous Markov chains? What is the Lie-Markov condition saying in this case? Jeremy Sumner Lie Markov models 22 / 23

References Sumner JG, Fernandez-Sanchez J, Jarvis PD. 2012. Lie Markov models. Journal of Theoretical Biology Fernandez-Sanchez J, Sumner JG, Jarvis PD, Woodhams MD. 2015. Lie Markov models with purine/pyrimidine symmetry. Journal of Mathematical Biology Draisma J, Kuttler J. 2009. On the ideals of equivariant tree models. Mathematische Annalen Sumner, Holland, Woodhams, Kaine, Jarvis, Fenandez-Sanchez (2012). Is GTR bad for molecular phylogenetics?. Syst. Biol. JP Huelsenbeck, B Larget, ME Alfaro (2004). Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. Mol. Biol. Evol. D Posada, KA Crandall (1998). Modeltest: testing the model of DNA substitution. Bioinformatics. Tavare S (1986). Some probabilistic and statistical problems in the analysis of DNA sequences. In: Miura RM, editor. Lectures on mathematics in the life sciences. Volume 17. Providence (RI): American Mathematical Society. p. 57-86. Jeremy Sumner Lie Markov models 23 / 23