Mutation models I: basic nucleotide sequence mutation models

Similar documents
Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Lecture Notes: Markov chains

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Lab 9: Maximum Likelihood and Modeltest

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

BMI/CS 776 Lecture 4. Colin Dewey

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Taming the Beast Workshop

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Week 5: Distance methods, DNA and protein models

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Evolutionary Analysis of Viral Genomes

Reading for Lecture 13 Release v10

Lecture 4. Models of DNA and protein change. Likelihood methods

Inferring Molecular Phylogeny

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Phylogenetics: Building Phylogenetic Trees

Evolutionary Models. Evolutionary Models

UNIT 5. Protein Synthesis 11/22/16

The Transition Probability Function P ij (t)

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations

Computational Biology: Basics & Interesting Problems

EVOLUTIONARY DISTANCES

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Chapter 5. Continuous-Time Markov Chains. Prof. Shun-Ren Yang Department of Computer Science, National Tsing Hua University, Taiwan

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Statistics 992 Continuous-time Markov Chains Spring 2004

In: P. Lemey, M. Salemi and A.-M. Vandamme (eds.). To appear in: The. Chapter 4. Nucleotide Substitution Models

Maximum Likelihood in Phylogenetics

Genetic distances and nucleotide substitution models

Markov Chains Handout for Stat 110

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Molecular Evolution, course # Final Exam, May 3, 2006

What Is Conservation?

Answers Lecture 2 Stochastic Processes and Markov Chains, Part2

The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed

Statistics 150: Spring 2007

Inferring Speciation Times under an Episodic Molecular Clock

STOCHASTIC PROCESSES Basic notions

In: M. Salemi and A.-M. Vandamme (eds.). To appear. The. Phylogenetic Handbook. Cambridge University Press, UK.

Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains

π b = a π a P a,b = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ),

Reconstruire le passé biologique modèles, méthodes, performances, limites

Lecture 4. Models of DNA and protein change. Likelihood methods

Predicting the Evolution of two Genes in the Yeast Saccharomyces Cerevisiae

Maximum Likelihood in Phylogenetics

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Phylogenetic Inference using RevBayes

Irreducibility. Irreducible. every state can be reached from every other state For any i,j, exist an m 0, such that. Absorbing state: p jj =1

Modeling Noise in Genetic Sequences

Letter to the Editor. Department of Biology, Arizona State University

Markov chains and the number of occurrences of a word in a sequence ( , 11.1,2,4,6)

18.600: Lecture 32 Markov Chains

Phylogenetic Assumptions

Models of Molecular Evolution

Chapter 7: Models of discrete character evolution

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

8. Statistical Equilibrium and Classification of States: Discrete Time Markov Chains

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

18.175: Lecture 30 Markov chains

Stochastic processes and

Stochastic processes. MAS275 Probability Modelling. Introduction and Markov chains. Continuous time. Markov property

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Biology 2018 Final Review. Miller and Levine

Problems on Evolutionary dynamics

Interphase & Cell Division

Markov Chains, Stochastic Processes, and Matrix Decompositions

Evolutionary Theory and Principles of Phylogenetics. Lucy Skrabanek ICB, WMC February 25, 2010

1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine.

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

CDA6530: Performance Models of Computers and Networks. Chapter 3: Review of Practical Stochastic Processes

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

MATH 564/STAT 555 Applied Stochastic Processes Homework 2, September 18, 2015 Due September 30, 2015

IEOR 6711: Professor Whitt. Introduction to Markov Chains

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Phylogenetics and Darwin. An Introduction to Phylogenetics. Tree of Life. Darwin s Trees

2 Discrete-Time Markov Chains

Understanding relationship between homologous sequences

On Estimating Topology and Divergence Times in Phylogenetics

Outlines. Discrete Time Markov Chain (DTMC) Continuous Time Markov Chain (CTMC)

Discrete quantum random walks

Lecture Notes: BIOL2007 Molecular Evolution

LTCC. Exercises solutions

Stochastic Processes

7. Tests for selection

Natural selection on the molecular level

Transcription:

Mutation models I: basic nucleotide sequence mutation models Peter Beerli September 3, 009 Mutations are irreversible changes in the DNA. This changes may be introduced by chance, by chemical agents, or radioactive decay. Types of mutations Many mutations involve larger sections of the DNA, these increase or decrease the DNA content Duplication: a segment of a chromosome is duplicated, resulting in potential overdoses of proteins that are coded in the duplicated regions, this mechanism is thought to be one of the driving forces of evolution because one of the two copies can now take another function. Deletions: a section of a chromosome is deleted, these section can be small ( base ) or large (many 0,000 bases) Translocation: one segment of a chromosome is exchanged with another segment of another chromosome. Inversion: a chromosome section is flipped. Point mutations: these affect always a single DNA site. Typically a nucleotide is damaged and during the next replication of the cell that damaged nucleotide is replaced with another nucleotide. Because all coding regions on the DNA are organized in triplets that are then used to synthetize the protein using transfer-rna and ribosomes, and because these tripes are redundant, there are potentially 4 3 triplet combinations (with four nucleotides A=adenine, C=cytosine, G=guanine, T=thymine [in RNA T is replaced by U=uracile]), but only 0

ISC537-Fall 007 Purine adenine guanine Pyrimidine Cytosine Thymine Figure : Building blocks of DNA common aminoacids and 3 STOP codons to mark the end of a coding sequence (Figure?? shows such a translation table between codons and aminoacids). Point mutations. A simple model DNA is built out 4 nucleotides and a model would need to take into account for mutations that allow to transition from one particular nucleotide to another, in the following section we reduce the problem to very simple model with two states U and Y but this shows the whole complexity of the modelling process. If you wish you could think of this as purines (either the nucleotide adenosine [A] or a guanine [G]) or pyrimidines (either cytosine [C] or thymine [T]) (Figure ). We have the states and substitution rate µ for the substitution rate from U to Y and the transition rate µ from Y to U (Figure ). This is a very simple model. We could think of introducing a different rate μ U Y μ Figure : Simple model with two states and one mutation rate for going from Y to U, but will refrain to do so for this outline. We will see later that there are models that set back-mutation to zero. The most common of these is the infinite sites mutation model. It will be discussed in chapter mutation models III.

ISC537-Fall 007.. Discrete time The model shown in Figure assumes that time is discrete and that we evaluate the transition from one state in time i to the same or other state in time i +, where the allele B is at risk to mutate to Y with rate µ or to stay in B with rate µ. The same logic applies to Y where Y changes to B with rate µ and stays in Y with rate µ. We can express this as a transition matrix ( ) µ µ R = µ µ () often expressed like this ( µ µ ) R = µ µ () R = I + R (3) where I is the Identity matrix. When we start with a specific state, say U, then we can evaluate in what state we will be in the next time-click. Running this process for many steps creates a Markov chain, in which the next state is only dependent on the current state and on the transition matrix that formulates the probabilities of change from U to Y or from Y to U or the probability of no change... Stationary distributions Unspoken assumptions of the above framework are: The Markov chain is irreducible: we can reach every state from any other, in our example we can go from U to Y and from Y to U. If we would set one of the transition rates to 0 then our framework will have a problem. The chain is aperiodic, we never go into a loop that cycles forever only in a subset of solutions. As long as µ and ν in our sample are not zero we will visit every state. All states of the chain are ergodic. When we run the chain infinitely long every state has a non-zero probability π i. So we can say that the rate matrix R has stationary distribution Π (diagonal matrix) ergodic relating to or denoting systems or processes with the property that, given sufficient time, they include or impinge on all points in a given space and can be represented statistically by a reasonably large selection of points. 3

ISC537-Fall 007 lim n > (Rn ) = Π or lim n > (Rn ) ij = π i (4)..3 Divergence matrix The matrix R gives the conditional probabilities: Rij k = Prob(in state j after k ticks state i) the matrix X(k) is defined as the divergence matrix X(k) ij = Prob(in state j after k ticks AND state i) We assume that the initial state was sampled from the stationary distribution (the process is already at equilibrium). Then X(k) ij = π i (R k ) ij or X(k) = ΠR k Typically we would think of Π as a matrix with the stationary frequencies on the diagonal and zero anywhere else. There is some inconsistency between the graph theory and the phylogenetic literature about rate matrices and sometimes the stationary frequencies are considered to be part of the rate matrix R.. Time reversible models We say that the Markov chain is time reversible if the divergence matrix X(k) is symmetric for all k so that X(t) ij = X(t) ji. Therefore we can calculate probabilities on trees without bothering whether this is forward or backward in time and any calculations can be made on unrooted trees. This assumptions makes the calculation of probabilities of a pair of nodes A and B that have a root C easy as the probabilities do not depend on the actual time of a node but only on the total branch length between A and B, so it is the simple addition of length A-C and C-B... Calculation of probabilities for continuous time [see also appendix] We discussed a discrete model that would force us to know the number and duration of the time ticks, both is typically not known. instead of having fixed time events we can assume that the events are drawn from a Poisson distribution, with probability µt (µt)k Prob(k events t, µ) = e (5) 4

ISC537-Fall 007 the expected number of events is µt and we call µ the expected number of events per unit time biologists would think of this as the mean instantaneous substitution rate. Let P (t) ij be the probability that being in state j at time t when started at time zero in state i. P (t) = ( ) Prob(U U t) Prob(U Y t) Prob(Y U t) Prob(Y Y t) (6) P (t) = = R k Prob(k events t, µ) (7) = e µt R k µt (µt)k e R k (µt)k The sum in formula 9 has the same form as the series approximation of the matrix exponentiation (8) (9) e A = A k (0) and so we get P (t) = e µt e Rµt () = e µti e Rµt I is the identity matrix () = e (R I)µt (3) = e Qµt (4) where Q = R I is the is the instantenous substitution rate matrix... Matrix exponentiation Matrix exponentiation is not all that easy, the formula 0 does converge only very slowly and often is not very useful. A better approach for some square and real matrices is using an Eigen 5

ISC537-Fall 007 decomposition because ea = P ed P ed 0 0 ed ed =... 0... (5)...... 0 0 (6) ednn where P are the Eigenvectors and D is the diagonal matrix of the Eigenvalues...3 Substitution matrix for our simple -state case With this simple case we can solve the equation analytically. Using MATHEMATICA we find! tµ! tµ Prob(U U t) Prob(U Y t) +e e (7) = tµ tµ Prob(Y U t) Prob(Y Y t) e +e Some prefer to name this matrix substitution matrix because in biology the term transition is occupied for the transition from a nucleotide A to G, G to A or from C to T, T to C..3 Nucleotide models Many models are possible and only few have names, we show not even all of these. Most commonly used are Jukes-Cantor, Kimura- parameter model, Felsenstein 8, Hasegawa-Kishino-Yano, Tamura-Nei, and General time reversible models. The differences between these models stems from A b C G a b b b b a A a b T C b G c i e f T Figure 3: Left: Kimura -parameter model. Rates differ between transitions and transversion. Right: General time reversible model. Rates are symmetrical but all pairs of substitution rates are different from each other. the fact that they allow for different numbers of parameters. We typically order the substitution matrix 6

ISC537-Fall 007 A C G T A - a b c C d - e f G g h - i T k l m - and allow for maximally modifiers of the overall mutation rate µ, labelled a to m. alternative ordering of the mutation matrix see Felsenstein s book, he orders A, G, C, T. For an Often in the literature there is no distinction between the divergence matrix X and the instantaneous rate matrix Q. In the following discussion we present the equilibrium frequencies as part of the formula. So that we express X(t) = ΠQµt (8).3. General time reversible model (GTR) µ(aπ C + bπ G + cπ T ) µaπ C µbπ G µcπ T µaπ A µ(aπ A + dπ G + eπ T ) µdπ G µeπ T Q = µbπ A µdπ C µ(bπ A + dπ C + fπ T ) µfπ T µcπ A µeπ C µfπ G µ(cπ A + eπ C + fπ G ) GTR is the most complex time-reversible model with 6 rate parameters r ij = a, b, c, d, e, f and base frequencies π A, π C, π G ; π T is not a real parameter because the base frequencies have to add up to. Both, r ij and the base frequencies, form the rates in the Q matrix. Calculating the probability transition matrix P for a branch with infinite.3. Jukes-Cantor Jukes-Cantor (JC) allows for a single parameter and has a transition matrix 3 4 µ 4 µ 4 µ 4 µ 4 Q = µ 3 4 µ 4 µ 4 µ 4 µ 4 µ 3 4 µ 4 µ 4 µ 4 µ 4 µ 3 4 µ 7

ISC537-Fall 007 The base frequencies π A, π C, π G, π T are all the same and 0.5. There are only two types of changes possible, either one does not change or one changes. This results in two probabilities: Prob(t) ii = 4 + 3 4 e µt (9) Prob(t) ij = 4 4 e µt (0) Compared to the most complex model, GTR, JC is setting all parameters a to f to and all base frequencies to the same value..3.3 Kimura s -parameter model Kimura s two-parameter (KP) allows for two parameters, different rates for transitions and transversion but still assumes equal base frequencies and has a transition matrix 4 µ(κ + ) 4 µ 4 µκ 4 µ 4 Q = µ 4 µ(κ + ) 4 µ 4 µκ 4 µκ 4 µ 4 µ(κ + ) 4 µ 4 µ 4 µκ 4 µ 4 µ(κ + ) The base frequencies π A, π C, π G, π T are all the same and 0.5. the parameter κ represents the transition bias, when κ = then the models reduces to the JC model. If the importance of transitions and transversions are rated equally then κ should be two because there are twice as many transversions a transitions. There are three types of changes possible: Prob(t) ii = 4 + 3 4 e µt () Prob(t) ij,transition = 4 + 4 e µt + κ+ e µt( ) () Prob(t) ij,tranversion = 4 4 e µt (3) Using GTR we can express the KP model as a = c = d = f = and b = e = κ. 8

ISC537-Fall 007 Probability.0 0.8 0.6 0.4 0. Transitions Transversions 3 4 5 Time Figure 4: Using different rates for transitions and transversions results in different substitution probabilities..3.4 Hasegawa-Kishino-Yano 985 and Felsenstein 984 Hasegawa-Kishino-Yano and Felsenstein relaxed the KP model and allowed for unequal base frequencies µ(κπ G + π Y ) µπ C µκπ G µπ T µπ A µ(κπ T + π R ) µπ G µκπ T Q HKY = µκπ A µπ C µ(κπ G + π Y ) µπ T µπ A µκπ C µπ G µ(κπ C + π R ) where π R = π A + π G and π Y = π C + π T. There are three types of changes possible: ( ) ( ) Prob(t) jj = π j + π j e µt Πj π j + e µtα (4) Π j Π j ( ) ( ) Prob(t) ij,transition = π j + π j e µt πj e µtα (5) Π j Π j Prob(t) ij,tranversion = π j ( e µt ) (6) where Π j = π A + π G if the base j is a purine, and Π j = π C + π T is a pyrimidine. The parameter α is different between the HKY and the F84 model: α = + Π j (κ ) for the HKY model α = κ + for the F84 model GTR expresses the HKY model setting a = c = d = f = and b = e = κ and unequal base frequencies. 9

ISC537-Fall 007.3.5 Summary of nucleotide models Using this simplified Q matrix we express all models (aπ C + bπ G + cπ T ) aπ C bπ G cπ T aπ A (aπ A + dπ G + eπ T ) dπ G eπ T Q = bπ A dπ C (bπ A + dπ C + fπ T ) fπ T cπ A eπ C fπ G (cπ A + eπ C + fπ G ) (7) GTR +all rates are different TN +Purine/Pyrimidins HKY F84 +Transition/Transversion F8 +uneqal base freq. KP +uneqal base freq. +Transition/Transversion JC 0

ISC537-Fall 007.3.6 Other models According to Huelsenbeck and Ronquist (005) there are 03 time-reversible models. It is not difficult to generate them and use the matrix-exponentiation method to get probability estimates, whether these models are all useful is a difficult question. Very few attempts to work with models that are not time-reversible were made [JF:0]..4 Reducing constraints.4. Rate variation among sites We assumed that each site has the same substitution rate. this can be relaxed when we treat the mutation rate as a random variable with appropriate distribution. Currently almost all programs use the gamma distribution. Its density function with parameters α, β > 0 is f(r α, β) = Γ(a) = 0 Γ(α)β α rα e r/β (8) Prob(t r)f(r α, β)dr (9) We think of this mutation rate variation often as two parameters: the mutation rate µ and the rate modifier r. The rate r is the random variable whereas the mutation rate is fixed. for practical purposes we typically assume the mean of r as and so we can simplify the number of parameters of the gamma distribution and reduce β to /α (the mean of the gamma distribution is αβ. To calculate the rates we use Prob(t r) = e Qµrt (30) this results in Prob(t r) = 0 Prob(t r)f(r α)dr (3) where we integrate each element of the matrix separately. The integral is time consuming and needs to be done whenever the branch length changes. A faster approximation is to use a discretization of the gamma distribution.

ISC537-Fall 007 Probability α = 0. α = α = 00 α = 0 0 Mutation rate µ [ 0 6 ] Figure 5: Gamma distributions with several values for α, β = /α 3 Study question. What happens when the models are not time reversible? Can you given an example?. Change the simple model so that it has a different back mutation, say ν, can you generate the transition matrix Q? It is easiest to think of ν = aµ. Appendix: Complete -state model example We have the states and substitution rate µ for the transition rate from U to Y, and the transition rate µ from Y to U. μ U Y μ with transition matrix ( ) µ µ R = µ µ (3)

ISC537-Fall 007 Assuming a Poisson process with the probability that k events are needed to go from state to state in time t is µt (µt)k Prob(k events t, µ) = e Writing all transition probabilities into a matrix, we get P (t) = ( ) Prob(U U t) Prob(U Y t) Prob(Y U t) Prob(Y Y t) (33) (34) Inserting and substitution we get P (t) = = R k Prob(k events t, µ) (35) = e µt R k µt (µt)k e R k (µt)k The sum in formula 37 has the same form as the series approximation of the matrix exponentiation (36) (37) e A = A k (38) and so we get P (t) = e µt e Rµt (39) = e µti e Rµt I is the identity matrix (40) = e (R I)µt (4) = e Qµt (4) where Q = R I is the is the instantaneous substitution rate matrix. exp Qµt can be solved in many ways, but we use one of the simpler methods. Eigendecomposition A = P ΛP (43) 3

ISC537-Fall 007 with the eigenvalues on the diagonal matrix Λ and the eigenvectors P. But before we do that we simplify the exponent, by scaling time by the (expected) mutation rate µ, so that our new scaled time τ = µt; therefore we have exp(qτ). Our example decomposes then into ( ) ( ) µ µ µτ µτ Qτ = (R I)τ = τ = µ µ µτ µτ (44) ( ) ( ) ( ) 0 0 Qτ = P ΛP = 0 µτ (45) ( ) ( ) ( ) 0 0 = 0 µτ (46) The exponentiation can be done on the diagonal elements only of Λ, but standard matrix exponentiation Λ k = Λ Λ...Λ k results in the same result, ( ) ( ) ( ) e P (t) = e Qτ 0 0 = 0 e µτ ( ( + e τµ ) = ) ( e τµ + e τµ ) e τµ (47) (48) 4