Nucleotide substitution models

Similar documents
EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Understanding relationship between homologous sequences

Molecular Population Genetics

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Computational Biology and Chemistry

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Evolutionary Change in Nucleotide Sequences. Lecture 3

What Is Conservation?

Lecture Notes: Markov chains

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Molecular Evolution and Comparative Genomics

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Dr. Amira A. AL-Hosary

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Markov Models & DNA Sequence Evolution

Trade Patterns, Production networks, and Trade and employment in the Asia-US region

Quantifying sequence similarity

Maximum Likelihood in Phylogenetics

BIOINFORMATICS TRIAL EXAMINATION MASTERS KT-OR

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Lecture 4. Models of DNA and protein change. Likelihood methods

Evolutionary Analysis of Viral Genomes

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Michael Yaffe Lecture #4 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Taming the Beast Workshop

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Biology 644: Bioinformatics

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Counting phylogenetic invariants in some simple cases. Joseph Felsenstein. Department of Genetics SK-50. University of Washington

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

C.DARWIN ( )

Crew of25 Men Start Monday On Showboat. Many Permanent Improvements To Be Made;Project Under WPA

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

that a w r o n g has been committed recognize where the responsibility

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Modeling Noise in Genetic Sequences

Phylogenetic invariants versus classical phylogenetics

Using algebraic geometry for phylogenetic reconstruction

Variances of the Average Numbers of Nucleotide Substitutions Within and Between Populations

7.36/7.91 recitation CB Lecture #4

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Molecular Evolution and Phylogenetic Tree Reconstruction

The wonderful world of RNA informatics

Phylogenetic Assumptions

Review for Exam 2 Solutions

Maximum Likelihood in Phylogenetics

Variance and Covariances of the Numbers of Synonymous and Nonsynonymous Substitutions per Site

Phylogenetics. Andreas Bernauer, March 28, Expected number of substitutions using matrix algebra 2

arxiv:q-bio/ v1 [q-bio.pe] 27 May 2005

Pithy P o i n t s Picked I ' p and Patljr Put By Our P e r i p a tetic Pencil Pusher VOLUME X X X X. Lee Hi^h School Here Friday Ni^ht

Initial amounts: mol Amounts at equilibrium: mol (5) Initial amounts: x mol Amounts at equilibrium: x mol

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Molecular evolution 2. Please sit in row K or forward

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Week 5: Distance methods, DNA and protein models

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Lecture 4. Models of DNA and protein change. Likelihood methods

Predicting the Evolution of two Genes in the Yeast Saccharomyces Cerevisiae

Phylogenetics. BIOL 7711 Computational Bioscience

Markov Repairable Systems with History-Dependent Up and Down States

Computational Design of New and Recombinant Selenoproteins

Estimation of evolutionary distances between homologous

Letter to the Editor. Department of Biology, Arizona State University

Reading for Lecture 13 Release v10

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Sequence Divergence & The Molecular Clock. Sequence Divergence

Rate Law Summary. Rate Laws vary as a function of time

Constructing Evolutionary/Phylogenetic Trees

Lecture 6 Free energy and its uses

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

EVOLUTIONARY DISTANCES

Today. Doubling time, half life, characteristic time. Exponential behaviour as solution to DE. Nonlinear DE (e.g. y = y (1-y) )

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Introduction to Polymer Physics

Predicting RNA Secondary Structure

The Genetic Code. Section I A 24 codon table and the exponent 2/3 series (ES) Section II Comparisons with simpler numeral series

Transition Theory Abbreviated Derivation [ A - B - C] # E o. Reaction Coordinate. [ ] # æ Æ

Phylogenetics: Building Phylogenetic Trees

5.111 Lecture Summary #18 Wednesday, October 22, 2014

The Genetic code. Section I. A 24 codon table and the exponent 2/3 series (ES) Åsa Wohlin

INVARIANTS STEVEN N. EVANS AND XIAOWEN ZHOU. Abstract. The method of invariants is an approach to the problem of reconstructing

Macroeconomics Qualifying Examination

d (5 cos 2 x) = 10 cos x sin x x x d y = (cos x)(e d (x 2 + 1) 2 d (ln(3x 1)) = (3) (M1)(M1) (C2) Differentiation Practice Answers 1.

More on phase diagram, chemical potential, and mixing

Quantitative Model Checking (QMC) - SS12

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

How to construct international inputoutput

Lecture 3: Markov chains.

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Special features of phangorn (Version 2.3.1)

(2 pts) a. What is the time-dependent Schrödinger Equation for a one-dimensional particle in the potential, V (x)?

D. Incorrect! That is what a phylogenetic tree intends to depict.

Transcription:

Nucleotide substitution models Alexander Churbanov University of Wyoming, Laramie Nucleotide substitution models p. 1/23

Jukes and Cantor s model [1] The simples symmetrical model of DNA evolution All sites change independently All sites have the same stochastic process working at them Nucleotide substitution models p. 2/23

Probability for nucleotide (1) Let us assume nucleotide residing at certain cite in DNA sequence is A, Consider probability that p A(t) that the site will be occupied by A at time t, Since we start with A, p A(0) = 1, At a time 1 probability of still having A is p A(1) = 1 3α. Nucleotide substitution models p. 3/23

Probability for nucleotide (2) At a time 2 p A(2) = (1 3α)p A(1) + α(1 p A(1) ) 1. The nucleotide has remained unchanged with probability 1 3α 2. The nucleotide did change to T, C, G, but subsequently reverted to A with probability α The following recurrence holds p A(t+1) = (1 3α)p A(t) + α(1 p A(t) ), p A(t+1) p A(t) = 3αp A(t) + α(1 p A(t) ), p A(t) = 3αp A(t) + α(1 p A(t) ) = 4αp A(t) + α. Nucleotide substitution models p. 4/23

Probability for nucleotide (2) At a time 2 p A(2) = (1 3α)p A(1) + α(1 p A(1) ) 1. The nucleotide has remained unchanged with probability 1 3α 2. The nucleotide did change to T, C, G, but subsequently reverted to A with probability α The following recurrence holds p A(t+1) = (1 3α)p A(t) + α(1 p A(t) ), p A(t+1) p A(t) = 3αp A(t) + α(1 p A(t) ), p A(t) = 3αp A(t) + α(1 p A(t) ) = 4αp A(t) + α. Nucleotide substitution models p. 5/23

Continuous time dp A(t) dt = 4αp A(t) + α. This first-order linear differential equation has solution p A(t) = 1 ( 4 + p A(0) 1 ) e 4αt 4 Initial probability is p A(0) = 1, thererefore p A(t) = 1 4 + 3 4 e 4αt Nucleotide substitution models p. 6/23

Probabilities If the initial nucleotide is not A, then p A(0) = 0 and p A(t) = 1 4 1 4 e 4αt generalizing for nucleotides i and j, where i j p ii(t) = 1 4 + 3 4 e 4αt p ij(t) = 1 4 1 4 e 4αt Nucleotide substitution models p. 7/23

Graphical interpretation Nucleotide substitution models p. 8/23

Sequence similarity (1) Nucleotide substitution models p. 9/23

Sequence similarity (2) A common measure for sequence similarity is the proportion of identical nucleotides between the two sequences under study. The expected value of this proportion is equal to the probability I (t) that the nucleotide at a given site at a time t is the same in both sequences. Cases include nucleotide conservation p 2 ii(t) and parallel substitutions p 2 ij(t). I (t) = p 2 AA(t) + p2 AT(t) + p2 AC(t) + p2 AG(t), I (t) = 1 4 + 3 4 e 8αt. Nucleotide substitution models p. 10/23

Estimating substitutions (1) The probability that the two sequences are different at a site at time t is p = 1 I (t) p = 3 4 ( 1 e 8αt ), 8αt = ln (1 43 p ). Nucleotide substitution models p. 11/23

Estimating substitutions (2) We estimate K, the actual number of substitutions per site since the divergence between the two sequences. In the one parameter model, K = 2(3αt), where 3αt is the expected number of substitutions per site in one lineage. ( ) 3 K = ln (1 43 ) 4 p Nucleotide substitution models p. 12/23

Kimura model [2] The method has the merit of incorporating the possibility that sometimes transition type substitutions (with rate α) may occur more frequently than transversion type substitutions (with rate β). Nucleotide substitution models p. 13/23

Kimura model [2] Same UU CC AA GG Total (Frequency) (R 1 ) (R 2 ) (R 3 ) (R 4 ) (R) Different, Type I UC CU AG GA Total (Frequency) (P 1 ) (P 1 ) (P 2 ) (P 2 ) (P ) Different, UA AU UG GU TypeII (Q 1 ) (Q 1 ) (Q 2 ) (Q 2 ) Total CA AC CG GC (Q) (Frequency) (Q 3 ) (Q 3 ) (Q 4 ) (Q 4 ) Nucleotide substitution models p. 14/23

Kimura model (1) Total rate of substitutions per site per year is k = α + 2β P is the probability of homologous sites showing a type I difference Q is the probability of homologous sites showing a type II difference R is the probability of homologous sites to be the same We denote probability of identity at homologous sites at time T as R(T) = 1 P(T) Q(T) Nucleotide substitution models p. 15/23

Kimura model (2) We can derive the equation for P and Q at time T + T in terms of P, Q and R at time T We can distinguish three ways by which UC (U at homologous position of organism 1 corresponding to C in organism 2) at time T + T is derived from various base pairs at time T. 1. Pair UC is derived from UC. Since probability of substitution in short time interval is T is (α + 2β) T. Thus the probability of no change occurring in both homologous sites is [1 (α + 2β) T] 2, so this case contribution is [1 (α + 2β) T] 2 P 1 (T) Nucleotide substitution models p. 16/23

Kimura model (3) 2. Pair UC is derived either from UU or from CC with probability α T [R 1 (T) + R 2 (T)] 3. Pair UC could be derived from UA, UG, AC and GC with probability β T[Q 1 (T) + Q 2 (T) + Q 3 (T) + Q 4 (T)] = β T Q(T) 2 Nucleotide substitution models p. 17/23

Kimura model (4) Combining contributions coming from different classes resulting in UC and disregarding terms with ( T) 2, we get (1) P 1 (T + T) = [1 (2α + 4β) T] P 1 (T) + α T [R 1 (T) + R 2 (T)] + β T Q(T) 2 Similarly, for the base pair AG we get (2) P 2 (T + T) = [1 (2α + 4β) T] P 2 (T) + α T [R 3 (T) + R 4 (T)] + β T Q(T) 2 Nucleotide substitution models p. 18/23

Kimura model (5) Summing equations (1) and (2), and noting P(T) = 2P 1 (T) + 2P 2 (T), we get (3) P(T) T = 2α 4(α + β)p(t) 2(α β)q(t) Q(T) (4) = 4β 8βQ(T) T Converting (3) and (4) to continuous case, we get dp(t) dt = 2α 4(α + β)p(t) 2(α β)q(t) dq(t) dt = 4β 8βQ(T) Nucleotide substitution models p. 19/23

Kimura model (5) The solution for these equations satisfy the initial condition P(0) = Q(0) = 0, i.e. no base difference exists at T = 0 P(T) = 1 4 1 2 e 4(α+β)T + 1 4 e 8βT Q(T) = 1 2 1 2 e 8βT Nucleotide substitution models p. 20/23

Kimura model (6) It follows that (5) and (6) so that (7) 4(α + β)t = ln(1 2P(T) Q(T)) 8βT = ln(1 2Q(T)) 4αT = ln(1 2P(T) Q(T)) + 1 2 ln(1 2Q(T)) Nucleotide substitution models p. 21/23

Kimura model (7) Since evolutionary rate is k = α + 2β, the total number of substitutions per two diverged sequences is K = 2Tk = 2αT + 4βT By omitting index T and following equations (6) and (7) we obtain K = 1 {(1 2 ln 2P Q) } 1 2Q Nucleotide substitution models p. 22/23

References [1] T.H. Jukes and C.R. Cantor, Evolution of protein molecules, Mammalian protein metabolism (H.N. Munro, ed.), Academic Press, New York, 1969, pp. 21 132. [2] M. Kimura, A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences, Journal of Molecular Evolution 16 (1980), 111 120. Nucleotide substitution models p. 23/23