Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Similar documents
Lecture Notes: Markov chains

Understanding relationship between homologous sequences

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

7.36/7.91 recitation CB Lecture #4

Reading for Lecture 13 Release v10

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Evolutionary Analysis of Viral Genomes

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Mutation models I: basic nucleotide sequence mutation models

The neutral theory of molecular evolution

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Taming the Beast Workshop

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Nucleotide substitution models

Introduction to Polymer Physics

Probabilistic modeling and molecular phylogeny

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Properties of amino acids in proteins

Lecture 4. Models of DNA and protein change. Likelihood methods

Quantifying sequence similarity

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

Evolutionary Models. Evolutionary Models

Computational Biology: Basics & Interesting Problems

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Similarity or Identity? When are molecules similar?

BMI/CS 776 Lecture 4. Colin Dewey

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Proteins: Characteristics and Properties of Amino Acids

Dr. Amira A. AL-Hosary

Amino Acid Side Chain Induced Selectivity in the Hydrolysis of Peptides Catalyzed by a Zr(IV)-Substituted Wells-Dawson Type Polyoxometalate

Other Methods for Generating Ions 1. MALDI matrix assisted laser desorption ionization MS 2. Spray ionization techniques 3. Fast atom bombardment 4.

Major Types of Association of Proteins with Cell Membranes. From Alberts et al

Get started on your Cornell notes right away

7.012 Problem Set 1. i) What are two main differences between prokaryotic cells and eukaryotic cells?

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Advanced Topics in RNA and DNA. DNA Microarrays Aptamers

Translation. A ribosome, mrna, and trna.

Lecture 4. Models of DNA and protein change. Likelihood methods

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Molecular Evolution and Comparative Genomics

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

O 3 O 4 O 5. q 3. q 4. Transition

5. DISCRETE DYNAMICAL SYSTEMS

The translation machinery of the cell works with triples of types of RNA bases. Any triple of RNA bases is known as a codon. The set of codons is

C.DARWIN ( )

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

CSE 549: Computational Biology. Substitution Matrices

Week 5: Distance methods, DNA and protein models

Molecular evolution 2. Please sit in row K or forward

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Energy Minimization of Protein Tertiary Structure by Parallel Simulated Annealing using Genetic Crossover

Lecture 15: Realities of Genome Assembly Protein Sequencing

EVOLUTIONARY DISTANCES

C CH 3 N C COOH. Write the structural formulas of all of the dipeptides that they could form with each other.

Natural selection on the molecular level

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Molecular Population Genetics

Packing of Secondary Structures

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Introduction to Comparative Protein Modeling. Chapter 4 Part I


Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Genomics and bioinformatics summary. Finding genes -- computer searches

Cladistics and Bioinformatics Questions 2013

Molecular Evolution and Phylogenetic Analysis

Any protein that can be labelled by both procedures must be a transmembrane protein.

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Endowed with an Extra Sense : Mathematics and Evolution

Lab 2A--Life on Earth

Peptides And Proteins

π b = a π a P a,b = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ),

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013

Introduction to the Ribosome Overview of protein synthesis on the ribosome Prof. Anders Liljas

Biol478/ August

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Phylogenetic inference

Physiochemical Properties of Residues

Processes of Evolution

Chapter 9 DNA recognition by eukaryotic transcription factors

Lecture 10: Cyclins, cyclin kinases and cell division

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

-1- HO H O H. β-d-galactopyranose (A) OMe CH 3 I. MeO. Ag 2 O. O Me HNO 3. OAc (CH 3 CO) 2 O. AcO. pyridine. O Ac. NaBH 4 H 2 O. MeOH. dry HCl.

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Supplementary Figure 3 a. Structural comparison between the two determined structures for the IL 23:MA12 complex. The overall RMSD between the two

BINF6201/8201. Molecular phylogenetic methods

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Transcription:

Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral theory came from observations on the rate of amino acid replacements in proteins. When extrapolated to the entire genome, the inferred rate of evolution was several nucleotide substitutions per year. This rate was regarded as much to high to result from natural selection, because the intensity of selection must be limited by the total amount of differential survival and reproduction that occurs in the organism. Direct DNA sequencing later revealed that rates of nucleotide substitution vary according to the function or presumed absence of function of the nucleotides. The first 18 amino acids in the amino terminal end of the human and mouse gamma interferon proteins are a signal peptide that is used in the secretion of molecules. The sequences are: Hum: Met Lys Try Thr Ser Tyr Ile Leu Ala Phe Gln Leu Cys Ile Val Leu Gly Ser Mou: Met Asn Ala Thr His Tyr Cys Leu Ala Leu Gln Leu Phe Leu Met Ala Val Ser We could simply count the number of sites that differ in the two sequences: there are 10/18 that differ, or 0.56. To interpret this, let us assume that amino acid replacements occur at the rate λ per unit time. Consider two independently evolving sequences, initially identical, which at time t are found to differ in proportion D t of their amino acids. After the next time interval, the proportion of differences is just D t+1 = (1 D t )(2λ)+D t, so we add to the already existing proportion of differences those that might have arisen in the current time interval. The factor of 2 is present because the total time for evolution is 2t time units (t units in each lineage). The equation ignores the unlikely possibility of an amino acid replacement making two previously different sites identical. If λ is the rate of amino acid replacement per unit time, then the probability that a particular site remains unsubstituted for t consecutive intervals along each of two independent lineages is (1-λ) 2t which is approximately e 2λt if λt is not too large. So, the probability of 1 or more replacements is just 1 minus this quantity, or about (1 e 2λt ). Since λ is the rate of amino acid replacement per unit time, the expected proportion of differences between two sequences at any time t is K=2λt. Substituting and rearranging gives the estimate of K from the observed differences K = ln(1 D t ). This quantity is used in preference to D because it takes multiple substitutions into account, especially over a long period of time. Rates of amino acid substitution vary over a 500-fold range in different proteins. The rate of amino acid replacement in gamma interferon is one of the largest rates known.

Among the slowest rates is that of histone H4, for which the substitution rate λ is 0.01 x -9 10 per year. The average rate for many proteins is very close to the rate found in hemoglobin, which is about 1 x 10-9 amino acid replacement per amino acid site per year. So far we have done our computations on polymorphism sequences based on just the observed counts, all on the assumption that we were looking at individuals from the same species. However, when we look across species, or even within species, if the time since common divergence is long enough, there is always the possibility that the same nucleotide site position could have changed more than once. For example, position 1, say, could first be an A; then be substituted with a T; and then once more back to A. In this case, we would not see any change from the ancestral state even though a change had occurred. We call this a multiple hit. (So in a way this relaxes the constraint of the infinite sites model.) In general then, the number of nucleotide substitutions that we observe, and that we have been calling mutations, are always an undercount of the true number. There are many models of nucleotide substitutions that have been proposed to provide a correction factor for this possibility. Nucleotide sequences are analyzed in the same manner as amino acid sequences, but we have to correct for cases where a substitution makes two previously different nucleotide sites identical. This is significant because an expected 1/3 of random substitutions will make two previously different nucleotides identical, while the chance is only 1/19 for amino acids. Many models of nucleotide substitution have been proposed, which differ mostly in the assumptions about rates of mutations between pairs of nucleotides. The simplest model for nucleotide substitution assumes that mutation occurs at a constant rate, and each nucleotide is equally likely to mutate to any other, the Jukes-Cantor model. If α is the rate of mutating from one nucleotide to another, then in any time interval, A mutates to C with probability α, A mutates to T with probability α, and A mutates to G with probability α. The probability that it does not mutate in this time interval is therefore (1 3α), The probability that a particular site is A at time t+1 is the probability of having been A at time t and not mutating, plus the probability of being at any other nucleotide an mutating to A: P A(t+1) = (1 3α)P A(t) +α(1 P A(t) ). From this we can obtain a simple differential equation: dp A(t ) =!4"P A(t ) + " which as we know show has the solution: P A(t ) = 1 4 + 3 4 e!4"t To solve this, we can use an integrating factor, multiplying both sides by f(t):

f (t)4!p A(t ) + f (t) dp A(t ) =! f (t) we need f (t) = f (t)4! so that we can use integration by parts so if f (t) = f (t)4!, then f (t) = e 4!t Substitution this value for f, we get: dp e 4!t 4!P A(t ) + e 4!t A(t ) =!e 4!t d ( e 4!t P A(t ) ) =!e 4!t d " e 4!t P A(t ) =!e 4!t + C ( e 4!t P A(t ) ) = "!e 4!t =!e 4!t + C To find the value of C, we use the fact that at time t=0, P A (0)=1 (we start at A). This gives us: 1 = 1 4 + Ce!4" 0 or C = 3 4 If we observe two sequences that have been separated for time t, then the probability that they continue to carry the same nucleotide at a particular site is: P AA = 1 4 + 3 4 e!8"t The proportion of sites that differ between the two sequences is just 1 minus this, or D=(1 P AA ). Therefore, D == 3 4 ( 1! e!8"t ) In our previous symbols, λ is the rate of mutation to a nucleotide different from the current nucleotide, so relating this to α, we have λ=3α. This implies that K=2λt=2(3α)=6αt. Taking logs of both sides of the equation above, we get: and since K=3/4(8αt), # 8!t = " ln 1" 4 $ % 3 D & ( K =! 3 4 ln " 1! 4 # $ 3 D % & This is the Jukes-Cantor correction factor. If one lets t go to infinity, the divergence will approach ¾, which makes sense: after enough time, the common ancestry of the two sequences has been erased, and ¼ of all sites will match by chance.

Example 1. If 25% of observed sites are different, D=0.25, then plugging this into the formula, the actual number of substitutions is K=0.3404, higher, as expected. Felsenstein s intuitive picture. To construct a more general picture, we can posit a 4x4 transition probability matrix, whose entries give the probability that a particular site will change to another site in one time-step. Let u be the mutation rate per unit per site; P ij = the probability that base i changes to base j in one generation. If t is large and u is small, we can approximate the process as a Poisson, on a continuous scale, with a constant risk of mutation. Note that if we had a type of mutation that changed a base to any one of the four bases chosen at random, with equal probability of the four outcomes, it would almost be like Jukes- Cantor, except that ¼ of the time it would make no change at all (it would do nothing, i.e., no event would occur at all). Now we imagine that we alter this new model by increasing the mutation rate to 4/3u. In this (rescaled) model, the mutations that change the bases occur at rate u, continuously in time (this includes the events that change, e.g., a base A to the same base A). Now this is just a Poisson model with mean rate of change 4/3u, so we can write the probability that k mutations occur, given that time t has passed, as follows: k " 4 Pr[k t] = e! 4 3 ut # $ 3 u % & k! Now we can easily compute the probability that we wind up with base j having started with a base i. The probability that zero base-changing events occur at all during time t is j just the 0 th term of the Poisson, which is just exp( 4/3ut) (The (4/3u) term drops out when j=0, as usual.). Therefore, the probability that there is at least one event that picks a base ¼ at random is 1 this probability, or (1 exp( 4/3ut)). ¼ of the time this change will be to some particular base. Putting this all together, we have: P ij = 1 4 (1! e! 4 3 ut ) (for i " j) P ij = 1 4 (1! e! 4 3 ut ) + e! 4 3 ut (for i = j) because in the second case we must add the probability that no base-changing event has happened at all, leaving us at base i. Thus, if we add up the expected proportion of sites that two sequences will differ after time t, it is just 3 times the first case above, giving us the expected fraction of sites that differing: D == 3 " 4 4 1! e! 3 ut # $ % & Note that the re-parameterization via the ut term is meant to replace the parameter K in the earlier derivation: the value of t here is actually the time 2t (total divergence time since common ancestor) given earlier.

The Kimura 2-parameter model. The Jukes-Cantor model suffers from several biological inaccuracies. For one thing, all transitions are assumed equal. However, it is well known that the rate of mutations between the two chemically similar purine bases (Adenine and Guanine, A and G) or between the two chemically similar pyrimidine bases (Cytosine and Thymine, C and T), is much higher (generally more than double) the rate of mutations between a purine and a pyrimidine (e.g., from an A to a T). This is particularly true, e.g., in mammals. The purine-purine or pyrimidine-pyrimidine mutations are called transitions and the purinepyrimidine mutations, transversions. If there were only one mutation parameter, as in the JC model, we would expect a 2:1 ration of transversions to transitions. Estimates from mammalian sequences, however, show that 57% of changes are actually transitions. There are two possible transition pairs (A-G and C-T) and four possible transversion pairs (A-T, A-C, G-T, G-C), discounting self-loops. So, in all there are 4 possible transitions, 8 transversions, and 4 self-loops. This suggests that one ought to move from a one parameter, single rate model to at least a 2 parameter rate model, one for the transition mutation rate (often denoted t s ) and one parameter for the transversion mutation rate (often denoted t v ). This is what Kimura suggested; this model is known as K2P. In the formulations of this model, we will often see the difference between the transition mutation rate and the transversion mutation rate expressed as a ratio, R= transition/transversion rates, or R=t s /t v. We can set up a differential equation as before to solve for the K2P model and its correction factor. If the time divergence is not too great, this will be close to Jukes- Cantor. We can also solve this model using general methods for continuous time Markov chains, as shown later for a slightly different model. Or, one can do the same sort of reparameterizing trick as with the Jukes-Cantor model, which we also defer to the next section. For now, we simply give the solution. If P denotes the total frequency of transitions and Q the total probability of transversions then the K2P says that: P = (1/ 4)(1! 2e!4(" +# )t + e!#t ) Q = (1/ 2)(1! e!8#t ) K = 2$t = 2"t + 4#t =!(1/ 2)ln(1! 2P! Q)! (1/ 4)(1! 2Q) For the Kimura model, we can show that the equilibrium frequency for all bases is ¼, irrespective of their initial frequency this follows from the symmetry of the model, as in Jukes-Cantor. However, turning this around, also like Jukes-Cantor, the formulas are applicable irrespective of the initial base frequency, so the model is applicable to a wider range of cases than many other models that are more complex. Further, the more complex models require one to estimate more parameters. We ll probe this further a bit later.

Example. You are given two sequences (50 bp each) of the homologous pseudogene in two species of yeast. (Pseudogene: non-transcribed gene that has decayed and is subject to purely neutral evolution.) The alignment is shown below: Species 1: CCTCGACGGCTTAGATCTGATCTGACCTAATGCTGCAATCGTTACAAAGT Species 2: CCTCCACGAGTAAGAGTTGATCCGACTTAGTCCTGCGATCGTTAGATAAT Note that Jukes-Cantor isn t exactly applicable here because there are 7 transitions and 7 transversions, 14 substitutions out of 50 base pairs. However, the Jukes-Cantor model assumes that there are twice as many transversions as transitions. To apply the Kimura correction, if we knew, say, that the two species diverged 10 million years ago, and there are 50 generations per year, the divergence time is 2 x 500 million or 10 9, but we d also have to get estimates of the transition/transversion ratio. In this data, it looks to be 1. The Jukes-Cantor correction would proceed as follows: There are 14 substitutions in 50 bp. D = 14/50 = 0.28. K = (3/4)*ln(1-(4/3)*D)=0.35. You should figure out what the difference is for the Kimura model, given R=1. Finally, notice that if we have these figures, we can determine the mutation rate (assuming that it is equal for the two species): K = 2t (years)*μ(mutations per bp per year) μ = 0.35/20 Myr = 1.75 10-8 mutations per bp per year Forthcoming: the matrix method of solution for estimating substitution corrections.