Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Similar documents
Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture Notes: Markov chains

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Week 5: Distance methods, DNA and protein models

Taming the Beast Workshop

What Is Conservation?

Maximum Likelihood in Phylogenetics

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Predicting the Evolution of two Genes in the Yeast Saccharomyces Cerevisiae

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Letter to the Editor. Department of Biology, Arizona State University

Mutation models I: basic nucleotide sequence mutation models

Maximum Likelihood in Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

In: M. Salemi and A.-M. Vandamme (eds.). To appear. The. Phylogenetic Handbook. Cambridge University Press, UK.

In: P. Lemey, M. Salemi and A.-M. Vandamme (eds.). To appear in: The. Chapter 4. Nucleotide Substitution Models

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Evolutionary Analysis of Viral Genomes

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetic Inference and Hypothesis Testing. Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Summary statistics, distributions of sums and means

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

BMI/CS 776 Lecture 4. Colin Dewey

Maximum Likelihood in Phylogenetics

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Inference of phylogenies, with some thoughts on statistics and geometry p.1/31

Lab 9: Maximum Likelihood and Modeltest

7.36/7.91 recitation CB Lecture #4

MODELING EVOLUTION AT THE PROTEIN LEVEL USING AN ADJUSTABLE AMINO ACID FITNESS MODEL

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Dr. Amira A. AL-Hosary

Quantifying sequence similarity

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

Phylogenetics: Building Phylogenetic Trees

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Maximum Likelihood in Phylogenetics

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

The Phylo- HMM approach to problems in comparative genomics, with examples.

Inferring Molecular Phylogeny

Constructing Evolutionary/Phylogenetic Trees

Week 7: Bayesian inference, Testing trees, Bootstraps

Counting phylogenetic invariants in some simple cases. Joseph Felsenstein. Department of Genetics SK-50. University of Washington

Likelihood in Phylogenetics

7. Tests for selection

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Molecular Evolution, course # Final Exam, May 3, 2006

Introduction to MEGA

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Understanding relationship between homologous sequences

Modeling Noise in Genetic Sequences

Reconstruire le passé biologique modèles, méthodes, performances, limites

Maximum Likelihood in Phylogenetics

Estimating Divergence Dates from Molecular Sequences

Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetic Assumptions

Maximum Likelihood in Phylogenetics

Maximum Likelihood in Phylogenetics

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Models of Molecular Evolution and Phylogeny

The Importance of Proper Model Assumption in Bayesian Phylogenetics

Scoring Matrices. Shifra Ben-Dor Irit Orr

Special features of phangorn (Version 2.3.1)

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Tutorial on Theoretical Population Genetics

Probabilistic modeling and molecular phylogeny

EVOLUTIONARY DISTANCES

Natural selection on the molecular level

Molecular Evolution and Comparative Genomics

Constructing Evolutionary/Phylogenetic Trees

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

MANNINO, FRANK VINCENT. Site-to-Site Rate Variation in Protein Coding

Sequence Alignment Techniques and Their Uses

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Efficiencies of Different Genes and Different Tree-building in Recovering a Known Vertebrate Phylogeny

Kei Takahashi and Masatoshi Nei

Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Transcription:

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

The Jukes-Cantor model (1969) A u/3 u/3 G u/3 u/3 u/3 C u/3 T the simplest symmetrical model of DNA evolution Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.2/22

Transition probabilities under the Jukes-Cantor model All sites change independently All sites have the same stochastic process working at them Make up a fictional kind of event, such that when it happens the site changes to one of the 4 bases chosen at random (equiprobably) Assertion: Having these events occur at rate 4 3u is the same as having the Jukes-Cantor model events occur at rate u The probability of none of these fictional events happens in time t is exp( 4 3 ut) No matter how many of these fictional events occur, provided it is not zero, the chance of ending up at a particular base is 1 4. Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.3/22

Jukes-Cantor transition probabilities, cont d Putting all this together, the probability of changing to C, given the site is currently at A, in time t is Prob (C A, t) = 1 4 (1 e 4 3 ut) while Prob (A A, t) = e 4 3 t + 1 4 (1 e 4 3 ut) or Prob (A A, t) = 1 4 (1 + 3e 4 3 ut) so that the total probability of change is (1 e 4 3 ut ) Prob (change t) = 3 4 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.4/22

Fraction of sites different, Jukes-Cantor 1 Differences per site 0.75 0.49 0 0 0.7945 Branch length after branches of different length, under the Jukes-Cantor model Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.5/22

Kimura s (1980) K2P model of DNA change, A a G b b b b C a T which allows for different rates of transitions and transversions, Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.6/22

Motoo Kimura Motoo Kimura, with family in Mishima, Japan in the 1960 s Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.7/22

Transition probabilities for the K2P model with two kinds of events: I. At rate α, if the site has a purine (A or G), choose one of the two purines at random and change to it. If the site has a pyrimidine (C or T), choose one of the pyrimidines at random and change to it. II. At rate β, choose one of the 4 bases at random and change to it. By proper choice of α and β one can achieve the overall rate of change and T s /T n ratio R you want. For rate of change 1, the transition probabilities (warning: terminological tangle). ) Prob (transition t) = 1 4 1 2 ( exp R+ 1 2 R+1 t Prob (transversion t) = 1 2 1 2 exp ( 2 R+1 t ). + 1 4 exp ( 2 R+1 t ) (the transversion probability is the sum of the probabilities of both kinds of transversions). Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.8/22

Transitions, transversions expected Differences 0.60 0.50 0.40 0.30 Total differences Transitions 0.20 Transversions 0.10 0.00 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Time (branch length) R = 10 in different amounts of branch length under the K2P model, for T s /T n = 10 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.9/22

Transitions, transversions expected 0.70 0.60 Total differences Differences 0.50 0.40 0.30 0.20 Transversions Transitions 0.10 0.00 R = 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Time (branch length) in different amounts of branch length under the K2P model, for T s /T n = 2 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.10/22

Other commonly used models include: Two models that specify the equilibrium base frequencies (you provide the frequencies π A, π C, π G, π T and they are set up to have an equilibrium which achieves them), and also let you control the transition/transversion ratio: The Hasegawa-Kishino-Yano (1985) model: to : A G C T from : A απ G + βπ G απ C απ T G απ A + βπ A απ C απ T C απ A απ G απ T + βπ T T απ A απ G απ C + βπ C Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.11/22

My F84 model to : A G C T from : A απ G + β π G πr απ C απ T G απ A + β π A πr απ C απ T C απ A απ G απ T + βπ T π Y T απ A απ G απ C + β π C πy where π R = π A + π G and π Y = π C + π T (The equilibrium frequencies of purines and pyrimidines) Both of these models have formulas for the transition probabilities, and both are subcases of a slightly more general class of models, the Tamura-Nei model (1993). Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.12/22

Reversibility P ij Pji π j π i Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.13/22

The General Time-Reversible model (GTR) It maintains detailed balance" so that the probability of starting at (say) A and ending at (say) T in evolution is the same as the probability of starting at T and ending at A: to : A G C T from : A απ G βπ C γπ T G απ A δπ C ɛπ T C βπ A δπ G υπ T T γπ A ɛπ G υπ C And there is of course the general 12-parameter model which has arbitrary rates for each of the 12 possible changes (from each of the 4 nucleotides to each of the 3 others). (Neither of these has formulas for the transition probabilities, but those can be done numerically.) Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.14/22

Relation between models There are many other models, but these are the most widely-used ones. Here is a general scheme of which models are subcases of which other ones: General 12 parameter model (12) General time reversible model (9) Tamura Nei (6) HKY (5) F84 (5) Kimura K2P (2) Jukes Cantor (1) Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.15/22

Rate variation among sites In reality, rates of evolution are not constant among sites. Fortunately, in the transition probability formulas, rates come in as simple multiples of times Thus if we know the rates at two sites, we can compute the probabilities of change by simply, for each site, multiplying all branch lengths by the appropriate rate If we don t know the rates, we can imagine averaging them over a distribution of rates. Usually the Gamma distribution is used In practice a discrete histogram of rates approximates the integration (For the Gamma it seems best to use Generalized Laguerre Quadrature to pick the rates and frequencies in the histogram). Also, there are actually autocorrelations with neighboring sites having similar rates of change. This can be handled by Hidden Markov Models, which we cover later. Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.16/22

A pioneer of protein evolution Margaret Dayhoff, about 1966 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.17/22

Models of amino acid change in proteins There are a variety of models put forward since the mid-1960 s: 1. Amino acid transition matrices Dayhoff (1968) model. Tabulation of empirical changes in closely related pairs of proteins, normalized. The PAM100 matrix, for example, is the expected transition matrix given 1 substitution per position. Jones, Taylor and Thornton (1992) recalculated PAM matrices (the JTT matrix) from a much larger set of data. Jones, Taylor, and Thurnton (1994a, 1994b) have tabulated a separate mutation data matrix for transmembrane proteins. Koshi and Goldstein (1995) have described the tabulation of further context-dependent mutation data matrices. Henikoff and Henikoff (1992) have tabulated the BLOSUM matrix for conserved motifs in gene families. 2. Goldman and Yang (1994) pioneered codon-based models (see next screen). Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.18/22

Approaches to protein sequence models Making a model for protein sequence evolution (a not very practical approach) 1. Use a good model of DNA evolution. 2. Use the appropriate genetic code 3. When an amino acid changes, accept it with a probability that declines as the amino acids become more different 4. Fit this to empirical information on protein evolution 5. Take into account variation of rate from site to site 6. Take into account correlation of rate variation in adjacent sites 7. How about protein structure? Secondary structure? 3 D struncture? Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.19/22

References Barry, D., and J. A. Hartigan. 1987. Statistical analysis of hominoid molecular evolution. Statistical Science 2: 191-210. [Early use of full 12-parameter model] Dayhoff, M. O. and R. V. Eck. 1968. Atlas of Protein Sequence and Structure 1967-1968. National Biomedical Research Foundation, Silver Spring, Maryland. [Dayhoff s PAM modelfor proteins] Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular Biology and Evolution 11: 725-736. [codon-based protein/dna models] Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution 22: 160-174. [HKY model] Henikoff, S. and J. G. Henikoff. 1992. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences, USA 89: 10915-10919. [BLOSUM protein model] Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.20/22

References Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Computer Applcations in the Biosciences (CABIOS) 8: 275-282. [JTT model for proteins] Jones, D. T., W. R. Taylor, and J. M. Thornton. 1994a. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33: 3038-3049. JTT membrane protein model] Jones, D. T., W. R. Taylor, and J. M. Thornton. 1994b. A mutation data matrix for transmembrane proteins. FEBS Letters 339: 269-275. [JTT membrane protein model] Jukes, T. H. and C. Cantor. 1969. Evolution of protein molecules. pp. 21-132 in Mammalian Protein Metabolism, ed. M. N. Munro. Academic Press, New York. [Jukes-Cantor model] Kimura, M. 1980. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120. [Kimura s 2-parameter model] Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.21/22

References Koshi, J. M. and R. A. Goldstein. 1995. Context-dependent optimal substitution matrices. Protein Engineering 8: 641-645. [generating other kinds of protein model matrices] Lanave, C., G. Preparata, C. Saccone, and G. Serio. 1984. A new method for calculating evolutionary substitution rates. Journal of Molecular Evolution 20: 86-93. [General reversible model] Lockhart, P. J., M. A. Steel, M. D. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular Biology and Evolution 11: 605-612. [The LogDet distance for correcting for changing base composition] Tamura, K. and M. Nei. 1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Molecular Biology and Evolution 10: 512-526. [Tamura-Nei model] Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.22/22

How it was done This projection produced using the prosper style in LaTeX, using Latex to make a.dvi file, using dvips to turn this into a Postscript file, using ps2pdf to mill it into a PDF file, and displaying the slides in Adobe Acrobat Reader. Result: nice slides using freeware. Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.23/22