Workshop David Rasmussen & arsten Magnus June 27, 2016 1 / 31
Outline of sequence evolution: rate matrices Markov chain model Variable rates amongst different sites: +Γ Implementation in BES2 2 / 31
genotype sequence level UGGUGUUG UGGUUUG phenotype e.g. antigenic level: ntibody binding to HIV codon: three nucleotides encode for one amino acid one nucleotide change can already change the phenotype alphabet: 4 nucleotides: DN: G RN: UG 20 amino acids 3 / 31
genotype sequence level UGGUGUUG UGGUUUG phenotype e.g. antigenic level: ntibody binding to HIV codon: three nucleotides encode for one amino acid one nucleotide change can already change the phenotype alphabet: 4 nucleotides: DN: G RN: UG 20 amino acids When comparing two nucleotide sequences we have to keep in mind that they are the result of mutation during replication (genotypic level) and selection (phenotypic level). 3 / 31
G G way of arranging sequences to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences o find an alignment: concept of positional homology: nucleotides (or amino acids) show positional homology if they exist at equivalent positions in the respective sequence. Programs for alignment MUSLE, LUSL which can be called from e.g. liview, Meglign,... BES analysis starts with aligned sequences!!! file format.fas,.fasta,.nexus 4 / 31
for nucleotide substitions 5 / 31
he fundamental problem G G G G taxon 1 G G G taxon 2 G G taxon 3 6 / 31
he fundamental problem G G single substitution > G G G taxon 1 G G G taxon 2 G G taxon 3 6 / 31
he fundamental problem G G multiple substitutions > > G G taxon 1 G G G taxon 2 G G taxon 3 6 / 31
he fundamental problem G G convergent substitution > G G taxon 1 > G G G taxon 2 G G taxon 3 6 / 31
he fundamental problem G G G G G G > > > G G taxon 1 G G taxon 1 G G > G > G G G taxon 2 G G G taxon 2 G G G G G taxon 3 G G taxon 3 G G Problem of phylogenetics: We observe sequences but not their evolutionary history. hus we have to take all possible evolutionary trajectories into account. 6 / 31
he fundamental problem G G G G G G G G taxon 1 > > G G taxon 1 > G G > G G G G taxon 2 G G G taxon 2 > G G G G G taxon 3 G G taxon 3 G G Problem of phylogenetics: We observe sequences but not their evolutionary history. hus we have to take all possible evolutionary trajectories into account. he sequence evolution model appears in the posterior:...... P( )=P( )P( )P( )P( )P( )... G...... G...... P( )... G... 6 / 31
model for nucleotide substitutions State space of each nucleotide position: S = {,,, G} Example: ssume the process is at state -(a+b+c) G a b c G 7 / 31
model for nucleotide substitutions State space of each nucleotide position: S = {,,, G} Example: ssume the process is at state -(a+b+c) G a b c G Substitution rate matrix: G -(a+b+c) a b c d -(d+e+f) e f g h -(g+h+i) i G j k l -(j+k+l) 7 / 31
Site models in BES2 8 / 31
he easiest substitution model: J69 J69: named after H Jukes, R antor: Evolution of protein molecules. 1969 [Jukes and antor, 1969]. all substitution have the same rate, λ G Substitution rates: G λ λ λ λ λ λ λ λ λ G λ λ λ 9 / 31
ccounting for transition/transversion: K80 K80: named after M Kimura: simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. 1980. [Kimura, 1980] transitions happen at rate α, transversions at rate β pyrimidines (one ring) purines (two rings) transversion transition G Substitution rates: G α β β α β β β β α G β β α 10 / 31
ccounting for transition/transversion: HKY HKY: named after [Hasegawa et al., 1984, Hasegawa et al., 1985] accounting for transitions (rate α), transversions (rate β) after a long period of evolution, equilibrium frequencies are reached pyrimidines (one ring) purines (two rings) transversion transition G Substitution rates: G απ βπ βπ G απ βπ βπ G βπ βπ απ G G βπ βπ απ α β β π 0 0 0 = α β β β β α 0 π 0 0 0 0 π 0 β β α 0 0 0 π G 11 / 31
ccounting for transition/transversion: N93 N93: named after [amura and Nei, 1993] accounting for different transition rates between and as well as and G after a long period of evolution, equilibrium frequencies are reached pyrimidines (one ring) purines (two rings) transversion α 1 α 2 transition G Substitution rates: G α 1 π βπ βπ G α 1 π βπ βπ G βπ βπ α 2 π G G βπ βπ α 2 π 12 / 31
more general substitution model: GR GR (REV): generalised time-reversible model based on three papers: [avaré, 1986, Yang, 1994, Zharkikh, 1994] Substitution rates: G aπ bπ cπ G aπ dπ eπ G bπ dπ fπ G G cπ eπ fπ + quite flexible + time-reversible - not completely general 13 / 31
he most general substitution model implemented in BES2 but not in BEUti UNRES: unrestricted model first described in [Yang, 1994] each substitution has a (different) rate Substitution rates: G a b c d e f g h i G j k l + most general case + all other models are special cases of UNRES - mathematical very complicated and not handy to use - not time-reversible 14 / 31
in BEUti model parameters description J69 1 all substitutions have the same rate K80 2+3 accounts for transition and transversions, not in BEUti HKY 2+3 distinction between transition and transversions, including equilibrium frequencies N93 3+3 different rates for transitions GR 6+3 general, but still time-reversible UNRES 12 most general, not time-reversible, not in BEUti an be empirically estimated from the alignment or inferred alongside the substitution rates. 15 / 31
he fundamental problem - again G G G G taxon 1 G G G taxon 2 G G taxon 3 Problem of phylogenetics: We observe sequences but not their evolutionary history. hus we have to take all possible evolutionary trajectories into account. 16 / 31
he fundamental problem - again G G G G taxon 1 G G G taxon 2 G G taxon 3 Problem of phylogenetics: We observe sequences but not their evolutionary history. hus we have to take all possible evolutionary trajectories into account. So far we determined rates of nucleotide substitutions. But we need probabilities. 16 / 31
Nucleotide substitutions as (M) Definition of a Markov chain (see also [Ross, 1996]) stochastic process, i.e. a series of random experiments through time Nucleotide substitutions as M G p G p G p G time 17 / 31
Nucleotide substitutions as (M) Definition of a Markov chain (see also [Ross, 1996]) stochastic process, i.e. a series of random experiments through time Nucleotide substitutions as M G p G p G p G time lives on a state space and jumps to the different states p p G 17 / 31
Nucleotide substitutions as (M) Definition of a Markov chain (see also [Ross, 1996]) stochastic process, i.e. a series of random experiments through time Nucleotide substitutions as M G p G p G p G time lives on a state space and jumps to the different states p p G memorylessness: the probability of jumping to a state only depends on the actual state G p G p G p G time 17 / 31
Why are a great model for nucleotide substitutions memorylessness: a nucleotides substitution happens independently from the substitution history at this site substitution rate matrix defines the transition probabilities applying theories of linear algebra we can calculate the transition probability matrix according to: P(t) = e Qt = U diag(e ɛ 1t, e ɛ 2t, e ɛ 3t, e ɛ 4t )U 1 the transition probabilities take into account every possible substitution path (hapman-kolmogorov theorem) 18 / 31
Example of transition probabilities: J69 Substitution rates: 3λ λ λ λ Q = λ 3λ λ λ λ λ 3λ λ λ λ λ 3λ P(t) = e Qt G transition probability matrix: p 0 (t) p 1 (t) p 1 (t) p 1 (t) P(t) = p 1 (t) p 0 (t) p 1 (t) p 1 (t) p 1 (t) p 1 (t) p 0 (t) p 1 (t) p 1 (t) p 1 (t) p 1 (t) p 0 (t) with p 0 (t) = 1 4 + 3 4 e 4λt and p 1 (t) = 1 4 1 4 e 4λt 19 / 31
Example of transition probabilities: J69 Substitution rates: 3λ λ λ λ Q = λ 3λ λ λ λ λ 3λ λ λ λ λ 3λ P(t) = e Qt G transition probability matrix: p 0 (t) p 1 (t) p 1 (t) p 1 (t) P(t) = p 1 (t) p 0 (t) p 1 (t) p 1 (t) p 1 (t) p 1 (t) p 0 (t) p 1 (t) p 1 (t) p 1 (t) p 1 (t) p 0 (t) with p 0 (t) = 1 4 + 3 4 e 4λt and p 1 (t) = 1 4 1 4 e 4λt substitutions per site λ = 0.015 day transistion probabilities 0.0 0.2 0.4 0.6 0.8 1.0 p 0 (t) p 1 (t) 0 20 40 60 80 100 time in days 19 / 31
J69: Stationary distribution Suppose we have a sequence that evolves with rate 9 substitutions per site λ = 2.2/3 10 year. We follow the evolution of 4 different sites with at site 1, at site 2, at site 3 and G at site 4 at time point 0. How likely is it, that after time t has passed, there is a,, or G at the four different positions? o answer this question, we follow the time evolution of the transition probability matrix P(t): 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0.46 0.18 0.18 0.18 0.18 0.46 0.18 0.18 0.18 0.18 0.46 0.18 0.18 0.18 0.18 0.46 0.31 0.23 0.23 0.23 0.23 0.31 0.23 0.23 0.23 0.23 0.31 0.23 0.23 0.23 0.23 0.31 0 4.5x10 8 9x10 8 1.8x10 9 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 time/years when t stationary distribution is reached ny long sequence (e.g....) at time 0, will be composed of equal amounts of,,,g after time t 20 / 31
J69: ime transformation he times we look at, e.g. in species evolution, are very often very large. hus, instead of real time, we display an evolutionary time scale in terms of sequence distances. s one substitution happens at rate 3λ in J69 (keep in mind that in other models the expected time to substitution is different!), we expect one substitution to happen after time 1/(3λ). his is due to exponentially distributed waiting times for an event happening at a certain rate. his means, that we expect one substitution after 1 2.2 10 9 4.5 10 8 years in our example. 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0.46 0.18 0.18 0.18 0.18 0.46 0.18 0.18 0.18 0.18 0.46 0.18 0.18 0.18 0.18 0.46 0.31 0.23 0.23 0.23 0.23 0.31 0.23 0.23 0.23 0.23 0.31 0.23 0.23 0.23 0.23 0.31 0 4.5x10 8 9x10 8 1.8x10 9 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 time/years time in years expected time to 1 substitution t = 3λ d in J69 rick from physics: compare units: [t] =years [ d 3λ ] = # substitutions # substitutions/year 0 1 2 4 d=timex(3 λ) 21 / 31
22 / 31
Variable rates so far: all sites in the sequence evolve at the same rate but: substitution rates might differ over the genome mutation rates might differ over sites selective pressure might be different on the phenotypic level 23 / 31
Variable rates so far: all sites in the sequence evolve at the same rate but: substitution rates might differ over the genome mutation rates might differ over sites selective pressure might be different on the phenotypic level We extend the existing models, by replacing the constant rates by Γ-distributed random variables (notation: J69+Γ, HKY+Γ,... ) 23 / 31
Example: J69+Γ λ λr we replace the substitution rate λ by λr, where R is a Γ-distributed random variable with shape parameter α and mean 1. g(r) 0.0 0.5 1.0 1.5 2.0 r α=0.2 α=1 α=2 α=20 0.0 0.5 1.0 1.5 2.0 2.5 3.0 24 / 31
Example: J69+Γ λ λr we replace the substitution rate λ by λr, where R is a Γ-distributed random variable with shape parameter α and mean 1. g(r) 0.0 0.2 0.4 0.6 0.8 1.0 r α=2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 In BEUti: hange number of Gamma ategory ount to allow for rate variation. 4 to 6 categories work normally well. 24 / 31
25 / 31
he codon sun codon consists of three nucleotides, translating to one of the 20 amino acids: hree-letter One-Letter Molecular mino cid bbreviation Symbol Weight lanine la 89Da rginine rg R 174Da sparagine sn N 132Da sparticacid sp D 133Da sparagineor asparticacid sx B 133Da ysteine ys 121Da Glutamine Gln Q 146Da Glutamicacid Glu E 147Da Glutamineor glutamicacid Glx Z 147Da Glycine Gly G 75Da Histidine His H 155Da Isoleucine Ile I 131Da Leucine Leu L 131Da Lysine Lys K 146Da Methionine Met M 149Da Phenylalanine Phe F 165Da Proline Pro P 115Da Serine Ser S 105Da hreonine hr 119Da ryptophan rp W 204Da yrosine yr Y 181Da Valine Val V 117Da [Sanger, 2015] [Promega, 2015] 26 / 31
Example: odon Overview over substitution rates to the same codon, the thickness of arrows represent different rates: (Ile) G (Val) (Leu) G (rg) (Leu) (Leu) (Gln) (Leu) G (Leu) (Pro) synonymous substitutions: does not change nonsynonymous substitutions: does change bigger arrows: transition smaller arrows: transversion adapted from [Yang, 2014] 27 / 31
Varying substitution rates amongst the codon positions [Bofkin and Goldman, 2007] have shown that in protein encoding regions second codon positions evolve more slowly than first codon positions third codon positions evolve faster than first codon positions 28 / 31
Varying substitution rates amongst the codon positions [Bofkin and Goldman, 2007] have shown that in protein encoding regions second codon positions evolve more slowly than first codon positions third codon positions evolve faster than first codon positions Different codon positions can have different evolutionary rates. BES2 allows for estimating these rates separately. file BES2.4.x/examples/nexus/primate-mtDN.nex 28 / 31
Including the choice of substitution rate model into your BES analysis 29 / 31
Rate models in BES2 BES2 allows for including different site models into your analysis ( Site Model tab in BEUti) Which site model is the best for your data? 30 / 31
Rate models in BES2 BES2 allows for including different site models into your analysis ( Site Model tab in BEUti) Which site model is the best for your data? : package bmodelest: Bayesian site model selection for nucleotide data 30 / 31
Rate models in BES2 BES2 allows for including different site models into your analysis ( Site Model tab in BEUti) Which site model is the best for your data? : package bmodelest: Bayesian site model selection for nucleotide data : package SubstBM: modelling across-site variation in the nucleotide 30 / 31
I - Bofkin, L. and Goldman, N. (2007). Variation in Evolutionary Processes at Different odon Positions. Molecular Biology and Evolution, 24(2):513 521. - Hasegawa, M., Kishino, H., and Yano,. (1985). Dating of the Human pe Splitting by a Molecular lock of Mitochondrial-Dna. Journal of, 22(2):160 174. - Hasegawa, M., Yano,., and Kishino, H. (1984). New Molecular lock of Mitochondrial-Dna and the Evolution of Hominoids. Proceedings of the Japan cademy Series B-Physical and Biological Sciences, 60(4):95 98. - Jukes,. and antor,. (1969). Evolution of protein molecules. Mammalian Protein Metabolism., pages 21 123. - Kimura, M. (1980). simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of molecular evolution, 16(2):111 120. - Promega (2015). he amino acids: https://www.promega.com/ /media/files/resources/technical references/amino acid abbreviations and molecular weights.pdf. - Ross, S. M. (1996). Stochastic Processes. Second edition. Wiley. - Sanger (2015). he codon sun: ftp://ftp.sanger.ac.uk/pub/yourgenome/downloads/activities/kras-cancer-mutation/krascodonwheel.pdf. - amura, K. and Nei, M. (1993). Estimation of the number of nucleotide substitutions in the control region of mitochondrial DN in humans and chimpanzees. Molecular Biology and Evolution, 10(3):512 526. - avaré, S. (1986). Some probabilistic and statistical problems in the analysis of DN sequences. In Some mathematical questions in biology DN sequence analysis (New York, 1984), pages 57 86. mer. Math. Soc., Providence, RI. - Yang, Z. (1994). Estimating the pattern of nucleotide substitution. Journal of molecular evolution, 39(1):105 111. - Yang, Z. (2014). Statistical pproach. Oxford University Press. - Zharkikh,. (1994). Estimation of evolutionary distances between nucleotide sequences. Journal of molecular evolution, 39(3):315 329. 31 / 31