SENCA: A codon substitution model to better estimate evolutionary processes

Similar documents
SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

MICROBIAL BIOCHEMISTRY BIOT 309. Dr. Leslye Johnson Sept. 30, 2012

Sequence Divergence & The Molecular Clock. Sequence Divergence

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Supplementary Information for

Microbial Typing by Machine Learned DNA Melt Signatures

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Edinburgh Research Explorer

Practical Bioinformatics

Identification of Bacteria Using Phylogenetic Relationships Revealed by MS/MS Sequencing of Tryptic Peptides Derived from Cellular Proteins

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Evidence That Mutation Is Universally Biased towards AT in Bacteria

Advanced topics in bioinformatics

Using an Artificial Regulatory Network to Investigate Neural Computation

types of codon models

INTERPRETATION OF THE GRAM STAIN

Tetracycline Rationale for the EUCAST clinical breakpoints, version th November 2009

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Whole Genome based Phylogeny

SUPPLEMENTARY DATA - 1 -

Why do more divergent sequences produce smaller nonsynonymous/synonymous

Fitness constraints on horizontal gene transfer

MicroGenomics. Universal replication biases in bacteria

part 3: analysis of natural selection pressure

ydci GTC TGT TTG AAC GCG GGC GAC TGG GCG CGC AAT TAA CGG TGT GTA GGC TGG AGC TGC TTC

Crick s early Hypothesis Revisited

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

Introduction to Molecular Phylogeny

Rise and Fall of Mutator Bacteria

Domain Bacteria. BIO 220 Microbiology Jackson Community College

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

The Journal of Animal & Plant Sciences, 28(5): 2018, Page: Sadia et al., ISSN:

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab)

Codon Distribution in Error-Detecting Circular Codes

Evolutionary Analysis of Viral Genomes

Proteins: Characteristics and Properties of Amino Acids

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the

Figure S1. Pangenome plots of ten recombining bacterial species based on RAST annotated

An Analytical Model of Gene Evolution with 9 Mutation Parameters: An Application to the Amino Acids Coded by the Common Circular Code

Genetic Basis of Variation in Bacteria

Microbial Taxonomy. Classification of living organisms into groups. A group or level of classification

Figure Page 117 Microbiology: An Introduction, 10e (Tortora/ Funke/ Case)

Ch 27: The Prokaryotes Bacteria & Archaea Older: (Eu)bacteria & Archae(bacteria)

NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

2 Genome evolution: gene fusion versus gene fission

Probabilistic modeling and molecular phylogeny

genome analysis, amino acid biosynthesis and transport, T-box, bacteria

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Bacterial pan-genomics

The Prokaryotes: Domains Bacteria and Archaea

arxiv: v1 [q-bio.qm] 18 Nov 2013

Overview of the major bacterial pathogens The major bacterial pathogens are presented in this table:

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Lecture 15: Realities of Genome Assembly Protein Sequencing

Taming the Beast Workshop

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Additional file 1 for Structural correlations in bacterial metabolic networks by S. Bernhardsson, P. Gerlee & L. Lizana

Computational Biology: Basics & Interesting Problems

The Trigram and other Fundamental Philosophies

160, and 220 bases, respectively, shorter than pbr322/hag93. (data not shown). The DNA sequence of approximately 100 bases of each

Gene Location and Bacterial Sequence Divergence

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Introduction to Bioinformatics Integrated Science, 11/9/05

Supplemental Figure 1.

Prokaryotic phylogenies inferred from protein structural domains

The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome. The minimal prokaryotic genome

World Journal of Pharmaceutical Research SJIF Impact Factor 8.074

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Translation. A ribosome, mrna, and trna.

In previous lecture. Shannon s information measure x. Intuitive notion: H = number of required yes/no questions.

Considerations with Antibiotic Therapy PART

Product List. - Kits & Reagents

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

POPULATION GENETICS Biology 107/207L Winter 2005 Lab 5. Testing for positive Darwinian selection

Practical examination

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Product Catalogue 2015 Clinical and Industrial Microbiology

MODELING EVOLUTION AT THE PROTEIN LEVEL USING AN ADJUSTABLE AMINO ACID FITNESS MODEL

Diversity of Chlamydia trachomatis Major Outer Membrane

Genetic code on the dyadic plane

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Antimicrobial peptides

Sequence Diversity of Flagellin (flic) Alleles in Pathogenic Escherichia coli

Properties of amino acids in proteins

ANTIMICROBIAL TESTING. E-Coli K-12 - E-Coli 0157:H7. Salmonella Enterica Servoar Typhimurium LT2 Enterococcus Faecalis

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Gram negative bacilli

Product Catalogue 2016 Clinical and Industrial Microbiology

μ gyra parc Escherichia coli Klebsiella pneumoniae Pseudomonas aeruginosa E. coli gyra E. coli parc gyra parc gyra Escherichia coli E. coli E.

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

codon substitution models and the analysis of natural selection pressure

Bacterial clasification

PROTEIN SYNTHESIS INTRO

Transcription:

SENCA: A codon substitution model to better estimate evolutionary processes Fanny Pouyet Marc Bailly-Bechet, Dominique Mouchiroud and Laurent Guéguen LBBE - UCB Lyon - UMR 5558 June 2015, Porquerolles, France.

Codon Usage Bias - CUB Lysine codon usage in bacteria % of codons 100 80 60 40 AAA AAG 20 0 A.dehalogenans B.radiodurans A.aeolicus E.coli B.aphidicola E.coli Lysine Usage 1200 61 sense codons, 20 amino acids Genes Number 1000 800 600 400 200 0 0 0.2 0.4 0.6 0.8 1 relative frequency of AAA codon 2/16

CUB How to measure it? Origins of the CUB: mutational vs. selective explanations. related trna concentration (um) Objectives: Disentangling the two hypotheses Codons usage (frequency %) Modelling evolutive processes explaining the observed CUB: at the nucleotidic (N), the codons (C) and the AA layers (A). 3/16

CUB How to measure it? Origins of the CUB: mutational vs. selective explanations. related trna concentration (um) Objectives: Disentangling the two hypotheses Codons usage (frequency %) Modelling evolutive processes explaining the observed CUB: at the nucleotidic (N), the codons (C) and the AA layers (A). 3/16

SENCA: Sites Evolution of Nucleotides, Codons and Amino-acids Let s have an example Thr ALANINE Gly Ser GCT GCC GCG GCA CCT CCG PROLINE CCC CCA Glu Asp GTT GTG VALINE GTC GTA Inspired by Yang and Nielsen 2008 (FMutSel). 4/16

SENCA: Sites Evolution of Nucleotides, Codons and Amino-acids N layer ALANINE T C GCT GCC C GCG GCA A CCT CCG PROLINE CCC CCA T GTT GTG VALINE GTC GTA κ : transition/transversion π n :equilibrium frequency of nucleotide n [A, C, G, T ]. 4/16

SENCA: Sites Evolution of Nucleotides, Codons and Amino-acids C layer ALANINE GCT GCC ALA GCT GCG ALA GCC ALA GCA GCA PRO CCG CCT CCG PROLINE CCC CCA GTT GTG VALINE GTC GTA VAL GTG φ AA (i) : codon i preference intra-aa. I codons(aa) φ AA(I) = 1 4/16

SENCA: Sites Evolution of Nucleotides, Codons and Amino-acids A layer ALANINE GCT GCC GCG GCA PRO CCT CCG PROLINE CCC CCA VAL GTT GTG VALINE GTC GTA ω : non-synonymous/synonymous substitution rates ψ AA : preference AA. AA amino acids ψ(aa) = 1 4/16

SENCA: Sites Evolution of Nucleotides, Codons and Amino-acids Inspiration and structure ALANINE GCT GCC ALA GCT GCG ALA GCC ALA GCA C T GCA A PRO CCG PRO C CCT CCG PROLINE CCC CCA VAL T GTT GTG VALINE GTC GTA VAL GTG Implemented in Bio++ http://biopp.univ-montp2.fr/. 4/16

SENCA Formulas Instantaneous evolutionary rate from codon i to j : 0 if 2 or 3 different positions, πj k κ f (x i, x j ) synonymous transition, q ij = πj k κωf (x i, x j ) non-synonymous transition, πj k f (x i, x j ) synonymous transversion, πj k ωf (x i, x j ) non-synonymous transversion. with: and: f (x i, x j ) = log( x i x j ) 1 x i x j x i = n aai φ aa (i)ψ aai 5/16

Data 21 pathogeneous bacteria and archea From Lassalle et al., PLoS Genet. 2015 concatenates of approx. 100 genes by increasing ENC (core genome, non-recombinant). Dataset Taxon Name Nb. of strains Nb of concatenates Mean GC % brucella Brucella spp. 9 4 58.8 francis Francisella tularensis 8 3 33.8 mycobacterium Mycobacterium tuberculosis complex 7 1 66.1 burk_mal Burkholderia Pseudomallei group 9 6 68.7 yersinia Yersinia pestis 11 6 49.3 clamydia Clamydia trachomatis 13 4 41.8 sulfo Sulfolobus spp. 8 4 35.4 burk_ceno Burkholderia cenocepacia complex 8 8 68.2 staph Staphylococcus aureus 15 5 34.2 bifido Bifidobacterium longum 6 3 61.9 acineto Acinetobacter spp. 6 5 40.8 campylo Campylobacter jejunii 6 4 31.6 clostridium Clostridium botulinum 8 5 29.6 strep_pyo Streptococcus pyogenes 12 3 39.6 strep_pneu Streptococcus pneumoniae 13 3 42.0 salmonella Salmonella enterica 14 6 54.6 listeria Listeria spp. 8 3 38.8 B_anthracis Bacillus antharcis/aureus group 17 3 37.0 nesseiria Nesseiria meningitidis 8 2 55.3 escherichia Escherichia coli 35 1 53.3 helicobacter Helicobacter pylori 14 1 40.4 6/16

Implementation Non-stationnary and homogeneous run, Intra-species runs: amino-acids preferences stationnary, Optimisation by max. likelihood: 109 parameters. Validation of the model: Comparisons with AIC criterium to standard codons model YN98xF61. SENCA is better in 68 over 78 concats. Exceptions are: 2 concats. of Brucella spp. and Burkholderia peusomallei, 1 of Salmonella enterica and Yersinia pestis, all of Sulfolobus spp. 7/16

Implementation Non-stationnary and homogeneous run, Intra-species runs: amino-acids preferences stationnary, Optimisation by max. likelihood: 109 parameters. Validation of the model: Comparisons with AIC criterium to standard codons model YN98xF61. SENCA is better in 68 over 78 concats. Exceptions are: 2 concats. of Brucella spp. and Burkholderia peusomallei, 1 of Salmonella enterica and Yersinia pestis, all of Sulfolobus spp. 7/16

What about the CUB? To disentangle between the 2 hypotheses: compute the CUB by comparison with expected if uniform > ENC index. Effective Number of Codons ENC varies between 61 and 20. ENC = 61 ENC = 32 ENC = 20 Strength of CUB 8/16

ENC index 60 50 ENC 40 30 20 SENCA OBS clostridium campylo francis staph sulfo B_anthracis listeria strep_pyo helico acineto clamy_trach strep_pneu yersinia escherichia salmo neisseria brucella bifido_longum mycobacterium burk_ceno Burk_mal At equilibrium, frequency of codon i is proportional to: 3 f (i) πi k φ aai (i)n aai ψ aai k=1 > ENC SENCA ENC OBS Species ordered by incr. GC obs 9/16

Species ordered by incr. GC obs ENC index 60 50 ENC 40 30 N 20 SENCA OBS clostridium campylo francis staph sulfo B_anthracis listeria strep_pyo helico acineto clamy_trach strep_pneu yersinia escherichia salmo neisseria brucella bifido_longum mycobacterium burk_ceno Burk_mal Compute ENCN : 3 f (i) πi k 1 1 > ENCN uniform CUB, k=1 9/16

ENC index 60 50 ENC 40 30 20 N C SENCA OBS clostridium campylo francis staph sulfo B_anthracis listeria strep_pyo helico acineto clamy_trach strep_pneu yersinia escherichia salmo neisseria brucella bifido_longum mycobacterium burk_ceno Burk_mal Compute ENC C : f (i) 1 φ aai (i)n aai 1 > ENC C ENC SENCA Species ordered by incr. GC obs 9/16

ENC index 60 50 ENC 40 30 20 N C SENCA OBS YN98 clostridium campylo francis staph sulfo B_anthracis listeria strep_pyo helico acineto clamy_trach strep_pneu yersinia escherichia salmo neisseria brucella bifido_longum mycobacterium burk_ceno Burk_mal ENC SENCA ENC OBS < ENC YN98, Is ENC SENCA mostly driven by ENC C? Quantifying. Species ordered by incr. GC obs 9/16

Quantification denc SENCA denc C + denc N CUB measured as distance to uniform usage: denc = 61 ENC, High correlation (R = 0.95, p-value<10e-16, Pearson correlation test), Relative importance of C and N: denc C denc SENCA > denc N denc SENCA. C N 1.5 1.0 0.5 0.0 clostridium campylo francis staph sulfo B_anthracis listeria strep_pyo helico acineto clamy_trach strep_pneu yersinia escherichia salmo neisseria brucella bifido_longum mycobacterium burk_ceno Burk_mal 10/16

What about genome composition? GC content 0.8 N C A SENCA OBS YN98 0.6 GC 0.4 0.2 clostridium campylo francis staph sulfo B_anthracis listeria strep_pyo helico acineto clamy_trach strep_pneu yersinia escherichia salmo neisseria brucella bifido_longum mycobacterium burk_ceno Burk_mal GC SENCA < GC obs SENCA combined effects of the 3 layers > AT enrichment at equilibrium. 11/16

What about genome composition? GC3 content 0.8 N C A SENCA OBS YN98 0.6 GC 0.4 0.2 clostridium campylo francis staph B_anthracis sulfo listeria acineto strep_pyo clamy_trach strep_pneu helico yersinia escherichia salmo neisseria brucella bifido_longum mycobacterium GC3 SENCA more biased than GC3 YN98 : in agreement with GC3 obs, Which layer has the most important impact on GC bias? burk_ceno Burk_mal 12/16

Quantification dgc dgc N + dgc C + dgc A dgc N + dgcc + dgca 0.2 0.1 0.0 0.1 0.2 GC bias measured as dgc = 0.5 GC (distance to uniform composition). High correlation (R=0.99, p-value<10e-16, Pearson correlation test). 0.2 0.1 0.0 0.1 dgc* Slope=1.04, intersect fixed to 0. 13/16

Quantification dgc dgc N + dgc C + dgc A A C N 0.3 0.2 0.1 0.0 clostridium campylo francis staph sulfo B_anthracis listeria strep_pyo helico acineto clamy_trach strep_pneu yersinia escherichia salmo neisseria brucella bifido_longum mycobacterium N, C and A impact on GC. burk_ceno Burk_mal 14/16

Conclusion Methodologically: Non-stationary model Multilayered model New statistical tools: GC and ENC of layers Biologically: Bias towards AT at equilibrium Importance of N, C and A for genomic bias. CUB mostly due to C. 15/16

Perspectives Extension of the model: Gene expression factor : measure of CUB intensity within a genome Biological questions: Influence on ω estimation, HIV Human interaction and consequences on HIV CUB, Can we detect recombination patterns: BGC? Thank you for your attention! Acknowledgments: Laurent Guéguen, Marc Bailly-Bechet, Dominique Mouchiroud, Florent Lassalle, Vincent Daubin 16/16

Omega comparisons 1.0 0.8 SENCA 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 YN98 17/16

SENCA: Site Evolution of Nucleotides, Codons and Amino-acids Implementation and parameters SENCA with 65 parameters: πj k equil. frequency of mutational process of j k κ transition/transversion φ aa (i) codon preference. For each AA, we have: i φ aa(i) = 1 n aai degenerescence of the AA ω = dn ds mutation rate of non-synynymous dn and synonymous ds ψ aa AA preference. We have: aa ψ aa = 1 In Bio++ (http://biopp.univ-montp2.fr/). Optimisation by max. likelihood. Homogeneous, non-stationary analysis (ψ aa stat. because intra-species analyses) 18/16