Evolutionary Analysis of Viral Genomes

Similar documents
SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Practical Bioinformatics

Advanced topics in bioinformatics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Supplementary Information for

Introduction to Molecular Phylogeny

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Lecture Notes: Markov chains

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Sequence Divergence & The Molecular Clock. Sequence Divergence

Crick s early Hypothesis Revisited

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Why do more divergent sequences produce smaller nonsynonymous/synonymous

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

Protein Threading. Combinatorial optimization approach. Stefan Balev.

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

Using algebraic geometry for phylogenetic reconstruction

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Probabilistic modeling and molecular phylogeny

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Understanding relationship between homologous sequences

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

SUPPLEMENTARY DATA - 1 -

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Taming the Beast Workshop

The Trigram and other Fundamental Philosophies

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

7. Tests for selection

Mutation models I: basic nucleotide sequence mutation models

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Codon Distribution in Error-Detecting Circular Codes

Electronic supplementary material

An Analytical Model of Gene Evolution with 9 Mutation Parameters: An Application to the Amino Acids Coded by the Common Circular Code

Quantifying sequence similarity

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

In: P. Lemey, M. Salemi and A.-M. Vandamme (eds.). To appear in: The. Chapter 4. Nucleotide Substitution Models

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

part 3: analysis of natural selection pressure

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

In: M. Salemi and A.-M. Vandamme (eds.). To appear. The. Phylogenetic Handbook. Cambridge University Press, UK.

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Dr. Amira A. AL-Hosary

Number-controlled spatial arrangement of gold nanoparticles with

Lecture 4. Models of DNA and protein change. Likelihood methods

What Is Conservation?

Lecture Notes: BIOL2007 Molecular Evolution

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Lecture 4. Models of DNA and protein change. Likelihood methods

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics. Part 8. Sequence Analysis An introduction. Mahdi Vasighi

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

Evolutionary Change in Nucleotide Sequences. Lecture 3

Phylogenetics. BIOL 7711 Computational Bioscience

Supplementary Information

Molecular Evolution and DNA systematics

Supplemental Figure 1.

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Similarity or Identity? When are molecules similar?

BIOL 502 Population Genetics Spring 2017

Phylogenetics: Building Phylogenetic Trees

Supporting Information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

SUPPLEMENTARY INFORMATION

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Capacity of DNA Data Embedding Under. Substitution Mutations

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Phylogenetic Assumptions

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the

It is the author's version of the article accepted for publication in the journal "Biosystems" on 03/10/2015.

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Get started on your Cornell notes right away

Transcription:

University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral Genomes Lecture 1: Quantifying Genetic Diversity Oliver G. Pybus Department of Zoology, University of Oxford h t t p : / / e v o l v e. z o o. o x. a c. u k

Information in Molecular Sequences Biological sequences (DNA, RNA, protein) contain information about the processes and events that formed them. This evolutionary information is often scrambled, fragmentary, hidden, or lost. Our aim is to use mathematical models to recover and interpret this information.

Information in Molecular Sequences Genetic distances Phylogenetic relationships Rates of evolution Dates of historical events Evolutionary / Population Processes Recombination rates Migration rates among subpopulations Natural selection & adaptation Population size change

Information in Molecular Sequences Mutation is the source of all sequence differences. Single nucleotide polymorphisms Silent / Replacement Transitions / Transversions Length polymorphisms Insertions / Deletions Recombination generates new combinations of mutations. Natural Selection and Genetic Drift act to change the frequency of mutations within a population.

Types of Mutation Transitions (TS): purine-to-purine or pyramidine-to-pyramidine A G or C T Transversions (TV): purine-to-pyramidine A C, A T, G C or G T Silent: encoded amino acid is unchanged Replacement: encoded amino acid is changed Silent change Synonymous change Ser Ile Ser Thr Ser TGT ATC TCT ACG AGC TGT ATA TCT ATG AGC Ser Ile Ser Met Ser Replacement change Non-synonymous change

MOLECULAR SEQUENCES Alignment Methods BIOINFORMATICS ALIGNMENT Sequence Evolution Models MOLECULAR EVOLUTION Phylogenetic Inference GENETIC DISTANCES EVOLUTIONARY TREE (time scale = genetic distance) PHYLOGENETICS Molecular Clock Models EVOLUTIONARY TREE (time scale = years) PHYLOGENETICS Coalescent Models POPULATION GENETICS POPULATION PROCESSES (e.g. adaptation, migration, population size change)

MOLECULAR SEQUENCES Alignment Methods BIOINFORMATICS ALIGNMENT Sequence Evolution Models MOLECULAR EVOLUTION Phylogenetic Inference GENETIC DISTANCES EVOLUTIONARY TREE (time scale = genetic distance) PHYLOGENETICS Molecular Clock Models EVOLUTIONARY TREE (time scale = years) PHYLOGENETICS Coalescent Models POPULATION GENETICS POPULATION PROCESSES (e.g. adaptation, migration, population size change)

Sequence Alignment Homology: similarity of a character among organisms due to inheritance from a shared common ancestor. Positional Homology: equivalent nucleotide/ amino acid positions within a sequence. Alignment: a proposed assignment of positional homology for a set of gene/protein sequences.

Sequence Alignment For example, how should these two sequences be aligned? Seq1: ATGCGTCGTT Seq2: ATCCGCGTC

Sequence Alignment Like this? Seq1: ATGCGTCGTT.. Seq2: ATCCG-CGTC (7 homologous sites + 2 mismatches + 1 insertion/deletion) Or like this? Seq1: AT--GCGTCGTT Seq2: ATCCGCGTC--- (7 homologous sites + 0 mismatches + 2 insertions/deletions)

Sequence Alignment Most alignment methods start by assigning relative weights to mismatches versus insertions/deletions. Different types of mismatch (e.g. transitions and transversions) can be weighted differently. The weights are used to calculate a total score for each possible alignment. Algorithms then search for the alignment with the best total score.

Sequence Alignment ClustalX is a commonly used alignment program. Alignment algorithms provide a useful first draft. Further adjustment by hand is often needed to correct errors.

Sequence Alignment Many pathogen nucleotide sequences are highly divergent and are therefore difficult to align: Seq1: GAAGGAAGCTCCTGGTTACTCCTGGGATCC Seq2: GAGGGTTCCTATCTATTAATTGGTAGC Seq3: GACGGCAGTGCATGGCTTTTGGGCAGT Seq4: GATGGGTCAGCTTACCTCCTGGCCGGGTCA

Sequence Alignment Considering the amino acid translation of the nucleotide sequences can make things easier: Seq1: GAA GGA AGC TCC TGG TTA CTC CTG GGA TCC Seq2: GAG GGT TCC --- TAT CTA TTA ATT GGT AGC Seq3: GAC GGC AGT GCA TGG --- CTT TTG GGC AGT Seq4: GAT GGG TCA GCT TAC CTC CTG GCC GGG TCA Seq1: Glu Gly Ser Ser Trp Leu Leu Leu Gly Ser Seq2: Glu Gly Ser - Tyr Leu Leu Ile Gly Ser Seq3: Asp Gly Ser Ala Trp - Leu Leu Gly Ser Seq4: Asp Gly Ser Ala Tyr Leu Leu Ala Gly Ser

MOLECULAR SEQUENCES Alignment Methods BIOINFORMATICS ALIGNMENT Sequence Evolution Models MOLECULAR EVOLUTION Phylogenetic Inference GENETIC DISTANCES EVOLUTIONARY TREE (time scale = genetic distance) PHYLOGENETICS Molecular Clock Models EVOLUTIONARY TREE (time scale = years) PHYLOGENETICS Coalescent Models POPULATION GENETICS POPULATION PROCESSES (e.g. adaptation, migration, population size change)

Genetic Distances SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 ATGGGTGCGA GAGCGTCAGT TCTAACAGGG GGAAAATTAG ATCGCTGGGA ATGGGTGCGA GAGCGTCAGT ATTAAGCGGG GGAGAATTAG ATCGATGGGA AAAAGTTCGG CTTAGGCCCG GGGGAAGAAA AAGATATATG ATGAAACATT AAAAATTCGG TTAAGGCCAG GGGGAAAGAA AAAATATAAA TTAAAACATA TAGTATGGGC AAGCAGGGAG CTGGAAAGAT TCGCATGTGA CCCCGGGCTA TAGTATGGGC AAGCAGGGAG CTAGAACGAT TCGCAGTTAA TCCTGGCCTG ATGGAAAGTA AGGAAGGATG TACTAAATTG TTACAACAAT TAGAGCCAGC TTAGAAACAT CAGAAGGCTG TAGACAAATA CTGGGACAGC TACAACCATC TCTCAAAACA GGCTCAGAAG GACTGCGGTC CTTGTTTAAC ACTCTGGCAG CCTTCAGACA GGATCAGAAG AACTTAGATC ATTATATAAT ACAGTAGCAA TACTGTGGTG CATACATAGT GACATCACTG TAGAAGACAC ACAGAAAGCT CCCTCTATTG TGTGCATCAA AGGATAGAGA TAAAAGACAC CAAGGAAGCT CTAGAACAGC TAAAGCGGCA TCATGGAGAA CAACAGAGCA AAACTGAAAG TTAGACAAGA TAGAG--GAA -----GAGCA AAACAAAAGT AA---GAAAA TAACTCAGGA AGCCGTGAAG GGGGAGCCAG TCAAGGCGCT AGTGCCTCTG AAGCACAGCA AGC-----AG CAGCTGACA- -CAGGACAC- AG--CAGC-- CTGGCATTAG TGGAAATTAC CAGG--TCAG CCAAAATTAC 420 sites, 121 differences, 22 indels

The Multiple Substitution Problem Identical by A Convergent G descent: evolution: A A A A C Over time, multiple substitutions can generate sequence homology: A G A A

The Multiple Substitution Problem After sufficient time, the sequence will be random because so many substitutions have occurred. For nucleotide sequences, this means that 25% of sites will identical by chance. Hence maximum sequence divergence = 75%.

The Multiple Substitution Problem When divergence is low, the observed number of changes is similar to the true number of changes (genetic distance). When divergence is high, the observed number of changes underestimates the true genetic distance ( saturation ). Genetic Distance 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% Actual Time Hidden Information Observed

The Multiple Substitution Problem A statistical model of sequence evolution can be used to accurately estimate genetic distance. These are called nucleotide substitution models. Genetic Distance 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% Actual Time Observed

Nucleotide substitution models Time-reversible Markov process with four states (A,C,G,T). A a C Each nucleotide site evolves independently. c b e d Transitions are in red. Transversions are in blue. A and G are Purines. C and T are Pyrimidines. T G a,b,c,d,e,f = relative rate parameters f

Nucleotide substitution models Q is the instantaneous rate matrix of the Markov process. Elements represent the instantaneous rate of change from base X to base Y. Q = # % % % % $. µa"c µb"g µc"t µa"a. µd"g µe"t µb"a µd"c. µf"t µc"a µe"c µf"g. & ( ( ( ( Diagonal elements are set so that the rows sum to zero. μ = nucleotide substitution rate πx = frequency of base X (usually estimated from the data)

Nucleotide substitution models Table 1.1. The models of nucleotide substitution Model Description Parameter constraints REV TrN The most general, timereversible, Markov model TVs; purine TSs; and pyrimidine TSs Equal freq.? Reference none no (e.g. Yang, 1994a) a=c=d=f no (Tamura and Nei, 1993) HKY TVs and TSs a=c=d=f, b=e no (Hasegawa et al., 1985) F81 One substitution type a=b=c=d=e=f no (Felsenstein, 1981) K3ST A-T, C-G TVs; A-C, G-T TVs; and TSs a=f, c=d, b=e yes (Kimura, 1981) K2P TVs and TSs a=c=d=f, b=e yes (Kimura, 1980) JC69 One substitution type a=b=c=d=e=f yes (Jukes and Cantor, 1969) Equal freq? = equal base frequencies (πx)?

Nucleotide substitution models Suppose we have 2 sequences, A & B, which have n sites. The genetic distance between them is t. The units of t are substitutions per site (μ time). A t B If the nucleotide substitution rate of the sequences (μ) is known, then t represents time (months or years).

Nucleotide substitution models At site x, sequence A has base i and sequence B has base j. For this site, the probability distribution of the genetic distance t between A and B is: Qt P x (t) = i, j Thus the distribution of genetic distance, given A & B is: P(t) = n " x=1 The value t can be estimated using maximum likelihood. e P x (t) i, j

Nucleotide substitution models Under the HKY model, the substitution probability is : P x i, j (t) = # j =,.. -.. /. "j + "j "j + "j % & % & 1 #j 1 #j % "j1$ e $µt & ( $1e $µt + ) ( % & % $1e $µt $ "j ( ) ) & #j $ "j #j #j ( e ), " A + " G if j is a purine - /" C + " T if j is a pyrimidine + = transition rate / transversion rate ( ) e $µt(1+ #j(+ $1)) (i = j) $µt(1+ #j(+ $1)) (transition) (transversion) More complex models are calculated numerically.

Among-site rate heterogeneity Some sites evolve slowly, others evolve rapidly. Among-site rate heterogeneity models let μ vary among nucleotide sites. The codon-position model defines 3 relative rates, one for each codon position. The third codon position usually evolves faster than positions 1 and 2. The gamma model supposes that μ is distributed according to a one-parameter gamma distribution. The substitution probabilty P(t) is then integrated across this distribution.

Among-site rate heterogeneity The gamma model has one parameter, α.

Genetic Distances: Assumptions Insertions and deletions are ignored. Substitutions are reversible (non-reversible models are possible, but computationally difficult). Nucleotide base frequencies (πx) do not change through time (stationarity). All nucleotide sites evolve independently. But correlation among sites may arise from: i. Epistatic interactions among mutations ii. DNA/RNA secondary structure (stem-loops, trna, rrna) iii.mutations that change more than one site at once.

Genetic Distances: Example Estimated genetic distances between SIVcpz and HIVlai, under different nucleotide substitution models: Observed % different sites = 0.406 Jukes-Cantor (JC69) = 0.586 Kimura 2 paramater (K2P) = 0.602 Hasegawa-Kishino-Yano (HKY) = 0.611 General reversible (REV) = 0.620 General reversible + gamma (REV+gamma) = 1.017