Evolutionary Analysis of Viral Genomes

University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral Genomes Lecture 1: Quantifying Genetic Diversity Oliver G. Pybus Department of Zoology, University of Oxford h t t p : / / e v o l v e. z o o. o x. a c. u k

Information in Molecular Sequences Biological sequences (DNA, RNA, protein) contain information about the processes and events that formed them. This evolutionary information is often scrambled, fragmentary, hidden, or lost. Our aim is to use mathematical models to recover and interpret this information.

Information in Molecular Sequences Genetic distances Phylogenetic relationships Rates of evolution Dates of historical events Evolutionary / Population Processes Recombination rates Migration rates among subpopulations Natural selection & adaptation Population size change

Information in Molecular Sequences Mutation is the source of all sequence differences. Single nucleotide polymorphisms Silent / Replacement Transitions / Transversions Length polymorphisms Insertions / Deletions Recombination generates new combinations of mutations. Natural Selection and Genetic Drift act to change the frequency of mutations within a population.

Types of Mutation Transitions (TS): purine-to-purine or pyramidine-to-pyramidine A G or C T Transversions (TV): purine-to-pyramidine A C, A T, G C or G T Silent: encoded amino acid is unchanged Replacement: encoded amino acid is changed Silent change Synonymous change Ser Ile Ser Thr Ser TGT ATC TCT ACG AGC TGT ATA TCT ATG AGC Ser Ile Ser Met Ser Replacement change Non-synonymous change

MOLECULAR SEQUENCES Alignment Methods BIOINFORMATICS ALIGNMENT Sequence Evolution Models MOLECULAR EVOLUTION Phylogenetic Inference GENETIC DISTANCES EVOLUTIONARY TREE (time scale = genetic distance) PHYLOGENETICS Molecular Clock Models EVOLUTIONARY TREE (time scale = years) PHYLOGENETICS Coalescent Models POPULATION GENETICS POPULATION PROCESSES (e.g. adaptation, migration, population size change)

Sequence Alignment Homology: similarity of a character among organisms due to inheritance from a shared common ancestor. Positional Homology: equivalent nucleotide/ amino acid positions within a sequence. Alignment: a proposed assignment of positional homology for a set of gene/protein sequences.

Sequence Alignment For example, how should these two sequences be aligned? Seq1: ATGCGTCGTT Seq2: ATCCGCGTC

Sequence Alignment Like this? Seq1: ATGCGTCGTT.. Seq2: ATCCG-CGTC (7 homologous sites + 2 mismatches + 1 insertion/deletion) Or like this? Seq1: AT--GCGTCGTT Seq2: ATCCGCGTC--- (7 homologous sites + 0 mismatches + 2 insertions/deletions)

Sequence Alignment Most alignment methods start by assigning relative weights to mismatches versus insertions/deletions. Different types of mismatch (e.g. transitions and transversions) can be weighted differently. The weights are used to calculate a total score for each possible alignment. Algorithms then search for the alignment with the best total score.

Sequence Alignment ClustalX is a commonly used alignment program. Alignment algorithms provide a useful first draft. Further adjustment by hand is often needed to correct errors.

Sequence Alignment Many pathogen nucleotide sequences are highly divergent and are therefore difficult to align: Seq1: GAAGGAAGCTCCTGGTTACTCCTGGGATCC Seq2: GAGGGTTCCTATCTATTAATTGGTAGC Seq3: GACGGCAGTGCATGGCTTTTGGGCAGT Seq4: GATGGGTCAGCTTACCTCCTGGCCGGGTCA

Sequence Alignment Considering the amino acid translation of the nucleotide sequences can make things easier: Seq1: GAA GGA AGC TCC TGG TTA CTC CTG GGA TCC Seq2: GAG GGT TCC --- TAT CTA TTA ATT GGT AGC Seq3: GAC GGC AGT GCA TGG --- CTT TTG GGC AGT Seq4: GAT GGG TCA GCT TAC CTC CTG GCC GGG TCA Seq1: Glu Gly Ser Ser Trp Leu Leu Leu Gly Ser Seq2: Glu Gly Ser - Tyr Leu Leu Ile Gly Ser Seq3: Asp Gly Ser Ala Trp - Leu Leu Gly Ser Seq4: Asp Gly Ser Ala Tyr Leu Leu Ala Gly Ser

Genetic Distances SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 SIVcpz HIV-1 ATGGGTGCGA GAGCGTCAGT TCTAACAGGG GGAAAATTAG ATCGCTGGGA ATGGGTGCGA GAGCGTCAGT ATTAAGCGGG GGAGAATTAG ATCGATGGGA AAAAGTTCGG CTTAGGCCCG GGGGAAGAAA AAGATATATG ATGAAACATT AAAAATTCGG TTAAGGCCAG GGGGAAAGAA AAAATATAAA TTAAAACATA TAGTATGGGC AAGCAGGGAG CTGGAAAGAT TCGCATGTGA CCCCGGGCTA TAGTATGGGC AAGCAGGGAG CTAGAACGAT TCGCAGTTAA TCCTGGCCTG ATGGAAAGTA AGGAAGGATG TACTAAATTG TTACAACAAT TAGAGCCAGC TTAGAAACAT CAGAAGGCTG TAGACAAATA CTGGGACAGC TACAACCATC TCTCAAAACA GGCTCAGAAG GACTGCGGTC CTTGTTTAAC ACTCTGGCAG CCTTCAGACA GGATCAGAAG AACTTAGATC ATTATATAAT ACAGTAGCAA TACTGTGGTG CATACATAGT GACATCACTG TAGAAGACAC ACAGAAAGCT CCCTCTATTG TGTGCATCAA AGGATAGAGA TAAAAGACAC CAAGGAAGCT CTAGAACAGC TAAAGCGGCA TCATGGAGAA CAACAGAGCA AAACTGAAAG TTAGACAAGA TAGAG--GAA -----GAGCA AAACAAAAGT AA---GAAAA TAACTCAGGA AGCCGTGAAG GGGGAGCCAG TCAAGGCGCT AGTGCCTCTG AAGCACAGCA AGC-----AG CAGCTGACA- -CAGGACAC- AG--CAGC-- CTGGCATTAG TGGAAATTAC CAGG--TCAG CCAAAATTAC 420 sites, 121 differences, 22 indels

The Multiple Substitution Problem Identical by A Convergent G descent: evolution: A A A A C Over time, multiple substitutions can generate sequence homology: A G A A

The Multiple Substitution Problem After sufficient time, the sequence will be random because so many substitutions have occurred. For nucleotide sequences, this means that 25% of sites will identical by chance. Hence maximum sequence divergence = 75%.

The Multiple Substitution Problem When divergence is low, the observed number of changes is similar to the true number of changes (genetic distance). When divergence is high, the observed number of changes underestimates the true genetic distance ( saturation ). Genetic Distance 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% Actual Time Hidden Information Observed

The Multiple Substitution Problem A statistical model of sequence evolution can be used to accurately estimate genetic distance. These are called nucleotide substitution models. Genetic Distance 200% 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% Actual Time Observed

Nucleotide substitution models Time-reversible Markov process with four states (A,C,G,T). A a C Each nucleotide site evolves independently. c b e d Transitions are in red. Transversions are in blue. A and G are Purines. C and T are Pyrimidines. T G a,b,c,d,e,f = relative rate parameters f

Nucleotide substitution models Q is the instantaneous rate matrix of the Markov process. Elements represent the instantaneous rate of change from base X to base Y. Q = # % % % % $. µa"c µb"g µc"t µa"a. µd"g µe"t µb"a µd"c. µf"t µc"a µe"c µf"g. & ( ( ( ( Diagonal elements are set so that the rows sum to zero. μ = nucleotide substitution rate πx = frequency of base X (usually estimated from the data)

Nucleotide substitution models Table 1.1. The models of nucleotide substitution Model Description Parameter constraints REV TrN The most general, timereversible, Markov model TVs; purine TSs; and pyrimidine TSs Equal freq.? Reference none no (e.g. Yang, 1994a) a=c=d=f no (Tamura and Nei, 1993) HKY TVs and TSs a=c=d=f, b=e no (Hasegawa et al., 1985) F81 One substitution type a=b=c=d=e=f no (Felsenstein, 1981) K3ST A-T, C-G TVs; A-C, G-T TVs; and TSs a=f, c=d, b=e yes (Kimura, 1981) K2P TVs and TSs a=c=d=f, b=e yes (Kimura, 1980) JC69 One substitution type a=b=c=d=e=f yes (Jukes and Cantor, 1969) Equal freq? = equal base frequencies (πx)?

Nucleotide substitution models Suppose we have 2 sequences, A & B, which have n sites. The genetic distance between them is t. The units of t are substitutions per site (μ time). A t B If the nucleotide substitution rate of the sequences (μ) is known, then t represents time (months or years).

Nucleotide substitution models At site x, sequence A has base i and sequence B has base j. For this site, the probability distribution of the genetic distance t between A and B is: Qt P x (t) = i, j Thus the distribution of genetic distance, given A & B is: P(t) = n " x=1 The value t can be estimated using maximum likelihood. e P x (t) i, j

Nucleotide substitution models Under the HKY model, the substitution probability is : P x i, j (t) = # j =,.. -.. /. "j + "j "j + "j % & % & 1 #j 1 #j % "j1$ e $µt & ( $1e $µt + ) ( % & % $1e $µt $ "j ( ) ) & #j $ "j #j #j ( e ), " A + " G if j is a purine - /" C + " T if j is a pyrimidine + = transition rate / transversion rate ( ) e $µt(1+ #j(+ $1)) (i = j) $µt(1+ #j(+ $1)) (transition) (transversion) More complex models are calculated numerically.

Among-site rate heterogeneity Some sites evolve slowly, others evolve rapidly. Among-site rate heterogeneity models let μ vary among nucleotide sites. The codon-position model defines 3 relative rates, one for each codon position. The third codon position usually evolves faster than positions 1 and 2. The gamma model supposes that μ is distributed according to a one-parameter gamma distribution. The substitution probabilty P(t) is then integrated across this distribution.

Among-site rate heterogeneity The gamma model has one parameter, α.

Genetic Distances: Assumptions Insertions and deletions are ignored. Substitutions are reversible (non-reversible models are possible, but computationally difficult). Nucleotide base frequencies (πx) do not change through time (stationarity). All nucleotide sites evolve independently. But correlation among sites may arise from: i. Epistatic interactions among mutations ii. DNA/RNA secondary structure (stem-loops, trna, rrna) iii.mutations that change more than one site at once.

Genetic Distances: Example Estimated genetic distances between SIVcpz and HIVlai, under different nucleotide substitution models: Observed % different sites = 0.406 Jukes-Cantor (JC69) = 0.586 Kimura 2 paramater (K2P) = 0.602 Hasegawa-Kishino-Yano (HKY) = 0.611 General reversible (REV) = 0.620 General reversible + gamma (REV+gamma) = 1.017