Molecular evolution 2 Please sit in row K or forward
RBFD: cat, mouse, parasite Toxoplamsa gondii cyst in a mouse brain http://phenomena.nationalgeographic.com/2013/04/26/mind-bending-parasite-permanently-quells-cat-fear-in-mice/ Credit: Jitinder P. Dubey https://commons.wikimedia.org/wiki/file:kittyply_edit1.jpg https://commons.wikimedia.org/wiki/file:мышь_2.jpg http://commons.wikimedia.org/wiki/file:phylogenetic_tree_of_life.png
Topics for the next few days The HIV genome in context Phylogenetic trees of hosts vs. pathogens: introducing the HIV/SIV case Phylogenetic reconstruction methods Constructing a distance matrix between sequences What is a sequence alignment? The Jukes-Cantor correction The neighbor-joining algorithm to make a tree from distances
The HIV genome in context Human E. coli Mimivirus M. genitalium HIV LINE transposon SINE transposon ~20,000 ~4000 ~1000 ~500 9 1 0 Variation in protein coding gene count
HIV (entire genome) E. coli (~25 thousand bp on the E. coli chromosome) Human (~1 million bp on human chromosome 4) https://commons.wikimedia.org/wiki/file:hiv-genome.png
Topics for the next few days The HIV genome in context Phylogenetic trees of hosts vs. pathogens: introducing the HIV/SIV case Phylogenetic reconstruction methods Constructing a distance matrix between sequences What is a sequence alignment? The Jukes-Cantor correction The neighbor-joining algorithm to make a tree from distances
HIV and SIV https://www.flickr.com/photos/23993953@n04/13079389505 http://www.evoanth.net/2015/05/23/what-does-a-chimp-look-for-in-a-tool/ http://pin.primate.wisc.edu/factsheets/entry/sooty_mangabey https://www.flickr.com/photos/berniedup/7692054594
Question: If we made a tree for the HIV/SIV sequences infecting these primates, how would it compare with this host tree? Gabon talapoin Sooty mangabey Drill Chimp Human Phylogeny from Perelman et al. molecular phylogeny of living primates 2011.
The data Strain and host HIV1_human_a HIV1_human_b HIV1_human_c HIV1_human_d HIV1_human_e SIV_chimp_a SIV_chimp_b HIV2_human_a HIV2_human_b HIV2_human_c SIV_sootyMangabey_a SIV_sootyMangabey_b SIV_drill SIV_gabonTalapoin HTLV-1 Sequence TTTTTTGGGTTTGGC... TTTTTTGGGGTTTGGC... TTTTTTGGCTCTGGC... TTTTTTGGGGGCTGGT... TTTTTTGGGGTCTGGC... TTTTTTGGGCGCCCC... TTTTTTGGGGGGCTGGC... TTTTTTGGGTGGGCTCC... TTTTTTGGGTTGGCCCT... TTTTTTGGGTTTGGCCCT... TTTTTTGGTTTGGCCCT... TTTTTTGGTTTGGTCCTT... TTTTTTGGGTCTCCCT... TTTTTTGGGGTCTTTTT... TGCGCTGGCCCTTCCT...
Representing nucleic acid molecules on a computer UUUUUUGGGGUCUGGCCUUCCUCGGG By convention, we represent as a single string going 5' to 3.
Representing nucleic acid molecules on a computer 5' TTTTTTGGGGTCTGGCCTTCCTCGGG 3' 3' TCCCTTCTGCCGGGGTGTTCCCTT 5' TTTTTTGGGGTCTGGCCTTCCTCGGG or Either of these ok Called reverse complements TTCCCTTGTGGGGCCGTCTTCCCT
Topics for the next few days The HIV genome in context Phylogenetic trees of hosts vs. pathogens: introducing the HIV/SIV case Phylogenetic reconstruction methods Constructing a distance matrix between sequences What is a sequence alignment? The Jukes-Cantor correction The neighbor-joining algorithm to make a tree from distances
Trees and distances: often substitutions accumulate (roughly) proportional to time B C D time
What is a sequence alignment? GTCGGT GTCGGT GTCCGCT GTCCGCT Our goal: to obtain distances between sequences by estimating the number of substitutions
What is a sequence alignment? GTCGGT GTCGGT GTCCGCT lignment process GTCCGCT G-T--CGGT GTCCGCT
Topics for the next few days The HIV genome in context Phylogenetic trees of hosts vs. pathogens: introducing the HIV/SIV case Phylogenetic reconstruction methods Constructing a distance matrix between sequences What is a sequence alignment? The Jukes-Cantor correction The neighbor-joining algorithm to make a tree from distances
Distances from alignments: estimating the number of substitutions Partial alignment from the gag gene: SIV_deBrazzaMonkey: TTTCTGGGTT HIV2_human_a: TGT------GCGGT Ignore sites with a gap character. How many substitutions occurred between these two sequences since their last common ancestor? Number of sequence differences at non-gap sites: 6 Does this mean there were 6 substitutions?
Number of substitutions vs. number of observed differences C T G T T G C G T T G G G T T T G T
Number of substitutions vs. number of observed differences T T G T TT GT Number of observed differences (3) is less than true number of substitutions (5).
Correcting for multiple hits with the probabilistic Jukes-Cantor model C Model a single nucleotide position G T Series of discrete time steps
Correcting for multiple hits with the probabilistic Jukes-Cantor model Model a single nucleotide position P (0) = 1 P (1) = G C T Series of discrete time steps
Correcting for multiple hits with the probabilistic Jukes-Cantor model Model a single nucleotide position P (0) = 1 P (1) = 1 3a G C T Series of discrete time steps
Two ways we can have nucleotide n at time t+1: 1. Nucleotide present: Time: n n t t+1 stays same 2. Nucleotide present: Time: not n n t t+1 changes to n Write an expression for P n (t+1) in terms of P n (t) and : P n (t +1) =
Two ways we can have nucleotide n at time t+1: 1. Nucleotide present: Time: n n t t+1 stays same 2. Nucleotide present: Time: not n n t t+1 changes to n Write an expression for P n (t+1) in terms of P n (t) and : P n (t +1) = (1 3)P n (t)+ [ 1 P n (t)]
Expression for the change in probability of nucleotide n, arising over one time step, from time t to time t+1.
Expression for the change in probability of nucleotide n, arising over one time step, from time t to time t+1. ΔP n (t) = P n (t +1) P n (t) ΔP n (t) = P n (t) 3P n (t)+ P n (t) P n (t) ΔP n (t) = 4P n (t)+
P n (0) P n (t) dp n (t) dt Math 45 = 4P n (t)+ P n (t) = 1 4 + " P (0) 1 % $ n # 4 & 'e 4t Probability of j at time t given j at time 0.! "" # = 1 4 + 3 4 )*+,- Probability of k at time t given j at time 0.! ". # = 1 4 1 4 )*+,-
Worksheet (Rip it off from the back of your packet) Name:! "" # = 1 4 + 3 4 )*+,-! ". # = 1 4 1 4 )*+,- 1. If the nucleotide is C at time 0, ie P C (0)=1, what is the probability is is C after a long time? 2. If the nucleotide is C at time 0, ie P C (0)=1, what is the probability it is G after a long time? 3. What are the equilibrium nucleotide frequencies we would expect as a result of this process? In other words, if we had a long sequence, and this process was happening at every position, what frequency of s, Cs, Gs and Ts would we expect after a long time?
Worksheet (Rip it off from the back of your packet) Name:! "" # = 1 4 + 3 4 )*+,-! ". # = 1 4 1 4 )*+,- 1. If the nucleotide is C at time 0, ie P C (0)=1, what is the probability is is C after a long time? 0.25 2. If the nucleotide is C at time 0, ie P C (0)=1, what is the probability it is G after a long time? 0.25 3. What are the equilibrium nucleotide frequencies we would expect as a result of this process? In other words, if we had a long sequence, and this process was happening at every position, what frequency of s, Cs, Gs and Ts would we expect after a long time? 0.25 each
Consider a single nucleotide position ancestor descendant strain 1 descendant strain 2 The probability strain 1 and strain 2 have the same nucleotide at this position I(t) = P 2 (t)+ P2 C (t)+ P2 G (t)+ P2 T (t) Express I(t) in terms of and t. I(t) =
Consider a single nucleotide position ancestor descendant strain 1 descendant strain 2 The probability strain 1 and strain 2 have the same nucleotide at this position I(t) = P 2 (t)+ P2 C (t)+ P2 G (t)+ P2 T (t) Express I(t) in terms of and t. " I(t) = 1 4 + 3 % $ # 4 e 4t ' & 2 " + 3 1 4 1 % $ # 4 e 4t ' & 2 *Note that this would still be the same if we had imagined the ancestral nucleotide was C, G or T.
" I(t) = 1 4 + 3 % $ # 4 e 4t ' & 2 " + 3 1 4 1 % $ # 4 e 4t ' & 2 I(t) = 1 16 + 6 16 e 4t + 9 16 e 8t + 3 16 6 16 e 4t + 3 16 e 8t I(t) = 4 16 + 12 16 e 8t I(t) = 1 4 + 3 4 e 8t
Can measure from alignments. 1. Probability two nucleotides are different. p =1 I(t) = 3 4 3 4 e 8t = 3 4 ( 1 e 8t ) 2. Probability of a substitution per site per unit time: 3 Expected number of substitutions per site in one lineage: 3t Expected number of substitutions per site separating the two strains: K = 6t
4 3 p =1 e 8t e 8t =1 4 3 p " 8t = ln$ 1 4 # 3 p % ' &! = 3 4 ln 1 4 3 )
Using the Jukes-Cantor correction Partial alignment from the gag gene: SIV_deBrazzaMonkey: TTTCTGGGTT HIV2_human_a: TGT------GCGGT! = 6 13 = 0.462 Proportion of sites that are different (consider only nongap sites) * = 3 4 ln 1 4 3! = 0.717 Estimated substitutions per site separating the two strains.
Topics for the next few days The HIV genome in context Phylogenetic trees of hosts vs. pathogens: introducing the HIV/SIV case Phylogenetic reconstruction methods Constructing a distance matrix between sequences What is a sequence alignment? The Jukes-Cantor correction The neighbor-joining algorithm to make a tree from distances
Hand in your worksheet please! (and be sure you put your full name on it)