ACGTTTGACTGAGGAGTTTACGGGAGCAAAGCGGCGTCATTGCTATTCGTATCTGTTTAG 010101100010010100001010101010011011100110001100101000100101 Human Population Genomics
Heritability & Environment Feasibility of identifying genetic variants by risk allele frequency and strength of genetic effect (odds ratio). TA Manolio et al. Nature 461, 747-753 (2009) doi:10.1038/nature08494
Global Ancestry Inference G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} Nature. 2008 November 6; 456(7218): 98 101.
Modeling population haplotypes VLMC 0000 0001 0011 0110 1000 1001 1011 G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} G i = H i1 + H i2, where, H i = h ij1 h ijn ; h ijk {0, 1} Browning, 2006
Phasing Browning & Browning, 2007
Identity By Descent...... { {
IBD detection IBD = F IBD = T Parente Rodriguez et al. 2013 FastIBD: sample haplotypes for each individual, check for IBD Browning & Browining 2011
Mexican Ancestry The genetics of Mexico recapitulates Native American substructure and affects biomedical traits, Moreno-Estrada et al. Science, 2014.
Population Sequencing C = 2-6x A/T C/C C/G 1,000 to A/A C/T G/G 1,000,000 A/A T/T C/G A/A C/T G/G
Population Sequencing C = 2-6x A/T C/C C/G 1,000 to A/A C/T G/G 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ]
Population Sequencing When C is high (>30x), 1,000 to A/T A/A C/C C/T C/G G/G 2-6x Prob(g ij = k data) ~ Prob(g ij = k reads mapping on (i, j)) fast & easy 1,000,00 0 A/A A/A T/T C/T C/G G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] When C is low, Prob(g ij = k data) needs to leverage LD: positions j j in all individuals in principle, intractable
Population Sequencing Summarization - Maximization: 1,000 to A/T A/A C/C C/T C/G G/G 2-6x 1. Identify candidate polymorphic sites 2. Initialize G (0) 3. Summarization: 1,000,00 0 A/A A/A T/T C/T C/G G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] p (n+1) ijk = Prob(g ij = K G(n), data) 4. Maximization: g (n+1) ijk = argmax p(n+1) ijk 5. Repeat until convergence
Modeling LD: Nearest Neighbors i j Sample 1 Sample 2 Sample 3 Sample 4 1 1 1 0 3 5 0 2 l Let S i = { samples with >= one read covering minor allele } S i = {1, 2, 3, 10} S j = {1, 3, 4} Sample 10 1 0 Then, Sim 1 (i, j) = (S i S j ) / (S i U S j ) = 2/4
Reveel Algorithm Summarization Maximization: 1,000 to A/T A/A C/C C/T C/G G/G 2-6x 1. Identify candidate polymorphic sites 2. Initialize G (0) 3. Summarization: 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] p (n+1) ijk = Prob(g ij = K G(n), data) 4. Maximization: g (n+1) ijk = argmax p(n+1) ijk 5. Repeat until convergence Candidate Polymorphic site Essentially, pos n j where some individuals have at least 2 reads with same minor allele
Reveel Algorithm Summarization Maximization: 1,000 to A/T A/A C/C C/T C/G G/G 2-6x 1. Identify candidate polymorphic sites 2. Initialize G (0) 3. Summarization: 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] p (n+1) ijk = Prob(g ij = K G(n), data) 4. Maximization: g (n+1) ijk = argmax p(n+1) ijk 5. Repeat until convergence At each position j, Use sum of read counts at j and its nearest neighbors
Reveel Algorithm: calculate P (n+1) p (n+1) ijk = P(g ij = k G(n), reads) 1,000 to A/T A/A C/C C/T C/G G/G 2-6x ~ P(g ij = k g knn, reads) = P(reads g ij = k) P(g ij = k g knn) ) 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] P(reads g ij = k) : easy P(g ij = k g knn ) = Let C 0, C 1, C 2 = # samples matching i in knn at (n), with j th genotype pos n = 0, 1, 2 Then, P(g ij = k g knn ) = C k / (C 0 + C 1 + C 2 )
Reveel Algorithm Summarization Maximization: 1,000 to A/T A/A C/C C/T C/G G/G 2-6x 1. Identify candidate polymorphic sites 2. Initialize G (0) 3. Summarization: 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] p (n+1) ijk = Prob(g ij = K G(n), data) 4. Maximization: g (n+1) ijk = argmax p(n+1) ijk 5. Repeat until convergence At each position j, Use sum of read counts at j and its nearest neighbors
Fixation, Positive & Negative Selection How can we detect negative selection? Negative Selection How can we detect positive selection? Neutral Drift Positive Selection
How can we detect positive selection? Ka/Ks ratio: Ratio of nonsynonymous to synonymous substitutions Very old, persistent, strong positive selection for a protein that keeps adapting Examples: immune response, spermatogenesis
How can we detect positive selection?
Positive Selection in Human Lineage
Positive Selection in Human Lineage
Long Haplotypes EHS, ihs tests Less time: Fewer mutations Fewer recombinations
Application: Malaria Study of genes known to be implicated in the resistance to malaria. Infectious disease caused by protozoan parasites of the genus Plasmodium Frequent in tropical and subtropical regions Transmitted by the Anopheles mosquito Slide Credits: Image source: wikipedia.org Marc Schaub
Application: Malaria Image source: Slide Credits: NIH - http://history.nih.gov/exhibits/bowman/images/malariacyclebig.jpg Marc Schaub
Application: Malaria Image source: CDC - http://www.dpd.cdc.gov/dpdx/images/parasiteimages/m-r/malaria/ malaria_risk_2003.gif Slide Credits: Marc Schaub
Results: G6PD Source: Sabeti et al. Nature 2002. Slide Credits: Marc Schaub
Results: TNFSF5 Source: Sabeti et al. Nature 2002. Slide Credits: Marc Schaub
Malaria and Sickle-cell Anemia Allison (1954): Sickle-cell anemia is limited to the region in Africa in which malaria is endemic. Distribution of malaria Distribution of sickle-cell anemia Slide Credits: Image source: wikipedia.org Marc Schaub
Malaria and Sickle-cell Anemia Single point mutation in the coding region of the Hemoglobin-B gene (glu val). Heterozygote advantage: Resistance to malaria Slight anemia. Image source: wikipedia.org Slide Credits: Marc Schaub
Lactose Intolerance Source: Ingram and Swallow. Population Genetics of Encyclopedia of Life Slide Credits: Sciences. 2007. Marc Schaub
Lactose Intolerance LCT, 5 LCT, 3 Source: Bersaglieri et al. Am. J. Hum. Genet. 2004. Slide Credits: Marc Schaub
Positive Selection in Human Lineage