ACGTTTGACTGAGGAGTTTACGGGAGCAAAGCGGCGTCATTGCTATTCGTATCTGTTTAG Human Population Genomics

ACGTTTGACTGAGGAGTTTACGGGAGCAAAGCGGCGTCATTGCTATTCGTATCTGTTTAG 010101100010010100001010101010011011100110001100101000100101 Human Population Genomics

Heritability & Environment Feasibility of identifying genetic variants by risk allele frequency and strength of genetic effect (odds ratio). TA Manolio et al. Nature 461, 747-753 (2009) doi:10.1038/nature08494

Global Ancestry Inference G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} Nature. 2008 November 6; 456(7218): 98 101.

Modeling population haplotypes VLMC 0000 0001 0011 0110 1000 1001 1011 G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} G i = H i1 + H i2, where, H i = h ij1 h ijn ; h ijk {0, 1} Browning, 2006

Phasing Browning & Browning, 2007

Identity By Descent...... { {

IBD detection IBD = F IBD = T Parente Rodriguez et al. 2013 FastIBD: sample haplotypes for each individual, check for IBD Browning & Browining 2011

Mexican Ancestry The genetics of Mexico recapitulates Native American substructure and affects biomedical traits, Moreno-Estrada et al. Science, 2014.

Population Sequencing C = 2-6x A/T C/C C/G 1,000 to A/A C/T G/G 1,000,000 A/A T/T C/G A/A C/T G/G

Population Sequencing C = 2-6x A/T C/C C/G 1,000 to A/A C/T G/G 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ]

Population Sequencing When C is high (>30x), 1,000 to A/T A/A C/C C/T C/G G/G 2-6x Prob(g ij = k data) ~ Prob(g ij = k reads mapping on (i, j)) fast & easy 1,000,00 0 A/A A/A T/T C/T C/G G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] When C is low, Prob(g ij = k data) needs to leverage LD: positions j j in all individuals in principle, intractable

Population Sequencing Summarization - Maximization: 1,000 to A/T A/A C/C C/T C/G G/G 2-6x 1. Identify candidate polymorphic sites 2. Initialize G (0) 3. Summarization: 1,000,00 0 A/A A/A T/T C/T C/G G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] p (n+1) ijk = Prob(g ij = K G(n), data) 4. Maximization: g (n+1) ijk = argmax p(n+1) ijk 5. Repeat until convergence

Modeling LD: Nearest Neighbors i j Sample 1 Sample 2 Sample 3 Sample 4 1 1 1 0 3 5 0 2 l Let S i = { samples with >= one read covering minor allele } S i = {1, 2, 3, 10} S j = {1, 3, 4} Sample 10 1 0 Then, Sim 1 (i, j) = (S i S j ) / (S i U S j ) = 2/4

Reveel Algorithm Summarization Maximization: 1,000 to A/T A/A C/C C/T C/G G/G 2-6x 1. Identify candidate polymorphic sites 2. Initialize G (0) 3. Summarization: 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] p (n+1) ijk = Prob(g ij = K G(n), data) 4. Maximization: g (n+1) ijk = argmax p(n+1) ijk 5. Repeat until convergence Candidate Polymorphic site Essentially, pos n j where some individuals have at least 2 reads with same minor allele

Reveel Algorithm Summarization Maximization: 1,000 to A/T A/A C/C C/T C/G G/G 2-6x 1. Identify candidate polymorphic sites 2. Initialize G (0) 3. Summarization: 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] p (n+1) ijk = Prob(g ij = K G(n), data) 4. Maximization: g (n+1) ijk = argmax p(n+1) ijk 5. Repeat until convergence At each position j, Use sum of read counts at j and its nearest neighbors

Reveel Algorithm: calculate P (n+1) p (n+1) ijk = P(g ij = k G(n), reads) 1,000 to A/T A/A C/C C/T C/G G/G 2-6x ~ P(g ij = k g knn, reads) = P(reads g ij = k) P(g ij = k g knn) ) 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] P(reads g ij = k) : easy P(g ij = k g knn ) = Let C 0, C 1, C 2 = # samples matching i in knn at (n), with j th genotype pos n = 0, 1, 2 Then, P(g ij = k g knn ) = C k / (C 0 + C 1 + C 2 )

Reveel Algorithm Summarization Maximization: 1,000 to A/T A/A C/C C/T C/G G/G 2-6x 1. Identify candidate polymorphic sites 2. Initialize G (0) 3. Summarization: 1,000,000 A/A T/T C/G A/A C/T G/G G 1,, G N ; G i = g i1 g in ; g ij {0, 1, 2} P 1,, P N ; P i : [ p ijk = Prob(g ij = k data) ] p (n+1) ijk = Prob(g ij = K G(n), data) 4. Maximization: g (n+1) ijk = argmax p(n+1) ijk 5. Repeat until convergence At each position j, Use sum of read counts at j and its nearest neighbors

Fixation, Positive & Negative Selection How can we detect negative selection? Negative Selection How can we detect positive selection? Neutral Drift Positive Selection

How can we detect positive selection? Ka/Ks ratio: Ratio of nonsynonymous to synonymous substitutions Very old, persistent, strong positive selection for a protein that keeps adapting Examples: immune response, spermatogenesis

How can we detect positive selection?

Positive Selection in Human Lineage

Long Haplotypes EHS, ihs tests Less time: Fewer mutations Fewer recombinations

Application: Malaria Study of genes known to be implicated in the resistance to malaria. Infectious disease caused by protozoan parasites of the genus Plasmodium Frequent in tropical and subtropical regions Transmitted by the Anopheles mosquito Slide Credits: Image source: wikipedia.org Marc Schaub

Application: Malaria Image source: Slide Credits: NIH - http://history.nih.gov/exhibits/bowman/images/malariacyclebig.jpg Marc Schaub

Application: Malaria Image source: CDC - http://www.dpd.cdc.gov/dpdx/images/parasiteimages/m-r/malaria/ malaria_risk_2003.gif Slide Credits: Marc Schaub

Results: G6PD Source: Sabeti et al. Nature 2002. Slide Credits: Marc Schaub

Results: TNFSF5 Source: Sabeti et al. Nature 2002. Slide Credits: Marc Schaub

Malaria and Sickle-cell Anemia Allison (1954): Sickle-cell anemia is limited to the region in Africa in which malaria is endemic. Distribution of malaria Distribution of sickle-cell anemia Slide Credits: Image source: wikipedia.org Marc Schaub

Malaria and Sickle-cell Anemia Single point mutation in the coding region of the Hemoglobin-B gene (glu val). Heterozygote advantage: Resistance to malaria Slight anemia. Image source: wikipedia.org Slide Credits: Marc Schaub

Lactose Intolerance Source: Ingram and Swallow. Population Genetics of Encyclopedia of Life Slide Credits: Sciences. 2007. Marc Schaub

Lactose Intolerance LCT, 5 LCT, 3 Source: Bersaglieri et al. Am. J. Hum. Genet. 2004. Slide Credits: Marc Schaub

Positive Selection in Human Lineage