Learning ancestral genetic processes using nonparametric Bayesian models

Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew Stephens, University of Chicago In partial fulfillment of the requirements for the degree of Doctor of Philosophy. 1

Recent explosion of genomic data 2003 2006-2

Inference on human migration history Li, Hui, Kelly Cho, Judith R Kidd, and Kenneth K Kidd. 2009. Genetic landscape of Eurasia and ʻadmixtureʼ in Uyghurs.. American journal of human genetics 85 (6) (December): 934 7; author reply 937 9. 3 Xu, Shuhua, and Li Jin. 2008. A genome-wide analysis of admixture in Uyghurs and a high-density admixture map for disease-gene discovery.. American journal of human genetics 83 (3) (September): 322 336.

Finding Disease Genes mutation G A Search for causal genes and mutations 4

Key challenges How to model complex inheritance processes underlying the data Realistic and flexible inheritance model 5 Images courtesy of thurj.org and stormfront.org

Key challenges Lack of large scale samples, data in high-dimensional space Need to exploit structural information in the data from genetically distinct but related groups Image courtesy of Nature Reviews: Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. 3 (5) (May): 380 390. doi:10.1038/nrg795. 6

Thesis goal Develop flexible non-parametric Bayesian models for learning ancestral genetic processes that efficiently utilize structural information in genetic data Biologically, provide practical tools that can reveal interesting characteristics of study populations Statistically, validate how non-parametric Bayesian models can help improving the performance of real applications 7

Theoretical component Nonparametric Bayesian models based on Dirichlet process and its extensions such as a hierarchical Dirichlet process or infinite hidden Markov model Combined with a haplotype inheritance model describing various genetic processes 8

SNP haplotypes Genetic polymorphism: a difference in DNA sequence among individuals, or populations Single Nucleotide Polymorphism (SNP): DNA sequence variation occurring when a single nucleotide - A, T, C, or G - differs between members of the groups Haplotypes Genotypes: sequences of an unordered pair of haplotype alleles in diploids, e.g. two haplotypes of (AATC) and (ACGG) form a genotype sequence of ({A,A}, {A,C}, {G,T}, {C,G}) 9

Thesis outline 1. Haplotype inference from multi-population data 2. Joint inference of population structure and recombination events 3. Local ancestry estimation in admixed populations 10

Application 1 A hierarchical Dirichlet process mixture model for haplotype reconstruction from multipopulation data (International Conference on Machine Learning 2006, Annals of Applied Statistics 2009) 11

Motivation Few existing programs have explicitly used the population labels in haplotype inference while a lot of datasets come from genetically distinct populations Each population has its own characteristics They are related and may share some components Develop a principled approach that explicitly exploit the population labels 12

Bayesian haplotype inference model mutation model A! founder haplotypes genotyping model H n1 H n2 individual haplotypes G n individual genotypes (observed) Each individual haplotype is a mixture of founder haplotypes Eric P. Xing, Michael I Jordan, and Roded Sharan. 2007. Bayesian haplotype inference via the Dirichlet process.. Journal of computational biology : a journal of computational molecular cell biology 14 (3) (April): 267 284. doi:10.1089/cmb.2006.0102. 13

Bayesian haplotype inference model Dirichlet process prior! G 0 F mutation model " A! founder haplotypes genotyping model H n1 H n2 individual haplotypes G n individual genotypes (observed) Each individual haplotype is a mixture of founder haplotypes Eric P. Xing, Michael I Jordan, and Roded Sharan. 2007. Bayesian haplotype inference via the Dirichlet process.. Journal of computational biology : a journal of computational molecular cell biology 14 (3) (April): 267 284. doi:10.1089/cmb.2006.0102. 14

Dirichlet Process A CDF, F on " follows a Dirichlet Process if for any measurable finite partition (B 1,B 2,.., B m ) of ", the joint distribution of the random variables ( F(B 1 ), F(B 2 ),, F(B m ) ) ~ Dirichlet( # G 0 (B 1 ),., # G 0 (B m ) ) where G 0 is the base measure and # is the scale parameter 15

Dirichlet Process Pólya urn model We associate mixture components (founders) with colors and samples (individual chromosomes) with balls in the Pólya urn model 16

DP mixture model A flexible haplotype inheritance model The number of founders can grow as large as needed by the given data Reasonable approximation to the coalescent theory Coalescent with mutation 17

Multi-population data 18

A hierarchical Dirichlet Process mixture Two level Pólya urn scheme - Assume population-specific DP - Use a common base measure distributed as another Dirichlet process - Atoms (or founder haplotypes) are shared across populations mutation model genotyping model 19

Performance Comparison of DP and HDP mixtures -Mode-1: ignore labels -Mode-2: handle each group separately Haplotype inference as a measurable application of DP and HDP mixture models 20

Comparison with benchmark algorithms 21

Sensitivity analysis on hyper parameters 22

Discussions Explicitly leverage population labels in haplotype inference using a hierarchical Dirichlet process mixture model Recombination not taken into account and a divide-and-conquer scheme called Partition-Ligation is used to deal with long sequences 23

Application 2 Spectrum: Joint Inference of Population Structure and Recombination events ( NIPS 2007, ISMB 2007, BA 2007) 24

Inheritance process x after one generation x... x after many generations x x x x 25

A new model for population representation Basic assumption Modern chromosomes are derived from hypothetical founder chromosomes via recombination and mutation Hypothetical founder chromosomes Individual chromosome Each individual chromosome is a mosaic of founders 26

Population analysis Given population data, recover the pool of hypothetical founders and their association with individual haplotypes haplotypes 00010000000 00101000011 00111010001 01101000001 Hypothetical Founder Association 27

How to recover the association Use hidden Markov models (HMM) c 1 c c3 2 cn Hidden State sequence Observation sequence How many founders? Use an infinite HMM (Beal et al. ( 2002), Teh et al. (2006)) 28

Infinite hidden Markov model (Hidden Markov Dirichlet Process) Transition between infinite number of founders Infinite dimensional transition matrix modeled by a hierarchical Dirichlet process 1 2 3 :. 1 2 3... - each row modeled with a DP - rows coupled by a common base measure under another DP 29

Infinite HMM model for population analysis H... " A? c 1 c 2 c3 " 1 2 3 cn H 30

Infinite HMM model for population analysis Founder allele reconstruction Inferring population structure Inferring recombination hotspot H... " A? c 1 c 2 c3 " 1 2 3 cn H 31

Discussion A new haplotype model using an infinite hidden Markov model allows joint inference of population structure and recombination events Provide an alternative way of characterizing a population 32

Application 3 Robust Estimation of Local Ancestry in Admixed Populations (In submission) 33

Recently admixed population African Americans: multiple sources of ancestry (African / European) African European 34 Local Genetic Ancestry

Admixture mapping in recently admixed populations Percent ancestry from population A 80% 60% Patients Controls 40% 20% Candidate Disease locus 0% Chromosome position 35

Local genetic ancestry estimation Input Ancestral populations (train data) African 11011000111 01001010011 11001010110 11010011011 European 00010000001 00101000011 00111010001 01101000001 Admixed individual (test data) 11001010001 aaaaaaaeeee Output 36

Challenges Real ancestral populations are not available for study Use un-admixed modern descendants How to reflect the discrepancy in the statistical model How to encode ancestral population data 37

Previous Work 1. Allele-frequency-based approach probability of allele A European African 0.2 0.8 0.05 0.3 0.15 0.5 0.3 0.6 0.1 0.2 2. Model individual-based approach EUROPEAN 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 AFRICAN 0 0 1 1 0 1 1 1 0 0 1 0 1 0 0 38

Our approach Founder-based population model Hypothetical Founder African 11011000111 01001010011 11001010110 11010011011 European 00010000000 00101000011 00111010001 01101000001 39

Local ancestry estimation Hypothetical Founder African 11011000111 01001010011 11001010110 11010011011 European 00010000000 00101000011 00111010001 01101000001 African American 11001010001 40

How to construct the pool of founders Associate each population with a unique infinite HMM and link them hierarchically by using a common base measure under DP ihmm for population 1 ihmm for population 2 41

Hidden Markov Model Hidden state: S=(founder indicator k, population indicator j) Transition model: recombination between founders & between populations 42

Local ancestry model using infinite HMMS A new haplotype-based ancestry model Unique population encoding based on founder haplotypes Leverage the structural relatedness between multiple ancestral populations Insensitive to the choice of ancestral population data Robust under deviation from typical modeling assumption 43

Error rates as a function of train data size Even when only limited amount of reference data are available, our method still performs very well Error rates Data 1 Data 2 Data 3 x-axis: number of individuals per train population 44

Robustness under deviation from modeling assumption Our method HAPMIX LAMP More accurate (lower error rates), and most robust (errors do not increase much) x-axis: deviation from modeling assumption 45

Analysis of HGDP dataset Geographic location Training: YRI CEU JPT+CHB MAYA African Middle East Europe Asia Oceania American 46

Analysis of HGDP dataset 47

Discussion A new population representation model using non-parametric Bayesian approach of infinite HMMs Efficiently exploit the shared structural information across multiple populations Insensitive to the amount of train data, robust under deviation from modeling assumption Reveal interesting characteristics of the study populations 48

Conclusion We have presented nonparametric Bayesian models for learning ancestral genetic processes Systematically handle grouped data by using hierarchical models Explicitly exploit the structural information Provide a flexible haplotype inheritance model that can incorporate various genetic properties 49

Conclusion Future work - Improve computational complexity - Computation in dynamic programming: O( (KJ)^2 * T ) per chromosome per iteration - Reduce redundant computation - Parallelization - Fine-scaled analysis of real datasets - Combination of admixture mapping and association study 50

Dedicated to my family: Celine, KiHyun, and my parents 51

Thank you 52

HDP mixture model for multipopulation haplotype inference 53

Result on three-way admixture 54

Biological Terms Genetic admixture Mixing of genetically distant populations local ancestry: locus-by-locus ancestry in an admixed individual chromosome... 0001010000100... 55