Learning ancestral genetic processes using nonparametric Bayesian models

Similar documents
Learning Ancestral Genetic Processes using Nonparametric Bayesian Models

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data

Effect of Genetic Divergence in Identifying Ancestral Origin using HAPAA

Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture

Advanced Machine Learning

Mathematical models in population genetics II

Genetic Drift in Human Evolution

Modelling Genetic Variations with Fragmentation-Coagulation Processes

Statistical Methods for studying Genetic Variation in Populations

mstruct: Inference of Population Structure in Light of both Genetic Admixing and Allele Mutations

Populations in statistical genetics

Introduction to population genetics & evolution

BIOINFORMATICS. StructHDP: Automatic inference of number of clusters and population structure from admixed genotype data

A Brief Overview of Nonparametric Bayesian Models

Population Genetics I. Bio

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Statistical Methods for studying Genetic Variation in Populations

Frequency Spectra and Inference in Population Genetics

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Genotype Imputation. Biostatistics 666

p(d g A,g B )p(g B ), g B

Similarity Measures and Clustering In Genetics

Fast Approximate MAP Inference for Bayesian Nonparametrics

Introduction to Advanced Population Genetics

Calculation of IBD probabilities

How robust are the predictions of the W-F Model?

6 Introduction to Population Genetics

Robust demographic inference from genomic and SNP data

Processes of Evolution

1. Understand the methods for analyzing population structure in genomes

Non-Parametric Bayes

Haplotyping as Perfect Phylogeny: A direct approach

6 Introduction to Population Genetics

Supporting Information Text S1

Computational Approaches to Statistical Genetics

Phylogenetic Networks with Recombination

Computational Methods for Learning Population History from Large Scale Genetic Variation Datasets

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Detecting selection from differentiation between populations: the FLK and hapflk approach.

Bayesian Inference of Interactions and Associations

Lecture 13: Population Structure. October 8, 2012

Hierarchical Dirichlet Processes

Calculation of IBD probabilities

Big Idea #1: The process of evolution drives the diversity and unity of life

Bayesian Nonparametrics for Speech and Signal Processing

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Supplementary Materials for

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

An Integrated Approach for the Assessment of Chromosomal Abnormalities

Bayesian Nonparametrics: Models Based on the Dirichlet Process

URN MODELS: the Ewens Sampling Lemma

Spatial Normalized Gamma Process

SAT in Bioinformatics: Making the Case with Haplotype Inference

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Hidden Markov models in population genetics and evolutionary biology

BIOINFORMATICS. SequenceLDhot: Detecting Recombination Hotspots. Paul Fearnhead a 1 INTRODUCTION 2 METHOD

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Supporting Information

Inference in Explicit Duration Hidden Markov Models

Week 7.2 Ch 4 Microevolutionary Proceses

25 : Graphical induced structured input/output models

An Integrated Approach for the Assessment of Chromosomal Abnormalities

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Demographic Inference with Coalescent Hidden Markov Model

Non-parametric Clustering with Dirichlet Processes

2. Map genetic distance between markers

(Genome-wide) association analysis

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

Lecture 3a: Dirichlet processes

Estimating Recombination Rates. LRH selection test, and recombination

Research Statement on Statistics Jun Zhang

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

Intraspecific gene genealogies: trees grafting into networks

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!

Hidden Markov models: from the beginning to the state of the art

Bayesian Nonparametrics

Haplotyping. Biostatistics 666

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Nonparametric Factor Analysis with Beta Process Priors

Genetic Association Studies in the Presence of Population Structure and Admixture

Dr. Amira A. AL-Hosary

STA 414/2104: Lecture 8

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Feature Selection via Block-Regularized Regression

List the five conditions that can disturb genetic equilibrium in a population.(10)

25 : Graphical induced structured input/output models

arxiv: v1 [cs.lg] 22 Jun 2009

Population Structure

Estimating Evolutionary Trees. Phylogenetic Methods

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

Learning in Bayesian Networks

Bayesian Hidden Markov Models and Extensions

Learning gene regulatory networks Statistical methods for haplotype inference Part I

The problem Lineage model Examples. The lineage model

Visualizing Population Genetics

Linear Regression (1/1/17)

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Bayesian analysis of the Hardy-Weinberg equilibrium model

Transcription:

Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew Stephens, University of Chicago In partial fulfillment of the requirements for the degree of Doctor of Philosophy. 1

Recent explosion of genomic data 2003 2006-2

Inference on human migration history Li, Hui, Kelly Cho, Judith R Kidd, and Kenneth K Kidd. 2009. Genetic landscape of Eurasia and ʻadmixtureʼ in Uyghurs.. American journal of human genetics 85 (6) (December): 934 7; author reply 937 9. 3 Xu, Shuhua, and Li Jin. 2008. A genome-wide analysis of admixture in Uyghurs and a high-density admixture map for disease-gene discovery.. American journal of human genetics 83 (3) (September): 322 336.

Finding Disease Genes mutation G A Search for causal genes and mutations 4

Key challenges How to model complex inheritance processes underlying the data Realistic and flexible inheritance model 5 Images courtesy of thurj.org and stormfront.org

Key challenges Lack of large scale samples, data in high-dimensional space Need to exploit structural information in the data from genetically distinct but related groups Image courtesy of Nature Reviews: Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. 3 (5) (May): 380 390. doi:10.1038/nrg795. 6

Thesis goal Develop flexible non-parametric Bayesian models for learning ancestral genetic processes that efficiently utilize structural information in genetic data Biologically, provide practical tools that can reveal interesting characteristics of study populations Statistically, validate how non-parametric Bayesian models can help improving the performance of real applications 7

Theoretical component Nonparametric Bayesian models based on Dirichlet process and its extensions such as a hierarchical Dirichlet process or infinite hidden Markov model Combined with a haplotype inheritance model describing various genetic processes 8

SNP haplotypes Genetic polymorphism: a difference in DNA sequence among individuals, or populations Single Nucleotide Polymorphism (SNP): DNA sequence variation occurring when a single nucleotide - A, T, C, or G - differs between members of the groups Haplotypes Genotypes: sequences of an unordered pair of haplotype alleles in diploids, e.g. two haplotypes of (AATC) and (ACGG) form a genotype sequence of ({A,A}, {A,C}, {G,T}, {C,G}) 9

Thesis outline 1. Haplotype inference from multi-population data 2. Joint inference of population structure and recombination events 3. Local ancestry estimation in admixed populations 10

Application 1 A hierarchical Dirichlet process mixture model for haplotype reconstruction from multipopulation data (International Conference on Machine Learning 2006, Annals of Applied Statistics 2009) 11

Motivation Few existing programs have explicitly used the population labels in haplotype inference while a lot of datasets come from genetically distinct populations Each population has its own characteristics They are related and may share some components Develop a principled approach that explicitly exploit the population labels 12

Bayesian haplotype inference model mutation model A! founder haplotypes genotyping model H n1 H n2 individual haplotypes G n individual genotypes (observed) Each individual haplotype is a mixture of founder haplotypes Eric P. Xing, Michael I Jordan, and Roded Sharan. 2007. Bayesian haplotype inference via the Dirichlet process.. Journal of computational biology : a journal of computational molecular cell biology 14 (3) (April): 267 284. doi:10.1089/cmb.2006.0102. 13

Bayesian haplotype inference model Dirichlet process prior! G 0 F mutation model " A! founder haplotypes genotyping model H n1 H n2 individual haplotypes G n individual genotypes (observed) Each individual haplotype is a mixture of founder haplotypes Eric P. Xing, Michael I Jordan, and Roded Sharan. 2007. Bayesian haplotype inference via the Dirichlet process.. Journal of computational biology : a journal of computational molecular cell biology 14 (3) (April): 267 284. doi:10.1089/cmb.2006.0102. 14

Dirichlet Process A CDF, F on " follows a Dirichlet Process if for any measurable finite partition (B 1,B 2,.., B m ) of ", the joint distribution of the random variables ( F(B 1 ), F(B 2 ),, F(B m ) ) ~ Dirichlet( # G 0 (B 1 ),., # G 0 (B m ) ) where G 0 is the base measure and # is the scale parameter 15

Dirichlet Process Pólya urn model We associate mixture components (founders) with colors and samples (individual chromosomes) with balls in the Pólya urn model 16

DP mixture model A flexible haplotype inheritance model The number of founders can grow as large as needed by the given data Reasonable approximation to the coalescent theory Coalescent with mutation 17

Multi-population data 18

A hierarchical Dirichlet Process mixture Two level Pólya urn scheme - Assume population-specific DP - Use a common base measure distributed as another Dirichlet process - Atoms (or founder haplotypes) are shared across populations mutation model genotyping model 19

Performance Comparison of DP and HDP mixtures -Mode-1: ignore labels -Mode-2: handle each group separately Haplotype inference as a measurable application of DP and HDP mixture models 20

Comparison with benchmark algorithms 21

Sensitivity analysis on hyper parameters 22

Discussions Explicitly leverage population labels in haplotype inference using a hierarchical Dirichlet process mixture model Recombination not taken into account and a divide-and-conquer scheme called Partition-Ligation is used to deal with long sequences 23

Application 2 Spectrum: Joint Inference of Population Structure and Recombination events ( NIPS 2007, ISMB 2007, BA 2007) 24

Inheritance process x after one generation x... x after many generations x x x x 25

A new model for population representation Basic assumption Modern chromosomes are derived from hypothetical founder chromosomes via recombination and mutation Hypothetical founder chromosomes Individual chromosome Each individual chromosome is a mosaic of founders 26

Population analysis Given population data, recover the pool of hypothetical founders and their association with individual haplotypes haplotypes 00010000000 00101000011 00111010001 01101000001 Hypothetical Founder Association 27

How to recover the association Use hidden Markov models (HMM) c 1 c c3 2 cn Hidden State sequence Observation sequence How many founders? Use an infinite HMM (Beal et al. ( 2002), Teh et al. (2006)) 28

Infinite hidden Markov model (Hidden Markov Dirichlet Process) Transition between infinite number of founders Infinite dimensional transition matrix modeled by a hierarchical Dirichlet process 1 2 3 :. 1 2 3... - each row modeled with a DP - rows coupled by a common base measure under another DP 29

Infinite HMM model for population analysis H... " A? c 1 c 2 c3 " 1 2 3 cn H 30

Infinite HMM model for population analysis Founder allele reconstruction Inferring population structure Inferring recombination hotspot H... " A? c 1 c 2 c3 " 1 2 3 cn H 31

Discussion A new haplotype model using an infinite hidden Markov model allows joint inference of population structure and recombination events Provide an alternative way of characterizing a population 32

Application 3 Robust Estimation of Local Ancestry in Admixed Populations (In submission) 33

Recently admixed population African Americans: multiple sources of ancestry (African / European) African European 34 Local Genetic Ancestry

Admixture mapping in recently admixed populations Percent ancestry from population A 80% 60% Patients Controls 40% 20% Candidate Disease locus 0% Chromosome position 35

Local genetic ancestry estimation Input Ancestral populations (train data) African 11011000111 01001010011 11001010110 11010011011 European 00010000001 00101000011 00111010001 01101000001 Admixed individual (test data) 11001010001 aaaaaaaeeee Output 36

Challenges Real ancestral populations are not available for study Use un-admixed modern descendants How to reflect the discrepancy in the statistical model How to encode ancestral population data 37

Previous Work 1. Allele-frequency-based approach probability of allele A European African 0.2 0.8 0.05 0.3 0.15 0.5 0.3 0.6 0.1 0.2 2. Model individual-based approach EUROPEAN 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 AFRICAN 0 0 1 1 0 1 1 1 0 0 1 0 1 0 0 38

Our approach Founder-based population model Hypothetical Founder African 11011000111 01001010011 11001010110 11010011011 European 00010000000 00101000011 00111010001 01101000001 39

Local ancestry estimation Hypothetical Founder African 11011000111 01001010011 11001010110 11010011011 European 00010000000 00101000011 00111010001 01101000001 African American 11001010001 40

How to construct the pool of founders Associate each population with a unique infinite HMM and link them hierarchically by using a common base measure under DP ihmm for population 1 ihmm for population 2 41

Hidden Markov Model Hidden state: S=(founder indicator k, population indicator j) Transition model: recombination between founders & between populations 42

Local ancestry model using infinite HMMS A new haplotype-based ancestry model Unique population encoding based on founder haplotypes Leverage the structural relatedness between multiple ancestral populations Insensitive to the choice of ancestral population data Robust under deviation from typical modeling assumption 43

Error rates as a function of train data size Even when only limited amount of reference data are available, our method still performs very well Error rates Data 1 Data 2 Data 3 x-axis: number of individuals per train population 44

Robustness under deviation from modeling assumption Our method HAPMIX LAMP More accurate (lower error rates), and most robust (errors do not increase much) x-axis: deviation from modeling assumption 45

Analysis of HGDP dataset Geographic location Training: YRI CEU JPT+CHB MAYA African Middle East Europe Asia Oceania American 46

Analysis of HGDP dataset 47

Discussion A new population representation model using non-parametric Bayesian approach of infinite HMMs Efficiently exploit the shared structural information across multiple populations Insensitive to the amount of train data, robust under deviation from modeling assumption Reveal interesting characteristics of the study populations 48

Conclusion We have presented nonparametric Bayesian models for learning ancestral genetic processes Systematically handle grouped data by using hierarchical models Explicitly exploit the structural information Provide a flexible haplotype inheritance model that can incorporate various genetic properties 49

Conclusion Future work - Improve computational complexity - Computation in dynamic programming: O( (KJ)^2 * T ) per chromosome per iteration - Reduce redundant computation - Parallelization - Fine-scaled analysis of real datasets - Combination of admixture mapping and association study 50

Dedicated to my family: Celine, KiHyun, and my parents 51

Thank you 52

HDP mixture model for multipopulation haplotype inference 53

Result on three-way admixture 54

Biological Terms Genetic admixture Mixing of genetically distant populations local ancestry: locus-by-locus ancestry in an admixed individual chromosome... 0001010000100... 55