Effect of Genetic Divergence in Identifying Ancestral Origin using HAPAA

Similar documents
Learning ancestral genetic processes using nonparametric Bayesian models

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

Mathematical models in population genetics II

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Genotype Imputation. Biostatistics 666

Population genetics models of local ancestry

Supporting Information Text S1

Introduction to Advanced Population Genetics

Learning Ancestral Genetic Processes using Nonparametric Bayesian Models

1. Understand the methods for analyzing population structure in genomes

Notes on Population Genetics

Spatial localization of recent ancestors for admixed individuals

Calculation of IBD probabilities

Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

Genotype Imputation. Class Discussion for January 19, 2016

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

Figure S1: The model underlying our inference of the age of ancient genomes

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

An Efficient and Accurate Graph-Based Approach to Detect Population Substructure

Linkage and Linkage Disequilibrium

Calculation of IBD probabilities

Inferring Admixture Histories of Human Populations Using. Linkage Disequilibrium

Supplementary Materials for

CS 68: BIOINFORMATICS. Prof. Sara Mathieson Swarthmore College Spring 2018

Population Genetics I. Bio

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Detecting selection from differentiation between populations: the FLK and hapflk approach.

p(d g A,g B )p(g B ), g B

Week 7.2 Ch 4 Microevolutionary Proceses

Supporting Information

Computational Genomics and Molecular Biology, Fall

The Wright-Fisher Model and Genetic Drift

Haplotyping as Perfect Phylogeny: A direct approach

Introduction to Linkage Disequilibrium

Evidence of Evolution

O 3 O 4 O 5. q 3. q 4. Transition

Inferring Admixture Histories of Human Populations Using Linkage Disequilibrium

Hidden Markov models in population genetics and evolutionary biology

Population Structure

Genetic Variation in Finite Populations

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Figure S2. The distribution of the sizes (in bp) of syntenic regions of humans and chimpanzees on human chromosome 21.

Populations in statistical genetics

Problems for 3505 (2011)

mstruct: Inference of Population Structure in Light of both Genetic Admixing and Allele Mutations

2. Map genetic distance between markers

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Structure Learning in Sequential Data

Big Idea #1: The process of evolution drives the diversity and unity of life

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture

Genetic Association Studies in the Presence of Population Structure and Admixture

KARYOTYPE. An organism s complete set of chromosomes

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination

Evolutionary Models. Evolutionary Models

(Genome-wide) association analysis

Supporting information for Demographic history and rare allele sharing among human populations.

Pair Hidden Markov Models

Hidden Markov Models. Terminology, Representation and Basic Problems

Frequency Spectra and Inference in Population Genetics

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Segregation versus mitotic recombination APPENDIX

Haploid & diploid recombination and their evolutionary impact

Phasing via the Expectation Maximization (EM) Algorithm

LECTURE NOTES: Discrete time Markov Chains (2/24/03; SG)

Analysis of Y-STR Profiles in Mixed DNA using Next Generation Sequencing

Computational Methods for Learning Population History from Large Scale Genetic Variation Datasets

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

BMI/CS 576 Fall 2016 Final Exam

How should we organize the diversity of animal life?

theta H H H H H H H H H H H K K K K K K K K K K centimorgans

HMM part 1. Dr Philip Jackson

Feature Selection via Block-Regularized Regression

Statistical Methods for studying Genetic Variation in Populations

Similarity Measures and Clustering In Genetics

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Hidden Markov Models

Organizing Life s Diversity

6 Introduction to Population Genetics

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

What is the expectation maximization algorithm?

Processes of Evolution

Population Genetics II (Selection + Haplotype analyses)

Lecture 7 Sequence analysis. Hidden Markov Models

Mutation, Selection, Gene Flow, Genetic Drift, and Nonrandom Mating Results in Evolution

Homology and Information Gathering and Domain Annotation for Proteins

How robust are the predictions of the W-F Model?

Statistical Methods for studying Genetic Variation in Populations

Genome 373: Hidden Markov Models II. Doug Fowler

Analysis of DNA variations in GSTA and GSTM gene clusters based on the results of genome-wide data from three Russian populations taken as an example

Extent of third-order linkage disequilibrium in a composite line of Iberian pigs

6 Introduction to Population Genetics

Effective population size and patterns of molecular evolution and variation

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

The genomes of recombinant inbred lines

Homology. and. Information Gathering and Domain Annotation for Proteins

THERE has been interest in analyzing population genomic

Supplementary Materials: Efficient moment-based inference of admixture parameters and sources of gene flow

Transcription:

Effect of Genetic Divergence in Identifying Ancestral Origin using HAPAA Andreas Sundquist*, Eugene Fratkin*, Chuong B. Do, Serafim Batzoglou Department of Computer Science, Stanford University, Stanford, CA 94305, USA {asundqui, fratkin, chuongdo, serafim}@cs.stanford.edu Figures Figure 1: (A) Hierarchical HMM state diagram for HAPAA. On the left, inter- and intra-population transitions occur with probabilities governed by matrix (, ). In the middle, each population has a similar structure: entry state transitions with uniform probability to a diploid model individual, then to exit state. On the right, in we transition into one of two states representing the haplotypes and of model individual with equal probability. Each haplotype emits its alleles via a mutation/error probability distribution (, ). Haplotypes transition to each other with probability proportional to the phase switch error, and transition out of the diploid sample with probability governed by genetic distance to the next locus and the population-specific recombination rate parameter. (B) HMM state diagram for previous methods. Each state represents a population and emits alleles according to frequency estimates for the populations, and admixture transition probabilities depend on the degree of admixture expected and other learned parameters. By construction, these methods assume a greater degree of independence between adjacent loci. *These authors contributed equally to the work.

Figure 2: Example inference on chromosome 22 of an individual admixed between three HapMap populations. The top two tracks represent the true ancestries, followed by three stages of HAPAA processing, and finally posterior probabilities and Viterbi decoding by SABER. The gray bars highlight two locations with correctly inferred ancestry but with phase switching errors between the haplotypes. Figure 3: Performance comparison between HAPAA and SABER. We measured the mean-square- error of the inferred posterior probability of population ancestry on chromosome 22 for a varying number of generations of admixture. Tests were constructed by simulating admixture over generations from 2 individuals selected randomly from three HapMap populations. 2

Figure 4: Methodology for studying the effect of genetic divergence on ancestry inference. We simulate pairs of randomly mating populations of fixed size 5,000 derived from the HapMap CEU population over generations. We construct training and test individuals derived over generations of admixture from 2 1 ancestors from one population and one ancestor from the other (minor) population. 3

Figure 5: Recall and precision of detecting minor population. We simulated twenty pairs of populations separated by {100,200,,2000} generations of drift on the whole genome of Illumina 550K loci. For each we constructed test individuals that were derived over {1,2,,20} generations of admixture from 2 1 ancestors from one population and one ancestor from the other (minor) population. Conditioned on the existence of at least one haploblock derived from the minor population, we measure the ability of HAPAA to identify these loci. 4

Figure 6: Performance of HAPAA when varying the number of model individuals. We created models with a varying number of individuals derived from three populations in the HapMap dataset within chromosome 22. For the Uniform population sizes we randomly picked {4,8,,48} individuals to model each population, while for the Uneven population sizes we picked 20 individuals from CEU and YRI and {4,8,,72} individuals from the ASN population. We benchmarked the meansquare-error performance of HAPAA on 1,000 test individuals admixed over {1,2,,20} generations from the three populations. Figure 7: Performance of HAPAA when using ancestral versus present-day model individuals. For each {1,2,,20} we constructed (1) a set of unrelated ancestral individuals by randomly selecting 45 from each HapMap population, and (2) 45 unrelated present-day model individuals for each population. Present-day individuals were generated over generations of random mating using HapMap samples as templates. We tested on chromosome 22 on individuals admixed from the three populations over generations using the mean-square-error metric. 5