A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data

Size: px

Start display at page:

Download "A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data"

Linette Neal
5 years ago
Views:

1 A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data Eric P. Xing January 27 CMU-ML-7-17 Kyung-Ah Sohn School of Computer Science Carnegie Mellon University Pittsburgh, PA Abstract Uncovering the haplotypes of single nucleotide polymorphisms and their population demography is essential for many biological and medical applications. Methods for haplotype inference developed thus far including those based on approximate coalescence, finite mixtures, and maximal parsimony often bypass issues such as unknown complexity of haplotype-space and demographic structures underlying multi-population genotype data. In this paper, we propose a new class of haplotype inference models based on a nonparametric Bayesian formalism built on the Dirichlet process, which represents a tractable surrogate to the coalescent process underlying population haplotypes and offers a well-founded statistical framework to tackle the aforementioned issues. Our proposed model, known as a hierarchical Dirichlet process mixture, is exchangeable, unbounded, and capable of coupling demographic information of different populations for posterior inference of individual haplotypes, the size and configuration of haplotype ancestor pools, and other parameters of interest given genotype data. The resulting haplotype inference program, Haploi, is readily applicable to genotype sequences with thousands of SNPs, at a time-cost often two-orders of magnitude less than that of the state-of-the-art PHASE program, with competitive and sometimes superior performance. Haploi also significantly outperforms several other extant algorithms on both simulated and realistic data.

2 Keywords: haplotype inference, Dirichlet Process, Hierarchical Dirichlet process, mixture model, population genetics

3 1 Introduction Recent experimental advances have led to an explosion of data which document genetic variation at the DNA level within and between populations. For example, the international SNP map working group Group 1 has reported the identification and mapping of 1.4 million single nucleotide polymorphisms (SNPs) from the genomes of 4 different human populations in the world. These kinds of data lead to challenging inference problems whose solutions could lead to greater understanding of the genetic basis of disease propensities and other complex traits 2,3. SNPs represent the largest class of individual differences in DNA. A SNP refers to the existence of two specific nucleotides chosen from {A, C, G, T } at a single chromosomal locus in a population; each variant is called an allele. A haplotype refers the joint allelic identities of a list of SNPs at contiguous sites in a local region of a single chromosome. Assuming no recombination in this local region, a haplotype is inherited as a unit. For diploid organisms (such as humans), each individual has two physical copies of each chromosome (except for the Y chromosome) in his/her somatic cells; one copy is inherited from the mother, and the other from the father. Thus during each generation of inheritance when chromosomes come in pairs, two haplotypes, for example, h 1 (1, 1,, ) and h 2 (,, 1, 1) of a 4-loci region, go together to make up a genotype, which is the list of unordered pairs of alleles in the attendant region, e.g., g (1/, 1/, 1/, 1/) in case of the aforementioned two haplotypes. That is, a genotype is obtained from a pair of haplotypes by omitting the specification of the association of each allele with one of the two chromosomes its phase. Indeed, phase is in general ambiguous when only the genotypes of a SNPs sequence are given 4,5. For example, in the above example, given the g, an alternative configuration of the haplotypes, h 1 (1, 1, 1, 1) and h 2 (,,, ), is also consistent with the genotype; but observing multiple genotypes in a population can help to bias the phase reconstruction toward the true haplotypes. The problem of inferring SNP haplotypes from genotypes is essential for the understanding of genetic variations and linkage disequilibrium patterns in a population. For example, accurate inferences concerning population structures or quantitative trait locus maps usually demand the analysis of the genetic states of possibly non-recombinant segments of the subject s chromosome(s) 6. Thus, it is advantageous to study haplotypes, which consist of several closely spaced (hence linked) phase-known SNPs and often prove to be more powerful discriminators of genetic variations within and among populations. Common biological methods for assaying genotypes typically do not provide phase information for individuals with heterozygous genotypes at multiple autosomal loci; phase can be obtained at a considerably higher cost via molecular haplotyping 7. In addition to being costly, these methods are subject to experimental error and are low-throughput. Alternatively, phase can also be inferred from the genotypes of a subject s close relatives 5. But this approach is often hampered by the fact that typing family members increases the cost and does not guarantee full informativeness. It is desirable to develop automatic and robust in silico methods for reconstructing haplotypes from genotypes and possibly other data sources (e.g., pedigrees). Key to the inference of individual haplotypes based on a given genotype sample, is the formulation and tractability of the marginal probability of the haplotypes of a study population. Consider the set of haplotypes, denoted as H = {h 1, h 2,..., h 2n } (where h i P T, P denotes the allele space of the polymorphic markers and T denotes the length of the marker sequence), of a random 1

4 sample of 2n chromosomes of n individuals taken from a population at stationarity of some inheritance process, e.g., an infinitely-many-allele (IMA) mutation model. Under common genetic arguments, the ancestral relationships amongst the sample back to its most recent common ancestor (MRCA) can be described by a genealogical tree, and computing p(h) involves a marginalization over all possible genealogical trees consistent with the sample, which is widely known to be intractable. As discussed in Stephens and Donnelly 8, write P (H) as a product of conditionals based on the chain rule, i.e., P (h 1, h 2,..., h 2n ) = P (h 1 )P (h 2 h 1 )... P (h 2n h 1... h 2n 1 ), (1) then the generation of a haplotype sample H can be viewed as a sequential process that draw one haplotype at a time conditioning on all the previously drawn haplotypes, e.g., by introducing random mutations to the latter. (This is equivalent to sampling from a genealogy evolving in non-overlapping generations.) Therefore, one can develop tractable approximation to P (H) by appropriately approximating the conditionals in Eq. (1). Stephens and Donnelly 8 suggested an approximation to P (h i h 1... h i 1 ) that captures, among several desirable genetic properties, the parental-dependent-mutation (PDM) property, by modeling h i as the progeny of a randomlychosen existing haplotype through a geometric-distributed number of mutations. This model, referred to as the PAC (for Product of Approximate Conditionals) model, forms the basis of the PHASE program 9, which has set the state-of-the-art benchmarks in haplotype inference. However, one caveat of the PAC model, as acknowledged in Li and Stephens 1, is that it implicitly assumes existence of an ordering in the haplotype sample, therefore the resulting likelihood does not enjoy the property of exchangeability that we would expect to be satisfied by the true p(h). Although empirically this pitfall appears to be inconsequential after some heuristic averaging over a moderate number of random orderings, it is difficult to associate this approximation to an explicit assumption about the population demography and genealogy underlying the sample. For example, the genealogy of haplotypes with possibly common ancestry is replaced by asymmetric pairwise relationships (induced by the conditional mutation model) between the haplotypes. The resulting flattening of the latent genealogical history makes it difficult to use the PAC method to discover and exploit latent demographic structures such as estimating the number and pattern of prototypical haplotypes (i.e., founders), which may be indicative of genetic bottlenecks and divergence time of the study population, or to make use of the ethnic identities of the sample to improve haplotyping accuracy in multi-population haplotype inference. The finite mixture models adopted by programs such as HAPLOTYPER represent another class of haplotype models that rely very little on demographic and genetic assumptions of the sample Under such a model, haplotypes are treated as latent variables associated with specific frequencies, and the probability of a genotype is given by: p(g) = p(h 1, h 2 )f(g h 1, h 2 ), (2) h 1,h 2 H The parental-dependent-mutation posit that, in a sequential generation process of haplotypes, if the next haplotype does not match exactly with an existing haplotype, it will tend to differ by a small number of mutations from an existing one, rather than be completely different. 2

5 where f(g h 1, h 2 ) is a noisy channel relating the observed genotype to the unobserved true underlying haplotypes, and H denote the set of all possible haplotypes of a given region. Under the assumption of Hardy-Weinberg equilibrium (HWE), an assumption that is standard in the literature and will also be made here, the mixing proportion p(h 1, h 2 ) is assumed to factor as p(h 1 )p(h 2 ). Given this basic statistical structure, the haplotype inference problem can be viewed a missing value inference and parameter estimation problem. Numerous statistical models and statistical inference approaches have been developed for this problem, such as the maximum likelihood approaches via the EM algorithm 11,15 17, and a number of parametric Bayesian inference methods based on Markov Chain Monte Carlo (MCMC) techniques 12,14. The finite mixture model defines an exchangeable P (H). But since such models are data-driven rather than genetically motivated, they offer no insight of the genealogical history underlying the sample. Furthermore, these methodologies have rather severe computational requirements in that a probability distribution must be maintained on a (large) set of possible haplotypes. Indeed, the size of the haplotype pool, which reflects the diversity of the genome and its evolutionary history, is unknown for any given population data; thus we have a mixture model problem in which a key aspect of the inferential problem involves inference over the number of mixture components, i.e., the haplotypes. There is a plethora of combinatorial algorithms based on various deterministic hypothesis such as the parsimony principles that offer control over the complexity of the inference problem 4,18 2, and these methods have demonstrated effectiveness in certain settings and provided important insights to the problem (see Gusfield 21 for an excellent survey). But compared to the statistical approaches, they offer less flexibility and/or scalability in handling missing value, typing error, evolution modeling and more complex scenarios on the horizon in haplotype modeling (e.g., recombinations, genetic mapping, etc.). Most current statistical methods for haplotype inference bypass the issue of ancestral-space uncertainty via an ad hoc specification of the number of haplotypes needed to account for the given genotypes. Although certain coalescent-based models 14 or model-selections methods can partially address these issues. Besides the ancestral-space uncertainty issue discussed above, the haplotype models developed so far are still limited in their flexibility and are inadequate for addressing many realistic problems. Consider for example a genetic demography study, in which one seeks to uncover ethnic- and/or geographic-specific genetic patterns based on a sparse census of multiple populations. In particular, suppose that we are given a sample that can be divided into a set of subpopulations; e.g., African, Asian and European. We may not only want to discover the sets of haplotypes within each subpopulation, but we may also wish to discover which haplotypes are shared between subpopulations, and what are their frequencies. Empirical and theoretical evidence suggests that an early split of an ancestral population following a populational bottleneck (e.g., due to sudden migration or environmental changes) may lead to ethnic-group-specific populational diversity, which features both ancient haplotypes (that have high variability) shared among different ethnic groups, and modern haplotypes (that are more strictly conserved) uniquely present in different ethnic groups 22. This structure is analogous to a hierarchical clustering setting in which different groups comprising A prevalent form of f in the literature is f I(h 1 h 2 = g), which is a deterministic indicator function of the event that haplotypes h 1 and h 2 are consistent with g. More desirable forms of f would model errors in the genotyping or data recording process, a point we will return to later in the paper. 3

6 multiple clusters may share clusters with common centroids. In this paper, we describe a new class of haplotype inference models based on a nonparametric Bayesian formalism built on the Dirichlet process (DP) 23,24, which offers a well-founded statistical framework to tackle the problems discussed above more efficiently and accurately. As we discuss in the sequel, the Dirichlet process can induce a partition of an unbounded population in a way that is closely related to the Ewens sampling formula 25, thus it can be viewed as an exchangeable approximation to the joint distribution of population haplotypes under a coalescent process. On the other hand, in the setting of mixture models, the Dirichlet process is able to capture uncertainty about the number of mixture components 26. A hierarchical extension of DP also leads to an elegant model that couples the demographic information in different populations for solving multi-population haplotype inference problems. Our model differs from the extant models in the following important ways: 1) Instead of resorting to ad hoc parametric assumptions or model selection over the number of population haplotypes, as in many parametric Bayesian models, we introduce a nonparametric prior over haplotypes ancestors, which facilitates posterior inference of the haplotypes (and other genetic properties of interest) in an open state space that can accommodate arbitrary sample size. 2) Our model captures similar genetic features as those emphasized in Stephens et al. 9, including the parent-dependentmutation property, but with an exchangeable likelihood function rather than an order-dependent one as in the PAC model. 3) The hierarchical Bayesian framework of our model explicitly captures ancestral/population structures and incorporates demographic/genetic parameters so that they can be inferred or estimated along with the haplotype phase based on given genotype data. 4) Our model can explicitly exploit the ethnic labels, and potentially latent sub-population structures of the sample, to improve haplotyping accuracy. Some fragments of the technical aspects of the proposed model was announced before in conferences in the machine learning community 27,28, but to our knowledge the full statistical model and its population genetic interpretations are new to the genetics community, and a computer program based on this model for haplotype reconstruction from large genotype data is not yet available. In this paper, we describe this new nonparametric Bayesian approach for haplotype modeling in detail, and we present an efficient Monte Carlo algorithm, Haploi, for haplotype inference based on the proposed model, which is readily applicable to genotype sequences with thousands of SNPs, at a time-cost often at least two-orders of magnitude less than that of the state-of-the-art PHASE program, with competitive and sometimes superior performance (mostly in long sequences). We also show that Haploi significantly outperforms several other extant haplotype inference algorithms on both simulated and realistic data. A C++ implementation of Haploi can be obtained from the authors via request, and will be soon made public on world wide web once interface and GUI development are completed. 2 The Statistical Model As motivated in the introduction, it is desirable to explicitly explore the demographic characteristics such as population structure and ethnic label, and the genetic scenarios such as coalescence and mutation, underlying the study populations, while performing haplotype inference on complex population samples. In the following, we present two novel nonparametric Bayesian models for 4

7 haplotype inference that facilitate this desire. We begin with a basic model for the simplest demographic and genetic scenario, in which we ignore individual ethnic labels in the sample (as in most extant haplotyping methods), and assume absence of recombination in the sample. Then we generalize this model to a multi-population scenario. Finally we deal with long genotypes with recombinations with an algorithmic approach motivated by the partition-ligation scheme A Dirichlet process mixture model for haplotypes Dirichlet process mixture Given a sample of n chromosomes, under neutrality and random-mating assumptions, the distribution of the genealogy trees of the sample can be approximated by that of a random tree known as the n-coalescent 29. Additionally, on each lineage there is a point process of mutation events. The best understood, and also the simplest instances of such mutation processes is the infinitelymany-alleles model, in which each mutation in the lineage produces a novel type, independent of the parental allele. IMA can be understood as an independent Poisson process with rate, say, α/2, which is determined by the size of the evolving population N (usually N >> n) and the per-generation mutation rate µ (i.e., α = 4N µ). Although easy to analyze, IMA is unrealistic because it fails to capture dependencies among haplotypes. On the other hand, to date no closed-form expression of P (H) is known for the more realistic parent-dependent mutation model under the n-coalescent; approximations such as the PAC model has been used as a tractable surrogate. In the sequel, we describe an alternative approach for modeling P (H) based on a nonparametric Bayesian formalism known as the Dirichlet process, which leads to a new class of models for haplotype distribution that approximately captures major properties that would result from a coalescent-with-pdm model. We begin with a brief genetic motivation of the proposed approach. Rather than focusing on a complete random genealogy up to the MRCA, we instead consider a sample of n individuals from a population characterized by an unknown set of founding haplotypes, with unknown founder frequencies. For now we focus attention on a small chromosomal region within which the possibility of recombination over relevant time-scales is negligible. As a consequence, we postulate that each individual s genotype is formed by drawing two random haplotype founders from an ancestral pool, one for each of the two actual haplotypes of this individual, which can be mutated version of their corresponding founders. We further assume that we are given noisy observations of the resulting genotypes. Below we show how this setting relates to the coalescent-with-ima and coalescent-with-pdm models. Under Kingman s coalescent-with-ima model, one can treat a haplotype from a modern individual as a descendent of a most resent common ancestor with unknown haplotype via random mutations that alter the allelic states of some SNPs 29. Hoppe 3 observed that a coalescent process in an infinite population leads to a partition of the population at every generation that can be succinctly captured by the following Pólya urn scheme. Consider an urn that at the outset contains a ball of a single color. At each step we either draw a ball from the urn and replace it with two balls of the same color, or we are given a ball of a new color which we place in the urn. One can see that such a scheme leads to a partition of 5

8 the balls according to their color. Mapping each ball to a haploid individual and each color to a possible haplotype, this partition is equivalent to the one resulted from the coalescence-with-ima process 3, and the probability distribution of the resulting allele spectrum the numbers of colors (i.e., haplotypes) with every possible number of representative balls (i.e., decedents) is captured by the well-known Ewens sampling formula 25. Letting parameter α define the probabilities of the two types of draws in the aforementioned Pólya urn scheme, and viewing each (distinct) color as a sample from Q, and each ball as a sample from Q, Blackwell and MacQueen 24 showed that this Pólya urn model yields samples whose distributions are those of Q the marginal probabilities under the Dirichlet process 23. Formally, a random probability measure Q is generated by a DP if for any measurable partition A 1,..., A k of the sample space (e.g., the partition of an unbounded haploid population according to common haplotype patterns), the vector of random probabilities Q(A i ) follows a Dirichlet distribution: (Q(A 1 ),..., Q(A k )) Dir(αQ (A 1 ),..., αq (A k )), where α denotes a scaling parameter and Q denotes a base measure. The Pólya urn construction of DP makes explicit an order-independent sequential sampling scheme to draw samples from a DP. Specifically, having observed n samples with values (φ 1,..., φ n ) from a Dirichlet process DP (α, Q ), the distribution of the value of the (n + 1)th sample is given by: φ n+1 φ 1,..., φ n, α, Q K k=1 n k n + α δ φ ( ) + α k n + α Q ( ), (3) where δ φ k ( ) denotes a point mass at a unique value φ k, n k denotes the number of samples with value φ k, and K denotes the number of unique values in the n samples drawn so far. This conditional distribution is useful for implementing Monte Carlo algorithms for haplotype inference under DP-based models. We will return to this point in the Appendix. Under a DP distribution described above, the sampled haplotypes follow an IMA model, meaning that all different haplotypes (i.e., ball colors) are independent. How can we take into consideration the fact that, in a real haplotype sample one would expect that some haplotypes differ only slightly (i.e., at a few SNP loci) whereas some differ much more significantly a phenomenon caused by possibly parent-dependent mutations. Now we describe a DP mixture (DPM) model that approximate this effect. In the context of mixture models, one can associate common data centroids, i.e., haplotype founders rather than all distinct haplotypes, with colors drawn from the Pólya urn model and thereby define a clustering of the (possibly noisy) data {h i } (e.g., modern haplotypes that are recognizable variants of their corresponding founders) via likelihood function p(h i φ i ). As obvious from Eq. (3), the samples (i.e., ball-draws) {φ i } from a DP (i.e., the urn) tend to concentrate themselves around some unique values {φ k } (i.e., colors); thus conditioning on each such unique value φ k, we have a mixture component p(h i φ k ) for the data. Such a mixture model is known as the DP mixture 26,31. Note that a DP mixture requires no prior specification of the number of components, which is typically unknown in genetic demography problems. It is important to emphasize that here DP is used as a prior distribution of mixture components. Multiplying this prior by a likelihood that relates the mixture components to the actual data yields a posterior distribution of the mixture components, and the design of the likelihood function is completely up to the 6

9 modeler based on specific problems. This nonparametric Bayesian formalism forms the technical foundation of the haplotype modeling and inference algorithms to be developed in this paper DPM for haplotype inference Now we briefly recapitulate the basic DPM for haplotypes first proposed in Xing et al. 27. In the next section we generalize this model to multi-population haplotypes, and describe specific Bayesian treatments of all relevant model parameters. Write H ie = [H ie,1,..., H ie,t ], where the sub-subscript e {, 1} denotes the two possible parental origins (i.e., paternal and maternal), for a haplotype over T contiguous SNPs from individual i; and let G i = [G i,1,..., G i,t ] denote the genotype these SNPs of individual i. For diploid organisms such as human, we denote the two alleles of a SNP by and 1; thus each G i,t can take on one of four values: or 1, indicating a homozygous site; 2, indicating a heterozygous site; and?, indicating missing data. (A generalization to polymorphisms with k-ary alleles is straightforward, but omitted here for simplicity.) Let A k = [A k,1,..., A k,t ] denote an ancestor haplotype (indexed by k) and θ k denote the mutation rate of ancestor k; and let C i denote an inheritance variable that specifies the ancestor of haplotype H i. We write P h (H A) for the inheritance model according to which individual haplotypes are derived from a founder, and P g (G H, H 1 ) for the genotyping model via which noisy observations of the genotypes are related to the true haplotypes. Under a DP mixture, we have the following Pólya urn scheme for sampling the genotypes, G i, i = 1,..., n, of a sample with n individuals: Draw first haplotype: a 1, θ 1 DP(τ, Q ) Q ( ), sample the 1st founder (and its associated mutation rate); h 1 P h ( a 1, θ 1 ), sample the 1st haplotype from an inheritance model defined on the 1st founder; for subsequent haplotypes: sample the founder indicator for the ith haplotype: p(c i = c j for some j < i c 1,..., c i 1 ) = c i DP(τ, Q ) p(c i c j for all j < i c 1,..., c i 1 ) = nc j i 1+α α i 1+α where n ci is the occupancy number of class c i the number of previous samples generated from founder a ci. sample the founder of haplotype i (indexed by c i ): = {a cj, θ cj } if c i = c j for some j < i (i.e., c i refers to an inherited φ ci DP(τ, Q ) founder) Q (a, θ) if c i c j for all j < i (i.e., c i refers to a new founder) 7

10 sample the haplotype according to its founder: h i c i P h ( a ci, θ ci ). sample all genotypes (according to a one-to-one mapping between haplotype index i and allele index ): g i h i, h i1 P g ( h i, h i1 ). Under appropriate parameterizations of the base measure Q, the inheritance model P h, and the genotyping model P g, which will be described in detail shortly, the problem of phasing individual haplotypes and estimating the size and configuration of the latent ancestral pool under a DPM model can be solved via posterior inference given the genotypes from a (presumably ethnically homogeneous) population descending from a single pool of ancestors, using, for example, a Gibbs sampler as we will outline in the Appendix. As mentioned earlier, treating haplotype distribution as a mixture model, where the set of mixture components correspond to the pool of ancestral haplotypes, or founders, of the population, can be justified by straightforward statistical genetics arguments. Crucially, however, the size of this pool is unknown; indeed, knowing the size of the pool would correspond to knowing something significant about the genome and its history. In most practical population genetic problems, usually the detailed genealogical structure of a population (as provided by the coalescent trees) is of less importance than the population-level features such as pattern of common ancestor alleles (i.e., founders) in a population bottleneck, the age of such alleles, etc. In this case, the DP mixture offers a principled approach to generalize the finite mixture model for haplotypes to an infinite mixture model that models uncertainty regarding the size of the ancestor haplotype pool, and at the same time it provides a reasonable approximation to the coalescence-with-pdm model by utilizing the partition structure resulted thereof, but allowing further mutations within each partite to introduce further diversity among descents of the same founder. 2.2 A Hierarchical DP mixture model for multi-population haplotypes Now we consider the case in which there exist multiple sample populations (e.g., ethnic groups), each modeled by a distinct DP mixture. Note that now we have multiple ancestor pools, one for each attendant population; instead of modeling these populations independently, we place all the population-specific DPMs under a common prior, so that the ancestors (i.e., mixture components) in any of the population-specific mixtures can be shared across all the mixtures, but the weight of an ancestral haplotype in each mixture is unique. Genetically, this means that for every possible ancestral haplotype, it can either be a founder in only one of the populations, or be a founder shared in some or all attendant populations; in the latter case, the frequencies of this haplotype founder being inherited are different in different populations. To tie population-specific DP mixtures together in this way, we use a hierarchical DP mixture model 32, in which the base measure associated with each population-specific DP mixture is itself drawn from a higher-level Dirichlet process DP(γ, F ). Since a draw from this higher-level DP is a discrete measure with probability 1, atoms drawn by different population-specific DPs from DP(γ, F ) the haplotype founders and its mutation rate φ k {A k, θ k }, which are used as the 8

11 mixture components in each of the population-specific DP mixtures are not going to be all distinct (i.e., the same (A k, θ k ) can be drawn by two different population-specific DPMs). This allows sharing of components across different mixture models Hierarchical Dirichlet Process Before presenting the HDP mixture for haplotypes, we digress with a brief description of the HDP formalism. As with the DP, it is useful to describe the marginals induced with an HDP using the more concrete representation of Pólya urn models. Imagine we set up a single stock urn at the top level, which contains balls of colors that are represented by at least one ball in one or multiple urns at the bottom level. At the bottom level, we have a set of distinct urns which are used to define the DP mixture for each population. Now let s suppose that upon drawing the m j -th ball for urn j at the bottom, the stock urn contains n balls of K distinct colors indexed by an integer set C = {1, 2,..., K}. Now we either draw a ball randomly from urn j, and place back two balls both of that color, or with some probability we return to the top level. From the stock urn, we can either draw a ball randomly and put back two balls of that color in the stock urn and one in j, or obtain a γ n 1+γ ball of a new color K + 1 with probability and put back a ball of this color in both the stock urn and urn j of the lower level. Essentially, we have a master DP (the top urn) that serves as a source of atoms for J child DPs (bottom urns). Associating each color k with a random variable φ k whose values are drawn from the base measure F, and recalling our discussion in the previous section, we know that draws from the stock urn can be viewed as marginals from a random measure distributed as a Dirichlet Process Q with parameter (γ, F ). From Eq. (3), for n random draws φ = {φ 1,..., φ n } from Q, the conditional prior for (φ n φ n ), where the subscript n denotes the index set of all but the n-th ball, is φ n φ n K k=1 n k n 1 + γ δ φ (φ γ k n) + n 1 + γ F (φ i), (4) where φ k, k = 1,..., K denote the K distinct values (i.e., colors) of φ (i.e., all the balls in the stock urn), and n k denote the number of balls of color k in the top urn. Conditioning on Q (i.e., using Q as an atomic base measure of each of the DPs corresponding to the bottom-level urns), the m j -th draws from the jth bottom-level urn are also distributed as marginals under a Dirichlet measure: φ mj φ mj = n k n K m j,k + τ k n 1+γ δ φ m j 1 + τ k (φ mj ) + k=1 τ m j 1 + τ γ n 1 + γ F (φ m j ) K π j,k δ φ k (φ mj ) + π j,k+1 F (φ mj ), (5) k=1 where π j,k := m j,k+τ n 1+γ τ γ m j, π 1+τ j,k+1 =, and m m j 1+τ n 1+γ j,k denotes the number of balls of color k in the j-th bottom urn. In our case, π j = (π j,1, π j,2,...) gives the a priori frequencies (i.e., mixture weighs) of haplotype founders in population j. 9

12 Figure 1: The haplotype-genotype generative process under HDPM, illustrated by an example concerning three populations. At the first level, all haplotype founders from different populations are drawn from a common pool via a Pólya urn scheme, which leads to the following effects: 1) the same founder can be drawn by either multiple populations (e.g., the red founder in population 1 and 2, and the blue one in population 1 and 3), or only a single population (e.g., the grey founder in population 1); 2) shared founders can have different frequencies of being inherited. Then at the second level, individual haplotypes were drawn from a population-specific founder pool also via a Pólya urn scheme, but this time through an inheritance models P h ( a k ) that allows mutations with respect to the founders, as indicated by the underscores at the mutated loci in the individual haplotypes. Finally, the genotype observations are related to the haplotype pairs of every individual via a noisy channel P g ( ) HDPM for multi-population haplotype inference Using the HDP construction described above, we now define an HDP mixture model for the genotypes in J populations. Elaborating on the notational scheme used earlier, let G (j) i = [G (j) i,1,..., G(j) i,t ] denote the genotype of T contiguous SNPs of individual i from ethnic group j; and let H (j) = [H (j) i,..., e,1 H(j),T ] denote a haplotype of individual i from ethnic group j. The basic generative structure of multi-population genotypes under an HDPM is as follows, which is also illustrated graphically in Figure 1. Q (φ 1, φ 2,...) γ, F DP(γ, F ), sample a DP of founders for all populations; Q j (φ (j) 1, φ(j) 2,...) τ, Q DP(τ, Q ), sample the DP of founders for each population; φ (j) Q j Q j, sample the founder of haplotype in population j; h (j) φ (j) P h ( φ (j) ), sample haplotype in population j; g (j) i h (j) i, h (j) i 1 P g ( h (j) i, h (j) i 1 ), sample genotype i in population j, 1

13 where in practice the first three sampling steps follow the nested Pólya urn scheme described above. Note that in the HDP the base measure of each lower-level DP is a draw from the root DP(γ, F ). From this description, it is apparent that the totality of all atomic samples, i.e., all instantiated haplotype founders and their associated mutation rates, from the base measure F form the common support (i.e., candidate founder patterns) of both the root DP and all the population-specific DPs. The child DPs place different mass distributions, i.e., a priori frequencies of haplotype founders, on this common support, in a population-specific fashion. Recall that the base measure F in the above generative process is defined as a distribution from which haplotype founders φ k := {A k, θ k } are drawn. Thus it is a joint measure on both A and θ. We let F (A, θ) = p(a)p(θ), where p(a) is uniform over all possible haplotypes and p(θ) is a beta distribution introducing a prior belief of low PDM mutation rate. For generality, we assume each A k,t (and also each H i,t ) takes its value from an allele set P. For other building blocks of the HDPM model, we propose the following specifications. Haplotype inheritance model: Omitting all but the locus index t, we can define our inheritance model to be a single-locus mutation model as follows 27 : ( ) I(ht=a 1 θ t) P h (h t a t, θ) = θ I(ht at) P 1 (6) where I( ) is the indicator function. This model corresponds to a star genealogy resulting from infrequent mutations over a shared ancestor, and is widely used as an approximation to a full coalescent genealogy starting from the shared ancestor (e.g., Liu et al. 33 ). Given this inheritance model, it can be shown that the marginal conditional distribution of a haplotype sample h = {h ie : e {, 1}, i {1, 2,..., I}} takes the following form resulted from an integration of θ in the joint conditional: p(h a, c) = K k=1 R(α h, β h ) Γ(α h + l k )Γ(β h + l k ) ( ) l 1 k, (7) Γ(α h + β h + l k + l k ) P 1 where R(α h, β h ) = Γ(α h+β h ), l Γ(α h )Γ(β h ) k = i,e,t I(h,t = a k,t )I(c ie = k) is the number of alleles which are identical to the ancestral alleles, and l k = i,e,t I(h a k,t )I(c ie = k) is the total number of mutated alleles. Genotype observation model: Next, we assume that the observed genotype at a locus is determined by the paternal and maternal alleles of this site via the following genotyping model 27 : P g (g h i,t, h i1,t, τ) = ξ I(h=g) [µ 1 (1 ξ)] I(h 1 g) [µ 2 (1 ξ)] I(h 2 g) (8) where h h i,t h i,t denotes the unordered pair of two actual SNP allele instances at locus t; 1 denotes set difference by exactly one element; 2 denotes set difference of both elements, and µ 1 and µ 2 are appropriately defined normalizing constants. Again we place a beta 11

14 prior Beta(α g, β g ) on ξ for smoothing. Under the above model specifications, it is standard to derive the posterior distribution of each haplotype H ie given all other haplotypes and all genotypes, and the posterior of any missing genotypes, by integrating out parameters θ or ξ and resorting to the Bayes theorem, which enable collapsed Gibbs sampling step where necessary. For simplicity, we omit details. Hyperprior for coalescent rate: Lastly, to capture uncertainty over the scaling parameters γ (or α in the single-layer DPM model), which is twice the mutation rate in the coalescent over the haplotype founders, we use a vague inverse Gamma prior: p(γ 1 ) G(1, 1) p(γ) γ 2 exp( 1/γ)). (9) Under this prior, the posterior distribution of γ depends only on the number of instances n, and the number of components K, but not on how the samples are distributed among the components: p(γ k, n) γk 2 exp(1/γ)γ(γ). (1) Γ(n + γ) The distribution p(log(γ) k, n) is log-concave, so we may efficiently generate independent samples from this distribution using adaptive rejection sampling 34. It is noteworthy that in an HDPM we need to define vague inverse Gamma priors also for the scaling parameters of populationspecific DPs at the bottom level. We use a single concentration parameter τ for these DPs; it is also possible to allow separate concentration parameters for each of the lower-level DPs, possibly tied distributionally via a common hyperparameter Posterior inference via Gibbs sampling Putting everything together, we have constructed an HDPM model for multi-population haplotypes. The two-level nested Pólya urn schemes described above motivates an efficient and easy-toimplement Gibbs sampling algorithm to sample from the posterior associated with HDPM. Specifically, under a collapsed Gibbs sampling scheme where θ and ξ are integrated out, the variables of interest are C (j) i, A e,t k,t, H (j),t, γ, and τ, i, j, k, t, e. The sampler alternates between three coupled stages. First, it samples the scaling parameters γ and τ of the DPs, following the predictive distribution given by Eq. (1). Then, it samples the c (j) s and a k,t s given the current values of the hidden haplotypes and the scaling parameters. Finally, given the current state of the ancestral pool and the ancestor assignment for each individual, it samples the h (j),t variables. Due to space limit, details of the forms and derivations of the predictive distributions used for step 2 and 3 are given in the on-line supplemental material. 3 Partition-ligation and the Haploi program As for most of the haplotype inference models proposed in the literature, the state space of the proposed HDPM model scales exponentially with the length of the genotype sequence, and therefore 12

15 it can not be directly applied to genotype data containing hundreds or thousands of SNPs. To deal with haplotypes with a large number of linked SNPs, Niu et al. 12 proposed a divide-and-conquer heuristic known as Partition-Ligation (PL), which was adopted by a number of haplotype inference algorithms including PL-EM 35, PHASE 9,36, and CHB 14. We equipped our haplotype inference algorithm based on the HDPM model with a variant of the PL heuristic, and present a new tool, Haploi, for haplotype inference of either single or multiple population genotype data containing thousands of SNPs. Unlike the original PL-scheme in Niu et al. 12, which works on disjoint blocks and then recursively ligate the phased blocks into larger (non-overlapping) haplotypes via Gibbs sampling in the product space of all the atomistic haplotypes of every attendant pair of blocks to be ligated, under a fixed-dimensional Dirichlet prior of the frequencies of the ligated haplotype ; our PL-scheme generate partially overlapping intermediate blocks from smaller blocks phased at the lower level, and the pairs of overlapping blocks are recursively merged into larger ones by leveraging the redundancy of information from overlapping regions, as well as an overall parsimonious criteria. Empirically we found that these strategy can lead to a significant reduction of the size of the haplotype search space for long genotypes such as those with thousands of SNPs, and therefore facilitate a more efficient inference algorithm. Again, due to space limit, details of this algorithm are available in the supplemental material. 4 Results In this section, we present a comparison of Haploi, with PHASE ,36, fastphase 37, HAPLO- TYPER 1. 12, and CHB We run each program using its default parameter settings. Three different error measures were used for evaluation: err s, the ratio of incorrectly phased SNP sites over all non-trivial heterozygous SNPs (excluding individuals with a single heterozygous SNP), and err i, the ratio of incorrectly phased individuals over all non-trivial heterogeneous individuals (i.e., those with at least two heterogeneous SNPs), and d w, the switch distance, which is the number of phase flips required to correct the predicted haplotypes over all non-trivial heterogeneous SNPs. Both short ( 1 SNPs) and long ( ) sequences were tested when the program permits. For short SNP sequences, we primarily use err s and err i ; whereas for long sequences we compare d w according to common practice, since it is regarded as a more sensible indicator of performance for longer SNP sequences. On short SNPs, we test on a large number of samples and report summary statistics of errors over all samples; whereas for long SNPs, we present error over each of samples (of different lengths) we tested due to heavy computational cost. We have also estimated other population genetic metrics of interest, such as the haplotype frequencies, the mutation rates θ, and the number of reconstructed haplotype founders K, under Specifically, we applied fastphase only on long SNP sequences since it is tailored for fast processing of long sequences at the cost of reduced accuracy; and we applied both HAPLOTYPER 1. and CHB 1. only on short SNP sequences since they can not reach convergence within acceptable test time (> 4 hours) on samples with more than 2 SNPs. 13

16 Comparison of haplotype error (err s ) Comparison of haplotype error (err i ) Comparison of haplotype error (d w ).5.4 HDP DP Phase Haplotyper CHB.5.4 HDP DP Phase Haplotyper CHB.5.4 HDP DP Phase Haplotyper CHB err s.3 err i.3 d w θ=.1 θ=.5 θ=.1 θ=.5 θ=.1 θ=.5 Figure 2: Performance on simulated datasets. Two kinds of datasets with different mutation rates θ were tested. Each dataset includes 1 individuals from 5 groups (2 from each). The sequence length was fixed to 1. The performance of each algorithm is represented in terms of err s, err i and d w. Each bar represents the average error rate across 5 different datasets where the standard deviation is shown as a vertical line. the HDP and DP models. We will present some of these results to illustrate consistency of our model (on simulation), and the characteristics of some real data set being studied. 4.1 Simulated Multi-Population SNP Data For simulation-based tests, we used a pool of haplotypes taking from the coalescent-based synthetic dataset in Stephens et al. 9, each containing 1 SNPs, as the hypothetical founders; and we drew each individual s haplotypes and genotype by randomly choosing two ancestors from these founders and applying the mutation and noisy genotyping models described in the methodology section. For each of our synthetic multi-population data set, we simulated 1 individuals consisting of five populations, each with 2 individuals. Each population is derived from 5 founders, two of which are shared among all populations, and the other three are population-specific. Thus overall the total number of founders across the five populations is 17. We test our algorithm on two data sets with different degree of sequence diversity. In the conserved data set, we assume the mutation rate θ to be.1 for all populations and all loci; in the diverse data set, θ is set to be.5. All populations and loci are assumed to have the same genotyping error rate. Fifty random samples were drawn from both the conserved and the diverse data sets Haplotype Accuracy We compare Haploi, implemented either based on a DPM model or an HDPM model (dubbed as Haploi-HDP and Haploi-DP, respectively, when distinction is needed; otherwise, we simply Here our simulation scheme assumes a star-genealogy with uniform mutation rate for each sub-population. This simulation scheme is simple to implement under our multi-population scenario for testing the bias/variance of a number of estimators of interest, and it reasonably approximates the genetic demography of samples under an IMA model on a coalescent tree. We have tested an early version of Haploi, which employed an DPM (rather than HDPM) without partition-ligation, on the single-population, coalescent-based simulated data used in Stephens et al. 9 in comparison with PHASE (see Xing et al. 27 ), and the results were similar. 14

17 use Haploi for Haploi-HDP), with extant phasing methods applied in two modes on the synthetic data sets. As mentioned in the introduction, given multi-population genotype data, to use an extant method, one can either adopt mode-i pool all populations together and jointly solve a single haplotype inference problem that ignore the population label of each individual; or follow mode- II apply the algorithm to each population and solve multiple haplotype inference problems separately. Haploi-HDP takes a different approach, by making explicit use of the population labels (if available) and jointly solve multiple coupled haplotype inference problems. (Note that when only a single population is concerned, or no population label is available, Haploi-HDP is still valid, and is equivalent to a baseline Haploi-DP with one more layer of DP hyper-prior). We apply our method to the simulated multi-population data, and compare its overall performance on the whole data with the outcomes of other algorithms run in mode-i, and compare the score of our method on each population with the outcome of extant methods run in mode-ii. Figure 2 summarizes the overall performance of Haploi on 5 conserved simulated samples and 5 diverse simulated samples, along with those of the reference algorithms run in mode-i. On the conserved samples, which are presumably easier to phase, Haploi-HDP outperforms all the other algorithms appreciably. On the diverse samples, which are more challenging to phase (due to more severe inconsistencies among individual haplotypes in the samples caused by high mutation rate), Haploi-HDP outperforms all other algorithms with a significant margin. Table 1: Comparison of Haploi and other algorithms run in mode-ii on the synthetic multipopulation data. Results on the conserved data sets (θ =.1) and the diverse data sets (θ =.5) are shown in the upper and lower sections, respectively. In the table, column pop specifies the 5 attendant sub-populations; column K shows the estimated number of founders in each subpopulation; columns denoted by err s and err i, respectively, show the phase error measured by site- and individual-discrepancies; and finally, row avg shows the average errors over 5 random samples of test data set. Haploi-HDP Haploi-DP PHASE HAPLOTYPER CHB θ pop err s err i K err s err i K err s err rr s err rr s err i (1) (2) (3) (4) (5) avg (1) (2) (3) (4) (5) avg Haploi-HDP also dominates other methods when the latter are run in mode-ii on the simulated data. Table 1 shows a comparison of the accuracy of Haploi-HDP on each sub-population (directly 15

18 1 Trajectory of K on conserved dataset 1 Trajectory of K on diverse dataset K 5 K Iteration Iteration.12 Trajectory of θ on conserved dataset.12 Trajectory of θ on diverse dataset θ.6 θ Iteration Iteration Figure 3: Trajectories of K and θ along the iterations of HDP inference from one among the simulated datasets. The left panels show the result on conserved dataset, and the right, on diverse dataset. extracted from results obtained in a single run of Haploi-HDP on all populations) in a simulated data set with results from separate runs of the other algorithms on each sub-population of genotypes. Note that, here K is expected to be 5 for each group, and the estimation by Haploi-HDP is quite close to this value. On the conserved data set, CHB shows the best result while all algorithms performed comparably. On the more difficult diverse dataset, the HDP approach outperformed other algorithms and inferred the founders of each group more robustly than the group-specific runs using the baseline Haploi-DP (which employ independent DPs for each sub-population) MCMC and parameter estimation Typically, the Pólya urn Gibbs sampler for Haploi converges within 1 iteration (Figure 3) on the synthetic data. This contracts sampling algorithms used in some of the other haplotype inference algorithms, which typically needs tens of thousands of samples to reach convergence. The fast convergence is possibly due to Haploi s ability to quickly infer the correct number of founding haplotypes underlying the genotypes samples, which leads to a model significantly more compact (i.e., parsimonious) than that derived from other algorithms. In Figure 4 (a) and (b), we show the histogram of the estimated K the number of recovered ancestors, across the 5 datasets via both of our algorithms. Recall that we expect K to be 17, and the estimated K under both the DP and HDP models turns out to be very close to this number on the conserved datasets (i.e., those with a small mutation rate); from the diverse data sets, Haploi-HDP can still offer a good estimate of the number of ancestors, whereas Haploi-DP recovered more ancestors (around 25 on average) than 16

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation Instructor: Arindam Banerjee November 26, 2007 Genetic Polymorphism Single nucleotide polymorphism (SNP) Genetic Polymorphism