A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data

Size: px
Start display at page:

Download "A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data"

Transcription

1 A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data Eric P. Xing January 27 CMU-ML-7-17 Kyung-Ah Sohn School of Computer Science Carnegie Mellon University Pittsburgh, PA Abstract Uncovering the haplotypes of single nucleotide polymorphisms and their population demography is essential for many biological and medical applications. Methods for haplotype inference developed thus far including those based on approximate coalescence, finite mixtures, and maximal parsimony often bypass issues such as unknown complexity of haplotype-space and demographic structures underlying multi-population genotype data. In this paper, we propose a new class of haplotype inference models based on a nonparametric Bayesian formalism built on the Dirichlet process, which represents a tractable surrogate to the coalescent process underlying population haplotypes and offers a well-founded statistical framework to tackle the aforementioned issues. Our proposed model, known as a hierarchical Dirichlet process mixture, is exchangeable, unbounded, and capable of coupling demographic information of different populations for posterior inference of individual haplotypes, the size and configuration of haplotype ancestor pools, and other parameters of interest given genotype data. The resulting haplotype inference program, Haploi, is readily applicable to genotype sequences with thousands of SNPs, at a time-cost often two-orders of magnitude less than that of the state-of-the-art PHASE program, with competitive and sometimes superior performance. Haploi also significantly outperforms several other extant algorithms on both simulated and realistic data.

2 Keywords: haplotype inference, Dirichlet Process, Hierarchical Dirichlet process, mixture model, population genetics

3 1 Introduction Recent experimental advances have led to an explosion of data which document genetic variation at the DNA level within and between populations. For example, the international SNP map working group Group 1 has reported the identification and mapping of 1.4 million single nucleotide polymorphisms (SNPs) from the genomes of 4 different human populations in the world. These kinds of data lead to challenging inference problems whose solutions could lead to greater understanding of the genetic basis of disease propensities and other complex traits 2,3. SNPs represent the largest class of individual differences in DNA. A SNP refers to the existence of two specific nucleotides chosen from {A, C, G, T } at a single chromosomal locus in a population; each variant is called an allele. A haplotype refers the joint allelic identities of a list of SNPs at contiguous sites in a local region of a single chromosome. Assuming no recombination in this local region, a haplotype is inherited as a unit. For diploid organisms (such as humans), each individual has two physical copies of each chromosome (except for the Y chromosome) in his/her somatic cells; one copy is inherited from the mother, and the other from the father. Thus during each generation of inheritance when chromosomes come in pairs, two haplotypes, for example, h 1 (1, 1,, ) and h 2 (,, 1, 1) of a 4-loci region, go together to make up a genotype, which is the list of unordered pairs of alleles in the attendant region, e.g., g (1/, 1/, 1/, 1/) in case of the aforementioned two haplotypes. That is, a genotype is obtained from a pair of haplotypes by omitting the specification of the association of each allele with one of the two chromosomes its phase. Indeed, phase is in general ambiguous when only the genotypes of a SNPs sequence are given 4,5. For example, in the above example, given the g, an alternative configuration of the haplotypes, h 1 (1, 1, 1, 1) and h 2 (,,, ), is also consistent with the genotype; but observing multiple genotypes in a population can help to bias the phase reconstruction toward the true haplotypes. The problem of inferring SNP haplotypes from genotypes is essential for the understanding of genetic variations and linkage disequilibrium patterns in a population. For example, accurate inferences concerning population structures or quantitative trait locus maps usually demand the analysis of the genetic states of possibly non-recombinant segments of the subject s chromosome(s) 6. Thus, it is advantageous to study haplotypes, which consist of several closely spaced (hence linked) phase-known SNPs and often prove to be more powerful discriminators of genetic variations within and among populations. Common biological methods for assaying genotypes typically do not provide phase information for individuals with heterozygous genotypes at multiple autosomal loci; phase can be obtained at a considerably higher cost via molecular haplotyping 7. In addition to being costly, these methods are subject to experimental error and are low-throughput. Alternatively, phase can also be inferred from the genotypes of a subject s close relatives 5. But this approach is often hampered by the fact that typing family members increases the cost and does not guarantee full informativeness. It is desirable to develop automatic and robust in silico methods for reconstructing haplotypes from genotypes and possibly other data sources (e.g., pedigrees). Key to the inference of individual haplotypes based on a given genotype sample, is the formulation and tractability of the marginal probability of the haplotypes of a study population. Consider the set of haplotypes, denoted as H = {h 1, h 2,..., h 2n } (where h i P T, P denotes the allele space of the polymorphic markers and T denotes the length of the marker sequence), of a random 1

4 sample of 2n chromosomes of n individuals taken from a population at stationarity of some inheritance process, e.g., an infinitely-many-allele (IMA) mutation model. Under common genetic arguments, the ancestral relationships amongst the sample back to its most recent common ancestor (MRCA) can be described by a genealogical tree, and computing p(h) involves a marginalization over all possible genealogical trees consistent with the sample, which is widely known to be intractable. As discussed in Stephens and Donnelly 8, write P (H) as a product of conditionals based on the chain rule, i.e., P (h 1, h 2,..., h 2n ) = P (h 1 )P (h 2 h 1 )... P (h 2n h 1... h 2n 1 ), (1) then the generation of a haplotype sample H can be viewed as a sequential process that draw one haplotype at a time conditioning on all the previously drawn haplotypes, e.g., by introducing random mutations to the latter. (This is equivalent to sampling from a genealogy evolving in non-overlapping generations.) Therefore, one can develop tractable approximation to P (H) by appropriately approximating the conditionals in Eq. (1). Stephens and Donnelly 8 suggested an approximation to P (h i h 1... h i 1 ) that captures, among several desirable genetic properties, the parental-dependent-mutation (PDM) property, by modeling h i as the progeny of a randomlychosen existing haplotype through a geometric-distributed number of mutations. This model, referred to as the PAC (for Product of Approximate Conditionals) model, forms the basis of the PHASE program 9, which has set the state-of-the-art benchmarks in haplotype inference. However, one caveat of the PAC model, as acknowledged in Li and Stephens 1, is that it implicitly assumes existence of an ordering in the haplotype sample, therefore the resulting likelihood does not enjoy the property of exchangeability that we would expect to be satisfied by the true p(h). Although empirically this pitfall appears to be inconsequential after some heuristic averaging over a moderate number of random orderings, it is difficult to associate this approximation to an explicit assumption about the population demography and genealogy underlying the sample. For example, the genealogy of haplotypes with possibly common ancestry is replaced by asymmetric pairwise relationships (induced by the conditional mutation model) between the haplotypes. The resulting flattening of the latent genealogical history makes it difficult to use the PAC method to discover and exploit latent demographic structures such as estimating the number and pattern of prototypical haplotypes (i.e., founders), which may be indicative of genetic bottlenecks and divergence time of the study population, or to make use of the ethnic identities of the sample to improve haplotyping accuracy in multi-population haplotype inference. The finite mixture models adopted by programs such as HAPLOTYPER represent another class of haplotype models that rely very little on demographic and genetic assumptions of the sample Under such a model, haplotypes are treated as latent variables associated with specific frequencies, and the probability of a genotype is given by: p(g) = p(h 1, h 2 )f(g h 1, h 2 ), (2) h 1,h 2 H The parental-dependent-mutation posit that, in a sequential generation process of haplotypes, if the next haplotype does not match exactly with an existing haplotype, it will tend to differ by a small number of mutations from an existing one, rather than be completely different. 2

5 where f(g h 1, h 2 ) is a noisy channel relating the observed genotype to the unobserved true underlying haplotypes, and H denote the set of all possible haplotypes of a given region. Under the assumption of Hardy-Weinberg equilibrium (HWE), an assumption that is standard in the literature and will also be made here, the mixing proportion p(h 1, h 2 ) is assumed to factor as p(h 1 )p(h 2 ). Given this basic statistical structure, the haplotype inference problem can be viewed a missing value inference and parameter estimation problem. Numerous statistical models and statistical inference approaches have been developed for this problem, such as the maximum likelihood approaches via the EM algorithm 11,15 17, and a number of parametric Bayesian inference methods based on Markov Chain Monte Carlo (MCMC) techniques 12,14. The finite mixture model defines an exchangeable P (H). But since such models are data-driven rather than genetically motivated, they offer no insight of the genealogical history underlying the sample. Furthermore, these methodologies have rather severe computational requirements in that a probability distribution must be maintained on a (large) set of possible haplotypes. Indeed, the size of the haplotype pool, which reflects the diversity of the genome and its evolutionary history, is unknown for any given population data; thus we have a mixture model problem in which a key aspect of the inferential problem involves inference over the number of mixture components, i.e., the haplotypes. There is a plethora of combinatorial algorithms based on various deterministic hypothesis such as the parsimony principles that offer control over the complexity of the inference problem 4,18 2, and these methods have demonstrated effectiveness in certain settings and provided important insights to the problem (see Gusfield 21 for an excellent survey). But compared to the statistical approaches, they offer less flexibility and/or scalability in handling missing value, typing error, evolution modeling and more complex scenarios on the horizon in haplotype modeling (e.g., recombinations, genetic mapping, etc.). Most current statistical methods for haplotype inference bypass the issue of ancestral-space uncertainty via an ad hoc specification of the number of haplotypes needed to account for the given genotypes. Although certain coalescent-based models 14 or model-selections methods can partially address these issues. Besides the ancestral-space uncertainty issue discussed above, the haplotype models developed so far are still limited in their flexibility and are inadequate for addressing many realistic problems. Consider for example a genetic demography study, in which one seeks to uncover ethnic- and/or geographic-specific genetic patterns based on a sparse census of multiple populations. In particular, suppose that we are given a sample that can be divided into a set of subpopulations; e.g., African, Asian and European. We may not only want to discover the sets of haplotypes within each subpopulation, but we may also wish to discover which haplotypes are shared between subpopulations, and what are their frequencies. Empirical and theoretical evidence suggests that an early split of an ancestral population following a populational bottleneck (e.g., due to sudden migration or environmental changes) may lead to ethnic-group-specific populational diversity, which features both ancient haplotypes (that have high variability) shared among different ethnic groups, and modern haplotypes (that are more strictly conserved) uniquely present in different ethnic groups 22. This structure is analogous to a hierarchical clustering setting in which different groups comprising A prevalent form of f in the literature is f I(h 1 h 2 = g), which is a deterministic indicator function of the event that haplotypes h 1 and h 2 are consistent with g. More desirable forms of f would model errors in the genotyping or data recording process, a point we will return to later in the paper. 3

6 multiple clusters may share clusters with common centroids. In this paper, we describe a new class of haplotype inference models based on a nonparametric Bayesian formalism built on the Dirichlet process (DP) 23,24, which offers a well-founded statistical framework to tackle the problems discussed above more efficiently and accurately. As we discuss in the sequel, the Dirichlet process can induce a partition of an unbounded population in a way that is closely related to the Ewens sampling formula 25, thus it can be viewed as an exchangeable approximation to the joint distribution of population haplotypes under a coalescent process. On the other hand, in the setting of mixture models, the Dirichlet process is able to capture uncertainty about the number of mixture components 26. A hierarchical extension of DP also leads to an elegant model that couples the demographic information in different populations for solving multi-population haplotype inference problems. Our model differs from the extant models in the following important ways: 1) Instead of resorting to ad hoc parametric assumptions or model selection over the number of population haplotypes, as in many parametric Bayesian models, we introduce a nonparametric prior over haplotypes ancestors, which facilitates posterior inference of the haplotypes (and other genetic properties of interest) in an open state space that can accommodate arbitrary sample size. 2) Our model captures similar genetic features as those emphasized in Stephens et al. 9, including the parent-dependentmutation property, but with an exchangeable likelihood function rather than an order-dependent one as in the PAC model. 3) The hierarchical Bayesian framework of our model explicitly captures ancestral/population structures and incorporates demographic/genetic parameters so that they can be inferred or estimated along with the haplotype phase based on given genotype data. 4) Our model can explicitly exploit the ethnic labels, and potentially latent sub-population structures of the sample, to improve haplotyping accuracy. Some fragments of the technical aspects of the proposed model was announced before in conferences in the machine learning community 27,28, but to our knowledge the full statistical model and its population genetic interpretations are new to the genetics community, and a computer program based on this model for haplotype reconstruction from large genotype data is not yet available. In this paper, we describe this new nonparametric Bayesian approach for haplotype modeling in detail, and we present an efficient Monte Carlo algorithm, Haploi, for haplotype inference based on the proposed model, which is readily applicable to genotype sequences with thousands of SNPs, at a time-cost often at least two-orders of magnitude less than that of the state-of-the-art PHASE program, with competitive and sometimes superior performance (mostly in long sequences). We also show that Haploi significantly outperforms several other extant haplotype inference algorithms on both simulated and realistic data. A C++ implementation of Haploi can be obtained from the authors via request, and will be soon made public on world wide web once interface and GUI development are completed. 2 The Statistical Model As motivated in the introduction, it is desirable to explicitly explore the demographic characteristics such as population structure and ethnic label, and the genetic scenarios such as coalescence and mutation, underlying the study populations, while performing haplotype inference on complex population samples. In the following, we present two novel nonparametric Bayesian models for 4

7 haplotype inference that facilitate this desire. We begin with a basic model for the simplest demographic and genetic scenario, in which we ignore individual ethnic labels in the sample (as in most extant haplotyping methods), and assume absence of recombination in the sample. Then we generalize this model to a multi-population scenario. Finally we deal with long genotypes with recombinations with an algorithmic approach motivated by the partition-ligation scheme A Dirichlet process mixture model for haplotypes Dirichlet process mixture Given a sample of n chromosomes, under neutrality and random-mating assumptions, the distribution of the genealogy trees of the sample can be approximated by that of a random tree known as the n-coalescent 29. Additionally, on each lineage there is a point process of mutation events. The best understood, and also the simplest instances of such mutation processes is the infinitelymany-alleles model, in which each mutation in the lineage produces a novel type, independent of the parental allele. IMA can be understood as an independent Poisson process with rate, say, α/2, which is determined by the size of the evolving population N (usually N >> n) and the per-generation mutation rate µ (i.e., α = 4N µ). Although easy to analyze, IMA is unrealistic because it fails to capture dependencies among haplotypes. On the other hand, to date no closed-form expression of P (H) is known for the more realistic parent-dependent mutation model under the n-coalescent; approximations such as the PAC model has been used as a tractable surrogate. In the sequel, we describe an alternative approach for modeling P (H) based on a nonparametric Bayesian formalism known as the Dirichlet process, which leads to a new class of models for haplotype distribution that approximately captures major properties that would result from a coalescent-with-pdm model. We begin with a brief genetic motivation of the proposed approach. Rather than focusing on a complete random genealogy up to the MRCA, we instead consider a sample of n individuals from a population characterized by an unknown set of founding haplotypes, with unknown founder frequencies. For now we focus attention on a small chromosomal region within which the possibility of recombination over relevant time-scales is negligible. As a consequence, we postulate that each individual s genotype is formed by drawing two random haplotype founders from an ancestral pool, one for each of the two actual haplotypes of this individual, which can be mutated version of their corresponding founders. We further assume that we are given noisy observations of the resulting genotypes. Below we show how this setting relates to the coalescent-with-ima and coalescent-with-pdm models. Under Kingman s coalescent-with-ima model, one can treat a haplotype from a modern individual as a descendent of a most resent common ancestor with unknown haplotype via random mutations that alter the allelic states of some SNPs 29. Hoppe 3 observed that a coalescent process in an infinite population leads to a partition of the population at every generation that can be succinctly captured by the following Pólya urn scheme. Consider an urn that at the outset contains a ball of a single color. At each step we either draw a ball from the urn and replace it with two balls of the same color, or we are given a ball of a new color which we place in the urn. One can see that such a scheme leads to a partition of 5

8 the balls according to their color. Mapping each ball to a haploid individual and each color to a possible haplotype, this partition is equivalent to the one resulted from the coalescence-with-ima process 3, and the probability distribution of the resulting allele spectrum the numbers of colors (i.e., haplotypes) with every possible number of representative balls (i.e., decedents) is captured by the well-known Ewens sampling formula 25. Letting parameter α define the probabilities of the two types of draws in the aforementioned Pólya urn scheme, and viewing each (distinct) color as a sample from Q, and each ball as a sample from Q, Blackwell and MacQueen 24 showed that this Pólya urn model yields samples whose distributions are those of Q the marginal probabilities under the Dirichlet process 23. Formally, a random probability measure Q is generated by a DP if for any measurable partition A 1,..., A k of the sample space (e.g., the partition of an unbounded haploid population according to common haplotype patterns), the vector of random probabilities Q(A i ) follows a Dirichlet distribution: (Q(A 1 ),..., Q(A k )) Dir(αQ (A 1 ),..., αq (A k )), where α denotes a scaling parameter and Q denotes a base measure. The Pólya urn construction of DP makes explicit an order-independent sequential sampling scheme to draw samples from a DP. Specifically, having observed n samples with values (φ 1,..., φ n ) from a Dirichlet process DP (α, Q ), the distribution of the value of the (n + 1)th sample is given by: φ n+1 φ 1,..., φ n, α, Q K k=1 n k n + α δ φ ( ) + α k n + α Q ( ), (3) where δ φ k ( ) denotes a point mass at a unique value φ k, n k denotes the number of samples with value φ k, and K denotes the number of unique values in the n samples drawn so far. This conditional distribution is useful for implementing Monte Carlo algorithms for haplotype inference under DP-based models. We will return to this point in the Appendix. Under a DP distribution described above, the sampled haplotypes follow an IMA model, meaning that all different haplotypes (i.e., ball colors) are independent. How can we take into consideration the fact that, in a real haplotype sample one would expect that some haplotypes differ only slightly (i.e., at a few SNP loci) whereas some differ much more significantly a phenomenon caused by possibly parent-dependent mutations. Now we describe a DP mixture (DPM) model that approximate this effect. In the context of mixture models, one can associate common data centroids, i.e., haplotype founders rather than all distinct haplotypes, with colors drawn from the Pólya urn model and thereby define a clustering of the (possibly noisy) data {h i } (e.g., modern haplotypes that are recognizable variants of their corresponding founders) via likelihood function p(h i φ i ). As obvious from Eq. (3), the samples (i.e., ball-draws) {φ i } from a DP (i.e., the urn) tend to concentrate themselves around some unique values {φ k } (i.e., colors); thus conditioning on each such unique value φ k, we have a mixture component p(h i φ k ) for the data. Such a mixture model is known as the DP mixture 26,31. Note that a DP mixture requires no prior specification of the number of components, which is typically unknown in genetic demography problems. It is important to emphasize that here DP is used as a prior distribution of mixture components. Multiplying this prior by a likelihood that relates the mixture components to the actual data yields a posterior distribution of the mixture components, and the design of the likelihood function is completely up to the 6

9 modeler based on specific problems. This nonparametric Bayesian formalism forms the technical foundation of the haplotype modeling and inference algorithms to be developed in this paper DPM for haplotype inference Now we briefly recapitulate the basic DPM for haplotypes first proposed in Xing et al. 27. In the next section we generalize this model to multi-population haplotypes, and describe specific Bayesian treatments of all relevant model parameters. Write H ie = [H ie,1,..., H ie,t ], where the sub-subscript e {, 1} denotes the two possible parental origins (i.e., paternal and maternal), for a haplotype over T contiguous SNPs from individual i; and let G i = [G i,1,..., G i,t ] denote the genotype these SNPs of individual i. For diploid organisms such as human, we denote the two alleles of a SNP by and 1; thus each G i,t can take on one of four values: or 1, indicating a homozygous site; 2, indicating a heterozygous site; and?, indicating missing data. (A generalization to polymorphisms with k-ary alleles is straightforward, but omitted here for simplicity.) Let A k = [A k,1,..., A k,t ] denote an ancestor haplotype (indexed by k) and θ k denote the mutation rate of ancestor k; and let C i denote an inheritance variable that specifies the ancestor of haplotype H i. We write P h (H A) for the inheritance model according to which individual haplotypes are derived from a founder, and P g (G H, H 1 ) for the genotyping model via which noisy observations of the genotypes are related to the true haplotypes. Under a DP mixture, we have the following Pólya urn scheme for sampling the genotypes, G i, i = 1,..., n, of a sample with n individuals: Draw first haplotype: a 1, θ 1 DP(τ, Q ) Q ( ), sample the 1st founder (and its associated mutation rate); h 1 P h ( a 1, θ 1 ), sample the 1st haplotype from an inheritance model defined on the 1st founder; for subsequent haplotypes: sample the founder indicator for the ith haplotype: p(c i = c j for some j < i c 1,..., c i 1 ) = c i DP(τ, Q ) p(c i c j for all j < i c 1,..., c i 1 ) = nc j i 1+α α i 1+α where n ci is the occupancy number of class c i the number of previous samples generated from founder a ci. sample the founder of haplotype i (indexed by c i ): = {a cj, θ cj } if c i = c j for some j < i (i.e., c i refers to an inherited φ ci DP(τ, Q ) founder) Q (a, θ) if c i c j for all j < i (i.e., c i refers to a new founder) 7

10 sample the haplotype according to its founder: h i c i P h ( a ci, θ ci ). sample all genotypes (according to a one-to-one mapping between haplotype index i and allele index ): g i h i, h i1 P g ( h i, h i1 ). Under appropriate parameterizations of the base measure Q, the inheritance model P h, and the genotyping model P g, which will be described in detail shortly, the problem of phasing individual haplotypes and estimating the size and configuration of the latent ancestral pool under a DPM model can be solved via posterior inference given the genotypes from a (presumably ethnically homogeneous) population descending from a single pool of ancestors, using, for example, a Gibbs sampler as we will outline in the Appendix. As mentioned earlier, treating haplotype distribution as a mixture model, where the set of mixture components correspond to the pool of ancestral haplotypes, or founders, of the population, can be justified by straightforward statistical genetics arguments. Crucially, however, the size of this pool is unknown; indeed, knowing the size of the pool would correspond to knowing something significant about the genome and its history. In most practical population genetic problems, usually the detailed genealogical structure of a population (as provided by the coalescent trees) is of less importance than the population-level features such as pattern of common ancestor alleles (i.e., founders) in a population bottleneck, the age of such alleles, etc. In this case, the DP mixture offers a principled approach to generalize the finite mixture model for haplotypes to an infinite mixture model that models uncertainty regarding the size of the ancestor haplotype pool, and at the same time it provides a reasonable approximation to the coalescence-with-pdm model by utilizing the partition structure resulted thereof, but allowing further mutations within each partite to introduce further diversity among descents of the same founder. 2.2 A Hierarchical DP mixture model for multi-population haplotypes Now we consider the case in which there exist multiple sample populations (e.g., ethnic groups), each modeled by a distinct DP mixture. Note that now we have multiple ancestor pools, one for each attendant population; instead of modeling these populations independently, we place all the population-specific DPMs under a common prior, so that the ancestors (i.e., mixture components) in any of the population-specific mixtures can be shared across all the mixtures, but the weight of an ancestral haplotype in each mixture is unique. Genetically, this means that for every possible ancestral haplotype, it can either be a founder in only one of the populations, or be a founder shared in some or all attendant populations; in the latter case, the frequencies of this haplotype founder being inherited are different in different populations. To tie population-specific DP mixtures together in this way, we use a hierarchical DP mixture model 32, in which the base measure associated with each population-specific DP mixture is itself drawn from a higher-level Dirichlet process DP(γ, F ). Since a draw from this higher-level DP is a discrete measure with probability 1, atoms drawn by different population-specific DPs from DP(γ, F ) the haplotype founders and its mutation rate φ k {A k, θ k }, which are used as the 8

11 mixture components in each of the population-specific DP mixtures are not going to be all distinct (i.e., the same (A k, θ k ) can be drawn by two different population-specific DPMs). This allows sharing of components across different mixture models Hierarchical Dirichlet Process Before presenting the HDP mixture for haplotypes, we digress with a brief description of the HDP formalism. As with the DP, it is useful to describe the marginals induced with an HDP using the more concrete representation of Pólya urn models. Imagine we set up a single stock urn at the top level, which contains balls of colors that are represented by at least one ball in one or multiple urns at the bottom level. At the bottom level, we have a set of distinct urns which are used to define the DP mixture for each population. Now let s suppose that upon drawing the m j -th ball for urn j at the bottom, the stock urn contains n balls of K distinct colors indexed by an integer set C = {1, 2,..., K}. Now we either draw a ball randomly from urn j, and place back two balls both of that color, or with some probability we return to the top level. From the stock urn, we can either draw a ball randomly and put back two balls of that color in the stock urn and one in j, or obtain a γ n 1+γ ball of a new color K + 1 with probability and put back a ball of this color in both the stock urn and urn j of the lower level. Essentially, we have a master DP (the top urn) that serves as a source of atoms for J child DPs (bottom urns). Associating each color k with a random variable φ k whose values are drawn from the base measure F, and recalling our discussion in the previous section, we know that draws from the stock urn can be viewed as marginals from a random measure distributed as a Dirichlet Process Q with parameter (γ, F ). From Eq. (3), for n random draws φ = {φ 1,..., φ n } from Q, the conditional prior for (φ n φ n ), where the subscript n denotes the index set of all but the n-th ball, is φ n φ n K k=1 n k n 1 + γ δ φ (φ γ k n) + n 1 + γ F (φ i), (4) where φ k, k = 1,..., K denote the K distinct values (i.e., colors) of φ (i.e., all the balls in the stock urn), and n k denote the number of balls of color k in the top urn. Conditioning on Q (i.e., using Q as an atomic base measure of each of the DPs corresponding to the bottom-level urns), the m j -th draws from the jth bottom-level urn are also distributed as marginals under a Dirichlet measure: φ mj φ mj = n k n K m j,k + τ k n 1+γ δ φ m j 1 + τ k (φ mj ) + k=1 τ m j 1 + τ γ n 1 + γ F (φ m j ) K π j,k δ φ k (φ mj ) + π j,k+1 F (φ mj ), (5) k=1 where π j,k := m j,k+τ n 1+γ τ γ m j, π 1+τ j,k+1 =, and m m j 1+τ n 1+γ j,k denotes the number of balls of color k in the j-th bottom urn. In our case, π j = (π j,1, π j,2,...) gives the a priori frequencies (i.e., mixture weighs) of haplotype founders in population j. 9

12 Figure 1: The haplotype-genotype generative process under HDPM, illustrated by an example concerning three populations. At the first level, all haplotype founders from different populations are drawn from a common pool via a Pólya urn scheme, which leads to the following effects: 1) the same founder can be drawn by either multiple populations (e.g., the red founder in population 1 and 2, and the blue one in population 1 and 3), or only a single population (e.g., the grey founder in population 1); 2) shared founders can have different frequencies of being inherited. Then at the second level, individual haplotypes were drawn from a population-specific founder pool also via a Pólya urn scheme, but this time through an inheritance models P h ( a k ) that allows mutations with respect to the founders, as indicated by the underscores at the mutated loci in the individual haplotypes. Finally, the genotype observations are related to the haplotype pairs of every individual via a noisy channel P g ( ) HDPM for multi-population haplotype inference Using the HDP construction described above, we now define an HDP mixture model for the genotypes in J populations. Elaborating on the notational scheme used earlier, let G (j) i = [G (j) i,1,..., G(j) i,t ] denote the genotype of T contiguous SNPs of individual i from ethnic group j; and let H (j) = [H (j) i,..., e,1 H(j),T ] denote a haplotype of individual i from ethnic group j. The basic generative structure of multi-population genotypes under an HDPM is as follows, which is also illustrated graphically in Figure 1. Q (φ 1, φ 2,...) γ, F DP(γ, F ), sample a DP of founders for all populations; Q j (φ (j) 1, φ(j) 2,...) τ, Q DP(τ, Q ), sample the DP of founders for each population; φ (j) Q j Q j, sample the founder of haplotype in population j; h (j) φ (j) P h ( φ (j) ), sample haplotype in population j; g (j) i h (j) i, h (j) i 1 P g ( h (j) i, h (j) i 1 ), sample genotype i in population j, 1

13 where in practice the first three sampling steps follow the nested Pólya urn scheme described above. Note that in the HDP the base measure of each lower-level DP is a draw from the root DP(γ, F ). From this description, it is apparent that the totality of all atomic samples, i.e., all instantiated haplotype founders and their associated mutation rates, from the base measure F form the common support (i.e., candidate founder patterns) of both the root DP and all the population-specific DPs. The child DPs place different mass distributions, i.e., a priori frequencies of haplotype founders, on this common support, in a population-specific fashion. Recall that the base measure F in the above generative process is defined as a distribution from which haplotype founders φ k := {A k, θ k } are drawn. Thus it is a joint measure on both A and θ. We let F (A, θ) = p(a)p(θ), where p(a) is uniform over all possible haplotypes and p(θ) is a beta distribution introducing a prior belief of low PDM mutation rate. For generality, we assume each A k,t (and also each H i,t ) takes its value from an allele set P. For other building blocks of the HDPM model, we propose the following specifications. Haplotype inheritance model: Omitting all but the locus index t, we can define our inheritance model to be a single-locus mutation model as follows 27 : ( ) I(ht=a 1 θ t) P h (h t a t, θ) = θ I(ht at) P 1 (6) where I( ) is the indicator function. This model corresponds to a star genealogy resulting from infrequent mutations over a shared ancestor, and is widely used as an approximation to a full coalescent genealogy starting from the shared ancestor (e.g., Liu et al. 33 ). Given this inheritance model, it can be shown that the marginal conditional distribution of a haplotype sample h = {h ie : e {, 1}, i {1, 2,..., I}} takes the following form resulted from an integration of θ in the joint conditional: p(h a, c) = K k=1 R(α h, β h ) Γ(α h + l k )Γ(β h + l k ) ( ) l 1 k, (7) Γ(α h + β h + l k + l k ) P 1 where R(α h, β h ) = Γ(α h+β h ), l Γ(α h )Γ(β h ) k = i,e,t I(h,t = a k,t )I(c ie = k) is the number of alleles which are identical to the ancestral alleles, and l k = i,e,t I(h a k,t )I(c ie = k) is the total number of mutated alleles. Genotype observation model: Next, we assume that the observed genotype at a locus is determined by the paternal and maternal alleles of this site via the following genotyping model 27 : P g (g h i,t, h i1,t, τ) = ξ I(h=g) [µ 1 (1 ξ)] I(h 1 g) [µ 2 (1 ξ)] I(h 2 g) (8) where h h i,t h i,t denotes the unordered pair of two actual SNP allele instances at locus t; 1 denotes set difference by exactly one element; 2 denotes set difference of both elements, and µ 1 and µ 2 are appropriately defined normalizing constants. Again we place a beta 11

14 prior Beta(α g, β g ) on ξ for smoothing. Under the above model specifications, it is standard to derive the posterior distribution of each haplotype H ie given all other haplotypes and all genotypes, and the posterior of any missing genotypes, by integrating out parameters θ or ξ and resorting to the Bayes theorem, which enable collapsed Gibbs sampling step where necessary. For simplicity, we omit details. Hyperprior for coalescent rate: Lastly, to capture uncertainty over the scaling parameters γ (or α in the single-layer DPM model), which is twice the mutation rate in the coalescent over the haplotype founders, we use a vague inverse Gamma prior: p(γ 1 ) G(1, 1) p(γ) γ 2 exp( 1/γ)). (9) Under this prior, the posterior distribution of γ depends only on the number of instances n, and the number of components K, but not on how the samples are distributed among the components: p(γ k, n) γk 2 exp(1/γ)γ(γ). (1) Γ(n + γ) The distribution p(log(γ) k, n) is log-concave, so we may efficiently generate independent samples from this distribution using adaptive rejection sampling 34. It is noteworthy that in an HDPM we need to define vague inverse Gamma priors also for the scaling parameters of populationspecific DPs at the bottom level. We use a single concentration parameter τ for these DPs; it is also possible to allow separate concentration parameters for each of the lower-level DPs, possibly tied distributionally via a common hyperparameter Posterior inference via Gibbs sampling Putting everything together, we have constructed an HDPM model for multi-population haplotypes. The two-level nested Pólya urn schemes described above motivates an efficient and easy-toimplement Gibbs sampling algorithm to sample from the posterior associated with HDPM. Specifically, under a collapsed Gibbs sampling scheme where θ and ξ are integrated out, the variables of interest are C (j) i, A e,t k,t, H (j),t, γ, and τ, i, j, k, t, e. The sampler alternates between three coupled stages. First, it samples the scaling parameters γ and τ of the DPs, following the predictive distribution given by Eq. (1). Then, it samples the c (j) s and a k,t s given the current values of the hidden haplotypes and the scaling parameters. Finally, given the current state of the ancestral pool and the ancestor assignment for each individual, it samples the h (j),t variables. Due to space limit, details of the forms and derivations of the predictive distributions used for step 2 and 3 are given in the on-line supplemental material. 3 Partition-ligation and the Haploi program As for most of the haplotype inference models proposed in the literature, the state space of the proposed HDPM model scales exponentially with the length of the genotype sequence, and therefore 12

15 it can not be directly applied to genotype data containing hundreds or thousands of SNPs. To deal with haplotypes with a large number of linked SNPs, Niu et al. 12 proposed a divide-and-conquer heuristic known as Partition-Ligation (PL), which was adopted by a number of haplotype inference algorithms including PL-EM 35, PHASE 9,36, and CHB 14. We equipped our haplotype inference algorithm based on the HDPM model with a variant of the PL heuristic, and present a new tool, Haploi, for haplotype inference of either single or multiple population genotype data containing thousands of SNPs. Unlike the original PL-scheme in Niu et al. 12, which works on disjoint blocks and then recursively ligate the phased blocks into larger (non-overlapping) haplotypes via Gibbs sampling in the product space of all the atomistic haplotypes of every attendant pair of blocks to be ligated, under a fixed-dimensional Dirichlet prior of the frequencies of the ligated haplotype ; our PL-scheme generate partially overlapping intermediate blocks from smaller blocks phased at the lower level, and the pairs of overlapping blocks are recursively merged into larger ones by leveraging the redundancy of information from overlapping regions, as well as an overall parsimonious criteria. Empirically we found that these strategy can lead to a significant reduction of the size of the haplotype search space for long genotypes such as those with thousands of SNPs, and therefore facilitate a more efficient inference algorithm. Again, due to space limit, details of this algorithm are available in the supplemental material. 4 Results In this section, we present a comparison of Haploi, with PHASE ,36, fastphase 37, HAPLO- TYPER 1. 12, and CHB We run each program using its default parameter settings. Three different error measures were used for evaluation: err s, the ratio of incorrectly phased SNP sites over all non-trivial heterozygous SNPs (excluding individuals with a single heterozygous SNP), and err i, the ratio of incorrectly phased individuals over all non-trivial heterogeneous individuals (i.e., those with at least two heterogeneous SNPs), and d w, the switch distance, which is the number of phase flips required to correct the predicted haplotypes over all non-trivial heterogeneous SNPs. Both short ( 1 SNPs) and long ( ) sequences were tested when the program permits. For short SNP sequences, we primarily use err s and err i ; whereas for long sequences we compare d w according to common practice, since it is regarded as a more sensible indicator of performance for longer SNP sequences. On short SNPs, we test on a large number of samples and report summary statistics of errors over all samples; whereas for long SNPs, we present error over each of samples (of different lengths) we tested due to heavy computational cost. We have also estimated other population genetic metrics of interest, such as the haplotype frequencies, the mutation rates θ, and the number of reconstructed haplotype founders K, under Specifically, we applied fastphase only on long SNP sequences since it is tailored for fast processing of long sequences at the cost of reduced accuracy; and we applied both HAPLOTYPER 1. and CHB 1. only on short SNP sequences since they can not reach convergence within acceptable test time (> 4 hours) on samples with more than 2 SNPs. 13

16 Comparison of haplotype error (err s ) Comparison of haplotype error (err i ) Comparison of haplotype error (d w ).5.4 HDP DP Phase Haplotyper CHB.5.4 HDP DP Phase Haplotyper CHB.5.4 HDP DP Phase Haplotyper CHB err s.3 err i.3 d w θ=.1 θ=.5 θ=.1 θ=.5 θ=.1 θ=.5 Figure 2: Performance on simulated datasets. Two kinds of datasets with different mutation rates θ were tested. Each dataset includes 1 individuals from 5 groups (2 from each). The sequence length was fixed to 1. The performance of each algorithm is represented in terms of err s, err i and d w. Each bar represents the average error rate across 5 different datasets where the standard deviation is shown as a vertical line. the HDP and DP models. We will present some of these results to illustrate consistency of our model (on simulation), and the characteristics of some real data set being studied. 4.1 Simulated Multi-Population SNP Data For simulation-based tests, we used a pool of haplotypes taking from the coalescent-based synthetic dataset in Stephens et al. 9, each containing 1 SNPs, as the hypothetical founders; and we drew each individual s haplotypes and genotype by randomly choosing two ancestors from these founders and applying the mutation and noisy genotyping models described in the methodology section. For each of our synthetic multi-population data set, we simulated 1 individuals consisting of five populations, each with 2 individuals. Each population is derived from 5 founders, two of which are shared among all populations, and the other three are population-specific. Thus overall the total number of founders across the five populations is 17. We test our algorithm on two data sets with different degree of sequence diversity. In the conserved data set, we assume the mutation rate θ to be.1 for all populations and all loci; in the diverse data set, θ is set to be.5. All populations and loci are assumed to have the same genotyping error rate. Fifty random samples were drawn from both the conserved and the diverse data sets Haplotype Accuracy We compare Haploi, implemented either based on a DPM model or an HDPM model (dubbed as Haploi-HDP and Haploi-DP, respectively, when distinction is needed; otherwise, we simply Here our simulation scheme assumes a star-genealogy with uniform mutation rate for each sub-population. This simulation scheme is simple to implement under our multi-population scenario for testing the bias/variance of a number of estimators of interest, and it reasonably approximates the genetic demography of samples under an IMA model on a coalescent tree. We have tested an early version of Haploi, which employed an DPM (rather than HDPM) without partition-ligation, on the single-population, coalescent-based simulated data used in Stephens et al. 9 in comparison with PHASE (see Xing et al. 27 ), and the results were similar. 14

17 use Haploi for Haploi-HDP), with extant phasing methods applied in two modes on the synthetic data sets. As mentioned in the introduction, given multi-population genotype data, to use an extant method, one can either adopt mode-i pool all populations together and jointly solve a single haplotype inference problem that ignore the population label of each individual; or follow mode- II apply the algorithm to each population and solve multiple haplotype inference problems separately. Haploi-HDP takes a different approach, by making explicit use of the population labels (if available) and jointly solve multiple coupled haplotype inference problems. (Note that when only a single population is concerned, or no population label is available, Haploi-HDP is still valid, and is equivalent to a baseline Haploi-DP with one more layer of DP hyper-prior). We apply our method to the simulated multi-population data, and compare its overall performance on the whole data with the outcomes of other algorithms run in mode-i, and compare the score of our method on each population with the outcome of extant methods run in mode-ii. Figure 2 summarizes the overall performance of Haploi on 5 conserved simulated samples and 5 diverse simulated samples, along with those of the reference algorithms run in mode-i. On the conserved samples, which are presumably easier to phase, Haploi-HDP outperforms all the other algorithms appreciably. On the diverse samples, which are more challenging to phase (due to more severe inconsistencies among individual haplotypes in the samples caused by high mutation rate), Haploi-HDP outperforms all other algorithms with a significant margin. Table 1: Comparison of Haploi and other algorithms run in mode-ii on the synthetic multipopulation data. Results on the conserved data sets (θ =.1) and the diverse data sets (θ =.5) are shown in the upper and lower sections, respectively. In the table, column pop specifies the 5 attendant sub-populations; column K shows the estimated number of founders in each subpopulation; columns denoted by err s and err i, respectively, show the phase error measured by site- and individual-discrepancies; and finally, row avg shows the average errors over 5 random samples of test data set. Haploi-HDP Haploi-DP PHASE HAPLOTYPER CHB θ pop err s err i K err s err i K err s err rr s err rr s err i (1) (2) (3) (4) (5) avg (1) (2) (3) (4) (5) avg Haploi-HDP also dominates other methods when the latter are run in mode-ii on the simulated data. Table 1 shows a comparison of the accuracy of Haploi-HDP on each sub-population (directly 15

18 1 Trajectory of K on conserved dataset 1 Trajectory of K on diverse dataset K 5 K Iteration Iteration.12 Trajectory of θ on conserved dataset.12 Trajectory of θ on diverse dataset θ.6 θ Iteration Iteration Figure 3: Trajectories of K and θ along the iterations of HDP inference from one among the simulated datasets. The left panels show the result on conserved dataset, and the right, on diverse dataset. extracted from results obtained in a single run of Haploi-HDP on all populations) in a simulated data set with results from separate runs of the other algorithms on each sub-population of genotypes. Note that, here K is expected to be 5 for each group, and the estimation by Haploi-HDP is quite close to this value. On the conserved data set, CHB shows the best result while all algorithms performed comparably. On the more difficult diverse dataset, the HDP approach outperformed other algorithms and inferred the founders of each group more robustly than the group-specific runs using the baseline Haploi-DP (which employ independent DPs for each sub-population) MCMC and parameter estimation Typically, the Pólya urn Gibbs sampler for Haploi converges within 1 iteration (Figure 3) on the synthetic data. This contracts sampling algorithms used in some of the other haplotype inference algorithms, which typically needs tens of thousands of samples to reach convergence. The fast convergence is possibly due to Haploi s ability to quickly infer the correct number of founding haplotypes underlying the genotypes samples, which leads to a model significantly more compact (i.e., parsimonious) than that derived from other algorithms. In Figure 4 (a) and (b), we show the histogram of the estimated K the number of recovered ancestors, across the 5 datasets via both of our algorithms. Recall that we expect K to be 17, and the estimated K under both the DP and HDP models turns out to be very close to this number on the conserved datasets (i.e., those with a small mutation rate); from the diverse data sets, Haploi-HDP can still offer a good estimate of the number of ancestors, whereas Haploi-DP recovered more ancestors (around 25 on average) than 16

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation Instructor: Arindam Banerjee November 26, 2007 Genetic Polymorphism Single nucleotide polymorphism (SNP) Genetic Polymorphism

More information

Learning ancestral genetic processes using nonparametric Bayesian models

Learning ancestral genetic processes using nonparametric Bayesian models Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew

More information

Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture

Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Eric P. Xing epxing@cs.cmu.edu Kyung-Ah Sohn sohn@cs.cmu.edu Michael I. Jordan jordan@cs.bereley.edu Yee-Whye

More information

Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space Kyung-Ah Sohn School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 ksohn@cs.cmu.edu Eric P.

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing

More information

Learning Ancestral Genetic Processes using Nonparametric Bayesian Models

Learning Ancestral Genetic Processes using Nonparametric Bayesian Models Learning Ancestral Genetic Processes using Nonparametric Bayesian Models Kyung-Ah Sohn CMU-CS-11-136 November 2011 Computer Science Department School of Computer Science Carnegie Mellon University Pittsburgh,

More information

Mathematical models in population genetics II

Mathematical models in population genetics II Mathematical models in population genetics II Anand Bhaskar Evolutionary Biology and Theory of Computing Bootcamp January 1, 014 Quick recap Large discrete-time randomly mating Wright-Fisher population

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

The Wright-Fisher Model and Genetic Drift

The Wright-Fisher Model and Genetic Drift The Wright-Fisher Model and Genetic Drift January 22, 2015 1 1 Hardy-Weinberg Equilibrium Our goal is to understand the dynamics of allele and genotype frequencies in an infinite, randomlymating population

More information

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics. Evolutionary Genetics (for Encyclopedia of Biodiversity) Sergey Gavrilets Departments of Ecology and Evolutionary Biology and Mathematics, University of Tennessee, Knoxville, TN 37996-6 USA Evolutionary

More information

Bayesian Nonparametrics: Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian

More information

Time-Sensitive Dirichlet Process Mixture Models

Time-Sensitive Dirichlet Process Mixture Models Time-Sensitive Dirichlet Process Mixture Models Xiaojin Zhu Zoubin Ghahramani John Lafferty May 25 CMU-CALD-5-4 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523 Abstract We introduce

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

6 Introduction to Population Genetics

6 Introduction to Population Genetics 70 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 19, 2011 6 Introduction to Population Genetics This chapter is based on: J. Hein, M.H. Schierup and C. Wuif, Gene genealogies, variation and evolution,

More information

Genetic Association Studies in the Presence of Population Structure and Admixture

Genetic Association Studies in the Presence of Population Structure and Admixture Genetic Association Studies in the Presence of Population Structure and Admixture Purushottam W. Laud and Nicholas M. Pajewski Division of Biostatistics Department of Population Health Medical College

More information

6 Introduction to Population Genetics

6 Introduction to Population Genetics Grundlagen der Bioinformatik, SoSe 14, D. Huson, May 18, 2014 67 6 Introduction to Population Genetics This chapter is based on: J. Hein, M.H. Schierup and C. Wuif, Gene genealogies, variation and evolution,

More information

Population Genetics I. Bio

Population Genetics I. Bio Population Genetics I. Bio5488-2018 Don Conrad dconrad@genetics.wustl.edu Why study population genetics? Functional Inference Demographic inference: History of mankind is written in our DNA. We can learn

More information

2. Map genetic distance between markers

2. Map genetic distance between markers Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Demography April 10, 2015

Demography April 10, 2015 Demography April 0, 205 Effective Population Size The Wright-Fisher model makes a number of strong assumptions which are clearly violated in many populations. For example, it is unlikely that any population

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

How robust are the predictions of the W-F Model?

How robust are the predictions of the W-F Model? How robust are the predictions of the W-F Model? As simplistic as the Wright-Fisher model may be, it accurately describes the behavior of many other models incorporating additional complexity. Many population

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

Theoretical Population Biology

Theoretical Population Biology Theoretical Population Biology 87 (013) 6 74 Contents lists available at SciVerse ScienceDirect Theoretical Population Biology journal homepage: www.elsevier.com/locate/tpb Genotype imputation in a coalescent

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans University of Bristol This Session Identity by Descent (IBD) vs Identity by state (IBS) Why is IBD important? Calculating IBD probabilities Lander-Green Algorithm

More information

URN MODELS: the Ewens Sampling Lemma

URN MODELS: the Ewens Sampling Lemma Department of Computer Science Brown University, Providence sorin@cs.brown.edu October 3, 2014 1 2 3 4 Mutation Mutation: typical values for parameters Equilibrium Probability of fixation 5 6 Ewens Sampling

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

mstruct: Inference of Population Structure in Light of both Genetic Admixing and Allele Mutations

mstruct: Inference of Population Structure in Light of both Genetic Admixing and Allele Mutations mstruct: Inference of Population Structure in Light of both Genetic Admixing and Allele Mutations Suyash Shringarpure and Eric P. Xing May 2008 CMU-ML-08-105 mstruct: Inference of Population Structure

More information

Tree-Based Inference for Dirichlet Process Mixtures

Tree-Based Inference for Dirichlet Process Mixtures Yang Xu Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, USA Katherine A. Heller Department of Engineering University of Cambridge Cambridge, UK Zoubin Ghahramani

More information

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles John Novembre and Montgomery Slatkin Supplementary Methods To

More information

Supporting Information Text S1

Supporting Information Text S1 Supporting Information Text S1 List of Supplementary Figures S1 The fraction of SNPs s where there is an excess of Neandertal derived alleles n over Denisova derived alleles d as a function of the derived

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Spatial Normalized Gamma Process

Spatial Normalized Gamma Process Spatial Normalized Gamma Process Vinayak Rao Yee Whye Teh Presented at NIPS 2009 Discussion and Slides by Eric Wang June 23, 2010 Outline Introduction Motivation The Gamma Process Spatial Normalized Gamma

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Modelling Genetic Variations with Fragmentation-Coagulation Processes

Modelling Genetic Variations with Fragmentation-Coagulation Processes Modelling Genetic Variations with Fragmentation-Coagulation Processes Yee Whye Teh, Charles Blundell, Lloyd Elliott Gatsby Computational Neuroscience Unit, UCL Genetic Variations in Populations Inferring

More information

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations

More information

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author AUTHORIZATION TO LEND AND REPRODUCE THE THESIS As the sole author of this thesis, I authorize Brown University to lend it to other institutions or individuals for the purpose of scholarly research. Date

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

More information

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall

More information

Part IV: Monte Carlo and nonparametric Bayes

Part IV: Monte Carlo and nonparametric Bayes Part IV: Monte Carlo and nonparametric Bayes Outline Monte Carlo methods Nonparametric Bayesian models Outline Monte Carlo methods Nonparametric Bayesian models The Monte Carlo principle The expectation

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees: MCMC for the analysis of genetic data on pedigrees: Tutorial Session 2 Elizabeth Thompson University of Washington Genetic mapping and linkage lod scores Monte Carlo likelihood and likelihood ratio estimation

More information

Modelling Linkage Disequilibrium, And Identifying Recombination Hotspots Using SNP Data

Modelling Linkage Disequilibrium, And Identifying Recombination Hotspots Using SNP Data Modelling Linkage Disequilibrium, And Identifying Recombination Hotspots Using SNP Data Na Li and Matthew Stephens July 25, 2003 Department of Biostatistics, University of Washington, Seattle, WA 98195

More information

A Brief Overview of Nonparametric Bayesian Models

A Brief Overview of Nonparametric Bayesian Models A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine

More information

Bayesian Nonparametric Regression for Diabetes Deaths

Bayesian Nonparametric Regression for Diabetes Deaths Bayesian Nonparametric Regression for Diabetes Deaths Brian M. Hartman PhD Student, 2010 Texas A&M University College Station, TX, USA David B. Dahl Assistant Professor Texas A&M University College Station,

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Introduction to population genetics & evolution

Introduction to population genetics & evolution Introduction to population genetics & evolution Course Organization Exam dates: Feb 19 March 1st Has everybody registered? Did you get the email with the exam schedule Summer seminar: Hot topics in Bioinformatics

More information

MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES

MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES Elizabeth A. Thompson Department of Statistics, University of Washington Box 354322, Seattle, WA 98195-4322, USA Email: thompson@stat.washington.edu This

More information

Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009 Gene Genealogies Coalescence Theory Annabelle Haudry Glasgow, July 2009 What could tell a gene genealogy? How much diversity in the population? Has the demographic size of the population changed? How?

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

Stat 516, Homework 1

Stat 516, Homework 1 Stat 516, Homework 1 Due date: October 7 1. Consider an urn with n distinct balls numbered 1,..., n. We sample balls from the urn with replacement. Let N be the number of draws until we encounter a ball

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Challenges when applying stochastic models to reconstruct the demographic history of populations.

Challenges when applying stochastic models to reconstruct the demographic history of populations. Challenges when applying stochastic models to reconstruct the demographic history of populations. Willy Rodríguez Institut de Mathématiques de Toulouse October 11, 2017 Outline 1 Introduction 2 Inverse

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin CHAPTER 1 1.2 The expected homozygosity, given allele

More information

The Combinatorial Interpretation of Formulas in Coalescent Theory

The Combinatorial Interpretation of Formulas in Coalescent Theory The Combinatorial Interpretation of Formulas in Coalescent Theory John L. Spouge National Center for Biotechnology Information NLM, NIH, DHHS spouge@ncbi.nlm.nih.gov Bldg. A, Rm. N 0 NCBI, NLM, NIH Bethesda

More information

Image segmentation combining Markov Random Fields and Dirichlet Processes

Image segmentation combining Markov Random Fields and Dirichlet Processes Image segmentation combining Markov Random Fields and Dirichlet Processes Jessica SODJO IMS, Groupe Signal Image, Talence Encadrants : A. Giremus, J.-F. Giovannelli, F. Caron, N. Dobigeon Jessica SODJO

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability

More information

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications Pengtao Xie, Machine Learning Department, Carnegie Mellon University 1. Background Latent Variable Models (LVMs) are

More information

Introduction to Advanced Population Genetics

Introduction to Advanced Population Genetics Introduction to Advanced Population Genetics Learning Objectives Describe the basic model of human evolutionary history Describe the key evolutionary forces How demography can influence the site frequency

More information

Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model

Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model Simultaneous inference for multiple testing and clustering via a Dirichlet process mixture model David B Dahl 1, Qianxing Mo 2 and Marina Vannucci 3 1 Texas A&M University, US 2 Memorial Sloan-Kettering

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Closed-form sampling formulas for the coalescent with recombination

Closed-form sampling formulas for the coalescent with recombination 0 / 21 Closed-form sampling formulas for the coalescent with recombination Yun S. Song CS Division and Department of Statistics University of California, Berkeley September 7, 2009 Joint work with Paul

More information

Bayesian analysis of the Hardy-Weinberg equilibrium model

Bayesian analysis of the Hardy-Weinberg equilibrium model Bayesian analysis of the Hardy-Weinberg equilibrium model Eduardo Gutiérrez Peña Department of Probability and Statistics IIMAS, UNAM 6 April, 2010 Outline Statistical Inference 1 Statistical Inference

More information

Similarity Measures and Clustering In Genetics

Similarity Measures and Clustering In Genetics Similarity Measures and Clustering In Genetics Daniel Lawson Heilbronn Institute for Mathematical Research School of mathematics University of Bristol www.paintmychromosomes.com Talk outline Introduction

More information

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

MIXED MODELS THE GENERAL MIXED MODEL

MIXED MODELS THE GENERAL MIXED MODEL MIXED MODELS This chapter introduces best linear unbiased prediction (BLUP), a general method for predicting random effects, while Chapter 27 is concerned with the estimation of variances by restricted

More information

Approximate Bayesian Computation

Approximate Bayesian Computation Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki and Aalto University 1st December 2015 Content Two parts: 1. The basics of approximate

More information

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences Indian Academy of Sciences A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences ANALABHA BASU and PARTHA P. MAJUMDER*

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Big Idea #1: The process of evolution drives the diversity and unity of life

Big Idea #1: The process of evolution drives the diversity and unity of life BIG IDEA! Big Idea #1: The process of evolution drives the diversity and unity of life Key Terms for this section: emigration phenotype adaptation evolution phylogenetic tree adaptive radiation fertility

More information

Supporting Information

Supporting Information Supporting Information Hammer et al. 10.1073/pnas.1109300108 SI Materials and Methods Two-Population Model. Estimating demographic parameters. For each pair of sub-saharan African populations we consider

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Intraspecific gene genealogies: trees grafting into networks

Intraspecific gene genealogies: trees grafting into networks Intraspecific gene genealogies: trees grafting into networks by David Posada & Keith A. Crandall Kessy Abarenkov Tartu, 2004 Article describes: Population genetics principles Intraspecific genetic variation

More information

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007 Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

More information