Improving prediction from Dirichlet process mixtures via enrichment

Size: px
Start display at page:

Download "Improving prediction from Dirichlet process mixtures via enrichment"

Transcription

1 Improving prediction from Dirichlet process mixtures via enrichment Sara K. Wade, David B. Dunson, Sonia Petrone, Lorenzo Trippa, For the Alzheimer s Disease Neuroimaging Initiative. November 15, 2013 Abstract Flexible covariate-dependent densit estimation can be achieved b modelling the joint densit of the response and covariates as a Dirichlet process mixture. An appealing aspect of this approach is that computations are relativel eas. In this paper, we examine the predictive performance of these models with an increasing number of covariates. Even for a moderate number of covariates, we find that the likelihood for x tends to dominate the posterior of the latent random partition, degrading the predictive performance of the model. To overcome this, we suggest using a different nonparametric prior, namel an enriched Dirichlet process. Our proposal maintains a simple allocation rule, so that computations remain relativel simple. Advantages are shown through both predictive equations and examples, including an application to diagnosis Alzheimer s disease. Kewords. Baesian nonparametrics; Densit regression; Predictive distribution; Random partition; Urn scheme. 1 Introduction Dirichlet process (DP) mixture models have become popular tools for Baesian nonparametric regression. In this paper, we examine their behavior in prediction and aim to highlight the difficulties that emerge with increasing dimension of the covariate space. To overcome these difficulties, we suggest a simple extension based on a nonparametric prior developed in Wade et al. [2011] that maintains desirable conjugac properties of the Dirichlet process and leads to improved prediction. The Corresponding author: sara.wade@eng.cam.ac.uk, Department of Engineering, Universit of Cambridge,Trumpington Street, Cambridge CB2 1PZ, UK Data used in preparation of this article were obtained from the Alzheimer s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analsis or writing of this report. A complete listing of ADNI investigators can be found at: http: //adni.loni.ucla.edu/wp-content/uploads/how_to_appl/adni_acknowledgement_list.pdf. 1

2 Prediction via enriched Dirichlet process mixtures 2 motivating application is to Alzheimer s disease studies, where the focus is prediction of the disease status based on biomarkers obtained from neuroimages. In this problem, a flexible nonparametric approach is needed to account for the possible nonlinear behavior of the response and complex interaction terms, resulting in improved diagnostic accurac. DP mixtures are widel used for Baesian densit estimation, see e.g. Ghosal [2010] and references therein. A common wa to extend these methods to nonparametric regression and conditional densit estimation is b modelling the joint distribution of the response and the covariates (X, Y ) as a mixture of multivariate Gaussians (or more general kernels). The regression function and conditional densit estimates are indirectl obtained from inference on the joint densit, an idea which is similarl emploed in classical kernel regression (Scott [1992], Chapter 8). This approach, which we call the joint approach, was first introduced b Müller et al. [1996], and subsequentl studied b man others including Kang and Ghosal [2009], Shahbaba and Neal [2009], Hannah et al. [2011], Park and Dunson [2010], and Müller and Quintana [2010]. The DP model uses simple local linear regression models as building blocks and partitions the observed subjects into clusters, where within clusters, the linear regression model provides a good fit. Even though within clusters the model is parametric, globall, a wide range of complex distributions can describe the joint distribution, leading to a flexible model for both the regression function and the conditional densit. Another related class of models is based on what we term the conditional approach. In such models, the conditional densit of Y given x, f( x), is modelled directl, as a convolution of a parametric famil f( x, θ) with an unknown mixing distribution P x for θ. A prior is then given on the famil of distributions {P x, x X } such that the P x s are dependent. Examples for the law of {P x, x X } start from the dependent DPs of MacEachern [1999], Gelfand et al. [2005] and include Griffin and Steel [2006], Dunson and Park [2008], Ren et al. [2011], Chung and Dunson [2009], and Rodriguez and Dunson [2011], just to name a few. Such conditional models can approximate a wide range of response distributions that ma change flexibl with the covariates. However, computations are often quite burdensome. One of the reasons the model examined here is so powerful is its simplicit. Together, the joint approach and the clustering of the DP provide a built-in technique to allow for changes in the response distribution across the covariate space, et it is simple and generall less computationall intensive than the nonparametric conditional models based on dependent DPs. The random allocation of subjects into groups in joint DP mixture models is driven b the need to obtain a good approximation of the joint distribution of X and Y. This means that subjects with similar covariates and similar relationship between the response and covariates will tend to cluster together. However, difficulties emerge as p, the dimension of the covariate space, increases. As we will detail, even for moderatel large p the likelihood of x tends to dominate the posterior of the random partition, so that clusters are based mainl on similarit in the covariate space. This behaviour is quite unappealing if the marginal densit of X is

3 Prediction via enriched Dirichlet process mixtures 3 complex, as is tpical in high dimensions, because it causes the posterior to concentrate on partitions with man small clusters, as man kernels are needed to describe f(x). This occurs even if the conditional densit of Y given x is more well-behaved, meaning, a few kernels suffice for its approximation. Tpical results include poor estimates of the regression function and conditional densit with unnecessaril wide credible intervals due to small clusters and, consequentl, poor prediction. This inefficient performance ma not disappear with increasing samples. On one hand, appealing to recent theoretical results (Wu and Ghosal [2008], Wu and Ghosal [2010], Tokdar [2011]), one could expect that as the sample size increases, the posterior on the unknown densit f(x, ) induced b the DP joint mixture model is consistent at the true densit. In turn, posterior consistenc of the joint is likel to have positive implications for the behavior of the random conditional densit and regression function; see Rodriguez et al. [2009], Hannah et al. [2011], and Norets and Pelenis [2012] for some developments in this direction. However, the unappealing behaviour of the random partition that we described above could be reflected in worse convergence rates. Indeed, recent results b Efromovich [2007] suggest that if the conditional densit is smoother than the joint, it can be estimated at a faster rate. Thus, improving inference on the random partition to take into account the different degree of smoothness of f(x) and f( x) appears to be a crucial issue. Our goal in this paper is to show that a simple modification of the nonparametric prior on the mixing distribution, that better models the random partition, can more efficientl conve the information present in the sample, leading to more efficient conditional densit estimates in term of smaller errors and less variabilit, for finite samples. To achieve this aim, we consider a prior that allows local clustering, that is, the clustering structure for the marginal of X and the regression of Y on x ma be different. We achieve this b replacing the DP with the enriched Dirichlet process (EDP) developed in Wade et al. [2011]. Like the DP, the EDP is a conjugate nonparametric prior, but it allows a nested clustering structure that can overcome the above issues and lead to improved predictions. An alternative proposal is outlined in Petrone and Trippa [2009] and in unpublished work b Dunson et al. [2011]. However, the EDP offers a richer parametrization. In a Baesian nonparametric framework, several extensions of the DP have been proposed to allow local clustering (see e.g. Dunson et al. [2008], Dunson [2009], Petrone et al. [2009]). However, the greater flexibilit is often achieved at the price of more complex computations. Instead, our proposal maintains an analticall computable allocation rule, and therefore, computations are a straightforward extension of those used for the joint DP mixture model. Thus, our main contributions are to highlight the problematic behavior in prediction of the joint DP mixture model for increasing p and also offer a simple solution based on the EDP that maintains computational ease. In addition, we give results on random nested partitions that are implied b the proposed prior. This paper is organized as follows. In Section 2, we review the joint DP mixture model, discuss the behavior of the random partition, and examine the predictive performance. In Section 3, we propose a joint EDP mixture model, derive its random

4 Prediction via enriched Dirichlet process mixtures 4 partition model, and emphasize the predictive improvements of the model. Section 4 covers computational procedures. We provide a simulated example in Section 5 to demonstrate how the EDP model can lead to more efficient estimators b making better use of information contained in the sample. Finall, in Section 6, we appl the model to predict Alzheimer s disease status based on measurements of various brain structures. 2 Joint DP mixture model A joint DP mixture model for multivariate densit estimation and nonparametric regression assumes that (X i, Y i ) P iid f(x, P ) = K(x, ξ)dp (ξ), where X i is p-dimensional, Y is usuall univariate, and the mixing distribution P is given a DP prior with scale parameter α > 0 and base measure P 0, denoted b P DP(αP 0 ). Due to the a.s. discrete nature of the DP, the model reduces to a countable mixture f(x, P ) = w j K(x, ξ j ), where the mixing weights w j have a stick-breaking prior with parameter α and iid ξ j P 0, independentl of the w j. This model was first developed for Baesian nonparametric regression b Müller, Erkanli, and West [1996], who assume multivariate Gaussian kernels N p+1 (µ, Σ) with ξ = (µ, Σ) and use a conjugate normal inverse- Wishart prior as the base measure of the DP. However, for even moderatel large p, this approach is practicall unfeasible. Indeed, the computational cost of dealing with the full p+1 b p+1 covariance matrix greatl increases with large p. Furthermore, the conjugate inverse-wishart prior is known to be too poorl parametrized; in particular, there is a single parameter ν to control variabilit, regardless of p (see Consonni and Veronese [2001]). A more effective formulation of this model has been recentl proposed b Shahbaba and Neal [2009], based on two simple modifications. First, the joint kernel is decomposed as the product of the marginal of X and the conditional of Y x, and the parameter space consequentl expressed in terms of the parameters ψ of the marginal and the parameters θ of the conditional. This is a classic reparametrization which, in the Gaussian case, is the basis of generalizations of the inverse-wishart conjugate prior; see Brown et al. [1994]. Secondl, the suggest using simple kernels, assuming local independence among the covariates, i.e. the covariance matrix of the kernel for X is diagonal. These two simple modifications allow several important improvements. Computationall, reducing the covariance matrix to p variances can greatl ease calculations. Regarding flexibilit of the base measure, the conjugate prior now includes a separate parameter to control variabilit for each of the p variances.

5 Prediction via enriched Dirichlet process mixtures 5 Furthermore, the model still allows for local correlations between Y and X through the conditional kernel of Y x and the parameter θ. In addition, the factorization in terms of the marginal and conditional and the assumption of local independence of the covariates allow for eas inclusion of discrete or other tpes of response or covariates. Note that even though, within each component, we assume independence of the covariates, globall, there ma be dependence. This extended model, in full generalit, can be described through latent parameter vectors as follows: Y i x i, θ i ind F ( x i, θ i ), X i ψ i ind F x ( ψ i ), (1) (θ i, ψ i ) P iid P, P DP(αP 0θ P 0ψ ). Here the base measure P 0θ P 0ψ of the DP assumes independence between the θ and the ψ parameters, as this is the structurall conjugate prior for (θ, ψ), and thus, results in simplified computations, but the model could be extended to more general choices. We further assume that P 0θ and P 0ψ are absolutel continuous, with densities p 0θ and p 0ψ. Since P is discrete a.s., integrating out the subject-specific parameters (θ i, ψ i ), the model for the joint densit is f( i, x i P ) = w j K( i x i, θ j )K(x i ψ j ), (2) where the kernels K( x, θ) and K(x ψ) are the densities associated to F ( x, θ) and F x ( ψ). In the Gaussian case, K(x ψ) is the product of N(µ x,h, σ 2 x,h ) Gaussians, h = 1,..., p, and K( x, θ) is N(xβ, σ 2 x ), where x = (1, x ). Shahbaba and Neal focus on the case when Y is categorical and the local model for Y x is a multinomial logit. Hannah et al. [2011] extend the model to the case when, locall, the conditional distribution of Y x belongs to the class of generalized linear models (GLM), that is, the distribution of the response belongs to the exponential famil and the mean of the response can be expressed a function of a linear combination of the covariates. Model (2) allows for flexible conditional densities f( x, P ) = w jk(x ψ j )K( x, θ j ) j =1 w j K(x ψ j ) j w j (x)k( x, θ j ). and nonlinear regression E[Y x, P ] = w j (x) E[Y x, θ j ], with E[Y x, θ j ] = xβj for Gaussian kernels. Thus, the model provides flexible kernelbased densit and regression estimation, and MCMC computations are standard. However, the DP onl allows joint clusters of the parameters (θ i, ψ i ), i = 1,..., n. We underline drawbacks of such joint clustering in the following subsections.

6 Prediction via enriched Dirichlet process mixtures Random partition and inference One of the crucial features of DP mixture models is the dimension reduction and clustering obtained due to the almost sure discreteness of P. In fact, this implies that a sample (θ i, ψ i ), i = 1,..., n from a DP presents ties with positive probabilit and can be convenientl described in terms of the random partition and the distinct values. Using a standard notation, we denote the random partition b a vector of cluster allocation labels ρ n = (s 1,..., s n ), with s i = j if (θ i, ψ i ) is equal to the j th unique value observed, (θ j, ψ j ), for j = 1,..., k, where k = k(ρ n) is the number of groups in the partition ρ n. Additionall, we will denote b S j = {i : s i = j} the set of subject indices in the j th cluster and use the notation j = { i : i S j } and x j = {x i : i S j }. We also make use of the short notation x 1:n = (x 1,..., x n ). Often, the random probabilit measure P is integrated out and inference is based on the posterior of the random partition ρ n and the cluster-specific parameters (θ, ψ ) (θ j, ψ j )k ; p(ρ n, θ, ψ x 1:n, 1:n ) k p(ρ n ) p 0θ (θj )p 0ψ (ψj ) k i S j K(x i ψ j )K( i x i, θ j ). As is well known (Antoniak [1974]), the prior induced b the DP on the random partition is p(ρ n θ, ψ ) α k k Γ(n j), where n j is the size of S j. Marginalizing out the θ, ψ, the posterior of the random partition is p(ρ n x 1:n, 1:n ) α k k Γ(n j ) h x (x j) h (j x j). (3) Notice that the marginal likelihood component in (3), sa h(x 1:n, 1:n ρ n ), is factorized as the product of the cluster-specific marginal likelihoods h x (x j )h (j x j ) where h x (x j ) = Ψ i S j K(x i ψ)dp 0ψ (ψ) and h (j x j ) = Θ i S j K( i x i, θ)dp 0θ (θ). Given the partition and the data, the conditional distribution of the distinct values is simple, as the are independent across clusters, with posterior densities p(θj ρ n, x 1:n, 1:n ) p 0θ (θj ) K( i x i, θj ), i S j p(ψ j ρ n, x 1:n, 1:n ) p 0ψ (ψ j ) i S j K(x i ψ j ). (4) Thus, given ρ n, the cluster-specific parameters (θj, ψ j ) are updated based onl on the observations in cluster S j. Computations are simple if p 0θ and p 0ψ are conjugate priors. The above expressions show the crucial role of the random partition. From equation (3), we have that given the data, subjects are clustered in groups with similar behaviour in the covariate space and similar relationship with the response.

7 Prediction via enriched Dirichlet process mixtures 7 However, even for moderate p the likelihood for x tends to dominate the posterior of the random partition, so that clusters are determined onl b similarit in the covariate space. This is particularl evident when the covariates are assumed to be independent locall, i.e. K(x i ψj ) = p h=1 K(x i,h ψj,h ). Clearl, for large p, the scale and magnitude of changes in p h=1 K(x i,h ψj,h ) will wash out an information given in the univariate likelihood K( i θj, x i). Indeed, this is just the behavior we have observed in practice in running simulations for large p (results not shown). For a simple example demonstrating how the number of components needed to approximate the marginal of X can blow up with p, imagine X is uniforml distributed on a cuboid of side length r > 1. Consider approximating f 0 (x) = 1 r p 1(x [0, r]p ) b f k (x) = k w j N p (x µ j, σj 2 I p ). Since the true distribution of x is uniform on the cube [0, r] p, to obtain a good approximation, the weighted components must place most of their mass on values of x contained in the cuboid. Let B σ (µ) denote a ball of radius σ centered at µ. If a random vector V is normall distributed with mean µ and variance σ 2 I p, then for 0 < ɛ < 1, P (V B σz(ɛ) (µ)) = 1 ɛ, where z(ɛ) 2 = (χ 2 p) 1 (1 ɛ), i.e. the square of z(ɛ) is the (1 ɛ) quantile of the chi-squared distribution with p degrees of freedom. For small ɛ, this means that the densit of V places most of its mass on values contained in a ball of radius σz(ɛ) centered at µ. For ɛ > 0, define f k (x) = k w j N(x; µ j, σj 2 I p ) 1(x B σj z(ɛ j )(µ j )), where ɛ j = ɛ/(kw j ). Then, f k is close to f k (in the L 1 sense): f k (x) f k (x) dx = R p R p k w j N(x; µ j, σj 2 I p ) 1(x Bσ c j z(ɛ j ) (µ j))dx = ɛ. For f k to be close to f 0, the parameters µ j, σ j, w j need to be chosen so that the balls B σj z(ɛ/(kw j ))(µ j ) are contained in the cuboid. That means that centers of the balls are contained in the cuboid, µ j [0, r] p, (5) with further constraints on σj 2 and w j, so that the radius is small enough. In particular, ( ) ɛ σ j z min(µ 1, r µ 1,..., µ p, r µ p ) r kw j 2. (6)

8 Prediction via enriched Dirichlet process mixtures 8 However, as p increases the volume of the cuboid goes to infinit, but the volume of an ball B σj z(ɛ/(kw j ))(µ j ) defined b (5) and (6) goes to 0 (see Clarke et al. [2009], Section 1.1). Thus, just to reasonabl cover the cuboid with the balls of interest, the number of components will increase dramaticall, and more so, when we consider the approximation error of the densit estimate. Now, as an extreme example, imagine that f 0 ( x) is a linear regression model. Even though one component is sufficient for f 0 ( x), a large number of components will be required to approximate f 0 (x), especiall as p increases. It appears evident from (4) this behavior of the random partition also negativel affects inference on the cluster-specific parameters. In particular, when man kernels are required to approximate the densit of X with few observations within each cluster, the posterior for θj ma be based on a sample of unnecessaril small size, leading to a flat posterior with an unreliable posterior mean and large influence of the prior. 2.2 Covariate-dependent urn scheme and prediction Difficulties associated to the behavior of the random partition also deteriorate the predictive performance of the model. Prediction in DP joint mixture models is based on a covariate-dependent urn scheme (Park and Dunson [2010]; Müller and Quintana [2010]), such that conditionall on the partition ρ n and x 1:n+1, the cluster allocation s n+1 of a new subject with covariate value x n+1 is determined as where s n+1 ρ n, x 1:n+1 ω k+1(x n+1 ) c 0 δ k+1 + k ω j (x n+1 ) c 0 δ j, (7) ω k+1 (x n+1 ) = α h 0,x(x n+1 ), α + n ω j (x n+1 ) = n j K(xn+1 ψ)p(ψ x j )dψ α + n n j h j,x (x n+1 ), α + n and c 0 = p(x n+1 ρ n, x 1:n ) is the normalizing constant. This urn scheme is a generalization of the classic Póla urn scheme that allows the cluster allocation probabilit to depend on the covariates; the probabilit of allocation to cluster j depends on the similarit of x n+1 to the x i in cluster j as measured b the predictive densit h j,x. See Park and Dunson [2010] for more details. From the urn scheme (7) one obtains the structure of the prediction. The pre-

9 Prediction via enriched Dirichlet process mixtures 9 dictive densit at for a new subject with a covariate of x n+1 is computed as f( 1:n, x 1:n+1 ) = ρ n f( 1:n, x 1:n+1, ρ n, s n+1 )p(s n+1 1:n, x 1:n+1, ρ n ) p(ρ n 1:n, x 1:n+1 ) s n+1 = ω k+1(x n+1 ) k ω j (x n+1 ) h 0, ( x n+1 ) + h j, ( x n+1 ) c 0p(ρ n x 1:n, 1:n ) c ρ 0 c n 0 p(x n+1 x 1:n, 1:n ) = ω k+1(x n+1 ) k ω j (x n+1 ) h 0, ( x n+1 ) + h j, ( x n+1 ) p(ρ n x 1:n, 1:n ), c c ρ n (8) where c = p(x n+1 x 1:n, 1:n ). Thus, given the partition, the conditional predictive densit is a weighted average of the prior guess h 0, ( x) K( x, θ)dp 0θ (θ) and the cluster-specific predictive densities of at x n+1, h j, ( x n+1 ) = K( x n+1, θ)p(θ x j, j )dθ, with covariate-dependent weights. The predictive densit is obtained b averaging with respect to the posterior of ρ n. However, for moderate to large p, the posterior of the random partition suffers the drawbacks discussed in the previous subsection. In particular, too man small x-clusters lead to unreliable within-cluster predictives based on small sample sizes. Furthermore, the measure which determines similarit of x n+1 and the j th cluster will be too rigid. Consequentl, the resulting overall prediction ma be quite poor. These drawbacks will also affect the point prediction, which, under quadratic loss, is E[Y n+1 1:n, x 1:n+1 ] = ρ n ω k+1(x n+1 ) E 0 [Y n+1 x n+1 ] + c k ω j (x n+1 ) c E j [Y n+1 x n+1 ] p(ρ n x 1:n, 1:n ), where E 0 [Y n+1 x n+1 ] is the expectation of Y n+1 with respect to h 0, and E j [Y n+1 x n+1 ] E[E[Y n+1 x n+1, θ j ] x j, j ] is the expectation of Y n+1 with respect to h j,. Example. When K( x, θ) = N(; xβ, σ 2 ) and the prior for (β, σ 2 ) is the multivariate normal-inverse gamma with parameters (β 0, C 1, a, b ), (9) is ρ n ω k+1(x n+1 ) x c n+1 β 0 + k ω j (x n+1 ) c (9) x n+1 ˆβj p(ρ n x 1:n, 1:n ), (10)

10 Prediction via enriched Dirichlet process mixtures 10 where ˆβ j = Ĉ 1 j (Cβ 0 + X jj ), Ĉ j = C + X jx j, X j is a n j b p + 1 matrix with rows x i for i S j, and (8) is ( ω k+1 (x n+1 ) T x c n+1 β 0, b ) W 1, 2a + a k ( ω j (x n+1 ) T x c n+1 ˆβj, ˆb,j Wj 1, 2â,j ), â,j where T ( ; µ, σ 2, ν) denotes the densit of a random variable V such that (V µ)/σ has a t-distribution with ν degrees of freedom, â,j = a + n j /2, ˆb,j = b ( j X j β 0 ) (I nj X j Ĉ 1 j X j)( j X j β 0 ), W = 1 x n+1 (C + x n+1x n+1 ) 1 x n+1, and W j is defined as W with C replaced b Ĉj. 3 Joint EDP mixture model As seen, the global clustering of the DP prior on P(Θ Ψ), the space of probabilit measures on Θ Ψ, does not allow one to efficientl model the tpes of data discussed in the previous section. Instead, it is desirable to use a nonparametric prior that allows man ψ-clusters, to fit the complex marginal of X, and fewer θ-clusters. At the same time, we want to preserve the desirable conjugac properties of the DP, in order to maintain fairl simple computations. To these aims, our proposal is to replace the DP with the more richl parametrized enriched Dirichlet process Wade et al. [2011]. The EDP is conjugate and has an analticall computable urn scheme, but it gives a nested partition structure that can model the desired clustering behavior. Recall that the model (2) was obtained b decomposing the joint kernel as the product of the marginal and conditional kernels. The EDP is a natural alternative for the mixing distribution of this model, as it is similarl based on the idea of expressing the unknown random joint probabilit measure P of (θ, ψ) in terms of the random marginal and conditionals. This requires the choice of an ordering of θ and ψ, and this choice is problem specific. In the situation described here, it is natural to consider the random marginal distribution P θ and the random conditional P ψ θ, to obtain the desired clustering structure. Then, the EDP prior is defined b P θ DP(α θ P 0θ ), P ψ θ ( θ) DP(α ψ (θ)p 0ψ θ ( θ)), θ Θ, and P ψ θ ( θ) for θ Θ are independent among themselves and from P θ. Together these assumptions induce a prior for the random joint P through the joint law of (11)

11 Prediction via enriched Dirichlet process mixtures 11 the marginal and conditionals and the mapping (P θ, P ψ θ ) P ψ θ ( θ)dp θ (θ). The prior is parametrized b the base measure P 0, expressed as P 0 (A B) = P 0ψ θ (B θ)dp 0θ (θ) A for all Borel sets A and B, and b a precision parameter α θ associated to θ and a collection of precision parameters α ψ (θ) for ever θ Θ associated to ψ θ. Note the contrast with the DP, which onl allows one precision parameter to regulate the uncertaint around P 0. The proposed EDP mixture model for regression is as in (2), but with P EDP(α θ, α ψ (θ), P 0 ) in place of P DP(αP 0θ P 0ψ ). In general, P 0 is such that θ and ψ are dependent, but here we assume the same structurall conjugate base measure as for the DP model (2), so P 0 = P 0θ P 0ψ. Using the square breaking representation of the EDP (Wade et al. [2011], Proposition 4) and integrating out the (θ i, ψ i ) parameters, the model for the joint densit is f(x, P ) = l=1 w j w l j K(x ψ l j )K( x, θ j ). This gives a mixture model for the conditional densities with more flexible weights f( x, P ) = l=1 w l jk(x ψ l j ) j =1 w j l =1 w l j K(x ψ l j ) K( x, θ j ) w j (x)k( x, θ j ). 3.1 Random partition and inference The advantage of the EDP is the implied nested clustering. The EDP partitions subjects in θ-clusters and ψ-subclusters within each θ-cluster, allowing the use of more kernels to describe the marginal of X for each kernel used for the conditional of Y x. The random partition model induced from the EDP can be described as a nested Chinese Restaurant Process (ncrp). First, customers choose restaurants according to the CRP induced b P θ DP(α θ P 0θ ), that is, with probabilit proportional to the number n j of customers eating at restaurant j, the (n + 1) th customer eats at restaurant j, and with probabilit proportional to α θ, she eats at a new restaurant. Restaurant are then colored iid with colors θj P 0θ. Within restaurant j, customers sit at tables as in the CRP induced b the P ψ θ j DP(α ψ (θj )P 0ψ θ( θj )). Tables in restaurant j are then colored with colors ψl j iid P 0ψ θ ( θj ). This differs from the ncrp proposed b Blei et al. [2010], which had the alternative aim of learning topic hierarchies b clustering parameters (topics) hierarchicall along a tree of infinite CRPs. In particular, each subject follows a path down the

12 Prediction via enriched Dirichlet process mixtures 12 tree according to a sequence of nested CRPs and the parameters of subject i are associated with the cluster visited at a latent subject-specific level l of this path. Although related, the EDP is not a special case with the tree depth fixed to 2; the EDP defines a prior on a multivariate random probabilit measure on Θ Ψ and induces a nested partition of the multivariate parameter (θ, ψ), where the first level of the tree corresponds to the clustering of the θ parameters and the second corresponds to the clustering of the ψ parameters. A generalization of the EDP to a depth of D is related to Blei et al. s ncrp with depth D, but onl if one regards the parameters of each subject as the vector of parameters (ξ 1,..., ξ D ) associated to each level of the tree. Furthermore, this generalization of the EDP would allow a more flexible specification of the mass parameters and possible correlation among the nested parameters. The nested partition of the EDP is described b ρ n = (ρ n,, ρ n,x ), where ρ n, = (s,1,..., s,n ) and ρ n,x = (s x,1,..., s x,n ) with s,i = j if θ i = θ j, the jth distinct θ-value in order of appearance, and s x,i = l if ψ i = ψ l j, the lth color that appeared inside the j th θ-cluster. Additionall, we use the notation S j+ = {i : s,i = j}, with size n j, j = 1,..., k, and S l j = {i : s,i = j, s x,i = l}, with size n l j, l = 1,..., k j. The unique parameters will be denoted b θ = (θ j )k and ψ = (ψ 1 j,..., ψ k j j )k. Furthermore, we use the notation ρ nj,x = (s x,i : i S j+ ) and j = { i : i S j+ }, x j = {x i : i S j+ }, x l j = {x i : i S l j }. Proposition 3.1 The probabilit law of the nested random partition defined from the EDP is p(ρ n ) = Γ(α θ) Γ(α θ + n) αk θ k Θ α ψ (θ) k Γ(α j ψ(θ))γ(n j ) Γ(α ψ (θ) + n j ) dp 0θ(θ) Γ(n l j ). Proof. From independence of random conditional distributions among θ Θ, p(ρ n, θ ) = p(ρ n, ) k p 0θ (θj )p(ρ n,x ρ n,, θ ) = p(ρ n, ) k j l=1 k p 0θ (θj )p(ρ nj,x θj ). Next, using the results of the random partition model of the DP (Antoniak [1974]), we have p(ρ n, θ ) = Γ(α θ) Γ(α θ + n) αk θ k Integrating out θ leads to the result. p 0θ (θ j )α ψ (θ j ) k j Γ(α ψ(θ j ))Γ(n j) Γ(α ψ (θ j ) + n j) k j Γ(n l j ). From Proposition 3.1, we gain an understanding of the tpes of partitions preferred b the EDP and the effect of the parameters. If for all θ, α ψ (θ) = α θ P 0θ ({θ}), that is α ψ (θ) = 0 if P 0θ is nonatomic, we are back to the DP random partition model, see l=1

13 Prediction via enriched Dirichlet process mixtures 13 Proposition 2 of Wade et al. [2011]. In the case when P 0θ is nonatomic, this means that the conditional P ψ θ is degenerate at some random location with probabilit one (for each restaurant one table). In general, α ψ (θ) ma be a flexible function of θ, reflecting the fact that within some θ-clusters more kernels ma be required for good approximation of the marginal of X. In practice, a common situation that we observe is a high value of α ψ (θ) for average values of θ and lower values of α ψ (θ) for more extreme θ values, capturing homogeneous outling groups. In this case, a small value of α θ will encourage few θ- clusters, and, given θ, a large α ψ (θj ) will encourage more ψ-clusters within the jth θ-cluster. The term k kj l=1 Γ(n l j) will encourage asmmetrical (θ, ψ)-clusters, preferring one large cluster and several small clusters, while, given θ, the term involving the product of beta functions contains parts that both encourage and discourage asmmetrical θ-clusters. In the special case when α ψ (θ) = α ψ for all θ Θ, the random partition model simplifies to p(ρ n ) = Γ(α θ) Γ(α θ + n) αk θ k α k j ψ Γ(α ψ )Γ(n j ) Γ(α ψ + n j ) k j Γ(n l j ). In this case, the tendenc of the term involving the product of beta functions is to slightl prefer asmmetrical θ-clusters with large values of α ψ boosting this preference. As discussed in the previous section, the random partition plas a crucial role, as its posterior distribution affects both inference on the cluster-specific parameters and prediction. For the EDP, it is given b the following proposition. Proposition 3.2 The posterior of the random partition of the EDP model is p(ρ n x 1:n, 1:n ) k k αθ k Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α j ψ(θ) k j dp 0θ (θ)h (j x j) Γ(n l j )h x (x l j ). Θ The proof relies on a simple application of Baes theorem. In the case of constant α ψ (θ), the expression for the posterior of ρ n simplifies to p(ρ n x 1:n, 1:n ) α k θ l=1 l=1 k k Γ(α ψ )Γ(n j ) Γ(α ψ + n j ) αk j ψ h j (j x j) Γ(n l j )h x (x l j ). Again, as in (3), the marginal likelihood component in the posterior distribution of ρ n is the product of the cluster-specific marginal likelihoods, but now the nested clustering structure of the EDP separates the factors relative to x and x, being h(x 1:n, 1:n ρ n ) = k h (j x j ) k j l=1 h x(x l j ). Even if the x-likelihood favors man ψ-clusters, now these can be obtained b sub-partitioning a coarser θ-partition, and the number k of θ-clusters can be expected to be much smaller than in (3). l=1

14 Prediction via enriched Dirichlet process mixtures 14 Further insights into the behavior of the random partition are given b the induced covariate-dependent random partition of the θ i parameters given the covariates, which is detailed in the following propositions. We will use the notation P n to denote the set of all possible partitions of the first n integers. Proposition 3.3 The covariate-dependent random partition model induced b the EDP prior is p(ρ n, x 1:n ) α k θ k ρ nj,x P nj Θ Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) Proof. An application of Baes theorem implies that p(ρ n x 1:n ) α k θ k Θ Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) k j l=1 k j Γ(n l j )h x (x l j ). l=1 Γ(n l j )h x (x l j ). (12) Integrating over ρ n,x, or equivalentl summing over all ρ nj,x in P nj,x for j = 1,..., k leads to, p(ρ n, x 1:n ) ρ n1+,x... k αθ k ρ nk+,x Θ Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) Γ(n l j )h x (x l j ), and, finall, since (12) is the product over the j terms, we can pull the sum over ρ nj,x within the product. This covariate-dependent random partition model will favor θ-partitions of the subjects which can be further partitioned into groups with similar covariates, where a partition with man desirable sub-partitions will have higher mass. Proposition 3.4 The posterior of the random covariate-dependent partition induced from the EDP model is p(ρ n, x 1:n, 1:n ) α k θ ρ nj,x P nj Θ k h (j x j) Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) k j h=1 k j l=1 Γ(n l j )h x (x l j ). The proof is similar in spirit to that of Proposition 3.3. Notice the preferred θ-partitions will consist of clusters with a similar relationship between and x, as measured b marginal local model h for x and similar x behavior, which is measured much more flexibl as a mixture of the previous marginal local models. The behavior of the random partition, detailed above, has important implications for the posterior of the unique parameters. Conditionall on the partition,

15 Prediction via enriched Dirichlet process mixtures 15 the cluster-specific parameters (θ, ψ ) are still independent, their posterior densit being where p(θ, ψ 1:n, x 1:n, ρ n ) = p(θ j j, x j) p 0θ (θ j ) k k j p(θj j, x j) p(ψl j x l j ), l=1 i S j+ K( i θ j, x i ), p(ψ l j x l j ) p 0ψ(ψ l j ) i S j,l K(x i ψ l j ). The important point is that the posterior of θj can now be updated with much larger sample sizes if the data determines that a coarser θ-partition is present. This will result in a more reliable posterior mean, a smaller posterior variance, and larger influence of the data compared with the prior. 3.2 Covariate-dependent urn scheme and prediction Similar to the DP model, computation of the predictive estimates relies on a covariatedependent urn scheme. For the EDP, we have s,n+1 ρ n, x 1:n+1, 1:n ) ω k+1(x n+1 ) c 0 δ k+1 + k ω j (x n+1 ) c 0 δ j, (13) where c 0 = p(x n+1 ρ n, x 1:n ), but now the expression of the ω j (x n+1 ) takes into account the possible allocation of x n+1 in subgroups, being ω j (x n+1 ) = p(x n+1 ρ n, x 1:n, 1:n, s,n+1 = j, s x,n+1 )p(s x,n+1 ρ n, x 1:n, 1:n, s,n+1 = j). s x,n+1 From this, it can be easil found that α θ ω k+1 (x n+1 ) = α θ + n h 0,x(x n+1 ) ω j (x n+1 ) = n k j j π α θ + n kj +1 jh 0,x (x n+1 ) + π l j h l j,x (x n+1 ), l=1 where π l j = n l j dp 0θ (θ x α ψ (θ) + n j, j ), π kj +1 j = j α ψ (θ) α ψ (θ) + n j dp 0θ (θ). In the case of constant α ψ (θ) = α ψ, these expressions simplif to π l j = n l j α ψ + n j, π kj +1 j = α ψ α ψ + n j.

16 Prediction via enriched Dirichlet process mixtures 16 Notice that (13) is similar to the covariate-dependent urn scheme of the DP model. The important difference is that the weights, which measure the similarit between x n+1 and the j th cluster, are much more flexible. It follows, from similar computations as in Section 2.2, that the predictive densit at for a new subject with a covariate value of x n+1 is f( x 1:n, 1:n, x n+1 ) = ρ n ω k+1(x n+1 ) h 0, ( x n+1 ) + c k ω j (x n+1 ) c h j, ( x n+1 ) p(ρ n x 1:n, 1:n ) where c = p(x n+1 x 1:n, 1:n ). Under the squared error loss function, the point prediction of n+1 is E[Y n+1 1:n, x 1:n+1 ] = ρ n ω k+1(x n+1 ) E 0 [Y n+1 x n+1 ] + c k ω j (x n+1 ) c (14) E j [Y n+1 x n+1 ] p(ρ n x 1:n, 1:n ). The expressions for the prediction densit (14) and point prediction (15) are quite similar to those of the DP, (8) and (9), respectivel; in both cases, the cluster-specific predictive estimates are averaged with covariate-dependent weights. However, there are two important differences for the EDP model. The first is that the weights in (13) are defined with a more flexible kernel; in fact, it is a mixture of the original kernels used in the DP model. This means that we have a more flexible measure of similarit in the covariate space. The second difference is that k will be smaller and n j will be larger with a high posterior probabilit, leading to a more reliable posterior distribution of θj due to larger sample sizes and better cluster-specific predictive estimates. We will demonstrate the advantage of these two ke differences in simulated and applied examples in Section 5. 4 Computations (15) Inference for the EDP model cannot be obtained analticall and must therefore be approximated. To obtain approximate inference, we rel on Markov Chain Monte Carlo (MCMC) methods and consider an extension of Algorithm 2 of Neal [2000] for the DP mixture model. In this approach, the random probabilit measure, P, is integrated out, and the model is viewed in terms of (ρ n, θ, ψ ). This algorithm requires the use of conjugate base measures P 0θ and P 0ψ. To deal with non-conjugate base measures, the approach used in Algorithm 8 of Neal [2000] can be directl adapted to the EDP mixture model.

17 Prediction via enriched Dirichlet process mixtures 17 Algorithm 2 is a Gibbs sampler which first samples the cluster label of each subject conditional on the partition of all other subjects, the data, and (θ, ψ ), and then samples (θ, ψ ) given the partition and the data. The first step can be easil performed thanks to the Póla urn which marginalizes the DP. Extending Algorithm 2 for the EDP model is straightforward, since the EDP maintains a simple, analticall computable urn scheme. In particular, the conditional probabilities p(s i ρ i n 1, θ, ψ, x 1:n, 1:n ) (provided in the Appendix) have a simple closed form, which allows conditional sampling of the individual cluster membership indicators s i, where ρ i n 1 denotes the partition of the n 1 subjects with the i th subject removed. To improve mixing, we include an additional Metropolis- Hastings step; at each iteration, after performing the n Gibbs updates for each s i, we propose a shuffle of the nested partition structure obtained b moving a ψ-cluster to be nested within a different or new θ-cluster. This move greatl improves mixing. A detailed description of the sampler, including the Metropolis-Hastings step, can be found in the Appendix. MCMC produces approximate samples, {ρ s n, ψ s, θ s } S s=1, from the posterior. The prediction given in equation (15) can be approximated b 1 S S ωk+1 s (x n+1) k s ωj s E h [Y n+1 x n+1 ] + (x n+1) E F [Y n+1 x n+1, θj s ], ĉ ĉ s=1 where ω s j (x n+1) for j = 1,..., k s +1, are as previousl defined in (13) with (ρ n, ψ, θ ) replaced b (ρ s n, ψ s, θ s ) and ĉ = 1 S S ωk+1 s (x k s n+1) + ωj s (x n+1 ). s=1 For the predictive densit estimate at x n+1, we define a grid of new values and for each in the grid, we compute 1 S S ωk+1 s (x n+1) k s ωj s h ( x n+1 ) + (x n+1) K( x n+1, θj s ). (16) ĉ ĉ s=1 Note that hperpriors ma be included for the precision parameters, α θ and α ψ ( ), and the parameters of the base measures. For the simulated examples and application, a Gamma hperprior is assigned to α θ, and α ψ (θ) for θ Θ are assumed to be i.i.d. from a Gamma hperprior. At each iteration, αθ s and αs ψ (θ) at θ s j for j = 1,..., k s are approximate samples from the posterior using the method described in Escobar and West [1995]. 5 Simulated example We consider a to example that demonstrates two ke advantages of the EDP model; first, it can recover the true coarser θ-partition; second, improved prediction and

18 Prediction via enriched Dirichlet process mixtures 18 smaller credible intervals result. The example shows that these advantages are evident even for a moderate value of p, with more drastic differences as p increases. A data set of n = 200 points were generated where onl the first covariate is a predictor for Y. The true model for Y is a nonlinear regression model obtained as a mixture of two normals with linear regression functions and weights depending onl on the first covariate; Y i x i ind p(x i,1 )N( i β 1,0 + β 1,1 x i,1, σ 2 1) + (1 p(x i,1 ))N( i β 2,0 + β 2,1 x i,1, σ 2 2), where p(x i,1 ) = ) τ 1 exp ( τ (x 1,i µ 1 ) 2 ) ), τ 1 exp ( τ (x 1,i µ 1 ) 2 + τ 2 exp ( τ (x 1,i µ 2 ) 2 with β 1 = (0, 1), σ 2 1 = 1/16, β 2 = (4.5, 0.1), σ 2 2 = 1/8 and µ 1 = 4, µ 2 = 6, τ 1 = τ 2 = 2. The covariates are sampled from a multivariate normal, X i = (X i,1,..., X i,p ) iid N(µ, Σ), (17) centered at µ = (4,..., 4) with a standard deviation of 2 along each dimension, i.e Σ h,h = 4. The covariance matrix Σ models two groups of covariates: those in the first group are positivel correlated among each other and the first covariate, but independent of the second group of covariates, which are positivel correlated among each other but independent of the first covariate. In particular, we take Σ h,l = 3.5 for h l in {1, 2, 4,..., 2 p/2 } or h l in {3, 5,..., 2 (p 1)/2 + 1} and Σ h,l = 0 for all other cases of h l. We examine both the DP and EDP mixture models; Y i x i, β i, σ 2,i ind N(x i β i, σ 2,i), (β i, σ 2,i, µ i, σ 2 x,i) P iid P X i µ i, σ 2 x,i ind p N(µ i,h, σx,h,i 2 ), with P DP or P EDP. The conjugate base measure is selected; P 0θ is a multivariate normal-inverse gamma prior and P 0ψ is the product of p normal-inverse gamma priors, that is and h=1 p 0θ (β, σ 2 ) = N(β; β 0, σ 2 C 1 )IG(σ 2 ; a, b ), p 0ψ (µ, σ 2 x) = p h=1 N(µ h ; µ 0,h, σ 2 x,h c 1 h )IG(σ2 x,h ; a x,h, b x,h ). For both models, we use the same subjective choice of the parameters of the base measure. In particular, we center the base measure on an average of the true

19 Prediction via enriched Dirichlet process mixtures p( rho,x)= x1 (a) DP, p= p( rho,x)= x1 (b) DP, p= p( rho,x)= x1 (c) DP, p= p( rho,x)= x1 (d) DP, p= p( rho_,x)= x1 (e) EDP, p= p( rho_,x)= x1 (f) EDP, p= p( rho_,x)= x1 (g) EDP, p= p( rho_,x)= x1 (h) EDP, p=15 Figure 1: The partition with the highest estimated posterior probabilit. Data points are plotted in the x 1 vs space and colored b cluster membership with estimated posterior probabilit included in the plot title. Table 1: The posterior mode number of clusters, denoted ˆk for both models, and the posterior mean of α, α θ, denoted ˆα for both models, as p increases. p = 1 p = 5 p = 10 p = 15 ˆk ˆα ˆk ˆα ˆk ˆα ˆk ˆα DP EDP parameters values with enough variabilit to recover the true model. A list of the prior parameters can be found in the Appendix. We assign hperpriors to the mass parameters, where for the DP model, α Gamma(1, 1), and for the EDP model, α θ Gamma(1, 1), α ψ (β, σ 2 ) iid Gamma(1, 1) for all β, σ 2 R p+1 R +. The computational procedures described in Section 4 were used to obtain posterior inference with 20,000 iterations and burn in period of 5,000. An examination of the trace and autocorrelation plots for the subject-specific parameters (β i, σ 2,i, µ i, σ 2 x,i ) provided evidence of convergence. Additional criteria for assessing the convergence of chain, in particular, the Geweke diagnostic, also suggested convergence, and the results are given in Table A.1 of the Appendix (see the R package coda for implementation and further details of the diagnostic). It should be noted that running times for both models are quite similar, although slightl faster for the

20 Prediction via enriched Dirichlet process mixtures 20 E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * x x x x1 (a) DP, p=1 (b) DP, p=5 (c) DP, p=10 (d) DP, p=15 E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * x x x x1 (e) EDP, p=1 (f) EDP, p=5 (g) EDP, p=10 (h) EDP, p=15 Figure 2: The point predictions for 20 test samples of the covariates are plotted against x 1 and represented with circles (DP in blue and EDP in red) with true prediction as black stars. The bars about the prediction depict the 95% credible intervals. Table 2: Prediction error for both models as p increases. p = 1 p = 5 p = 10 p = 15 ˆl1 ˆl2 ˆl1 ˆl2 ˆl1 ˆl2 ˆl1 ˆl2 DP EDP DP. The first main point to emphasize is the improved behavior of the posterior of the random partition for the EDP. We note that for both models, the posterior of the partition is spread out. This is because the space of partitions is ver large and man partitions are ver similar, differing onl in a few subjects; thus, man partitions fit the data well. We depict representative partitions of both models with increasing p in Figure 1. Observations are plotted in the x 1 space and colored according to the partition for the DP and the θ-partition for the EDP. As expected, for p = 1 the DP does well at recovering the true partition, but as clearl seen from Figure 1, for large values of p, the DP partition is comprised of man clusters, which are needed to approximate the multivariate densit of X. In fact, the densit of Y x can be recovered with onl two kernels regardless of p, and the θ-partitions of

21 Prediction via enriched Dirichlet process mixtures 21 f( x) f( x) f( x) f( x) (a) p= (b) p= (c) p= (d) p=15 Figure 3: Predictive densit estimate (DP in blue and EDP in red) for the first new subject with true conditional densit in black. The pointwise 95% credible intervals are depicted with dashed lines (DP in blue and EDP in red). f( x) f( x) f( x) f( x) (a) p= (b) p= (c) p= (d) p=15 Figure 4: Predictive densit estimate (DP in blue and EDP in red) for the fifth new subject with true conditional densit in black. The pointwise 95% credible intervals are depicted with dashed lines (DP in blue and EDP in red). Table 3: l 1 densit regression error for both models as p increases. p = 1 p = 5 p = 10 p = 15 DP EDP the EDP depicted in Figure 1, with onl two θ-clusters, are ver similar to the true configuration, even for increasing p. On the other hand, the (θ, ψ)-partition of the EDP (not shown) consists of man clusters and resembles the partition of the DP model. This behavior is representative of the posterior distribution of the random partition that, for the DP, has a large posterior mode of k and large posterior mean of α for larger values of p, while most of the EDP θ-partitions are composed of onl 2 clusters with onl a handful of subjects placed in the incorrect cluster and the posterior mean of α θ is much smaller for large p. Table 1 summarizes the posterior of k for both models and the posterior of the precision parameters α, α θ. It is interesting

22 Prediction via enriched Dirichlet process mixtures 22 to note that posterior samples of α ψ (θ) for θ characteristic of the first cluster tend to be higher than posterior samples of α ψ (θ) for the second cluster; that is, more clusters are needed to approximate the densit of X within the first cluster. A non constant α ψ (θ) allows us to capture this behavior. As discussed in Section 3, a main point of our proposal is the increased finite sample efficienc of the EDP model. To illustrate this, we create a test set with m = 200 new covariates values simulated from (17) with maximum dimension p = 15 and compute the true regression function E[Y n+j x n+j ] and conditional densit f( x n+j ) for each new subject. To quantif the gain in efficienc of the EDP model, we calculate the point prediction and predictive densit estimates from both models and compare them with the truth. Judging from both the empirical l 1 and l 2 prediction errors, the EDP model outperforms the DP model, with greater improvement for larger p; see Table 2. Figure 2 displas the prediction against x 1 for 20 new subjects. Recall that each new x 1 is associated with different values of (x 2,..., x p ), which accounts for the somewhat erratic behavior of the prediction as a function of x 1 for increasing p. The comparison of the credible intervals is quite interesting. For p > 1, the unnecessaril wide credible intervals for the DP regression model stand out in the first row of Figure 2. This is due to small cluster samples sizes for the DP model with p > 1. The densit regression estimates for all new subjects were computed b evaluating (16) at a grid of -values. As a measure of the performance of the models, the empirical l 1 distance between the true and estimated conditional densities for each of the new covariate values is shown in Table 3. Again the EDP model outperforms the DP model. Figures 3 and 4 displa the true conditional densit (in black) for two new covariate values with the estimated conditional densities in blue for the DP and red for the EDP. It is evident that for p > 1 the densit regression estimates are improved and that the pointwise 95% credible intervals are almost uniforml wider both in and x for the DP model, sometimes drasticall so. It is important to note that while the credible intervals of the EDP model are considerabl tighter, the still contain the true densit. 6 Alzheimer s disease stud Our primar goal in this section is to show that the EDP model leads to improved inference in a real data stud with the important goal of diagnosing Alzheimer s disease (AD). In particular, EDP leads to improved predictive accurac, tighter credible intervals around the predictive probabilit of having the disease, and a more interpretable clustering structure. Alzheimer s disease is a prevalent form of dementia that slowl destros memor and thinking skills, and eventuall even the abilit to carr out the simplest tasks. Unfortunatel, a definitive diagnosis cannot be made until autops. However, the brain ma show severe evidence of neurobiological damage even at earl stages of the disease before the onset of memor disturbances. As this damage ma be difficult

Bayesian Nonparametric Regression through Mixture Models

Bayesian Nonparametric Regression through Mixture Models Bayesian Nonparametric Regression through Mixture Models Sara Wade Bocconi University Advisor: Sonia Petrone October 7, 2013 Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(

More information

Monte Carlo integration

Monte Carlo integration Monte Carlo integration Eample of a Monte Carlo sampler in D: imagine a circle radius L/ within a square of LL. If points are randoml generated over the square, what s the probabilit to hit within circle?

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Bayesian Nonparametrics: Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. 2 Biostatistics, School of Public

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department

More information

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017 Bayesian Statistics Debdeep Pati Florida State University April 3, 2017 Finite mixture model The finite mixture of normals can be equivalently expressed as y i N(µ Si ; τ 1 S i ), S i k π h δ h h=1 δ h

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

A Fully Nonparametric Modeling Approach to. BNP Binary Regression A Fully Nonparametric Modeling Approach to Binary Regression Maria Department of Applied Mathematics and Statistics University of California, Santa Cruz SBIES, April 27-28, 2012 Outline 1 2 3 Simulation

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I X i Ν CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr 2004 Dirichlet Process I Lecturer: Prof. Michael Jordan Scribe: Daniel Schonberg dschonbe@eecs.berkeley.edu 22.1 Dirichlet

More information

A Nonparametric Model for Stationary Time Series

A Nonparametric Model for Stationary Time Series A Nonparametric Model for Stationary Time Series Isadora Antoniano-Villalobos Bocconi University, Milan, Italy. isadora.antoniano@unibocconi.it Stephen G. Walker University of Texas at Austin, USA. s.g.walker@math.utexas.edu

More information

The linear model is the most fundamental of all serious statistical models encompassing:

The linear model is the most fundamental of all serious statistical models encompassing: Linear Regression Models: A Bayesian perspective Ingredients of a linear model include an n 1 response vector y = (y 1,..., y n ) T and an n p design matrix (e.g. including regressors) X = [x 1,..., x

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics

More information

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process 10-708: Probabilistic Graphical Models, Spring 2015 19 : Bayesian Nonparametrics: The Indian Buffet Process Lecturer: Avinava Dubey Scribes: Rishav Das, Adam Brodie, and Hemank Lamba 1 Latent Variable

More information

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007 COS 597C: Bayesian Nonparametrics Lecturer: David Blei Lecture # Scribes: Jordan Boyd-Graber and Francisco Pereira October, 7 Gibbs Sampling with a DP First, let s recapitulate the model that we re using.

More information

Dirichlet Process Mixtures of Generalized Linear Models

Dirichlet Process Mixtures of Generalized Linear Models Lauren A. Hannah David M. Blei Warren B. Powell Department of Computer Science, Princeton University Department of Operations Research and Financial Engineering, Princeton University Department of Operations

More information

Infinite latent feature models and the Indian Buffet Process

Infinite latent feature models and the Indian Buffet Process p.1 Infinite latent feature models and the Indian Buffet Process Tom Griffiths Cognitive and Linguistic Sciences Brown University Joint work with Zoubin Ghahramani p.2 Beyond latent classes Unsupervised

More information

Dirichlet Processes: Tutorial and Practical Course

Dirichlet Processes: Tutorial and Practical Course Dirichlet Processes: Tutorial and Practical Course (updated) Yee Whye Teh Gatsby Computational Neuroscience Unit University College London August 2007 / MLSS Yee Whye Teh (Gatsby) DP August 2007 / MLSS

More information

Non-parametric Clustering with Dirichlet Processes

Non-parametric Clustering with Dirichlet Processes Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction

More information

Gibbs Sampling in Linear Models #2

Gibbs Sampling in Linear Models #2 Gibbs Sampling in Linear Models #2 Econ 690 Purdue University Outline 1 Linear Regression Model with a Changepoint Example with Temperature Data 2 The Seemingly Unrelated Regressions Model 3 Gibbs sampling

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Hybrid Dirichlet processes for functional data

Hybrid Dirichlet processes for functional data Hybrid Dirichlet processes for functional data Sonia Petrone Università Bocconi, Milano Joint work with Michele Guindani - U.T. MD Anderson Cancer Center, Houston and Alan Gelfand - Duke University, USA

More information

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

Flexible Regression Modeling using Bayesian Nonparametric Mixtures Flexible Regression Modeling using Bayesian Nonparametric Mixtures Athanasios Kottas Department of Applied Mathematics and Statistics University of California, Santa Cruz Department of Statistics Brigham

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Dirichlet Process Mixtures of Generalized Linear Models

Dirichlet Process Mixtures of Generalized Linear Models Dirichlet Process Mixtures of Generalized Linear Models Lauren A. Hannah Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA David M. Blei Department

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown

More information

Lecture 16: Mixtures of Generalized Linear Models

Lecture 16: Mixtures of Generalized Linear Models Lecture 16: Mixtures of Generalized Linear Models October 26, 2006 Setting Outline Often, a single GLM may be insufficiently flexible to characterize the data Setting Often, a single GLM may be insufficiently

More information

DAG models and Markov Chain Monte Carlo methods a short overview

DAG models and Markov Chain Monte Carlo methods a short overview DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex

More information

CMPS 242: Project Report

CMPS 242: Project Report CMPS 242: Project Report RadhaKrishna Vuppala Univ. of California, Santa Cruz vrk@soe.ucsc.edu Abstract The classification procedures impose certain models on the data and when the assumption match the

More information

Chapter 2. Data Analysis

Chapter 2. Data Analysis Chapter 2 Data Analysis 2.1. Density Estimation and Survival Analysis The most straightforward application of BNP priors for statistical inference is in density estimation problems. Consider the generic

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Normalized kernel-weighted random measures

Normalized kernel-weighted random measures Normalized kernel-weighted random measures Jim Griffin University of Kent 1 August 27 Outline 1 Introduction 2 Ornstein-Uhlenbeck DP 3 Generalisations Bayesian Density Regression We observe data (x 1,

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Latent factor density regression models

Latent factor density regression models Biometrika (2010), 97, 1, pp. 1 7 C 2010 Biometrika Trust Printed in Great Britain Advance Access publication on 31 July 2010 Latent factor density regression models BY A. BHATTACHARYA, D. PATI, D.B. DUNSON

More information

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Nonparametrics: Models Based on the Dirichlet Process Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro

More information

Dirichlet Process. Yee Whye Teh, University College London

Dirichlet Process. Yee Whye Teh, University College London Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, Blackwell-MacQueen urn scheme, Chinese restaurant

More information

Image segmentation combining Markov Random Fields and Dirichlet Processes

Image segmentation combining Markov Random Fields and Dirichlet Processes Image segmentation combining Markov Random Fields and Dirichlet Processes Jessica SODJO IMS, Groupe Signal Image, Talence Encadrants : A. Giremus, J.-F. Giovannelli, F. Caron, N. Dobigeon Jessica SODJO

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Bayesian Nonparametric Models

Bayesian Nonparametric Models Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference 1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

arxiv: v1 [stat.me] 3 Mar 2013

arxiv: v1 [stat.me] 3 Mar 2013 Anjishnu Banerjee Jared Murray David B. Dunson Statistical Science, Duke University arxiv:133.449v1 [stat.me] 3 Mar 213 Abstract There is increasing interest in broad application areas in defining flexible

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee September 03 05, 2017 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles Linear Regression Linear regression is,

More information

A Brief Overview of Nonparametric Bayesian Models

A Brief Overview of Nonparametric Bayesian Models A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine

More information

36-720: Latent Class Models

36-720: Latent Class Models 36-720: Latent Class Models Brian Junker October 17, 2007 Latent Class Models and Simpson s Paradox Latent Class Model for a Table of Counts An Intuitive E-M Algorithm for Fitting Latent Class Models Deriving

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison

David B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison AN IMPROVED MERGE-SPLIT SAMPLER FOR CONJUGATE DIRICHLET PROCESS MIXTURE MODELS David B. Dahl dbdahl@stat.wisc.edu Department of Statistics, and Department of Biostatistics & Medical Informatics University

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Partial factor modeling: predictor-dependent shrinkage for linear regression

Partial factor modeling: predictor-dependent shrinkage for linear regression modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework

More information

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte

More information

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Bayesian Modeling of Conditional Distributions

Bayesian Modeling of Conditional Distributions Bayesian Modeling of Conditional Distributions John Geweke University of Iowa Indiana University Department of Economics February 27, 2007 Outline Motivation Model description Methods of inference Earnings

More information

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural

More information

Nonparametric Bayes Uncertainty Quantification

Nonparametric Bayes Uncertainty Quantification Nonparametric Bayes Uncertainty Quantification David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & ONR Review of Bayes Intro to Nonparametric Bayes

More information

Supplementary Material: A Robust Approach to Sequential Information Theoretic Planning

Supplementary Material: A Robust Approach to Sequential Information Theoretic Planning Supplementar aterial: A Robust Approach to Sequential Information Theoretic Planning Sue Zheng Jason Pacheco John W. Fisher, III. Proofs of Estimator Properties.. Proof of Prop. Here we show how the bias

More information

A comparative review of variable selection techniques for covariate dependent Dirichlet process mixture models

A comparative review of variable selection techniques for covariate dependent Dirichlet process mixture models A comparative review of variable selection techniques for covariate dependent Dirichlet process mixture models William Barcella 1, Maria De Iorio 1 and Gianluca Baio 1 1 Department of Statistical Science,

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Dirichlet Process Mixtures of Generalized Linear Models

Dirichlet Process Mixtures of Generalized Linear Models Dirichlet Process Mixtures of Generalized Linear Models Lauren A. Hannah Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA David M. Blei Department

More information

The Metropolis-Hastings Algorithm. June 8, 2012

The Metropolis-Hastings Algorithm. June 8, 2012 The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Bayesian spatial hierarchical modeling for temperature extremes

Bayesian spatial hierarchical modeling for temperature extremes Bayesian spatial hierarchical modeling for temperature extremes Indriati Bisono Dr. Andrew Robinson Dr. Aloke Phatak Mathematics and Statistics Department The University of Melbourne Maths, Informatics

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles

Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles Peter Green and John Lau University of Bristol P.J.Green@bristol.ac.uk Isaac Newton Institute, 11 December

More information

Copyright, 2008, R.E. Kass, E.N. Brown, and U. Eden REPRODUCTION OR CIRCULATION REQUIRES PERMISSION OF THE AUTHORS

Copyright, 2008, R.E. Kass, E.N. Brown, and U. Eden REPRODUCTION OR CIRCULATION REQUIRES PERMISSION OF THE AUTHORS Copright, 8, RE Kass, EN Brown, and U Eden REPRODUCTION OR CIRCULATION REQUIRES PERMISSION OF THE AUTHORS Chapter 6 Random Vectors and Multivariate Distributions 6 Random Vectors In Section?? we etended

More information

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California Texts in Statistical Science Bayesian Ideas and Data Analysis An Introduction for Scientists and Statisticians Ronald Christensen University of New Mexico Albuquerque, New Mexico Wesley Johnson University

More information