Improving prediction from Dirichlet process mixtures via enrichment

Size: px

Start display at page:

Download "Improving prediction from Dirichlet process mixtures via enrichment"

Beverly George
5 years ago
Views:

1 Improving prediction from Dirichlet process mixtures via enrichment Sara K. Wade, David B. Dunson, Sonia Petrone, Lorenzo Trippa, For the Alzheimer s Disease Neuroimaging Initiative. November 15, 2013 Abstract Flexible covariate-dependent densit estimation can be achieved b modelling the joint densit of the response and covariates as a Dirichlet process mixture. An appealing aspect of this approach is that computations are relativel eas. In this paper, we examine the predictive performance of these models with an increasing number of covariates. Even for a moderate number of covariates, we find that the likelihood for x tends to dominate the posterior of the latent random partition, degrading the predictive performance of the model. To overcome this, we suggest using a different nonparametric prior, namel an enriched Dirichlet process. Our proposal maintains a simple allocation rule, so that computations remain relativel simple. Advantages are shown through both predictive equations and examples, including an application to diagnosis Alzheimer s disease. Kewords. Baesian nonparametrics; Densit regression; Predictive distribution; Random partition; Urn scheme. 1 Introduction Dirichlet process (DP) mixture models have become popular tools for Baesian nonparametric regression. In this paper, we examine their behavior in prediction and aim to highlight the difficulties that emerge with increasing dimension of the covariate space. To overcome these difficulties, we suggest a simple extension based on a nonparametric prior developed in Wade et al. [2011] that maintains desirable conjugac properties of the Dirichlet process and leads to improved prediction. The Corresponding author: sara.wade@eng.cam.ac.uk, Department of Engineering, Universit of Cambridge,Trumpington Street, Cambridge CB2 1PZ, UK Data used in preparation of this article were obtained from the Alzheimer s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analsis or writing of this report. A complete listing of ADNI investigators can be found at: http: //adni.loni.ucla.edu/wp-content/uploads/how_to_appl/adni_acknowledgement_list.pdf. 1

2 Prediction via enriched Dirichlet process mixtures 2 motivating application is to Alzheimer s disease studies, where the focus is prediction of the disease status based on biomarkers obtained from neuroimages. In this problem, a flexible nonparametric approach is needed to account for the possible nonlinear behavior of the response and complex interaction terms, resulting in improved diagnostic accurac. DP mixtures are widel used for Baesian densit estimation, see e.g. Ghosal [2010] and references therein. A common wa to extend these methods to nonparametric regression and conditional densit estimation is b modelling the joint distribution of the response and the covariates (X, Y ) as a mixture of multivariate Gaussians (or more general kernels). The regression function and conditional densit estimates are indirectl obtained from inference on the joint densit, an idea which is similarl emploed in classical kernel regression (Scott [1992], Chapter 8). This approach, which we call the joint approach, was first introduced b Müller et al. [1996], and subsequentl studied b man others including Kang and Ghosal [2009], Shahbaba and Neal [2009], Hannah et al. [2011], Park and Dunson [2010], and Müller and Quintana [2010]. The DP model uses simple local linear regression models as building blocks and partitions the observed subjects into clusters, where within clusters, the linear regression model provides a good fit. Even though within clusters the model is parametric, globall, a wide range of complex distributions can describe the joint distribution, leading to a flexible model for both the regression function and the conditional densit. Another related class of models is based on what we term the conditional approach. In such models, the conditional densit of Y given x, f( x), is modelled directl, as a convolution of a parametric famil f( x, θ) with an unknown mixing distribution P x for θ. A prior is then given on the famil of distributions {P x, x X } such that the P x s are dependent. Examples for the law of {P x, x X } start from the dependent DPs of MacEachern [1999], Gelfand et al. [2005] and include Griffin and Steel [2006], Dunson and Park [2008], Ren et al. [2011], Chung and Dunson [2009], and Rodriguez and Dunson [2011], just to name a few. Such conditional models can approximate a wide range of response distributions that ma change flexibl with the covariates. However, computations are often quite burdensome. One of the reasons the model examined here is so powerful is its simplicit. Together, the joint approach and the clustering of the DP provide a built-in technique to allow for changes in the response distribution across the covariate space, et it is simple and generall less computationall intensive than the nonparametric conditional models based on dependent DPs. The random allocation of subjects into groups in joint DP mixture models is driven b the need to obtain a good approximation of the joint distribution of X and Y. This means that subjects with similar covariates and similar relationship between the response and covariates will tend to cluster together. However, difficulties emerge as p, the dimension of the covariate space, increases. As we will detail, even for moderatel large p the likelihood of x tends to dominate the posterior of the random partition, so that clusters are based mainl on similarit in the covariate space. This behaviour is quite unappealing if the marginal densit of X is

3 Prediction via enriched Dirichlet process mixtures 3 complex, as is tpical in high dimensions, because it causes the posterior to concentrate on partitions with man small clusters, as man kernels are needed to describe f(x). This occurs even if the conditional densit of Y given x is more well-behaved, meaning, a few kernels suffice for its approximation. Tpical results include poor estimates of the regression function and conditional densit with unnecessaril wide credible intervals due to small clusters and, consequentl, poor prediction. This inefficient performance ma not disappear with increasing samples. On one hand, appealing to recent theoretical results (Wu and Ghosal [2008], Wu and Ghosal [2010], Tokdar [2011]), one could expect that as the sample size increases, the posterior on the unknown densit f(x, ) induced b the DP joint mixture model is consistent at the true densit. In turn, posterior consistenc of the joint is likel to have positive implications for the behavior of the random conditional densit and regression function; see Rodriguez et al. [2009], Hannah et al. [2011], and Norets and Pelenis [2012] for some developments in this direction. However, the unappealing behaviour of the random partition that we described above could be reflected in worse convergence rates. Indeed, recent results b Efromovich [2007] suggest that if the conditional densit is smoother than the joint, it can be estimated at a faster rate. Thus, improving inference on the random partition to take into account the different degree of smoothness of f(x) and f( x) appears to be a crucial issue. Our goal in this paper is to show that a simple modification of the nonparametric prior on the mixing distribution, that better models the random partition, can more efficientl conve the information present in the sample, leading to more efficient conditional densit estimates in term of smaller errors and less variabilit, for finite samples. To achieve this aim, we consider a prior that allows local clustering, that is, the clustering structure for the marginal of X and the regression of Y on x ma be different. We achieve this b replacing the DP with the enriched Dirichlet process (EDP) developed in Wade et al. [2011]. Like the DP, the EDP is a conjugate nonparametric prior, but it allows a nested clustering structure that can overcome the above issues and lead to improved predictions. An alternative proposal is outlined in Petrone and Trippa [2009] and in unpublished work b Dunson et al. [2011]. However, the EDP offers a richer parametrization. In a Baesian nonparametric framework, several extensions of the DP have been proposed to allow local clustering (see e.g. Dunson et al. [2008], Dunson [2009], Petrone et al. [2009]). However, the greater flexibilit is often achieved at the price of more complex computations. Instead, our proposal maintains an analticall computable allocation rule, and therefore, computations are a straightforward extension of those used for the joint DP mixture model. Thus, our main contributions are to highlight the problematic behavior in prediction of the joint DP mixture model for increasing p and also offer a simple solution based on the EDP that maintains computational ease. In addition, we give results on random nested partitions that are implied b the proposed prior. This paper is organized as follows. In Section 2, we review the joint DP mixture model, discuss the behavior of the random partition, and examine the predictive performance. In Section 3, we propose a joint EDP mixture model, derive its random

4 Prediction via enriched Dirichlet process mixtures 4 partition model, and emphasize the predictive improvements of the model. Section 4 covers computational procedures. We provide a simulated example in Section 5 to demonstrate how the EDP model can lead to more efficient estimators b making better use of information contained in the sample. Finall, in Section 6, we appl the model to predict Alzheimer s disease status based on measurements of various brain structures. 2 Joint DP mixture model A joint DP mixture model for multivariate densit estimation and nonparametric regression assumes that (X i, Y i ) P iid f(x, P ) = K(x, ξ)dp (ξ), where X i is p-dimensional, Y is usuall univariate, and the mixing distribution P is given a DP prior with scale parameter α > 0 and base measure P 0, denoted b P DP(αP 0 ). Due to the a.s. discrete nature of the DP, the model reduces to a countable mixture f(x, P ) = w j K(x, ξ j ), where the mixing weights w j have a stick-breaking prior with parameter α and iid ξ j P 0, independentl of the w j. This model was first developed for Baesian nonparametric regression b Müller, Erkanli, and West [1996], who assume multivariate Gaussian kernels N p+1 (µ, Σ) with ξ = (µ, Σ) and use a conjugate normal inverse- Wishart prior as the base measure of the DP. However, for even moderatel large p, this approach is practicall unfeasible. Indeed, the computational cost of dealing with the full p+1 b p+1 covariance matrix greatl increases with large p. Furthermore, the conjugate inverse-wishart prior is known to be too poorl parametrized; in particular, there is a single parameter ν to control variabilit, regardless of p (see Consonni and Veronese [2001]). A more effective formulation of this model has been recentl proposed b Shahbaba and Neal [2009], based on two simple modifications. First, the joint kernel is decomposed as the product of the marginal of X and the conditional of Y x, and the parameter space consequentl expressed in terms of the parameters ψ of the marginal and the parameters θ of the conditional. This is a classic reparametrization which, in the Gaussian case, is the basis of generalizations of the inverse-wishart conjugate prior; see Brown et al. [1994]. Secondl, the suggest using simple kernels, assuming local independence among the covariates, i.e. the covariance matrix of the kernel for X is diagonal. These two simple modifications allow several important improvements. Computationall, reducing the covariance matrix to p variances can greatl ease calculations. Regarding flexibilit of the base measure, the conjugate prior now includes a separate parameter to control variabilit for each of the p variances.

5 Prediction via enriched Dirichlet process mixtures 5 Furthermore, the model still allows for local correlations between Y and X through the conditional kernel of Y x and the parameter θ. In addition, the factorization in terms of the marginal and conditional and the assumption of local independence of the covariates allow for eas inclusion of discrete or other tpes of response or covariates. Note that even though, within each component, we assume independence of the covariates, globall, there ma be dependence. This extended model, in full generalit, can be described through latent parameter vectors as follows: Y i x i, θ i ind F ( x i, θ i ), X i ψ i ind F x ( ψ i ), (1) (θ i, ψ i ) P iid P, P DP(αP 0θ P 0ψ ). Here the base measure P 0θ P 0ψ of the DP assumes independence between the θ and the ψ parameters, as this is the structurall conjugate prior for (θ, ψ), and thus, results in simplified computations, but the model could be extended to more general choices. We further assume that P 0θ and P 0ψ are absolutel continuous, with densities p 0θ and p 0ψ. Since P is discrete a.s., integrating out the subject-specific parameters (θ i, ψ i ), the model for the joint densit is f( i, x i P ) = w j K( i x i, θ j )K(x i ψ j ), (2) where the kernels K( x, θ) and K(x ψ) are the densities associated to F ( x, θ) and F x ( ψ). In the Gaussian case, K(x ψ) is the product of N(µ x,h, σ 2 x,h ) Gaussians, h = 1,..., p, and K( x, θ) is N(xβ, σ 2 x ), where x = (1, x ). Shahbaba and Neal focus on the case when Y is categorical and the local model for Y x is a multinomial logit. Hannah et al. [2011] extend the model to the case when, locall, the conditional distribution of Y x belongs to the class of generalized linear models (GLM), that is, the distribution of the response belongs to the exponential famil and the mean of the response can be expressed a function of a linear combination of the covariates. Model (2) allows for flexible conditional densities f( x, P ) = w jk(x ψ j )K( x, θ j ) j =1 w j K(x ψ j ) j w j (x)k( x, θ j ). and nonlinear regression E[Y x, P ] = w j (x) E[Y x, θ j ], with E[Y x, θ j ] = xβj for Gaussian kernels. Thus, the model provides flexible kernelbased densit and regression estimation, and MCMC computations are standard. However, the DP onl allows joint clusters of the parameters (θ i, ψ i ), i = 1,..., n. We underline drawbacks of such joint clustering in the following subsections.

6 Prediction via enriched Dirichlet process mixtures Random partition and inference One of the crucial features of DP mixture models is the dimension reduction and clustering obtained due to the almost sure discreteness of P. In fact, this implies that a sample (θ i, ψ i ), i = 1,..., n from a DP presents ties with positive probabilit and can be convenientl described in terms of the random partition and the distinct values. Using a standard notation, we denote the random partition b a vector of cluster allocation labels ρ n = (s 1,..., s n ), with s i = j if (θ i, ψ i ) is equal to the j th unique value observed, (θ j, ψ j ), for j = 1,..., k, where k = k(ρ n) is the number of groups in the partition ρ n. Additionall, we will denote b S j = {i : s i = j} the set of subject indices in the j th cluster and use the notation j = { i : i S j } and x j = {x i : i S j }. We also make use of the short notation x 1:n = (x 1,..., x n ). Often, the random probabilit measure P is integrated out and inference is based on the posterior of the random partition ρ n and the cluster-specific parameters (θ, ψ ) (θ j, ψ j )k ; p(ρ n, θ, ψ x 1:n, 1:n ) k p(ρ n ) p 0θ (θj )p 0ψ (ψj ) k i S j K(x i ψ j )K( i x i, θ j ). As is well known (Antoniak [1974]), the prior induced b the DP on the random partition is p(ρ n θ, ψ ) α k k Γ(n j), where n j is the size of S j. Marginalizing out the θ, ψ, the posterior of the random partition is p(ρ n x 1:n, 1:n ) α k k Γ(n j ) h x (x j) h (j x j). (3) Notice that the marginal likelihood component in (3), sa h(x 1:n, 1:n ρ n ), is factorized as the product of the cluster-specific marginal likelihoods h x (x j )h (j x j ) where h x (x j ) = Ψ i S j K(x i ψ)dp 0ψ (ψ) and h (j x j ) = Θ i S j K( i x i, θ)dp 0θ (θ). Given the partition and the data, the conditional distribution of the distinct values is simple, as the are independent across clusters, with posterior densities p(θj ρ n, x 1:n, 1:n ) p 0θ (θj ) K( i x i, θj ), i S j p(ψ j ρ n, x 1:n, 1:n ) p 0ψ (ψ j ) i S j K(x i ψ j ). (4) Thus, given ρ n, the cluster-specific parameters (θj, ψ j ) are updated based onl on the observations in cluster S j. Computations are simple if p 0θ and p 0ψ are conjugate priors. The above expressions show the crucial role of the random partition. From equation (3), we have that given the data, subjects are clustered in groups with similar behaviour in the covariate space and similar relationship with the response.

7 Prediction via enriched Dirichlet process mixtures 7 However, even for moderate p the likelihood for x tends to dominate the posterior of the random partition, so that clusters are determined onl b similarit in the covariate space. This is particularl evident when the covariates are assumed to be independent locall, i.e. K(x i ψj ) = p h=1 K(x i,h ψj,h ). Clearl, for large p, the scale and magnitude of changes in p h=1 K(x i,h ψj,h ) will wash out an information given in the univariate likelihood K( i θj, x i). Indeed, this is just the behavior we have observed in practice in running simulations for large p (results not shown). For a simple example demonstrating how the number of components needed to approximate the marginal of X can blow up with p, imagine X is uniforml distributed on a cuboid of side length r > 1. Consider approximating f 0 (x) = 1 r p 1(x [0, r]p ) b f k (x) = k w j N p (x µ j, σj 2 I p ). Since the true distribution of x is uniform on the cube [0, r] p, to obtain a good approximation, the weighted components must place most of their mass on values of x contained in the cuboid. Let B σ (µ) denote a ball of radius σ centered at µ. If a random vector V is normall distributed with mean µ and variance σ 2 I p, then for 0 < ɛ < 1, P (V B σz(ɛ) (µ)) = 1 ɛ, where z(ɛ) 2 = (χ 2 p) 1 (1 ɛ), i.e. the square of z(ɛ) is the (1 ɛ) quantile of the chi-squared distribution with p degrees of freedom. For small ɛ, this means that the densit of V places most of its mass on values contained in a ball of radius σz(ɛ) centered at µ. For ɛ > 0, define f k (x) = k w j N(x; µ j, σj 2 I p ) 1(x B σj z(ɛ j )(µ j )), where ɛ j = ɛ/(kw j ). Then, f k is close to f k (in the L 1 sense): f k (x) f k (x) dx = R p R p k w j N(x; µ j, σj 2 I p ) 1(x Bσ c j z(ɛ j ) (µ j))dx = ɛ. For f k to be close to f 0, the parameters µ j, σ j, w j need to be chosen so that the balls B σj z(ɛ/(kw j ))(µ j ) are contained in the cuboid. That means that centers of the balls are contained in the cuboid, µ j [0, r] p, (5) with further constraints on σj 2 and w j, so that the radius is small enough. In particular, ( ) ɛ σ j z min(µ 1, r µ 1,..., µ p, r µ p ) r kw j 2. (6)

8 Prediction via enriched Dirichlet process mixtures 8 However, as p increases the volume of the cuboid goes to infinit, but the volume of an ball B σj z(ɛ/(kw j ))(µ j ) defined b (5) and (6) goes to 0 (see Clarke et al. [2009], Section 1.1). Thus, just to reasonabl cover the cuboid with the balls of interest, the number of components will increase dramaticall, and more so, when we consider the approximation error of the densit estimate. Now, as an extreme example, imagine that f 0 ( x) is a linear regression model. Even though one component is sufficient for f 0 ( x), a large number of components will be required to approximate f 0 (x), especiall as p increases. It appears evident from (4) this behavior of the random partition also negativel affects inference on the cluster-specific parameters. In particular, when man kernels are required to approximate the densit of X with few observations within each cluster, the posterior for θj ma be based on a sample of unnecessaril small size, leading to a flat posterior with an unreliable posterior mean and large influence of the prior. 2.2 Covariate-dependent urn scheme and prediction Difficulties associated to the behavior of the random partition also deteriorate the predictive performance of the model. Prediction in DP joint mixture models is based on a covariate-dependent urn scheme (Park and Dunson [2010]; Müller and Quintana [2010]), such that conditionall on the partition ρ n and x 1:n+1, the cluster allocation s n+1 of a new subject with covariate value x n+1 is determined as where s n+1 ρ n, x 1:n+1 ω k+1(x n+1 ) c 0 δ k+1 + k ω j (x n+1 ) c 0 δ j, (7) ω k+1 (x n+1 ) = α h 0,x(x n+1 ), α + n ω j (x n+1 ) = n j K(xn+1 ψ)p(ψ x j )dψ α + n n j h j,x (x n+1 ), α + n and c 0 = p(x n+1 ρ n, x 1:n ) is the normalizing constant. This urn scheme is a generalization of the classic Póla urn scheme that allows the cluster allocation probabilit to depend on the covariates; the probabilit of allocation to cluster j depends on the similarit of x n+1 to the x i in cluster j as measured b the predictive densit h j,x. See Park and Dunson [2010] for more details. From the urn scheme (7) one obtains the structure of the prediction. The pre-

9 Prediction via enriched Dirichlet process mixtures 9 dictive densit at for a new subject with a covariate of x n+1 is computed as f( 1:n, x 1:n+1 ) = ρ n f( 1:n, x 1:n+1, ρ n, s n+1 )p(s n+1 1:n, x 1:n+1, ρ n ) p(ρ n 1:n, x 1:n+1 ) s n+1 = ω k+1(x n+1 ) k ω j (x n+1 ) h 0, ( x n+1 ) + h j, ( x n+1 ) c 0p(ρ n x 1:n, 1:n ) c ρ 0 c n 0 p(x n+1 x 1:n, 1:n ) = ω k+1(x n+1 ) k ω j (x n+1 ) h 0, ( x n+1 ) + h j, ( x n+1 ) p(ρ n x 1:n, 1:n ), c c ρ n (8) where c = p(x n+1 x 1:n, 1:n ). Thus, given the partition, the conditional predictive densit is a weighted average of the prior guess h 0, ( x) K( x, θ)dp 0θ (θ) and the cluster-specific predictive densities of at x n+1, h j, ( x n+1 ) = K( x n+1, θ)p(θ x j, j )dθ, with covariate-dependent weights. The predictive densit is obtained b averaging with respect to the posterior of ρ n. However, for moderate to large p, the posterior of the random partition suffers the drawbacks discussed in the previous subsection. In particular, too man small x-clusters lead to unreliable within-cluster predictives based on small sample sizes. Furthermore, the measure which determines similarit of x n+1 and the j th cluster will be too rigid. Consequentl, the resulting overall prediction ma be quite poor. These drawbacks will also affect the point prediction, which, under quadratic loss, is E[Y n+1 1:n, x 1:n+1 ] = ρ n ω k+1(x n+1 ) E 0 [Y n+1 x n+1 ] + c k ω j (x n+1 ) c E j [Y n+1 x n+1 ] p(ρ n x 1:n, 1:n ), where E 0 [Y n+1 x n+1 ] is the expectation of Y n+1 with respect to h 0, and E j [Y n+1 x n+1 ] E[E[Y n+1 x n+1, θ j ] x j, j ] is the expectation of Y n+1 with respect to h j,. Example. When K( x, θ) = N(; xβ, σ 2 ) and the prior for (β, σ 2 ) is the multivariate normal-inverse gamma with parameters (β 0, C 1, a, b ), (9) is ρ n ω k+1(x n+1 ) x c n+1 β 0 + k ω j (x n+1 ) c (9) x n+1 ˆβj p(ρ n x 1:n, 1:n ), (10)

10 Prediction via enriched Dirichlet process mixtures 10 where ˆβ j = Ĉ 1 j (Cβ 0 + X jj ), Ĉ j = C + X jx j, X j is a n j b p + 1 matrix with rows x i for i S j, and (8) is ( ω k+1 (x n+1 ) T x c n+1 β 0, b ) W 1, 2a + a k ( ω j (x n+1 ) T x c n+1 ˆβj, ˆb,j Wj 1, 2â,j ), â,j where T ( ; µ, σ 2, ν) denotes the densit of a random variable V such that (V µ)/σ has a t-distribution with ν degrees of freedom, â,j = a + n j /2, ˆb,j = b ( j X j β 0 ) (I nj X j Ĉ 1 j X j)( j X j β 0 ), W = 1 x n+1 (C + x n+1x n+1 ) 1 x n+1, and W j is defined as W with C replaced b Ĉj. 3 Joint EDP mixture model As seen, the global clustering of the DP prior on P(Θ Ψ), the space of probabilit measures on Θ Ψ, does not allow one to efficientl model the tpes of data discussed in the previous section. Instead, it is desirable to use a nonparametric prior that allows man ψ-clusters, to fit the complex marginal of X, and fewer θ-clusters. At the same time, we want to preserve the desirable conjugac properties of the DP, in order to maintain fairl simple computations. To these aims, our proposal is to replace the DP with the more richl parametrized enriched Dirichlet process Wade et al. [2011]. The EDP is conjugate and has an analticall computable urn scheme, but it gives a nested partition structure that can model the desired clustering behavior. Recall that the model (2) was obtained b decomposing the joint kernel as the product of the marginal and conditional kernels. The EDP is a natural alternative for the mixing distribution of this model, as it is similarl based on the idea of expressing the unknown random joint probabilit measure P of (θ, ψ) in terms of the random marginal and conditionals. This requires the choice of an ordering of θ and ψ, and this choice is problem specific. In the situation described here, it is natural to consider the random marginal distribution P θ and the random conditional P ψ θ, to obtain the desired clustering structure. Then, the EDP prior is defined b P θ DP(α θ P 0θ ), P ψ θ ( θ) DP(α ψ (θ)p 0ψ θ ( θ)), θ Θ, and P ψ θ ( θ) for θ Θ are independent among themselves and from P θ. Together these assumptions induce a prior for the random joint P through the joint law of (11)

11 Prediction via enriched Dirichlet process mixtures 11 the marginal and conditionals and the mapping (P θ, P ψ θ ) P ψ θ ( θ)dp θ (θ). The prior is parametrized b the base measure P 0, expressed as P 0 (A B) = P 0ψ θ (B θ)dp 0θ (θ) A for all Borel sets A and B, and b a precision parameter α θ associated to θ and a collection of precision parameters α ψ (θ) for ever θ Θ associated to ψ θ. Note the contrast with the DP, which onl allows one precision parameter to regulate the uncertaint around P 0. The proposed EDP mixture model for regression is as in (2), but with P EDP(α θ, α ψ (θ), P 0 ) in place of P DP(αP 0θ P 0ψ ). In general, P 0 is such that θ and ψ are dependent, but here we assume the same structurall conjugate base measure as for the DP model (2), so P 0 = P 0θ P 0ψ. Using the square breaking representation of the EDP (Wade et al. [2011], Proposition 4) and integrating out the (θ i, ψ i ) parameters, the model for the joint densit is f(x, P ) = l=1 w j w l j K(x ψ l j )K( x, θ j ). This gives a mixture model for the conditional densities with more flexible weights f( x, P ) = l=1 w l jk(x ψ l j ) j =1 w j l =1 w l j K(x ψ l j ) K( x, θ j ) w j (x)k( x, θ j ). 3.1 Random partition and inference The advantage of the EDP is the implied nested clustering. The EDP partitions subjects in θ-clusters and ψ-subclusters within each θ-cluster, allowing the use of more kernels to describe the marginal of X for each kernel used for the conditional of Y x. The random partition model induced from the EDP can be described as a nested Chinese Restaurant Process (ncrp). First, customers choose restaurants according to the CRP induced b P θ DP(α θ P 0θ ), that is, with probabilit proportional to the number n j of customers eating at restaurant j, the (n + 1) th customer eats at restaurant j, and with probabilit proportional to α θ, she eats at a new restaurant. Restaurant are then colored iid with colors θj P 0θ. Within restaurant j, customers sit at tables as in the CRP induced b the P ψ θ j DP(α ψ (θj )P 0ψ θ( θj )). Tables in restaurant j are then colored with colors ψl j iid P 0ψ θ ( θj ). This differs from the ncrp proposed b Blei et al. [2010], which had the alternative aim of learning topic hierarchies b clustering parameters (topics) hierarchicall along a tree of infinite CRPs. In particular, each subject follows a path down the

12 Prediction via enriched Dirichlet process mixtures 12 tree according to a sequence of nested CRPs and the parameters of subject i are associated with the cluster visited at a latent subject-specific level l of this path. Although related, the EDP is not a special case with the tree depth fixed to 2; the EDP defines a prior on a multivariate random probabilit measure on Θ Ψ and induces a nested partition of the multivariate parameter (θ, ψ), where the first level of the tree corresponds to the clustering of the θ parameters and the second corresponds to the clustering of the ψ parameters. A generalization of the EDP to a depth of D is related to Blei et al. s ncrp with depth D, but onl if one regards the parameters of each subject as the vector of parameters (ξ 1,..., ξ D ) associated to each level of the tree. Furthermore, this generalization of the EDP would allow a more flexible specification of the mass parameters and possible correlation among the nested parameters. The nested partition of the EDP is described b ρ n = (ρ n,, ρ n,x ), where ρ n, = (s,1,..., s,n ) and ρ n,x = (s x,1,..., s x,n ) with s,i = j if θ i = θ j, the jth distinct θ-value in order of appearance, and s x,i = l if ψ i = ψ l j, the lth color that appeared inside the j th θ-cluster. Additionall, we use the notation S j+ = {i : s,i = j}, with size n j, j = 1,..., k, and S l j = {i : s,i = j, s x,i = l}, with size n l j, l = 1,..., k j. The unique parameters will be denoted b θ = (θ j )k and ψ = (ψ 1 j,..., ψ k j j )k. Furthermore, we use the notation ρ nj,x = (s x,i : i S j+ ) and j = { i : i S j+ }, x j = {x i : i S j+ }, x l j = {x i : i S l j }. Proposition 3.1 The probabilit law of the nested random partition defined from the EDP is p(ρ n ) = Γ(α θ) Γ(α θ + n) αk θ k Θ α ψ (θ) k Γ(α j ψ(θ))γ(n j ) Γ(α ψ (θ) + n j ) dp 0θ(θ) Γ(n l j ). Proof. From independence of random conditional distributions among θ Θ, p(ρ n, θ ) = p(ρ n, ) k p 0θ (θj )p(ρ n,x ρ n,, θ ) = p(ρ n, ) k j l=1 k p 0θ (θj )p(ρ nj,x θj ). Next, using the results of the random partition model of the DP (Antoniak [1974]), we have p(ρ n, θ ) = Γ(α θ) Γ(α θ + n) αk θ k Integrating out θ leads to the result. p 0θ (θ j )α ψ (θ j ) k j Γ(α ψ(θ j ))Γ(n j) Γ(α ψ (θ j ) + n j) k j Γ(n l j ). From Proposition 3.1, we gain an understanding of the tpes of partitions preferred b the EDP and the effect of the parameters. If for all θ, α ψ (θ) = α θ P 0θ ({θ}), that is α ψ (θ) = 0 if P 0θ is nonatomic, we are back to the DP random partition model, see l=1

13 Prediction via enriched Dirichlet process mixtures 13 Proposition 2 of Wade et al. [2011]. In the case when P 0θ is nonatomic, this means that the conditional P ψ θ is degenerate at some random location with probabilit one (for each restaurant one table). In general, α ψ (θ) ma be a flexible function of θ, reflecting the fact that within some θ-clusters more kernels ma be required for good approximation of the marginal of X. In practice, a common situation that we observe is a high value of α ψ (θ) for average values of θ and lower values of α ψ (θ) for more extreme θ values, capturing homogeneous outling groups. In this case, a small value of α θ will encourage few θ- clusters, and, given θ, a large α ψ (θj ) will encourage more ψ-clusters within the jth θ-cluster. The term k kj l=1 Γ(n l j) will encourage asmmetrical (θ, ψ)-clusters, preferring one large cluster and several small clusters, while, given θ, the term involving the product of beta functions contains parts that both encourage and discourage asmmetrical θ-clusters. In the special case when α ψ (θ) = α ψ for all θ Θ, the random partition model simplifies to p(ρ n ) = Γ(α θ) Γ(α θ + n) αk θ k α k j ψ Γ(α ψ )Γ(n j ) Γ(α ψ + n j ) k j Γ(n l j ). In this case, the tendenc of the term involving the product of beta functions is to slightl prefer asmmetrical θ-clusters with large values of α ψ boosting this preference. As discussed in the previous section, the random partition plas a crucial role, as its posterior distribution affects both inference on the cluster-specific parameters and prediction. For the EDP, it is given b the following proposition. Proposition 3.2 The posterior of the random partition of the EDP model is p(ρ n x 1:n, 1:n ) k k αθ k Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α j ψ(θ) k j dp 0θ (θ)h (j x j) Γ(n l j )h x (x l j ). Θ The proof relies on a simple application of Baes theorem. In the case of constant α ψ (θ), the expression for the posterior of ρ n simplifies to p(ρ n x 1:n, 1:n ) α k θ l=1 l=1 k k Γ(α ψ )Γ(n j ) Γ(α ψ + n j ) αk j ψ h j (j x j) Γ(n l j )h x (x l j ). Again, as in (3), the marginal likelihood component in the posterior distribution of ρ n is the product of the cluster-specific marginal likelihoods, but now the nested clustering structure of the EDP separates the factors relative to x and x, being h(x 1:n, 1:n ρ n ) = k h (j x j ) k j l=1 h x(x l j ). Even if the x-likelihood favors man ψ-clusters, now these can be obtained b sub-partitioning a coarser θ-partition, and the number k of θ-clusters can be expected to be much smaller than in (3). l=1

14 Prediction via enriched Dirichlet process mixtures 14 Further insights into the behavior of the random partition are given b the induced covariate-dependent random partition of the θ i parameters given the covariates, which is detailed in the following propositions. We will use the notation P n to denote the set of all possible partitions of the first n integers. Proposition 3.3 The covariate-dependent random partition model induced b the EDP prior is p(ρ n, x 1:n ) α k θ k ρ nj,x P nj Θ Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) Proof. An application of Baes theorem implies that p(ρ n x 1:n ) α k θ k Θ Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) k j l=1 k j Γ(n l j )h x (x l j ). l=1 Γ(n l j )h x (x l j ). (12) Integrating over ρ n,x, or equivalentl summing over all ρ nj,x in P nj,x for j = 1,..., k leads to, p(ρ n, x 1:n ) ρ n1+,x... k αθ k ρ nk+,x Θ Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) Γ(n l j )h x (x l j ), and, finall, since (12) is the product over the j terms, we can pull the sum over ρ nj,x within the product. This covariate-dependent random partition model will favor θ-partitions of the subjects which can be further partitioned into groups with similar covariates, where a partition with man desirable sub-partitions will have higher mass. Proposition 3.4 The posterior of the random covariate-dependent partition induced from the EDP model is p(ρ n, x 1:n, 1:n ) α k θ ρ nj,x P nj Θ k h (j x j) Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) k j h=1 k j l=1 Γ(n l j )h x (x l j ). The proof is similar in spirit to that of Proposition 3.3. Notice the preferred θ-partitions will consist of clusters with a similar relationship between and x, as measured b marginal local model h for x and similar x behavior, which is measured much more flexibl as a mixture of the previous marginal local models. The behavior of the random partition, detailed above, has important implications for the posterior of the unique parameters. Conditionall on the partition,

15 Prediction via enriched Dirichlet process mixtures 15 the cluster-specific parameters (θ, ψ ) are still independent, their posterior densit being where p(θ, ψ 1:n, x 1:n, ρ n ) = p(θ j j, x j) p 0θ (θ j ) k k j p(θj j, x j) p(ψl j x l j ), l=1 i S j+ K( i θ j, x i ), p(ψ l j x l j ) p 0ψ(ψ l j ) i S j,l K(x i ψ l j ). The important point is that the posterior of θj can now be updated with much larger sample sizes if the data determines that a coarser θ-partition is present. This will result in a more reliable posterior mean, a smaller posterior variance, and larger influence of the data compared with the prior. 3.2 Covariate-dependent urn scheme and prediction Similar to the DP model, computation of the predictive estimates relies on a covariatedependent urn scheme. For the EDP, we have s,n+1 ρ n, x 1:n+1, 1:n ) ω k+1(x n+1 ) c 0 δ k+1 + k ω j (x n+1 ) c 0 δ j, (13) where c 0 = p(x n+1 ρ n, x 1:n ), but now the expression of the ω j (x n+1 ) takes into account the possible allocation of x n+1 in subgroups, being ω j (x n+1 ) = p(x n+1 ρ n, x 1:n, 1:n, s,n+1 = j, s x,n+1 )p(s x,n+1 ρ n, x 1:n, 1:n, s,n+1 = j). s x,n+1 From this, it can be easil found that α θ ω k+1 (x n+1 ) = α θ + n h 0,x(x n+1 ) ω j (x n+1 ) = n k j j π α θ + n kj +1 jh 0,x (x n+1 ) + π l j h l j,x (x n+1 ), l=1 where π l j = n l j dp 0θ (θ x α ψ (θ) + n j, j ), π kj +1 j = j α ψ (θ) α ψ (θ) + n j dp 0θ (θ). In the case of constant α ψ (θ) = α ψ, these expressions simplif to π l j = n l j α ψ + n j, π kj +1 j = α ψ α ψ + n j.

16 Prediction via enriched Dirichlet process mixtures 16 Notice that (13) is similar to the covariate-dependent urn scheme of the DP model. The important difference is that the weights, which measure the similarit between x n+1 and the j th cluster, are much more flexible. It follows, from similar computations as in Section 2.2, that the predictive densit at for a new subject with a covariate value of x n+1 is f( x 1:n, 1:n, x n+1 ) = ρ n ω k+1(x n+1 ) h 0, ( x n+1 ) + c k ω j (x n+1 ) c h j, ( x n+1 ) p(ρ n x 1:n, 1:n ) where c = p(x n+1 x 1:n, 1:n ). Under the squared error loss function, the point prediction of n+1 is E[Y n+1 1:n, x 1:n+1 ] = ρ n ω k+1(x n+1 ) E 0 [Y n+1 x n+1 ] + c k ω j (x n+1 ) c (14) E j [Y n+1 x n+1 ] p(ρ n x 1:n, 1:n ). The expressions for the prediction densit (14) and point prediction (15) are quite similar to those of the DP, (8) and (9), respectivel; in both cases, the cluster-specific predictive estimates are averaged with covariate-dependent weights. However, there are two important differences for the EDP model. The first is that the weights in (13) are defined with a more flexible kernel; in fact, it is a mixture of the original kernels used in the DP model. This means that we have a more flexible measure of similarit in the covariate space. The second difference is that k will be smaller and n j will be larger with a high posterior probabilit, leading to a more reliable posterior distribution of θj due to larger sample sizes and better cluster-specific predictive estimates. We will demonstrate the advantage of these two ke differences in simulated and applied examples in Section 5. 4 Computations (15) Inference for the EDP model cannot be obtained analticall and must therefore be approximated. To obtain approximate inference, we rel on Markov Chain Monte Carlo (MCMC) methods and consider an extension of Algorithm 2 of Neal [2000] for the DP mixture model. In this approach, the random probabilit measure, P, is integrated out, and the model is viewed in terms of (ρ n, θ, ψ ). This algorithm requires the use of conjugate base measures P 0θ and P 0ψ. To deal with non-conjugate base measures, the approach used in Algorithm 8 of Neal [2000] can be directl adapted to the EDP mixture model.

17 Prediction via enriched Dirichlet process mixtures 17 Algorithm 2 is a Gibbs sampler which first samples the cluster label of each subject conditional on the partition of all other subjects, the data, and (θ, ψ ), and then samples (θ, ψ ) given the partition and the data. The first step can be easil performed thanks to the Póla urn which marginalizes the DP. Extending Algorithm 2 for the EDP model is straightforward, since the EDP maintains a simple, analticall computable urn scheme. In particular, the conditional probabilities p(s i ρ i n 1, θ, ψ, x 1:n, 1:n ) (provided in the Appendix) have a simple closed form, which allows conditional sampling of the individual cluster membership indicators s i, where ρ i n 1 denotes the partition of the n 1 subjects with the i th subject removed. To improve mixing, we include an additional Metropolis- Hastings step; at each iteration, after performing the n Gibbs updates for each s i, we propose a shuffle of the nested partition structure obtained b moving a ψ-cluster to be nested within a different or new θ-cluster. This move greatl improves mixing. A detailed description of the sampler, including the Metropolis-Hastings step, can be found in the Appendix. MCMC produces approximate samples, {ρ s n, ψ s, θ s } S s=1, from the posterior. The prediction given in equation (15) can be approximated b 1 S S ωk+1 s (x n+1) k s ωj s E h [Y n+1 x n+1 ] + (x n+1) E F [Y n+1 x n+1, θj s ], ĉ ĉ s=1 where ω s j (x n+1) for j = 1,..., k s +1, are as previousl defined in (13) with (ρ n, ψ, θ ) replaced b (ρ s n, ψ s, θ s ) and ĉ = 1 S S ωk+1 s (x k s n+1) + ωj s (x n+1 ). s=1 For the predictive densit estimate at x n+1, we define a grid of new values and for each in the grid, we compute 1 S S ωk+1 s (x n+1) k s ωj s h ( x n+1 ) + (x n+1) K( x n+1, θj s ). (16) ĉ ĉ s=1 Note that hperpriors ma be included for the precision parameters, α θ and α ψ ( ), and the parameters of the base measures. For the simulated examples and application, a Gamma hperprior is assigned to α θ, and α ψ (θ) for θ Θ are assumed to be i.i.d. from a Gamma hperprior. At each iteration, αθ s and αs ψ (θ) at θ s j for j = 1,..., k s are approximate samples from the posterior using the method described in Escobar and West [1995]. 5 Simulated example We consider a to example that demonstrates two ke advantages of the EDP model; first, it can recover the true coarser θ-partition; second, improved prediction and

18 Prediction via enriched Dirichlet process mixtures 18 smaller credible intervals result. The example shows that these advantages are evident even for a moderate value of p, with more drastic differences as p increases. A data set of n = 200 points were generated where onl the first covariate is a predictor for Y. The true model for Y is a nonlinear regression model obtained as a mixture of two normals with linear regression functions and weights depending onl on the first covariate; Y i x i ind p(x i,1 )N( i β 1,0 + β 1,1 x i,1, σ 2 1) + (1 p(x i,1 ))N( i β 2,0 + β 2,1 x i,1, σ 2 2), where p(x i,1 ) = ) τ 1 exp ( τ (x 1,i µ 1 ) 2 ) ), τ 1 exp ( τ (x 1,i µ 1 ) 2 + τ 2 exp ( τ (x 1,i µ 2 ) 2 with β 1 = (0, 1), σ 2 1 = 1/16, β 2 = (4.5, 0.1), σ 2 2 = 1/8 and µ 1 = 4, µ 2 = 6, τ 1 = τ 2 = 2. The covariates are sampled from a multivariate normal, X i = (X i,1,..., X i,p ) iid N(µ, Σ), (17) centered at µ = (4,..., 4) with a standard deviation of 2 along each dimension, i.e Σ h,h = 4. The covariance matrix Σ models two groups of covariates: those in the first group are positivel correlated among each other and the first covariate, but independent of the second group of covariates, which are positivel correlated among each other but independent of the first covariate. In particular, we take Σ h,l = 3.5 for h l in {1, 2, 4,..., 2 p/2 } or h l in {3, 5,..., 2 (p 1)/2 + 1} and Σ h,l = 0 for all other cases of h l. We examine both the DP and EDP mixture models; Y i x i, β i, σ 2,i ind N(x i β i, σ 2,i), (β i, σ 2,i, µ i, σ 2 x,i) P iid P X i µ i, σ 2 x,i ind p N(µ i,h, σx,h,i 2 ), with P DP or P EDP. The conjugate base measure is selected; P 0θ is a multivariate normal-inverse gamma prior and P 0ψ is the product of p normal-inverse gamma priors, that is and h=1 p 0θ (β, σ 2 ) = N(β; β 0, σ 2 C 1 )IG(σ 2 ; a, b ), p 0ψ (µ, σ 2 x) = p h=1 N(µ h ; µ 0,h, σ 2 x,h c 1 h )IG(σ2 x,h ; a x,h, b x,h ). For both models, we use the same subjective choice of the parameters of the base measure. In particular, we center the base measure on an average of the true

19 Prediction via enriched Dirichlet process mixtures p( rho,x)= x1 (a) DP, p= p( rho,x)= x1 (b) DP, p= p( rho,x)= x1 (c) DP, p= p( rho,x)= x1 (d) DP, p= p( rho_,x)= x1 (e) EDP, p= p( rho_,x)= x1 (f) EDP, p= p( rho_,x)= x1 (g) EDP, p= p( rho_,x)= x1 (h) EDP, p=15 Figure 1: The partition with the highest estimated posterior probabilit. Data points are plotted in the x 1 vs space and colored b cluster membership with estimated posterior probabilit included in the plot title. Table 1: The posterior mode number of clusters, denoted ˆk for both models, and the posterior mean of α, α θ, denoted ˆα for both models, as p increases. p = 1 p = 5 p = 10 p = 15 ˆk ˆα ˆk ˆα ˆk ˆα ˆk ˆα DP EDP parameters values with enough variabilit to recover the true model. A list of the prior parameters can be found in the Appendix. We assign hperpriors to the mass parameters, where for the DP model, α Gamma(1, 1), and for the EDP model, α θ Gamma(1, 1), α ψ (β, σ 2 ) iid Gamma(1, 1) for all β, σ 2 R p+1 R +. The computational procedures described in Section 4 were used to obtain posterior inference with 20,000 iterations and burn in period of 5,000. An examination of the trace and autocorrelation plots for the subject-specific parameters (β i, σ 2,i, µ i, σ 2 x,i ) provided evidence of convergence. Additional criteria for assessing the convergence of chain, in particular, the Geweke diagnostic, also suggested convergence, and the results are given in Table A.1 of the Appendix (see the R package coda for implementation and further details of the diagnostic). It should be noted that running times for both models are quite similar, although slightl faster for the

20 Prediction via enriched Dirichlet process mixtures 20 E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * x x x x1 (a) DP, p=1 (b) DP, p=5 (c) DP, p=10 (d) DP, p=15 E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * x x x x1 (e) EDP, p=1 (f) EDP, p=5 (g) EDP, p=10 (h) EDP, p=15 Figure 2: The point predictions for 20 test samples of the covariates are plotted against x 1 and represented with circles (DP in blue and EDP in red) with true prediction as black stars. The bars about the prediction depict the 95% credible intervals. Table 2: Prediction error for both models as p increases. p = 1 p = 5 p = 10 p = 15 ˆl1 ˆl2 ˆl1 ˆl2 ˆl1 ˆl2 ˆl1 ˆl2 DP EDP DP. The first main point to emphasize is the improved behavior of the posterior of the random partition for the EDP. We note that for both models, the posterior of the partition is spread out. This is because the space of partitions is ver large and man partitions are ver similar, differing onl in a few subjects; thus, man partitions fit the data well. We depict representative partitions of both models with increasing p in Figure 1. Observations are plotted in the x 1 space and colored according to the partition for the DP and the θ-partition for the EDP. As expected, for p = 1 the DP does well at recovering the true partition, but as clearl seen from Figure 1, for large values of p, the DP partition is comprised of man clusters, which are needed to approximate the multivariate densit of X. In fact, the densit of Y x can be recovered with onl two kernels regardless of p, and the θ-partitions of

21 Prediction via enriched Dirichlet process mixtures 21 f( x) f( x) f( x) f( x) (a) p= (b) p= (c) p= (d) p=15 Figure 3: Predictive densit estimate (DP in blue and EDP in red) for the first new subject with true conditional densit in black. The pointwise 95% credible intervals are depicted with dashed lines (DP in blue and EDP in red). f( x) f( x) f( x) f( x) (a) p= (b) p= (c) p= (d) p=15 Figure 4: Predictive densit estimate (DP in blue and EDP in red) for the fifth new subject with true conditional densit in black. The pointwise 95% credible intervals are depicted with dashed lines (DP in blue and EDP in red). Table 3: l 1 densit regression error for both models as p increases. p = 1 p = 5 p = 10 p = 15 DP EDP the EDP depicted in Figure 1, with onl two θ-clusters, are ver similar to the true configuration, even for increasing p. On the other hand, the (θ, ψ)-partition of the EDP (not shown) consists of man clusters and resembles the partition of the DP model. This behavior is representative of the posterior distribution of the random partition that, for the DP, has a large posterior mode of k and large posterior mean of α for larger values of p, while most of the EDP θ-partitions are composed of onl 2 clusters with onl a handful of subjects placed in the incorrect cluster and the posterior mean of α θ is much smaller for large p. Table 1 summarizes the posterior of k for both models and the posterior of the precision parameters α, α θ. It is interesting

22 Prediction via enriched Dirichlet process mixtures 22 to note that posterior samples of α ψ (θ) for θ characteristic of the first cluster tend to be higher than posterior samples of α ψ (θ) for the second cluster; that is, more clusters are needed to approximate the densit of X within the first cluster. A non constant α ψ (θ) allows us to capture this behavior. As discussed in Section 3, a main point of our proposal is the increased finite sample efficienc of the EDP model. To illustrate this, we create a test set with m = 200 new covariates values simulated from (17) with maximum dimension p = 15 and compute the true regression function E[Y n+j x n+j ] and conditional densit f( x n+j ) for each new subject. To quantif the gain in efficienc of the EDP model, we calculate the point prediction and predictive densit estimates from both models and compare them with the truth. Judging from both the empirical l 1 and l 2 prediction errors, the EDP model outperforms the DP model, with greater improvement for larger p; see Table 2. Figure 2 displas the prediction against x 1 for 20 new subjects. Recall that each new x 1 is associated with different values of (x 2,..., x p ), which accounts for the somewhat erratic behavior of the prediction as a function of x 1 for increasing p. The comparison of the credible intervals is quite interesting. For p > 1, the unnecessaril wide credible intervals for the DP regression model stand out in the first row of Figure 2. This is due to small cluster samples sizes for the DP model with p > 1. The densit regression estimates for all new subjects were computed b evaluating (16) at a grid of -values. As a measure of the performance of the models, the empirical l 1 distance between the true and estimated conditional densities for each of the new covariate values is shown in Table 3. Again the EDP model outperforms the DP model. Figures 3 and 4 displa the true conditional densit (in black) for two new covariate values with the estimated conditional densities in blue for the DP and red for the EDP. It is evident that for p > 1 the densit regression estimates are improved and that the pointwise 95% credible intervals are almost uniforml wider both in and x for the DP model, sometimes drasticall so. It is important to note that while the credible intervals of the EDP model are considerabl tighter, the still contain the true densit. 6 Alzheimer s disease stud Our primar goal in this section is to show that the EDP model leads to improved inference in a real data stud with the important goal of diagnosing Alzheimer s disease (AD). In particular, EDP leads to improved predictive accurac, tighter credible intervals around the predictive probabilit of having the disease, and a more interpretable clustering structure. Alzheimer s disease is a prevalent form of dementia that slowl destros memor and thinking skills, and eventuall even the abilit to carr out the simplest tasks. Unfortunatel, a definitive diagnosis cannot be made until autops. However, the brain ma show severe evidence of neurobiological damage even at earl stages of the disease before the onset of memor disturbances. As this damage ma be difficult

Bayesian Nonparametric Regression through Mixture Models

Bayesian Nonparametric Regression through Mixture Models Sara Wade Bocconi University Advisor: Sonia Petrone October 7, 2013 Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression