Improving prediction from Dirichlet process mixtures via enrichment
|
|
- Beverly George
- 5 years ago
- Views:
Transcription
1 Improving prediction from Dirichlet process mixtures via enrichment Sara K. Wade, David B. Dunson, Sonia Petrone, Lorenzo Trippa, For the Alzheimer s Disease Neuroimaging Initiative. November 15, 2013 Abstract Flexible covariate-dependent densit estimation can be achieved b modelling the joint densit of the response and covariates as a Dirichlet process mixture. An appealing aspect of this approach is that computations are relativel eas. In this paper, we examine the predictive performance of these models with an increasing number of covariates. Even for a moderate number of covariates, we find that the likelihood for x tends to dominate the posterior of the latent random partition, degrading the predictive performance of the model. To overcome this, we suggest using a different nonparametric prior, namel an enriched Dirichlet process. Our proposal maintains a simple allocation rule, so that computations remain relativel simple. Advantages are shown through both predictive equations and examples, including an application to diagnosis Alzheimer s disease. Kewords. Baesian nonparametrics; Densit regression; Predictive distribution; Random partition; Urn scheme. 1 Introduction Dirichlet process (DP) mixture models have become popular tools for Baesian nonparametric regression. In this paper, we examine their behavior in prediction and aim to highlight the difficulties that emerge with increasing dimension of the covariate space. To overcome these difficulties, we suggest a simple extension based on a nonparametric prior developed in Wade et al. [2011] that maintains desirable conjugac properties of the Dirichlet process and leads to improved prediction. The Corresponding author: sara.wade@eng.cam.ac.uk, Department of Engineering, Universit of Cambridge,Trumpington Street, Cambridge CB2 1PZ, UK Data used in preparation of this article were obtained from the Alzheimer s Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analsis or writing of this report. A complete listing of ADNI investigators can be found at: http: //adni.loni.ucla.edu/wp-content/uploads/how_to_appl/adni_acknowledgement_list.pdf. 1
2 Prediction via enriched Dirichlet process mixtures 2 motivating application is to Alzheimer s disease studies, where the focus is prediction of the disease status based on biomarkers obtained from neuroimages. In this problem, a flexible nonparametric approach is needed to account for the possible nonlinear behavior of the response and complex interaction terms, resulting in improved diagnostic accurac. DP mixtures are widel used for Baesian densit estimation, see e.g. Ghosal [2010] and references therein. A common wa to extend these methods to nonparametric regression and conditional densit estimation is b modelling the joint distribution of the response and the covariates (X, Y ) as a mixture of multivariate Gaussians (or more general kernels). The regression function and conditional densit estimates are indirectl obtained from inference on the joint densit, an idea which is similarl emploed in classical kernel regression (Scott [1992], Chapter 8). This approach, which we call the joint approach, was first introduced b Müller et al. [1996], and subsequentl studied b man others including Kang and Ghosal [2009], Shahbaba and Neal [2009], Hannah et al. [2011], Park and Dunson [2010], and Müller and Quintana [2010]. The DP model uses simple local linear regression models as building blocks and partitions the observed subjects into clusters, where within clusters, the linear regression model provides a good fit. Even though within clusters the model is parametric, globall, a wide range of complex distributions can describe the joint distribution, leading to a flexible model for both the regression function and the conditional densit. Another related class of models is based on what we term the conditional approach. In such models, the conditional densit of Y given x, f( x), is modelled directl, as a convolution of a parametric famil f( x, θ) with an unknown mixing distribution P x for θ. A prior is then given on the famil of distributions {P x, x X } such that the P x s are dependent. Examples for the law of {P x, x X } start from the dependent DPs of MacEachern [1999], Gelfand et al. [2005] and include Griffin and Steel [2006], Dunson and Park [2008], Ren et al. [2011], Chung and Dunson [2009], and Rodriguez and Dunson [2011], just to name a few. Such conditional models can approximate a wide range of response distributions that ma change flexibl with the covariates. However, computations are often quite burdensome. One of the reasons the model examined here is so powerful is its simplicit. Together, the joint approach and the clustering of the DP provide a built-in technique to allow for changes in the response distribution across the covariate space, et it is simple and generall less computationall intensive than the nonparametric conditional models based on dependent DPs. The random allocation of subjects into groups in joint DP mixture models is driven b the need to obtain a good approximation of the joint distribution of X and Y. This means that subjects with similar covariates and similar relationship between the response and covariates will tend to cluster together. However, difficulties emerge as p, the dimension of the covariate space, increases. As we will detail, even for moderatel large p the likelihood of x tends to dominate the posterior of the random partition, so that clusters are based mainl on similarit in the covariate space. This behaviour is quite unappealing if the marginal densit of X is
3 Prediction via enriched Dirichlet process mixtures 3 complex, as is tpical in high dimensions, because it causes the posterior to concentrate on partitions with man small clusters, as man kernels are needed to describe f(x). This occurs even if the conditional densit of Y given x is more well-behaved, meaning, a few kernels suffice for its approximation. Tpical results include poor estimates of the regression function and conditional densit with unnecessaril wide credible intervals due to small clusters and, consequentl, poor prediction. This inefficient performance ma not disappear with increasing samples. On one hand, appealing to recent theoretical results (Wu and Ghosal [2008], Wu and Ghosal [2010], Tokdar [2011]), one could expect that as the sample size increases, the posterior on the unknown densit f(x, ) induced b the DP joint mixture model is consistent at the true densit. In turn, posterior consistenc of the joint is likel to have positive implications for the behavior of the random conditional densit and regression function; see Rodriguez et al. [2009], Hannah et al. [2011], and Norets and Pelenis [2012] for some developments in this direction. However, the unappealing behaviour of the random partition that we described above could be reflected in worse convergence rates. Indeed, recent results b Efromovich [2007] suggest that if the conditional densit is smoother than the joint, it can be estimated at a faster rate. Thus, improving inference on the random partition to take into account the different degree of smoothness of f(x) and f( x) appears to be a crucial issue. Our goal in this paper is to show that a simple modification of the nonparametric prior on the mixing distribution, that better models the random partition, can more efficientl conve the information present in the sample, leading to more efficient conditional densit estimates in term of smaller errors and less variabilit, for finite samples. To achieve this aim, we consider a prior that allows local clustering, that is, the clustering structure for the marginal of X and the regression of Y on x ma be different. We achieve this b replacing the DP with the enriched Dirichlet process (EDP) developed in Wade et al. [2011]. Like the DP, the EDP is a conjugate nonparametric prior, but it allows a nested clustering structure that can overcome the above issues and lead to improved predictions. An alternative proposal is outlined in Petrone and Trippa [2009] and in unpublished work b Dunson et al. [2011]. However, the EDP offers a richer parametrization. In a Baesian nonparametric framework, several extensions of the DP have been proposed to allow local clustering (see e.g. Dunson et al. [2008], Dunson [2009], Petrone et al. [2009]). However, the greater flexibilit is often achieved at the price of more complex computations. Instead, our proposal maintains an analticall computable allocation rule, and therefore, computations are a straightforward extension of those used for the joint DP mixture model. Thus, our main contributions are to highlight the problematic behavior in prediction of the joint DP mixture model for increasing p and also offer a simple solution based on the EDP that maintains computational ease. In addition, we give results on random nested partitions that are implied b the proposed prior. This paper is organized as follows. In Section 2, we review the joint DP mixture model, discuss the behavior of the random partition, and examine the predictive performance. In Section 3, we propose a joint EDP mixture model, derive its random
4 Prediction via enriched Dirichlet process mixtures 4 partition model, and emphasize the predictive improvements of the model. Section 4 covers computational procedures. We provide a simulated example in Section 5 to demonstrate how the EDP model can lead to more efficient estimators b making better use of information contained in the sample. Finall, in Section 6, we appl the model to predict Alzheimer s disease status based on measurements of various brain structures. 2 Joint DP mixture model A joint DP mixture model for multivariate densit estimation and nonparametric regression assumes that (X i, Y i ) P iid f(x, P ) = K(x, ξ)dp (ξ), where X i is p-dimensional, Y is usuall univariate, and the mixing distribution P is given a DP prior with scale parameter α > 0 and base measure P 0, denoted b P DP(αP 0 ). Due to the a.s. discrete nature of the DP, the model reduces to a countable mixture f(x, P ) = w j K(x, ξ j ), where the mixing weights w j have a stick-breaking prior with parameter α and iid ξ j P 0, independentl of the w j. This model was first developed for Baesian nonparametric regression b Müller, Erkanli, and West [1996], who assume multivariate Gaussian kernels N p+1 (µ, Σ) with ξ = (µ, Σ) and use a conjugate normal inverse- Wishart prior as the base measure of the DP. However, for even moderatel large p, this approach is practicall unfeasible. Indeed, the computational cost of dealing with the full p+1 b p+1 covariance matrix greatl increases with large p. Furthermore, the conjugate inverse-wishart prior is known to be too poorl parametrized; in particular, there is a single parameter ν to control variabilit, regardless of p (see Consonni and Veronese [2001]). A more effective formulation of this model has been recentl proposed b Shahbaba and Neal [2009], based on two simple modifications. First, the joint kernel is decomposed as the product of the marginal of X and the conditional of Y x, and the parameter space consequentl expressed in terms of the parameters ψ of the marginal and the parameters θ of the conditional. This is a classic reparametrization which, in the Gaussian case, is the basis of generalizations of the inverse-wishart conjugate prior; see Brown et al. [1994]. Secondl, the suggest using simple kernels, assuming local independence among the covariates, i.e. the covariance matrix of the kernel for X is diagonal. These two simple modifications allow several important improvements. Computationall, reducing the covariance matrix to p variances can greatl ease calculations. Regarding flexibilit of the base measure, the conjugate prior now includes a separate parameter to control variabilit for each of the p variances.
5 Prediction via enriched Dirichlet process mixtures 5 Furthermore, the model still allows for local correlations between Y and X through the conditional kernel of Y x and the parameter θ. In addition, the factorization in terms of the marginal and conditional and the assumption of local independence of the covariates allow for eas inclusion of discrete or other tpes of response or covariates. Note that even though, within each component, we assume independence of the covariates, globall, there ma be dependence. This extended model, in full generalit, can be described through latent parameter vectors as follows: Y i x i, θ i ind F ( x i, θ i ), X i ψ i ind F x ( ψ i ), (1) (θ i, ψ i ) P iid P, P DP(αP 0θ P 0ψ ). Here the base measure P 0θ P 0ψ of the DP assumes independence between the θ and the ψ parameters, as this is the structurall conjugate prior for (θ, ψ), and thus, results in simplified computations, but the model could be extended to more general choices. We further assume that P 0θ and P 0ψ are absolutel continuous, with densities p 0θ and p 0ψ. Since P is discrete a.s., integrating out the subject-specific parameters (θ i, ψ i ), the model for the joint densit is f( i, x i P ) = w j K( i x i, θ j )K(x i ψ j ), (2) where the kernels K( x, θ) and K(x ψ) are the densities associated to F ( x, θ) and F x ( ψ). In the Gaussian case, K(x ψ) is the product of N(µ x,h, σ 2 x,h ) Gaussians, h = 1,..., p, and K( x, θ) is N(xβ, σ 2 x ), where x = (1, x ). Shahbaba and Neal focus on the case when Y is categorical and the local model for Y x is a multinomial logit. Hannah et al. [2011] extend the model to the case when, locall, the conditional distribution of Y x belongs to the class of generalized linear models (GLM), that is, the distribution of the response belongs to the exponential famil and the mean of the response can be expressed a function of a linear combination of the covariates. Model (2) allows for flexible conditional densities f( x, P ) = w jk(x ψ j )K( x, θ j ) j =1 w j K(x ψ j ) j w j (x)k( x, θ j ). and nonlinear regression E[Y x, P ] = w j (x) E[Y x, θ j ], with E[Y x, θ j ] = xβj for Gaussian kernels. Thus, the model provides flexible kernelbased densit and regression estimation, and MCMC computations are standard. However, the DP onl allows joint clusters of the parameters (θ i, ψ i ), i = 1,..., n. We underline drawbacks of such joint clustering in the following subsections.
6 Prediction via enriched Dirichlet process mixtures Random partition and inference One of the crucial features of DP mixture models is the dimension reduction and clustering obtained due to the almost sure discreteness of P. In fact, this implies that a sample (θ i, ψ i ), i = 1,..., n from a DP presents ties with positive probabilit and can be convenientl described in terms of the random partition and the distinct values. Using a standard notation, we denote the random partition b a vector of cluster allocation labels ρ n = (s 1,..., s n ), with s i = j if (θ i, ψ i ) is equal to the j th unique value observed, (θ j, ψ j ), for j = 1,..., k, where k = k(ρ n) is the number of groups in the partition ρ n. Additionall, we will denote b S j = {i : s i = j} the set of subject indices in the j th cluster and use the notation j = { i : i S j } and x j = {x i : i S j }. We also make use of the short notation x 1:n = (x 1,..., x n ). Often, the random probabilit measure P is integrated out and inference is based on the posterior of the random partition ρ n and the cluster-specific parameters (θ, ψ ) (θ j, ψ j )k ; p(ρ n, θ, ψ x 1:n, 1:n ) k p(ρ n ) p 0θ (θj )p 0ψ (ψj ) k i S j K(x i ψ j )K( i x i, θ j ). As is well known (Antoniak [1974]), the prior induced b the DP on the random partition is p(ρ n θ, ψ ) α k k Γ(n j), where n j is the size of S j. Marginalizing out the θ, ψ, the posterior of the random partition is p(ρ n x 1:n, 1:n ) α k k Γ(n j ) h x (x j) h (j x j). (3) Notice that the marginal likelihood component in (3), sa h(x 1:n, 1:n ρ n ), is factorized as the product of the cluster-specific marginal likelihoods h x (x j )h (j x j ) where h x (x j ) = Ψ i S j K(x i ψ)dp 0ψ (ψ) and h (j x j ) = Θ i S j K( i x i, θ)dp 0θ (θ). Given the partition and the data, the conditional distribution of the distinct values is simple, as the are independent across clusters, with posterior densities p(θj ρ n, x 1:n, 1:n ) p 0θ (θj ) K( i x i, θj ), i S j p(ψ j ρ n, x 1:n, 1:n ) p 0ψ (ψ j ) i S j K(x i ψ j ). (4) Thus, given ρ n, the cluster-specific parameters (θj, ψ j ) are updated based onl on the observations in cluster S j. Computations are simple if p 0θ and p 0ψ are conjugate priors. The above expressions show the crucial role of the random partition. From equation (3), we have that given the data, subjects are clustered in groups with similar behaviour in the covariate space and similar relationship with the response.
7 Prediction via enriched Dirichlet process mixtures 7 However, even for moderate p the likelihood for x tends to dominate the posterior of the random partition, so that clusters are determined onl b similarit in the covariate space. This is particularl evident when the covariates are assumed to be independent locall, i.e. K(x i ψj ) = p h=1 K(x i,h ψj,h ). Clearl, for large p, the scale and magnitude of changes in p h=1 K(x i,h ψj,h ) will wash out an information given in the univariate likelihood K( i θj, x i). Indeed, this is just the behavior we have observed in practice in running simulations for large p (results not shown). For a simple example demonstrating how the number of components needed to approximate the marginal of X can blow up with p, imagine X is uniforml distributed on a cuboid of side length r > 1. Consider approximating f 0 (x) = 1 r p 1(x [0, r]p ) b f k (x) = k w j N p (x µ j, σj 2 I p ). Since the true distribution of x is uniform on the cube [0, r] p, to obtain a good approximation, the weighted components must place most of their mass on values of x contained in the cuboid. Let B σ (µ) denote a ball of radius σ centered at µ. If a random vector V is normall distributed with mean µ and variance σ 2 I p, then for 0 < ɛ < 1, P (V B σz(ɛ) (µ)) = 1 ɛ, where z(ɛ) 2 = (χ 2 p) 1 (1 ɛ), i.e. the square of z(ɛ) is the (1 ɛ) quantile of the chi-squared distribution with p degrees of freedom. For small ɛ, this means that the densit of V places most of its mass on values contained in a ball of radius σz(ɛ) centered at µ. For ɛ > 0, define f k (x) = k w j N(x; µ j, σj 2 I p ) 1(x B σj z(ɛ j )(µ j )), where ɛ j = ɛ/(kw j ). Then, f k is close to f k (in the L 1 sense): f k (x) f k (x) dx = R p R p k w j N(x; µ j, σj 2 I p ) 1(x Bσ c j z(ɛ j ) (µ j))dx = ɛ. For f k to be close to f 0, the parameters µ j, σ j, w j need to be chosen so that the balls B σj z(ɛ/(kw j ))(µ j ) are contained in the cuboid. That means that centers of the balls are contained in the cuboid, µ j [0, r] p, (5) with further constraints on σj 2 and w j, so that the radius is small enough. In particular, ( ) ɛ σ j z min(µ 1, r µ 1,..., µ p, r µ p ) r kw j 2. (6)
8 Prediction via enriched Dirichlet process mixtures 8 However, as p increases the volume of the cuboid goes to infinit, but the volume of an ball B σj z(ɛ/(kw j ))(µ j ) defined b (5) and (6) goes to 0 (see Clarke et al. [2009], Section 1.1). Thus, just to reasonabl cover the cuboid with the balls of interest, the number of components will increase dramaticall, and more so, when we consider the approximation error of the densit estimate. Now, as an extreme example, imagine that f 0 ( x) is a linear regression model. Even though one component is sufficient for f 0 ( x), a large number of components will be required to approximate f 0 (x), especiall as p increases. It appears evident from (4) this behavior of the random partition also negativel affects inference on the cluster-specific parameters. In particular, when man kernels are required to approximate the densit of X with few observations within each cluster, the posterior for θj ma be based on a sample of unnecessaril small size, leading to a flat posterior with an unreliable posterior mean and large influence of the prior. 2.2 Covariate-dependent urn scheme and prediction Difficulties associated to the behavior of the random partition also deteriorate the predictive performance of the model. Prediction in DP joint mixture models is based on a covariate-dependent urn scheme (Park and Dunson [2010]; Müller and Quintana [2010]), such that conditionall on the partition ρ n and x 1:n+1, the cluster allocation s n+1 of a new subject with covariate value x n+1 is determined as where s n+1 ρ n, x 1:n+1 ω k+1(x n+1 ) c 0 δ k+1 + k ω j (x n+1 ) c 0 δ j, (7) ω k+1 (x n+1 ) = α h 0,x(x n+1 ), α + n ω j (x n+1 ) = n j K(xn+1 ψ)p(ψ x j )dψ α + n n j h j,x (x n+1 ), α + n and c 0 = p(x n+1 ρ n, x 1:n ) is the normalizing constant. This urn scheme is a generalization of the classic Póla urn scheme that allows the cluster allocation probabilit to depend on the covariates; the probabilit of allocation to cluster j depends on the similarit of x n+1 to the x i in cluster j as measured b the predictive densit h j,x. See Park and Dunson [2010] for more details. From the urn scheme (7) one obtains the structure of the prediction. The pre-
9 Prediction via enriched Dirichlet process mixtures 9 dictive densit at for a new subject with a covariate of x n+1 is computed as f( 1:n, x 1:n+1 ) = ρ n f( 1:n, x 1:n+1, ρ n, s n+1 )p(s n+1 1:n, x 1:n+1, ρ n ) p(ρ n 1:n, x 1:n+1 ) s n+1 = ω k+1(x n+1 ) k ω j (x n+1 ) h 0, ( x n+1 ) + h j, ( x n+1 ) c 0p(ρ n x 1:n, 1:n ) c ρ 0 c n 0 p(x n+1 x 1:n, 1:n ) = ω k+1(x n+1 ) k ω j (x n+1 ) h 0, ( x n+1 ) + h j, ( x n+1 ) p(ρ n x 1:n, 1:n ), c c ρ n (8) where c = p(x n+1 x 1:n, 1:n ). Thus, given the partition, the conditional predictive densit is a weighted average of the prior guess h 0, ( x) K( x, θ)dp 0θ (θ) and the cluster-specific predictive densities of at x n+1, h j, ( x n+1 ) = K( x n+1, θ)p(θ x j, j )dθ, with covariate-dependent weights. The predictive densit is obtained b averaging with respect to the posterior of ρ n. However, for moderate to large p, the posterior of the random partition suffers the drawbacks discussed in the previous subsection. In particular, too man small x-clusters lead to unreliable within-cluster predictives based on small sample sizes. Furthermore, the measure which determines similarit of x n+1 and the j th cluster will be too rigid. Consequentl, the resulting overall prediction ma be quite poor. These drawbacks will also affect the point prediction, which, under quadratic loss, is E[Y n+1 1:n, x 1:n+1 ] = ρ n ω k+1(x n+1 ) E 0 [Y n+1 x n+1 ] + c k ω j (x n+1 ) c E j [Y n+1 x n+1 ] p(ρ n x 1:n, 1:n ), where E 0 [Y n+1 x n+1 ] is the expectation of Y n+1 with respect to h 0, and E j [Y n+1 x n+1 ] E[E[Y n+1 x n+1, θ j ] x j, j ] is the expectation of Y n+1 with respect to h j,. Example. When K( x, θ) = N(; xβ, σ 2 ) and the prior for (β, σ 2 ) is the multivariate normal-inverse gamma with parameters (β 0, C 1, a, b ), (9) is ρ n ω k+1(x n+1 ) x c n+1 β 0 + k ω j (x n+1 ) c (9) x n+1 ˆβj p(ρ n x 1:n, 1:n ), (10)
10 Prediction via enriched Dirichlet process mixtures 10 where ˆβ j = Ĉ 1 j (Cβ 0 + X jj ), Ĉ j = C + X jx j, X j is a n j b p + 1 matrix with rows x i for i S j, and (8) is ( ω k+1 (x n+1 ) T x c n+1 β 0, b ) W 1, 2a + a k ( ω j (x n+1 ) T x c n+1 ˆβj, ˆb,j Wj 1, 2â,j ), â,j where T ( ; µ, σ 2, ν) denotes the densit of a random variable V such that (V µ)/σ has a t-distribution with ν degrees of freedom, â,j = a + n j /2, ˆb,j = b ( j X j β 0 ) (I nj X j Ĉ 1 j X j)( j X j β 0 ), W = 1 x n+1 (C + x n+1x n+1 ) 1 x n+1, and W j is defined as W with C replaced b Ĉj. 3 Joint EDP mixture model As seen, the global clustering of the DP prior on P(Θ Ψ), the space of probabilit measures on Θ Ψ, does not allow one to efficientl model the tpes of data discussed in the previous section. Instead, it is desirable to use a nonparametric prior that allows man ψ-clusters, to fit the complex marginal of X, and fewer θ-clusters. At the same time, we want to preserve the desirable conjugac properties of the DP, in order to maintain fairl simple computations. To these aims, our proposal is to replace the DP with the more richl parametrized enriched Dirichlet process Wade et al. [2011]. The EDP is conjugate and has an analticall computable urn scheme, but it gives a nested partition structure that can model the desired clustering behavior. Recall that the model (2) was obtained b decomposing the joint kernel as the product of the marginal and conditional kernels. The EDP is a natural alternative for the mixing distribution of this model, as it is similarl based on the idea of expressing the unknown random joint probabilit measure P of (θ, ψ) in terms of the random marginal and conditionals. This requires the choice of an ordering of θ and ψ, and this choice is problem specific. In the situation described here, it is natural to consider the random marginal distribution P θ and the random conditional P ψ θ, to obtain the desired clustering structure. Then, the EDP prior is defined b P θ DP(α θ P 0θ ), P ψ θ ( θ) DP(α ψ (θ)p 0ψ θ ( θ)), θ Θ, and P ψ θ ( θ) for θ Θ are independent among themselves and from P θ. Together these assumptions induce a prior for the random joint P through the joint law of (11)
11 Prediction via enriched Dirichlet process mixtures 11 the marginal and conditionals and the mapping (P θ, P ψ θ ) P ψ θ ( θ)dp θ (θ). The prior is parametrized b the base measure P 0, expressed as P 0 (A B) = P 0ψ θ (B θ)dp 0θ (θ) A for all Borel sets A and B, and b a precision parameter α θ associated to θ and a collection of precision parameters α ψ (θ) for ever θ Θ associated to ψ θ. Note the contrast with the DP, which onl allows one precision parameter to regulate the uncertaint around P 0. The proposed EDP mixture model for regression is as in (2), but with P EDP(α θ, α ψ (θ), P 0 ) in place of P DP(αP 0θ P 0ψ ). In general, P 0 is such that θ and ψ are dependent, but here we assume the same structurall conjugate base measure as for the DP model (2), so P 0 = P 0θ P 0ψ. Using the square breaking representation of the EDP (Wade et al. [2011], Proposition 4) and integrating out the (θ i, ψ i ) parameters, the model for the joint densit is f(x, P ) = l=1 w j w l j K(x ψ l j )K( x, θ j ). This gives a mixture model for the conditional densities with more flexible weights f( x, P ) = l=1 w l jk(x ψ l j ) j =1 w j l =1 w l j K(x ψ l j ) K( x, θ j ) w j (x)k( x, θ j ). 3.1 Random partition and inference The advantage of the EDP is the implied nested clustering. The EDP partitions subjects in θ-clusters and ψ-subclusters within each θ-cluster, allowing the use of more kernels to describe the marginal of X for each kernel used for the conditional of Y x. The random partition model induced from the EDP can be described as a nested Chinese Restaurant Process (ncrp). First, customers choose restaurants according to the CRP induced b P θ DP(α θ P 0θ ), that is, with probabilit proportional to the number n j of customers eating at restaurant j, the (n + 1) th customer eats at restaurant j, and with probabilit proportional to α θ, she eats at a new restaurant. Restaurant are then colored iid with colors θj P 0θ. Within restaurant j, customers sit at tables as in the CRP induced b the P ψ θ j DP(α ψ (θj )P 0ψ θ( θj )). Tables in restaurant j are then colored with colors ψl j iid P 0ψ θ ( θj ). This differs from the ncrp proposed b Blei et al. [2010], which had the alternative aim of learning topic hierarchies b clustering parameters (topics) hierarchicall along a tree of infinite CRPs. In particular, each subject follows a path down the
12 Prediction via enriched Dirichlet process mixtures 12 tree according to a sequence of nested CRPs and the parameters of subject i are associated with the cluster visited at a latent subject-specific level l of this path. Although related, the EDP is not a special case with the tree depth fixed to 2; the EDP defines a prior on a multivariate random probabilit measure on Θ Ψ and induces a nested partition of the multivariate parameter (θ, ψ), where the first level of the tree corresponds to the clustering of the θ parameters and the second corresponds to the clustering of the ψ parameters. A generalization of the EDP to a depth of D is related to Blei et al. s ncrp with depth D, but onl if one regards the parameters of each subject as the vector of parameters (ξ 1,..., ξ D ) associated to each level of the tree. Furthermore, this generalization of the EDP would allow a more flexible specification of the mass parameters and possible correlation among the nested parameters. The nested partition of the EDP is described b ρ n = (ρ n,, ρ n,x ), where ρ n, = (s,1,..., s,n ) and ρ n,x = (s x,1,..., s x,n ) with s,i = j if θ i = θ j, the jth distinct θ-value in order of appearance, and s x,i = l if ψ i = ψ l j, the lth color that appeared inside the j th θ-cluster. Additionall, we use the notation S j+ = {i : s,i = j}, with size n j, j = 1,..., k, and S l j = {i : s,i = j, s x,i = l}, with size n l j, l = 1,..., k j. The unique parameters will be denoted b θ = (θ j )k and ψ = (ψ 1 j,..., ψ k j j )k. Furthermore, we use the notation ρ nj,x = (s x,i : i S j+ ) and j = { i : i S j+ }, x j = {x i : i S j+ }, x l j = {x i : i S l j }. Proposition 3.1 The probabilit law of the nested random partition defined from the EDP is p(ρ n ) = Γ(α θ) Γ(α θ + n) αk θ k Θ α ψ (θ) k Γ(α j ψ(θ))γ(n j ) Γ(α ψ (θ) + n j ) dp 0θ(θ) Γ(n l j ). Proof. From independence of random conditional distributions among θ Θ, p(ρ n, θ ) = p(ρ n, ) k p 0θ (θj )p(ρ n,x ρ n,, θ ) = p(ρ n, ) k j l=1 k p 0θ (θj )p(ρ nj,x θj ). Next, using the results of the random partition model of the DP (Antoniak [1974]), we have p(ρ n, θ ) = Γ(α θ) Γ(α θ + n) αk θ k Integrating out θ leads to the result. p 0θ (θ j )α ψ (θ j ) k j Γ(α ψ(θ j ))Γ(n j) Γ(α ψ (θ j ) + n j) k j Γ(n l j ). From Proposition 3.1, we gain an understanding of the tpes of partitions preferred b the EDP and the effect of the parameters. If for all θ, α ψ (θ) = α θ P 0θ ({θ}), that is α ψ (θ) = 0 if P 0θ is nonatomic, we are back to the DP random partition model, see l=1
13 Prediction via enriched Dirichlet process mixtures 13 Proposition 2 of Wade et al. [2011]. In the case when P 0θ is nonatomic, this means that the conditional P ψ θ is degenerate at some random location with probabilit one (for each restaurant one table). In general, α ψ (θ) ma be a flexible function of θ, reflecting the fact that within some θ-clusters more kernels ma be required for good approximation of the marginal of X. In practice, a common situation that we observe is a high value of α ψ (θ) for average values of θ and lower values of α ψ (θ) for more extreme θ values, capturing homogeneous outling groups. In this case, a small value of α θ will encourage few θ- clusters, and, given θ, a large α ψ (θj ) will encourage more ψ-clusters within the jth θ-cluster. The term k kj l=1 Γ(n l j) will encourage asmmetrical (θ, ψ)-clusters, preferring one large cluster and several small clusters, while, given θ, the term involving the product of beta functions contains parts that both encourage and discourage asmmetrical θ-clusters. In the special case when α ψ (θ) = α ψ for all θ Θ, the random partition model simplifies to p(ρ n ) = Γ(α θ) Γ(α θ + n) αk θ k α k j ψ Γ(α ψ )Γ(n j ) Γ(α ψ + n j ) k j Γ(n l j ). In this case, the tendenc of the term involving the product of beta functions is to slightl prefer asmmetrical θ-clusters with large values of α ψ boosting this preference. As discussed in the previous section, the random partition plas a crucial role, as its posterior distribution affects both inference on the cluster-specific parameters and prediction. For the EDP, it is given b the following proposition. Proposition 3.2 The posterior of the random partition of the EDP model is p(ρ n x 1:n, 1:n ) k k αθ k Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α j ψ(θ) k j dp 0θ (θ)h (j x j) Γ(n l j )h x (x l j ). Θ The proof relies on a simple application of Baes theorem. In the case of constant α ψ (θ), the expression for the posterior of ρ n simplifies to p(ρ n x 1:n, 1:n ) α k θ l=1 l=1 k k Γ(α ψ )Γ(n j ) Γ(α ψ + n j ) αk j ψ h j (j x j) Γ(n l j )h x (x l j ). Again, as in (3), the marginal likelihood component in the posterior distribution of ρ n is the product of the cluster-specific marginal likelihoods, but now the nested clustering structure of the EDP separates the factors relative to x and x, being h(x 1:n, 1:n ρ n ) = k h (j x j ) k j l=1 h x(x l j ). Even if the x-likelihood favors man ψ-clusters, now these can be obtained b sub-partitioning a coarser θ-partition, and the number k of θ-clusters can be expected to be much smaller than in (3). l=1
14 Prediction via enriched Dirichlet process mixtures 14 Further insights into the behavior of the random partition are given b the induced covariate-dependent random partition of the θ i parameters given the covariates, which is detailed in the following propositions. We will use the notation P n to denote the set of all possible partitions of the first n integers. Proposition 3.3 The covariate-dependent random partition model induced b the EDP prior is p(ρ n, x 1:n ) α k θ k ρ nj,x P nj Θ Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) Proof. An application of Baes theorem implies that p(ρ n x 1:n ) α k θ k Θ Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) k j l=1 k j Γ(n l j )h x (x l j ). l=1 Γ(n l j )h x (x l j ). (12) Integrating over ρ n,x, or equivalentl summing over all ρ nj,x in P nj,x for j = 1,..., k leads to, p(ρ n, x 1:n ) ρ n1+,x... k αθ k ρ nk+,x Θ Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) Γ(n l j )h x (x l j ), and, finall, since (12) is the product over the j terms, we can pull the sum over ρ nj,x within the product. This covariate-dependent random partition model will favor θ-partitions of the subjects which can be further partitioned into groups with similar covariates, where a partition with man desirable sub-partitions will have higher mass. Proposition 3.4 The posterior of the random covariate-dependent partition induced from the EDP model is p(ρ n, x 1:n, 1:n ) α k θ ρ nj,x P nj Θ k h (j x j) Γ(α ψ (θ))γ(n j ) Γ(α ψ (θ) + n j ) α ψ(θ) k j dp 0θ (θ) k j h=1 k j l=1 Γ(n l j )h x (x l j ). The proof is similar in spirit to that of Proposition 3.3. Notice the preferred θ-partitions will consist of clusters with a similar relationship between and x, as measured b marginal local model h for x and similar x behavior, which is measured much more flexibl as a mixture of the previous marginal local models. The behavior of the random partition, detailed above, has important implications for the posterior of the unique parameters. Conditionall on the partition,
15 Prediction via enriched Dirichlet process mixtures 15 the cluster-specific parameters (θ, ψ ) are still independent, their posterior densit being where p(θ, ψ 1:n, x 1:n, ρ n ) = p(θ j j, x j) p 0θ (θ j ) k k j p(θj j, x j) p(ψl j x l j ), l=1 i S j+ K( i θ j, x i ), p(ψ l j x l j ) p 0ψ(ψ l j ) i S j,l K(x i ψ l j ). The important point is that the posterior of θj can now be updated with much larger sample sizes if the data determines that a coarser θ-partition is present. This will result in a more reliable posterior mean, a smaller posterior variance, and larger influence of the data compared with the prior. 3.2 Covariate-dependent urn scheme and prediction Similar to the DP model, computation of the predictive estimates relies on a covariatedependent urn scheme. For the EDP, we have s,n+1 ρ n, x 1:n+1, 1:n ) ω k+1(x n+1 ) c 0 δ k+1 + k ω j (x n+1 ) c 0 δ j, (13) where c 0 = p(x n+1 ρ n, x 1:n ), but now the expression of the ω j (x n+1 ) takes into account the possible allocation of x n+1 in subgroups, being ω j (x n+1 ) = p(x n+1 ρ n, x 1:n, 1:n, s,n+1 = j, s x,n+1 )p(s x,n+1 ρ n, x 1:n, 1:n, s,n+1 = j). s x,n+1 From this, it can be easil found that α θ ω k+1 (x n+1 ) = α θ + n h 0,x(x n+1 ) ω j (x n+1 ) = n k j j π α θ + n kj +1 jh 0,x (x n+1 ) + π l j h l j,x (x n+1 ), l=1 where π l j = n l j dp 0θ (θ x α ψ (θ) + n j, j ), π kj +1 j = j α ψ (θ) α ψ (θ) + n j dp 0θ (θ). In the case of constant α ψ (θ) = α ψ, these expressions simplif to π l j = n l j α ψ + n j, π kj +1 j = α ψ α ψ + n j.
16 Prediction via enriched Dirichlet process mixtures 16 Notice that (13) is similar to the covariate-dependent urn scheme of the DP model. The important difference is that the weights, which measure the similarit between x n+1 and the j th cluster, are much more flexible. It follows, from similar computations as in Section 2.2, that the predictive densit at for a new subject with a covariate value of x n+1 is f( x 1:n, 1:n, x n+1 ) = ρ n ω k+1(x n+1 ) h 0, ( x n+1 ) + c k ω j (x n+1 ) c h j, ( x n+1 ) p(ρ n x 1:n, 1:n ) where c = p(x n+1 x 1:n, 1:n ). Under the squared error loss function, the point prediction of n+1 is E[Y n+1 1:n, x 1:n+1 ] = ρ n ω k+1(x n+1 ) E 0 [Y n+1 x n+1 ] + c k ω j (x n+1 ) c (14) E j [Y n+1 x n+1 ] p(ρ n x 1:n, 1:n ). The expressions for the prediction densit (14) and point prediction (15) are quite similar to those of the DP, (8) and (9), respectivel; in both cases, the cluster-specific predictive estimates are averaged with covariate-dependent weights. However, there are two important differences for the EDP model. The first is that the weights in (13) are defined with a more flexible kernel; in fact, it is a mixture of the original kernels used in the DP model. This means that we have a more flexible measure of similarit in the covariate space. The second difference is that k will be smaller and n j will be larger with a high posterior probabilit, leading to a more reliable posterior distribution of θj due to larger sample sizes and better cluster-specific predictive estimates. We will demonstrate the advantage of these two ke differences in simulated and applied examples in Section 5. 4 Computations (15) Inference for the EDP model cannot be obtained analticall and must therefore be approximated. To obtain approximate inference, we rel on Markov Chain Monte Carlo (MCMC) methods and consider an extension of Algorithm 2 of Neal [2000] for the DP mixture model. In this approach, the random probabilit measure, P, is integrated out, and the model is viewed in terms of (ρ n, θ, ψ ). This algorithm requires the use of conjugate base measures P 0θ and P 0ψ. To deal with non-conjugate base measures, the approach used in Algorithm 8 of Neal [2000] can be directl adapted to the EDP mixture model.
17 Prediction via enriched Dirichlet process mixtures 17 Algorithm 2 is a Gibbs sampler which first samples the cluster label of each subject conditional on the partition of all other subjects, the data, and (θ, ψ ), and then samples (θ, ψ ) given the partition and the data. The first step can be easil performed thanks to the Póla urn which marginalizes the DP. Extending Algorithm 2 for the EDP model is straightforward, since the EDP maintains a simple, analticall computable urn scheme. In particular, the conditional probabilities p(s i ρ i n 1, θ, ψ, x 1:n, 1:n ) (provided in the Appendix) have a simple closed form, which allows conditional sampling of the individual cluster membership indicators s i, where ρ i n 1 denotes the partition of the n 1 subjects with the i th subject removed. To improve mixing, we include an additional Metropolis- Hastings step; at each iteration, after performing the n Gibbs updates for each s i, we propose a shuffle of the nested partition structure obtained b moving a ψ-cluster to be nested within a different or new θ-cluster. This move greatl improves mixing. A detailed description of the sampler, including the Metropolis-Hastings step, can be found in the Appendix. MCMC produces approximate samples, {ρ s n, ψ s, θ s } S s=1, from the posterior. The prediction given in equation (15) can be approximated b 1 S S ωk+1 s (x n+1) k s ωj s E h [Y n+1 x n+1 ] + (x n+1) E F [Y n+1 x n+1, θj s ], ĉ ĉ s=1 where ω s j (x n+1) for j = 1,..., k s +1, are as previousl defined in (13) with (ρ n, ψ, θ ) replaced b (ρ s n, ψ s, θ s ) and ĉ = 1 S S ωk+1 s (x k s n+1) + ωj s (x n+1 ). s=1 For the predictive densit estimate at x n+1, we define a grid of new values and for each in the grid, we compute 1 S S ωk+1 s (x n+1) k s ωj s h ( x n+1 ) + (x n+1) K( x n+1, θj s ). (16) ĉ ĉ s=1 Note that hperpriors ma be included for the precision parameters, α θ and α ψ ( ), and the parameters of the base measures. For the simulated examples and application, a Gamma hperprior is assigned to α θ, and α ψ (θ) for θ Θ are assumed to be i.i.d. from a Gamma hperprior. At each iteration, αθ s and αs ψ (θ) at θ s j for j = 1,..., k s are approximate samples from the posterior using the method described in Escobar and West [1995]. 5 Simulated example We consider a to example that demonstrates two ke advantages of the EDP model; first, it can recover the true coarser θ-partition; second, improved prediction and
18 Prediction via enriched Dirichlet process mixtures 18 smaller credible intervals result. The example shows that these advantages are evident even for a moderate value of p, with more drastic differences as p increases. A data set of n = 200 points were generated where onl the first covariate is a predictor for Y. The true model for Y is a nonlinear regression model obtained as a mixture of two normals with linear regression functions and weights depending onl on the first covariate; Y i x i ind p(x i,1 )N( i β 1,0 + β 1,1 x i,1, σ 2 1) + (1 p(x i,1 ))N( i β 2,0 + β 2,1 x i,1, σ 2 2), where p(x i,1 ) = ) τ 1 exp ( τ (x 1,i µ 1 ) 2 ) ), τ 1 exp ( τ (x 1,i µ 1 ) 2 + τ 2 exp ( τ (x 1,i µ 2 ) 2 with β 1 = (0, 1), σ 2 1 = 1/16, β 2 = (4.5, 0.1), σ 2 2 = 1/8 and µ 1 = 4, µ 2 = 6, τ 1 = τ 2 = 2. The covariates are sampled from a multivariate normal, X i = (X i,1,..., X i,p ) iid N(µ, Σ), (17) centered at µ = (4,..., 4) with a standard deviation of 2 along each dimension, i.e Σ h,h = 4. The covariance matrix Σ models two groups of covariates: those in the first group are positivel correlated among each other and the first covariate, but independent of the second group of covariates, which are positivel correlated among each other but independent of the first covariate. In particular, we take Σ h,l = 3.5 for h l in {1, 2, 4,..., 2 p/2 } or h l in {3, 5,..., 2 (p 1)/2 + 1} and Σ h,l = 0 for all other cases of h l. We examine both the DP and EDP mixture models; Y i x i, β i, σ 2,i ind N(x i β i, σ 2,i), (β i, σ 2,i, µ i, σ 2 x,i) P iid P X i µ i, σ 2 x,i ind p N(µ i,h, σx,h,i 2 ), with P DP or P EDP. The conjugate base measure is selected; P 0θ is a multivariate normal-inverse gamma prior and P 0ψ is the product of p normal-inverse gamma priors, that is and h=1 p 0θ (β, σ 2 ) = N(β; β 0, σ 2 C 1 )IG(σ 2 ; a, b ), p 0ψ (µ, σ 2 x) = p h=1 N(µ h ; µ 0,h, σ 2 x,h c 1 h )IG(σ2 x,h ; a x,h, b x,h ). For both models, we use the same subjective choice of the parameters of the base measure. In particular, we center the base measure on an average of the true
19 Prediction via enriched Dirichlet process mixtures p( rho,x)= x1 (a) DP, p= p( rho,x)= x1 (b) DP, p= p( rho,x)= x1 (c) DP, p= p( rho,x)= x1 (d) DP, p= p( rho_,x)= x1 (e) EDP, p= p( rho_,x)= x1 (f) EDP, p= p( rho_,x)= x1 (g) EDP, p= p( rho_,x)= x1 (h) EDP, p=15 Figure 1: The partition with the highest estimated posterior probabilit. Data points are plotted in the x 1 vs space and colored b cluster membership with estimated posterior probabilit included in the plot title. Table 1: The posterior mode number of clusters, denoted ˆk for both models, and the posterior mean of α, α θ, denoted ˆα for both models, as p increases. p = 1 p = 5 p = 10 p = 15 ˆk ˆα ˆk ˆα ˆk ˆα ˆk ˆα DP EDP parameters values with enough variabilit to recover the true model. A list of the prior parameters can be found in the Appendix. We assign hperpriors to the mass parameters, where for the DP model, α Gamma(1, 1), and for the EDP model, α θ Gamma(1, 1), α ψ (β, σ 2 ) iid Gamma(1, 1) for all β, σ 2 R p+1 R +. The computational procedures described in Section 4 were used to obtain posterior inference with 20,000 iterations and burn in period of 5,000. An examination of the trace and autocorrelation plots for the subject-specific parameters (β i, σ 2,i, µ i, σ 2 x,i ) provided evidence of convergence. Additional criteria for assessing the convergence of chain, in particular, the Geweke diagnostic, also suggested convergence, and the results are given in Table A.1 of the Appendix (see the R package coda for implementation and further details of the diagnostic). It should be noted that running times for both models are quite similar, although slightl faster for the
20 Prediction via enriched Dirichlet process mixtures 20 E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * x x x x1 (a) DP, p=1 (b) DP, p=5 (c) DP, p=10 (d) DP, p=15 E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * E[ x] * * * * ******* * **** ** * * x x x x1 (e) EDP, p=1 (f) EDP, p=5 (g) EDP, p=10 (h) EDP, p=15 Figure 2: The point predictions for 20 test samples of the covariates are plotted against x 1 and represented with circles (DP in blue and EDP in red) with true prediction as black stars. The bars about the prediction depict the 95% credible intervals. Table 2: Prediction error for both models as p increases. p = 1 p = 5 p = 10 p = 15 ˆl1 ˆl2 ˆl1 ˆl2 ˆl1 ˆl2 ˆl1 ˆl2 DP EDP DP. The first main point to emphasize is the improved behavior of the posterior of the random partition for the EDP. We note that for both models, the posterior of the partition is spread out. This is because the space of partitions is ver large and man partitions are ver similar, differing onl in a few subjects; thus, man partitions fit the data well. We depict representative partitions of both models with increasing p in Figure 1. Observations are plotted in the x 1 space and colored according to the partition for the DP and the θ-partition for the EDP. As expected, for p = 1 the DP does well at recovering the true partition, but as clearl seen from Figure 1, for large values of p, the DP partition is comprised of man clusters, which are needed to approximate the multivariate densit of X. In fact, the densit of Y x can be recovered with onl two kernels regardless of p, and the θ-partitions of
21 Prediction via enriched Dirichlet process mixtures 21 f( x) f( x) f( x) f( x) (a) p= (b) p= (c) p= (d) p=15 Figure 3: Predictive densit estimate (DP in blue and EDP in red) for the first new subject with true conditional densit in black. The pointwise 95% credible intervals are depicted with dashed lines (DP in blue and EDP in red). f( x) f( x) f( x) f( x) (a) p= (b) p= (c) p= (d) p=15 Figure 4: Predictive densit estimate (DP in blue and EDP in red) for the fifth new subject with true conditional densit in black. The pointwise 95% credible intervals are depicted with dashed lines (DP in blue and EDP in red). Table 3: l 1 densit regression error for both models as p increases. p = 1 p = 5 p = 10 p = 15 DP EDP the EDP depicted in Figure 1, with onl two θ-clusters, are ver similar to the true configuration, even for increasing p. On the other hand, the (θ, ψ)-partition of the EDP (not shown) consists of man clusters and resembles the partition of the DP model. This behavior is representative of the posterior distribution of the random partition that, for the DP, has a large posterior mode of k and large posterior mean of α for larger values of p, while most of the EDP θ-partitions are composed of onl 2 clusters with onl a handful of subjects placed in the incorrect cluster and the posterior mean of α θ is much smaller for large p. Table 1 summarizes the posterior of k for both models and the posterior of the precision parameters α, α θ. It is interesting
22 Prediction via enriched Dirichlet process mixtures 22 to note that posterior samples of α ψ (θ) for θ characteristic of the first cluster tend to be higher than posterior samples of α ψ (θ) for the second cluster; that is, more clusters are needed to approximate the densit of X within the first cluster. A non constant α ψ (θ) allows us to capture this behavior. As discussed in Section 3, a main point of our proposal is the increased finite sample efficienc of the EDP model. To illustrate this, we create a test set with m = 200 new covariates values simulated from (17) with maximum dimension p = 15 and compute the true regression function E[Y n+j x n+j ] and conditional densit f( x n+j ) for each new subject. To quantif the gain in efficienc of the EDP model, we calculate the point prediction and predictive densit estimates from both models and compare them with the truth. Judging from both the empirical l 1 and l 2 prediction errors, the EDP model outperforms the DP model, with greater improvement for larger p; see Table 2. Figure 2 displas the prediction against x 1 for 20 new subjects. Recall that each new x 1 is associated with different values of (x 2,..., x p ), which accounts for the somewhat erratic behavior of the prediction as a function of x 1 for increasing p. The comparison of the credible intervals is quite interesting. For p > 1, the unnecessaril wide credible intervals for the DP regression model stand out in the first row of Figure 2. This is due to small cluster samples sizes for the DP model with p > 1. The densit regression estimates for all new subjects were computed b evaluating (16) at a grid of -values. As a measure of the performance of the models, the empirical l 1 distance between the true and estimated conditional densities for each of the new covariate values is shown in Table 3. Again the EDP model outperforms the DP model. Figures 3 and 4 displa the true conditional densit (in black) for two new covariate values with the estimated conditional densities in blue for the DP and red for the EDP. It is evident that for p > 1 the densit regression estimates are improved and that the pointwise 95% credible intervals are almost uniforml wider both in and x for the DP model, sometimes drasticall so. It is important to note that while the credible intervals of the EDP model are considerabl tighter, the still contain the true densit. 6 Alzheimer s disease stud Our primar goal in this section is to show that the EDP model leads to improved inference in a real data stud with the important goal of diagnosing Alzheimer s disease (AD). In particular, EDP leads to improved predictive accurac, tighter credible intervals around the predictive probabilit of having the disease, and a more interpretable clustering structure. Alzheimer s disease is a prevalent form of dementia that slowl destros memor and thinking skills, and eventuall even the abilit to carr out the simplest tasks. Unfortunatel, a definitive diagnosis cannot be made until autops. However, the brain ma show severe evidence of neurobiological damage even at earl stages of the disease before the onset of memor disturbances. As this damage ma be difficult
Bayesian Nonparametric Regression through Mixture Models
Bayesian Nonparametric Regression through Mixture Models Sara Wade Bocconi University Advisor: Sonia Petrone October 7, 2013 Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationBayesian nonparametrics
Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability
More informationBayesian non-parametric model to longitudinally predict churn
Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics
More informationSTAT Advanced Bayesian Inference
1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(
More informationMonte Carlo integration
Monte Carlo integration Eample of a Monte Carlo sampler in D: imagine a circle radius L/ within a square of LL. If points are randoml generated over the square, what s the probabilit to hit within circle?
More informationOutline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution
Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationBayesian Nonparametrics: Dirichlet Process
Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationBayesian Linear Models
Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. 2 Biostatistics, School of Public
More informationBayesian Linear Models
Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department
More informationBayesian Statistics. Debdeep Pati Florida State University. April 3, 2017
Bayesian Statistics Debdeep Pati Florida State University April 3, 2017 Finite mixture model The finite mixture of normals can be equivalently expressed as y i N(µ Si ; τ 1 S i ), S i k π h δ h h=1 δ h
More informationBayesian Linear Regression
Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective
More informationA Fully Nonparametric Modeling Approach to. BNP Binary Regression
A Fully Nonparametric Modeling Approach to Binary Regression Maria Department of Applied Mathematics and Statistics University of California, Santa Cruz SBIES, April 27-28, 2012 Outline 1 2 3 Simulation
More informationBayes methods for categorical data. April 25, 2017
Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationCS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I
X i Ν CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr 2004 Dirichlet Process I Lecturer: Prof. Michael Jordan Scribe: Daniel Schonberg dschonbe@eecs.berkeley.edu 22.1 Dirichlet
More informationA Nonparametric Model for Stationary Time Series
A Nonparametric Model for Stationary Time Series Isadora Antoniano-Villalobos Bocconi University, Milan, Italy. isadora.antoniano@unibocconi.it Stephen G. Walker University of Texas at Austin, USA. s.g.walker@math.utexas.edu
More informationThe linear model is the most fundamental of all serious statistical models encompassing:
Linear Regression Models: A Bayesian perspective Ingredients of a linear model include an n 1 response vector y = (y 1,..., y n ) T and an n p design matrix (e.g. including regressors) X = [x 1,..., x
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationLecture 3a: Dirichlet processes
Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics
More information19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process
10-708: Probabilistic Graphical Models, Spring 2015 19 : Bayesian Nonparametrics: The Indian Buffet Process Lecturer: Avinava Dubey Scribes: Rishav Das, Adam Brodie, and Hemank Lamba 1 Latent Variable
More informationLecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007
COS 597C: Bayesian Nonparametrics Lecturer: David Blei Lecture # Scribes: Jordan Boyd-Graber and Francisco Pereira October, 7 Gibbs Sampling with a DP First, let s recapitulate the model that we re using.
More informationDirichlet Process Mixtures of Generalized Linear Models
Lauren A. Hannah David M. Blei Warren B. Powell Department of Computer Science, Princeton University Department of Operations Research and Financial Engineering, Princeton University Department of Operations
More informationInfinite latent feature models and the Indian Buffet Process
p.1 Infinite latent feature models and the Indian Buffet Process Tom Griffiths Cognitive and Linguistic Sciences Brown University Joint work with Zoubin Ghahramani p.2 Beyond latent classes Unsupervised
More informationDirichlet Processes: Tutorial and Practical Course
Dirichlet Processes: Tutorial and Practical Course (updated) Yee Whye Teh Gatsby Computational Neuroscience Unit University College London August 2007 / MLSS Yee Whye Teh (Gatsby) DP August 2007 / MLSS
More informationNon-parametric Clustering with Dirichlet Processes
Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction
More informationGibbs Sampling in Linear Models #2
Gibbs Sampling in Linear Models #2 Econ 690 Purdue University Outline 1 Linear Regression Model with a Changepoint Example with Temperature Data 2 The Seemingly Unrelated Regressions Model 3 Gibbs sampling
More informationBayesian Nonparametrics
Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet
More informationVariational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures
17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter
More informationHybrid Dirichlet processes for functional data
Hybrid Dirichlet processes for functional data Sonia Petrone Università Bocconi, Milano Joint work with Michele Guindani - U.T. MD Anderson Cancer Center, Houston and Alan Gelfand - Duke University, USA
More informationFlexible Regression Modeling using Bayesian Nonparametric Mixtures
Flexible Regression Modeling using Bayesian Nonparametric Mixtures Athanasios Kottas Department of Applied Mathematics and Statistics University of California, Santa Cruz Department of Statistics Brigham
More informationStat 542: Item Response Theory Modeling Using The Extended Rank Likelihood
Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters
More informationDirichlet Process Mixtures of Generalized Linear Models
Dirichlet Process Mixtures of Generalized Linear Models Lauren A. Hannah Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA David M. Blei Department
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown
More informationLecture 16: Mixtures of Generalized Linear Models
Lecture 16: Mixtures of Generalized Linear Models October 26, 2006 Setting Outline Often, a single GLM may be insufficiently flexible to characterize the data Setting Often, a single GLM may be insufficiently
More informationDAG models and Markov Chain Monte Carlo methods a short overview
DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex
More informationCMPS 242: Project Report
CMPS 242: Project Report RadhaKrishna Vuppala Univ. of California, Santa Cruz vrk@soe.ucsc.edu Abstract The classification procedures impose certain models on the data and when the assumption match the
More informationChapter 2. Data Analysis
Chapter 2 Data Analysis 2.1. Density Estimation and Survival Analysis The most straightforward application of BNP priors for statistical inference is in density estimation problems. Consider the generic
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationNormalized kernel-weighted random measures
Normalized kernel-weighted random measures Jim Griffin University of Kent 1 August 27 Outline 1 Introduction 2 Ornstein-Uhlenbeck DP 3 Generalisations Bayesian Density Regression We observe data (x 1,
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationMarkov Chain Monte Carlo (MCMC)
Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationLatent factor density regression models
Biometrika (2010), 97, 1, pp. 1 7 C 2010 Biometrika Trust Printed in Great Britain Advance Access publication on 31 July 2010 Latent factor density regression models BY A. BHATTACHARYA, D. PATI, D.B. DUNSON
More informationBayesian Nonparametrics: Models Based on the Dirichlet Process
Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro
More informationDirichlet Process. Yee Whye Teh, University College London
Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, Blackwell-MacQueen urn scheme, Chinese restaurant
More informationImage segmentation combining Markov Random Fields and Dirichlet Processes
Image segmentation combining Markov Random Fields and Dirichlet Processes Jessica SODJO IMS, Groupe Signal Image, Talence Encadrants : A. Giremus, J.-F. Giovannelli, F. Caron, N. Dobigeon Jessica SODJO
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationBayesian Nonparametric Models
Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior
More informationBayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units
Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional
More informationMotivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University
Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationBayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference
1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the Federal Reserve Board of Governors or the Federal Reserve System. Bayesian Estimation of DSGE
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationarxiv: v1 [stat.me] 3 Mar 2013
Anjishnu Banerjee Jared Murray David B. Dunson Statistical Science, Duke University arxiv:133.449v1 [stat.me] 3 Mar 213 Abstract There is increasing interest in broad application areas in defining flexible
More informationMCMC algorithms for fitting Bayesian models
MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models
More informationPart 6: Multivariate Normal and Linear Models
Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of
More information1 Data Arrays and Decompositions
1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationBayesian Linear Models
Bayesian Linear Models Sudipto Banerjee September 03 05, 2017 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles Linear Regression Linear regression is,
More informationA Brief Overview of Nonparametric Bayesian Models
A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine
More information36-720: Latent Class Models
36-720: Latent Class Models Brian Junker October 17, 2007 Latent Class Models and Simpson s Paradox Latent Class Model for a Table of Counts An Intuitive E-M Algorithm for Fitting Latent Class Models Deriving
More informationNonparametric Bayesian Methods - Lecture I
Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics
More informationDavid B. Dahl. Department of Statistics, and Department of Biostatistics & Medical Informatics University of Wisconsin Madison
AN IMPROVED MERGE-SPLIT SAMPLER FOR CONJUGATE DIRICHLET PROCESS MIXTURE MODELS David B. Dahl dbdahl@stat.wisc.edu Department of Statistics, and Department of Biostatistics & Medical Informatics University
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationPartial factor modeling: predictor-dependent shrinkage for linear regression
modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework
More informationDirichlet Enhanced Latent Semantic Analysis
Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,
More informationFoundations of Nonparametric Bayesian Methods
1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models
More informationSTAT 425: Introduction to Bayesian Analysis
STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte
More informationINFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES
INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationBayesian Modeling of Conditional Distributions
Bayesian Modeling of Conditional Distributions John Geweke University of Iowa Indiana University Department of Economics February 27, 2007 Outline Motivation Model description Methods of inference Earnings
More informationSTA 216, GLM, Lecture 16. October 29, 2007
STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural
More informationNonparametric Bayes Uncertainty Quantification
Nonparametric Bayes Uncertainty Quantification David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & ONR Review of Bayes Intro to Nonparametric Bayes
More informationSupplementary Material: A Robust Approach to Sequential Information Theoretic Planning
Supplementar aterial: A Robust Approach to Sequential Information Theoretic Planning Sue Zheng Jason Pacheco John W. Fisher, III. Proofs of Estimator Properties.. Proof of Prop. Here we show how the bias
More informationA comparative review of variable selection techniques for covariate dependent Dirichlet process mixture models
A comparative review of variable selection techniques for covariate dependent Dirichlet process mixture models William Barcella 1, Maria De Iorio 1 and Gianluca Baio 1 1 Department of Statistical Science,
More informationBayesian Nonparametrics for Speech and Signal Processing
Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer
More informationDirichlet Process Mixtures of Generalized Linear Models
Dirichlet Process Mixtures of Generalized Linear Models Lauren A. Hannah Department of Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA David M. Blei Department
More informationThe Metropolis-Hastings Algorithm. June 8, 2012
The Metropolis-Hastings Algorithm June 8, 22 The Plan. Understand what a simulated distribution is 2. Understand why the Metropolis-Hastings algorithm works 3. Learn how to apply the Metropolis-Hastings
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning
More informationComputer Vision Group Prof. Daniel Cremers. 14. Clustering
Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it
More informationGaussian Mixture Model
Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationBayesian spatial hierarchical modeling for temperature extremes
Bayesian spatial hierarchical modeling for temperature extremes Indriati Bisono Dr. Andrew Robinson Dr. Aloke Phatak Mathematics and Statistics Department The University of Melbourne Maths, Informatics
More informationLecture 7 and 8: Markov Chain Monte Carlo
Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationColouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles
Colouring and breaking sticks, pairwise coincidence losses, and clustering expression profiles Peter Green and John Lau University of Bristol P.J.Green@bristol.ac.uk Isaac Newton Institute, 11 December
More informationCopyright, 2008, R.E. Kass, E.N. Brown, and U. Eden REPRODUCTION OR CIRCULATION REQUIRES PERMISSION OF THE AUTHORS
Copright, 8, RE Kass, EN Brown, and U Eden REPRODUCTION OR CIRCULATION REQUIRES PERMISSION OF THE AUTHORS Chapter 6 Random Vectors and Multivariate Distributions 6 Random Vectors In Section?? we etended
More informationInfinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix
Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationRonald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California
Texts in Statistical Science Bayesian Ideas and Data Analysis An Introduction for Scientists and Statisticians Ronald Christensen University of New Mexico Albuquerque, New Mexico Wesley Johnson University
More information