QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION

Size: px
Start display at page:

Download "QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION"

Transcription

1 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION MARK FISHER Preliminary and incomplete. Abstract. The Quasi-Bernstein Density (QBD) model is a Bayesian nonparameteric model for density estimation on a finite interval that incorporates beta distributions into a Dirichlet Process Mixture (DPM) model. The coefficients of the beta distributions are restricted to the natural numbers, producing a model that is related to the Bernstein polynomial model introduced by Petrone (1999a). The potential number of mixture components is unbounded. The prior predictive density is uniform over the finite interval. The model is first applied to observed data, and then applied to latent variables via indirect density estimation. In this latter setting the model becomes a DPM model for the prior. In the context of binomial observations and latent success rates, this prior is compared with the Dirichlet process (DP) prior found in the literature. (The DP prior can be viewed as a special case of the DPM prior.) 1. Introduction This paper is about density estimation on a finite interval using a mixture of beta distributions. (A simple extension allows for two-dimensional density estimation on a rectangle.) Here are the main features: (1) the approach to inference is Bayesian; (2) the parameters of the beta distributions are restricted to the natural numbers; (3) the potential number of mixture components is unbounded; and (4) the prior predictive density is uniform. (The prior predictive density can be made nonuniform if desired.) As illustrations, the model is applied to a number a standard data sets in one- and two-dimensions. In addition, I show how to apply the model to latent variables via what I call indirect density estimation. (In this context I introduce the distinction between generic and specific cases.) To illustrate this technique, I apply this estimation technique to compute the density of unobserved success rates that underly the observations from binomial experiments and/or units. The results are compared with those generated by an alternative model that has appeared in the literature. Related literature. This model is related to the Bernstein polynomial density technique of Petrone (1999a). See also Trippa et al. (2011) and Kottas (2006). For asymptotic properties of random Bernstein polynomials, see Petrone (1999b) and Petrone and Wasserman (2002). For a multivariate extension, see Zhao et al. (2013). For a discussion of samplers for Dirichlet process mixture models see Neal (2000). Date: July 22, 05:23. The views expressed herein are the author s and do not necessarily reflect those of the Federal Reserve Bank of Atlanta or the Federal Reserve System. 1

2 2 MARK FISHER Liu (1996) presents a related model in which the latent success rates for binomial observations have a Dirichlet Process (DP) prior. (This model is discussed in Section 8.) Outline. Sections 2 through 6 present the model as applied to observables. Section 2 provides a general introduction to Dirichlet Process Mixture (DPM) models and Chinese Restaurant Processes (CRP) and derives the general form of the conditional predictive distribution. Section 3 specializes the model to the quasi-bernstein model. Section 4 presents a two-dimensional extension of the model. Section 5 describes the details of the Markov chain Monte Carlo (MCMC) sampler. An empirical investigation is presented in Section 6. Sections 7 through 9 apply the model to the case of latent variables. Section 7 applies the model to latent variables via indirect density estimation, introducing the distinction between generic and specific cases. Section 8 presents an alternative (DP-based) prior that has been used in the literature for the case of binomial observations with latent success rates. Section 9 presents an empirical investigation for the latent-variable case. In addition there are five appendices. Appendix A discusses potential variation in the predictive distribution arising from additional observations. Appendix B presents additional features of the probabilities associated with the Chinese Restaurant Process. Appendix C presents a generalization of the two-dimensional model that includes local dependence via a copula. Appendix D shows how to integrate out the unobserved success rate in the case of binomial data. Appendix E discusses pooling, shrinkage, and sharing. 2. The standard model in general terms This section describes the model. Much of the exposition in this section is dedicated to providing the less knowledgeable reader with some insight into (i) the structure of the standard Dirichelt Process Mixture (DPM) model, (ii) its marginalization into a classification model based on the Chinese Restaurant Process (CRP), and (iii) the derivation of the conditional predictive distribution (which forms the basis for posterior estimation). To that end, this section borrows from the exposition of Teh (2010) [to which the reader is referred for additional introductory material]. The general form of the standard model adopted in this paper is stated in (2.13). The novelty is entirely in the specializations, which are presented in Section 3 in (3.1), (3.3), and (3.12). Inferential goal. Let x 1:n = (x 1,..., x n ) denote a set of observations. 1 The goal of inference is posterior distribution for the next observation (which we express hereafter as a density), also known as the posterior predictive distribution: p(x n+1 x 1:n, I), (2.1) where I stands for the model and the prior information. 2 Computing this predictive distribution amounts to an exercise in Bayesian density estimation. For density estimation, it is natural to adopt assumptions that allow for a wide 1 In Section 3, the model will be specialized to require xi [0, 1] for all i. This restriction, however, in not relevant in the current section. 2 Appendix A discusses the potential variation in the predictive distribution arising from additional observations.

3 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 3 range of possible distributions, including the possibility of multi-modality. In what follows, we provide a brief introduction to what has become the standard approach. The discussion of DPM models is intended to help give the reader a sense of where the flexibility of the Bayesian approach to density estimation comes from. However, one may skim this discussion (for notation) and start in earnest with the CRP model [summarized in (2.13)] upon which the predictive distribution is based. DPM models. Let θ 1:n = (θ 1,..., θ n ) denote a set of latent parameters, where θ i Θ. The Dirichlet Process Mixture (DPM) model has the following hierarchical structure: 3 x i θ i F (θ i ) iid θ i G G α DP(α, H) α P α. (2.2a) (2.2b) (2.2c) (2.2d) According to (2.2a), the observations are conditionally independent but not (necessarily) identically distributed. The distribution F plays the role of a kernel in density estimation. The location and scale (i.e., bandwidth) of the kernel associated with the ith observation are controlled by the parameter θ i. The prior distribution for θ i is G. G is a random probability distribution with a prior given by a Dirichlet Process (DP). The DP depends on a concentration parameter α > 0 and a base distribution H, where H is a distribution over Θ. The flexibility of the DP prior is such that G can approximate arbitrarily well (in the weak topology 4 ) any distribution defined on the support of H. The DP prior for G is completely characterized by the follow proposition: If A 1,..., A s is any finite measurable partition of Θ, then ( G(A1 ),..., G(A s ) ) Dirichlet ( α H(A 1 ),..., α H(A s ) ), (2.3) where Dirichlet denotes the Dirichlet distribution. Note s j=1 G(A j) = s j=1 H(A s) = 1. It follows from the properties of the Dirichlet distribution that E[G(A)] = H(A), for any measurable set A Θ. (2.4) Thus, the base distribution H can be thought of as the expectation of the random distribution G. In addition, the variance of G around H is inversely related to the concentration parameter α: V [G(A)] = H(A) ( 1 H(A) ) /(1 + α). (2.5) Thus, the concentration parameter determines how concentrated G is around its expectation H. Realizations of G are (almost surely) discrete even if H is not and consequently there is positive probability that θ i = θ j for some j i. The smaller the concentration parameter, the greater the probability of duplication. (This will become evident shortly.) 3 The distributions F, H, and Pα will be specified in Section 3. 4 The central limit theorem involves the weak topology.

4 4 MARK FISHER There is an explicit representation for G. Let θ = ( θ 1, θ 2,...) denote the support of G and let v = (v 1, v 2,...) denote the associated probabilities. In other words, v c is the probability that θ i = θ c. Then G can be expressed as G = v c, (2.6) δ θc c=1 where δ w denotes a point-mass (distribution) located at w. The randomness of G follows from the randomness of both the support and the associated probabilities: θ c iid H v α Stick(α), where Stick(α) denotes the stick-breaking distribution given by 5 (2.7a) (2.7b) c 1 v c = ε c (1 ε l ) where ε c Beta(1, α). (2.8) l=1 The parameter α controls the rate at which the weights decline on average. In particular, the weights decline geometrically in expectation: E[v c α, I] = α c 1 (1 + α) c. (2.9) Note E[v 1 α, I] = 1/(1 + α) and E [ c=n+1 v c α, I ] = ( α/(1 + α) ) n. When α is small, a few weights dominate the distribution thereby producing a substantial probability of duplicates. Marginalization. There are a number of schemes to estimate the DPM-based model (assuming F, H and P α have been specified). However, it is possible to reexpress the model (via marginalization) in a form that allows for estimation schemes that are both simpler and lower-variance. Although the marginalized model cannot deliver inferences about G, it does deliver the predictive inferences that are the subject of this paper. As a first step in reexpressing the model, introduce the classification variables z 1:n = (z 1,..., z n ), where z i = c indicates θ i = θ c. 6 The classifications are solely determined by the mixture weights v. In particular, z i takes on value c with probability v c, which can be expressed as z 1:n v Multinomial(v). (2.10) The reexpression is completed by integrating out v using its prior, producing the marginal distribution for the classifications: p(z 1:n α) = p(z 1:n v) p(v α) dv. (2.11) This distribution can be expressed as follows: z 1:n α CRP(α), (2.12) 5 Start with a stick of length one. Break off the fraction ε1 leaving a stick of length 1 ε 1. Then break off the fraction ε 2 of the remaining stick leaving a stick of length (1 ε 1) (1 ε 2). Continue in this manner. 6 Note that z1:n can be interpreted as a partition of the set {1,..., n}. We may interpret the partition as determining the way in which the parameters are shared among the observations.

5 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 5 where CRP(α) denotes the Chinese Restaurant Process (CRP) with concentration parameter α. The CRP plays a central role in the model and we will discuss is shortly. Summary. The marginalized model is summarized as follows: x i θ i F (θ i ) θ i = θ zi (2.13a) (2.13b) θ c iid H z 1:n α CRP(α) α P α. (2.13c) (2.13d) (2.13e) This is a standard model in the literature. The novelty introduced in this paper comes in the specification of F, H, and P α (in Section 3). In the meantime, it is convenient to discuss the CRP and also to derive the general forms of some additional expressions. CRP as sharing. Perhaps the most interesting aspect of (2.13) is the sharing of parameters among the observations. We begin by stating certain features of this sharing. In particular, there is a finite number of clusters, where the number of clusters is random. Each of the clusters has a common parameter, where θ c is the parameter for cluster c. Each observation has a classification parameter z i, where z i = c denotes that θ i belongs to cluster c, and consequently θ i = θ zi. (2.14) We assume the cluster parameters are independently and identically distributed according to some distribution function H: θ c iid H. (2.15) The way in which the parameters are shared is determined by a probability distribution for the classifications, z 1:n = (z 1,..., z n ). For this purpose we adopt the Chinese Restaurant Process (CRP). The imagery behind the CRP is that there is a restaurant with an unlimited number of tables at which customers are randomly seated, and all customers seated at the same table share the same entree. In the analogy, the tables correspond to the clusters, the customers correspond to the regimes, and the entrees correspond to the parameter values. The seating assignment works as follows. Customers arrive one at a time, and the first customer is always seated at table number one. Subsequent customers are seated either at an already occupied table or at the next unoccupied table. The probabilities of each of these possible seating choices are described next. Suppose n customers have already been seated. Let m n denote the number of occupied tables, which can be anywhere from 1 to n. Let d n,c denote the number of customers seated at table c. (For future reference, note that m n = max[z 1:n ], that d n,c is the multiplicity of c in z 1:n, and that m n c=1 d n,c = n.) The probability that the next customer is seated at table c is given by p(z n+1 = c z 1:n, α) = { dn,c n+α c {1,..., m n } α, (2.16) n+α c = m n + 1

6 6 MARK FISHER where α is the concentration parameter. 7 The effect of α on the amount of sharing can be seen in (2.16). If α = 0, then all customers are seated at the first table and there is complete sharing (θ i = θ 1 ). At the other extreme if α =, then each customer is seated separately and there is no sharing (θ i = θ i ). Intermediate values for α produce partial sharing. Because the value of α plays such an important role in determining the amount of sharing, we provide it with a prior and allow the data to help determine its value. The probability of z 1:n (given α) can be computed using (2.16). It depends only on the set of multiplicities associated with z 1:n : n 1 m n p(z 1:n α) = p(z 1 α) p(z i+1 z 1:i, α) = αmn c=1 (d n,c 1)!, (2.17) (α) n i=1 where (α) n := n i=1 (i 1 + α) = Γ(n + α)/γ(α).8 Posterior predictive distribution revisited. The following parameter vector is central to what follows: ψ n := (α, z 1:n, θ 1:mn ). (2.18) We can express the posterior predictive distribution in terms of ψ n : p(x n+1 x 1:n, I) = p(x n+1 ψ n, I) p(ψ n x 1:n, I) dψ n. (2.19) Calculation of the posterior distribution according to (2.19) involves the following steps: Find the explicit form for the conditional predictive distribution p(x n+1 ψ n, I), (2.20) make draws of {ψ n (r) } R r=1 from the posterior distribution p(ψ n x 1:n, I), and compute the approximation (Rao Blackwellization): p(x n+1 x 1:n, I) 1 R R r=1 p(x n+1 ψ (r) n, I). (2.21) The task of finding the explicit form for p(x n+1 ψ n ) begins by recognizing Then, noting x n+1 ψ n+1 F ( θ zn+1 ). (2.22) ψ n+1 = ψ n (z n+1, θ mn+1 ), (2.23) the task concludes by integrating out (z n+1, θ mn+1 ) using distributions conditional on ψ n. The result is a mixture of m n + 1 terms: m n d n,c p(x n+1 ψ n, I) = n + α f(x n+1 θ c ) + α n + α p(x n+1 I), (2.24) c=1 7 The CRP can be interpreted as the limit of sharing scheme involving K categories. Let z1:n Multinomial(n, w) where w Dirichlet(α/K,..., α/k). Integrate out w and let K. See Neal (2000) for details. 8 See Appendix B for addition information regarding the assignment of probabilities to partitions.

7 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 7 where f( θ i ) is the density of F (θ i ), h( I) is the density of H, and p(x i I) = f(x i θ c ) h( θ c I) d θ c (2.25) for any i 1 (including n + 1). Equation (2.24) is the standard workhorse of Bayesian nonparametric density estimation. A key feature of (2.24) is this: If α = then nothing is learned about x n+1 from x 1:n because no information flows through the hyperparameter and p(x n+1 ψ n, I) = p(x n+1 I). Thus the no-sharing environment is also a no-learning environment. 3. Specializing the model Having set the stage with the general structure of the standard Bayesian nonparametric density model, we are now in a position to specify the model by specializing F, H, and P α. Recall the general structure of the model is summarized by (2.13). Specializing F. Let θ i = (a i, b i ) Θ = N N and let where F (θ i ) = Beta(a i, b i ), (3.1) Beta(x a, b) = 1 [0,1] (x) xa 1 (1 x) b 1, (3.2) B(a, b) Γ(a) Γ(b) and where B(a, b) is the beta function, B(a, b) = Γ(a+b) = 1 0 xa 1 (1 x) b 1 dx. Note the parameters in the Beta distributions are restricted to the positive integers. Also note the model assumes x i [0, 1]. It is straightforward to adopt the model to any finite interval. Specializing H. Let θ c = (ã c, b c ) Θ. In order to describe the prior for θ c it is convenient to change variables to ( j c, k c ), where (ã c, b c ) = ( j c, k c j c + 1). Note k c N and j c {1,..., k c }. Let H be characterized by kc Pascal(1, ξ) j c k c Discrete(1/ k c,..., 1/ k c ). (3.3a) (3.3b) This prior delivers a model that shares much in common with Petrone s random Bernstein polynomial model. Because we do not treat an entire Bernstein basis as a unit, we refer to this as a quasi-bernstein model. The prior for ( j c, k c ) can be expressed as p( j c, k c I) = p( k c I)/ k c, (3.4) where p( k c I) = ξ (1 ξ) k c 1 (3.5)

8 8 MARK FISHER We are now equipped to explicitly express p(x i I) p(x i I) = p(x i θ c ) p( θ c I) d θ c = = p( k c I) kc=1 kc j c=1 Beta(x i j c, k c j c + 1) p( k c I) Uniform(x i 0, 1) kc=1 = 1 [0,1] (x i ). kc (3.6) The simplification in (3.6) does not depend on the form of p( k c I). It follows from a property of Bernstein polynomials, for which d l=0 B l,d(x) = 1, where B l,d (x) = ( d l) x l (1 x) d l denotes the lth Bernstein basis function of degree d. Equation (3.6) follows since Beta(x j, k j + 1)/k B j 1,k 1 (x). Here is the prior in terms of (ã c, b c ): p(ã c, b c I) = p( j c, k c I) ( j c, k c)=(ã c,ã c+ b c 1) = ξ (1 ξ)ãc+ b c 2 (The Jacobian of the transformation is identically one) ã c + b c 1 (3.7) Conditional predictive distribution revisited. Given the specification of F and H, we can now express the conditional predictive distribution in the form used for estimation in this paper: m n ( ) ( ) dn,c α p(x n+1 ψ n, I) = Beta(x n+1 ã c, b c ) + 1 n + α n + α [0,1] (x n+1 ), (3.8) c=1 where the prior for (ã c, b c ) N N is given in (3.7). Thus the conditional predictive distribution for x n+1 is a mixture of Beta distributions with integer coefficients and consequently so is the predictive distribution p(x n+1 x 1:n, I). Specializing P α. In formulating the prior for α it is instructive to examine the joint distribution for x 1 and x 2 conditional on α. First, note that the joint distribution conditional on ψ 1 can be expressed as where [refer to (2.24)] p(x 2 ψ 1, I) = p(x 1, x 2 ψ 1, I) = f(x 1 θ 1 ) p(x 2 ψ 1, I), (3.9) ( ) ( ) 1 α f(x 2 θ 1 ) + p(x 2 I). (3.10) 1 + α 1 + α

9 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION x x 1 Figure 1. Density histogram of 10 5 draws from p(x 1, x 2 I) with ξ =.005. Therefore, the joint distribution conditional on α can be obtained by integrating out θ 1 : p(x 1, x 2 α, I) = p(x 1, x 2 ψ 1, I) p( θ 1 I) d θ 1 ( ) ( ) 1 α = f(x 1 θ 1 ) f(x 2 θ 1 ) h( θ 1 I) d θ 1 + p(x 1 I) p(x 2 I). 1 + α 1 + α (3.11) We see that the joint distribution is a weighted average of the two extreme sharing environments. Equation (3.11) illustrates sharing in its simplest form. In light of this equation, we choose the prior for α such that λ = α/(1 + α) Uniform(0, 1). In particular, P α is characterized by the density 1 p(α I) = (1 + α) 2. (3.12) This prior differs from the standard conjugate prior for α. With this prior for α, the joint distribution for x 1 and x 2 is p(x 1, x 2 I) = 1 f(x 1 θ 1 ) f(x 2 θ 1 ) h( θ 1 I) d θ [0,1](x 1 ) 1 [0,1] (x 2 ), (3.13) where the densities f and h are given above. See Figure 1 for a binned density plot of 10 5 draws from p(x 1, x 2 I) where ξ =

10 10 MARK FISHER 4. Two-dimensional density estimation The extension of the model to two-dimensional observations is straightforward. 9 Let x i = (x i1, x i2 ), θ i = (θ i1, θ i2 ), and θ c = ( θ c1, θ c2 ), where θ il = (a il, b il ) and θ cl = (ã cl, b cl ) for l {1, 2}. Let F be characterized by f(x i θ i ) = Beta(x i1 a i1, b i1 ) Beta(x i2 a i2, b i2 ). (4.1) Note the local independence. A model with local dependence is described in Appendix C. Let H be characterized by p( θ c I) = p( θ c1 I) p( θ c2 I), where p( θ cl I) = p( k cl I)/ k cl. Consequently, p(x i I) = 1 [0,1] 2(x i ). (4.2) The priors for α and z 1:n are unchanged. The expression for the conditional predictive distribution is becomes m n p(x n+1 ψ n, I) = c=1 ( dn,c n + α ) Beta(x n+1,1 ã i1, b i1 ) Beta(x n+1,2 ã i2, b i2 ) ( ) α + 1 n + α [0,1] 2(x n+1 ). (4.3) Simple adaptations of the sampling scheme described in Section 5 allow one to make draws of ψ n in the two-dimensional case and thereby use (2.21) in conjunction with (4.3) to compute the approximation to the posterior predictive density. 5. Drawing from the posterior distribution We discuss how to make draws of ψ n = (α, z 1:n, θ 1:mn ) from the posterior distribution via MCMC (Markov chain Monte Carlo). We adopt a Gibbs sampling approach, relying on full conditional posterior distributions. The sampling scheme described here follows the outline summarized by Algorithm 2 in Neal (2000). Classifications. We begin with making draws of the classifications, z 1:n. Let z i 1:n denote the (possibly renormalized) vector of classifications after having removed case i. Renormalization will occur if before removal the cluster associated with observation i is a singleton (i.e., d n,zi = 1), in which case θ zi will be discarded and the remaining clusters will be relabeled. Based on (2.16) and the exchangeability of the classifications, we have p(z i = c α, z i { d i 1:n ) = n,c n 1+α α n 1+α c {1,..., m i c = m i n + 1 n }, (5.1) where m i n = max[z1:n i ] and let d i n,c denote the multiplicity of class c in z1:n i. Therefore, the full conditional probability of z i given the observations is p(z i α, z1:n i, θ 1:m i, x 1:n, I) n { d i n,c n 1+α Beta(x i ã c, b c ) α n 1+α 1 [0,1](x i ) c {1,..., m i c = m i n + 1 n }, (5.2) 9 See Zhao et al. (2013) for a discussion of the properties of a two-dimensional Bernstein polynomial density.

11 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 11 where 1 [0,1] (x i ) = 1 the likelihood for a new, as yet unobserved, component parameter. If a new cluster is selected (i.e., if z i = m i n + 1), then the new cluster parameter is drawn from the posterior using x i as the sole observation. The posterior distribution for the parameters for a new cluster factors as follows (letting c = m i n + 1): where p( j c, k c x i, I) = p(x i j c, k c ) p( j c k c, I) p( k c I) p(x i I) p( j c k c, x i, I) = = p( j c k c, x i, I) p( k c I), (5.3) ) ( kc 1 c 1 x j i (1 x i ) k c j c = Binomial( j c 1 k c 1, x i ). (5.4) j c 1 Note that k c is not identified and is drawn from its prior distribution and then ( j c 1) Binomial( k c 1, x i ) with the proviso that if k c = 1 then j c = 1. Cluster parameters. Having classified each of the observations (and having, in the process, discarded old values of θ c and drawn new values of θ c ), we then resample each element of θ 1:mn conditional on the classifications. Let I c = {i : z i = c}, (5.5) and let x c 1:n = {x i i I c }, where the number observations in x c 1:n equals d n,c. We have already shown how to make draws if d n,c = 1. Now assume d n,c 2. The posterior for ( j c, k c ) is given by p( θ c x 1:n, z 1:n, α, I) = p( θ c x c 1:n, I) h( θ c I) i I c f(x i θ c ) = p( j c, k c I) ( i I c x i ) j c 1 ( i I c (1 x i ) ) k c j c B( j c, k c j c + 1) dn,c. One can adopt a Metropolis-Hastings scheme with the following proposal: (5.6) k c 1 Poisson(k c (r) ) (5.7) j c 1 Binomial( k c 1, x c (r) 1:n ), (5.8) where x c 1:n = 1 d n,c i I c x i is the sample mean of x c 1:n.10 Let Then q( θ c θ c, x c 1:n) := Poisson( k c 1 k c ) Binomial( j c k c 1, x c 1:n). (5.9) where u (r+1) Uniform(0, 1) and M (r) c θ (r+1) c = { θ c θ (r) c u (r+1), (5.10) otherwise M (r) c = p( θ c x 1:n, z (r) 1:n, α(r), I) x 1:n, z (r) p( θ (r) c 1:n, α(r), I) q( θ (r) c θ c, x c (r) 1:n ) q( θ c θ c (r), x c (r) (5.11) 1:n ). 10 As an alternative, one can adopt a Metropolis scheme with a random-walk proposal for (ãc, b c). For example, one can draw from a bivariate normal distribution and round the draw to the integers.

12 12 MARK FISHER Concentration parameter. The remaining component of ψ n is α. The likelihood for α is given by (2.17): p(z 1:n α) αmn = αmn Γ(α) (α) n Γ(n + α). (5.12) Draws of α can be made using a Metropolis Hastings scheme. 11 Let the proposal be where 12 α Singh Maddala(α (r), h), (5.13) Singh Maddala(x m, h) = h xh 1 m h (x h + m h ) 2. (5.14) Note that P α = Singh Maddala(1, 1). The inverse-cdf method can be used to make draws from Singh Maddala(m, h): Draw u Uniform(0, 1) and set x = m ( u/(1 u) ) 1/h. determining whether or not to accept the proposal, we require the Hastings ratio, Then where M (r) = For Singh Maddala(α (r) α, h) Singh Maddala(α α (r), h) = α. (5.15) α (r) α (r+1) = { α α (r) M (r) u (r+1), (5.16) otherwise ( ) (r) p(z(r) 1:n α ) p(z (r) 1:n α(r) ) α α (r) = (α(r) ) n α m n +1 (α ) n α (r). (5.17) 6. Investigation: Part I In this section we apply the Bernstein Dirichlet density prior spline to a number of applications and investigate the performance: the galaxy data, the Buffalo snowfall data, and the Old Faithful data (in two dimensions). Unless otherwise noted, ξ = 1/200. With this setting, the prior mean for k c equals 200 and the prior standard deviation equals The 90% highest prior density (i.e., probability mass) region runs from k c = 1 to k c = 460. As noted above, the prior for α is given by p(α I) = 1/(1 + α) 2 so that the prior median for α equals one. Galaxy data. Figure 5 shows the quasi-bernstein predictive density for the galaxy data with support over the interval [5, 40]. A total of 50,500 draws were made, with the first 500 discarded and 1,000 draws retained (every 50th) from the remaining 50,000. In Figures 6 8 are shown the posterior distributions of the number of clusters, λ = α/(1 + α), and k c. 11 Alternatively, draws of α can be made using a Metropolis scheme. Make a random-walk proposal of λ N(λ (r), s 2 ), where λ (r) = α (r) /(1 + α (r) ) and s 2 is a suitable scale. Then evaluate the likelihood ratio for α = λ /(1 λ ) relative to α (r) to determine whether or not to accept the proposal α. 12 The Singh Maddala distribution is also known as the Burr Type XII distribution. The version of the Singh Maddala distribution expressed in (5.14) is specialized from the more general version that involves a third parameter. In Mathematica, the distribution is represented by SinghMaddalaDistribution[1, h, m].

13 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 13 Buffalo snowfall data. Figure 9 shows the quasi-bernstein predictive density for the galaxy data with support over the interval [0, 150]. A total of 50,500 draws were made, with the first 500 discarded and 1,000 draws retained (every 50th) from the remaining 50,000. In Figures are shown the posterior distributions of the number of clusters, λ = α/(1 + α), and k c. The density in Figure 9 is substantially smoother than what is produced by many alternative models which typically display three modes. In the current model, fixing α = 5 will produce three modes, but this value for α is deemed unlikely according the model when we learn about α. The posterior median for α is about The posterior probability of α 5 is about 20 times lower than the prior probability. (Increasing α also has the effect of increasing the probability of new cluster, which in turn has the effect of increasing the predictive density at the boundaries of the region. For example, the predictive density increase by roughly a factor of 10 at x n+1 = 150.) With this data set there is a strong (posterior) relation between α and k c. The posterior median of k c equals about 10 given α < 1, but it equals about 140 given α 1. Old Faithfull data. Here we examine the Old Faithful data, which comprises 272 observations of pairs composed of eruption time (the duration of the current eruption in minutes) and waiting time (the amount of time until the subsequent eruption in minutes). Figure 13 shows a scatter plot of the data, a contour plot of the joint predictive distribution, and two line plots of conditional expectations computed from the joint distribution. The distribution was given positive support over the region [1, 5.75] [35, 105]. The distribution is distinctly bimodal. Figure 14 shows the marginal predictive distributions computed from the joint distribution (along with rug plots of the data). The following four figures provide diagnostics. 7. Indirect density estimation for latent variables Up to this point we have assumed that x 1:n was observed and the goal of inference was the posterior predictive distribution p(x n+1 x 1:n, I). In this section, we now suppose x 1:n is latent and instead Y 1:n = (Y 1,..., Y n ) is observed. Note that Y i may be a vector of observations. To accommodate this situation, we augment (2.13) with Y i x i P Yi (x i ). (7.1) The form of the density p(y i x i ) will depend on the specific application. Nuisance parameters may have been integrated out to obtain p(y i x i ). One may interpret p(y i x i ) as a noisy signal for x i. When x 1:n was observed, there was an obvious asymmetry between x i x 1:n and x n+1. This asymmetry carries over to the situation where x 1:n is latent. In particular, we observe signals for x i x 1:n but we do not observe a signal for x n+1. Based on this distinction, we refer to x i as a specific case because we have a (specific) signal for it and we refer to x n+1 as the generic case because it applies to any coefficient for which we have no signal as yet (and for which we judge x 1:n to provide an appropriate basis for inference). The specific

14 14 MARK FISHER cases are the ones that appear in the likelihood: p(y 1:n x 1:n+1 ) = p(y 1:n x 1:n ) = n p(y i x i ). (7.2) The goal of inference is the indirect posterior predictive distribution, p(x n+1 Y 1:n, I). From a conceptual standpoint, the indirect predictive distribution is the average of the direct distribution, where the average is computed with respect to the posterior uncertainty of x 1:n : p(x n+1 Y 1:n, I) = p(x n+1 x 1:n, I) p(x 1:n Y 1:n, I) dx 1:n. (7.3) Once again we can express the distribution of interest in terms of ψ n : p(x n+1 Y 1:n, I) = p(x n+1 ψ n ) p(ψ n Y 1:n, I) dψ n 1 R R r=1 i=1 p(x n+1 ψ (r) n ) (7.4) where {ψ n (r) } R r=1 now represents draws from p(ψ n Y 1:n, I). 13 The sampler works as before with the additional step of drawing x 1:n Y 1:n, ψ n for each sweep of the sampler. Since the joint likelihood factors [see (7.2)], the full conditional posterior for x i reduces to the posterior for x i in isolation (conditional on θ i ): where p(x i Y 1:n, x i 1:n, ψ n, I) = p(x i Y i, θ i ) θi = θ zi, (7.5) p(x i Y i, θ i ) = p(y i x i ) f(x i θ i ) p(yi x i ) f(x i θ i ) dθ i. (7.6) The posterior distributions of the specific cases can be approximated with histograms of the draws {x (r) i } R r=1 from the posterior. However, we can adopt a Rao Blackwellization approach (as we have done with the generic case) and obtain a lower variance approximation. In particular, p(x i Y 1:n, I) = p(x i Y i, θ i ) p(θ i Y 1:n, I) dθ i Note θ (r) i = θ (r) c where c = z (r) i. 1 R R r=1 p(x i Y i, θ (r) i ). (7.7) 13 In passing, note the likelihood of the observations Y1:n is can be computed via a succession of generic distirubions: where p(y 1:n I) = p(y 1 I) n p(y i Y 1:i 1, I) i=2 p(y i Y 1:i 1, I) = p(y i x i) p(x i Y 1:i 1, I) dx i.

15 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 15 Binomial data. An important case is when the data are binomial in nature: 14 p(y i x i ) = Binomial(s i T i, x i ), (7.8) where T i is the number of trials, s i is the number of successes, and x i is the latent probability of success. In this case, the conditional posterior for a specific cases is p(x i Y i, θ i ) = Beta(x i a i + s i, b i + T i s i ), (7.9) which can be used in (7.7). In this setting both specific and generic posterior distributions are mixtures of beta distributions. 8. Alternative prior for indirect density estimation with binomial data We present a special case where the kernel F is a point mass located at θ i [0, 1] and the base distribution H is uniform over the unit interval: The densities can be expressed as F (θ i ) = δ θi and H = Uniform(0, 1). (8.1) f(x i θ i ) = δ θi (x i ) and h( θ c I ) = 1 [0,1] ( θ c ), (8.2) where I denotes this alternative model. Given (8.2), the conditional predictive distribution [see (2.24)] becomes a mixture of point masses and the uniform density: p(x n+1 ψ n, I ) = m n c=1 d n,c n + α δ θc (x n+1 ) + α n + α 1 [0,1](x n+1 ). (8.3) The knowledgable reader will recognize (8.3) as the conditional predictive distribution that would be derived from a DP model (as opposed to a DPM model). This equivalence is a consequence of the point-mass kernel. The point-mass kernel identifies x i with θ i. Since x i plays no role given θ i, it is convenient to integrate x i out, obtaining the likelihood of θ i : p(y i θ i ) := p(y i x i ) f(x i θ i ) dx i = p(y i x i ) xi =θ i. (8.4) The sampling scheme proceeds as follows. The classifications are updated according to where p(z i z1:n i, θ 1:m i, α, Y 1:n, I ) n { d i n,c n 1+α p(y i θ c ) α n 1+α p(y i I ) c {1,..., m i c = m i n + 1 n }, (8.5) p(y i θ c ) = p(y i θ i ) θi = θ c (8.6) p(y i I ) = p(y i θ c ) h( θ c I ) d θ c. (8.7) 14 With the binomial likelihood it is possible to analytically integrate out x1:n as shown in Appendix D. However, we do not follow that procedure in our estimation.

16 16 MARK FISHER The case we are interested in involves the binomial likelihood. 15 Therefore, p(y i θ c ) = Binomial(s i T i, θ c ) (8.8) p(y i I ) = 1 T i + 1. (8.9) Updating the cluster parameters conditional on the classifications is straightforward: θ c Y1:n, c I Beta(ă c, b c ) (8.10) where ă c := 1 + s l l I c and bc := 1 + T l s l. l I c (8.11) There is no change in the way the updates are computed for α. When we apply our approximations to the generic and specific distributions, we obtain p(x n+1 Y 1:n, I ) 1 R m n (r) d (r) n,c (x R n + α (r) δ θ(r) n+1 ) + α(r) c n + α (r) 1 [0,1](x n+1 ) (8.12) p(x i Y 1:n, I ) 1 R r=1 R r=1 c=1 δ (r) θ (x i ). (8.13) i The generic distribution is a weighted average of the specific distributions and the prior distribution. Also note these distributions are composed of mixtures of point masses (along with a continuous distribution for the generic case). It is possible to compute smoother approximations. We turn to that now. Smoothing. In the binomial-data case with the alternative model it is possible to adopt Algorithm 3 in Neal (2000) which reduces estimation solely to classification (conditional on the concentration parameter). The algorithm involves integrating out θ c : p(y i (Y i 1:n )c, I ) = p(y i θ c ) p( θ c (Y i 1:n )c, I ) d θ c, (8.14) where (Y1:n i )c denotes all the observations excluding the ith that are classified as c. In particular, p(y i (Y1:n i )c, I ) = Beta Binomial(s i ă i i c, b c, T i ), (8.15) where ă i c := ă c s i and b i c := b c (T i s i ). (8.16) The classifications are assigned according to p(z i α, z i 1:n, Y 1:n, I ) = { d i n,c n } n 1+α p(y i (Y1:n i, I ) c {1,..., m i α n 1+α p(y i I ) c = m i n + 1 { d i n,c n 1+α Beta Binomial(s i ă i c, α n 1+α 1 T i +1 b i c, T i ) c {1,..., m i c = m i n For textbook treatment of this case see Greenberg (2013) and Geweke et al. (2011). n }. (8.17)

17 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 17 Draws of the concentration parameter can be made as usual. Posterior distributions. The posterior distribution for the specific case conditional on the classifications and the data is given by since p(x i z 1:n, Y 1:n, I ) = p(y i x i ) p(x i (Y1:n i )z i, I ) p(y i (Y1:n i = Beta(x )z i, I i ă zi, b zi ), (8.18) ) p(y i x i ) = Binomial(s i T i, x i ) (8.19) p(x i (Y1:n i )z i, I ) = Beta(x i ă i i z i, b z i ) (8.20) p(y i (Y i 1:n )z i, I ) = Beta Binomial(s i ă i z i, where ă zi and b zi are given in (8.11) and ă i z i and cases c = z i ]. Therefore, p(x i Y 1:n, I ) 1 R n r=1 b i z i p(x i z (r) 1:n, Y 1:n, I ) = 1 R b i z i, T i ), (8.21) are given in (8.16) [where in both R r=1 Beta(x i ă (r) (r) z i, b z i ). (8.22) The posterior distribution for the generic case given the concentration parameter, the classifications, and the data is p(x n+1 Y 1:n, α, z 1:n, I ) = p(x n+1 ψ n, I ) p( θ 1:mn Y 1:n, I ) d θ 1:mn = m n c=1 d n,c n + α Beta(x n+1 ă c, b c ) + α n + α 1 [0,1](x n+1 ), (8.23) where we have used f(x n+1 θ c ) p( θ 1:mn Y 1:n, I ) d θ mn = p( θ c Y c 1:n, I ) θc=x n+1 = Beta(x n+1 ă c, b c ). (8.24) Therefore the posterior distribution for the generic case can be approximated by p(x n+1 Y 1:n, I ) 1 R R r=1 p(x n+1 Y 1:n, α (r), z (r) 1:n, I ). (8.25) Model comparison. The alternative model may be compared with the main model in terms of their likelihoods. For A {I, I }, n p(y 1:n A) = p(y 1 A) p(y i Y 1:i 1, A), (8.26) where p(y i Y 1:i 1, A) = Note p(y 1 I) = p(y 1 I ) = 1 0 p(y 1 x 1 ) dx 1. i=2 p(y i x i ) p(x i Y 1:i 1, A) dx i. (8.27)

18 18 MARK FISHER Table 1. Rat tumor data: 71 studies (rats with tumors/total number or rats). 00/17 00/18 00/18 00/19 00/19 00/19 00/19 00/20 00/20 00/20 00/20 00/20 00/20 00/20 01/20 01/20 01/20 01/20 01/19 01/19 01/18 01/18 02/25 02/24 02/23 01/10 02/20 02/20 02/20 02/20 02/20 02/20 05/49 02/19 05/46 03/27 02/17 07/49 07/47 03/20 03/20 02/13 09/48 04/20 04/20 04/20 04/20 04/20 04/20 04/20 10/50 10/48 04/19 04/19 04/19 05/22 11/46 12/49 05/20 05/20 06/23 05/19 06/22 04/14 06/20 06/20 06/20 16/52 15/47 15/46 09/24 Table 2. The thumbtack data set: 320 instances of binomial experiments with 9 trials each. The results are summarized in terms of the number of experiments that have a given number of successes. No. of successes Total No. of experiments Frequency (percent) Investigation: Part II In this section we apply the density spline to a number of applications and investigate the performance. Rat tumor data, baseball data, and other data. Rat tumor data. The rat tumor data is composed of the results from 71 studies. The number of rats per study varied from ten to 52. The rat tumor data are described in Table 5.1 in Gelman et al. (2014) and repeated for convenience in Table 1 (although the data are displayed in a different order). The data are plotted in Figure 3. This plot brings out certain features of the data that are not evident in the table. There are 59 studies for which the total number of rats is less than or equal to 35 and more than half of these studies (32) have observed tumor rates less than or equal to 10%. By contrast, none of the other 12 studies has an observed tumor rate less than or equal to 10%. The posterior distribution for the generic case is shown in Figure 19. The posterior distributions for the specific cases are shown in Figure 20. This latter figure can be compared with Figure 5.4 in Gelman et al. (2014) to show the differences in the results obtained by the more general approach presented here. Baseball batting skill. This example is inspired by the example in Efron (2010) which in turn draws on Efron and Morris (1975). We are interested in the ability of baseball players to generate hits. We do not observe this ability directly; rather we observe the outcomes (successes and failures) of a number of trials for a number of players. In this example T i is the number of at-bats and s i is the number of hits for player i. See Figure 2 for the data. [The analysis in not complete.] Thumbtack data. The thumbtack data are shown in Table 2. The posterior distribution for the generic success rate is displayed in Figure 21.

19 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION Hits Figure 2. Baseball data: 18 players with 45 at-bats each. 0.4 Observed Rate of Tumors Number of Rats in Study Figure 3. Rat tumor data: 71 studies. Number of studies (1 to 7) proportional to area of dot. Number of rats with tumors (0 to 16) indicated by contour lines. There are 59 studies for which the total number of rats is less than or equal to 35 and more than half of these studies (32) have observed tumor rates less than or equal to 10%. By contrast, none of the other 12 studies has an observed tumor rate less than or equal to 10%. The posterior distribution for the generic success rate given the alternative model is shown in Figure 26.

20 20 MARK FISHER Appendix A. Uncertainty about the predictive distribution The predictive distribution quantifies the uncertainty about x n+1 given x 1:n. It is this uncertainty that is of interest in this paper. There is, however, a different notion of uncertainty that could be of interest. Consider a decision that must be made based on the predictive distribution. Suppose the decision can be made either with the currently available information or delayed (with some additional cost) until additional information is acquired. Uncertainty about the extent to which that additional information can change the predictive distribution is relevant to the overall decision process. 16 For example, suppose the decision can be made using the current predictive distribution p(x n+1 x 1:n, I) or delayed and made with the next predictive distribution p(x n+2 x n+1, x 1:n, I). (A.1) Variation in the next distribution is driven by variation in x n+1 from the current distribution, and the expectation of the next distribution equals the current distribution: p(x n+2 x 1:n, I) = p(x n+2 x n+1, x 1:n, I) p(x n+1 x 1:n, I) dx n+1 (A.2) = p(x n+1, x n+2 x 1:n, I) dx n+1 = p(x n+1 x 1:n, I) xn+1 =x n+2, where the last equality follows from the exchangeability of x n+1 and x n+2. If x n+1 and x n+2 are independent conditional on x 1:n, then the next distribution equals its expectation and there is no variation. For n = 0, the next generic distribution (conditional on α) is p(x 2 x 1, α, I) = p(x 1, x 2 α, I) = p(x 1, x 2 α, I) p(x 1 α, I) p(x 1 I) ( ) 1 f(x1 θ 1 ) f(x 2 θ 1 ) h( θ 1 I) d θ 1 = 1 + α p(x 1 I) If α = then there is no variation. ( ) α + p(x 2 I). 1 + α Appendix B. Partitions, configurations, and adding up (A.3) Let P(n) denote the set of partitions of the set {1,..., n}. The elements of P(n) can be identified with (normalized) values for z 1:n. (We normalize z 1:n as follows: z 1 = 1 and z i+1 max[z 1:i ] + 1.) For example, let n = 4. The partition {{1, 2, 4}, {3}} of the set {1, 2, 3, 4} can be identified with the classifications z 1:4 = (1, 1, 2, 1). Note that the number of elements of in the partition is m n. In this example, m 4 = 2 and the set of multiplicities is {d 4,1, d 4,2 } = {3, 1}. Each element of P(n) is assigned a probability via p(z 1:n α) such that p(z 1:n (ζ) α) = 1, (B.1) ζ P(n) 16 This is not the same as the posterior uncertainty regarding G.

21 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 21 where z 1:n (ζ) denotes the classifications that corresponds to the partition ζ. The number of partitions is given by the Bell number, B n, which grows very rapidly. For example, B 10 = 115,975 and B 20 = 51,724,158,235,372. Here is a way to evaluate (B.1) without completely enumerating P(n). We can partition P(n) into what Casella et al. (2014) call configuration classes. Configuration classes themselves can be identified with the partitions of an integer. (Even though there are far fewer partitions of the integer n than of the set {1,..., n}, there remains a combinatoric explosion that makes explicit evaluation infeasible for any but small values of n. For example, the number of partitions of 100 is 190,569,292.) Classifications with the same set of multiplicities belong to the same configuration class. Consequently, the CRP assigns the same probability to each element of a configuration class [see (2.17)]. Therefore to obtain the probability of the entire configuration class we can evaluate the probability of a representative element of each configuration class (as given by a partition of the associated integer) and multiply by the number of elements of each configuration class. Casella et al. (2014) show that the number of elements in a configuration class is given by ( ) n 1 d n,1,..., d n ( mn n,mn i=1 c=1 1 {i}(d n,c ) )!, (B.2) where ( ) n d n,1,..., d n,mn is the multinomial coefficient and the second factor corrects for over counting. Example. Let n = 4. There are 15 partitions of the set {1, 2, 3, 4}, but there are only five partitions of the integer 4: (4), (3, 1), (2, 2), (2, 1, 1), and (1, 1, 1, 1), where each of the partitions sums to 4. Each partition has an associated value of m 4. The values are 1, 2, 2, 3, and 4, respectively. The second and third partitions of the integer 4 each involve two terms (i.e., m 4 = 2). Interpreting these partitions as configuration classes, the second and third configuration classes belong the the same partition class as defined by Casella et al. (2014) 17. Figure 4 provides a visualization for the n = 4 example. [See also Figure 2 in Casella et al. (2014).] The number of elements in each of the five configuration classes is (1, 4, 3, 6, 1). Additionally assuming α = 1, the probabilities of a representative element from each configuration class are ( 1 4, 1 12, 1 24, 1 24, 1 24 ). Appendix C. Local dependence via copula In this section we generalize the two-dimensional density model (presented in Section 4) to include local dependence, which we introduce via a copula. The Farlie Gumbel Morgenstern (FGM) copula is easy to work with because a flat prior for its copula parameter produces a flat prior predictive density over the unit square [0, 1] 2. The FGM copula is given by C(u 1, u 2 τ) := u 1 u 2 ( 1 τ (1 u1 ) (1 u 2 ) ) where 1 τ 1. (C.1) 17 Casella et al. (2014) provide an alternative prior over the partitions of {1,..., n} in which all elements of a given partition class have equal probability.

22 22 MARK FISHER Figure 4. The 15 partitions of {1, 2, 3, 4}. Each of the four rows represents a partition class: m 4 = 1, 2, 3, 4. Each of the five configuration classes is indicated by color. (The second partition class is comprised of two configuration classes.) Source: Wikipedia. This is the CDF for (u 1, u 2 ) defined over the unit square. The joint PDF is c(u 1, u 2 τ) := 1 + τ (2 u 1 1) (2 u 2 1) (C.2) The marginal densities are flat: p(u i τ) = 1 [0,1] (u i ) for i = 1, 2. The correlation between u 1 and u 2 is τ/3. Thus the potential dependence is somewhat limited. Setting τ = 0 delivers independence. The CDF for the beta distribution is the regularized incomplete beta function: I x (a, b) = x 0 Beta(t a, b) dt. Let the joint CDF for a single observation x i = (x i1, x i2 ) be given by where θ i = (j i1, j i2, k i1, k i2, τ i ) and F (x i θ i ) := C(w i1, w i2, τ i ), w il := I xil (j il, k il j il + 1). (C.3) (C.4) (C.5)

23 Then the joint density is given by QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 23 2 f(x i θ i ) = c(w i1, w i2 τ i ) Beta(x il j il, k il j il + 1). l=1 (C.6) Let the prior for θ c = ( j c1, j c2, k c1, k c2, τ c ) be given by where p( θ c I) = p( k c1 I) p( k c2 I) p( τ c I) kc1 kc2, (C.7) p( τ c I) = [ 1,1]( τ c ). (C.8) Let j c = ( j c1, j c2 ), k c = ( k c1, k c2 ), and where f(x i θ c ) = c( w ic1, w ic2 τ c ) 2 Beta(x il j cl, k cl j cl + 1), l=1 w icl := I xil ( j cl, k cl j cl + 1). (C.9) (C.10) The expectation with respect to τ c produces (conditional) independence between x i1 and x i2 : p(x i j c, k c, I) = 1 1 f(x i θ c ) p( τ c I) d τ c = 2 Beta(x il j cl, k cl j cl + 1). l=1 (C.11) Given the conditional priors for j c1 and j c2, this prior produces a flat prior predictive for any p( k c1 I) and p( k c1 I): p(x i I) = 1 [0,1] 2(x i ). The posterior for θ c given a single observation x i is (C.12) p( θ c x i, I) = f(x i θ c ) p( θ c I) p(x i I) ( 2 ) ( 2 ) ( ) = p( k cl I) p( j cl x il, k c( wic1, w ic2 τ c ) cl ). 2 l=1 l=1 }{{}}{{}}{{} p( k c x i,i) p( j c x i, k c,i) p( τ c x i, j c, k c,i) (C.13) Thus, we can make a draw form the joint posterior by first drawing k c from its prior distribution, then drawing j c from its distribution conditional on k c, and finally drawing τ c from its conditional distribution.

24 24 MARK FISHER Appendix D. Binomial likelihood When the data are observations from binomial experiments, it is possible to integrate out the unobserved success rates (since the prior is a mixture of beta distributions). There is a closed-form expression for the likelihood of θ i = (j i, k i ) in terms of the observations: p(s i T i, θ i ) = p(s i T i, j i, k i ) = Note that 1 0 Binomial(s i T i, x i ) Beta(x i j i, k i j i + 1) dx i = Beta-Binomial(s i j i, k i j i + 1, T i ) ( ) Ti ki! (s i + j i 1)! (T i + k i s i j i )! =. (j i 1)! (k i j i )! (T i + k i )! s i p(s i T i, k c, I) = kc j c=1 p(s i T i, j c, k c ) kc = 1 T i + 1, (D.1) (D.2) which is independent of k c. Therefore, p(s i T i, k c, I) = p(s i T i, I). We can again use Algorithm 2 from Neal (2000) to make draws from the posterior distribution. Classification of s i T i depends on p(z i z1:n i, θ 1:m i, α, s i, T i ) n { d i n,c n 1+α p(s i T i, θ c ) α n 1+α p(s i T i, I) c {1,..., m i c = m i n + 1 n }. (D.3) In order to make a draw of a new cluster component, consider the following. Note that where p( j c, k c T i, s i, I) = p(s i T i, j c, k c ) p( j c, k c I) p(s i T i, I) { (T i + 1) Beta-Binomial(s i j c, = k } c j c + 1, T i ) p( k c I) kc = Beta-Binomial( j c 1 s i + 1, T i s i + 1, k c 1) p( k c I), ( j c 1) (T i, s i, k c ) Beta-Binomial(s i + 1, T i s i + 1, k c 1), (D.4) (D.5) with the proviso that j c = 1 if k c = 1. Note that a single observation by itself does not identify k c. Appendix E. Sharing, shrinkage, and pooling One may interpret the main prior in terms of partial sharing of the parameters. Whenever a parameter is shared among cases, the associated coefficients are shrunk toward a common value. Complete sharing, therefore, implies global shrinkage, while partial sharing implies local shrinkage which allows for multiple modes to exist simultaneously. Gelman et al. (2014) and Gelman and Hill (2007), discuss three types of pooling: no pooling, complete pooling, and partial pooling. The no-pooling model corresponds to our no-sharing prior and the partial-pooling model corresponds to our one-component complete

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(

More information

Bayesian Nonparametrics: Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian

More information

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations John R. Michael, Significance, Inc. and William R. Schucany, Southern Methodist University The mixture

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown

More information

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007 COS 597C: Bayesian Nonparametrics Lecturer: David Blei Lecture # Scribes: Jordan Boyd-Graber and Francisco Pereira October, 7 Gibbs Sampling with a DP First, let s recapitulate the model that we re using.

More information

Infinite latent feature models and the Indian Buffet Process

Infinite latent feature models and the Indian Buffet Process p.1 Infinite latent feature models and the Indian Buffet Process Tom Griffiths Cognitive and Linguistic Sciences Brown University Joint work with Zoubin Ghahramani p.2 Beyond latent classes Unsupervised

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Nonparametrics: Models Based on the Dirichlet Process Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro

More information

NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON A CIRCLE

NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON A CIRCLE NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON A CIRCLE MARK FISHER Incomplete. Abstract. This note describes nonparametric Bayesian density estimation for circular data using a Dirichlet Process Mixture

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian Nonparametric Models

Bayesian Nonparametric Models Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Nonparametric Mixed Membership Models

Nonparametric Mixed Membership Models 5 Nonparametric Mixed Membership Models Daniel Heinz Department of Mathematics and Statistics, Loyola University of Maryland, Baltimore, MD 21210, USA CONTENTS 5.1 Introduction................................................................................

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet

More information

2 Inference for Multinomial Distribution

2 Inference for Multinomial Distribution Markov Chain Monte Carlo Methods Part III: Statistical Concepts By K.B.Athreya, Mohan Delampady and T.Krishnan 1 Introduction In parts I and II of this series it was shown how Markov chain Monte Carlo

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Non-parametric Clustering with Dirichlet Processes

Non-parametric Clustering with Dirichlet Processes Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

A Bayesian perspective on GMM and IV

A Bayesian perspective on GMM and IV A Bayesian perspective on GMM and IV Christopher A. Sims Princeton University sims@princeton.edu November 26, 2013 What is a Bayesian perspective? A Bayesian perspective on scientific reporting views all

More information

Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC Stat 451 Lecture Notes 07 12 Markov Chain Monte Carlo Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapters 8 9 in Givens & Hoeting, Chapters 25 27 in Lange 2 Updated: April 4, 2016 1 / 42 Outline

More information

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I X i Ν CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr 2004 Dirichlet Process I Lecturer: Prof. Michael Jordan Scribe: Daniel Schonberg dschonbe@eecs.berkeley.edu 22.1 Dirichlet

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features Yangxin Huang Department of Epidemiology and Biostatistics, COPH, USF, Tampa, FL yhuang@health.usf.edu January

More information

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics Table of Preface page xi PART I INTRODUCTION 1 1 The meaning of probability 3 1.1 Classical definition of probability 3 1.2 Statistical definition of probability 9 1.3 Bayesian understanding of probability

More information

SIMPLEX REGRESSION MARK FISHER

SIMPLEX REGRESSION MARK FISHER SIMPLEX REGRESSION MARK FISHER Abstract. This note characterizes a class of regression models where the set of coefficients is restricted to the simplex (i.e., the coefficients are nonnegative and sum

More information

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process 10-708: Probabilistic Graphical Models, Spring 2015 19 : Bayesian Nonparametrics: The Indian Buffet Process Lecturer: Avinava Dubey Scribes: Rishav Das, Adam Brodie, and Hemank Lamba 1 Latent Variable

More information

Hyperparameter estimation in Dirichlet process mixture models

Hyperparameter estimation in Dirichlet process mixture models Hyperparameter estimation in Dirichlet process mixture models By MIKE WEST Institute of Statistics and Decision Sciences Duke University, Durham NC 27706, USA. SUMMARY In Bayesian density estimation and

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

Flexible Regression Modeling using Bayesian Nonparametric Mixtures Flexible Regression Modeling using Bayesian Nonparametric Mixtures Athanasios Kottas Department of Applied Mathematics and Statistics University of California, Santa Cruz Department of Statistics Brigham

More information

Gibbs Sampling in Latent Variable Models #1

Gibbs Sampling in Latent Variable Models #1 Gibbs Sampling in Latent Variable Models #1 Econ 690 Purdue University Outline 1 Data augmentation 2 Probit Model Probit Application A Panel Probit Panel Probit 3 The Tobit Model Example: Female Labor

More information

Bayesian Methods with Monte Carlo Markov Chains II

Bayesian Methods with Monte Carlo Markov Chains II Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University hslu@stat.nctu.edu.tw http://tigpbp.iis.sinica.edu.tw/courses.htm 1 Part 3

More information

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham MCMC 2: Lecture 3 SIR models - more topics Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham Contents 1. What can be estimated? 2. Reparameterisation 3. Marginalisation

More information

Image segmentation combining Markov Random Fields and Dirichlet Processes

Image segmentation combining Markov Random Fields and Dirichlet Processes Image segmentation combining Markov Random Fields and Dirichlet Processes Jessica SODJO IMS, Groupe Signal Image, Talence Encadrants : A. Giremus, J.-F. Giovannelli, F. Caron, N. Dobigeon Jessica SODJO

More information

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Inference for DSGE Models. Lawrence J. Christiano Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Bayesian inference Bayes rule. Monte Carlo integation.

More information

Hierarchical Dirichlet Processes with Random Effects

Hierarchical Dirichlet Processes with Random Effects Hierarchical Dirichlet Processes with Random Effects Seyoung Kim Department of Computer Science University of California, Irvine Irvine, CA 92697-34 sykim@ics.uci.edu Padhraic Smyth Department of Computer

More information

10. Exchangeability and hierarchical models Objective. Recommended reading

10. Exchangeability and hierarchical models Objective. Recommended reading 10. Exchangeability and hierarchical models Objective Introduce exchangeability and its relation to Bayesian hierarchical models. Show how to fit such models using fully and empirical Bayesian methods.

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods Tomas McKelvey and Lennart Svensson Signal Processing Group Department of Signals and Systems Chalmers University of Technology, Sweden November 26, 2012 Today s learning

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Infinite Feature Models: The Indian Buffet Process Eric Xing Lecture 21, April 2, 214 Acknowledgement: slides first drafted by Sinead Williamson

More information

Bayesian Inference for Dirichlet-Multinomials

Bayesian Inference for Dirichlet-Multinomials Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS Summer School 1 / 50 Random variables and distributed according to notation A probability distribution

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA Intro: Course Outline and Brief Intro to Marina Vannucci Rice University, USA PASI-CIMAT 04/28-30/2010 Marina Vannucci

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

Fitting Narrow Emission Lines in X-ray Spectra

Fitting Narrow Emission Lines in X-ray Spectra Outline Fitting Narrow Emission Lines in X-ray Spectra Taeyoung Park Department of Statistics, University of Pittsburgh October 11, 2007 Outline of Presentation Outline This talk has three components:

More information

Dirichlet Processes: Tutorial and Practical Course

Dirichlet Processes: Tutorial and Practical Course Dirichlet Processes: Tutorial and Practical Course (updated) Yee Whye Teh Gatsby Computational Neuroscience Unit University College London August 2007 / MLSS Yee Whye Teh (Gatsby) DP August 2007 / MLSS

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida Bayesian Statistical Methods Jeff Gill Department of Political Science, University of Florida 234 Anderson Hall, PO Box 117325, Gainesville, FL 32611-7325 Voice: 352-392-0262x272, Fax: 352-392-8127, Email:

More information

Multimodal Nested Sampling

Multimodal Nested Sampling Multimodal Nested Sampling Farhan Feroz Astrophysics Group, Cavendish Lab, Cambridge Inverse Problems & Cosmology Most obvious example: standard CMB data analysis pipeline But many others: object detection,

More information

Distance dependent Chinese restaurant processes

Distance dependent Chinese restaurant processes David M. Blei Department of Computer Science, Princeton University 35 Olden St., Princeton, NJ 08540 Peter Frazier Department of Operations Research and Information Engineering, Cornell University 232

More information

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017 Bayesian Statistics Debdeep Pati Florida State University April 3, 2017 Finite mixture model The finite mixture of normals can be equivalently expressed as y i N(µ Si ; τ 1 S i ), S i k π h δ h h=1 δ h

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado CSCI 5822 Probabilistic Model of Human and Machine Learning Mike Mozer University of Colorado Topics Language modeling Hierarchical processes Pitman-Yor processes Based on work of Teh (2006), A hierarchical

More information

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

SAMPLING ALGORITHMS. In general. Inference in Bayesian models SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be

More information

Part 7: Hierarchical Modeling

Part 7: Hierarchical Modeling Part 7: Hierarchical Modeling!1 Nested data It is common for data to be nested: i.e., observations on subjects are organized by a hierarchy Such data are often called hierarchical or multilevel For example,

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Weakness of Beta priors (or conjugate priors in general) They can only represent a limited range of prior beliefs. For example... There are no bimodal beta distributions (except when the modes are at 0

More information

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1 Parameter Estimation William H. Jefferys University of Texas at Austin bill@bayesrules.net Parameter Estimation 7/26/05 1 Elements of Inference Inference problems contain two indispensable elements: Data

More information

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Spatial Bayesian Nonparametrics for Natural Image Segmentation Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University Joint work with Michael Jordan University of California Soumya Ghosh Brown University Parsing Visual Scenes

More information

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017 Markov Chain Monte Carlo (MCMC) and Model Evaluation August 15, 2017 Frequentist Linking Frequentist and Bayesian Statistics How can we estimate model parameters and what does it imply? Want to find the

More information

Dirichlet Process. Yee Whye Teh, University College London

Dirichlet Process. Yee Whye Teh, University College London Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, Blackwell-MacQueen urn scheme, Chinese restaurant

More information

Chapter 12 PAWL-Forced Simulated Tempering

Chapter 12 PAWL-Forced Simulated Tempering Chapter 12 PAWL-Forced Simulated Tempering Luke Bornn Abstract In this short note, we show how the parallel adaptive Wang Landau (PAWL) algorithm of Bornn et al. (J Comput Graph Stat, to appear) can be

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Inference for DSGE Models. Lawrence J. Christiano Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Preliminaries. Probabilities. Maximum Likelihood. Bayesian

More information

STAT 425: Introduction to Bayesian Analysis

STAT 425: Introduction to Bayesian Analysis STAT 425: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 2017 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 2) Fall 2017 1 / 19 Part 2: Markov chain Monte

More information

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES SHILIANG SUN Department of Computer Science and Technology, East China Normal University 500 Dongchuan Road, Shanghai 20024, China E-MAIL: slsun@cs.ecnu.edu.cn,

More information

Data Analysis and Uncertainty Part 2: Estimation

Data Analysis and Uncertainty Part 2: Estimation Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Topics in Estimation 1. Estimation 2. Desirable

More information

Bayesian Inference. Chapter 2: Conjugate models

Bayesian Inference. Chapter 2: Conjugate models Bayesian Inference Chapter 2: Conjugate models Conchi Ausín and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in

More information

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Spatial Normalized Gamma Process

Spatial Normalized Gamma Process Spatial Normalized Gamma Process Vinayak Rao Yee Whye Teh Presented at NIPS 2009 Discussion and Slides by Eric Wang June 23, 2010 Outline Introduction Motivation The Gamma Process Spatial Normalized Gamma

More information

Chapter 5. Chapter 5 sections

Chapter 5. Chapter 5 sections 1 / 43 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Bayesian finite mixtures with an unknown number of. components: the allocation sampler

Bayesian finite mixtures with an unknown number of. components: the allocation sampler Bayesian finite mixtures with an unknown number of components: the allocation sampler Agostino Nobile and Alastair Fearnside University of Glasgow, UK 2 nd June 2005 Abstract A new Markov chain Monte Carlo

More information

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

A Fully Nonparametric Modeling Approach to. BNP Binary Regression A Fully Nonparametric Modeling Approach to Binary Regression Maria Department of Applied Mathematics and Statistics University of California, Santa Cruz SBIES, April 27-28, 2012 Outline 1 2 3 Simulation

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

DAG models and Markov Chain Monte Carlo methods a short overview

DAG models and Markov Chain Monte Carlo methods a short overview DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex

More information

Bayesian Mixtures of Bernoulli Distributions

Bayesian Mixtures of Bernoulli Distributions Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions

More information

Answers and expectations

Answers and expectations Answers and expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P E

More information