QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION

Size: px

Start display at page:

Download "QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION"

Tabitha Craig
5 years ago
Views:

1 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION MARK FISHER Preliminary and incomplete. Abstract. The Quasi-Bernstein Density (QBD) model is a Bayesian nonparameteric model for density estimation on a finite interval that incorporates beta distributions into a Dirichlet Process Mixture (DPM) model. The coefficients of the beta distributions are restricted to the natural numbers, producing a model that is related to the Bernstein polynomial model introduced by Petrone (1999a). The potential number of mixture components is unbounded. The prior predictive density is uniform over the finite interval. The model is first applied to observed data, and then applied to latent variables via indirect density estimation. In this latter setting the model becomes a DPM model for the prior. In the context of binomial observations and latent success rates, this prior is compared with the Dirichlet process (DP) prior found in the literature. (The DP prior can be viewed as a special case of the DPM prior.) 1. Introduction This paper is about density estimation on a finite interval using a mixture of beta distributions. (A simple extension allows for two-dimensional density estimation on a rectangle.) Here are the main features: (1) the approach to inference is Bayesian; (2) the parameters of the beta distributions are restricted to the natural numbers; (3) the potential number of mixture components is unbounded; and (4) the prior predictive density is uniform. (The prior predictive density can be made nonuniform if desired.) As illustrations, the model is applied to a number a standard data sets in one- and two-dimensions. In addition, I show how to apply the model to latent variables via what I call indirect density estimation. (In this context I introduce the distinction between generic and specific cases.) To illustrate this technique, I apply this estimation technique to compute the density of unobserved success rates that underly the observations from binomial experiments and/or units. The results are compared with those generated by an alternative model that has appeared in the literature. Related literature. This model is related to the Bernstein polynomial density technique of Petrone (1999a). See also Trippa et al. (2011) and Kottas (2006). For asymptotic properties of random Bernstein polynomials, see Petrone (1999b) and Petrone and Wasserman (2002). For a multivariate extension, see Zhao et al. (2013). For a discussion of samplers for Dirichlet process mixture models see Neal (2000). Date: July 22, 05:23. The views expressed herein are the author s and do not necessarily reflect those of the Federal Reserve Bank of Atlanta or the Federal Reserve System. 1

2 2 MARK FISHER Liu (1996) presents a related model in which the latent success rates for binomial observations have a Dirichlet Process (DP) prior. (This model is discussed in Section 8.) Outline. Sections 2 through 6 present the model as applied to observables. Section 2 provides a general introduction to Dirichlet Process Mixture (DPM) models and Chinese Restaurant Processes (CRP) and derives the general form of the conditional predictive distribution. Section 3 specializes the model to the quasi-bernstein model. Section 4 presents a two-dimensional extension of the model. Section 5 describes the details of the Markov chain Monte Carlo (MCMC) sampler. An empirical investigation is presented in Section 6. Sections 7 through 9 apply the model to the case of latent variables. Section 7 applies the model to latent variables via indirect density estimation, introducing the distinction between generic and specific cases. Section 8 presents an alternative (DP-based) prior that has been used in the literature for the case of binomial observations with latent success rates. Section 9 presents an empirical investigation for the latent-variable case. In addition there are five appendices. Appendix A discusses potential variation in the predictive distribution arising from additional observations. Appendix B presents additional features of the probabilities associated with the Chinese Restaurant Process. Appendix C presents a generalization of the two-dimensional model that includes local dependence via a copula. Appendix D shows how to integrate out the unobserved success rate in the case of binomial data. Appendix E discusses pooling, shrinkage, and sharing. 2. The standard model in general terms This section describes the model. Much of the exposition in this section is dedicated to providing the less knowledgeable reader with some insight into (i) the structure of the standard Dirichelt Process Mixture (DPM) model, (ii) its marginalization into a classification model based on the Chinese Restaurant Process (CRP), and (iii) the derivation of the conditional predictive distribution (which forms the basis for posterior estimation). To that end, this section borrows from the exposition of Teh (2010) [to which the reader is referred for additional introductory material]. The general form of the standard model adopted in this paper is stated in (2.13). The novelty is entirely in the specializations, which are presented in Section 3 in (3.1), (3.3), and (3.12). Inferential goal. Let x 1:n = (x 1,..., x n ) denote a set of observations. 1 The goal of inference is posterior distribution for the next observation (which we express hereafter as a density), also known as the posterior predictive distribution: p(x n+1 x 1:n, I), (2.1) where I stands for the model and the prior information. 2 Computing this predictive distribution amounts to an exercise in Bayesian density estimation. For density estimation, it is natural to adopt assumptions that allow for a wide 1 In Section 3, the model will be specialized to require xi [0, 1] for all i. This restriction, however, in not relevant in the current section. 2 Appendix A discusses the potential variation in the predictive distribution arising from additional observations.

3 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 3 range of possible distributions, including the possibility of multi-modality. In what follows, we provide a brief introduction to what has become the standard approach. The discussion of DPM models is intended to help give the reader a sense of where the flexibility of the Bayesian approach to density estimation comes from. However, one may skim this discussion (for notation) and start in earnest with the CRP model [summarized in (2.13)] upon which the predictive distribution is based. DPM models. Let θ 1:n = (θ 1,..., θ n ) denote a set of latent parameters, where θ i Θ. The Dirichlet Process Mixture (DPM) model has the following hierarchical structure: 3 x i θ i F (θ i ) iid θ i G G α DP(α, H) α P α. (2.2a) (2.2b) (2.2c) (2.2d) According to (2.2a), the observations are conditionally independent but not (necessarily) identically distributed. The distribution F plays the role of a kernel in density estimation. The location and scale (i.e., bandwidth) of the kernel associated with the ith observation are controlled by the parameter θ i. The prior distribution for θ i is G. G is a random probability distribution with a prior given by a Dirichlet Process (DP). The DP depends on a concentration parameter α > 0 and a base distribution H, where H is a distribution over Θ. The flexibility of the DP prior is such that G can approximate arbitrarily well (in the weak topology 4 ) any distribution defined on the support of H. The DP prior for G is completely characterized by the follow proposition: If A 1,..., A s is any finite measurable partition of Θ, then ( G(A1 ),..., G(A s ) ) Dirichlet ( α H(A 1 ),..., α H(A s ) ), (2.3) where Dirichlet denotes the Dirichlet distribution. Note s j=1 G(A j) = s j=1 H(A s) = 1. It follows from the properties of the Dirichlet distribution that E[G(A)] = H(A), for any measurable set A Θ. (2.4) Thus, the base distribution H can be thought of as the expectation of the random distribution G. In addition, the variance of G around H is inversely related to the concentration parameter α: V [G(A)] = H(A) ( 1 H(A) ) /(1 + α). (2.5) Thus, the concentration parameter determines how concentrated G is around its expectation H. Realizations of G are (almost surely) discrete even if H is not and consequently there is positive probability that θ i = θ j for some j i. The smaller the concentration parameter, the greater the probability of duplication. (This will become evident shortly.) 3 The distributions F, H, and Pα will be specified in Section 3. 4 The central limit theorem involves the weak topology.

4 4 MARK FISHER There is an explicit representation for G. Let θ = ( θ 1, θ 2,...) denote the support of G and let v = (v 1, v 2,...) denote the associated probabilities. In other words, v c is the probability that θ i = θ c. Then G can be expressed as G = v c, (2.6) δ θc c=1 where δ w denotes a point-mass (distribution) located at w. The randomness of G follows from the randomness of both the support and the associated probabilities: θ c iid H v α Stick(α), where Stick(α) denotes the stick-breaking distribution given by 5 (2.7a) (2.7b) c 1 v c = ε c (1 ε l ) where ε c Beta(1, α). (2.8) l=1 The parameter α controls the rate at which the weights decline on average. In particular, the weights decline geometrically in expectation: E[v c α, I] = α c 1 (1 + α) c. (2.9) Note E[v 1 α, I] = 1/(1 + α) and E [ c=n+1 v c α, I ] = ( α/(1 + α) ) n. When α is small, a few weights dominate the distribution thereby producing a substantial probability of duplicates. Marginalization. There are a number of schemes to estimate the DPM-based model (assuming F, H and P α have been specified). However, it is possible to reexpress the model (via marginalization) in a form that allows for estimation schemes that are both simpler and lower-variance. Although the marginalized model cannot deliver inferences about G, it does deliver the predictive inferences that are the subject of this paper. As a first step in reexpressing the model, introduce the classification variables z 1:n = (z 1,..., z n ), where z i = c indicates θ i = θ c. 6 The classifications are solely determined by the mixture weights v. In particular, z i takes on value c with probability v c, which can be expressed as z 1:n v Multinomial(v). (2.10) The reexpression is completed by integrating out v using its prior, producing the marginal distribution for the classifications: p(z 1:n α) = p(z 1:n v) p(v α) dv. (2.11) This distribution can be expressed as follows: z 1:n α CRP(α), (2.12) 5 Start with a stick of length one. Break off the fraction ε1 leaving a stick of length 1 ε 1. Then break off the fraction ε 2 of the remaining stick leaving a stick of length (1 ε 1) (1 ε 2). Continue in this manner. 6 Note that z1:n can be interpreted as a partition of the set {1,..., n}. We may interpret the partition as determining the way in which the parameters are shared among the observations.

5 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 5 where CRP(α) denotes the Chinese Restaurant Process (CRP) with concentration parameter α. The CRP plays a central role in the model and we will discuss is shortly. Summary. The marginalized model is summarized as follows: x i θ i F (θ i ) θ i = θ zi (2.13a) (2.13b) θ c iid H z 1:n α CRP(α) α P α. (2.13c) (2.13d) (2.13e) This is a standard model in the literature. The novelty introduced in this paper comes in the specification of F, H, and P α (in Section 3). In the meantime, it is convenient to discuss the CRP and also to derive the general forms of some additional expressions. CRP as sharing. Perhaps the most interesting aspect of (2.13) is the sharing of parameters among the observations. We begin by stating certain features of this sharing. In particular, there is a finite number of clusters, where the number of clusters is random. Each of the clusters has a common parameter, where θ c is the parameter for cluster c. Each observation has a classification parameter z i, where z i = c denotes that θ i belongs to cluster c, and consequently θ i = θ zi. (2.14) We assume the cluster parameters are independently and identically distributed according to some distribution function H: θ c iid H. (2.15) The way in which the parameters are shared is determined by a probability distribution for the classifications, z 1:n = (z 1,..., z n ). For this purpose we adopt the Chinese Restaurant Process (CRP). The imagery behind the CRP is that there is a restaurant with an unlimited number of tables at which customers are randomly seated, and all customers seated at the same table share the same entree. In the analogy, the tables correspond to the clusters, the customers correspond to the regimes, and the entrees correspond to the parameter values. The seating assignment works as follows. Customers arrive one at a time, and the first customer is always seated at table number one. Subsequent customers are seated either at an already occupied table or at the next unoccupied table. The probabilities of each of these possible seating choices are described next. Suppose n customers have already been seated. Let m n denote the number of occupied tables, which can be anywhere from 1 to n. Let d n,c denote the number of customers seated at table c. (For future reference, note that m n = max[z 1:n ], that d n,c is the multiplicity of c in z 1:n, and that m n c=1 d n,c = n.) The probability that the next customer is seated at table c is given by p(z n+1 = c z 1:n, α) = { dn,c n+α c {1,..., m n } α, (2.16) n+α c = m n + 1

6 6 MARK FISHER where α is the concentration parameter. 7 The effect of α on the amount of sharing can be seen in (2.16). If α = 0, then all customers are seated at the first table and there is complete sharing (θ i = θ 1 ). At the other extreme if α =, then each customer is seated separately and there is no sharing (θ i = θ i ). Intermediate values for α produce partial sharing. Because the value of α plays such an important role in determining the amount of sharing, we provide it with a prior and allow the data to help determine its value. The probability of z 1:n (given α) can be computed using (2.16). It depends only on the set of multiplicities associated with z 1:n : n 1 m n p(z 1:n α) = p(z 1 α) p(z i+1 z 1:i, α) = αmn c=1 (d n,c 1)!, (2.17) (α) n i=1 where (α) n := n i=1 (i 1 + α) = Γ(n + α)/γ(α).8 Posterior predictive distribution revisited. The following parameter vector is central to what follows: ψ n := (α, z 1:n, θ 1:mn ). (2.18) We can express the posterior predictive distribution in terms of ψ n : p(x n+1 x 1:n, I) = p(x n+1 ψ n, I) p(ψ n x 1:n, I) dψ n. (2.19) Calculation of the posterior distribution according to (2.19) involves the following steps: Find the explicit form for the conditional predictive distribution p(x n+1 ψ n, I), (2.20) make draws of {ψ n (r) } R r=1 from the posterior distribution p(ψ n x 1:n, I), and compute the approximation (Rao Blackwellization): p(x n+1 x 1:n, I) 1 R R r=1 p(x n+1 ψ (r) n, I). (2.21) The task of finding the explicit form for p(x n+1 ψ n ) begins by recognizing Then, noting x n+1 ψ n+1 F ( θ zn+1 ). (2.22) ψ n+1 = ψ n (z n+1, θ mn+1 ), (2.23) the task concludes by integrating out (z n+1, θ mn+1 ) using distributions conditional on ψ n. The result is a mixture of m n + 1 terms: m n d n,c p(x n+1 ψ n, I) = n + α f(x n+1 θ c ) + α n + α p(x n+1 I), (2.24) c=1 7 The CRP can be interpreted as the limit of sharing scheme involving K categories. Let z1:n Multinomial(n, w) where w Dirichlet(α/K,..., α/k). Integrate out w and let K. See Neal (2000) for details. 8 See Appendix B for addition information regarding the assignment of probabilities to partitions.

7 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 7 where f( θ i ) is the density of F (θ i ), h( I) is the density of H, and p(x i I) = f(x i θ c ) h( θ c I) d θ c (2.25) for any i 1 (including n + 1). Equation (2.24) is the standard workhorse of Bayesian nonparametric density estimation. A key feature of (2.24) is this: If α = then nothing is learned about x n+1 from x 1:n because no information flows through the hyperparameter and p(x n+1 ψ n, I) = p(x n+1 I). Thus the no-sharing environment is also a no-learning environment. 3. Specializing the model Having set the stage with the general structure of the standard Bayesian nonparametric density model, we are now in a position to specify the model by specializing F, H, and P α. Recall the general structure of the model is summarized by (2.13). Specializing F. Let θ i = (a i, b i ) Θ = N N and let where F (θ i ) = Beta(a i, b i ), (3.1) Beta(x a, b) = 1 [0,1] (x) xa 1 (1 x) b 1, (3.2) B(a, b) Γ(a) Γ(b) and where B(a, b) is the beta function, B(a, b) = Γ(a+b) = 1 0 xa 1 (1 x) b 1 dx. Note the parameters in the Beta distributions are restricted to the positive integers. Also note the model assumes x i [0, 1]. It is straightforward to adopt the model to any finite interval. Specializing H. Let θ c = (ã c, b c ) Θ. In order to describe the prior for θ c it is convenient to change variables to ( j c, k c ), where (ã c, b c ) = ( j c, k c j c + 1). Note k c N and j c {1,..., k c }. Let H be characterized by kc Pascal(1, ξ) j c k c Discrete(1/ k c,..., 1/ k c ). (3.3a) (3.3b) This prior delivers a model that shares much in common with Petrone s random Bernstein polynomial model. Because we do not treat an entire Bernstein basis as a unit, we refer to this as a quasi-bernstein model. The prior for ( j c, k c ) can be expressed as p( j c, k c I) = p( k c I)/ k c, (3.4) where p( k c I) = ξ (1 ξ) k c 1 (3.5)

8 8 MARK FISHER We are now equipped to explicitly express p(x i I) p(x i I) = p(x i θ c ) p( θ c I) d θ c = = p( k c I) kc=1 kc j c=1 Beta(x i j c, k c j c + 1) p( k c I) Uniform(x i 0, 1) kc=1 = 1 [0,1] (x i ). kc (3.6) The simplification in (3.6) does not depend on the form of p( k c I). It follows from a property of Bernstein polynomials, for which d l=0 B l,d(x) = 1, where B l,d (x) = ( d l) x l (1 x) d l denotes the lth Bernstein basis function of degree d. Equation (3.6) follows since Beta(x j, k j + 1)/k B j 1,k 1 (x). Here is the prior in terms of (ã c, b c ): p(ã c, b c I) = p( j c, k c I) ( j c, k c)=(ã c,ã c+ b c 1) = ξ (1 ξ)ãc+ b c 2 (The Jacobian of the transformation is identically one) ã c + b c 1 (3.7) Conditional predictive distribution revisited. Given the specification of F and H, we can now express the conditional predictive distribution in the form used for estimation in this paper: m n ( ) ( ) dn,c α p(x n+1 ψ n, I) = Beta(x n+1 ã c, b c ) + 1 n + α n + α [0,1] (x n+1 ), (3.8) c=1 where the prior for (ã c, b c ) N N is given in (3.7). Thus the conditional predictive distribution for x n+1 is a mixture of Beta distributions with integer coefficients and consequently so is the predictive distribution p(x n+1 x 1:n, I). Specializing P α. In formulating the prior for α it is instructive to examine the joint distribution for x 1 and x 2 conditional on α. First, note that the joint distribution conditional on ψ 1 can be expressed as where [refer to (2.24)] p(x 2 ψ 1, I) = p(x 1, x 2 ψ 1, I) = f(x 1 θ 1 ) p(x 2 ψ 1, I), (3.9) ( ) ( ) 1 α f(x 2 θ 1 ) + p(x 2 I). (3.10) 1 + α 1 + α

9 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION x x 1 Figure 1. Density histogram of 10 5 draws from p(x 1, x 2 I) with ξ =.005. Therefore, the joint distribution conditional on α can be obtained by integrating out θ 1 : p(x 1, x 2 α, I) = p(x 1, x 2 ψ 1, I) p( θ 1 I) d θ 1 ( ) ( ) 1 α = f(x 1 θ 1 ) f(x 2 θ 1 ) h( θ 1 I) d θ 1 + p(x 1 I) p(x 2 I). 1 + α 1 + α (3.11) We see that the joint distribution is a weighted average of the two extreme sharing environments. Equation (3.11) illustrates sharing in its simplest form. In light of this equation, we choose the prior for α such that λ = α/(1 + α) Uniform(0, 1). In particular, P α is characterized by the density 1 p(α I) = (1 + α) 2. (3.12) This prior differs from the standard conjugate prior for α. With this prior for α, the joint distribution for x 1 and x 2 is p(x 1, x 2 I) = 1 f(x 1 θ 1 ) f(x 2 θ 1 ) h( θ 1 I) d θ [0,1](x 1 ) 1 [0,1] (x 2 ), (3.13) where the densities f and h are given above. See Figure 1 for a binned density plot of 10 5 draws from p(x 1, x 2 I) where ξ =

10 10 MARK FISHER 4. Two-dimensional density estimation The extension of the model to two-dimensional observations is straightforward. 9 Let x i = (x i1, x i2 ), θ i = (θ i1, θ i2 ), and θ c = ( θ c1, θ c2 ), where θ il = (a il, b il ) and θ cl = (ã cl, b cl ) for l {1, 2}. Let F be characterized by f(x i θ i ) = Beta(x i1 a i1, b i1 ) Beta(x i2 a i2, b i2 ). (4.1) Note the local independence. A model with local dependence is described in Appendix C. Let H be characterized by p( θ c I) = p( θ c1 I) p( θ c2 I), where p( θ cl I) = p( k cl I)/ k cl. Consequently, p(x i I) = 1 [0,1] 2(x i ). (4.2) The priors for α and z 1:n are unchanged. The expression for the conditional predictive distribution is becomes m n p(x n+1 ψ n, I) = c=1 ( dn,c n + α ) Beta(x n+1,1 ã i1, b i1 ) Beta(x n+1,2 ã i2, b i2 ) ( ) α + 1 n + α [0,1] 2(x n+1 ). (4.3) Simple adaptations of the sampling scheme described in Section 5 allow one to make draws of ψ n in the two-dimensional case and thereby use (2.21) in conjunction with (4.3) to compute the approximation to the posterior predictive density. 5. Drawing from the posterior distribution We discuss how to make draws of ψ n = (α, z 1:n, θ 1:mn ) from the posterior distribution via MCMC (Markov chain Monte Carlo). We adopt a Gibbs sampling approach, relying on full conditional posterior distributions. The sampling scheme described here follows the outline summarized by Algorithm 2 in Neal (2000). Classifications. We begin with making draws of the classifications, z 1:n. Let z i 1:n denote the (possibly renormalized) vector of classifications after having removed case i. Renormalization will occur if before removal the cluster associated with observation i is a singleton (i.e., d n,zi = 1), in which case θ zi will be discarded and the remaining clusters will be relabeled. Based on (2.16) and the exchangeability of the classifications, we have p(z i = c α, z i { d i 1:n ) = n,c n 1+α α n 1+α c {1,..., m i c = m i n + 1 n }, (5.1) where m i n = max[z1:n i ] and let d i n,c denote the multiplicity of class c in z1:n i. Therefore, the full conditional probability of z i given the observations is p(z i α, z1:n i, θ 1:m i, x 1:n, I) n { d i n,c n 1+α Beta(x i ã c, b c ) α n 1+α 1 [0,1](x i ) c {1,..., m i c = m i n + 1 n }, (5.2) 9 See Zhao et al. (2013) for a discussion of the properties of a two-dimensional Bernstein polynomial density.

11 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 11 where 1 [0,1] (x i ) = 1 the likelihood for a new, as yet unobserved, component parameter. If a new cluster is selected (i.e., if z i = m i n + 1), then the new cluster parameter is drawn from the posterior using x i as the sole observation. The posterior distribution for the parameters for a new cluster factors as follows (letting c = m i n + 1): where p( j c, k c x i, I) = p(x i j c, k c ) p( j c k c, I) p( k c I) p(x i I) p( j c k c, x i, I) = = p( j c k c, x i, I) p( k c I), (5.3) ) ( kc 1 c 1 x j i (1 x i ) k c j c = Binomial( j c 1 k c 1, x i ). (5.4) j c 1 Note that k c is not identified and is drawn from its prior distribution and then ( j c 1) Binomial( k c 1, x i ) with the proviso that if k c = 1 then j c = 1. Cluster parameters. Having classified each of the observations (and having, in the process, discarded old values of θ c and drawn new values of θ c ), we then resample each element of θ 1:mn conditional on the classifications. Let I c = {i : z i = c}, (5.5) and let x c 1:n = {x i i I c }, where the number observations in x c 1:n equals d n,c. We have already shown how to make draws if d n,c = 1. Now assume d n,c 2. The posterior for ( j c, k c ) is given by p( θ c x 1:n, z 1:n, α, I) = p( θ c x c 1:n, I) h( θ c I) i I c f(x i θ c ) = p( j c, k c I) ( i I c x i ) j c 1 ( i I c (1 x i ) ) k c j c B( j c, k c j c + 1) dn,c. One can adopt a Metropolis-Hastings scheme with the following proposal: (5.6) k c 1 Poisson(k c (r) ) (5.7) j c 1 Binomial( k c 1, x c (r) 1:n ), (5.8) where x c 1:n = 1 d n,c i I c x i is the sample mean of x c 1:n.10 Let Then q( θ c θ c, x c 1:n) := Poisson( k c 1 k c ) Binomial( j c k c 1, x c 1:n). (5.9) where u (r+1) Uniform(0, 1) and M (r) c θ (r+1) c = { θ c θ (r) c u (r+1), (5.10) otherwise M (r) c = p( θ c x 1:n, z (r) 1:n, α(r), I) x 1:n, z (r) p( θ (r) c 1:n, α(r), I) q( θ (r) c θ c, x c (r) 1:n ) q( θ c θ c (r), x c (r) (5.11) 1:n ). 10 As an alternative, one can adopt a Metropolis scheme with a random-walk proposal for (ãc, b c). For example, one can draw from a bivariate normal distribution and round the draw to the integers.

12 12 MARK FISHER Concentration parameter. The remaining component of ψ n is α. The likelihood for α is given by (2.17): p(z 1:n α) αmn = αmn Γ(α) (α) n Γ(n + α). (5.12) Draws of α can be made using a Metropolis Hastings scheme. 11 Let the proposal be where 12 α Singh Maddala(α (r), h), (5.13) Singh Maddala(x m, h) = h xh 1 m h (x h + m h ) 2. (5.14) Note that P α = Singh Maddala(1, 1). The inverse-cdf method can be used to make draws from Singh Maddala(m, h): Draw u Uniform(0, 1) and set x = m ( u/(1 u) ) 1/h. determining whether or not to accept the proposal, we require the Hastings ratio, Then where M (r) = For Singh Maddala(α (r) α, h) Singh Maddala(α α (r), h) = α. (5.15) α (r) α (r+1) = { α α (r) M (r) u (r+1), (5.16) otherwise ( ) (r) p(z(r) 1:n α ) p(z (r) 1:n α(r) ) α α (r) = (α(r) ) n α m n +1 (α ) n α (r). (5.17) 6. Investigation: Part I In this section we apply the Bernstein Dirichlet density prior spline to a number of applications and investigate the performance: the galaxy data, the Buffalo snowfall data, and the Old Faithful data (in two dimensions). Unless otherwise noted, ξ = 1/200. With this setting, the prior mean for k c equals 200 and the prior standard deviation equals The 90% highest prior density (i.e., probability mass) region runs from k c = 1 to k c = 460. As noted above, the prior for α is given by p(α I) = 1/(1 + α) 2 so that the prior median for α equals one. Galaxy data. Figure 5 shows the quasi-bernstein predictive density for the galaxy data with support over the interval [5, 40]. A total of 50,500 draws were made, with the first 500 discarded and 1,000 draws retained (every 50th) from the remaining 50,000. In Figures 6 8 are shown the posterior distributions of the number of clusters, λ = α/(1 + α), and k c. 11 Alternatively, draws of α can be made using a Metropolis scheme. Make a random-walk proposal of λ N(λ (r), s 2 ), where λ (r) = α (r) /(1 + α (r) ) and s 2 is a suitable scale. Then evaluate the likelihood ratio for α = λ /(1 λ ) relative to α (r) to determine whether or not to accept the proposal α. 12 The Singh Maddala distribution is also known as the Burr Type XII distribution. The version of the Singh Maddala distribution expressed in (5.14) is specialized from the more general version that involves a third parameter. In Mathematica, the distribution is represented by SinghMaddalaDistribution[1, h, m].

13 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 13 Buffalo snowfall data. Figure 9 shows the quasi-bernstein predictive density for the galaxy data with support over the interval [0, 150]. A total of 50,500 draws were made, with the first 500 discarded and 1,000 draws retained (every 50th) from the remaining 50,000. In Figures are shown the posterior distributions of the number of clusters, λ = α/(1 + α), and k c. The density in Figure 9 is substantially smoother than what is produced by many alternative models which typically display three modes. In the current model, fixing α = 5 will produce three modes, but this value for α is deemed unlikely according the model when we learn about α. The posterior median for α is about The posterior probability of α 5 is about 20 times lower than the prior probability. (Increasing α also has the effect of increasing the probability of new cluster, which in turn has the effect of increasing the predictive density at the boundaries of the region. For example, the predictive density increase by roughly a factor of 10 at x n+1 = 150.) With this data set there is a strong (posterior) relation between α and k c. The posterior median of k c equals about 10 given α < 1, but it equals about 140 given α 1. Old Faithfull data. Here we examine the Old Faithful data, which comprises 272 observations of pairs composed of eruption time (the duration of the current eruption in minutes) and waiting time (the amount of time until the subsequent eruption in minutes). Figure 13 shows a scatter plot of the data, a contour plot of the joint predictive distribution, and two line plots of conditional expectations computed from the joint distribution. The distribution was given positive support over the region [1, 5.75] [35, 105]. The distribution is distinctly bimodal. Figure 14 shows the marginal predictive distributions computed from the joint distribution (along with rug plots of the data). The following four figures provide diagnostics. 7. Indirect density estimation for latent variables Up to this point we have assumed that x 1:n was observed and the goal of inference was the posterior predictive distribution p(x n+1 x 1:n, I). In this section, we now suppose x 1:n is latent and instead Y 1:n = (Y 1,..., Y n ) is observed. Note that Y i may be a vector of observations. To accommodate this situation, we augment (2.13) with Y i x i P Yi (x i ). (7.1) The form of the density p(y i x i ) will depend on the specific application. Nuisance parameters may have been integrated out to obtain p(y i x i ). One may interpret p(y i x i ) as a noisy signal for x i. When x 1:n was observed, there was an obvious asymmetry between x i x 1:n and x n+1. This asymmetry carries over to the situation where x 1:n is latent. In particular, we observe signals for x i x 1:n but we do not observe a signal for x n+1. Based on this distinction, we refer to x i as a specific case because we have a (specific) signal for it and we refer to x n+1 as the generic case because it applies to any coefficient for which we have no signal as yet (and for which we judge x 1:n to provide an appropriate basis for inference). The specific

14 14 MARK FISHER cases are the ones that appear in the likelihood: p(y 1:n x 1:n+1 ) = p(y 1:n x 1:n ) = n p(y i x i ). (7.2) The goal of inference is the indirect posterior predictive distribution, p(x n+1 Y 1:n, I). From a conceptual standpoint, the indirect predictive distribution is the average of the direct distribution, where the average is computed with respect to the posterior uncertainty of x 1:n : p(x n+1 Y 1:n, I) = p(x n+1 x 1:n, I) p(x 1:n Y 1:n, I) dx 1:n. (7.3) Once again we can express the distribution of interest in terms of ψ n : p(x n+1 Y 1:n, I) = p(x n+1 ψ n ) p(ψ n Y 1:n, I) dψ n 1 R R r=1 i=1 p(x n+1 ψ (r) n ) (7.4) where {ψ n (r) } R r=1 now represents draws from p(ψ n Y 1:n, I). 13 The sampler works as before with the additional step of drawing x 1:n Y 1:n, ψ n for each sweep of the sampler. Since the joint likelihood factors [see (7.2)], the full conditional posterior for x i reduces to the posterior for x i in isolation (conditional on θ i ): where p(x i Y 1:n, x i 1:n, ψ n, I) = p(x i Y i, θ i ) θi = θ zi, (7.5) p(x i Y i, θ i ) = p(y i x i ) f(x i θ i ) p(yi x i ) f(x i θ i ) dθ i. (7.6) The posterior distributions of the specific cases can be approximated with histograms of the draws {x (r) i } R r=1 from the posterior. However, we can adopt a Rao Blackwellization approach (as we have done with the generic case) and obtain a lower variance approximation. In particular, p(x i Y 1:n, I) = p(x i Y i, θ i ) p(θ i Y 1:n, I) dθ i Note θ (r) i = θ (r) c where c = z (r) i. 1 R R r=1 p(x i Y i, θ (r) i ). (7.7) 13 In passing, note the likelihood of the observations Y1:n is can be computed via a succession of generic distirubions: where p(y 1:n I) = p(y 1 I) n p(y i Y 1:i 1, I) i=2 p(y i Y 1:i 1, I) = p(y i x i) p(x i Y 1:i 1, I) dx i.

15 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 15 Binomial data. An important case is when the data are binomial in nature: 14 p(y i x i ) = Binomial(s i T i, x i ), (7.8) where T i is the number of trials, s i is the number of successes, and x i is the latent probability of success. In this case, the conditional posterior for a specific cases is p(x i Y i, θ i ) = Beta(x i a i + s i, b i + T i s i ), (7.9) which can be used in (7.7). In this setting both specific and generic posterior distributions are mixtures of beta distributions. 8. Alternative prior for indirect density estimation with binomial data We present a special case where the kernel F is a point mass located at θ i [0, 1] and the base distribution H is uniform over the unit interval: The densities can be expressed as F (θ i ) = δ θi and H = Uniform(0, 1). (8.1) f(x i θ i ) = δ θi (x i ) and h( θ c I ) = 1 [0,1] ( θ c ), (8.2) where I denotes this alternative model. Given (8.2), the conditional predictive distribution [see (2.24)] becomes a mixture of point masses and the uniform density: p(x n+1 ψ n, I ) = m n c=1 d n,c n + α δ θc (x n+1 ) + α n + α 1 [0,1](x n+1 ). (8.3) The knowledgable reader will recognize (8.3) as the conditional predictive distribution that would be derived from a DP model (as opposed to a DPM model). This equivalence is a consequence of the point-mass kernel. The point-mass kernel identifies x i with θ i. Since x i plays no role given θ i, it is convenient to integrate x i out, obtaining the likelihood of θ i : p(y i θ i ) := p(y i x i ) f(x i θ i ) dx i = p(y i x i ) xi =θ i. (8.4) The sampling scheme proceeds as follows. The classifications are updated according to where p(z i z1:n i, θ 1:m i, α, Y 1:n, I ) n { d i n,c n 1+α p(y i θ c ) α n 1+α p(y i I ) c {1,..., m i c = m i n + 1 n }, (8.5) p(y i θ c ) = p(y i θ i ) θi = θ c (8.6) p(y i I ) = p(y i θ c ) h( θ c I ) d θ c. (8.7) 14 With the binomial likelihood it is possible to analytically integrate out x1:n as shown in Appendix D. However, we do not follow that procedure in our estimation.

16 16 MARK FISHER The case we are interested in involves the binomial likelihood. 15 Therefore, p(y i θ c ) = Binomial(s i T i, θ c ) (8.8) p(y i I ) = 1 T i + 1. (8.9) Updating the cluster parameters conditional on the classifications is straightforward: θ c Y1:n, c I Beta(ă c, b c ) (8.10) where ă c := 1 + s l l I c and bc := 1 + T l s l. l I c (8.11) There is no change in the way the updates are computed for α. When we apply our approximations to the generic and specific distributions, we obtain p(x n+1 Y 1:n, I ) 1 R m n (r) d (r) n,c (x R n + α (r) δ θ(r) n+1 ) + α(r) c n + α (r) 1 [0,1](x n+1 ) (8.12) p(x i Y 1:n, I ) 1 R r=1 R r=1 c=1 δ (r) θ (x i ). (8.13) i The generic distribution is a weighted average of the specific distributions and the prior distribution. Also note these distributions are composed of mixtures of point masses (along with a continuous distribution for the generic case). It is possible to compute smoother approximations. We turn to that now. Smoothing. In the binomial-data case with the alternative model it is possible to adopt Algorithm 3 in Neal (2000) which reduces estimation solely to classification (conditional on the concentration parameter). The algorithm involves integrating out θ c : p(y i (Y i 1:n )c, I ) = p(y i θ c ) p( θ c (Y i 1:n )c, I ) d θ c, (8.14) where (Y1:n i )c denotes all the observations excluding the ith that are classified as c. In particular, p(y i (Y1:n i )c, I ) = Beta Binomial(s i ă i i c, b c, T i ), (8.15) where ă i c := ă c s i and b i c := b c (T i s i ). (8.16) The classifications are assigned according to p(z i α, z i 1:n, Y 1:n, I ) = { d i n,c n } n 1+α p(y i (Y1:n i, I ) c {1,..., m i α n 1+α p(y i I ) c = m i n + 1 { d i n,c n 1+α Beta Binomial(s i ă i c, α n 1+α 1 T i +1 b i c, T i ) c {1,..., m i c = m i n For textbook treatment of this case see Greenberg (2013) and Geweke et al. (2011). n }. (8.17)

17 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 17 Draws of the concentration parameter can be made as usual. Posterior distributions. The posterior distribution for the specific case conditional on the classifications and the data is given by since p(x i z 1:n, Y 1:n, I ) = p(y i x i ) p(x i (Y1:n i )z i, I ) p(y i (Y1:n i = Beta(x )z i, I i ă zi, b zi ), (8.18) ) p(y i x i ) = Binomial(s i T i, x i ) (8.19) p(x i (Y1:n i )z i, I ) = Beta(x i ă i i z i, b z i ) (8.20) p(y i (Y i 1:n )z i, I ) = Beta Binomial(s i ă i z i, where ă zi and b zi are given in (8.11) and ă i z i and cases c = z i ]. Therefore, p(x i Y 1:n, I ) 1 R n r=1 b i z i p(x i z (r) 1:n, Y 1:n, I ) = 1 R b i z i, T i ), (8.21) are given in (8.16) [where in both R r=1 Beta(x i ă (r) (r) z i, b z i ). (8.22) The posterior distribution for the generic case given the concentration parameter, the classifications, and the data is p(x n+1 Y 1:n, α, z 1:n, I ) = p(x n+1 ψ n, I ) p( θ 1:mn Y 1:n, I ) d θ 1:mn = m n c=1 d n,c n + α Beta(x n+1 ă c, b c ) + α n + α 1 [0,1](x n+1 ), (8.23) where we have used f(x n+1 θ c ) p( θ 1:mn Y 1:n, I ) d θ mn = p( θ c Y c 1:n, I ) θc=x n+1 = Beta(x n+1 ă c, b c ). (8.24) Therefore the posterior distribution for the generic case can be approximated by p(x n+1 Y 1:n, I ) 1 R R r=1 p(x n+1 Y 1:n, α (r), z (r) 1:n, I ). (8.25) Model comparison. The alternative model may be compared with the main model in terms of their likelihoods. For A {I, I }, n p(y 1:n A) = p(y 1 A) p(y i Y 1:i 1, A), (8.26) where p(y i Y 1:i 1, A) = Note p(y 1 I) = p(y 1 I ) = 1 0 p(y 1 x 1 ) dx 1. i=2 p(y i x i ) p(x i Y 1:i 1, A) dx i. (8.27)

18 18 MARK FISHER Table 1. Rat tumor data: 71 studies (rats with tumors/total number or rats). 00/17 00/18 00/18 00/19 00/19 00/19 00/19 00/20 00/20 00/20 00/20 00/20 00/20 00/20 01/20 01/20 01/20 01/20 01/19 01/19 01/18 01/18 02/25 02/24 02/23 01/10 02/20 02/20 02/20 02/20 02/20 02/20 05/49 02/19 05/46 03/27 02/17 07/49 07/47 03/20 03/20 02/13 09/48 04/20 04/20 04/20 04/20 04/20 04/20 04/20 10/50 10/48 04/19 04/19 04/19 05/22 11/46 12/49 05/20 05/20 06/23 05/19 06/22 04/14 06/20 06/20 06/20 16/52 15/47 15/46 09/24 Table 2. The thumbtack data set: 320 instances of binomial experiments with 9 trials each. The results are summarized in terms of the number of experiments that have a given number of successes. No. of successes Total No. of experiments Frequency (percent) Investigation: Part II In this section we apply the density spline to a number of applications and investigate the performance. Rat tumor data, baseball data, and other data. Rat tumor data. The rat tumor data is composed of the results from 71 studies. The number of rats per study varied from ten to 52. The rat tumor data are described in Table 5.1 in Gelman et al. (2014) and repeated for convenience in Table 1 (although the data are displayed in a different order). The data are plotted in Figure 3. This plot brings out certain features of the data that are not evident in the table. There are 59 studies for which the total number of rats is less than or equal to 35 and more than half of these studies (32) have observed tumor rates less than or equal to 10%. By contrast, none of the other 12 studies has an observed tumor rate less than or equal to 10%. The posterior distribution for the generic case is shown in Figure 19. The posterior distributions for the specific cases are shown in Figure 20. This latter figure can be compared with Figure 5.4 in Gelman et al. (2014) to show the differences in the results obtained by the more general approach presented here. Baseball batting skill. This example is inspired by the example in Efron (2010) which in turn draws on Efron and Morris (1975). We are interested in the ability of baseball players to generate hits. We do not observe this ability directly; rather we observe the outcomes (successes and failures) of a number of trials for a number of players. In this example T i is the number of at-bats and s i is the number of hits for player i. See Figure 2 for the data. [The analysis in not complete.] Thumbtack data. The thumbtack data are shown in Table 2. The posterior distribution for the generic success rate is displayed in Figure 21.

19 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION Hits Figure 2. Baseball data: 18 players with 45 at-bats each. 0.4 Observed Rate of Tumors Number of Rats in Study Figure 3. Rat tumor data: 71 studies. Number of studies (1 to 7) proportional to area of dot. Number of rats with tumors (0 to 16) indicated by contour lines. There are 59 studies for which the total number of rats is less than or equal to 35 and more than half of these studies (32) have observed tumor rates less than or equal to 10%. By contrast, none of the other 12 studies has an observed tumor rate less than or equal to 10%. The posterior distribution for the generic success rate given the alternative model is shown in Figure 26.

20 20 MARK FISHER Appendix A. Uncertainty about the predictive distribution The predictive distribution quantifies the uncertainty about x n+1 given x 1:n. It is this uncertainty that is of interest in this paper. There is, however, a different notion of uncertainty that could be of interest. Consider a decision that must be made based on the predictive distribution. Suppose the decision can be made either with the currently available information or delayed (with some additional cost) until additional information is acquired. Uncertainty about the extent to which that additional information can change the predictive distribution is relevant to the overall decision process. 16 For example, suppose the decision can be made using the current predictive distribution p(x n+1 x 1:n, I) or delayed and made with the next predictive distribution p(x n+2 x n+1, x 1:n, I). (A.1) Variation in the next distribution is driven by variation in x n+1 from the current distribution, and the expectation of the next distribution equals the current distribution: p(x n+2 x 1:n, I) = p(x n+2 x n+1, x 1:n, I) p(x n+1 x 1:n, I) dx n+1 (A.2) = p(x n+1, x n+2 x 1:n, I) dx n+1 = p(x n+1 x 1:n, I) xn+1 =x n+2, where the last equality follows from the exchangeability of x n+1 and x n+2. If x n+1 and x n+2 are independent conditional on x 1:n, then the next distribution equals its expectation and there is no variation. For n = 0, the next generic distribution (conditional on α) is p(x 2 x 1, α, I) = p(x 1, x 2 α, I) = p(x 1, x 2 α, I) p(x 1 α, I) p(x 1 I) ( ) 1 f(x1 θ 1 ) f(x 2 θ 1 ) h( θ 1 I) d θ 1 = 1 + α p(x 1 I) If α = then there is no variation. ( ) α + p(x 2 I). 1 + α Appendix B. Partitions, configurations, and adding up (A.3) Let P(n) denote the set of partitions of the set {1,..., n}. The elements of P(n) can be identified with (normalized) values for z 1:n. (We normalize z 1:n as follows: z 1 = 1 and z i+1 max[z 1:i ] + 1.) For example, let n = 4. The partition {{1, 2, 4}, {3}} of the set {1, 2, 3, 4} can be identified with the classifications z 1:4 = (1, 1, 2, 1). Note that the number of elements of in the partition is m n. In this example, m 4 = 2 and the set of multiplicities is {d 4,1, d 4,2 } = {3, 1}. Each element of P(n) is assigned a probability via p(z 1:n α) such that p(z 1:n (ζ) α) = 1, (B.1) ζ P(n) 16 This is not the same as the posterior uncertainty regarding G.

21 QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 21 where z 1:n (ζ) denotes the classifications that corresponds to the partition ζ. The number of partitions is given by the Bell number, B n, which grows very rapidly. For example, B 10 = 115,975 and B 20 = 51,724,158,235,372. Here is a way to evaluate (B.1) without completely enumerating P(n). We can partition P(n) into what Casella et al. (2014) call configuration classes. Configuration classes themselves can be identified with the partitions of an integer. (Even though there are far fewer partitions of the integer n than of the set {1,..., n}, there remains a combinatoric explosion that makes explicit evaluation infeasible for any but small values of n. For example, the number of partitions of 100 is 190,569,292.) Classifications with the same set of multiplicities belong to the same configuration class. Consequently, the CRP assigns the same probability to each element of a configuration class [see (2.17)]. Therefore to obtain the probability of the entire configuration class we can evaluate the probability of a representative element of each configuration class (as given by a partition of the associated integer) and multiply by the number of elements of each configuration class. Casella et al. (2014) show that the number of elements in a configuration class is given by ( ) n 1 d n,1,..., d n ( mn n,mn i=1 c=1 1 {i}(d n,c ) )!, (B.2) where ( ) n d n,1,..., d n,mn is the multinomial coefficient and the second factor corrects for over counting. Example. Let n = 4. There are 15 partitions of the set {1, 2, 3, 4}, but there are only five partitions of the integer 4: (4), (3, 1), (2, 2), (2, 1, 1), and (1, 1, 1, 1), where each of the partitions sums to 4. Each partition has an associated value of m 4. The values are 1, 2, 2, 3, and 4, respectively. The second and third partitions of the integer 4 each involve two terms (i.e., m 4 = 2). Interpreting these partitions as configuration classes, the second and third configuration classes belong the the same partition class as defined by Casella et al. (2014) 17. Figure 4 provides a visualization for the n = 4 example. [See also Figure 2 in Casella et al. (2014).] The number of elements in each of the five configuration classes is (1, 4, 3, 6, 1). Additionally assuming α = 1, the probabilities of a representative element from each configuration class are ( 1 4, 1 12, 1 24, 1 24, 1 24 ). Appendix C. Local dependence via copula In this section we generalize the two-dimensional density model (presented in Section 4) to include local dependence, which we introduce via a copula. The Farlie Gumbel Morgenstern (FGM) copula is easy to work with because a flat prior for its copula parameter produces a flat prior predictive density over the unit square [0, 1] 2. The FGM copula is given by C(u 1, u 2 τ) := u 1 u 2 ( 1 τ (1 u1 ) (1 u 2 ) ) where 1 τ 1. (C.1) 17 Casella et al. (2014) provide an alternative prior over the partitions of {1,..., n} in which all elements of a given partition class have equal probability.

22 MARK FISHER Figure 4. The 15 partitions of {1, 2, 3, 4}. Each of the four rows represents a partition class: m 4 = 1, 2, 3, 4. Each of the five configuration classes is indicated by color.

The joint PDF is c(u 1, u 2 τ) := 1 + τ (2 u 1 1) (2 u 2 1) (C.2) The marginal densities are flat: p(u i τ) = 1 [0,1] (u i ) for i = 1, 2. The correlation between u 1 and u 2 is τ/3.

22 22 MARK FISHER Figure 4. The 15 partitions of {1, 2, 3, 4}. Each of the four rows represents a partition class: m 4 = 1, 2, 3, 4. Each of the five configuration classes is indicated by color. (The second partition class is comprised of two configuration classes.) Source: Wikipedia. This is the CDF for (u 1, u 2 ) defined over the unit square. The joint PDF is c(u 1, u 2 τ) := 1 + τ (2 u 1 1) (2 u 2 1) (C.2) The marginal densities are flat: p(u i τ) = 1 [0,1] (u i ) for i = 1, 2. The correlation between u 1 and u 2 is τ/3. Thus the potential dependence is somewhat limited. Setting τ = 0 delivers independence. The CDF for the beta distribution is the regularized incomplete beta function: I x (a, b) = x 0 Beta(t a, b) dt. Let the joint CDF for a single observation x i = (x i1, x i2 ) be given by where θ i = (j i1, j i2, k i1, k i2, τ i ) and F (x i θ i ) := C(w i1, w i2, τ i ), w il := I xil (j il, k il j il + 1). (C.3) (C.4) (C.5)

23 Then the joint density is given by QUASI-BERNSTEIN PREDICTIVE DENSITY ESTIMATION 23 2 f(x i θ i ) = c(w i1, w i2 τ i ) Beta(x il j il, k il j il + 1). l=1 (C.6) Let the prior for θ c = ( j c1, j c2, k c1, k c2, τ c ) be given by where p( θ c I) = p( k c1 I) p( k c2 I) p( τ c I) kc1 kc2, (C.7) p( τ c I) = [ 1,1]( τ c ). (C.8) Let j c = ( j c1, j c2 ), k c = ( k c1, k c2 ), and where f(x i θ c ) = c( w ic1, w ic2 τ c ) 2 Beta(x il j cl, k cl j cl + 1), l=1 w icl := I xil ( j cl, k cl j cl + 1). (C.9) (C.10) The expectation with respect to τ c produces (conditional) independence between x i1 and x i2 : p(x i j c, k c, I) = 1 1 f(x i θ c ) p( τ c I) d τ c = 2 Beta(x il j cl, k cl j cl + 1). l=1 (C.11) Given the conditional priors for j c1 and j c2, this prior produces a flat prior predictive for any p( k c1 I) and p( k c1 I): p(x i I) = 1 [0,1] 2(x i ). The posterior for θ c given a single observation x i is (C.12) p( θ c x i, I) = f(x i θ c ) p( θ c I) p(x i I) ( 2 ) ( 2 ) ( ) = p( k cl I) p( j cl x il, k c( wic1, w ic2 τ c ) cl ). 2 l=1 l=1 }{{}}{{}}{{} p( k c x i,i) p( j c x i, k c,i) p( τ c x i, j c, k c,i) (C.13) Thus, we can make a draw form the joint posterior by first drawing k c from its prior distribution, then drawing j c from its distribution conditional on k c, and finally drawing τ c from its conditional distribution.

24 24 MARK FISHER Appendix D. Binomial likelihood When the data are observations from binomial experiments, it is possible to integrate out the unobserved success rates (since the prior is a mixture of beta distributions). There is a closed-form expression for the likelihood of θ i = (j i, k i ) in terms of the observations: p(s i T i, θ i ) = p(s i T i, j i, k i ) = Note that 1 0 Binomial(s i T i, x i ) Beta(x i j i, k i j i + 1) dx i = Beta-Binomial(s i j i, k i j i + 1, T i ) ( ) Ti ki! (s i + j i 1)! (T i + k i s i j i )! =. (j i 1)! (k i j i )! (T i + k i )! s i p(s i T i, k c, I) = kc j c=1 p(s i T i, j c, k c ) kc = 1 T i + 1, (D.1) (D.2) which is independent of k c. Therefore, p(s i T i, k c, I) = p(s i T i, I). We can again use Algorithm 2 from Neal (2000) to make draws from the posterior distribution. Classification of s i T i depends on p(z i z1:n i, θ 1:m i, α, s i, T i ) n { d i n,c n 1+α p(s i T i, θ c ) α n 1+α p(s i T i, I) c {1,..., m i c = m i n + 1 n }. (D.3) In order to make a draw of a new cluster component, consider the following. Note that where p( j c, k c T i, s i, I) = p(s i T i, j c, k c ) p( j c, k c I) p(s i T i, I) { (T i + 1) Beta-Binomial(s i j c, = k } c j c + 1, T i ) p( k c I) kc = Beta-Binomial( j c 1 s i + 1, T i s i + 1, k c 1) p( k c I), ( j c 1) (T i, s i, k c ) Beta-Binomial(s i + 1, T i s i + 1, k c 1), (D.4) (D.5) with the proviso that j c = 1 if k c = 1. Note that a single observation by itself does not identify k c. Appendix E. Sharing, shrinkage, and pooling One may interpret the main prior in terms of partial sharing of the parameters. Whenever a parameter is shared among cases, the associated coefficients are shrunk toward a common value. Complete sharing, therefore, implies global shrinkage, while partial sharing implies local shrinkage which allows for multiple modes to exist simultaneously. Gelman et al. (2014) and Gelman and Hill (2007), discuss three types of pooling: no pooling, complete pooling, and partial pooling. The no-pooling model corresponds to our no-sharing prior and the partial-pooling model corresponds to our one-component complete

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(