Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother

Size: px

Start display at page:

Download "Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother"

Christiana Randall
5 years ago
Views:

1 Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother J.E. Griffin and M. F. J. Steel University of Kent and University of Warwick Abstract In this paper we discuss implementing Bayesian fully nonparametric regression by defining a process prior on distributions which depend on covariates. We consider the problem of centring our process over a class of regression models and propose fully nonparametric regression models with flexible location structures. We also introduce a non-trivial extension of a dependent finite mixture model proposed by Chung and Dunson (7) to a dependent infinite mixture model and propose a specific prior, the Dirichlet Process Regression Smoother, which allows us to control the smoothness of the process. Computational methods are developed for the models described. Results are presented for simulated and actual data examples. Keywords: Nonlinear regression; Nonparametric regression; Model centring; Stick-breaking prior Introduction Nonparametric regression offers the ability to accurately model relationships between covariates and responses that are not well represented by simple parametric models. For example, we may have non-linearity in the mean, heteroscedasticity or changing numbers of modes. Here, we take a Bayesian approach. There is a substantial literature on estimation of a nonlinear mean function (see, e.g., Denison et al, for a review) with parametric errors. There has been some work on non-linear heteroscedasticity, most notably in Kohn et al. (), by modelling the standard deviation as a function of covariates with parametric errors, which was Jim Griffin acknowledges research support from The Nuffield Foundation grant NUF-NAL/78.

2 extended to errors drawn from a single unknown distribution in Leslie et al (7). In this paper, we will develop fully nonparametric models that allow the whole distribution of the response to change with covariates. The challenge is to define models that make a minimal amount of assumptions but allow effective inference. Assume that a response, y, observed at a covariate value x can be expressed as y = m(x) + σ(x)ɛ, where ɛ has expectation and variance, and its distribution may depend on x. The parameters m(x) and σ(x) are the mean and the variance of the response with covariate x. In Bayesian inference there has been substantial work on fitting this model when ɛ follows a parametric distribution with constant variance σ (x) = σ. Many priors for flexible estimation of the mean function have been proposed. Examples include Gaussian processes (Neal 998), or basis function regression, such as a spline basis or a wavelet basis. Gaussian process priors have a long history in Bayesian nonparametric analysis dating back to at least O Hagan (978) and have recently seen renewed interest in the machine learning community (e.g. Neal 998). The basis function approach is described in Denison et al. (). Wahba (978) gives a Bayesian interpretation of spline smoothing, which is further explored in Speckman and Sun (3). Models that relax the assumption of constant σ (x) are developed in Kohn et al. (). Leslie et al. (7) extend these models to nonparametric inference about the distribution of ɛ (which is assumed independent of x). An alternative approach investigated in the literature is so-called density regression or density smoothing, which defines a nonparametric prior for an unknown distribution (random probability measure) that changes with the covariate value. Much recent work on Bayesian density estimation from a sample y, y,..., y n has been based around extending nonparametric mixture models which are defined by y i k(ψ i ), ψ i F, F Π, () where k( ) is a probability density function (pdf) parameterised by ψ and the prior Π places support across a wide range of distributions. Many choices for Π are possible including Normalized Random Measures (James et al. 5) and the Dirichlet Process (Ferguson 973). Some other methods are reviewed by Müller and Quintana (4). In this paper we will concentrate on the stick-breaking class of discrete distributions for which F d = p i δ θi, () i= where δ θ denotes a Dirac measure at θ and p i = V i j<i ( V j). The sequence V, V, V 3,... is independent and V i Be(a i, b i ) while θ, θ, θ 3,... are i.i.d. from some distribution H. The

3 Dirichlet process occurs as a special case if a i = and b i = M for all i. If we observe pairs (x, y ), (x, y ),..., (x n, y n ) and we wish to estimate the conditional distribution of y given x, then a natural extension of () and () above defines d y i k(ψ i ), ψ i F x, F x = p j (x)δ θj (x). (3) Many previously defined processes fall within this class: if p i (x) = p i then we have the single p DDP model (MacEachern, ) which has been applied to spatial problems (Gelfand et al., 5) and quantile regression (Kottas and Krnjajic, 5). Often it is assumed that θ i (x) follows a Gaussian process. Such models are flexible but make it hard to control specific features of the modelling. Alternatively, we could assume that θ i (x) is a given function described by θ i (x) = K(x; φ i ) parameterised by φ i, and k( ) is the pdf of a Normal(, σ ) distribution; an alternative nonparametric representation is then j= y i N(K(x i ; φ i ), σ ), φ i i.i.d. F, F Π. An example in regression analysis is the ANOVA-DDP model of De Iorio et al. (4) who model replicated data from a series of experiments using combinations of treatment and block levels. A standard parametric approach would be ANOVA. The function K(x; φ i ) mimics the form of the group means in an ANOVA analysis by taking a linear combination of treatment effects, block effects and interactions. Other authors have considered a regression model for p i (x). Dunson et al. (7) define dependent measures by defining each measure to be an average of several unknown, latent distributions. Allowing the weights to change with covariates will allow the distributions to change. Griffin and Steel (6) (hereafter denoted by GS), Dunson and Park (8) and Reich and Fuentes (7) exploit the stick-breaking construction of random measures. The construction of GS transforms an infinite sequence of independent beta random variables V, V, V 3,... and an infinite sequence of i.i.d. random variables θ, θ, θ 3,... into a distribution F using (). GS then define dependent Dirichlet processes by permuting the sequence V, V, V 3,... at different covariate values of x. Dunson and Park (8) and Reich and Fuentes (7) use the stick-breaking construction on a finite sequence of random variables on [,]. Dependence can be simply introduced by allowing each element of this sequence to depend on x. For example, V i (x) = K(x, φ i )W i where < K(x, φ i ) < and W i is beta distributed. This induces dependence of p i (x). However, it is usually hard to link the correlation of V i (x) to the correlation of p i (x). This makes these models difficult to implement in general. Chung and Dunson (7) define a specific example of this type of process, the Local Dirichlet process, which defines the K(x, φ i ) to be a ball of radius φ i around x. In this case, it is simple to relate the correlation in p i (x) to the properties of the kernel and the choice of the prior for φ i. One purpose of this paper is to extend this class of models to the nonparametric case where we have an infinite number of atoms. 3

4 The methods developed in GS lead to fairly tractable prior distributions which are centred over a single distribution. A second aim of the present paper is to develop hierarchical extensions to exploit the natural idea of centring over a non-trivial model (in other words, allowing the centring distribution to depend on the covariates). Thus, we allow for two sources of dependence on covariates: dependence of the random probability measure on the covariates and dependence of the centring distribution on the same (or other) covariates. Besides extra flexibility, this also provides a framework for assessing the adequacy of commonly used parametric models, by centring over these models. We will also propose a new density regression prior which allows us to control its smoothness, along with a simple Markov chain Monte Carlo (MCMC) algorithm for inference. The paper is organised in the following way: Section introduces the idea of centring a dependent nonparametric prior over a parametric model and introduces centred models using infinite mixtures of normal and of uniform distributions. The models indicate the suitability of the parametric model for the observed data. Section 3 introduces a new method for constructing Bayesian nonparametric regression estimators, discussing a particular class in detail: the Dirichlet Process Regression Smoother (DPRS). Section 4 briefly discusses computational methods for these hierarchical models defined using the DPRS with more details of the implementation in Appendix B. Section 5 illustrates the use of these models in inference problems, and a final section concludes. Without specific mention in the text, proofs will be grouped in Appendix A. Centring nonparametric models over parametric models Nonparametric priors for an unknown distribution are usually centred over a parametric distribution. This means that the prior expectation of the mass that the unknown distribution places on any set is given by a parametric distribution. It will be useful to extend this idea to dependent nonparametric priors and think explicitly about how we can implement centring over parametric regression models. The nonparametric prior will model aspects of the conditional distribution that are not well captured by our parametric centring model and, by centring, we can use prior information elicited for the parametric model directly. If the parameters controlling the departures of the fitted distribution from the centring model are given priors then we can assess how well our parametric model describes the data. The covariates used in the parametric and nonparametric parts of the model are not required to be the same, but x will generally denote the union of all covariates. Definition A nonparametric model will be centred over a parametric model, with parameters θ, if the prior predictive distribution of the nonparametric model conditional on θ coin- 4

5 cides with the parametric model for all covariate values. In this paper we will concentrate on centring over the standard regression model with normally distributed errors y i = g(x i ) + ɛ i, ɛ i N(, σ ), where g(x) is a parametric regression function. Centring then refers to defining a nonparametric prior for the distribution of ɛ i whose predictive distribution is a zero mean, normal distribution. An infinite mixture model can be naturally extended to depend on covariates through ɛ i k(ɛ i ψ i ), ψ i F xi, F x Π(H, φ) (4) where Π(H, φ) is the class of dependent discrete random probability measures for which d F x = p i (x)δ θi i= where p i (x) are weights such that i= p i(x) =, θ, θ, θ 3, i.i.d. H and φ are parameters governing the behaviour of p i (x). This class is relatively tractable and it has formed the basis of several density regression priors. We restrict attention to Dirichlet process-based models since these methods dominate in the literature and our approach follows these ideas naturally. We have assumed that the position θ i does not depend on x. A similar idea was implemented by GS using the πddp prior. We consider two possible classes: one of which defines k to be normal and a second that takes k to be a uniform distribution. A natural centred nonparametric model assumes that the regression errors ɛ i, can be modelled by ɛ i N(µ i, aσ ), µ i F xi, F x Π(H, φ), H N(, ( a)σ ) where < a <. This model will be denoted here as Model (a) and is centred over y i N(g(x i ), σ ) with the same prior on σ. Note that dependence on the covariates x enters in two different ways: it is used to define ɛ i as deviations of y i from g(x i ) and the nonparametric distribution of the means µ i depends on x i through the depedent discrete random probability measure defined in (4). For all values of x the model reduces to the standard Dirichlet process mixture of normals model. The parameterisation of the model is discussed by Griffin (6), who pays particular attention to prior choice. Many distributional features, such as multimodality, are more easily controlled by a rather than M. Small values of a suggest that the nonparametric modelling is crucial. A uniform prior distribution on a supports a wide range of departures from a normal distribution. 5

6 This type of model can capture non-linear relationships between errors and regressors through the changing weights p i (x). However, the k-th moment of the distribution F x, µ (k) x = i= p i(x)θi k, has autocorrelation given by Corr(µ(k) x, µ (k) y ) = (M + ) i= E[p i(x)p i (y)] which does not depend on k and so the autocorrelation of all moments of the distributions at two covariate levels are the same. This seems unsatisfactory for many applications where we would imagine that the mean is much less correlated than the variance (an assumption that underlies much regression modelling where the error variance is considered constant). This relatively crude correlation structure can lead to undersmoothing of the posterior estimates of the distribution. In particular, the posterior mean E[y x] will often have the step form typical of piecewise constant models. These problems can be addressed by defining the mean of ɛ i using flexible regression ideas such as Gaussian process priors (O Hagan 978, Rasmussen and Williams 6) or basis regression models (Denison et al. ). If the mean of the errors has a highly non-linear relationship to the regressor then this relationship could be more accurately and parsimoniously modelled using these ideas. This is denoted by Model (b) and written as ɛ i m(x i ) N(µ i, aσ ), µ i F xi, F x Π(H, φ), H N(, ( a)σ ), where σ = σ + σ and m(x) follows a Gaussian process prior where m(x ),..., m(x n ) are jointly normally distributed with constant mean and the covariance of m(x i ) and m(x j ) is σ ρ(x i, x j ) with ρ(x i, x j ) a proper correlation function. A popular choice of correlation function is the flexible Matèrn class (see e.g. Stein, 999) for which ρ(x i, x j ) = ν Γ(ν) (ζ x i x j ) ν K ν (ζ x i x j ), where K ν is the modified Bessel function of order ν. The process is q times mean squared differentiable if and only if q < ν and ζ acts as a range parameter. We define b = σ /σ, which can be interpreted as the amount of residual variability explained by the nonparametric Gaussian process estimate of m(x). If we consider the prior predictive with respect to F x we obtain the centring model y i N(g(x i ) + m(x i ), σ ), whereas if we integrate out both F x and m(x) with their priors we obtain y i N(g(x i ), σ ). The model allows the first moment of ɛ i to have a different correlation structure to all higher moments. The approach is not restricted to the Gaussian process and a wavelet or spline basis could be introduced instead. Models (a) and (b) illustrate an important advantage of centring over a model: it provides a natural way to distinguish between the parametric dependence on covariates, captured by g(x), and the nonparametric dependence, modelled through F x and m(x). Thus, by choosing g(x) appropriately, we may find that the nonparametric modelling is less critical. This will be detected by a large value of a and a small value of b, and will allow us to use the model to evaluate interesting parametric specifications. It is important to note that if we take a linear parametric regression function, say g(x) = x γ, then the interpretation of γ is non-standard in 6

7 this model since E[y i F xi, γ, x i ] is merely distributed around g(x i ) and P (E[y i F xi, γ, x i ] = g(x i )) = if y i is a continuous random variable and H is absolutely continuous, which occur in a large proportion of potential applications. Thus, γ will not have the standard interpretation as the effects of the covariate on the mean of the response. The predictive mean, i.e. E[Y i γ, x i ] still equals g(x i ), however. The prior uncertainty about this predictive mean will increase as our confidence in the centring model (usually represented by one of the parameters in φ) decreases. One solution to this identifiability problem is to follow Kottas and Gelfand () who fix the median of ɛ i, which is often a natural measure of centrality in nonparametric applications, to be. If we assume that the error distribution is symmetric and unimodal, then median and mean regression will coincide (if the mean exists). An alternative, wider, class of error distributions, introduced by Kottas and Gelfand () to regression problems, is the class of unimodal densities with median zero (extensions to quantile regression are discussed in Kottas and Krnjajic 5). Incorporating our dependent discrete random probability measure structure into this model defines Model : ɛ i m(x i ) U ( σ ui, σ ui ), u i F xi, F x Π(H, φ), which leads to symmetric error distributions with U(c, d) denoting a Uniform distribution on (c, d). For H we choose a Gamma distribution with shape parameter 3/ and scale /, corresponding to a normal centring distribution. This model is centred exactly as Model (b). 3 A Bayesian Density Smoother This section develops a measure-valued stochastic process that can be used as a prior for Bayesian nonparametric inference when we want to infer distributions, {F x } x X, where X is the space of covariates. It will be stationary in the sense that all marginal distributions F x follow a stick-breaking process with the same parameters at any x X. The model is developed in the spirit of GS and shares similarities with their order-based dependent Dirichlet process (πddp) with an arrivals construction. The latter construction defines a stochastic process that is not reversible and we will define a new process that is reversible in one dimension and easily extends to an isotropic process in higher dimensions, which makes its suitable for our regression purpose. The process is also a non-trivial generalisation of the independently proposed Local Dirichlet Process (Chung and Dunson, 7) from finite mixture models to infinite mixture models. The πddp of GS defines F x by making the V i s enter () in a different order at each covariate value so that F x = i p i (x)δ θi 7

8 where p i (x) = V i {j π j (x)<π i (x)} ( V j) where π (x), π (x), π 3 (x),... is a random permutation of a subset of,, 3,.... Similar random permutations and choices of subsets will induce similar distributions. By allowing a permutation of a subset rather than a permutation of all natural number, we can allow atoms to appear in the distribution at only some points x X. We can define orderings by allocating a latent parameter τ i to the i-th atom (V i, θ i ) and order at x according to some distance between x and τ i. The arrivals construction of GS takes time as a covariate and defines π i (x) = #{j τ i τ j < x} i.e. the ordering is backwards in τ from the largest τ j smaller than x. If we denote the region in which points are relevant for the ordering at x by U(x), then the arrivals process assumes that U(x) U(y) if y > x and that the new atoms are placed at the start of the process. Clearly, the process is not reversible in time. Intuitively, if the process allows atoms to appear in the ordering then, to define a reversible process, we need atoms to also disappear. A simple method that achieves this effect is to restrict U(x) to be an interval. Thus, as a special case the arrivals ordering arises when the left-hand side of the interval is minus infinity for all points. If the interval is symmetric around a centre then the process will be reversible by construction. In what follows, we will focus on processes with Dirichlet marginals. Definition Let {S j } j= be an infinite collection of random subsets of X with associated marks (V j, θ j ) which are i.i.d. realisations of Be(, M) and H respectively. We define F x = δ θi V i ( V j ), {i x S i } {j<i x S j } and then {F x x X } follows a Subset-based Dependent Dirichlet Process which is represented as S-DDP(M, H, Π), where Π are parameters that determine how to generate the infinite collection of random subsets. This defines the idea for general spaces. Any atom θ j only appears in the stick-breaking representation of F x at points x which belong to a subset of X, and this allows atoms to appear and disappear. In particular, if X R p then we can define S j to be a closed, convex set of appropriate dimension. However, it can be naturally applied to other spaces using a suitable metric. The dependence between distributions at different locations s and v can be easily measured using the correlation of F s (B) and F v (B) for any measurable set B, which can be represented by Theorem [ Corr(F s, F v ) = M + E ( M B i M + i= ) i j= A ( ) i ] j M + j= B j M + (5) 8

9 where A i = I(s S i or v S i ) and B i = I(s S i and v S i ). If s, v S i for all i then the correlation will be. If s and v do not both fall in any S i then the correlation will be. Consider a situation where the probability of observing a shared element in each subsequence is constant given two covariate values s and v and equals, say, p s,v. Then Theorem The correlation between F s and F v can be expressed as Corr(F s, F v ) = M+ M+ p s,v + M M+ p s,v where, for any k, p s,v = P (s, v S k s S k or v S k ). = (M + )p s,v + M( + p s,v ) This correlation is increasing both in p s,v and M, the mass parameter of the Dirichlet process at each covariate value. As p s,v tends to the limits of zero and one, the correlation does the same, irrespective of M. As M tends to zero, the Sethuraman representation in () will be totally dominated by the first element, and thus the correlation tends to p s,v. Finally, as M (the Dirichlet process tends to the centring distribution) the correlation will tend to p s,v /(+p s,v ), as other elements further down the ordering can also contribute to the correlation. Thus, the correlation is always larger than p s,v if the latter is smaller than one. Note that the correlation between distributions at different values of x will not tend to unity as M tends to infinity, in contrast to the πddp constructions proposed in GS. This is a consequence of the construction: some points will not be shared by the ordering at s and v no matter how large M. The correlation between drawings from F s and F v, given by Corr(F s, F v )/(M + ) (see GS) will, however, tend to zero as M. To make the result more applicable in regression, we now define a specific, simple method for choosing the subset in p-dimensional Euclidean space using a ball of radius r. Definition 3 Let (C, r, t) be a Poisson process with intensity f(r) defined on R p R + with associated marks (V j, θ j ) which are i.i.d. realisations of Be(, M) and H respectively. We define F x = {i x B ri (C i )} δ θi V i {j x B rj (C j ), t j <t i } ( V j ) where B r (C) is a ball of radius r around C. Then {F x x X } follows a Ball-based Dependent Dirichlet Process which is represented as B-DDP(M, H, f), where f is a non-negative function on R + (the positive half-line). The intensity f(r) can be any positive function. However, we will usually take f(r) to be a probability density function. The following argument shows that defining f(r) more generally does not add to the flexibility of the model. Suppose that (C, r, t) follows the Poisson process above then writing C i = C i, r i = r i and t i = t i λ for λ > defines a coupled Poisson 9

10 process (C, r, t ) with intensity λf(r). The ordering of t and t is the same for the two coupled processes and the B-DDP only depends on t through its ordering. It follows that distributions {F x } x X defined using (C, r, t) and (C, r, t ) will be the same. The definition could be easily extended to ellipsoids around a central point that would allow the process to exhibit anisotropy. It is necessary to add the latent variable t for F x to be a nonparametric prior. The set T (x) = {i x C i < r i } indicates the atoms that appear in the stick-breaking representation of F x. If we would instead define a Poisson process on (C, r) on R p R + with intensity f(r) then T (x) would be Poisson distributed with mean r f(r) dr. This integral can be infinite but this would have strong implications for the correlation structure. By including t we make T (x) infinite for all choices of f and therefore define a nonparametric process. To calculate the correlation function, and relate its properties to the parameters of the distribution of r, it is helpful to consider p s,v. This probability only depends on those centres from the set {C k s S k or v S k } = {C k C k B rk (s) B rk (v)}. Theorem 3 If {F x } x X follows a B-DDP then ν (Br (s) B r (v)) f(r) dr p s,v = ν (Br (s) B r (v)) f(r) dr, where ν( ) denotes the Lebesgue measure in the covariate space X. Sofar, our results are valid for a covariate space of any dimension. However, in the sequel, we will focus particularly on implementations with a covariate that takes values in the real line. In this case, Theorem 3 leads to a simple expression. Corollary If {F x } x X follows a B-DDP on R then p s,s+u = µ ui 4µ µ + ui where µ = E[r], I = u/ f(r) dr and µ = u/ rf(r) dr, provided µ exists. Throughout, we will assume the existence of a nonzero mean for r and define different correlation structures through the choice of the distribution of f(r). We will focus on two properties of the autocorrelation. The first property is the range, say x, which we define as the distance at which the autocorrelation function takes the value ε which implies that p s,s+x = ε(m + ) M + + M( ε). The second property is the mean square differentiability which is related to the smoothness of the process. In particular, a weakly stationary process on the real line is mean square differentiable of order q if and only if the q th derivate of the autocovariance function evaluated at zero exists and is finite (see for example Stein, 999). In the case of a Gamma distributed radius, we can derive the following result.

11 Theorem 4 If {F x } x X follows a B-DDP with f(r) = βα Γ(α) rα exp{ βr} then F x is mean square differentiable of order q =,,... if and only if α q. If each radius follows a Gamma distribution then we can choose the shape parameter, α, to control smoothness and the scale parameter, β, to define the range, x. A closed form inverse relationship will not be available analytically in general. However, if we choose α =, which gives an exponential distribution, then β = x log ( + M + ε ε(m + ) ). (6) M =. M = M = Figure : The autocorrelation function for a Gamma distance distribution with range 5 with shape: α =. (dashed line), α = (solid line) and α = (dotted line) Figure shows the form of the autocorrelation for various smoothness parameters and a range fixed to 5 (with ε =.5). Clearly, the mass parameter M which is critical for the variability of the process, does not have much impact on the shape of the autocorrelation function, once the range is fixed. We will concentrate on the Gamma implementation and work with the following class Definition 4 Let {C, r, t} be a Poisson process with intensity βα Γ(α) rα i exp{ βr i } defined on R p R + with associated marks (V i, θ i ) which are i.i.d. realisations of Be(, M) and H. We define F x = {i x B ri (C i )} V i {j x B rj (C j ), t j <t i } ( V j )δ θi, and then {F x x X } follows a Dirichlet Process Regression Smoother (DPRS) which is represented as DPRS(M, H, α, β). Typically, we fix α and x and put appropriate priors on M and any other parameters in the model. We use a prior distribution for M which can be elicited by choosing a typical value for M to be n and a variance parameter η. This prior has the density function and is discussed in more detail in GS. p(m) = nη Γ(η) Γ(η) M η (M + n ) η,

12 4 Computational method This section discusses how the nonparametric hierarchical models discussed in Section with a DPRS prior can be fitted to data using a retrospective sampler. Retrospective Sampling for Dirichlet process-based hierarchical models was introduced by Papaspiliopoulos and Roberts (8). Previous attempts to use the stick-breaking representation of the Dirichlet process in MCMC samplers have used truncation (e.g. Muliere and Tardella 998, Ishwaran and James for the Dirichlet process and GS for order-based dependent Dirichlet processes). The Retrospective Sampler avoids the need to truncate. The method produces a sample from the posterior distribution of all parameters except the unknown distribution. Inference about the unknown distribution must use some truncation method. This makes the methods comparable to Polya urn based methods, which are reviewed by Neal (). Retrospective methods have been used for density regression models by Dunson and Park (8). We assume that data (x, y ),..., (x n, y n ) have been observed which are hierarchically modelled by p(y i ψ i ), ψ i x i F xi, F x DPRS(M, H, α, β). The DPRS assumes that F x = j= p j(x)δ θj where θ, θ, i.i.d. H and p is constructed according to the definition of the DPRS. Additional parameters can be added to the sampling kernel or the centring distribution, and these are updated in the standard way for Dirichlet process mixture models. MCMC is more easily implemented by introducing latent variables s, s,..., s n and re-expressing the model as p(y i θ si ), P (s i = j) = p j (x i ), θ, θ, i.i.d. H The latent variables s = (s,..., s n ) are often called allocations since s i assigns the i-th observation to the distinct elements of F xi (i.e. ψ i = θ si ). Defining y = (y,..., y n ), θ = (θ, θ,... ) and V = (V, V,... ), the posterior distribution of the parameters is p(θ, s, t, C, r, V y) p(y θ, s) n p(s i x i, C, t, r, V )p(v )p(θ)p(c, r, t) i= The probability p(s i x i, C, t, r, V ) is only non-zero if x B ri (C i ). Let (C A, r A, t A ) be the Poisson process (C, r, t) restricted to the set A and let (θi A, V i A ) be the mark associated with (Ci A, ra i, ta i ). These marks are collected in the sets θa and V A respectively. Defining the set R = {(C, r, t) x B r (C)} with its complement R C, we obtain p(θ, s, t, C, r, V y) p ( y θ R, s ) n p ( s i x i, C R, t R, r R, V R) p ( V R) p ( θ R) i= ( ) ( ) p C R, r R, t R )p(v RC p θ RC p (C ) RC, r RC, t RC

13 which follows from the independence of Poisson processes on disjoint sets. Therefore we can draw inference using the following restricted posterior distribution p ( θ R, s, t R, C R, r R, V R y ) p ( y θ R, s ) n p ( s i x i, C R, t R, r R, V R) i= p ( V R) p ( θ R) p ( C R, r R, t R) We define a retrospective sampler for this restricted posterior distribution. A method of simulating (C R, r R, t R ) that will be useful for our retrospective sampler is: ) initialize t Ex ( R f(r) dc dr) and ) t i = t i + x i where x i Ex ( R f(r) dc dr). Then (C R, rr ), (CR, rr ),... are independent of tr, tr,... and we take i.i.d. draws from the distribution p ( ( Ck R ) n ) ( n ) rr k = U B r R(x i ), p ( r R ) ν i= B r R(x i) f(r R k k = k k ) ν ( n i= B u(x i )) f(u) du, i= where U(A) represents the pdf of the uniform distribution on the set A and ν( ) represents the appropriate Lesbesgue measure. It is often hard to simulate from this conditional distribution of Ck R and to calculate the normalising constant of the distribution of rr k. It will usually be simpler to use a rejection sampler from the joint distribution of (C, R) conditioned to fall in a simpler set containing R. For example, in one dimension we define d (r k ) = (x min r k, x max + r k ) where x min and x max are the maximum and minimum values of x,..., x n and the rejection envelope is g(c R k rr k ) = U ( d (r R k )), g(r R k ) = (x max x min + r k ) f(r R k ) (xmax x min + u) f(u) du. Any values (C R k, rr k ) are rejected if they do not fall in R. If we use a DPRS where r k follows a Gamma(α,β) distribution then we sample rk R distribution g(rk R ) = wf Ga (α, β) + ( w)f Ga (α +, β) from the rejection envelope using a mixture where f Ga (α, β) is the pdf of a Ga(α, β) distribution and w = (x max x min )/(x max x min + α β ). This construction generates the Poisson process underlying the B-DDP ordered in t and we will use it to retrospectively sample the Poison process in t (we can think of the DPRS as defined by a marked Poisson process where t R follows a Poisson process and (C R, r R ) are the marks). In fact, the posterior distribution only depends on t R, tr, tr 3,... through their ordering and we can simply extend the ideas of Papaspiliopoulos and Roberts (8) to update the allocations s. The MCMC sampler defined on the posterior distribution parameterised by r R can have problems mixing. The sampler can have much better mixing properties by using the reparameterisation from r R to r R where we let d il 3 = x i C l and

14 define ri R = ri R max{d ij s i = j}. Conditioning on r R = (r R, r R,... ) rather than r R = (r R, rr,... ) allows each observation to be allocated to a distinct element. Initially, we condition on s i = (s,..., s i, s i+,..., s n ) and remove s i from the allocation. Let K i = max{s i } and let r () k = rj R distribution is r () k = rj R + max{d jk s j = k, j =,..., i, i +,..., n} and + max({d jk s j = k, j =,..., i, i +,..., n} { x i C j }). The proposal p(y i θ k )V k ( V k ) A ( ) k l<k ( V l) f r () k ( ) k K q(s i = k) = c f r () i k max m K i {p(y i θ m )}V k l<k ( V l) k > K i { } where A k = # m r () k < d msm < r () k, s m > k and ( K i f c = p(y i θ l )V l ( V l ) A l ( V h ) ( f l= h<l r () l r () l ) ) + max l K i {p(y i θ l )} h K i ( V h ). If k > K i we need to generate (θ K i +, V K i +, C K i +, d K i +),..., (θ k, V k, C k, d k ) independently from their prior distribution. A value is generated from this discrete distribution using the standard inversion method (i.e. simulate a uniform random variate U and the proposed value k is such that k l= q(s i = l) < U k l= q(s i = l)). Papaspiliopoulos and Roberts (8) show that the acceptance probability of the proposed value is α = { } if k K i. min, if k > K i p(y i θ k ) max l K i p(y i θ l ) The other full conditional distributions of the Gibbs sampler are given in Appendix B. 5 Examples This section applies the models developed in Section in combination with the DPRS of Section 3 on simulated data and two real datasets: the prestige data (Fox and Suschnigg, 989) and the electricity data of Yatchew (3). As a basic model, we take Model (a) with a regression function f(x) =. This model tries to capture the dependence on x exclusively through the Dirichlet process smoother. Model (b) is a more sophisticated version of Model, where m(x) is modelled through a Gaussian process, as explained in Section. Finally, we will also use Model, which will always have a Gaussian process specification for m(x). The prior on M is as explained in Section 3 with n = 3 and η = for all examples. On the parameter a in Model we adopt a Uniform prior over (,). The range x of the DPRS is 4

15 such that the correlation is.4 at the median distance between the covariate values. Priors on σ and on the parameters of the Gaussian process σ and ζ are as in the benchmark prior of Palacios and Steel (6). Finally, we fix the smoothness parameter ν of the Gaussian process at. 5. Example : Sine wave We generated observations from the following model y i = sin(πx i ) + ɛ i where x i are uniformly distributed on (, ) and the errors ɛ i are independent and chosen to be heteroscedastic and non-normally distributed. We consider two possible formulations: Error assumes that ɛ i follows a t-distribution with zero mean,.5 degrees of freedom and a conditional variance of the form σ (x) = (x ) which equals at x = and increases away from. Error assumes that the error distribution is a mixture of normals p(ɛ i x i ) =.3N(.3,.) +.7N(.3 +.6x i,.). This error distribution is bimodal at x i = and unimodal (and normal) at x i =. The first error distribution can be represented using both a mixture of normals and a scale mixture of uniforms whereas the second error distribution can not be fitted using a mixture of uniforms. Initially, we assume Error. The results for Figure : Example with Error : predictive conditional mean of y given x for Model (a): α =. (dashed line), α = (solid line), α = (dotted line). Data points are indicated by dots. The right panel presents a magnified section of the left panel Model (a) are illustrated in Figure for three values of α. Smaller values of α lead to rougher processes and the effect of its choice on inference is clearly illustrated. In the sequel, we will only present results where α =. Under Model (a), we infer a rough version of the underlying true distribution function as illustrated by the predictive density in Figure 3. The small values of a in Table indicates a lack of normality. The results are similar to those of GS who find that the estimated conditional mean is often blocky which reflects the underlying piecewise constant approximation to the changing distributional form. In the more complicated models the conditional location is modelled through a nonparametric regression function (in this case a Gaussian process prior). Both Model (b) and Model 5

16 Model (a) Model (b) Model σ.7 (.48,.3).64 (.46,.96).68 (.49,.8) a.9 (.,.4).5 (.,.33) b.75 (.54,.88).76 (.53,.9) ρ.53 (.3,.96).6 (.3,.9) M.38 (.4,.95).84 (.6, 5.7).57 (.46, 3.64) Table : Example with Error : posterior median and 95% credible interval (in brackets) for selected parameters Model (a) Model (b) Model p(y x) Conditional variance Figure 3: Example with Error : heatmap of the posterior predictive density p(y x) ite dashed line) and plot of the posterior conditional predictive variance σ (x) (solid line) and the true value (dashed line) assume a constant prior mean for m(x). Introducing the Gaussian process into Model leads to smaller values of σ since some variability can now be explained by the Gaussian process prior. However, the posterior for a still favours fairly small values, reminding us that even if the conditional mean is better modelled with the Gaussian process, the tails are still highly non-normal (see Figure 4). The estimated posterior predictive distributions (as depicted in Figure 4: Example with Error : The posterior predictive distribution of y for x =.4 using Model (a) (solid line) and Model (b) (dashed line) Figure 3) are now much smoother. Both Model (b) and Model lead to large values for b and 6

17 thus σ (which better fits the true variability of the mean). This leads to better estimates of the conditional predictive variance, as illustrated in Figure 3. Clearly a model of this type would struggle with estimation at the extreme values of x but the main part of the functional form is well-recovered. The parameter ρ = ν/ζ used in Table is an alternative range parameter which is favoured by Stein (999, p.5), and indicates that the Gaussian process dependence of m(x) is similar for Model (a) and Model. The posterior median values of ρ lead to a range of the Gaussian process equal to.89 and.6 for Models (b) and, respectively. Results for data generated with the second error structure are shown in Figure 5 and Table (for selected parameters). Model (b) is able to infer the bimodal distribution for small values of x and the single mode for large x as well as the changing variance. Model is not able to capture the bimodality by construction and only captures the changing variability. In both cases the mean is well estimated. Small values of a illustrate the difficulty in capturing the error structure. The large values of b indicate that the centring model (a constant model with mean zero) does a poor job in capturing the mean. Predictive density p(y x) Posterior of m(x) Predictive error distribution Model (b) Model Figure 5: Example with Error : heatmap of posterior predictive density p(y x), plot of the posterior of m(x) indicating median (solid line), 95% credible interval (dashed lines) and data (dots), and the posterior predictive error distribution indicating the.5 th, 5 th, 75 th and 97.5 th percentiles Model (a) Model (b) Model σ.7 (.4,.66).47 (.34,.7).54 (.38,.98) a. (.,.3).3 (.,.38) b.84 (.66,.9).84 (.65,.94) Table : Example with Error : posterior median and 95% credible interval (in brackets) for selected parameters 7

18 5. Prestige data Fox and Suschnigg (989) and Fox (997) consider measuring the relationship between income and prestige of occupations using the 97 Canadian census. The prestige of the jobs was measured through a social survey. We treat income as the response and the prestige measure as a covariate. The data is available to download in the R package car. Figure 6 shows the fitted Model (a) Model (b) Model Figure 6: Prestige data: posterior distribution of the conditional mean indicating median (solid line), 95% credible interval (dashed lines) and data (dots) conditional mean. In all cases the relationship between income and prestige show an increasing trend for smaller income before prestige flattens out for larger incomes. The result are very similar to those described in Fox (997). The inference for selected individual parameters is presented in Table 3. As in the previous example, the Gaussian process structure on m(x) Model (a) Model (b) Model σ. (4.8, 43.7). (4.8, 3.). (6., 36.4) a. (.3,.3).8 (.8,.69) b.66 (.38,.85).69 (.4,.88) Table 3: Prestige data: posterior median and 95% credible interval (in brackets) for selected parameters accounts for quite a bit of variability, rendering the error distribution not too far from normal in Model (b), as indicated by the fairly large values of a. 5.3 Scale economies in electricity distribution Yatchew (3) considers fitting a cost function for the distribution of electricity. A Cobb- Douglas model is fitted, which assumes that tc = f(cust) + β wage + β pcap + β 3 PUC + β 4 kwh + β 5 life + β 6 lf + β 7 kmwire + ɛ, where tc is the log of total cost per customer, cust is the log of the number of customers, wage is the log wage rate, pcap is the log price of capital, PUC is a dummy variable for public 8

19 utility commissions, life is the log of the remaining life of distribution assets, lf is the log of the load factor, and kmwire is the log of kilometres of distribution wire per customer. The data consist of 8 municipal distributors in Ontario, Canada during 993. We will fit the DPRS model with cust as the covariate to ɛ and we will centre the model over two parametric regression models by choosing f(cust) as follows: Parametric, γ + γ cust, and Parametric, γ + γ cust + γ 3 cust. Parametric Model (a) Model (b) Model γ.4 (-4.4, 5.) -.7 (-4.88, 3.) -.9 (-4.98, 3.9) -.67 (-4.79, 4.3) γ -.7 (-.3, -.) -.7 (-.4, -.) -. (-.,.) -.9 (-.,.) β.48 (-.5,.6).67 (.5,.).7 (.7,.3).7 (.,.53) β 4. (-.6,.3).7 (-.,.5).4 (-.4,.).6 (-.4,.3) β 6.97 (.3,.9). (.9,.).4 (.4,.).9 (.4,.4) σ.7 (.5,.). (.4,.36).3 (.7,.39).7 (.9,.48) a.9 (.5,.45).75 (.5,.99) b.4 (.,.77).55 (,,.84) Table 4: Electricity data: posterior median and 95% credible interval (in brackets) for selected parameters of Parametric model (linear) and the nonparametric models centred over Parametric model The results of Yatchew (3) suggest that a linear f(cust) is not sufficient to explain the effect of number of customers. The results for selected parameters are shown in Tables 4 and 5 when centring over Parametric and over Parametric, respectively. When fitting both parametric models we see differences in the estimates of the effects of some other covariates. The parameters β and β 6 have larger posterior medians under Parametric while β 4 has a smaller estimate. If we centre our nonparametric models over the linear parametric model then we see the same changes for β and β 6 and a smaller change for β 4. Posterior inference on regression coefficients is much more similar for all models in Table 5. In particular, the parametric effect of customers is very similar for Parametric and for all the nonparametric models centred over it. The estimated correction to the parametric fit for the effect of customers is shown in Figure 7. For models centred over the linear model, it shows a difference which could be well captured by a quadratic effect, especially for Model (b) and Model. Under both centring models, the importance of the nonparametric fitting of the error sharply decreases as a Gaussian process formulation for the regression function is used, as evidenced by the increase in a. Changing to a quadratic centring distribution leads to decreased estimates of b indicating a more appropriate fit of the parametric part. This is corroborated by the nonparametric correction to this fit as displayed in Figure 7. 9

20 Parametric Model (a) Model (b) Model γ.77 (-.53, 6.96).78 (-.83, 6.88).5 (-.44, 7.56).77 (-4., 7.79) γ -.83 (-.9, -.48) -.9 (-.4, -.4) -.9 (-.69, -.3) -.96 (-.57, -.3) γ 3.4 (.,.6).5 (.,.7).4 (.,.9).5 (.,.8) β.83 (.,.48).79 (.6,.38).8 (.4,.43).78(-.3,.4) β 4 -. (-.,.5) -. (-.,.7). (-.8,.8). (-.8,.9) β 6.5 (.38,.9).3 (.5,.8).3 (.47,.5).3 (.48,.59) σ.6 (.3,.9).7 (.4,.3). (.6,.34). (.6,.38) a.3 (.,.4).77 (.4,.99) b.3 (.8,.75).37 (.5,.77) Table 5: Electricity data: posterior median and 95% credible interval (in brackets) for selected parameters of Parametric model (quadratic) and the nonparametric models centred over Parametric model Model (a) Model (b) Model Linear Quadratic Figure 7: Electricity data: posterior mean of the nonparametric component(s) of the model 6 Discussion This paper shows how ideas from Bayesian nonparametric density estimation and nonparametric estimation of the mean in regression models can be combined to define a range of useful models. We introduce novel approaches to nonparametric modelling by centring over appropriately chosen parametric models. This allows for a more structured approach to Bayesian nonparametrics and can greatly assist in identifying the specific inadequacies of commonly used parametric models. An important aspect of the methodology is separate modelling of various components, such as important quantiles, like the median, or moments, like the mean, which allows the nonparametric smoothing model to do less work. These ideas can be used in combination with any nonparametric prior that allows distributions to change with covari-

21 ates. In this paper we have concentrated on one example which is the Dirichlet Process Regression Smoother (DPRS) prior, introduced here. The latter is related to the πddp methods of GS but allows simpler computation (and without truncation) through retrospective methods. The parameters of the DPRS can be chosen to control the smoothness and the scale of the process. Appendix A: Proofs Proof of Theorem It is easy to show (see GS) that for any measurable set B In this case, if x / S i or y / S i then Corr(F x (B), F y (B)) = (M + ) E[p i (x)p i (y)] = E[p i (x)p i (y)] Otherwise, let R () i = {j < i x S i and y S i }, R () i = {j < i x S i and y / S i } and R (3) i = {j < i x / S i and y S i } E[p i (x)p i (y)] = E V ( V i ) ( V i ) ( V i ) i j R () i j R () i ( = (M + )(M + ) E M M + ) #R () i i= j R (3) i ( M M + ) #R () i ( ) (3) M #R i M + and so [ Corr(F x (B), F y (B)) = M + E ( M B i M + Proof of Theorem i= [ ( ) i M E B i M + i= ( = E M B i M + {i y S i or x S i } ) i j= A j ( M + M + ) i j= A ( ) i ] j M + j= B j M + ) i ] j= B j j= A ( ) i j M + j= B j M +

22 The set {i y S i or x S i } must have infinite size since it contains the {i x S i } which have infinite size. Let B i = then [ Corr(F s, F v ) = M + E = M + = M + = M + = M + i= = M M+ p s,v + M M+ p s,v Proof of Theorem 3 B i ( M M + ) i E [ B i ) i ( ) i ] M + j= B j M + ] i [ (M ) ] + B j E M + j= [( M + ( M M + i= ( ) M i p s,v M + M + i= ( ) M ( ) M i p s,v M + M + i= ( ) M p s,v [( ) M + M M+ p s,v + ) p s,v + ( p s,v ) ] i [( ) ] M + i p s,v + ( p s,v ) M + ] M M+ ( p s,v) Since (C, r, t) follows a Poisson process on R p R + with intensity f(r), p(c k s S k or v S k, r k ) is uniformly distributed on B rk (s) B rk (v) and p(r k s I k or v I k ) = ν(b r k (s) B rk (v))f(r k ) ν(brk (s) B rk (v))f(r k ) dr k where ν( ) is Lebesgue measure. Then p s,v = P (s, v S k s S k or v S k ) = p (C k, r k s I k or v I k ) dc k dr k = B rk (s) B rk (v) ν (Brk (s) B rk (v)) f(r k ) dr k ν (Brk (s) B rk (v)) f(r k ) dr k. Proof of Theorem 4 The autocorrelation function can be expressed as f(p s,s+u ) where f(x) = ( M+ M+ )/( + M M+ x). Then by Faá di Bruno s formula d n du n f(p s,s+u) = n! d m + +m n f m!m!m 3!... dp m + +m n s,s+u {j m j } ( d j p s,s+u du j j! ) mj, where m + m + 3m nm n = n with m j, j =,..., n, and so d n lim u du n f(p s,s+u) = n! m!m!m 3!... lim d m + +m n f ( d j ) mj p s,s+u u dp m + +m n s,s+u du j. j! {j m j }

23 d Since lim m + +mn f u dp m + +m n = lim ps,s+u dm + +mn f s,s+u dp m + +m n is finite and non-zero for all values s,s+u of n, the degree of differentiability of the autocorrelation function is equal to the degree of ( ) differentiability of p s,s+u. We can write p s,s+u = 4µ a with a = µ ui. Now d k p s,s+u a k = (k )!(4µ a) k d and lim k p s,s+u u da k non-zero. By application of Faá di Bruno s formula d n du n p s,s+u = n! d m + +m n p s,s+u m!m!m 3!... da m + +m n = (k )!(µ) k which is finite and {j m j } ( d j ) mj a du j j! and the degree of differentiability is determined by the degree of differentiability of a. If p(r) Ga(α, β) then dµ du = ( u ) α exp{ u/} and di du = ( u ) α exp{ u/} and it is easy to show that dn a du = C n n u α n+ exp{ u/} + ζ where ζ contains terms with power of x greater than α n +. If lim u u α exp{ u/} is finite then so is lim u u α+k exp{u/} for k > and so the limit will be finite iff α n +, i.e. α n. Appendix B: Computational details As we conduct inference on the basis of the Poisson process restricted to the set R, all quantities (C, r, t, V, θ) should have a superscript R. To keep notation manageable, these superscripts are not explicitly used in this Appendix. Updating the centres We update each centre C,..., C K from its full conditional distribution Metropolis-Hastings random walk step. A new value C i for the i-th centre is proposed from N(C i, σ C ) where σ C is chosen so that the acceptance rate is approximately.5. If there is no x i such that x i (C i r i, C i + r i) or if there is one value of j such that s j = i for which x i / (C i r i, C i + r i) then α(c i, C i ) =. Otherwise, the acceptance probability has the form n α(c i, C i) j= h<s j and C h = r h<x j <C h +r ( V h) h n j= h<s j and C h r h <x j <C h +r h ( V h ). Updating the distances The distances can be updated using a Gibbs step since the full conditional distribution of r k has a simple piecewise form. Recall that d ik = x i C k and let S k = {j s j k}. We define S ord k to be a version of S k where the element have been ordered to be increasing in d ik, i.e. if i > j and i, j S ord k then d ik > d jk. Finally we define d k = max[{x min C k, C k x max } 3

Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother

Bayesian Nonparametric Modelling with the Dirichlet Process Regression Smoother J. E. Griffin and M. F. J. Steel University of Warwick Bayesian Nonparametric Modelling with the Dirichlet Process Regression