Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Size: px

Start display at page:

Download "Comparing Non-informative Priors for Estimation and Prediction in Spatial Models"

Conrad Preston
6 years ago
Views:

1 Environmentrics 00, 1 12 DOI: /env.XXXX Comparing Non-informative Priors for Estimation and Prediction in Spatial Models Regina Wu a and Cari G. Kaufman a Summary: Fitting a Bayesian model to spatial data requires specifying prior distributions for parameters. Current literature focuses on the performance of certain non-informative prior distributions in the context of parameter estimation, but often times, what is of main interest is accuracy of predictions. In this paper, we carry out a simulation study comparing two commonly used non-informative priors for spatial models, the reference prior and Jeffreys-rule prior, and find that even though the reference prior outperforms the Jeffreys-rule prior for parameter estimation, the two perform almost identically in terms of mean square prediction errors and coverage rates in prediction. We do find a systematic relationship between prior choice and prediction credible interval length and we explain why this happens, relating it to previous results on parameter estimation. Our results show that unlike parameter estimation, prior choice makes little difference in prediction. Keywords: reference prior; Jeffreys-rule prior; spatial data; Bayesian kriging 1. INTRODUCTION Spatial data are often modeled by Gaussian random fields where the mean and covariance parameters are unknown. This uncertainty can be accounted for by taking a Bayesian approach and placing priors distributions on the parameters in the assumed model. Choosing a prior distribution can be difficult when the parameters are hard to interpret or when there is little prior knowledge available, so it is useful to have default objective priors which avoid a Department of Statistics, University of California, Berkeley, CA 94720, U.S.A Correspondence to: Regina Wu, Department of Statistics, University of California, Berkeley, CA 94720, U.S.A wuregina88@gmail.com This paper has been submitted for consideration for publication in Environmetrics

2 Environmetrics R. Wu and C. G. Kaufman the need for subjective specifications. The motivation of Berger et al., 2001 is to present non-informative prior distributions that yield proper posterior distributions, as many commonly used priors fail to do. Berger et al., 2001 recommend using the reference prior over the Jeffreys-rule prior, independence Jeffreys prior, and constant prior. In addition to having a proper posterior, Berger et al., 2001 show the reference prior outperforms the Jeffreys-rule prior in terms of parameter estimation as it produces credible intervals for the parameters with close to nominal coverage. Papers such as Kazianka and Pilz, 2012, Ren et al., 2012, and Oliveira, 2007 have extended the study of objective Bayes inference for spatially correlated data by including nugget effects to account for the possibility of measurement errors. When examining frequentist coverage of parameters such as and noise to signal ratio η, the results agree with Berger et al., 2001 where the Jeffreys-rule prior produces credible intervals for the parameters with low coverage rates, especially as p and get larger. The current literature focuses on prior performance with regards to parameter estimation. The motivation for this paper stems from the fact that in geostatistics parameter estimation is usually not performed in isolation, but for the end goal of making predictions of the underlying process at unobserved locations. Hence, it is interesting to look at how the priors compare for predictions. From a Bayesian standpoint, predictions can easily be generated using kriging procedures (see e.g. Handcock and Stein, 1993), but prior choice is again a question. We evaluate the predictive performance of the reference prior and Jeffreys-rule prior introduced in Berger et al.,

3 Comparing Non-informative Priors Environmetrics 2. THE MODEL AND PRIORS 2.1. The Model Let {Z(s), s D} denote a Gaussian random field, with the n observations denoted by Z = (Z(s 1 ),..., Z(s n )) where s 1,..., s n are known distinct sampling locations in D. Our goal is to examine predictions made for Z pred = (Z(s 1),..., Z(s m)), where s 1,..., s m are new locations in D. The mean function of the Gaussian field is E{Z(s)} = β f(s) where β = (β 1,..., β p ) R p are unknown regression parameters and f(s) = (f 1 (s),..., f p (s)) are known locationdependent covariates. The covariance function is cov{z(s), Z(u)} = σ 2 K ( s u ) where denotes Euclidean distance, σ 2 = var{z(s)}, K ( s u ) = corr{z(s), Z(u)} is an isotropic correlation function, and = ( 1,..., c ) Θ R c is the vector of parameters controlling the range of correlation and the smoothness of the field. The likelihood of the model parameters using observed data z = (z 1,..., z n ) where z i represents the observed value of Z(s i ) is { } 1 L(β, σ 2, ; z) = (2πσ 2 ) n/2 Σ 1/2 exp 2σ (z 2 Xβ) Σ 1 (z Xβ) (1) where X is a full rank n p matrix defined by X ij = f j (s i ) and Σ is a positive definite n n matrix defined by Σ,ij = K ( s i s j ) Θ. We focus on the use of spherical, power exponential, rational quadratic, and Matérn covariance functions. Generating samples from the posterior distribution of is most efficiently carried out using the integrated likelihood L I (, z) defined by L(β, σ 2, ; z)π(β, σ 2, )dβdσ 2 = L I (; z)π() R p (0, ) 3

4 Environmetrics R. Wu and C. G. Kaufman where π(β, σ 2, ) and π() denote the prior distributions for (β, σ 2, ) and respectively. Further details on the priors will be given in Section 2.2. Specifically, we have L I (; z) = Σ 1 2 X Σ 1 X 1 2 (S 2 ) n p 2 +a 1 (2) where S 2 = (z X ˆβ ) Σ 1 (z X ˆβ ) is the generalized residual sum of squares and ˆβ = (X Σ 1 X) 1 X Σ 1 z is the generalized least squares estimator of β given. Furthermore, we can write the posterior predictive distribution conditional on Z and, which is the multivariate t-distribution with df = n p + 2a 2 Z pred Z, t n p+2a 2 (Ẑpred, V pred ) (3) with mean Ẑpred and variance V pred defined as Ẑ pred = X pred β + k predσ 1 (z Xβ ) V pred = 1 df 2 S2 V where X pred is a full rank m p matrix defined by X pred,ij = f j (s i ), k pred is a n m matrix with k pred,ij = K ( s i s j ), K pred is a m m matrix with K pred,ij = K ( s i s j ), and V = K pred k pred Σ 1 k pred + [X pred X Σ 1 k pred] (X Σ 1 X) 1 [X pred X Σ 1 k pred] Reference and Jeffreys-rule Priors The prior distributions for (β, σ 2, ) Ω R p (0, ) (0, ) presented in Berger et al., 2001 take the form π(β, σ 2, ) π() (σ 2 ) a. (4) They derive the reference prior for (σ 2, ) which is the Jeffreys-rule prior using a marginal 4

5 Comparing Non-informative Priors Environmetrics model with integrated likelihood defined for (σ 2, ). The reference prior π R (β, σ 2, ) is of the form (4) with a = 1 and π R () { tr[w 2 ] 1 } 1 n p (tr[w ]) 2 2 where W = (( )Σ )Σ 1 P Σ and P Σ = I X(X Σ 1 X) 1 X Σ 1. The independence Jeffreys prior π J1 (β, σ 2, ) is defined by a = 1 and π J1 () {tr[u 2 ] 1n (tr[u ]) 2 } 1 2 where U = (( )Σ )Σ 1. And lastly, the Jeffreys-rule prior πj2 (β, σ 2, ) is defined by a = 1 + p 2 and πj2 () X Σ 1 X 1 2 π J1 () Berger et al., 2001 show that the reference prior and Jeffreys-rule prior always yield proper posteriors under sampling distribution (1), but the independence Jeffreys prior is proper only when 1 is not a column of X. 3. SIMULATION STUDY 3.1. Simulation Set-Up Berger et al., 2001 compare the reference prior and Jeffreys-rule prior by examining the empirical coverage of equal-tailed Bayesian credible intervals for the range parameter and their average log lengths. Both priors produce intervals with reasonable coverage when only an intercept term is included in the mean of the Gaussian process, but the coverage of the Jeffreys-rule prior is very low when additional terms are added, particularly when the spatial correlation is also strong. The authors attribute the poor performance to the increase in degrees of freedom to the posterior for σ, which causes a shift towards smaller σ values. 5

6 Environmetrics R. Wu and C. G. Kaufman This shift is said to cause the 95% credible sets from the marginal posterior distribution for to have poor coverage rates. We replicate and extend the simulation study in Berger et al., 2001 to compare mean squared errors in addition to coverage rates, and posterior predictions of the Gaussian process in addition to inference about. Our simulated spatial data modeled under 6 different Gaussian random fields Z( ) uses the same choices of mean and covariance functions in Berger et al., The mean function E{Z(s)} is either the constant value 0.15 (p=1) or the function x.1y + 9x 2 xy + 1.2y 2 (p=6), and the Matérn covariance function is C(d) which simplifies to the exponential case C(d) =.12exp{ d/} when ν =.5. We let values in the covariance vary across.2,.5, or 1.0. These values give correlations which decay to.05 at distances of.5992, , and For each of the six combinations, we simulate 3000 sets of data at equally-spaced locations in the union of a training set S t = {(x, y) x, y {0, 0.25, 0.5, 0.75, 1}} = {s 1,..., s 25 } and a validation set S v = {(x, y) x, y {.125, 0.375, 0.625, 0.875}} = {s 1,..., s 16}. For each of the 3000 sample replications z j in our training set S t, we sample 1000 values of from the marginal posterior p( z j ) using a random walk Metropolis Hastings (MH) sampler. Candidates are sampled from a normal distribution centered at the previous value, and the first 200 values of each chain are removed for burn-in. The sampling procedure can be carried out using the integrated likelihood L I (; z j ) in (2). For each dataset z j, we take each sample made from p( z j ) and draw from the corresponding multivariate t-distribution p(z pred z j, ) given in (3). The resulting 800 draws are therefore a sample from p(z pred z j ), from which we can calculate the posterior means z pred,j, a vector of length 16. Calculations of mean squared prediction error (MSPE), frequent coverage, and credible length are based on just one arbitrary prediction location s k = (.625,.375). We compute MSPE by averaging over the 3000 replications. Bayesian equaltailed 95% credible intervals (L j, U j ) are calculated for each dataset z j. The frequentist 6

7 Comparing Non-informative Priors Environmetrics coverage is calculated as p = 1 estimate of j=1 I L j <z pred,j <U j. This has an associated standard error p(1 p). Lastly, we calculate the average credible interval lengths Simulation Results and Interpretation Our first parameter setting follows the same set-up as in Berger et al., 2001, in which the data are simulated under range parameter ν =.5 and this is treated as known. Results in Table 1 show that the MSPE s are very similar between the two priors. For p = 1, the Jeffreysrule prior has higher coverage rates than the reference prior and for p = 6 the reference prior has higher coverage rates than the Jeffreys-rule prior. It seems that this ordering is driven by the credible interval length. The Jeffreys-rule prior performs well here, in contrast to the results for parameter estimation in Berger et al., 2001, in which it had very poor performance, especially when p = 6. It should be noted that differences in interval length and coverage when doing predictions are, practically speaking, very small, in contrast to the large differences observed for parameter estimation in Berger et al., [Table 1 about here.] The similarity in predictive performance across priors is further illustrated in the forest plots of Figure 1. In each plot, the same random subset of 10 out of the 3000 datasets is taken and for each prior we show the predictive samples at the location s k = (.625,.375) in the validation set S v. The samples are centered by the true value for easier comparison and the posterior means are plotted along with 95% equal-tailed credible intervals. The striking similarities between predictions made from the two priors imply that prior choice has little effect on predictive inference. [Figure 1 about here.] Although the similarity between the priors in terms of prediction is strong, we still seek to understand the root of the differences in credible interval lengths, since this is the suspected 7

8 Environmetrics R. Wu and C. G. Kaufman cause of the differences in frequentist coverage rates. In particular, we will look for differences in p( Z) across priors that may be related to differences in spread of p(z pred Z). This relationship is somewhat subtle, depending on both the marginal posterior p( Z) for the different priors, as well as the way a particular influences V ar[z pred,jk Z j, ] for a given prior. For a dataset Z j, the prediction interval for Z pred,jk should be wider for those with larger V ar[z pred,jk Z j ]. We can decompose this to V ar[z pred,jk Z j ] = E (V ar[z pred,jk Z j, ]) + V ar (E[Z pred,jk Z j, ]) (5) where V ar[z pred,jk Z j, ] and E[Z pred,jk Z j, ] can be calculated using the t-distribution in (3) for a given Z j and sample. The outer expectation and variance are with respect to the posterior distribution p( Z j ). Therefore, we can approximate each term in the sum using the empirical mean and variance of these terms calculated using the MCMC samples. We find that the E (V ar[z pred,jk Z j, ]) component consistently dominates the sum, with the V ar (E[Z pred,jk Z j, ]) component having negligible magnitude in comparison. Furthermore, no consistent ordering is found in V ar (E[Z pred,jk Z j, ]) between priors, but E (V ar[z pred,jk Z j, ]) follows the ordering of the priors with regards to credible interval lengths observed earlier in Table 1. To understand why this ordering occurs, we focus on the predictions when p = 6, since this is when the difference between the priors is largest. We illustrate the result for a randomly chosen dataset and the same arbitrary prediction location s k = (.625,.375). A sequence of s are plotted with the corresponding V ar[z pred,jk Z j, ] values. The resulting curves are shown in Figure 2, with a = 1 corresponding to the reference prior and a = 1 + p 2 corresponding to the Jeffreys-rule prior. Other differences between these two priors are not important in calculating V ar (E[Z pred,jk Z j, ]). Specifically, for given Z j and, the values of V ar[z pred,jk Z j, ] differ only by the multiplicative constant n 2 n p 2 across the different 8

9 Comparing Non-informative Priors Environmetrics priors. Along the x-axis are rug plots of the posterior samples under each prior and along the y-axis are the rug plots of the corresponding V ar[z pred,jk Z j, ] values for each prior. The horizontal lines indicate the estimated value of E(V ar[z pred,jk Z j, ]) under each prior. [Figure 2 about here.] The curves in Figure 2 indicate that for a particular prior, having larger values correspond to smaller V ar[z pred,jk Z j, ] values. Judging by the rug plot, if V ar[z pred,jk Z j, ] did not differ by the multiplicative constant, the reference prior should have the lowest E(V ar[z pred,jk Z j, ]) since it has slightly larger samples on average, but instead we see it has the highest E(V ar[z pred,jk Z j, ]). This is due to the upward shift of the true V ar[z pred,jk Z j, ] curve under the reference prior which completely drowns out the effect of having larger samples. Hence, E(V ar[z pred,jk Z j, ]) is larger under the reference prior than under the Jeffreys-rule prior, which is exactly the ordering of the credible interval lengths seen in the p = 6 case in Table 1 and in the ordering of the E(V ar[z pred,jk Z j, ]) component of Equation (5). When the simulation is re-run under a larger range parameter of ν = 1.5, the credible interval lengths is higher for the Jeffreys-rule prior than the reference prior in the p = 6 case. Looking at Figure 3, we see that the reference prior still has the largest samples on average, but this time the curve for V ar[z pred,jk Z j, ] under the reference prior is only shifted up slightly. Because of the closeness of the curve to that of the Jeffreys-rule, the effect of having larger samples is present and the expected ordering of E(V ar[z pred,jk Z j, ]) is maintained unlike in the ν =.5 case. [Figure 3 about here.] 9

10 Environmetrics R. Wu and C. G. Kaufman 4. ILLUSTRATION Unlike global climate models (GCMs), regional climate models (RCMs) are used to model climate system evolution in a limited area through discretized versions of physical processes. The higher resolution in RCMs is beneficial as it better captures the impact of local features such as lakes and mountains as well as subgrid-scale atmospheric processes that can only be approximated in GCMs. Because the RCMs have limited area, they require boundary conditions which are commonly provided by the output of GCMs (Kaufman and Sain, 2010). To illustrate the conclusions from the simulation study, we observe average summer temperatures with model output from a RCM called CHRM with boundary conditions from a GCM called HADAM as shown in Figure 4. This data consists of 2663 average summer temperatures. Temperatures at 671 equally spaced locations (25%) are used as the training set and the remaining 1,992 locations (75%) are used as the validation set. The three covariates are longitude (lon), latitude (lat), and elevation (elev). We model the mean as E{Z(s)} = β 1 + β 2 lon + β 3 lat + β 4 elev + β 5 elev 2 (p = 5) and the covariance as the exponential covariance function. For each prior, we sample 6000 values of from the marginal posterior p( z) using a random walk Metropolis Hastings sampler as before. The first 1000 values of each chain are removed for burn-in and every 10th sample is kept leaving 500 values. For each remaining, we obtain sample predictions at the 1,992 locations by drawing from the corresponding multivariate t-distribution p(z pred z, ). The MSPE and average credible interval lengths are calculated by averaging over the 1,992 locations. The observed coverage rate is calculated as ˆp = 1 1,992 1,992 k=1 I L k <z pred,k <U k. Table 2 shows that the two priors perform almost identically in terms of prediction. In fact, there are no significant differences in observed coverage nor in MSPE across priors. 10

11 Comparing Non-informative Priors Environmetrics [Figure 4 about here.] [Table 2 about here.] 5. DISCUSSION In this paper, we extended the current literature on prior performance in parameter estimation into prediction. Although the reference prior is the recommended prior in Berger et al., 2001, a key finding of this paper is that the reference prior and Jeffreys-rule prior perform almost identically for prediction. We discovered a systematic relationship between prior choice and credible interval lengths based on the posterior predictive distribution. Specifically, we decompose V ar[z pred,jk Z j ] to find the ordering is driven by the E(V ar[z pred,jk Z j, ]) component in (5). The observed ordering in credible interval lengths is dependent on the degree of separation between the a = 1 curve and a = 1 + p 2 curves seen in Figure 2 and Figure 3. The larger the separation, the less we see the effect of the reference prior having larger values. In our simulation, we assumed the range parameter ν is known, so it would be interesting to see the effects when the range parameter is fixed at an incorrect value or estimated. Furthermore, it would be interesting to extend the simulation to see how well the priors perform for prediction when there is a nugget present. REFERENCES Berger JO, Oliveira VD, Sansó B, Objective Bayesian analysis of spatially correlated data. Journal of the American Statistical Association 96(456): Handcock MS, Stein ML, A Bayesian analysis of kriging. Technometrics 35(4):

12 Environmetrics R. Wu and C. G. Kaufman Kaufman CG, Sain SR, Bayesian functional ANOVA modeling using gaussian process prior distributions. Bayesian Analysis 5(1): Kazianka H, Pilz J, Objective bayesian analysis of spatial data with uncertain nugget and range parameters. The Canadian Journal of Statistics 40(2): Oliveira VD, Objective bayesian analysis of spatial data with measurement error. The Canadian Journal of Statistics 35(2): Ren C, Sun D, He C, Objective bayesian analysis for a spatial model with nuggest effects. Journal of Statistical Planning and Inference 142(7):

13 FIGURES Environmetrics FIGURES Figure 1. Forest plots of centered prediction samples grouped by dataset (Dashed Line with circle: reference prior, Solid Line with mark x: Jeffreys-rule prior). =0.2 =0.5 =1 p= p=

14 Environmetrics FIGURES Figure 2. Plot of against V ar[z pred,jk Z j, ]. Upper curve calculated using a = 1 and lower curve using a = 1 + p 2. Rug plot of samples along x-axis and rug plot of corresponding V ar[z pred,jk Z j, ] values along y-axis. Horizontal lines indicate E(V ar[z pred,jk Z j, ]) values calculated from MCMC samples. (Dashed line: reference prior, Solid line: Jeffreys-rule prior) Var[Z pred, jk Z j, ] Jeffreys rule prior Reference prior Reference prior Jeffreys rule prior 14

15 FIGURES Environmetrics Figure 3. Plot of against V ar[z pred,jk Z j, ] values. Upper curve calculated using a = 1 and lower curve using a = 1 + p 2. Rug plot of samples along x-axis and rug plot of corresponding V ar[z pred,jk Z j, ] values along y-axis. Horizontal lines indicate E(V ar[z pred,jk Z j, ]) values calculated from MCMC samples. (Dashed line: reference prior, Solid line: Jeffreys-rule prior) Var[Z pred, jk Z j, ] Jeffreys rule prior Reference prior Reference prior Jeffreys rule prior 15

16 Environmetrics FIGURES Figure 4. Average summer temperatures (C ) of the Prudence Project experiment between

17 TABLES Environmetrics TABLES Table 1. Mean-squared prediction error, frequentist coverage of predictions, and credible interval length results from simulation =.2 =.5 = 1 Mean Squared Prediction Error (and S.E.) p = 1, ν =.5 Reference prior.0729 (.0019).0332 (.0008).0169 (.0004) Jeffreys-rule prior.0732 (.0019).0334 (.0008).0170 (.0004) Prediction reference prior.0728 (.0019).0332 (.0008).0169 (.0004) p = 6, ν =.5 Reference prior.0742 (.0019).0330 (.0008).0174 (.0005) Jeffreys-rule prior.0762 (.0019).0346 (.0009).0184 (.0005) Prediction reference prior.0745 (.0019).0333 (.0008).0176 (.0005) Frequentist Coverage (and S.E.) p = 1, ν =.5 Reference prior.9480 (.0041).9580 (.0037).9647 (.0034) Jeffreys-rule prior.9483 (.0040).9593 (.0036).9660 (.0033) Prediction reference prior.9433 (.0042).9530 (.0039).9613 (.0035) p = 6, ν =.5 Reference prior.9513 (.0039).9660 (.0033).9603 (.0036) Jeffreys-rule Prior.9243 (.0048).9380 (.0044).9413 (.0043) Prediction reference prior.9180(.0050).9310 (.0046).9357 (.0045) Credible Interval Lengths (and S.E.) p = 1, ν =.5 Reference prior (.0034).7746 (.0023).5649 (.0017) Jeffreys-rule prior (.0034).7807 (.0024).5673 (.0017) Prediction reference prior (.0032).7613 (.0023).5554 (.0017) p = 6, ν =.5 Reference prior (.0035).8048 (.0025).5857 (.0018) Jeffreys-rule prior (.0032).7382 (.0023).5386 (.0017) Prediction reference prior (.0030).7061 (.0022).5140 (.0016) 17

18 Environmetrics TABLES Table 2. Mean-squared prediction error, observed coverage of predictions, and credible interval length results from illustration Mean Squared Prediction Error (and S.E.) Reference prior.2156(.0214) Jeffreys-rule prior.2150 (.0214) Observed Coverage (and S.E.) Reference prior.9487 (.0049) Jeffreys-rule prior.9483 (.0049) Credible Interval Lengths (and S.E.) Reference prior (.0070) Jeffreys-rule prior (.0069) 18

Comparing Non-informative Priors for Estimation and. Prediction in Spatial Models

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models Vigre Semester Report by: Regina Wu Advisor: Cari Kaufman January 31, 2010 1 Introduction Gaussian random fields with specified