Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department of Statistical Science University of North Carolina at Chapel Hill Department of Statistics and Operations Research January 14th, 2009

Outline Introduction and Motivation Statistical Methods Background Univariate Linear Methods Developed (Smith and Zhu) Research Questions Multivariate Non-Linear Development Simulation Results Conclusions and Future Work 1

Introduction Motivation The need often arises in spatial settings for data transformation. The transformation may be non-linear, and the desired predictand may require interpolation of predictions at multiple sites. In traditional kriging methods, standard formula for MSPE does not take into account estimation of covariance parameters - this generally leads to underestimated prediction errors Bayesian methods offer a solution, but iterative methods can be computationally time intensive 2

Introduction Possible Solution Smith and Zhu (2004) establish a second-order expansion for predictive distributions in Gaussian processes with estimated covariances. Here, we establish a similar expansion as in Smith and Zhu for multivariate kriging with non-linear predictands. Main Results Explicit formula for a general, non-linear predictand for: the expected length of a Bayesian prediction interval the coverage probability bias Matching prior (CPB=0) and alternative estimator are explored 3

Background: Spatial Statistics We have a stochastic process Z(s), generally assumed to be Gaussian with known mean µ and covariance structure V (θ). Z(s) = µ(s) + e(s) where e(s) is a zero mean error process. Basic mode:l Z N(µ, V (θ)) And the model with mean as linear function of covariates: N(Xβ, V (θ)), with Z X a matrix of covariates β a vector of unknown regression coefficients. 4

Background: Covariance Structures Example: Covariance function for the exponential model: cov{z(s i ), Z(s j )} = σ 2 exp ( ( )) dij φ (1) V (θ) vector of standardized covariances determined by θ = (σ 2, φ). Underlying covariance( structure introduced through V (θ), matrix with entries v ij = exp ( d ij φ ). ) φ = range parameter σ = scale parameter 5

Background: Kriging Kriging: technique for predicting values at unobserved locations w/in random field through linear combinations of observed variables. Refers to the construction of spatial predictor in terms of known model parameters. Universal kriging: mean process is a linear combination of covariates. Y is vector of known observations; Y 0 value to be predicted (scalar) ( Y Y 0 ) N (( Xβ x T 0 β ) ( V (θ) w, T (θ) w(θ) v 0 (θ) )) (2) X is n p vector of covariates for observations Y x 0 is p 1 vector of covariates for predicted scalar Y 0 β vector of regression coefficients θ is vector of covariance parameters. 6

Background: Kriging Universal kriging aims to find linear predictor Ŷ 0 = λ T Y that minimizes MSPE E { (Y 0 Ŷ 0 ) 2} subject to condition X T λ = x 0. Using Lagrange multipliers, the optimal λ is: λ(θ) = V 1 (θ)w(θ)+v 1 (θ)x(x T V 1 (θ)x) 1 (x 0 X T V 1 (θ)w(θ)). with corresponding MSPE: σ0 2 (θ) = v 0(θ) w(θ) T V 1 (θ)w(θ) + (x 0 X T V 1 (θ)w(θ)) T (X T V 1 (θ)x) 1 (x 0 X T V 1 (θ)w(θ)). Thus the predictive distribution function is: Pr {Y 0 z Y = y, θ} = ψ(z; y, θ) = Φ ( z λ(θ) T y σ 0 (θ) ) 7

Background: REML Estimation Restricted Maximum Likelihood (REML) estimation is based on the joint density of vector of contrasts. Distribution independent of population mean, and resulting maximum likelihood estimator is approximately unbiased, as opposed to the MLE. REML estimator (Smith, 2001; Stein, 1999) is max Θ ln (θ) where ln(θ) = n q 2 log(2π) + 1 2 log XT X 1 2 log XT V (θ) 1 X 1 2 log V (θ) 1 2 G2 (θ) where G 2 (θ) is the generalized residual sum of squares G 2 (θ) = Y T {V 1 (θ) V 1 (θ)x(x T V 1 (θ)x) 1 X T V 1 (θ)}y Use REML estimator ˆθ to obtain the predictive distribution function: ˆψ(z; y, θ) = ψ(z; y, ˆθ) 8

Background: Smith and Zhu (2004) Smith and Zhu provide the original development for univariate normal predictive distribution of methods considered in this paper for non-linear multivariate case. This includes: Establishing a second-order expansion for predictive distributions in Gaussian processes. Using covariance parameter estimates (REML) in the plug-in approach as well as Bayesian methods. Main focus is the estimation of quantiles for predictive distribution and application to prediction intervals. Leads to calculation of second-order coverage probability bias - lends itself to possible existence of matching prior where CPB=0. Also: frequentist correction, z P, leads to coverage probability bias of zero, analogous to existence of matching prior. 9

Introduce Notation Recall the restricted log-likelihood function, l n (θ) Let U i = l n(θ) θ i, U ij = 2 l n (θ) θ i θ j, etc, U ij is (i, j) entry of inverse of matrix whose (i, j) entry is U ij and Q(θ) is the log of the prior, π(θ). Superscripts denote components of vectors, subscripts indicate differentiation wrt components of θ; summation notation. Function of interest is predictive distribution function, ψ(z; Z, θ). ψ denotes either plug-in estimator ˆψ, or Bayesian estimator, ψ: ψ = ˆψ + ˆD + O p (n 2 ) (3) where D = 1 2 U ijkψ l U ij U kl 1 2 (ψ ij + 2ψ i Q j )U ij (4) and ˆD indicates the evaluation of D at ˆθ. 10

Introduce Notation Further, introduce random Z i, Z ij, Z ijk with mean 0 such that U i = n 1 2Z i, U ij = nκ ij + n 1 2Z ij, U ijk = nκ ijk + n 1 2Z ijk where κ i,j = E{Z i Z j }, κ ij,k = E{Z ij Z k }. Note κ i,j = κ ij and is (i, j) entry of normalized Fisher information matrix. Matrix is assumed invertible w/ inverse entries κ i,j. 11

Univariate Normal Case Assume that ψ has expansion: ψ (z; Y ) = ψ(z; Y, θ) + n 1 2R(z, Y ) + n 1 S(z, Y ) + o p (n 1 ) (5) For both the plug-in and Bayesian method, components of R and S can be calculated explicitly, using a Taylor expansion for the plug-in approach and a combination of Taylor and Laplace for the Bayesian approach. For ẑ P (plug-in estimator), R = κ i,j Z i ψ j (6) S = κ i,j κ k,l Z ik Z j ψ l + 1 2 κi,r κ j,s κ k,t κ ijk Z r Z s ψ t + 1 2 κi,j κ k,l Z i Z k ψ jl (7) where S for ẑ P is further denoted S 1. For z P (Bayesian estimator), corresponding expression is: S 2 = S 1 + 1 2 κ ijkκ i,j κ k,l ψ l + ( 1 2 ψ ij + ψ i Q j )κ i,j where Q(θ) is the log of the prior, π(θ). 12

Coverage Probability Bias Hence the coverage probability bias, the expected value of ψ(zp ; Y, θ) ψ(z P ; Y, θ), is expressed: CPB = E[ n 1/2 R(z P, Y ) + n 1 R(z P, Y )R (z P, Y ) ψ (z P ; Y, θ) S(z P, Y ) + o p (n 1 )] (8) The coverage probability bias represents difference between P {Y 0 zp Y, θ)} and target probability P, where z P is the plug-in estimate zˆ P or the Bayesian estimate z P of P-quantiles of target distribution. 13

Key Findings Development of these expansions allows comparison with standard frequentist correction procedures. It also allows for selection of design criterion based on expected length of a prediction interval and coverage probability bias. Matching Prior Interesting development: coverage probability bias can be reduced to a form (Smith, 2004) that suggests existence of a matching prior. May be possible to chose prior, π, so that expectations of the O(n 1/2 ) and O(n 1 ) terms in second-order CPB are zero. Important result because while it may be difficult or impratical to compute matching prior, it lends itself to assisting in prior selection based on how closely different forms of standard priors (Jeffreys, reference prior, etc.) come to matching prior. 14

Key Findings Estimator z P is a form of the asymptotic bias and includes a frequentist correction term developed by Harville and Jeske (1992) and Zimmerman and Cressie (1992): z P = ẑ P n 1 asymptotic bias φ(φ 1 (P )) (9) To calculate CPB, calculation of moments of various expressions involving R, S, and their derivatives is needed. By the asymptotic formulae, these can be expressed in terms of derivatives of ψ and other quantities that are explicit functions of the Gaussian process. 15

Preliminary Simulation for Univariate Linear Predictand In this paper we first looked at a preliminary simulation for the univariate predictand. Random plane of 16 location values Simulated corresponding observations: Y (s) = X T (s)β + S(s) X(s) is column vector with entries 1, s 1, s 2 where s 1 and s 2 are the coordinates at site s. β = (123) T and S(s) is a stationary Gaussian process with mean 0 Exponential covariance structure with σ = 1 and φ = 1 is used. Parameter estimates are obtained at each site using the other 15 sites. Theoretical 95% PI are constructed and the Empirical Coverage Probabilities are computed over 100 simulations at each site. 16

Preliminary Simulation Results Empirical coverage probabilities obtained through kriging using REML estimates for the covariance parameters are much lower than 95%, with values ranging from 64% to 89% and an Average Empirical Coverage (AEC) of 81.4%. Discrepancy can be attributed to error introduced into model through estimates of covariance parameters Bayesian method used Gibbs sampling in WinBUGS with REML estimates as starting values. Showed empirical coverage results around 95%, with a range of 86% to 99%, and an AEC of 94.8%. The Smith-Zhu Laplace approximation method shows definite improvement with an Average Empirical Coverage of 92.9% and a range of 87% to 98% coverage. 17

Comparison of Laplace, Bayesian, and Plug-In Methods Empirical Coverage Probabilities 2 Param Exponential Predictions at 16 Sites Coverage Probability 60 70 80 90 100 o o True: AEC=94% Bayes: AEC =95% REML: AEC=81% Lapl: AEC=93% 5 10 15 Sites 18

Conclusion Key results of Smith and Zhu s 2004 paper are expressions for coverage probability bias and expected length of a prediction interval for both plug-in and Bayesian predictors. Established for Gaussian process with mean that is combination of linear regressors and parametrically specified covariance. Possible existence of a matching prior introduced Frequentist correction allows for second-order CPB of zero. This paper expands these methods to the analogous non-linear multivariate predictands, such as those motivated by methods established in Smith, Kolenikov, and Cox (2003). 19

Multivariate Non-Linear Development An important difference between the univariate linear and multivariate non-linear case, is that the multivariate predictive distribution (G ) not necessarily available in closed form. Thus, a method is needed to determine derivatives of predictive distribution function. In the multivariate case, predictand can be written as H = m j=1 h(y 0,j ) or H = H(Y 0 ) (10) where h( ) is linear kriging function, such as h(y) = y 2. Example: variance stabilizing square root transform where desired predictand is spatial average (Smith, Kolenikov and Cox, 2003.) H( ) is a more general transformation function 20

Multivariate Non-Linear Development Assume G has an expansion: G (z; Y ) = G(z; Y, θ) + n 1 2R(z, Y ) + n 1 S(z, Y ) + o p (n 1 ) Consider multivariate non-linear expansion analogs to Smith and Zhu s expansions and form of CPB: R = κ i,j Z i G j S = κ i,j κ k,l Z ik Z j G l + 1 2 κi,r κ j,s κ k,t κ ijk Z r Z s G t + 1 2 κi,j κ k,l Z i Z k G jl CPB = E[ n 1 2R(z P, Y ) + n 1 R(z P, Y )R (z P, Y ) G (z P ; Y, θ) (11) S(z P, Y ) + o p (n 1 )] Extension to multivariate kriging is generalization of univariate. 21

Multivariate Non-Linear Development Objective is to evaluate G = P (H(Y 0 ) z y; θ) and its partial derivatives. For the multivariate, nonlinear predictand, the exact form of G may not be easily manipulated. Develop methodology to derive derivatives of predictive distribution G with respect to z, θ, and both z and θ Employ kernel density estimation to evaluate each term Develop parameteric bootstrap to estimate empirical cdf Can estimate predictive distribution as G B (z Y, θ) = 1 B ΣI{H(Y b o ) z}, where B denotes number of iterations and G B represents the bootstrapped estimate 22

Derivative Development For G (z Y, θ) = E f(y0 Y,θ) [I{H(Y 0) z}], partial derivatives up to order 2 are necessary; can be expressed as expectations wrt predictive distribution function. θ i I{H(Y 0 ) z}f(y 0 Y, θ)dy 0 = E f(y0 Y,θ) [ I{H(Y 0 ) z} ] θ ilnf(y 0 Y, θ) where f(y 0 Y, θ) is the restricted likelihood and be analytically evaluated. θ ilnf(y 0 Y, θ) can In practice, I{H(Y 0 ) z} θ ilnf(y 0 Y, θ) is empirically estimated and averaged over many iterations using a numerical approximation for derivatives of the restricted log-likelihood. Simulated values are used in place of theoretical expected values. 23

Kernel Density The cumulative distribution of the kernel is used to approximate the predictive distribution G (z Y, θ). G (z Y, θ) = 1 B ΣI{H(Y b o ) z} 1 B ΣB b=1 K 1( z H(Y b h o ) ) (12) The kernel density is further expressed as K and its distribution function written K 1. Density used to estimate predictive density, G (z Y, θ). K 1 = K estimates the predictive distribution G (z Y, θ). 24

Kernel Density Here we consider the Epanechnikov kernel outlined in Silverman (1986). The Epanechnikov kernel has an efficiency of 1; based on minimizing mean integrated squared error. 1 f(t) = Bh ΣB b=1 3 ( 4 1 1 5 5 t 2) 5 t 5 0 otherwise. where t = z H(Y 0) h with z the predicted value, H(Y 0 ) the true value, and h smoothing parameter (bandwidth). Epanechnikov Kernel Value of Kernel 0.00 0.10 0.20 0.30 2 1 0 1 2 t 25

Multivariate Non-Linear Development In summary, estimation is achieved through a parametric bootstrap : 1. For b = 1,..., B replications generate Y (b) 0 N[Λ T Y, V 0 ]. 2. Calculate G (z Y, θ) = P [H(Y 0 ) z] 1 B b I{H(Y (b) 0 ) z}. 3. Use kernel density differentiable wrt z and wrt the components of θ to approximate Ĝ and derivatives wrt z 4. Express derivatives wrt θ as expectations wrt restricted loglikelihood f(y 0 Y, θ) 5. Evaluate derivatives wrt θ using numerical approximation 26

Expansion of Coverage Probability Bias The coverage probability bias can be expressed as: E [G (z P ) G (z P )] = n 1 2 ( n 1 2κ i,j E + n 1 (n 1 κ i,j κ k,l E [ 1 U i B ΣB b=1 K 1( z H(Y b h ] ) [ ARR G B RR G E [S G ] o ) ) θ ilnf(y b 0 Y, θ) ]) (13) where K 1 is the kernel distribution with kernel density K, A RR G and B RR G are as expressed in Equations (14) and (15) and E[S G ] is the appropriate S for the desired plug-in or bayesian prediction. 1 A RR G = U i B ΣB b=1 K 1( z H(Y o b) ) h θ lnf(y b j 0 Y, θ)u k Σ B 1 b=1 Bh K(z H(Y o b) ) h θ llnf(y 0 b Y, θ) (14) B RR G = 1 Bh ΣB b=1 K ( z H(Y b h o ) ) (15) 27

For the Bayesian approach, where S = S 2G : E [S 2G ] = E [S 1G ] [ 1 + 1 2 κ ijkκ i,j κ k,l E B ΣB b=1 K 1( z H(Y o b ) ) h θ llnf(y 0 b Y, θ) + 1 [ 1 2 κi,j E B ΣB b=1 K 1( z H(Y o b ] ) )A S2G h [ + κ i,j 1 E B ΣB b=1 K 1( z H(Y o b ) ) ] h θ ilnf(y 0 b Y, θ) Q j (16) ] where Q(θ) = log(π(θ)) from the Bayesian framework and A S2G = ( 2 lnf(y b 0 Y, θ) + θ i θ j θ ilnf(y 0 b Y, θ) θ j lnf(y 0 b Y θ) ) 28

Asymptotic Frequentist Correction Alternative to Matching Prior Not necessary to find exact form of matching prior. Construct artificial predictor z P, equivalent to Bayesian predictor, as an alternative to solving Equation (13) by obtaining matching prior. For percentile P, define z P. Laplace Approximation [ ] z P = ẑ P n 1 Equation (13) G (G 1 (P )) (17) where Equation (13) is the expression for the CPB. This is a function of the asymptotic bias as seen in Equation (9) from Smith and Zhu, and is an analog to the univariate normal case. 29

Simulation To compare the Laplace approximation technique to the standard Plug-In approach using REML estimates, a simulation was constructed. Here we look specifically at the sum of the squares of predictions over multiple sites. Run over N = 100 iterations - can be thought of as a double loop. Outer loop generates predictions by kriging w/ REML estimates Inner loop uses kernel density estimation to obtain an empirical predictive distribution across the prediction sites and calculate estimate using Laplace approximation technique 30

Simulation n 1 sites of (s 1, s 2 ) coordinates are randomly generated and a random field Y (S) is generated of the form Y (s) = X T (s)β + S(s). X(s) is column vector with entries s 1, s 2, β = ( 1 2 ) and S(s) is a stationary Gaussian process with mean 0 and variance σ 2 = 1. Correlation function is parametrized by an exponential covariance structure cov{y (s 1 ), Y (s 2 )} = σ 2 (exp( d ij φ )) σ = 1.0 and the range parameter φ = 0.2. 31

Simulation n 1 Y values ( n 1 = 30) treated as observed across original sites. An additional n 2 = 5 sites are generated, and corresponding simulated field values, Y 0, are treated as true values. Objective of simulation is to interpolate a non-linear, multivariate predictand H(Y 0 ) across the n 2 sites. Predictand here is the sum of squares across the n 2 sites, H(Y 0 ) = Σ n 2 i=1 Y 2 0. 32

Simulation Of interest is the empirical coverage of 95% theoretical prediction intervals generated using the Laplace method vs the plug-in method. n 1 observed sites are used to obtain parameter estimates (using REML) for the range parameter φ and the shape parameter σ. REML estimates plugged-in to the universal kriging methods to interpolate sum of squares prediction over n 2 sites Laplace approximation is calculated using developed methodology 33

Simulation Results Non-Linear Plug-In vs Laplace Prediction Intervals: 1 Parameter Empirical 95% PI s for plug-in method result in severe undercoverage. An Average Empirical Coverage (AEC) of 75.8% was found. Laplace approximation technique resulted in an improvement. AEC for the Laplace approximation prediction intervals was 78.7%. Note that the Laplace approximation sometimes exhibits erratic behavior - the correction produced extremely large adjustments, possibly due to the REML estimates hitting the bounds set in the optimization algorithm and the fact that asymptotic arguments are not as reliable for small samples. Thus it is reasonable to assume that the Laplace approximation may result in even more of an improvement over the plug-in method if this can be corrected, and provides an interesting area of future study. 34

Empirical probabilities from the Laplace technique are plotted against the plug-in coverages. As expected there is a strong positive correlation. Line y = x shows that majority of Laplace coverages are larger than plug-in coverages. 35

Simulation Results Empirical Coverage Probabilities: 2 Parameter Case Empirical 95% PI s for plug-in method result in undercoverage. An Average Empirical Coverage (AEC) of 91.9% was found. Laplace approximation technique resulted in a slight improvement. AEC for Laplace approximation prediction intervals was 92.2%. Empirical probabilities from Laplace technique are plotted against plug-in coverages. Line y = x shows empirical coverages of Laplace approximation larger than plug-in coverages in about half of simulated intervals. Plot shows evidence of greater improvement in Laplace approximation over plug-in method when empirical coverage is low 36

Conclusions Developed a practical method for analytical evaluation which included a boot-strap method to obtain predictions for general, non-linear predictands and incorporates kernel density estimation for unknown or computationally difficult predictive distribution. Laplace approximation technique showed improvement in linear univariate case, with empirical coverage probabilities for PI s analogous to Bayesian PI s, both showing very close agreement to theoretical prediction coverage of 95%. Simulation for multivariate non-linear predictand showed promising results for Laplace approximation technique. Results also suggest the existence of matching prior for nonlinear predictands, which has a form analogous to the form derived for the univariate normal case in Smith and Zhu (2004). 37

Future Work Investigation of different prior specifications for Bayesian and Laplace methods, specifically the Jeffreys prior and the reference prior Specific computation of matching prior to achieve second-order coverage probability bias of zero Application to data analysis, such as the square-root transformation of the PM 2.5 data considered in Smith, Kolenikov, and Cox (2003) 38

Thank You! 39