A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation

Size: px

Start display at page:

Download "A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation"

Ethelbert Wilcox
5 years ago
Views:

1 Supplementary materials for this article are available at /s A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation Ryan J. Parker, Brian J. Reich,andJoEidsvik Spatial data are increasing in size and complexity due to technological advances. For an analysis of a large and diverse spatial domain, simplifying assumptions such as stationarity are questionable and standard computational algorithms are inadequate. In this paper, we propose a computationally efficient method to estimate a nonstationary covariance function. We partition the spatial domain into a fine grid of subregions and assign each subregion its own set of spatial covariance parameters. This introduces a large number of parameters and to stabilize the procedure we impose a penalty to spatially smooth the estimates. By penalizing the absolute difference between parameters for adjacent subregions, the solution can be identical for adjacent subregions and thus the method identifies stationary subdomains. To apply the method to large datasets, we use a block composite likelihood which is natural in this setting because it also operates on a partition of the spatial domain. The method is applied to tropospheric ozone in the US, and we find that the spatial covariance on the west coast differs from the rest of the country. Key Words: Spatial statistics; Nonstationary covariance; Regularization; Penalized likelihood. 1. INTRODUCTION Spatial datasets continue to grow in size and complexity, with satellites, for example, being able to collect spatiotemporal data on a large scale. A fundamental task of a spatial analysis is estimating the spatial covariance function. A common simplifying assumption is stationarity, i.e., that the covariance function is the same for the entire spatial domain and the covariance of the process at two locations depends only on their relative locations. However, for data collected over a large and diverse geographic domain, this assumption is untenable as we might expect a different covariance in different subregions (e.g., in coastal and mountainous regions). Therefore, our interest is in modeling a nonstationary covariance function for large spatial datasets. Ryan J. Parker (B), SAS Institute, Cary, NC, USA ( ryan.parker@jmp.com). Brian J. Reich, North Carolina State University, Raleigh, NC, USA. Jo Eidsvik, Norwegian University of Science and Technology, Trondheim, Norway International Biometric Society Journal of Agricultural, Biological, and Environmental Statistics DOI: /s

2 R. J. Parker et al. As a motivating example, we conduct a spatial analysis of ground-level ozone in the Continental US. As shown in Fig. 2, ozone is monitored at hundreds of locations and our objective is to interpolate ozone throughout the entire domain. Ozone maps are useful for regulation, conducting epidemiological studies, and monitoring temporal changes. Emission sources, atmospheric chemistry, and climate vary substantially across this large domain, and thus the stationarity assumption is questionable. Because of their complex structure, a variety of techniques have been studied to model spatial data having a nonstationary covariance. One of these techniques involves deforming the spatial region so that the deformed space can be modeled with a stationary covariance model (Sampson and Guttorp 1992). Nychka and Saltzman (1998), among others, have proposed the use of empirical orthogonal functions (EOF) to model the nonstationary covariance function using eigenfunctions. Constructing the nonstationary process in terms of kernel convolutions was proposed by Higdon (1998). The kernels used in this technique allow for complex nonstationary structures, with Paciorek and Schervish (2006) using this approach to introduce a nonstationary covariance model that is a function of any stationary correlation function. Due to the simplicity of working with stationary covariance models, Fuentes (2002) proposes the use of a mixture of independent stationary processes, with mixture weights coming from a kernel function that operates on spatial location. This construction allows for the covariance function to be modeled as a weighted average of stationary covariance functions. A more recent approach from Bornn et al. (2012) proposes expanding the observation space into higher dimensions where the process is stationary, such as when only two dimensions of a three-dimensional environmental process are observed. Sampson (2010) provides a detailed review of nonstationary methods. Typically nonstationary models are fit using datasets that are large enough that traditional computing methods are either not tractable or replaceable with more efficient methods. One of these techniques uses a covariance taper to introduce sparsity into the covariance matrix so that more efficient sparse matrix methods can be used for estimation and prediction (Furrer et al. 2006; Kaufman et al. 2008). Fixed rank kriging (Cressie and Johannesson 2008) decomposes the covariance matrix in terms of basis functions that are more tractable to work with than the full covariance matrix. Banerjee et al. (2008) propose the use of predictive process models that allow for a low rank covariance matrix to be used for computing (see also Finley et al. 2009). Lindgren et al. (2011) show that there is a link between Gaussian processes and Gaussian Markov random fields (GMRF) that can be used to exploit fast matrix operations under the GMRF. The use of approximate likelihoods has helped speed up computations (Stein et al. 2004; Fuentes 2007). Composite likelihood methods have also shown to be successful for estimation and prediction, with the block composite likelihood (BCL, Eidsvik et al. 2014) in particular having computations that are linear in the number of observations. We propose a method that models the nonstationary covariance using a penalized likelihood (Sect. 3) as an alternative to costly Bayesian computing. Penalized likelihoods have been used previously to model a nonstationary covariance, although these techniques differ fromours.chang et al. (2010) formulate a constrained least squares approach for estimating the nonstationary covariance by using basis functions to represent the spatial process (Hsu et al. (2012) extend this for spatiotemporal data). Our technique is novel in that instead

3 A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation of using a semiparametric approach as Chang et al. (2010) have done, we work directly with covariance parameters from familiar stationary covariance models. Using the flexible nonstationary covariance of Paciorek and Schervish (2006) (see Sect. 2), we discretize the domain into subregions that each have a covariance parameter so that the covariance is allowed to vary over space. Our regularization comes through penalizing the difference between parameter values in neighboring subregions. When penalizing the absolute differences in these parameters, our method is capable of detecting stationary subdomains formed by neighboring subregions with the same covariance parameters. We believe this approach offers a unique and new contribution when modeling spatial data having a nonstationary covariance. We further anticipate increased applicability of such rich nonstationary models as the data size and complexity continue to grow. Our technique is demonstrated through a simulation study (Sect. 5) and an analysis of tropospheric ozone (Sect. 6). We conclude with a discussion and identify areas of future research in Sect NONSTATIONARY COVARIANCE MODEL Let y t (s) be the response at spatial location s for replicate t. We consider observing t = 1,...,T replications each at n spatial locations s i = (s i1, s i2 ) T, with y ti = y t (s i ) denoting the observation at the ith location in replication t. The observations are assumed to be the realization of a Gaussian process with mean function μ(s i ; β) and covariance function C(s i, s j ; θ), with independence between replications. This results in the joint density of y t = (y t1,...,y tn ) T having a multivariate normal distribution with mean vector μ(β) and covariance matrix (θ). Here, β and θ are the model parameters. The log-likelihood of these observations is l(β, θ; y) = 1 2 T t=1 A common stationary covariance model for C is where the Mahalanobis distance D ij is { log (θ) + [ y t μ(β) ] T (θ) 1 [ y t μ(β) ]}. (1) C(s i, s j ; θ) = τ 2 I (i = j) + σ 2 R S ( D ij ), (2) D ij = (s i s j ) T 1 (s i s j ). In this model, τ 2 is the nugget (nonspatial variance), σ 2 is the partial sill (spatial variance), and controls the range of spatial correlation. The stationary correlation function R S ( D ij ) that depends on the range can take a variety of forms, such as the familiar exponential correlation R S ( D ij ) = exp ( D ij ). Our interest is in the case when (θ) has a nonstationary covariance structure (see, for example, Sampson 2010). We choose to consider the flexible nonstationary covariance model of Paciorek and Schervish (2006). In the Paciorek and Schervish model, the nugget

4 R. J. Parker et al. τ 2, partial sill σ 2, and nonstationary kernel matrices are allowed to vary over space. The covariance function is C(s i, s j ; θ) = τ 2 (s i )I (i = j) + σ(s i )σ (s j )R NS (s i, s j ). (3) The nonstationary correlation function, R NS (s i, s j ) = (s i ) 1 4 (s j ) [ (si ) + (s j ) ] 2 1 R S ( D ij ), 2 is defined for any stationary correlation function R S ( ). In the nonstationary case, the Mahalanobis distance D ij is D ij = (s i s j ) T ( 1 2 [ (si ) + (s j ) ]) 1 (s i s j ). Hence stationarity is a special case of this correlation function when τ 2, σ 2, and are the same at all locations. The nonstationary kernel matrices (s) that model the range of spatial dependence can be constructed to model complex anisotropic correlation structures (such as in Higdon et al or Neto et al. 2014). We, however, will consider the simpler isotropic form (s) = ( ) φ 2 (s) 0 0 φ 2 (s) that assumes the range does not have an angular component. The parameter φ(s) then determines the range of spatial correlation in the vicinity of s. Because the parameters in this model vary over space, the model allows for the estimation of an anisotropic covariance even with diagonal (s). For example, when the range parameter φ(s) varies over space, the correlation is different depending on the direction of the neighboring location. We do not estimate the angular component, however, so our parameterization may not estimate the true covariance well when the data exhibit an explicit angular component. Fitting this model when the parameters vary over space is difficult. Paciorek and Schervish (2006) propose the use of a Gaussian process for each spatially varying parameter. This approach is costly due to the O(n 3 ) operations required to update parameters at each iteration in Bayesian inference. Banerjee et al. (2008) apply the predictive process model to (3), but they assume that there are a small number of known subregions that have fixed covariance parameters instead of each site having a covariance parameter. As described below, our method can be thought of as a mixture between these two approaches; on the one hand, we not only reduce the number of covariance parameters by dividing the domain into a discrete set of regions, but we also allow these parameters to vary smoothly over space without knowing these subregions a priori.

5 A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation 3. COVARIANCE REGULARIZATION To reduce the number of parameters in the nonstationary covariance (3), we create a grid of R subregions, R 1,...,R R, that partition the region of interest S (see Fig. 2b). We assume that each subregion r has a different set of covariance parameters τr 2, σ r 2, and φ r so that, e.g., τ 2 (s) = τr 2 if s is in subregion R r. With this model construction, we can estimate covariance parameters that vary over space while reducing the total number of parameters needed to represent the covariance. Because we want to consider a fine grid of subregions, we propose regularizing the covariance parameters to determine where we can reduce the total parameters needed to represent the nonstationary covariance. We penalize the difference of parameter values in neighboring subregions, and this penalty allows us to borrow information between the neighboring subregions. To perform covariance regularization, we first place the unconstrained covariance parameters in the 3 R matrix θ so that θ 1r = log (τr 2), θ 2r = log (σr 2), and θ 3r = log (φ r ). Now we maximize the penalized log-likelihood l P (β, θ; y, λ) = l(β, θ; y) 3 λ i r 1 r 2 θ ir1 θ ir2 q, (4) where r 1 r 2 denotes that subregions r 1 and r 2 are neighbors. This penalized model has a tuning parameter λ i for each covariance parameter type to control the similarity between parameters in neighboring subregions. When λ i = 0 then parameter i corresponds to the full nonstationary model without a penalty on the parameter. When λ i = then parameter i is forced to be the same in all regions, just as if we were using the stationary model. The choice of q determines the type of regularization. When q = 1 we penalize the L 1 norm of the differences in parameters similar to variable fusion (Land and Friedman 1997)or the fused lasso (Tibshirani et al. 2005; Tibshirani and Taylor 2011). This penalty allows for the solution θ ir1 = θ ir2 for some pairs of neighbors, forming a stationary subdomain. Hence this penalty is useful when identifying these subdomains is an objective of the analysis. On the other hand, when q = 2, we penalize the squared L 2 norm of the differences of parameter values. This penalty bears resemblance to a conditionally autoregressive (CAR, Besag et al. 1991) prior for θ i. This never gives estimates θ ir1 = θ ir2, but it provides for easier computation and a potentially smoother surface of covariance parameter estimates. 4. COMPUTING DETAILS Nonstationary models are typically fit to large datasets, and so evaluating the full likelihood in (1) is burdensome because of the O(n 3 ) operations required when working with the large covariance matrix (θ). Therefore we choose to use the block composite likelihood (BCL) of Eidsvik et al. (2014) to perform fast estimation for large n. We have chosen to use this technique because computation scales linearly with the number of observations n and is parallelizable. Also, the full likelihood is a special case of the BCL framework that can be used when n is small enough so that the O(n 3 ) operations are not prohibitive.

6 R. J. Parker et al. The BCL is formed by creating a grid of B blocks, B 1,...,B B, that partition S. Note that the grid of blocks B can be different from the grid of subregions R used in the covariance regularization (Fig. 2c). The log-composite likelihood is then composed of the sum of joint log-likelihoods for all neighboring blocks. That is, the BCL is l BCL (β, θ; y) = b 1 b 2 l [ β, θ; (y b1, y b2 ) T ], (5) where y b are the observations in block b. To use the BCL in our regularized model, we add the penalty term (4) and maximize l PBCL (β, θ; y, λ) = l BCL (β, θ; y) 3 λ i r 1 r 2 θ ir1 θ ir2 q. (6) To perform parameter estimation, Algorithm 1 of Eidsvik et al. (2014) has been modified to account for the penalty term in (6). Estimation requires we compute derivatives of (6) with respect to each parameter in θ, as the regularization parameter λ is tuned using a validation set (see Sect. 4.1). For L 2 regularization, the derivatives of the penalty on the squared differences are straightforward to compute. To estimate under L 1 regularization, however, we must modify the penalty to alleviate the need to take the derivative of the absolute value terms. This modification is done by introducing parameters α to replace the L 1 term with an L 2 term as used in Lin and Zhang (2006) and Storlie et al. (2011), among others. With these new parameters, we maximize l BCL (β, θ; y) 3 α 1 r 1 r 2 ir 1 r 2 (θ ir1 θ ir2 ) 2 3 λ 0i r 1 r 2 α ir1 r 2. (7) Now we will show that the maximizer of (7) is the maximizer of (6) when q = 1. First, we will show that solving (6) implies solving (7). Suppose that β and θ maximize (6)fora given λ. Letλ 0i = 1 4 λ2 i and α ir1 r 2 = λ 1/2 0i θ ir1 θ ir2.itfollowsthat l BCL ( β, θ; y) 3 = l BCL ( β, θ; y) 2 = l BCL ( β, θ; y) α 1 r 1 r 2 3 ir 1 r 2 ( θ ir1 θ ir2 ) 2 λ 1/2 0i 3 λ i r 1 r 2 θ ir1 θ ir2 r 1 r 2 θ ir1 θ ir2. 3 λ 0i r 1 r 2 α ir1 r 2 Hence (6) equals (7) at the solution of (6). To conclude, we will now show that solving (7) implies solving (6). The optimal α jkl s in (7) satisfy α 2 jkl (θ jk θ jl ) 2 λ 0 j = 0, and

7 A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation thus α jkl = (θ jk θ jl ) 2 λ 0 j. Substituting α ijk s into (7), we have l BCL (β, θ; y) 3 = l BCL (β, θ; y) 2 α 1 r 1 r 2 3 ir 1 r 2 (θ ir1 θ ir2 ) 2 λ 1/2 0i 3 λ 0i r 1 r 2 θ ir1 θ ir2. r 1 r 2 α ir1 r 2 By letting λ i = 2λ 1/2 0i,wehave(7) equivalent to (6) when q = 1. Hence maximizing (7) for β and θ leads to the equivalent solution in (6). Therefore, we have shown that estimation in (6) and (7) is equivalent CHOOSING TUNING PARAMETERS λ, R, AND B In this model we must make choices about three tuning parameters: (1) penalty λ, (2) subregion grid R 1,...,R R, and (3) grid of blocks B 1,...,B B for the BCL. For a fixed subregion grid and grid of BCL blocks, we choose the penalty λ using the maximum loglikelihood of a holdout set y h, l(y h ; β, θ, y f ), where y f are the observations used to compute parameter estimates β and θ (Bien and Tibshirani 2011). This maximum log-likelihood is computed over a single random holdout set or with cross-validation. Candidate penalty parameters are proposed using the algorithm below. To choose λ, we use an algorithm in the spirit of gradient directed regularization (Friedman and Popescu 2003). We must first choose a descending grid for each λ i, such as g = (, 250, 100, 25, 5, 0) T, which we use in our simulation study in Sect. 5. Thisgrid allows us to fit models with the same parameter in each region (g j = ) all the way down to models without any penalty (g j = 0). To start we set each λ i = g 1 corresponding to the stationary model. Next, we fit three different models by one-at-a-time setting λ j = g 2 and leaving the other λ i = g 1. If we find improvement (a larger log-likelihood) with one of these three models, then we choose the fit with the largest improvement and repeat this process of fitting three different models with the elements of g. Once we no longer find improvement then the algorithm terminates. The subregion grid can also be chosen with the above holdout set ideas. One way to tune the subregion grid would be to first start with one subregion corresponding to stationarity and continuing to add subregions until no more improvement is found in the log-likelihood on the holdout set. Alternatively, features of the observation space may direct the researcher to compare competing subregions to find the best for their data. Ultimately this can be time consuming and so we propose using a fine grid of subregions so that the penalization can approximate the subdomains of stationarity without costly tuning over many competing subregion grids. Finally, we propose setting the grid of BCL blocks based on the recommendations of Eidsvik et al. (2014). Because these blocks are used as a way to speed up computing time, they propose choosing the smallest number of observations in each block while maintaining statistical efficiency. The size of the blocks may depend on the features of the correlation

8 R. J. Parker et al. structure, such as a slower decay in spatial correlation (longer spatial range) potentially needing larger block sizes. We have found that 50 observations per block work well in our simulation study and real data example. Also, under more complex anisotropic structures, block sizes may need to be rectangular in shape to account for the directional differences in the correlation structure. 5. SIMULATION STUDY In this section, we design and analyze the results of a simulation study to better understand the performance of the regularized nonstationary model DESIGN In this study, over N = 100 simulation replications, we generate data from a nonstationary covariance with the exponential correlation function. These samples are generated using a2 2 square grid of four subregions in the unit square [0, 1] [0, 1], with each subregion having a nugget, partial sill, and range parameter. We consider three different covariance structures so that one of the three parameters varies from subregion to subregion, while the other two parameters are the same in each subregion. For example, we will allow the nugget to vary from subregion to subregion but have the partial sill and range parameter equal in each subregion. This allows us to gain insight into how sensitive this approach is to estimation and prediction performance when varying a single covariance parameter. Data are generated over T = 10 independent replications on an n = = 3, 025 grid in the unit square. The mean at each location is μ ti = β 0 + β 1 x ti with β 0 = 0 and β 1 = 1. The covariates x ti are generated independent and identically distributed from a standard normal distribution. The three nonstationary covariance structures are listed in Table 1. We compare the performance of the actual covariance (Oracle) to that of estimates obtained from (1) assuming stationarity, (2) L 1 regularization (q = 1), and (3) L 2 regularization (q = 2). For the regularized fits, we will compare an R = 4 4 = 16 subregion grid to an R = 5 5 = 25 subregion grid. These two choices for R will allow us to compare the case when the chosen grid of subregions fits perfectly into the true covariance (R = 16) to the case when the subregions overlap the true covariance (R = 25). The models are Table 1. Nonstationary covariance structures used to simulate data with the nugget (left), partial sill (center), and range (right) varying by region. Design τ 2 σ 2 φ Varying nugget Varying partial sill Varying range

9 A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation fit using n f = 2, 525 of the observation locations, with n h = 500 randomly selected for a validation set. For the nonstationary models, 500 of the remaining n f observations are randomly chosen and used as a holdout set for choosing the penalty λ. After holding out the validation set, we construct B = 51 blocks for the BCL so that we have approximately 50 observations in each block (these blocks are constructed using k-means clustering on the observation locations). When fitting the regularized models we will allow all parameters in each subregion to vary since we would not expect to know beforehand which were fixed to be the same in each subregion. This further allows us to evaluate how well our penalized model handles this situation EVALUATING PERFORMANCE We evaluate the performance of the methods in our simulation study as follows. The overall model fit for each method is evaluated by computing the log-likelihood of the data in the validation set. To compare the estimation of the overall covariance matrix, we compute the Frobenius norm, (θ) ( ˆθ) F, of the difference between the actual and estimated covariance matrix for each method. The root mean squared error (RMSE) of the regression coefficient estimates of β j, 1 N N ( ˆβ (i) j β j ) 2, is used to evaluate how well the methods estimate the effect of model covariates. The integrated RMSE of each covariance parameter vector, for example, 1 N N 1 R R r=1 ( ) 2, ˆτ r 2 (i) τr 2 for the nugget, allows us to identify how well the methods estimate each covariance parameter type. Finally, we use the prediction MSE and coverage of 90 % prediction intervals for data in the validation set to evaluate the prediction power of each method RESULTS The results for data generated with a varying nugget are shown in Table 2, varying partial sill in Table 3, and varying range in Table 4. Where appropriate, we present relative differences (indicated by (Rel.) ) that are relative to either the oracle or stationary model. These allow us to easily compare the proportional differences between results for each method. Using paired t tests, we find that the differences in the log-likelihood and Frobenius norm are statistically significant when comparing the nonstationary and stationary models, giving us confidence that we are better estimating the full covariance in these cases. The

10 R. J. Parker et al. Table 2. Simulation study results when the nugget varies over space. Oracle Stationary Nonstationary L 1 L 2 L 1 L 2 Log-likelihood difference (θ) ( θ) F (Rel.) 41.7 (1) 17.5 (0.42) 16.6 (0.40) 23.9 (0.57) 28.2 (0.68) Integrated RMSE τ 2 (Rel.) 0.80 (1) 0.07 (0.08) 0.08 (0.09) 0.15 (0.19) 0.10 (0.13) Integrated RMSE σ 2 (Rel.) 0.02 (1) 0.03 (1.74) 0.03 (1.59) 0.05 (2.73) 0.03 (1.56) Integrated RMSE φ (Rel.) (1) (3.45) (1.73) (2.02) (2.41) RMSE β 0 (Rel.) 0.03 (1) 0.04 (1.10) 0.04 (1.09) 0.04 (1.09) 0.04 (1.10) 0.04 (1.10) RMSE β 1 (Rel.) (1) (1.20) (1.02) (1.02) (1.03) (1.03) Prediction MSE (Rel.) (1) (1.004) (1) 1.33 (1) (1.001) (1.001) Prediction coverage (0.90) The relative factors shown in parenthesis highlight the proportional differences between the methods and the Oracle. The Oracle model uses the true values of the covariance parameters, so for metrics evaluating covariance parameters the differences are relative to the stationary model. differences in prediction MSE between the nonstationary and stationary models are also statistically significant in all cases. When the nugget varies (Table 2), we see that the nonstationary models outperform the stationary model for overall model fit (log-likelihood) and estimating the nonstationary covariance (Frobenius norm). The nonstationary models also do better estimating the slope, β 1, with the stationary model being 1.20 times larger than the oracle estimate, whereas the estimates from the nonstationary fits are only 1.02 to 1.03 times larger. The integrated RMSE of τ 2 clearly shows that the flexibility of the nonstationary models is needed to estimate this varying parameter, with the error from the nonstationary models being less than 20 % of the size of the error from the stationary model depending on the structure of the subregion grid and type of penalty. For the partial sill and range, the estimates from the nonstationary model are not as precise as the stationary model. The errors are not large from a practical standpoint, and this is reinforced by the overall estimate of the covariance matrix being much better for the nonstationary models. Each method does a good job predicting, and they all have the expected coverage for the 90 % prediction intervals. In this case, the nonstationary models with either the L 1 or L 2 penalty perform well. When the partial sill varies (Table 3), the nonstationary models continue to perform well. The L 1 and L 2 regularization outperform the stationary model in terms of the log-likelihood and covariance matrix estimate, and as we expect the estimate of the partial sill is better for all of the nonstationary models. When using the overlapping grid of R = 5 5, the nonstationary models have some difficulty estimating the range parameter when compared relative to the stationary model. Estimating the overall covariance is still better under these conditions though, indicating these individual parameter estimates do not adversely impact the full covariance estimate. For the R = 4 4 grid, we do not experience the same difficulty, with the estimated nugget outperforming the stationary model. The estimate of the range is slightly better for the stationary model compared to this subregion grid, but again this is

11 A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation Table 3. Simulation study results when the partial sill varies over space. Oracle Stationary Nonstationary L 1 L 2 L 1 L 2 Log-likelihood difference (θ) ( θ) F (Rel.) (1) 37.8 (0.16) 50.7 (0.21) (0.71) (0.59) Integrated RMSE τ 2 (Rel.) 0.02 (1) 0.02 (0.71) 0.02 (0.80) 0.03 (1.27) 0.02 (0.94) Integrated RMSE σ 2 (Rel.) 0.55 (1) 0.04 (0.08) 0.05 (0.09) 0.18 (0.32) 0.13 (0.23) Integrated RMSE φ (Rel.) (1) (1.06) (1.46) (14.1) (6.83) RMSE β 0 (Rel.) 0.04 (1) 0.07 (1.94) 0.04 (1.13) 0.04 (1.11) 0.05 (1.43) 0.05 (1.49) RMSE β 1 (Rel.) (1) (1.28) (1.04) (1.04) (1.07) (1.08) Prediction MSE (Rel.) (1) 0.96 (1.003) (1.00) (1.00) 0.96 (1.003) 0.96 (1.003) Prediction coverage (0.90) The relative factors shown in parenthesis highlight the proportional differences between the methods and the Oracle. The Oracle model uses the true values of the covariance parameters, so for metrics evaluating covariance parameters the differences are relative to the stationary model. Table 4. Simulation study results when the range varies over space. Oracle Stationary Nonstationary L 1 L 2 L 1 L 2 Log-likelihood difference (θ) ( θ) F (Rel.) (1) 11.8 (0.11) 17.3 (0.16) 52.7 (0.50) 60.3 (0.57) Integrated RMSE τ 2 (Rel.) 0.16 (1) 0.01 (0.05) 0.02 (0.28) 0.01 (0.09) 0.11 (0.69) Integrated RMSE σ 2 (Rel.) 0.09 (1) 0.01 (0.13) 0.02 (0.28) 0.02 (0.26) 0.07 (0.80) Integrated RMSE φ (Rel.) (1) (0.12) (0.15) (0.26) 0.01 (0.32) RMSE β 0 (Rel.) 0.02 (1) 0.05 (3.01) 0.02 (1.19) 0.02 (1.22) 0.03 (1.53) 0.03 (1.63) RMSE β 1 (Rel.) (1) (1.22) (1.02) (1.02) (1.04) (1.05) Prediction MSE (Rel.) (1) (1.019) (1.00) (1.00) (1.01) (1.009) Prediction coverage (0.90) The relative factors shown in parenthesis highlight the proportional differences between the methods and the Oracle. The Oracle model uses the true values of the covariance parameters, so for metrics evaluating covariance parameters the differences are relative to the stationary model. not of practical importance given the size of the errors and the much better estimate of the overall covariance. Prediction performance continues to be very similar across all methods. When the range varies (Table 4), the nonstationary models perform the best compared to when the other parameters vary. Each of the covariance parameters is estimated very well in this case when compared to the stationary model regardless of the subregion grid or penalty used. In the varying nugget and varying partial sill cases, one of the nonvarying parameters would be estimated better with the stationary model, but now we have that both the L 1 and L 2 regularization are better estimating each of these parameters. The prediction MSE continues to be similar in all cases, but this case also has the largest difference in prediction MSE between the stationary model and the nonstationary models, highlighting the importance of modeling the range properly for prediction.

12 R. J. Parker et al (a) Varying Nugget (b) Varying Partial Sill (c) Varying Range Figure 1. The relative prediction MSE (stationary / L 2 prediction MSE) by location, grouped into an grid of locations, between the stationary model and the nonstationary model with L 2 regularization. These highlight the differences between prediction MSE by location, with southwest quadrant having larger differences relative to the rest of the space. All cases show this effect, with the largest effect by varying the nugget, then varying the range, and finally a larger but smaller effect for varying the partial sill. Finally, we explore the prediction performance by location. In Fig. 1, we plot the relative prediction MSE between the stationary model and the nonstationary model with L 2 regularization. These show that the prediction performance varies by location in our study. The southwest quadrant in particular shows larger prediction MSE in the stationary model relative to the nonstationary fit, with the parameter values in this region being smaller than the others around it. Although the prediction error is similar on the whole, we do find that specific locations can have improved prediction performance by using the nonstationary model over the stationary model. These results show that the nonstationary model estimates the true covariance much better than the stationary model, as we would expect, especially for the parameters that vary over space. For this blocked covariance structure, the nonstationary model also outperforms the stationary model when the grid of subregions does not perfectly fit into the true covariance. Nonstationary models (including the oracle model) do not notably improve prediction MSE compared to the stationary model in any of the three cases we considered. As others have found, such as Reich et al. (2011, Table 1), nonstationary models do not tend to dramatically improve prediction performance (further results are given in the supplemental materials). We not only find similar results overall, but also find that there are locations where prediction performance is improved using the nonstationary model. Also, we find that the MSE for estimating mean parameters β is smaller for the nonstationary models. Therefore when the objective is parameter estimation or understanding how the covariance changes across space, then the nonstationary method performs well, but if the objective is only spatial prediction, then the stationary model appears to be sufficient with the exception of the locational differences we found. In terms of computation, the time needed to estimate the parameters in the nonstationary model is higher than the stationary model. In this study we are able to estimate the stationary model in roughly 2 min. The nonstationary model, however, requires somewhere between 35 and 60 min for L 1 regularization and min for L 2 regularization. So computing time should also be a factor when deciding between models.

13 A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation 6. ANALYZING THE SPATIAL COVARIANCE OF TROPOSPHERIC OZONE To demonstrate how the regularized nonstationary model performs on real data, we examine monitored ozone data for July 2005 along with model output for ozone from CMAQ version (Appel et al. 2013). CMAQ is a deterministic model that combines meteorology and emissions data with atmospheric chemistry equations to estimate daily ozone levels. The ozone is measured hourly, and we analyze the daily maximum 8-h average measured in parts-per-billion (ppb). We restrict our analysis to the n = 709 sites that have data for all T = 31 days of the month. In this analysis, all latitude and longitude coordinates have been projected to kilometers (km) using the Lambert Conformal Conic (LCC) projection with parameters defined by CMAQ. More details about this dataset are provided in Reich et al. (2014). In Fig. 2, we plot the observation locations (a), subregion grid of parameters for the regularized nonstationary model (b), and the BCL grid to speed up computing (c). The subregion grid (Fig. 2b) has R = 100 subregions. These subregions were created with k- means clustering on the observation locations, with the shape of each subregion formed by computing the Delaunay triangulation at the location of the 100 centroids from k-means. We do this because of the irregularly spaced observation locations so that there are not too many observation locations within any subregion. The BCL grid (Fig. 2c) was created in a similar way, except that the number of blocks was chosen to be B = 14 so that approximately 50 observation locations fall within each block. For these data, we fit models with two different mean structures. The first uses the CMAQ covariate available at each observation location so that the mean on day t at location s is μ t (s) = β 0 + β 1 CMAQ t (s). (8) This will allow us to estimate the residual nonstationary covariance after accounting for the CMAQ values, which may prove useful to modelers working to improve CMAQ. Alternatively, we will consider a polynomial on the location s so that we can remove any large scale trends before modeling the nonstationary covariance. That is, we assume the mean is μ t (s) = β 0 + β 1 s 1 + β 2 s 2 + β 3 s β 4s β 5s 1 s 2, (9) which is constant for each day. This mean structure is of interest because it does not remove the effect of CMAQ, allowing us to estimate the marginal nonstationary covariance structure to gain insight into the effects of long-range transport or common point sources. Each of these mean structures is fit with a stationary covariance, L 1 regularization, and L 2 regularization, giving us six models for comparison. We use the exponential covariance model, as Reich et al. (2014) demonstrate the estimated smoothness parameter for the Matèrn covariance is approximately 0.5, equivalent to the simpler exponential model. To evaluate model performance, we create a validation set by holding out all 31 days of data for 70 sites, which is approximately 10 % of our 709 observation locations. After

14 R. J. Parker et al. (a) Observation locations (b) Subregion grid (c) BCL grid Figure 2. Location of observations s 1,...,s n (a), the subregion grid having a covariance parameter in each subregion R 1,...,R R (b), and the BCL grid B 1,...,B B (c).

15 A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation Table 5. A comparison between the stationary (S), L 1 nonstationary, and L 2 nonstationary covariance models. Mean Cov LL All Excluding CA Only CA Pred MSE Pred SD Pred MSE Pred SD Pred MSE Pred SD CMAQ S L L Poly S L L This table compares log-likelihood of the validation set (LL) along with prediction error (Pred MSE) and prediction standard deviations (Pred SD) over sites in the validation set for the entire United States, excluding California, and only in California. Note the prediction MSE and SD are computed on a centered and scaled version of the original data so that the observations have mean zero and variance one. fitting each model to the remaining data, we compute the log-likelihood (LL), prediction mean squared error (Pred MSE), and prediction standard deviation (Pred SD) for the data in the validation set. These are computed for all of the observation locations along with just locations in California. The results are presented in Table 5. These results show that the nonstationary models are preferred to the stationary model, as the log-likelihood is larger for the L 1 and L 2 regularized models when compared to the stationary model. The overall prediction MSE is not significantly different between the stationary and nonstationary models, although this is different when looking at the west coast. In California specifically, we find that the L 2 model does as much as 15 % better for the model with CMAQ and 11 % better for the polynomial mean models. As the parameter estimates suggest in Fig. 3, this region of the country is where we find more nonstationary features, and these results show that we can indeed perform better for these more complex structures. These plots of the parameter estimates also highlight an attractive feature of the L 1 regularization: the ability to fuse parameters in neighboring subregions to detect subdomains of stationarity. The nugget (a) and range (c) parameters are estimated to be the same over the entire country for the L 1 penalty, with the partial sill (b) estimates being fused for large regions, especially in the east. Instead of a unique parameter in each of the 100 subregions, we have a much smaller set to work with due to this parameter fusion. Plots of the predictions using the polynomial mean (9) for each covariance model along with differences relative to the stationary model are shown in Fig. 4. Although the predictions are very similar between the models for the entire United States (a), the prediction standard deviations (b) vary between the stationary and regularized models. The southwest region in particular is noteworthy, as the prediction variance is smaller under the stationary model. Predictions for only California are shown in (c) along with associated prediction standard deviations (d). These highlight that the nonstationary models do a better job of detecting lower ozone levels in the southeastern region of the state. One other key difference is near the borders of New Mexico, Texas, and Mexico. We do not have many observations in this region, and the prediction uncertainty under the nonstationary model is high relative to the stationary model (Fig. 4b).

16 R. J. Parker et al. CMAQ: L1 Polynomial: L (a) Nugget CMAQ: L1 Polynomial: L (b) Partial sill CMAQ: L Polynomial: L1 (c) Range Figure 3. Estimated parameter values for the nugget (a), partial sill (b), and range (c)forthel 1 and L 2 nonstationary covariance models along with the CMAQ and polynomial mean functions. 7. DISCUSSION In this paper, we have developed a computationally efficient method for estimating a nonstationary spatial covariance function. Using a penalty to smooth the covariance parameters over space provides stability and identifies stationary subdomains in the L 1 case. For both simulated and real data, we find this method improves estimation of the covariance function compared to the stationary model. Future work includes allowing for anisotropy and Matèrn covariance. These extensions would increase the computational burden, but would not require fundamental changes. Computation may be improved using parallel computing. Since the BCL sums over pairs of blocks, it is embarrassingly parallelizable. Eidsvik et al. (2014) achieve 100-fold speedups via parallelization in some stationary cases and similar gains should be possible in our nonstationary setting. We have assumed in this paper that the data consist of several independent realizations from the spatial process. The method can also be applied to a single realization. Also, we conjecture that our method will produce reasonable estimates (with questionable uncertainty measures) even if the realizations are dependent over time with separable spatiotemporal covariance so that each realization has the same spatial covariance. In the case of strong temporal dependence, a possibility is to analyze the differences between

Author's personal copy A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation Stationary L1 S L1 Difference S Difference 100 80 60 40 20 (a) Stationary L1 10 5 0 5 10 US Predictions

17 Author's personal copy A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation Stationary L1 S L1 Difference S Difference (a) Stationary L US Predictions S/L1 Ratio S/ Ratio (b) US Prediction SD Stationary L1 S L1 Difference S Difference (c) CA Predictions Stationary L S/L1 Ratio S/ Ratio (d) CA Prediction SD Figure 4. Predictions and associated prediction standard deviations (in terms of ppb) for the United States (a, b) and California (c, d) using the stationary covariance along with the L 1 and L 2 regularized nonstationary covariance models. Also included are the differences in predictions (a and c) as well as the prediction SD ratios (b and d). adjacent time points which may be approximately independent. However, we concede that our approach is inappropriate for estimating nonstationary and nonseparable spatiotemporal covariance functions, and we leave this as future work. [Received June Accepted April 2016.] REFERENCES Appel, K., Pouliot, G., Simon, H., Sarwar, G., Pye, H., Napelenok, S., Akhtar, F., and Roselle, S. (2013), Evaluation of dust and trace metal estimates from the Community Multiscale Air Quality (CMAQ) model version 5.0, Geoscientific Model Development, 6, Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008), Gaussian predictive process models for large spatial data sets, Journal of the Royal Statistical Society: Series B, 70, Besag, J., York, J., and Mollié, A. (1991), Bayesian image restoration, with two applications in spatial statistics, Annals of the Institute of Statistical Mathematics, 43, Bien, J. and Tibshirani, R. J. (2011), Sparse estimation of a covariance matrix, Biometrika, 98,

18 R. J. Parker et al. Bornn, L., Shaddick, G., and Zidek, J. V. (2012), Modeling nonstationary processes through dimension expansion, Journal of the American Statistical Association, 107, Chang, Y.-M., Hsu, N.-J., and Huang, H.-C. (2010), Semiparametric estimation and selection for nonstationary spatial covariance functions, Journal of Computational and Graphical Statistics, 19, Cressie, N. and Johannesson, G. (2008), Fixed rank kriging for very large spatial data sets, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, Eidsvik, J., Shaby, B. A., Reich, B. J., Wheeler, M., and Niemi, J. (2014), Estimation and prediction in spatial models with block composite likelihoods, Journal of Computational and Graphical Statistics, 23, Finley, A. O., Sang, H., Banerjee, S., and Gelfand, A. E. (2009), Improving the performance of predictive process modeling for large datasets, Computational Statistics & Data Analysis, 53, Friedman, J. and Popescu, B. E. (2003), Gradient directed regularization for linear regression and classification, Tech. rep., Statistics Department, Stanford University. Fuentes, M. (2002), Spectral methods for nonstationary spatial processes, Biometrika, 89, (2007), Approximate likelihood for large irregularly spaced spatial data, Journal of the American Statistical Association, 102, Furrer, R., Genton, M. G., and Nychka, D. (2006), Covariance tapering for interpolation of large spatial datasets, Journal of Computational and Graphical Statistics, 15. Higdon, D. (1998), A process-convolution approach to modelling temperatures in the North Atlantic Ocean, Environmental and Ecological Statistics, 5, Higdon, D., Swall, J., and Kern, J. (1999), Non-stationary spatial modeling, in Bayesian Statistics 6, pp Hsu, N.-J., Chang, Y.-M., and Huang, H.-C. (2012), A group lasso approach for non-stationary spatial-temporal covariance estimation, Environmetrics, 23, Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008), Covariance tapering for likelihood-based estimation in large spatial data sets, Journal of the American Statistical Association, 103, Land, S. R. and Friedman, J. H. (1997), Variable fusion: A new adaptive signal regression method, Tech. rep., Department of Statistics, Carnegie Mellon University, Pittsburgh. Lin, Y. and Zhang, H. H. (2006), Component selection and smoothing in smoothing spline analysis of variance models, Annals of Statistics, 34, Lindgren, F., Rue, H., and Lindström, J. (2011), An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73, Neto, J. H. V., Schmidt, A. M., and Guttorp, P. (2014), Accounting for spatially varying directional effects in spatial covariance structures, Journal of the Royal Statistical Society: Series C (Applied Statistics), 63, Nychka, D. and Saltzman, N. (1998), Design of air-quality monitoring networks, in Case studies in environmental statistics, Springer, pp Paciorek, C. J. and Schervish, M. J. (2006), Spatial modelling using a new class of nonstationary covariance functions, Environmetrics, 17, Reich, B. J., Chang, H. H., and Foley, K. M. (2014), A spectral method for spatial downscaling, Biometrics, 70, Reich, B. J., Eidsvik, J., Guindani, M., Nail, A. J., and Schmidt, A. M. (2011), A class of covariate-dependent spatiotemporal covariance functions, The annals of applied statistics, 5, Sampson, P. D. (2010), Constructions for Nonstationary Spatial Processes, in Handbook of Spatial Statistics, eds. Gelfand, A. E., Diggle, P. J., Fuentes, M., and Guttorp, P., CRC Press, chap. 9. Sampson, P. D. and Guttorp, P. (1992), Nonparametric estimation of nonstationary spatial covariance structure, Journal of the American Statistical Association, 87, Stein, M. L., Chi, Z., and Welty, L. J. (2004), Approximating likelihoods for large spatial data sets, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, Storlie, C. B., Bondell, H. D., Reich, B. J., and Zhang, H. H. (2011), Surface estimation, variable selection, and the nonparametric oracle property, Statistica Sinica, 21, 679.

19 A Fused Lasso Approach to Nonstationary Spatial Covariance Estimation Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005), Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society: Series B, 67, Tibshirani, R. J. and Taylor, J. (2011), The solution path of the generalized lasso, Annals of Statistics, 39,

Spatial smoothing using Gaussian processes

Spatial smoothing using Gaussian processes Chris Paciorek paciorek@hsph.harvard.edu August 5, 2004 1 OUTLINE Spatial smoothing and Gaussian processes Covariance modelling Nonstationary covariance modelling