Practical 12: Geostatistics This practical will introduce basic tools for geostatistics in R. You may need first to install and load a few packages. The packages sp and lattice contain useful function and structures for the management of spatially distributed data. The package gstat provides tools for the analysis of geostatistical data. Meuse data We consider a classical dataset in geostatistics which is avaliable in the sp package. The data set consists of 155 samples of top soil heavy metal concentrations (ppm), along with a number of soil and landscape variables. The samples were collected in a flood plain of the river Meuse, near the village Stein (The Netherlands). library(sp) data(meuse) head(meuse) coordinates(meuse) <- c('x','y') You can see that the dataset reports the geographical coordinates (x and y) as well as the measurements associated to each data point. The coordinates function above instructs R about which column correspond to the coordinates; this changes the nature of the dataset, which is now treated as a SpatialPointsDataFrame, i.e. a data frame which associated spatial location. This is one of the data structure that are available in R for spatial data and it is provided by the package sp. If we now try to plot the data, R gives as the locations of the data point. plot(meuse) Other types of plots are avalable to explore the other variables in the data set, for example the quantity of zinc in the soil: spplot(meuse,'zinc',do.log=true) bubble(meuse,'zinc',do.log=true) Can you interpret the output of this plot? (Check the help if needed) While the SpatialPointsDataFrame structure is usually preferred for geostatistical data, other types of structure are available in the sp package. For example, the river borders can be described (and plotted) using SpatialPolygons and plotted together with the meuse dataset for better data visualisation. 1
data(meuse.riv) meuse.lst <- list(polygons(list(polygon(meuse.riv)), "meuse.riv")) meuse.sr <- SpatialPolygons(meuse.lst) plot(meuse.sr, col = "grey") plot(meuse, add = TRUE) Looking at the geography, can you suggest any interpretation for the variation in the quantity of zinc? To explore the spatial dependence, we can first plot the semivariogram cloud, i.e. the empirical semivarogram for all the distances observed in the dataset (we are transforming the quantity of zinc on the log scale first). library(gstat) cld <- variogram(log(zinc) ~ 1, data=meuse, cloud = TRUE) plot(cld, main = 'Semivariogram cloud') or better the binned semivariogram. svgm <- variogram(log(zinc) ~ 1, width=100, data=meuse) plot(svgm, main = 'Binned Semivariogram',pch=19) The parameter width controls the bandwidth, try to change the value and see what happens. It is also possible to include covariates in the formula (in place of 1), to account for a drift term in the model. Does the plot of the binned semivariogram suggests the presence of spatial dependence? If yes, does the process appear to be (second order) stationary? 2
We can then fit a parametric semivariogram model to the binned sample semivariogram, via weighted least squares, using the vgm and fit.variogram functions. The vgm function provides the expressions for a variety of parametric semivariograms models. You can type vgm() for a complete list. Let us now fit a spherical semivariogram. sph.model<-fit.variogram(svgm, vgm(psill=0.6, "Sph", range=800, nugget=0.2)) plot(svgm,sph.model) sph.model # model psill range # 1 Nug 0.06114813 0.0000 # 2 Sph 0.58610673 933.4006 We needed to specify initial values for the model parameters in the optimization algorithm: the partial sill (0.6), the range (800) and the nugget (0.2). Reasonable choices for these starting values can be obtained looking at the binned variogram plot. In particular, the algorithm may fail if you select completely unreasonable choices for the (effective) range. The estimated nugget is ˆτ 2 = 0.0611, estimated sill is 0.0611 + 0.586 = 0.647 (psill stands for partial sill, which are components that make up the sill) and estimated range is â 2 = 933. Try now to fit an exponential semivariogram model. exp.model<-fit.variogram(svgm, vgm(0.6, "Exp", 800, 0.2)) plot(svgm,exp.model) exp.model # model psill range # 1 Nug 0.01429689 0.0000 # 2 Exp 0.71486260 477.2015 What are the estimated nugget, sill and effective range? binned semivariogram? Which one better fits the Kriging Let us consider again the Meuse dataset, now with the aim of reconstructing a smooth surface of the (logarithm of) lead concentration. Let us start by assuming the process is stationary and follow the spherical semivariogram model as we have seen above. There are two alternative ways to obtain the kriging prediction, by using the krige function or by fitting a model with gstat and then use the predict option for that model. Both are doing the same mathematical operations (actually, krige is just a wrapper for gstat and predict ). Let us consider the simple kriging prediction first (we pretend to know the true mean, even if we estimate it from the data). 3
# grid for prediction data(meuse.grid) coordinates(meuse.grid) <- c('x','y') meuse.grid <- as(meuse.grid, 'SpatialPixelsDataFrame') beta.hat <- mean(log(meuse$lead)) # assumed known #simple kriging prediction lz.sk <- krige(log(lead)~1, meuse, meuse.grid, sph.model, beta = beta.hat) plot(lz.sk) # the spplot gives you prediction # variance as well (although the colorscale is not great) spplot(lz.sk, col.regions=bpy.colors(n = 100, cutoff.tails = 0.2)) The ordinary kriging prediction can be easily obtained by not providing a mean value: ## ordinary kriging lz.ok <- krige(log(lead)~1, meuse, meuse.grid, sph.model) plot(lz.ok) Looking at the concentration of lead and the position of the observation with respect to the river, we see that the lead concentration appears to change with the distance from the river (which is one of the parameter in the dataset, dist). Let us consider first the scatterplot of log(lead) and dist. plot(meuse$dist,log(meuse$lead)) This suggests to fit a universal kriging model where the mean is a function of the distance from the river (possibly a square root, looking at the scatterplot). 4
lead.gstat <- gstat(id = 'lead', formula = log(lead) ~ sqrt(dist), data = meuse, model=sph.model) lead.gstat lead.uk <- predict(lead.gstat, newdata = meuse.grid) plot(lead.uk) The gstat function automatically (and iteratively) estimates the drift, the residuals and the residual variogram, then the predict function compute the universal kriging predictor for the new locations. It is also possible to get an evaluation of the drift using the option BLUE=TRUE: 5
lead.trend<-predict(lead.gstat,newdata = meuse.grid,blue=true) plot(lead.trend, main='drift') You can also get the predicted value of the field or the estimated trend in a specific new observation (but note that you need to provide also the correspondent value for the predictors in the non-stationary mean model, in this case the distance from the river) new_obs<-data.frame(x=179660,y=331860,dist=0.124805) coordinates(new_obs)<-c('x','y') predict(lead.gstat,newdata=new_obs) # [using universal kriging] # coordinates lead.pred lead.var # 1 (179660, 331860) 4.59488 0.1460198 predict(lead.gstat,newdata=new_obs,blue=true) # [generalized least squares trend estimation] # coordinates lead.pred lead.var # 1 (179660, 331860) 4.92977 0.03846024 Radioactivity data The file radio reports the information on 158 control units in the area around a nuclear power plant. At each site, available data consist of: radioactivity levels [Bq], longitude [Long], latitude [Lat] and type of soil [Soil], a factor with two levels, U, urban, and V, vegetation]. filepath <- "http://www.statslab.cam.ac.uk/~tw389/teaching/slp18/data/" filename <- "radioactivity" radio <- read.table(paste0(filepath, filename), header=t) head(radio) coordinates(radio) <- c('long', 'Lat') spplot(radio, 'Soil') spplot(radio, 'Bq') 6
Explore the data graphically and comment on how radioactivity and type of soil correlate in the space. Fit a linear model with the soil as predictor, assuming for the errors a spherical semivariogram. Try to use a semivarogram model both with and without nugget. Which one is preferable? Write down the algebraic form of the fitted model. What are the estimates of the semivariogram parameters? What are the estimates of the coefficients of the linear model (hint: think to how they related to the predicted drift)? Using the chosen model, predict the radioactivity level at the location (Long = 78.59, Lat = 35.34), which is a parking lot. Estimate the variance of prediction error at the same location. 7