Spatial and Environmental Statistics

Spatial and Environmental Statistics Dale Zimmerman Department of Statistics and Actuarial Science University of Iowa January 17, 2019 Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 1 / 44

Overview 1 Spatial Datasets 2 What is Spatial (and Environmental) Statistics? 3 Three Important Types of Spatial Data Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 2 / 44

Spatial Datasets Wet deposition of SO 4 (g/m 2 ) in 1987 at National Acid Deposition Program sites. lat 50 45 40 35 30 25 130 120 110 100 90 80 70 60 0 1 2 3 4 5 so4dep long Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 3 / 44

Spatial Datasets Coal ash samples from a mine in Pennsylvania. coal.ash$y 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 8.6 9.0 11.9 8.9 10.0 11.6 10.9 8.8 8.9 9.1 7.6 9.6 10.4 10.6 10.4 9.6 10.7 8.9 7.8 7.8 9.0 8.6 9.8 9.1 10.7 11.2 9.0 9.3 8.2 7.9 7.6 8.2 8.8 10.7 12.8 10.0 9.4 8.6 9.0 9.0 7.3 9.6 9.7 10.0 9.9 11.2 9.9 10.3 8.2 9.8 10.1 8.6 8.9 8.6 7.0 8.8 8.0 10.0 9.7 9.8 10.3 9.8 10.0 9.0 7.7 9.2 7.8 9.1 7.6 11.2 10.1 9.9 10.3 10.2 11.1 10.6 8.8 10.2 9.3 8.6 9.9 10.8 11.6 9.0 9.9 8.9 10.2 9.3 10.6 9.1 10.2 10.7 9.5 9.4 9.8 10.4 9.8 8.9 9.2 11.4 12.5 9.6 10.8 10.1 9.4 9.5 11.0 9.9 7.8 8.2 9.9 11.0 10.1 11.5 10.4 8.4 8.9 8.1 8.0 7.0 7.9 11.3 9.4 9.4 11.2 9.9 10.7 9.3 9.3 10.1 8.6 8.8 11.2 9.9 10.2 10.6 11.6 9.2 10.0 11.2 8.1 11.3 10.8 11.8 9.8 11.0 9.8 10.2 9.2 8.2 9.2 10.0 8.2 11.0 10.3 13.1 10.5 11.6 9.5 8.5 10.9 10.4 11.1 11.0 10.8 10.1 8.7 11.2 9.4 9.6 10.4 10.8 17.6 10.9 13.1 11.4 10.0 9.2 9.8 11.1 10.8 8.9 9.5 9.2 9.6 8.2 10.9 10.9 9.6 9.5 10.6 10.3 9.5 10.1 12.6 9.6 9.6 9.8 7.8 9.3 8.8 9.0 8.3 8.1 10.6 10.4 9.3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 coal.ash$x 9.1 Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 4 / 44

Spatial Datasets Presence (black) or absence (white) of Atriplex hymenelytra on a grid of quadrats in Death Valley, CA. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 5 / 44

Spatial Datasets Population-adjusted mortality rates due to SIDS in counties of North Carolina, 1974-1978. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 6 / 44

Spatial Datasets Locations of Japanese pines, redwood saplings, biological cells, and scouring rushes in various study areas. Pines Redwoods Cells Rushes Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 7 / 44

What is Spatial Statistics? Basic ingredients: Observations on one or more response variables are taken at multiple, identifiable sites in some spatial domain. Locations of these sites are observed and are attached, as labels, to the observations. An analysis of the observations is performed, in which the spatial locations of sites are taken into account. Either the observations or the spatial locations (or both) are modelled as random variables, and inferences are made about these models and/or about additional unobserved variables. Thus, spatial statistics would include any investigation in which the data s spatial locations play a role in a probabilistic or statistical analysis (we will emphasize the statistical). Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 8 / 44

What is Spatial Statistics? Spatial statistics is a vast subject, in large part because spatial data are of so many different types. The response variable may be: univariate or multivariate categorical or continuous real-valued (numerical) or not real-valued (e.g. set-valued) observational or experimental The data locations may: be points, regions, or something else be regularly or irregularly spaced be regularly or irregularly shaped belong to a Euclidean or non-euclidean space (e.g., river network) Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 9 / 44

What is Spatial Statistics? The mechanism that generates the data locations may be: known or unknown random or non-random related or unrelated to the processes that govern the responses Related subjects: Time series analysis Reliability/survival analysis Longitudinal data analysis Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 10 / 44

Three Important Types of Spatial Data 1. Geostatistical data The response variable exists at every point in the study region; however, we observe the response at only a finite number of points, or at a finite number of subregions that are small relative to the spacing between them. Examples: (a) Annual acid rain deposition in U.S. (b) Richness of iron ore within an ore body Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 11 / 44

Three Important Types of Spatial Data 2. Areal (sometimes called lattice) data The response variable exists and is observed only on a finite set of (usually contiguous) subregions within the study region. Examples: (a) Presence or absence of a plant species in square quadrats over a study area (b) Numbers of deaths due to SIDS in the counties of North Carolina (c) Pixel values from remote sensing (satellites) Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 12 / 44

Three Important Types of Spatial Data 3. Spatial point patterns Data are the spatial locations of point events within the study region. No response variable is observed at the locations. Examples: (a) Locations of Equisetum arvense plants at a marsh edge evidence of environmental gradient? (b) Location of lunar craters meteor impacts or volcanism? (c) Locations of residences of individuals with lung cancer within 50 miles of a large incinerator does disease risk increase with proximity to the incinerator? A more general kind of spatial point pattern is a marked spatial point pattern, in which a nontrivial response variable (called the mark) is observed at each point. If the mark is discrete, we have a multivariate spatial point pattern. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 13 / 44

Three Important Types of Spatial Data The distinctions between these three types are not always clear-cut. In particular, areal data and geostatistical data have many similarities. In a sense, areal data are not as refined as geostatistical data or spatial point patterns since you can obtain areal data by various reductions (integration or counting) of the other two. In addition to indicating some prototypes of spatial data, the examples listed above indicate the breadth of disciplines in which scientific inquiry is concerned with spatial data. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 14 / 44

Spatial Statistics Why Bother? There are two main reasons why we bother with spatial statistics for spatial data (instead of just using classical statistics): 1 Characterizing the spatial structure of the data may be of direct interest. 2 The spatial structure may not be of direct interest, but modeling or otherwise accounting for it may improve other inferences. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 15 / 44

Spatial Statistics Why Bother? More details on the first reason: The observations are suspected of having a coherent spatial structure, the characterization of which may be important. The kinds of spatial structure that may occur vary across types, but there are some commonalities. It has been observed over and over again in practice that observations taken at sites close together tend to be more alike than observations taken at sites far apart. In the spatial context, this is sometimes called the First Law of Spatial Statistics. This law can manifest through either the large-scale (global) structure or the small-scale (local) structure, or both. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 16 / 44

Spatial Statistics Why Bother? Large-scale structure Mean function of geostatistical process Mean vector of areal process Intensity of spatial point process Small-scale structure Variogram, covariance function of geostatistical process Neighbor weights for areal process Ripley s K-function, second-order intensity, nearest-neighbor functions for spatial point process Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 17 / 44

Spatial Statistics Why Bother? Two important types of spatial structure are stationarity and isotropy. Formal definitions of these will be given later. For now, the following descriptions will suffice. (a) Stationarity the property whereby the behavior of the process is similar across all of the spatial domain under study. This implies: constant (no trend) large-scale structure small-scale structure that depends on the spatial locations only through their relative positions (displacement) (b) Isotropy the property whereby the process is stationary, plus the small-scale structure depends on the spatial locations only through the Euclidean distance between them. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 18 / 44

Spatial Statistics Why Bother? Characterization of the spatial structure is usually achieved by one or more of the following types of statistical inference: Testing for the existence of spatial structure Estimating spatial structural parameters Choosing between alternative structural models Prediction of unobserved variables using estimated structure (almost exclusively geostatistical, where it is known as kriging) Now we elaborate on the second reason why we bother with spatial statistics, i.e., taking account of the spatial structure to improve non-spatial inferences. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 19 / 44

Spatial Statistics Why Bother? Example 1 (from geostatistics). Prediction of an unobserved response at the from 6 nearby observations. If all of the observed responses are uncorrelated with each other and with Z(s 0 ), then Z (the average of the observed responses) is the best linear unbiased predictor. If, however, the responses are spatially correlated, then Z is inefficient. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 20 / 44

Spatial Statistics Why Bother? Example 2 (from spatial point pattern analysis). Estimation of the number, N, of trees in a forest of area A. One method for estimating N is based on measuring the distance, X i, to the nearest tree from each of m fixed points. X2 X4 X1 X3 Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 21 / 44

Spatial Statistics Why Bother? If tree locations are completely spatially random (a random sample from the uniform distribution on A), then the MLE of N is ˆN = m A m i=1 πx i 2. If not, then ˆN can be badly biased. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 22 / 44

Spatial Statistics Why Bother? Example 3 (from areal data analysis). Variance of the sample mean. Consider 16 observations taken over square subregions in a 4 4 grid, indexed by rows and columns as Z(i, j): Z(4,1) Z(4,4) Z(1,1) Z(1,4) Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 23 / 44

Spatial Statistics Why Bother? Suppose that the observations have common mean µ and common variance 1, and corr[z(i, j), Z(k, l)] = 0.5 i k + j l. Suppose we wish to estimate µ by the sample mean, Z. It s tedious but mathematically easy to show that var( Z). = 0.266. If there were no spatial correlation, then var( Z) = 1/16 = 0.0625. Thus if we obtain a 95% (say) confidence interval for µ by acting as though there is no correlation, our interval will actually be much narrower than it should be. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 24 / 44

Spatial Statistics Why Bother? Example 4 (from areal data analysis). Spatial experimental design. Consider a field-plot experiment with 50 units, laid out in 10 linear blocks of 5 plots each. Suppose there are 5 treatments and each is to occur once in each block. Consider two designs: Randomized block design (RBD) First-order nearest-neighbor balanced design (first-order NNBD) 5 4 1 3 2 2 5 4 1 3 3 2 5 4 1 1 3 2 5 4 4 1 3 2 5 5 1 2 4 3 3 5 1 2 4 4 3 5 1 2 2 4 3 5 1 1 2 4 3 5 Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 25 / 44

Spatial Statistics Why Bother? It turns out that if treatment-adjusted responses are independent across blocks but positively spatially correlated within blocks, then the first-order NNBD is optimal in the sense of minimizing the average variance of treatment contrasts. It can be considerably superior to the RBD if the correlation is strong. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 26 / 44

Geostatistical Data and Model, Part I Basic form of geostatistical data is (x i, y i ) : i = 1,..., n, where x i is a spatial location in a spatial domain of interest A, and y i is a the measured value of a variable y at x i. Usually A lies in 2-D space and the x i s are points, in which case x i is a 2-D vector of spatial coordinates. Almost always, the x i s are distinct (i.e., no repeated measurements). The x i s are either deterministic (nonrandom) or, if they are randomly chosen, it is assumed (unless noted otherwise) that the mechanism for their selection is independent of the process governing the y i s. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 27 / 44

Geostatistical Data and Model, Part I As for y i, it is assumed to be a realization of a random variable Y i, which in turn is regarded as a function f ( ) of another random variable S(x i ): Y i = f (S(x i )). In fact, it is assumed that there is a random variable S(x) at each possible location in x A, not merely at x 1,..., x n. The collection S( ) {S(x) : x A} is a stochastic process, which in this context is called a geostatistical process or a random field. In general it is not observable, even at x 1,..., x n. In the wet sulfate deposition example: A is the continental U.S. x 1,..., x n are the locations of the buckets y i is the observed annual wet sulfate deposition Y i = S(x i ) + ɛ i, where ɛ i is measurement error Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 28 / 44

Geostatistical Data and Model, Part I It is assumed that Y 1,..., Y n S( ) are conditionally independent, but that S(x) and S(x ) generally are not independent. If S( ) is a Gaussian process, meaning that any finite subcollection has a multivariate normal distribution, then S(x) and S(x ) are independent iff corr(s(x), S(x )) = 0. Thus in this case at least, it is clear that the correlation structure of S( ) is important in modeling and inference for geostatistical data. Even if S( ) is not Gaussian, some types of inference are possible if we can characterize the first-order ( large-scale) and second-order ( small-scale) moment structure of the data. So we take that as our first major task. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 29 / 44

Means, Variances, and Correlations of Random Variables Suppose that Y is a random variable (defined on some probability space). If Y is discrete, it has a probability mass function f (y), where f (y) > 0 for y Y. If the sum u(y)f (y) y Y exists, then it (the sum) is called the expectation of u(y ), written as E(u(Y )). If Y is (absolutely) continuous, it has a probability density function f (y), where f (y) > 0 for y Y. If u(y)f (y) dy Y exists, then it (the integral) is also called the expectation of u(y ), also written as E(u(Y )). Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 30 / 44

Means, Variances, and Correlations of Random Variables It follows easily from the definition of expectation that when it exists, E(c 1 u 1 (Y ) + c 2 u 2 (Y )) = c 1 E(u 1 (Y )) + c 2 E(u 2 (Y )) where c 1 and c 2 are constants and u 1 ( ) and u 2 ( ) are functions. Some important special types of expectations are the mean and variance: Mean, E(Y ) ( µ) Variance, Var(Y ) = E[(Y µ) 2 ] = E[Y 2 2µY + µ 2 ] = E(Y 2 ) µ 2 ( σ 2 ) By the rule at the top of this page, E(aY + b) = aµ + b, and Var(aY + b) = a 2 σ 2. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 31 / 44

Means, Variances, and Correlations of Random Variables Now suppose that Y 1 and Y 2 are two random variables, either both discrete or both continuous, with joint pmf or pdf f (y 1, y 2 ). Then the expectation of u(y 1, Y 2 ), when it exists, is given by an expression similar to that given before. For example, in the discrete case, u(y 1, y 2 )f (y 1, y 2 ). (y 1,y 2 ) Y It turns out that the means µ 1 and µ 2, and variances σ 2 1 and σ2 2, of Y 1 and Y 2 (when they exist) may be obtained using either this approach or the previous approach (using marginal distributions). However, there is another important expectation in this bivariate case that requires the second approach only. The covariance of Y 1 and Y 2, when it exists, is Cov(Y 1, Y 2 ) = E[(Y 1 µ 1 )(Y 2 µ 2 )] σ 12 Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 32 / 44

Means, Variances, and Correlations of Random Variables The covariance σ 12 : exists if and only if σ 2 1 and σ2 2 exist measures the sign and extent to which Y 1 and Y 2 co-vary, but its magnitude depends on the variances may be written alternatively as E(Y 1 Y 2 ) µ 1 µ 2 satisfies the rule Cov(a 1 Y 1 + b 1, a 2 Y 2 + b 2 ) = a 1 a 2 σ 12 A scaled version of the covariance is the correlation, written as corr(y 1, Y 2 ) or ρ 12 : ρ 12 = σ 12, σ1 2σ2 2 provided that σ 2 1 and σ2 2 are positive. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 33 / 44

Means, Variances, and Correlations of Random Variables Facts about ρ 12 : 1 ρ 12 1 ρ 12 measures the strength of the linear association between Y 1 and Y 2 If Y 1 and Y 2 are independent, then ρ 12 = 0; but the converse is false Another moment used in spatial statistics is the semivariance of Y 1 and Y 2, written as γ 12 : γ 12 = 1 2 Var(Y 1 Y 2 ) provided that this variance exists. It can be shown that: when σ 2 1 and σ2 2 exist, γ 12 = 1 2 (σ2 1 + σ2 2 2σ 12); when σ 2 1 = σ2 2 = σ2, say, γ 12 = σ 2 σ 12 = σ 2 (1 ρ 12 ). Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 34 / 44

Means, Variances, and Correlations of Random Variables We can extend all of these ideas to situations with any finite number n of random variables Y 1,..., Y n. If Y 1,..., Y n are random variables whose variances σ1 2,..., σ2 n exist, and we write µ 1,..., µ n for their means and σ ij for cov(y i, Y j ), then the following rules hold for the mean, variance, and covariance of linear transformations of the variables: ( n ) n E a i Y i = a i µ i ; i=1 ( n ) Var a i Y i i=1 = = i=1 n ai 2 σi 2 + 2 i=1 n ai 2 σi 2 + 2 i=1 n n i=1 j=i+1 n n i=1 j=i+1 a i a j σ ij a i a j σ i σ j ρ ij ; Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 35 / 44

Means, Variances, and Correlations of Random Variables n n Cov a i Y i, b j Y j = i=1 j=1 = n n a i b j σ ij i=1 j=1 n i=1 j=1 n a i b j ρ ij σ i σ j. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 36 / 44

Means, Variances, and Correlations of Random Variables For simplicity, it is helpful to organize all the means in a n-dimensional vector: µ 1 µ 2 µ =.. µ n Likewise, we can organize all the variance and covariances in an n n symmetric nonnegative definite matrix: σ 2 1 σ 12 σ 1n σ 21 σ2 2 σ 2n Σ =.... σ n1 σ n2 σn 2 Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 37 / 44

Means, Variances, and Correlations of Random Variables Similarly, we can construct an n n symmetric nonnegative definite correlation matrix 1 ρ 12 ρ 1n ρ 21 1 ρ 2n ρ =... ρ n1 ρ n2 1 and an n n symmetric semivariance matrix 0 γ 12 γ 1n γ 21 0 γ 2n Γ =.... γ n1 γ n2 0 Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 38 / 44

Mean Function, Covariance Function, Correlation Function, and Semivariogram of a Geostatistical Process Recall that a geostatistical process S( ) = {S(x) : x A} is an infinite collection of random variables, one located at each point in a spatial domain A. How can we represent the mean, variance, covariance (or correlation), and semivariance of all these random variables? We do it by regarding the moments as functions of the spatial location(s): Mean function µ(x) = E(S(x)); Covariance function σ(x, x ) = Cov(S(x), S(x )); Correlation function ρ(x, x ) = Corr(S(x), S(x )); Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 39 / 44

Mean Function, Covariance Function, Correlation Function, and Semivariogram of a Geostatistical Process Semivariogram γ(x, x ) = Var(S(x) S(x )). Often, there is not enough data to estimate these functions nonparametrically, so in practice they are approximated by parsimonious parametric models. In due time we will consider such models, but for now, we merely note the simplifications that result from stationarity and isotropy. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 40 / 44

Mean Function, Covariance Function, Correlation Function, and Semivariogram of a Geostatistical Process We consider three types of stationarity: 1. Strict stationarity requires that the joint probability distribution of {S(x) : x A} depends only on the relative positions of sites, i.e., F S(x1 +h),...,s(x m+h)(s 1,..., s m ) = F S(x1 ),...,S(x m)(s 1,..., s m ) for all m, all locations x 1,..., x m, all displacements h, and all s 1,..., s m. This implies, for example, that P[S(x 1 + h) s 1 ] = P[S(x 1 ) s 1 ] for all x 1, all h, and all s 1 (i.e., equality of marginal distributions). It also implies that bivariate distributions are identical for any pairs of variables displaced by the same vector. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 41 / 44

Mean Function, Covariance Function, Correlation Function, and Semivariogram of a Geostatistical Process 2. Second-order stationarity requires that: the mean is constant over space; the way that two variables of S( ) co-vary is consistent for variables located at sites having the same relative positions. That is, the covariance between variables at two sites depends on only the sites relative positions. This can be expressed in either of the following two ways: σ(x, x ) = σ(x + h, x + h) for all h. σ(x, x ) = σ(x x ) for all x, x A. This implies that the variance is constant over space. If S( ) is second-order stationary, the correlation function satisfies the same property. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 42 / 44

Mean Function, Covariance Function, Correlation Function, and Semivariogram of a Geostatistical Process 3. Intrinsic stationarity requires that: the mean is constant over space; the semivariogram depends on only the relative positions of sites. This can be expressed as Some important facts: γ(x, x ) = γ(x x ) for all x, x A. Strict stationarity is stronger than second-order stationarity, which in turn is stronger than intrinsic stationarity. If S( ) is second-order stationary with variance σ 2, then γ(x x ) = σ 2 σ(x x ). Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 43 / 44

Mean Function, Covariance Function, Correlation Function, and Semivariogram of a Geostatistical Process Isotropy adds a further requirement to each stationarity assumption. In particular, it requires that the condition defining the stationarity hold not only for variables displaced by the same h, but also for variables for which the orientation of h is different but the length of h is the same. Thus, isotropic second-order stationarity and isotropic intrinsic stationary require, respectively, that σ(x, x ) = σ( x x ) for all x, x A, γ(x, x ) = γ( x x ) for all x, x A. Isotropic second-order stationarity implies that ρ(x, x ) = ρ( x x ) for all x, x A. Dale Zimmerman (UIOWA) Spatial and Environmental Statistics January 17, 2019 44 / 44