Introduction to Spatial Data and Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. July 25, 2014 1
Researchers in diverse areas such as climatology, ecology, environmental health, and real estate marketing are increasingly faced with the task of analyzing data that are: 2 Graduate Workshop on Environmental Data Analytics 2014
Researchers in diverse areas such as climatology, ecology, environmental health, and real estate marketing are increasingly faced with the task of analyzing data that are: highly multivariate, with many important predictors and response variables, 2 Graduate Workshop on Environmental Data Analytics 2014
Researchers in diverse areas such as climatology, ecology, environmental health, and real estate marketing are increasingly faced with the task of analyzing data that are: highly multivariate, with many important predictors and response variables, geographically referenced, and often presented as maps, and 2 Graduate Workshop on Environmental Data Analytics 2014
Researchers in diverse areas such as climatology, ecology, environmental health, and real estate marketing are increasingly faced with the task of analyzing data that are: highly multivariate, with many important predictors and response variables, geographically referenced, and often presented as maps, and temporally correlated, as in longitudinal or other time series structures. 2 Graduate Workshop on Environmental Data Analytics 2014
Researchers in diverse areas such as climatology, ecology, environmental health, and real estate marketing are increasingly faced with the task of analyzing data that are: highly multivariate, with many important predictors and response variables, geographically referenced, and often presented as maps, and temporally correlated, as in longitudinal or other time series structures. motivates hierarchical modeling and data analysis for complex spatial (and spatiotemporal) data sets. 2 Graduate Workshop on Environmental Data Analytics 2014
Type of spatial data point-referenced data, where y(s) is a random vector at a location s R r, where s varies continuously over D, a fixed subset of R r that contains an r-dimensional rectangle of positive volume; 3 Graduate Workshop on Environmental Data Analytics 2014
Type of spatial data point-referenced data, where y(s) is a random vector at a location s R r, where s varies continuously over D, a fixed subset of R r that contains an r-dimensional rectangle of positive volume; areal data, where D is again a fixed subset (of regular or irregular shape), but now partitioned into a finite number of areal units with well-defined boundaries; 3 Graduate Workshop on Environmental Data Analytics 2014
Type of spatial data point-referenced data, where y(s) is a random vector at a location s R r, where s varies continuously over D, a fixed subset of R r that contains an r-dimensional rectangle of positive volume; areal data, where D is again a fixed subset (of regular or irregular shape), but now partitioned into a finite number of areal units with well-defined boundaries; point pattern data, where now D is itself random; its index set gives the locations of random events that are the spatial point pattern. y(s) itself can simply equal 1 for all s D (indicating occurrence of the event), or possibly give some additional covariate information (producing a marked point pattern process). 3 Graduate Workshop on Environmental Data Analytics 2014
Exploration of spatial data First step in analyzing data 4 Graduate Workshop on Environmental Data Analytics 2014
Exploration of spatial data First step in analyzing data First Law of Geography: Mean + Error 4 Graduate Workshop on Environmental Data Analytics 2014
Exploration of spatial data First step in analyzing data First Law of Geography: Mean + Error Mean: first-order behavior 4 Graduate Workshop on Environmental Data Analytics 2014
Exploration of spatial data First step in analyzing data First Law of Geography: Mean + Error Mean: first-order behavior Error: second-order behavior (covariance function) 4 Graduate Workshop on Environmental Data Analytics 2014
Exploration of spatial data First step in analyzing data First Law of Geography: Mean + Error Mean: first-order behavior Error: second-order behavior (covariance function) EDA tools examine both first and second order behavior 4 Graduate Workshop on Environmental Data Analytics 2014
Exploration of spatial data First step in analyzing data First Law of Geography: Mean + Error Mean: first-order behavior Error: second-order behavior (covariance function) EDA tools examine both first and second order behavior Preliminary displays: Simple locations to surface displays 4 Graduate Workshop on Environmental Data Analytics 2014
Exploration of spatial data First Law of Geography = + data mean error 5 Graduate Workshop on Environmental Data Analytics 2014
Exploration of spatial data Scallops Sites 6 Graduate Workshop on Environmental Data Analytics 2014
Deterministic surface interpolation Spatial surface observed at finite set of locations S = {s 1, s 2,..., s n } Tessellate the spatial domain (usually with data locations as vertices) Fit an interpolating polynomial: f(s) = i w i (S ; s)f(s i ) Interpolate by reading off f(s 0 ). Issues: Sensitivity to tessellations Choices of multivariate interpolators Numerical error analysis 7 Graduate Workshop on Environmental Data Analytics 2014
Scallops data: image and contour plots Latitude 39.0 39.5 40.0 40.5 73.5 73.0 72.5 72.0 Longitude 8 Graduate Workshop on Environmental Data Analytics 2014
Scallops data: image and contour plots Drop-line scatter plot 9 Graduate Workshop on Environmental Data Analytics 2014
11 11.211.411.611.8 12 12.2 12.4 logsp Introduction to spatial data and models 38.04 Scallops data: image and contour plots Surface plot 38.02 38 Latitude 37.98 37.96-121.34-121.32 Longitude -121.3-121.28 10 Graduate Workshop on Environmental Data Analytics 2014
Scallops data: image and contour plots Image contour plot Latitude 37.96 37.98 38.00 38.02 38.04 121.36 121.34 121.32 121.30 121.28 121.26 Longitude 11 Graduate Workshop on Environmental Data Analytics 2014
Scallops data: image and contour plots Locations form patterns 128100 128150 128200 128250 128300 1031350 1031400 1031450 1031500 1031550 Eastings Northings 128100 128150 128200 128250 128300 1031350 1031400 1031450 1031500 1031550 12 Graduate Workshop on Environmental Data Analytics 2014
0 2 4 6 8 10 12 14 Shrub Density Introduction to spatial data and models Scallops data: image and contour plots Surface features 1031550 1031500 1031450 Northings 1031400 1031350 128100 128150 128200 128250 128300 Eastings 13 Graduate Workshop on Environmental Data Analytics 2014
Scallops data: image and contour plots Interesting plot arrangements N S UTM coordinates 4879050 4879100 4879150 4879200 456300 456320 456340 456360 456380 456400 456420 456440 E W UTM coordinates 14 Graduate Workshop on Environmental Data Analytics 2014
Elements of point-level modeling Point-level modeling refers to modeling of spatial data collected at locations referenced by coordinates (e.g., lat-long, Easting-Northing). 15 Graduate Workshop on Environmental Data Analytics 2014
Elements of point-level modeling Point-level modeling refers to modeling of spatial data collected at locations referenced by coordinates (e.g., lat-long, Easting-Northing). Fundamental concept: Data from a spatial process {y(s) : s D}, where D is a fixed subset in Euclidean space. 15 Graduate Workshop on Environmental Data Analytics 2014
Elements of point-level modeling Point-level modeling refers to modeling of spatial data collected at locations referenced by coordinates (e.g., lat-long, Easting-Northing). Fundamental concept: Data from a spatial process {y(s) : s D}, where D is a fixed subset in Euclidean space. Example: y(s) is a pollutant level at site s 15 Graduate Workshop on Environmental Data Analytics 2014
Elements of point-level modeling Point-level modeling refers to modeling of spatial data collected at locations referenced by coordinates (e.g., lat-long, Easting-Northing). Fundamental concept: Data from a spatial process {y(s) : s D}, where D is a fixed subset in Euclidean space. Example: y(s) is a pollutant level at site s Conceptually: Pollutant level exists at all possible sites 15 Graduate Workshop on Environmental Data Analytics 2014
Elements of point-level modeling Point-level modeling refers to modeling of spatial data collected at locations referenced by coordinates (e.g., lat-long, Easting-Northing). Fundamental concept: Data from a spatial process {y(s) : s D}, where D is a fixed subset in Euclidean space. Example: y(s) is a pollutant level at site s Conceptually: Pollutant level exists at all possible sites Practically: Data will be a partial realization of a spatial process observed at {s 1,..., s n } 15 Graduate Workshop on Environmental Data Analytics 2014
Elements of point-level modeling Point-level modeling refers to modeling of spatial data collected at locations referenced by coordinates (e.g., lat-long, Easting-Northing). Fundamental concept: Data from a spatial process {y(s) : s D}, where D is a fixed subset in Euclidean space. Example: y(s) is a pollutant level at site s Conceptually: Pollutant level exists at all possible sites Practically: Data will be a partial realization of a spatial process observed at {s 1,..., s n } Statistical objectives: Inference about the process y(s); predict at new locations. 15 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Suppose our spatial process has a mean, µ (s) = E (y (s)), and that the variance of y(s) exists for all s D. 16 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Suppose our spatial process has a mean, µ (s) = E (y (s)), and that the variance of y(s) exists for all s D. Strong stationarity: If for any given set of sites, and any displacement h, the distribution of (y(s 1 ),..., y(s n )) is the same as, (y(s 1 + h),..., y(s n + h)). Weak stationarity: Constant mean µ(s) = µ, and Cov(y(s), y(s + h)) = C(h): the covariance depends only upon the displacement (or separation) vector. 16 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Suppose our spatial process has a mean, µ (s) = E (y (s)), and that the variance of y(s) exists for all s D. Strong stationarity: If for any given set of sites, and any displacement h, the distribution of (y(s 1 ),..., y(s n )) is the same as, (y(s 1 + h),..., y(s n + h)). Weak stationarity: Constant mean µ(s) = µ, and Cov(y(s), y(s + h)) = C(h): the covariance depends only upon the displacement (or separation) vector. Strong stationarity implies weak stationarity 16 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Suppose our spatial process has a mean, µ (s) = E (y (s)), and that the variance of y(s) exists for all s D. Strong stationarity: If for any given set of sites, and any displacement h, the distribution of (y(s 1 ),..., y(s n )) is the same as, (y(s 1 + h),..., y(s n + h)). Weak stationarity: Constant mean µ(s) = µ, and Cov(y(s), y(s + h)) = C(h): the covariance depends only upon the displacement (or separation) vector. Strong stationarity implies weak stationarity The process is Gaussian if y = (y(s 1 ),..., y(s n )) has a multivariate normal distribution. 16 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Suppose our spatial process has a mean, µ (s) = E (y (s)), and that the variance of y(s) exists for all s D. Strong stationarity: If for any given set of sites, and any displacement h, the distribution of (y(s 1 ),..., y(s n )) is the same as, (y(s 1 + h),..., y(s n + h)). Weak stationarity: Constant mean µ(s) = µ, and Cov(y(s), y(s + h)) = C(h): the covariance depends only upon the displacement (or separation) vector. Strong stationarity implies weak stationarity The process is Gaussian if y = (y(s 1 ),..., y(s n )) has a multivariate normal distribution. For Gaussian processes, strong and weak stationarity are equivalent. 16 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Variograms Suppose we assume E[y(s + h) y(s)] = 0 and define E[y(s + h) y(s)] 2 = V ar (y(s + h) y(s)) = 2γ (h). This is sensible if the left hand side depends only upon h. Then we say the process is intrinsically stationary. 17 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Variograms Suppose we assume E[y(s + h) y(s)] = 0 and define E[y(s + h) y(s)] 2 = V ar (y(s + h) y(s)) = 2γ (h). This is sensible if the left hand side depends only upon h. Then we say the process is intrinsically stationary. γ(h) is called the semivariogram and 2γ(h) is called the variogram. 17 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Variograms Suppose we assume E[y(s + h) y(s)] = 0 and define E[y(s + h) y(s)] 2 = V ar (y(s + h) y(s)) = 2γ (h). This is sensible if the left hand side depends only upon h. Then we say the process is intrinsically stationary. γ(h) is called the semivariogram and 2γ(h) is called the variogram. Note that intrinsic stationarity defines only the first and second moments of the differences y(s + h) y(s). It says nothing about the joint distribution of a collection of variables y(s 1 ),..., y(s n ), and thus provides no likelihood. 17 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Intrinsic Stationarity and Ergodicity Relationship between γ(h) and C(h): 2γ(h) = V ar(y(s + h)) + V ar(y(s)) 2Cov(y(s + h), y(s)) = C(0) + C(0) 2C(h) = 2[C(0) C(h)]. 18 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Intrinsic Stationarity and Ergodicity Relationship between γ(h) and C(h): 2γ(h) = V ar(y(s + h)) + V ar(y(s)) 2Cov(y(s + h), y(s)) = C(0) + C(0) 2C(h) = 2[C(0) C(h)]. Easy to recover γ from C. The converse needs the additional assumption of ergodicity: lim u C(u) = 0. 18 Graduate Workshop on Environmental Data Analytics 2014
Stationary Gaussian processes Intrinsic Stationarity and Ergodicity Relationship between γ(h) and C(h): 2γ(h) = V ar(y(s + h)) + V ar(y(s)) 2Cov(y(s + h), y(s)) = C(0) + C(0) 2C(h) = 2[C(0) C(h)]. Easy to recover γ from C. The converse needs the additional assumption of ergodicity: lim u C(u) = 0. So lim u γ(u) = C(0), and we can recover C from γ as long as this limit exists. C(h) = lim γ(u) γ(h). u 18 Graduate Workshop on Environmental Data Analytics 2014
Isotropy When γ(h) or C(h) depends upon the separation vector only through the distance h, we say that the process is isotropic. In that case, we write γ( h ) or C( h ). Otherwise we say that the process is anisotropic. 19 Graduate Workshop on Environmental Data Analytics 2014
Isotropy When γ(h) or C(h) depends upon the separation vector only through the distance h, we say that the process is isotropic. In that case, we write γ( h ) or C( h ). Otherwise we say that the process is anisotropic. If the process is intrinsically stationary and isotropic, it is also called homogeneous. 19 Graduate Workshop on Environmental Data Analytics 2014
Isotropy When γ(h) or C(h) depends upon the separation vector only through the distance h, we say that the process is isotropic. In that case, we write γ( h ) or C( h ). Otherwise we say that the process is anisotropic. If the process is intrinsically stationary and isotropic, it is also called homogeneous. Isotropic processes are popular because of their simplicity, interpretability, and because a number of relatively simple parametric forms are available as candidates for C (and γ). Denoting h by t for notational simplicity, the next two tables provide a few examples... 19 Graduate Workshop on Environmental Data Analytics 2014
Isotropy Some common isotropic variograms model Variogram, { γ(t) τ Linear γ(t) = 2 + σ 2 t if t > 0 0 otherwise τ 2 + σ 2 if t 1/φ Spherical γ(t) = τ 2 + σ 2 [ 3 2 φt 1 2 (φt)3] if 0 < t 1/φ { 0 otherwise τ Exponential γ(t) = 2 + σ 2 (1 exp( φt)) if t > 0 { 0 otherwise Powered τ γ(t) = 2 + σ 2 (1 exp( φt p )) if t > 0 exponential { 0 otherwise Matérn τ γ(t) = 2 + σ 2 [ 1 (1 + φt) e φt] if t > 0 at ν = 3/2 0 o/w 20 Graduate Workshop on Environmental Data Analytics 2014
Isotropy γ(t) = Examples: Spherical Variogram τ 2 + σ 2 if t 1/φ τ 2 + σ 2 [ 3 2 φt 1 2 (φt)3] if 0 < t 1/φ 0 if t = 0. 21 Graduate Workshop on Environmental Data Analytics 2014
Isotropy γ(t) = Examples: Spherical Variogram τ 2 + σ 2 if t 1/φ τ 2 + σ 2 [ 3 2 φt 1 2 (φt)3] if 0 < t 1/φ 0 if t = 0. While γ(0) = 0 by definition, γ(0 + ) lim t 0 + γ(t) = τ 2 ; this quantity is the nugget. 21 Graduate Workshop on Environmental Data Analytics 2014
Isotropy γ(t) = Examples: Spherical Variogram τ 2 + σ 2 if t 1/φ τ 2 + σ 2 [ 3 2 φt 1 2 (φt)3] if 0 < t 1/φ 0 if t = 0. While γ(0) = 0 by definition, γ(0 + ) lim t 0 + γ(t) = τ 2 ; this quantity is the nugget. lim t γ(t) = τ 2 + σ 2 ; this asymptotic value of the semivariogram is called the sill. (The sill minus the nugget, σ 2 in this case, is called the partial sill.) 21 Graduate Workshop on Environmental Data Analytics 2014
Isotropy γ(t) = Examples: Spherical Variogram τ 2 + σ 2 if t 1/φ τ 2 + σ 2 [ 3 2 φt 1 2 (φt)3] if 0 < t 1/φ 0 if t = 0. While γ(0) = 0 by definition, γ(0 + ) lim t 0 + γ(t) = τ 2 ; this quantity is the nugget. lim t γ(t) = τ 2 + σ 2 ; this asymptotic value of the semivariogram is called the sill. (The sill minus the nugget, σ 2 in this case, is called the partial sill.) Finally, the value t = 1/φ at which γ(t) first reaches its ultimate level (the sill) is called the range, R 1/φ. 21 Graduate Workshop on Environmental Data Analytics 2014
Isotropy Examples: Spherical Variogram 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 1.0 1.5 2.0 b) spherical; a0 = 0.2, a1 = 1, R = 1 22 Graduate Workshop on Environmental Data Analytics 2014
Isotropy Some common isotropic covariograms Model Covariance function, C(t) Linear C(t) does not exist 0 if t 1/φ Spherical C(t) = σ 2 [ 1 3 2 φt + 1 2 (φt)3] if 0 < t 1/φ { τ 2 + σ 2 otherwise σ Exponential C(t) = 2 exp( φt) if t > 0 { τ 2 + σ 2 otherwise Powered σ C(t) = 2 exp( φt p ) if t > 0 exponential { τ 2 + σ 2 otherwise Matérn σ C(t) = 2 (1 + φt) exp( φt) if t > 0 at ν = 3/2 τ 2 + σ 2 otherwise 23 Graduate Workshop on Environmental Data Analytics 2014
Isotropy Notes on exponential model C(t) = { τ 2 + σ 2 if t = 0 σ 2 exp( φt) if t > 0. 24 Graduate Workshop on Environmental Data Analytics 2014
Isotropy Notes on exponential model C(t) = { τ 2 + σ 2 if t = 0 σ 2 exp( φt) if t > 0. We define the effective range, t 0, as the distance at which this correlation has dropped to only 0.05. Setting exp( φt 0 ) equal to this value we obtain t 0 3/φ, since log(0.05) 3. 24 Graduate Workshop on Environmental Data Analytics 2014
Isotropy Notes on exponential model C(t) = { τ 2 + σ 2 if t = 0 σ 2 exp( φt) if t > 0. We define the effective range, t 0, as the distance at which this correlation has dropped to only 0.05. Setting exp( φt 0 ) equal to this value we obtain t 0 3/φ, since log(0.05) 3. Finally, the form of C(t) shows why the nugget τ 2 is often viewed as a nonspatial effect variance, and the partial sill (σ 2 ) is viewed as a spatial effect variance. 24 Graduate Workshop on Environmental Data Analytics 2014
Isotropy The Matèrn Correlation Function Much of statistical modelling is carried out through correlation functions rather than variograms 25 Graduate Workshop on Environmental Data Analytics 2014
Isotropy The Matèrn Correlation Function Much of statistical modelling is carried out through correlation functions rather than variograms The Matèrn is a very versatile family: { σ 2 C(t) = 2 ν 1 Γ(ν) (2 νtφ) ν K ν (2 (ν)tφ) if t > 0 τ 2 + σ 2 if t = 0 K ν is the modified Bessel function of order ν (computationally tractable) 25 Graduate Workshop on Environmental Data Analytics 2014
Isotropy The Matèrn Correlation Function Much of statistical modelling is carried out through correlation functions rather than variograms The Matèrn is a very versatile family: { σ 2 C(t) = 2 ν 1 Γ(ν) (2 νtφ) ν K ν (2 (ν)tφ) if t > 0 τ 2 + σ 2 if t = 0 K ν is the modified Bessel function of order ν (computationally tractable) ν is a smoothness parameter (a fractal) controlling process smoothness 25 Graduate Workshop on Environmental Data Analytics 2014
Variogram model fitting How do we select a variogram? Can the data really distinguish between variograms? 26 Graduate Workshop on Environmental Data Analytics 2014
Variogram model fitting How do we select a variogram? Can the data really distinguish between variograms? Empirical Variogram: γ(t) = 1 2 N(t) s i,s j N(t) (y(s i ) y(s j )) 2 where N(t) is the number of points such that s i s j = t and N(t) is the number of points in N(t). 26 Graduate Workshop on Environmental Data Analytics 2014
Variogram model fitting How do we select a variogram? Can the data really distinguish between variograms? Empirical Variogram: γ(t) = 1 2 N(t) s i,s j N(t) (y(s i ) y(s j )) 2 where N(t) is the number of points such that s i s j = t and N(t) is the number of points in N(t). Grid up the t space into intervals I 1 = (0, t 1 ), I 2 = (t 1, t 2 ), and so forth, up to I K = (t K 1, t K ). Representing t values in each interval by its midpoint, we define: N(t k ) = {(s i, s j ) : s i s j I k }, k = 1,..., K. 26 Graduate Workshop on Environmental Data Analytics 2014
Variogram model fitting Empirical variogram: scallops data Gamma(d) 0 1 2 3 4 5 6 0.0 0.5 1.0 1.5 2.0 distance 27 Graduate Workshop on Environmental Data Analytics 2014
Variogram model fitting Empirical variogram: scallops data Parametric Semivariograms gamma(d) 0 2 4 6 8 Exponential Gaussian Cauchy Spherical Bessel-J0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 distance Bessel Mixtures - Random Weights gamma(d) 0 2 4 6 8 Two Three Four Five 0.0 0.5 1.0 1.5 2.0 2.5 3.0 distance Bessel Mixtures - Random Phi s gamma(d) 0 2 4 6 8 Two Three Four Five 0.0 0.5 1.0 1.5 2.0 2.5 3.0 distance 28 Graduate Workshop on Environmental Data Analytics 2014