An Introduction to Spatial Statistics. Chunfeng Huang Department of Statistics, Indiana University

An Introduction to Spatial Statistics Chunfeng Huang Department of Statistics, Indiana University

Microwave Sounding Unit (MSU) Anomalies (Monthly): 1979-2006.

Iron Ore (Cressie, 1986) Raw percent data 2 4 6 8 5 10 15

North Carolina sudden infant deaths, 1974-1978 and 1979-1984 (Cressie, 1993) SIDS (sudden infant death syndrome) in North Carolina 36.5 N 1974 36 N 35.5 N 35 N 34.5 N 34 N 1979 36.5 N 36 N 35.5 N 35 N 34.5 N 84 W 82 W 80 W 78 W 76 W 34 N 0 10 20 30 40 50 60

cells japanesepines redwoodfull

Three types of spatial data Geostatistics Variogram Kriging Lattice or Areal data Markov Random Field Conditional Autoregressive Model (CAR) Spatial point pattern Complete spatial randomness (CSR) K function, L function

Geostatistics: Quantify spatial dependency and Assumptions Second-order stationarity: E(Z(s i )) = µ, cov(z(s i ), Z(s j )) = C(s i s j ) Isotropy: cov(z(s i ), Z(s j )) = C( s i s j ) Intrinsic stationarity: E(Z(s i )) = µ, var(z(s i ) Z(s j )) = 2γ(s i s j ) γ( ): semi-variogram, 2γ( ): variogram. Stationary implies intrinsic stationary: γ(s i s j ) = C(0) C(s i s j ) Intrinsic stationary does not necessarily imply stationary.

vgm(1,"sph",1.5,nugget=0.2) 1.2 1.0 0.8 semivariance 0.6 0.4 0.2 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 distance

Variogram (Semi-variogram) Sill, range, nugget Conditional negative definite. (Variance may be negative if this is not satisfied.) Parametric models

3 0.0 1.0 2.0 3.0 0.0 1.0 2.0 3.0 vgm(1,"nug",0) vgm(1,"exp",1) vgm(1,"sph",1) vgm(1,"gau",1) vgm(1,"exc",1) 2 1 0 vgm(1,"mat",1) vgm(1,"ste",1) vgm(1,"cir",1) vgm(1,"lin",0) vgm(1,"bes",1) 3 2 1 semivariance 3 2 vgm(1,"pen",1) vgm(1,"per",1) vgm(1,"hol",1) vgm(1,"log",1) vgm(1,"pow",1) 0 1 0 vgm(1,"spl",1) vgm(1,"leg",1) 3 2 1 0 0.0 1.0 2.0 3.0 distance

Estimation of variogram Empirical variogram (Method-of-moments estimator) 2ˆγ(h) = 1 N(h) {Z(s i ) Z(s j )} 2, N(h) where N(h) is the set of pairs with distance h, N(h) is the number of pairs. Tolerance regions. Fitting parametric models (weighted least squares, MLE, REML)

Empirical variogram semivariance 0 5 10 0 2 4 6 8 distance

Kriging Suppose we observe a spatial process Z(s 1 ), Z(s 2 ),..., Z(s n ). The best (in terms of minimizing mean squared prediction error) unbiased linear predictor of Z(s 0 ) is (Cressie, 1993) Ẑ(s 0 ) = n λ i Z(s i ), i=1 where (λ 1,..., λ n ) = ( γ + 1 (1 1T Γ 1 ) T γ) 1 T Γ 1 Γ 1 1 γ = (γ(s 0 s 1 ),..., γ(s 0 s n )) T Kriging variance: Γ = {γ(s i s j )} n n σ 2 K (s 0) = λ T Γ 1 λ (1 T Γ 1 1 1) 2 /(1 T Γ 1 1)

Z y x 48 50 52 54 56 58

Prediction MSE y x 2.6 2.8 3.0 3.2 3.4 3.6

Northing 2 4 6 8 x x o o x o x o x xo x x o o x o o 52 54 56 58 Iron Ore % 46 50 54 58 ox ox o o o o o o x o ox x o o x x x x x o o o ox x ox x x x 5 10 15 Easting x o = Median Iron Ore % Iron Ore % x = Mean Iron Ore %

Areal Data or Lattice Data Simultaneous autoregressive (SAR) model z(s 1 ) = +b 12 z(s 2 ) +... + b 1n z(s n ) + ɛ 1 z(s 2 ) = b 21 z(s 1 ) + +... + b 2n z(s n ) + ɛ 2. z(s n ) = b n1 z(s 1 ) + b n2 z(s 2 ) +... + +ɛ n Or z = Bz + ɛ Explanatory variables. One can show that ɛ is not independent of z. What if z is discrete?

Conditional autoregressive (CAR) model where z(s i ) {z(s j ) : j i} N( n i=1 c ij τ 2 j = c ji τ 2 i, c ii = 0 c ij z(s j ), τ 2 i ) If we further assume the conditional dependency only through the neighborhood of s i : N(s i ) (Markov random field (MRF)) c ik = 0, k / N(s i )

Response variable falls in exponential family: auto spatial models. Response variable is normal: auto Gaussian model. Response variable is binary: auto logistic model. Response variable is poisson: auto poisson model. SAR can be represented as a CAR model. Not necessarily vice versa, see example in Cressie (1993). Markov random field theory is used to guide from the conditional distributions to a valid joint distribution.

Proximity matrix W. Some choices for W ij (W ii = 0): W ij = 1 if i, j share a common boundary. W ij is an inverse distance between units. W ij = 1 if distance between i and j less than some fixed value. W ij = 1 for m nearest neighbors.

CAR model where and z N(X β, (I C) 1 M) C = {c ij } n n, M = τ 2 I C = ρw ρ = 0 spatial independence Condition on ρ: I ρw is positive definite.

Spatial point process A spatial point process is said to have the complete spatial randomness (CSR) property if it is a homogeneous poisson point process. A 1,..., A r disjoint, then N(A 1 ),..., N(A r ) are independent. (N(A) : number of events in A) N(A) Poisson(λ A ), where A is the volume of A.

Testing for CSR W : distance from a randomly chosen event to its nearest event. X : point to nearest event. Through Monte Carlo test.

First order intensity function λ(x) = lim dx 0 Second order intensity function λ 2 (x, y) = lim dx, dy 0 { } E[N(dx)] dx Stationary: λ(x) = λ, λ 2 (x, y) = λ 2 (x y) Isotropy: λ 2 (x y) = λ 2 ( x y ) Under CSR: λ(x) = λ, λ 2 (x, y) = λ 2 { } E[N(dx)N(dy)] dx dy

K function K(t) = 1 λ E[N 0(t)] where N 0 (t) is the number of further events within distance t of an arbitrary event. One can show that λ 2 (t) = λ2 K (t), λ 2 K(t) = 2π 2πt For the CSR: K(t) = πt 2 L function: L(t) = K(t)/π t t 0 λ 2 (y)ydy

Kenvjap K(r) 0.00 0.05 0.10 0.15 0.20 K obs(r) K theo(r) K hi(r) K lo(r) 0.00 0.05 0.10 0.15 0.20 r

Kenvred K(r) 0.00 0.05 0.10 0.15 K obs(r) K theo(r) K hi(r) K lo(r) 0.00 0.05 0.10 0.15 0.20 r

References Cressie, N. (1993), Statistics for Spatial Data, rev. ed., Wiley. Cressie, N. and Wikle, C. (2012), Statistics for Spatio-Temporal Data, Wiley. Diggle, P. J. (1983), Statistical Analysis of Spatial Point Patterns, Academic Press. Banerjee, S., Carlin, B. P. and Gelfand, A. E. (2004), Hierarchical Modeling and Analysis for Spatial Data, Chapman and Hall. Bivand, R., Pebesma, E. and Gomez-Rubio V. (2008), Applied Spatial Data Analysis with R, Springer-Verlag. Stein, M. (1999), Interpolation of Spatial Data. Cressie, N. (1986). Kiring non stationary data. Journal of American Statistical Association, 81, 625-634.