Gaussian predictive process models for large spatial data sets. Sudipto Banerjee, Alan E. Gelfand, Andrew O. Finley, and Huiyan Sang Presenters: Halley Brantley and Chris Krut September 28, 2015
Overview Recap Gaussian Process Spatial Regression Univariate Predictive Multivariate Gaussian Processes Linear Model of Coregionalization Multivariate Predictive Process Extensions to non-gaussian and Space-Time Data.
Gaussian Process Definition Y (s) is a Gaussian process with mean function µ(s) and covariance function H(s, s ) = cov(y (s), Y (s )) if for every subset of locations s 1,..., s n the vector Ỹ = (Y (s 1),..., Y (s n )) T Ỹ MV N n ( µ, H) (1) where µ = (µ(s 1 ),..., µ(s n )) T and H is a matrix such that {H} ij = H(s i, s j ; φ). To be a valid covariance function H(s, s : φ) must be positive semidefinite in the sense that it generates covariance matrices H which are positive semi definite v T Hv 0[2, Pg. 80].
Spatial Regression Model Y (s) = X(s) T β + w(s) + ε(s) (2) X(s) T is a vector of coefficients. ε(s) is an independent process, nugget effect. w(s) is a spatial random effect. w(s) GP (0, C(s, s ; θ)) C(s, s ; θ) = σ 2 ρ(s, s ; θ) Y N(Xβ, Σ Y ) Σ Y = C(θ) + τ 2 I
Computational Challenges Fitting the above model requires calculating determinants and inverses of large matrices. Computational Complexity grows with matrix size Matrix Complexity Memory limitations also create problems. Lot s of work has been done trying to fit models for large spatial data sets.
Univariate Predictive Process Y (s) = X(s) T β + w(s) + ε(s) (3) w = (w(s 1),..., w(s m)) w(s) = E(w(s 0 ) w ) = c(s; θ) T C 1 (θ)w c(s; θ) T = (C(s, s 1; θ),..., C(s, s m; θ)) w(s) GP (0, c T (s; θ)c 1 c(s; θ)) Advantage: Now work with m m matrices instead of n n.
Properties w(s 0 ) minimizes E(w(s 0 ) f(w ) w ) over all real valued functions f(w ). w(s ) = c T (s j; θ)c 1 (θ)w = w(s j) It interpolates process w(s) at the knots. w a = (w(s 1 ),..., w(s n ), w(s 1),..., w(s m)) and p(w a Y ) be the posterior distribution of w a with all other parameters fixed. 1. p(w a Y ) P (w a )P (Y w) since P (Y w) = P (Y w a ). 2. The posterior for the predictive process model replaces P (Y w) with q(y w ). 3. We want to preserve q(y w a ) = q(y w ). 4. Authors claim the predictive process model corresponds to the density which minimizes the reverse Kullback Leibler divergence between the posteriors q(w a Y ) and p(w a Y ).
Knot Selection In addition to specifying a covariance function the predictive process relies on specifying a set of knots S. We need to specify the number of knots m and the location of the knots. Choosing the knots to be all spatial locations reduces to the original model. Choosing the number of knots we balance performance with computational complexity. Authors consider modifications to a standard grid of knots(close pairs, infill).
To compare performance Compare covariance function of the parent process with that of the predictive process. 200 locations are uniformly generated over a [0, 10] X [0, 10] rectangle. Knots consist of a 10 X 10 equally spaced grid Matern covariance with σ 2 = 1 and range parameter φ = 2, and four values of ν. Covariances for 2,000 of the roughly 40,000 distance pairs are plotted for the predictive process.
Covariances of w(s) against distance (line) and covariances of w(s) against distance (points) : (a) smoothness parameter 0.5; (b) smoothness parameter 1; (c) smoothness parameter 1.5; (d) smoothness parameter 5 See Figure 1, Pg. 831
Alternative Scenario Exponential covariance Set ν = 0.5 Choose 4 values of the range parameter.
Covariances of w(s) against distance (line) and covariances of w(s) against distance (points): (a) range parameter 2; (b) range parameter 4; (c) range parameter 6; (d) range parameter 12 See figure 2, Pg. 832
Take-Aways 1 Covariance functions agree better at larger distances Especially when increasing smoothness and range May need dense knots. Knot selection with a packed subset (instead of just a grid), may improve results.
Lattice plus close pairs configuration: regular k x k lattice of knots, intensifies this grid by randomly choosing m of these lattice points and then placing an additional knot close to each of them. Lattice plus infill design: starts with knots on a regular k x k lattice, intensifies the grid by placing a more finely spaced lattice within m randomly chosen cells of the original lattice.
Simulation 1 Simulate Y (s) from 3000 irregularly scattered locations (s). Y (s) = x T (s)β + w(s) + ɛ(s) See figure 3a, Pg. 838.
See Figures 3a,3b,3c Pg. 838.
See Table 1, Pg. 837
See Table 2, Pg. 839
See Table 3, Pg. 839
Take-Aways 2 Estimation is more sensitive to the number of knots than to the underlying design. Close pair designs appear to improve estimation of the shorter ranges as seen for λ 2 with 256 knots. Predictions are much more robust (little change with increase in knots).
Simulation 2 15,000 locations (vs 3000 in Simulation 1) Non-stationary random field full model computationally infeasible. Divide domain into 3 regions each with a different intercept.
Simulated sites and OLS residuals knots. See Figure 4ab, Pg. 840. Spatial residuals See Figure 4c, Pg. 840.
See Table 4, Pg. 841
Take-Aways 3 Knot density, better estimation Spatial residuals are smoother and illustrate regional anisotropy.
Application Forest biomass and other variables that are related to current carbon stocks are important for quantifying ecological and economic viability of forest landscapes. Want to know: how biomass changes across the landscape (as a continuous surface) and how homogeneous it is across the region? interpolated surface.
Data Point-referenced biomass (log-transformed) data observed at 9500 locations (USDA) Y (s): biomass from trees X 1 (s) : the cross-sectional area of all stems above 1.37 m from the ground (basal area) X 2 (s) number of tree stems (stem density) at that location. Spatially varying-coefficient model: Y (s) = x T (s) β(s) + ɛ(s) Predictive Process Model: See Figure 5, Pg. 843. Y = Xβ + Z T C T (θ)c 1 (θ)w + ɛ
See Table 5 Pg. 844.
See Figure 6, Pg. 845 Posterior (mean) estimates of spatial surfaces from the spatially varying coefficients model: (a) intercept parameter (b) basal area parameter (c) stem density (d) (log-) biomass
Bivariate Gaussian Process Bivariance ( ) Gaussian Process w1 (s) MV GP w 2 (s) 2 (0, Γ w (s, s )) Cross-Covariance ( Function Γ w (s, s cov(w1 (s), w ) = 1 (s )) cov(w 1 (s), w 2 (s ) )) cov(w 2 (s), w 1 (s )) cov(w 2 (s), w 2 (s )) For observed locations s 1,..., s n The covariance matrix induced by Γ w (s, s ) becomes 2n 2n.
Multivariate Gaussian Process Multivariate Gaussian Process w 1 (s). MV GP k (0, Γ w (s, s )) w k (s) Cross-Covariance Function cov(w 1 (s), w 1 (s ))... cov(w 1 (s), w k (s )) Γ w (s, s ) =..... cov(w k (s), w 1 (s ))... cov(k 2 (s), w k (s )) For observed locations s 1,..., s n The covariance matrix induced by Γ w (s, s ) becomes k n k n.
Multivariate Spatial Regression Y (s) = X(s) T β + w(s) + ε(s) (4) Linear Model of Coregionalization[1] w 1 (s) v 1 (s). = A(s)v(s) = A(s). w m (s) v m (s) v j (s) independent GP (0, ρ j (s, s )) v j (s) GP (, Γ v (s, s )) Γ v (s, s ) = diag([ρ i (s, s )] k i=1) Γ w (s, s ) = A(s)Γ v (s, s )A T (s ) SinceΓ w (s, s) = A(s)A T (s ) we can take A(s) to be lower
Multivariate Predictive Process Y (s) = X(s) T β + w(s) + ε(s) (5) w(s) = cov(w(s), w )var 1 (w )w = C T (s; θ)c 1 (θ)w C(s; θ) = Γ w (s, s 1; θ). Γ w (s, s m; θ) is a mk k matrix. C 1 (θ) = [ Γ w (s i, s j) ] m is an mk mk matrix. i,j=1
Additional Computational Savings A(s 1) 0... 0 A 0 A(s 2)... 0 =...... 0 0 0 A(s m) Γ v (s 1, s 1) Γ v (s 1, s 2)... Γ v (s 1, s m) Γ v (s 2, s 1) Γ v (s 2, s 2)... Γ v (s 2, s m) Σ v =...... Γ v (s m, s 1)...... Γ v (s m, s m) ρ 1 (s i, s j) 0... 0 Γ v (s i, s 0 ρ 2 (s i, s j)... 0 j) =...... 0 0 0 ρ k (s i, s j)
Additional Computational Savings C = A Σ v A T C 1 = A 1 Σ 1 v A 1T Σ v = P T H P H1 0... 0 H 0 H2... 0 =...... 0 0 0 Hm H i = [ρ i (s j, s j )]m j,j =1 P 1 = P T
General Framework Spatial Mixed Model Y (s) = X T β + Z T (s)w(s) + ε(s) Y (s) is a q 1 vector of responses at location s. X T (s) = diag ( x T 1 (s),..., x T q (s) ) Z T (s) is a q k design matrix. w(s) is k 1 vector of spatial effects. β is a vector of coefficients of length p = Predictive Process Model Y (s) = X T β + Z T (s) w(s) + ε(s) q p l. l=1
Implementation Y = Xβ + Z T C T (θ)c 1 (θ)w + ε, ε N(0, I q Ψ). Y = [Y (s i )] n i=1 is a nq 1 vector of responses. X T = [X(s i ) T ] n i=1 is an nq p matrix of coefficients. Z T = BlockDiag(Z(s 1 ),..., Z(s n )) is a nq nk design matrix. C T (θ) = [Γ w (s i, s j )] n,m i,j=1 w and C are from the predictive process. After marginalizing out w f(y Ω) = MV N(Xβ, Z T C T (θ)c 1 (θ)c(θ)z + I n Ψ)
Sherman-Woodbury-Morrison Formula 1. (A + UCV ) 1 = A 1 A 1 U(C 1 + V A 1 U) 1 V A 1 [4] 2. det(a + UW V T ) = det(w 1 + V T A 1 U)det(W )det(a).[3] Likelihood calculations for Y involve computing Determinant Inverse of (Z T C T (θ)c 1 (θ)c(θ)z + I n Ψ). Using the identities computations are in terms of mk mk matrices instead of nq nq.
Extensions: Non-gaussian and Spatio-Temporal Data 1. Non-Gaussian Data(Binary, Count, Categorical): Binomial data (probit and logistic models) Count data. Assume we have an appropriate transformation η(s) = g(e(y (s))) = X T (s)β + w(s), g() known. In general you can t marginalize out w(s) and it s full conditional is not available. Clever trick: η(s) = g(e(y (s))) = X T (s)β + w(s) + ε(s), ε(s) produces full conditionals for w(s) which are multivariate normal. 2. Space-Time Data Now must specify knots over space and time D S. The predictive process model extends naturally to this case.
S Banerjee, BP Carlin, and AE Gelfand. Hierarchical modeling and analysis for spatial data. Monographs on statistics and applied probability (101) Show all parts in this series, 2004. Carl Edward Rasmussen. Gaussian processes for machine learning. 2006. Wikipedia. Matrix determinant lemma wikipedia, the free encyclopedia, 2015. [Online; accessed 16-September-2015]. Wikipedia. Woodbury matrix identity wikipedia, the free encyclopedia, 2015.
[Online; accessed 16-September-2015].