Nearest Neighbor Gaussian Processes for Large Spatial Data

Size: px

Start display at page:

Download "Nearest Neighbor Gaussian Processes for Large Spatial Data"

Aron Washington
5 years ago
Views:

1 Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland. 2 Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles. 3 Departments of Forestry and Geography, Michigan State University, East Lansing, Michigan.

2 Low rank Gaussian Predictive Process Pros Proper Gaussian process Allows for coherent spatial interpolation at arbitrary resolution Can be used as prior for spatial random effects in any hierarchical setup for spatial data Computationally tractable 1

3 Low rank Gaussian Predictive Process Cons True w Full GP PP 64 knots Figure: Comparing full GP vs low-rank GP with 2500 locations Low rank models like the Predictive Process (PP) often tends to oversmooth Increasing the number of knots can fix this but will lead to heavy computation 1

4 Sparse matrices Idea: Use a sparse matrix instead of a low rank matrix to approximate the dense full GP covariance matrix Goals: Scalability: Both in terms of storage and computing inverse and determinants Closely approximate full GP inference Proper Gaussian process model like the Predictive Process 2

5 Cholesky factors Write a joint density p(w) = p(w 1, w 2,..., w n ) as: p(w 1 )p(w 2 w 1 )p(w 3 w 1, w 2 ) p(w n w 1, w 2,..., w n 1 ) For Gaussian distribution w N(0, C) this w 1 = 0 + η 1 ; w 2 = a 21 w 1 + η 2 ; w n = a n1 w 1 + a n2 w a n,n 1 w n 1 + η n ; 3

6 Cholesky factors Write a joint density p(w) = p(w 1, w 2,..., w n ) as: p(w 1 )p(w 2 w 1 )p(w 3 w 1, w 2 ) p(w n w 1, w 2,..., w n 1 ) For Gaussian distribution w N(0, C) this w 1 w 2 w 3. w n = a a 31 a a n1 a n2 a n3... a n,n 1 0 w 1 w 2 w 3. w n + η 1 η 2 η 3. η n = w = Aw + η; η N(0, D), where D = diag(d 1, d 2,..., d n ). 3

7 Cholesky factors Write a joint density p(w) = p(w 1, w 2,..., w n ) as: p(w 1 )p(w 2 w 1 )p(w 3 w 1, w 2 ) p(w n w 1, w 2,..., w n 1 ) For Gaussian distribution w N(0, C) this w w 1 η 1 w 2 a w 2 η 2 w 3 = a 31 a w 3 + η a n1 a n2 a n3... a n,n 1 0 w n = w = Aw + η; η N(0, D), where D = diag(d 1, d 2,..., d n ). Cholesky factorization: C 1 = (I A) D 1 (I A) w n η n 3

8 Cholesky factors w <i = (w 1, w 2,..., w i 1 ) c i = Cov(w i, w <i ), C i = Var(w <i ) i th row of A and d i = Var(η i ) are obtained from p(w i w <i ) as follows: Solve for a ij s from i 1 j=1 a ijw j = E(w i w <i ) = c i C 1 i w <i 4

9 Cholesky factors w <i = (w 1, w 2,..., w i 1 ) c i = Cov(w i, w <i ), C i = Var(w <i ) i th row of A and d i = Var(η i ) are obtained from p(w i w <i ) as follows: Solve for a ij s from i 1 j=1 a ijw j = E(w i w <i ) = c i C 1 i w <i d i = Var(w i w <i ) = σ 2 c i C 1 i c i 4

10 Cholesky factors w <i = (w 1, w 2,..., w i 1 ) c i = Cov(w i, w <i ), C i = Var(w <i ) i th row of A and d i = Var(η i ) are obtained from p(w i w <i ) as follows: Solve for a ij s from i 1 j=1 a ijw j = E(w i w <i ) = c i C 1 i w <i d i = Var(w i w <i ) = σ 2 c i C 1 i c i 4

11 Cholesky factors w <i = (w 1, w 2,..., w i 1 ) c i = Cov(w i, w <i ), C i = Var(w <i ) i th row of A and d i = Var(η i ) are obtained from p(w i w <i ) as follows: Solve for a ij s from i 1 j=1 a ijw j = E(w i w <i ) = c i C 1 i w <i d i = Var(w i w <i ) = σ 2 c i C 1 i c i For large i, inverting C i becomes slow The Cholesky factor approach for the full GP covariance matrix C does not offer any computational benefits 4

12 Cholesky Factors and Directed Acyclic Graphs (DAGs) Full dependency graph Number of non-zero entries (sparsity) of A equals number of arrows in the graph In particular: Sparsity of the i th row of A is same as the number of arrows towards i in the DAG 5

13 Introducing sparsity via graphical models Full dependency graph p(y 1 )p(y 2 y 1 )p(y 3 y 1, y 2 )p(y 4 y 1, y 2, y 3 ) p(y 5 y 1, y 2, y 3, y 4 )p(y 6 y 1, y 2,..., y 5 )p(y 7 y 1, y 2,..., y 6 ). 6

14 Introducing sparsity via graphical models 3 Nearest neighbor dependency graph p(y 1 )p(y 2 y 1 )p(y 3 y 1, y 2 )p(y 4 y 1, y 2, y 3 ) p(y 5 y 1, y 2, y 3, y 4 )p(y 6 y 1, y 2, y 3, y 4, y 5 )p(y 7 y 1, y 2, y 3, y 4, y 5, y 6 ) 6

15 Introducing sparsity via graphical models 3 Nearest neighbor dependency graph Create a sparse DAG by keeping at most m arrows pointing to each node Set a ij = 0 for all i, j which has no arrow between them Fixing a ij = 0 introduces conditional independence and w j drops out from the conditional set in p(w i {w k : l < i}) 7

16 Introducing sparsity via graphical models 3 Nearest neighbor dependency graph N(i) denote neighbor set of i, i.e., the set of nodes from which there are arrows to i a ij = 0 for j / N(i) and nonzero a ij s obtained by solving: E[w i w N(i) ] = a ij w j j N(i) The above linear system is only m m 7

17 Choosing neighbor sets Matern Covariance Function: C(s i, s j ) = 1 2 ν 1 Γ(ν) ( s i s j φ) ν K ν ( s i s j φ); φ > 0, ν > 0, 8

18 Choosing neighbor sets Spatial covariance functions decay with distance Vecchia (1988): N(s i ) = m nearest neighbors of s i in s 1, s 2,..., s i 1 Nearest points have highest correlations Theory: Screening effect Stein, 2002 We use Vecchia s choice of m-nearest neighbor Other choices proposed in Stein et al. (2004); Gramacy and Apley (2015); Guinness (2016) can also be used 9

19 Nearest neighbors 10

20 Nearest neighbors 10

21 Nearest neighbors 10

22 Sparse precision matrices The neighbor sets and the covariance function C(, ) define a sparse Cholesky factor A N(w 0, C) N(w 0, C) ; C 1 = (I A) D 1 (I A) A D C 1 det( C) = n i=1 D i, C 1 is sparse with O(nm 2 ) entries 11

23 Sparse precision matrices The neighbor sets and the covariance function C(, ) define a sparse Cholesky factor A N(w 0, C) N(w 0, C) ; C 1 = (I A) D 1 (I A) A D C 1 det( C) = n i=1 D i, C 1 is sparse with O(nm 2 ) entries 11

24 Extension to a Process We have defined w N(0, C) over the set of data locations S = {s 1, s 2,..., s n } For s / S, define N(s) as set of m-nearest neighbors of s in S Define w(s) = i:s i N(s) a i(s)w(s i ) + η(s) where η(s) ind N(0, d(s)) a i (s) and d(s) are once again obtained by solving m m system Well-defined GP over entire domain Nearest Neighbor GP (NNGP) Datta et al., JASA, (2016) 12

25 Hierarchical spatial regression with NNGP Spatial linear model y(s) = x(s) β + w(s) + ɛ(s) w(s) modeled as NNGP derived from a GP(0, C(,, σ 2, φ)) ɛ(s) iid N(0, τ 2 ) contributes to the nugget Priors for the parameters β, σ 2, τ 2 and φ Only difference from a full GP model is the NNGP prior w(s) 13

26 Hierarchical spatial regression with NNGP Full Bayesian Model N(y Xβ + w, τ 2 I) N(w 0, C(σ 2, φ)) N(β µ β, V β ) IG(τ 2 a τ, b τ ) IG(σ 2 a σ, b σ ) Unif (φ a φ, b φ ) Gibbs sampler: Conjugate full conditionals for β, τ 2, σ 2 and w(s i ) s Metropolis step for updating φ Posterior predictive distribution at any location using composition sampling: N(y(s) x(s) β + w(s), τ 2 I) N(w(s) a(s) w R, d(s)) p(w, β, τ 2, σ 2, φ y) d(w, β, τ 2, σ 2, φ) 14

27 Choosing m Run NNGP in parallel for few values of m Choose m based on model evaluation metrics Our results suggested that typically m 20 yielded excellent approximations to the full GP 15

28 Storage and computation Storage: Never needs to store n n distance matrix Stores smaller m m matrices Total storage requirements O(nm 2 ) Computation: Only involves inverting small m m matrices Total flop count per iteration of Gibbs sampler is O(nm 3 ) Since m n, NNGP offers great scalability for large datasets 16

29 Simulation experiments 2500 locations on a unit square y(s) = β 0 + β 1 x(s) + w(s) + ɛ(s) Single covariate x(s) generated as iid N(0, 1) Spatial effects generated from GP(0, σ 2 R(, φ)) R(, φ) is exponential correlation function with decay φ Candidate models: Full GP, Gaussian Predictive Process (GPP) with 64 knots and NNGP 17

30 Fitted Surfaces True w Full GP GPP 64 knots NNGP, m = 10 NNGP, m = 20 Figure: Univariate synthetic data analysis 18

31 Parameter estimates NNGP Predictive Process Full True m = 10 m = knots Gaussian Process β (0.62, 1.31) 1.03 (0.65, 1.34) 1.30 (0.54, 2.03) 1.03 (0.69, 1.34) β (4.99, 5.03) 5.01 (4.99, 5.03) 5.03 (4.99, 5.06) 5.01 (4.99, 5.03) σ (0.78, 1.23) 0.94 (0.77, 1.20) 1.29 (0.96, 2.00) 0.94 (0.76, 1.23) τ (0.08, 0.13) 0.10 (0.08, 0.13) 0.08 (0.04, 0.13) 0.10 (0.08, 0.12) φ (9.70, 16.77) (9.99, 17.15) 5.61 (3.48, 8.09) (9.92, 17.50) 19

32 Model evaluation NNGP Predictive Process Full m = 10 m = knots Gaussian Process DIC score RMSPE Run time (Minutes) NNGP performs at par with Full GP GPP oversmooths and performs much worse both in terms of parameter estimation and model comparison NNGP yields huge computational gains 20

33 Multivariate spatial data Point-referenced spatial data often come as multivariate measurements at each location. Examples: Environmental monitoring: stations yield measurements on ozone, NO, CO, and PM 2.5. Forestry: measurements of stand characteristics age, total biomass, and average tree diameter. Atmospheric modeling: at a given site we observe surface temperature, precipitation and wind speed We anticipate dependence between measurements at a particular location across locations 21

34 Multvariate spatial linear model Spatial linear model for q-variate spatial data: y i = x i (s)β i + w i (s) + ɛ i (s) for i = 1, 2,..., q ɛ(s) = (ɛ 1 (s), ɛ 2 (s),..., ɛ q (s)) N(0, E) where E is the q q noise matrix w(s) = (w 1 (s), w 2 (s),..., w q (s)) is modeled as a q-variate Gaussian process 22

35 Spatially varying coefficients Often the relationship between the (univariate) spatial response and covariates vary across the space The regression coefficients can then be modeled as spatial processes Spatially varying coefficient (SVC) model: y(s) = x(s) β(s) + ɛ(s) Even though the response can be univariate, β(s) is modeled as a p-variate GP 23

36 Multivariate GPs Cov(w(s i ), w(s j )) = C(s i, s j θ) a q q cross-covariance matrix Choices for the function C(, θ) Multivariate Matérn Linear model of co-regionalization For data observed at n locations, all choices lead to a dense nq nq matrix C = Cov(w(s 1 ), w(s 2 ),..., w(s n )) Not scalable when nq is large 24

37 Multivariate NNGPs Cholesky factor approach similar to the univariate case w(s 1 ) w(s 1 ) η(s 1 ) w(s 2 ) A w(s 2 ) η(s 2 ) w(s 3 ) = A 31 A w(s 3 ) + η(s 3 ) w(s n ) A n1 A n2 A n3... A n,n 1 0 w(s n ) η(s n ) = w = Aw + η; η N(0, D), D = diag(d 1, D 2,..., D n ). Only differences: w(s i ) and η(s i ) s are q 1 vectors and A ij and D i s are q q matrix 25

38 Multivariate NNGPs Choose neighbor sets N(i) for each location s i Set A ij = 0 if j / N(i) Solve for non-zero A ij s from the mq mq linear system: j N(i) A ijw(s j ) = E(w(s i ) {w(s j ) j N(i)}) Multivariate NNGP: w N(0, C) where C 1 = (I A) D 1 (I A) C 1 is sparse with O(nm 2 ) non-zero q q blocks det( C) = n i=1 det(d i ) Storage and computation needs remains linear in n 26

39 U.S. Forest biomass data Observed biomass NDVI Forest biomass data from measurements at 114,371 plots NDVI (greenness) is used to predict forest biomass 27

40 U.S. Forest biomass data Non Spatial Model Biomass = β 0 + β 1 NDVI + error, ˆβ 0 = 1.043, ˆβ 1 = Residuals Variogram of residuals Strong spatial pattern among residuals 28

41 Forest biomass dataset n 10 5 (Forest Biomass) full GP requires storage 40Gb and time 140 hrs per iteration. We use a spatially varying coefficients NNGP model Model Biomass(s) = β 0 (s) + β 1 (s)ndvi(s) + ɛ(s) w(s) = (β 0 (s), β 1 (s)) Bivariate NNGP(0, C(, θ)), m = 5 Time 6 seconds per iteration Full inferential output: 41 hours (25000 MCMC iterations) 29

42 Forest biomass data Observed biomass Fitted biomass β 0(s) β NDVI (s) 30

43 Reducing parameter dimensionality The Gibbs sampler algorithm for the NNGP updates w(s 1 ), w(s 2 ),..., w(s n ) sequentially Dimension of the MCMC for this sequential algorithm is O(n) If the number of data locations n is very large, this high-dimensional MCMC can converge slowly Although each iteration for the NNGP model will be very fast, many more MCMC iterations may be required 31

44 Collapsed NNGP Same model: y(s) = x(s) β + w(s) + ɛ(s) w(s) NNGP(0, C(, θ)) ɛ(s) iid N(0, τ 2 ) Vector form y N(Xβ + w, τ 2 I); w N(0, C(θ)) Collapsed model: Marginalizing out w, we have y N(Xβ, τ 2 I + C(θ)) 32

45 Collapsed NNGP Model y N(Xβ, τ 2 I + C(θ)) Only involves few parameters β, τ 2 and θ = (σ 2, φ) Drastically reduces the MCMC dimensionality Gibbs sampler updates are based on sparse linear systems using C 1 Improved MCMC convergence Can recover posterior distribution of w y Complexity of the algorithm depends on the design of the data locations and is not guaranteed to be O(n) 33

46 Response NNGP w(s) GP(0, C(, θ)) y(s) GP(x(s) β, Σ(, τ 2, θ)) Σ(s i, s j ) = C(s i, s j θ) + τ 2 δ(s i = s j ) (δ is Kronecker delta) We can directly derive the NNGP covariance function corresponding to Σ(, ) Σ is the NNGP covariance matrix for the n locations Response model: y N(Xβ, Σ) Storage and computations are guaranteed to be O(n) Low dimensional MCMC Improved convergence Cannot coherently recover w y 34

47 Comparison of NNGP models Sequential Collapsed Response O(n) time Yes No Yes Recovery of w y Yes Yes No Parameter dimensionality High Low Low 35

48 Comparison of NNGP models Figure: Run time per iteration as a function of number of locations for different NNGP models 36

49 Comparison of NNGP models Figure: MCMC convergence diagnostics using Gelman-Rubin shrink factor for different NNGP models for a simulated dataset 37

50 Summary of Nearest Neighbor Gaussian Processes Sparsity inducing Gaussian process Constructed from sparse Cholesky factors based on m nearest neighbors Scalability: Storage, inverse and determinant of NNGP covariance matrix are all O(n) Proper Gaussian process, allows for inference using hierarchical spatial models and predictions at arbitrary spatial resolution Closely approximates full GP inference, does not oversmooth like low rank models Extension to multivariate NNGP Collapsed and response NNGP models with improved MCMC convergence spnngp package in R for analyzing large spatial data using NNGP models 38

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Abhirup Datta 1 Sudipto Banerjee 1 Andrew O. Finley 2 Alan E. Gelfand 3 1 University of Minnesota, Minneapolis,