A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging

Size: px

Start display at page:

Download "A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging"

Cora Jennings
5 years ago
Views:

1 A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Cheng Li DSAP, National University of Singapore Joint work with Rajarshi Guhaniyogi (UC Santa Cruz), Terrance D. Savitsky (US Bureau of Labor Statistics), and Sanvesh Srivastava (U Iowa)

2 Outline Background of GP D&C Bayes Method Theory Experiments Future Work

3 Background: Gaussian Process Gaussian Process is a commonly tool to model functions (surfaces). Let s,..., s n be n arbitrary different locations in a spatial domain D. f ( ) is the mean function (which we want to model). f i = f (s i ) (i =,..., n) are the functional values at the location s i. A Gaussian process assumes that f = (f,..., f n) jointly follows N (f, C). (C) ij = C(s i, s j ), i, j n. C(, ) is the covariance kernel function. Examples of covariance kernel function: C(s, s ) = τ e φ s s, C(s, s ) = τ e φ s s, C(s, s ) = τ ν Γ(ν) (φ s s ) ν K ν (φ s s ).

4 Fitting Gaussian Process Regression Consider fitting a function f with noise: y(s) = f (s) + ɛ(s). Observed locations S = (s,..., s n); y = (y(s ),..., y(s n)); f = (f (s ),..., f (s n)). We impose a Gaussian process (GP) prior on f : f N (, C(S, S)). (C(S, S)) ij = C(s i, s j ), i, j n. ɛ(s) is a white noise process independent of f ; Var(ɛ(s)) = σ. Let f = f (s ) be the functional value at a testing location s. Then the joint distribution of (y, f ) is [ ] ( [ ]) y C(S, S) + σ I n C(S, s ) f N, C(s, S) C(s, s ) The GP posterior of f given the observed data y is Gaussian, with [ ] E[f y] = C(s, S) C(S, S) + σ I n y, [ ] Var[f y] = C(s, s ) C(s, S) C(S, S) + σ I n C(S, s ).

5 Long-standing Challenge in Fitting GP Inverting the matrix C(S, S) + σ I n requires O(n 3 ) computational cost and O(n ) storage cost. For Bayesian methods, learning the parameters in the model (such as the parameters in the kernel C(, ) and σ ) often requires inverting the matrix C(S, S) + σ I n repeatedly. For commonly used kernels, such as squared exponential kernel and Matérn kernel, the matrix C(S, S) tends to have singularity problem when n becomes large, say n 4. This is one of the main motivations that drive the research on GP.

6 Existing Methods to Speed up GP The general idea is to simplify the structure of C(, ). Low-rank methods: Fit the spatial surface f using r chosen basis functions or knots. Fixed-rank kriging (Cressie & Johannesson 8 JRSSB), Predictive process (Banerjee et al. 8 JRSSB, Finley et al. 9 CSDA). Compacted supported covariance kernel: Covariance tapering (Kaufman, Schervish & Nychka 8 JASA). Sparsity on inverse covariance matrix: Composite likelihood (Stein, Chi & Welty 4 JRSSB), Nearest Neighbor GP (Datta et al. 6 JASA, Datta et al. 6 AOAS) Block structured covariance: Fitting GP on sub-regions (Nychka et al. 5 JCGS, Katzfuss 7 JASA, Guhaniyogi and Banerjee 8 JCGS). Related methods in machine learning: Subset of regressors (Quinonero-Candela & Rasmussen 5 JMLR); Inducing point methods (Wilson & Nickisch 5 ICML), etc.

7 Problems with Existing Methods Low-rank methods are popular for large spatial datasets. In this work, we compare mainly with the modified predictive process (MPP) (Finley et al. 9 CSDA). MPP reduces the computation complexity of GP from O(n 3 ) to O(nr ), where r is the number of knots (chosen basis functions). However, a very large n also requires a much larger r ( n), in order to achieve the same level of accuracy. For example, if n is about million, then the number of knots needs to be a few thousand. Nearest Neighbor GP (NNGP, Datta et al. 6 JASA, Datta et al. 6 AOAS) works well for rough surfaces but not as well for smooth surfaces.

8 Goals We need fast and scalable computational methods to fit GP to spatial datasets with million observations in reasonable amount of time (e.g. in about day). We develop under the Bayesian framework with posterior samplers that can be practically easy to implement. We need theoretical guarantees for the validity of uncertainty quantification.

9 Outline Background of GP D&C Bayes Method Theory Experiments Future Work

10 Divide-and-Conquer Approach for Big Data General steps: Divide the massive datasets of size n into k non-overlapping subsets; Perform statistical inference (estimation, posterior sampling, etc.) on each of the k subsets; Combine the k subset results into a global result (averaged estimator, posterior distribution, etc.) Divide-and-conquer simple and effective: Allows parallel computing; Significantly reduces computational costs locally. Each machine only handles data of size m = n/k.

11 Divide and Conquer Bayes Consensue Monte Carlo (CMC, Scott et al. 6 ) Semiparametric density product (SDP, Neiswanger et al. 4 ) Wasserstein posterior (WASP, Srivastava et al. 5 ), PIE (Li et al. 7 ) Double-parallel MCMC (DP-MCMC, Xue & Liang 7 ) Basic idea: When the data are all independent Posterior Likelihood Prior [ k ] jth subset likelihood prior j= k j= [ jth subset likelihood prior /k] (CMC, SDP) [ k /k (jth subset likelihood) k prior] (WASP, PIE, DP-MCMC) j=

12 Divide and Conquer Bayes Idea: Run MCMC in parallel on disjoint subset data, and combine the subset posterior distributions into a global posterior.

13 Divide and Conquer Bayes PIE (Li, Srivastava, and Dunson 7 ): Average the subset posterior empirical quantiles. Leading to the Wasserstein- barycenter, as an approximation to the full data posterior.

14 For a -dimensional parameter ξ R, PIE Algorithm in 3 Steps Step : Run MCMC on k subsets on k machines in parallel. Draw posterior samples from each subset posterior. Get the posterior samples for ξ. Step : Find the qth empirical quantiles of ξ, for each of the k subset posterior distributions. q (, ) can come from a fine grid on (, ). Step 3: Average the k subset quantiles. Output: Averaged quantiles for ξ = A combined distribution for ξ, as an approximation to the true posterior of ξ based on the full data. Posterior credible intervals. For independent data, this method (as well as many other similar ones) is asymptotically valid for approximating the true posterior of ξ given the full data.

15 Wasserstein Barycenter Wasserstein- (W ) distance between two measures µ and ν: { } W (µ, ν) = inf E(X Y ), law(x ) = µ, law(y ) = ν For univariate distribution functions F and F, their W distance has an equivalent form: [ W (F, F ) = F (q) F (q) ] dq where F, F are quantile functions of F, F. The Wasserstein- barycenter of k distributions µ,..., µ k is µ = arg min µ P k k W (µ, µ j ), where P is the space of all distributions with finite nd moment. j=

16 Outline Background of GP D&C Bayes Method Theory Experiments Future Work

17 A GP-based Spatial Model In this work, we consider a generalized GP regression model with spatial predictors: y(s) = x(s) β + w(s) + ɛ(s). Observed locations S = {s,..., s n}; y = (y(s ),..., y(s n)) ; w = (w(s ),..., w(s n)). x(s) is a p-dimensional predictor vector at location s. X = (x(s ),..., x(s n)). We impose a Gaussian process (GP) prior on w: w N (, C α(s, S)). (C α(s, S)) ij = C α(s i, s j ), i, j n. ɛ(s) is a white noise process independent of w and x; Var(ɛ(s)) = σ. Let s be a testing location and s / S. Let w = w(s ), y = y(s ), x = x(s ). The prior for β is N(µ β, Σ β ). A prior is also imposed on α, σ.

18 Iterate the following steps: Posterior Sampling Algorithm Sample β given y, X, α, σ from N(m β, V β ), where { } V β = X V(α, σ ) X + Σ β, { } m β = V β X V(α, σ ) y + Σ β µ β, V(α, σ ) = C α(s, S) + σ I n. Sample (α, σ ) given y, X, β using the random walk Metropolis-Hastings. Then w, y can be drawn as w y, X, β, α, σ N(m, V ), where m = C α(s, S)V(α, σ ) (y Xβ), V = C α(s, s ) C α(s, S)V(α, σ ) C α(s, s ). y y, X, x, w, β, α, σ N(x β + w, σ ).

19 Distributed Kriging (DISK) What can we do for big n like n = 6? We propose a D&C Bayesian method called DISK. For short, DISK = (Existing D&C Bayes) + (Spatially Dependent Data) We fit the GP-based spatial model on k different subsets of data, and then combine the results. Partition scheme: Uniform sampling - such that every subset is representative of the full data. For example, if PIE (posterior interval estimation, Li et al 7 ) is used, then for each parameter in (β, α, σ, w, y ), the marginal posterior is obtained by the following steps:

20 DISK with PIE Partition the locations into k subsets S = k j=s j, S j S j =. Let n = km, where n is the total sample size and m is the subset size. Fit the GP regression model for the jth subset (j =,..., k) y(s ji ) = x(s ji ) β + w(s ji ) + ɛ(s ji ), i =,..., m, by using the modified posterior { k π m(β, α, σ y j, X j ) l j (β, α, σ )} π(β, α, σ ), ( ) l j (β, α, σ ) = N X j β, V j (α, σ ), V j (α, σ ) = C α(s j, S j ) + σ I m, where (y j, X j ) is the jth subset data at locations S j. Take the qth empirical quantiles of the marginals of π m(β, α, σ y j, X j ) for j =,..., k, and then average them. Perform the same combination for the predictive posteriors of w and y.

21 Outline Background of GP D&C Bayes Method Theory Experiments Future Work

22 Theory Our theoretical results mainly focus on the posterior of w, the values function w at a testing location s. We present upper bounds for the Bayes L risk, and a Bayesian version of bias-variance tradeoff. We first show the results for the simplified model y(s) = w(s) + ɛ(s) without the x(s) β part, and then generalize to the full model y(s) = x(s) β + w(s) + ɛ(s). To ease the technical difficulty, we assume that the finite-dimensional parameters (α, σ ) are fixed (because their estimation is notoriously difficult...) Furthermore, we assume that the testing location s is selected independently from the same distribution that generates the training locations S = {s,..., s n}.

23 Bayes L Risk The Bayes L risk of the DISK posterior is defined as Bayes L risk = E SE S E s [w(s ) w (s )], where E s is the expectation w.r.t. the sampled testing location, E S is the DISK posterior expectation given the training locations S, and E S is the expectation w.r.t. the training location. This Bayes L risk can be decomposed into 3 parts: Bayes L risk = bias + var mean + var DISK They are the squared bias, the variance of subset posterior means (var mean), and the mean of subset variances (var DISK ). The reason for variance terms is the law of total variance var(y ) = var[e(y X )] + E[var(Y X )].

24 Some Notation Let P s be the sampling distribution of s,..., s n, s. Let {φ i } i= be an orthonormal basis of L (P s) w.r.t. P s. The kernel has the series expansion C α(s, s ) = i= µ iφ i (s)φ i (s ) w.r.t. P s for any s, s D, where µ µ... are the eigenvalues of C α. The trace of the kernel C α is defined as tr(c α) = i= µ i. For any f L (P s), f (s) = i= θ iφ i (s), where θ i = f, φ i L (P s). The reproducing kernel Hilbert space (RKHS) H attached to C α is the space of all functions f L (P s) such that the H-norm f H = i= θ i /µ i <.

25 Bayes Bias-Variance Tradeoff More technical assumptions: Assumption : The true surface w lies in the RKHS (reproducing kernel Hilbert space) of C α. (This can be relaxed by using sieve approximation.) Assumption : The covariance kernel C α satisfies tr(c α) <. Assumption 3: There are positive constants ρ and r, such that E Ps [φ r i (s)] ρ r for every i =,,...,, and var [ɛ(s)] σ < for all s D. We derive upper bounds for the three terms in the Bayes L risk, and show that (roughly speaking) for a given total sample size n, as k, bias, var mean, var DISK bias + var mean, where k is the number of subsets.

26 Bayes Bias-Variance Tradeoff bias 8σ n w H + w H inf d N var mean n + 4 w H k var DISK 3 σ n γ inf d N 3n σ σ k n w H + σ n γ ( ) σ + inf n d N 8n ρ 4 tr(c α)tr(cα) d Ab(m, d, r)ρ r γ( σ + µ n ) σ m ρ 4 tr(c α)tr(cα) d Ab(m, d, r)ρ r γ( σ + n ) + m ( ) σ, n 5n σ tr(c α)tr(cα) d Ab(m, d, r)ρ γ( σ + tr(c α) m where N is the set of all positive integers, A is a positive constant, ( ) max(r, max(r, log d) b(m, d, r) = max log d),, γ(a) = i= m / /r µ i µ i + a for any a >, tr(c d α) = µ i. i=d+ n ) r,

27 Examples of Covariance Kernels Implications from the previous results for some kernel classes (with r > 4): Finite rank kernels: if µ d + = µ d =... = for some integer d, and k cn r 4 r /(log n) r r for some constant c >, then the Bayes L risk is O ( n ). Exponentially decaying kernels: if µ i c µ exp ( c µi κ ) for some constants c µ >, c µ >, κ > and all i N, and k cn r 4 r /(log n) (rκ+κ ) κ(r ) ( ) for some constant c >,, then the Bayes L risk is O (log n) /κ /n. Polynomially decaying kernels: if µ i c µi ν for some constants (r 4)ν (r ) and all i N, and k cn (r )ν /(log n) r r c µ >, ν > r r 4 constant c >, then the Bayes L risk is O (n ν ν ). for some

28 Examples of Covariance Kernels Some comments: The rates for finite rank kernels and exponentially decaying kernels are near minimax optimal up to logarithm factors. The rate for polynomially decaying kernels is suboptimal, but ) the difference is small. The minimax optimal rate is O (n ν+ ν. The gap between the two rates is n ν(ν+). The reason for sub-optimality is due to our fixed stochastic approximation (likelihood raised to the fixed power of k). One can possibly make it optimal by using an adaptive power in the subset likelihood.

29 Examples of Covariance Kernels (Convenient) generalization to the full model y(s) = x(s) β + w(s) + ɛ(s): Extra Assumption: x( ) is a deterministic function. The prior of β is normal N(µ β, Σ β ), and is independent of the prior on w( ), which is GP(, C α(, )). Based on this assumption, we can do a change of kernel: Č α(s, s ) = Cov { x(s ) β + w(s ), x(s ) β + w(s ) } = x(s ) Σ β x(s ) + C α(s, s ). Our theory results hold for the mean function x(s) β + w(s) in the full model, as long as all the previous conditions on C α is satisfied for the new kernel Č α. Since we only focus on the predictive loss on the mean function x(s) β + w(s), we do not need any identification condition for β and w( ).

30 Outline Background of GP D&C Bayes Method Theory Experiments Future Work

31 Simulation Model : Smooth Surface D = [, ] [, ]. f (s) = e (s ) + e.8(s+).5 sin{8(s +.)}, for s [, ]. The true surface is w (s) = f (s )f (s ), s = (s, s ) D. The response is y(s) = β + w (s) + ɛ, ɛ N(, σ ). β =, σ =.. We use the Matérn kernel with ν = /: C(s, s ) = τ exp ( φ s s ) for s, s D. σ InvGamma(,.), τ InvGamma(, ), φ Uniform(., ). In simulation scenarios, we use n = 4 and n = 6, respectively. The training locations are sampled uniformly over D. We use l = 5 testing locations, also sampled uniformly over D. The Bayes L risks (and other criteria) are estimated by averaging over the l testing locations. All methods are replicated for Monte Carlo simulations.

32 Methods Compared LatticeKrig (Nychka et al. 5 ) using the LatticeKrig package in R, with 3 number of resolutions. Locally approximated Gaussian process (lagp, Gramacy and Apley 5 ) using the lagp package in R with method set to nn. Full-rank Gaussian process (GP) using the spbayes package in R with the full data. Modified predictive process (MPP, Finley et al. 5 ) using the spbayes package in R with the full data. Nearest neighbor Gaussian process (NNGP, Datta et al. 5 ) using the spnngp package in R with the number of nearest neighbors (NN) as 5, 5, and 5. DISK + CMC (Scott et al. 6 ) + (Full GP or MPP). DISK + PIE (Li et al. 7 ) + (Full GP or MPP). The first two methods are frequentist; the others are all Bayesian.

33 Simulation Results When n = 4 Prediction MSE averaged over l = 5 locations lagp LatticeKrig Full GP MPP MPP (r = ) (r = 4) NNGP NN= 5 NN= 5 NN= DISK-CMC(MPP) k = k = k = 3 k = 4 k = 5 r = DISK-CMC(MPP) k = k = k = 3 k = 4 k = 5 r = DISK-PIE(GP) k = k = k = 3 k = 4 k = DISK-PIE(MPP) k = k = k = 3 k = 4 k = 5 r = DISK-PIE(MPP) k = k = k = 3 k = 4 k = 5 r =

34 Background of GP D&C Bayes Method True w* Theory DISK (GP, k = ): 5% Quantile Experiments DISK (GP, k = 3): 5% Quantile Future Work DISK (GP, k = 4): 5% Quantile..5. DISK (GP, k = ):.5% Quantile GP: 5% Quantile DISK (GP, k = ): 5% Quantile.5 DISK (GP, k = ): 97.5% Quantile..5. MPP (r = ): 5% Quantile DISK (MPP, r =, k = ):.5% Quantile DISK (MPP, r =, k = ): 5% Quantile.5 DISK (MPP, r =, k = ): 97.5% Quantile..5. MPP (r = 4): 5% Quantile DISK (MPP, r = 4, k = ):.5% Quantile DISK (MPP, r = 4, k = ): 5% Quantile.5 DISK (MPP, r = 4, k = ): 97.5% Quantile..5..5

35 Bias-Variance Tradeoff Bias Variance DISK (GP) DISK (MPP, r = ) DISK (MPP, r = 4) GP lagp l Risk k k k When n = 6, full data GP, LatticeKrig, MPP, NNGP are not applicable. Only lagp is left to compare with DISK. Prediction MSE averaged over l = 5 locations lagp DISK(MPP) DISK(MPP) (r = 4, k = 5) (r = 4, k = 5)..6.5

36 Simulation Results When n = 4 Parametric Inference and Predictive Interval Coverage β τ σ φ 95% PI Coverage Truth lagp LatticeKrig GP MPP (r = ) MPP (r = ) NNGP NN= NN= NN= DISK+PIE (GP) k = DISK+PIE (MPP) r =, k = DISK+PIE (MPP) r = 4, k =

37 Simulation Results When n = 6 Parametric Inference and Predictive Interval Coverage β τ σ φ Truth lagp DISK+PIE (MPP) r = 4, k = r = 6, k = MSPE 95% PI Coverage lagp..945 DISK+PIE (MPP) r = 4, k = r = 6, k = r is the number of knots. k is the number of subsets. Full GP, LatticeKrig, MPP, and NNGP fail to run for n = 6.

38 Simulation Model : Rough Surface D = [, ] [, ]. The true surface w ( ) is a rough sample path from the Matérn kernel with ν = /: C(s, s ) = τ exp ( φ s s ) for s, s D. We fit the model using the same kernel. The response is y(s) = β + w (s) + ɛ, ɛ N(, σ ). β =, σ =., τ =, φ = 9. σ InvGamma(,.), τ InvGamma(, ), φ Uniform(., 3). Training sample size n = 4 respectively. The training locations are sampled uniformly over D. Testing sample size l = 5, also sampled uniformly over D. The Bayes L risks (and other criteria) are estimated by averaging over the l testing locations. Only compare DISK(MPP) k = and lagp.

39 Results for Simulation Parametric Inference and Predictive Interval Coverage (k = for DISK) β σ τ φ Truth. 9 lagp DISK(MPP) r = (.96,.3) (.,.7) (.9,.3) (8.94, 9.68) r = (.96,.3) (.,.7) (.9,.3) (8.96, 9.73) MSPE 95% PI Coverage 95% PI Length lagp.5 (.). (.5).38 (.3) DISK(MPP) r =.9 (.).96 (.7) 3.97 (.773) r = 4.8 (.94).96 (.35) 3.86 (.77)

40 Sea Surface Temperature Data SST in the west coast of mainland U.S.A., Canada, and Alaska between 4 65 north latitudes and 8 west longitudes. The dataset is obtained from NODC World Ocean Database. wod.html We choose,, 8 spatial observations in the selected domain. 6 obs are used as training data, and 8 obs are used as testing data. Model: y(s) = β + β latitude + w(s) + ɛ(s), ɛ(s) N(, σ ). For DISK+CMC(MPP) and DISK+PIE(MPP), we show the results with r = 4 and k = 3.

41 Background of GP D&C Bayes Method Theory Experiments Future Work SST Data: Predicted Surfaces Truth 65 lagp 65 CMC:.5% Quantile 65 DISK:.5% Quantile CMC: 5% Quantile DISK: 5% Quantile CMC: 97.5% Quantile DISK: 97.5% Quantile I The PMSEs of CMC(MPP) and DISK(MPP) are almost the same, and both are comparable to lagp (a freq GP method). DISK(MPP) has wider predictive intervals. I Other Bayesian GP methods do not work, including MPP and NNGP.

42 SST Data: Numerical Results Parametric inference and prediction in SST data. DISK-CMC nd DISK-PIE use MPP-based modeling with 4 knots on k = 3 subsets. β β σ τ lagp DISK-CMC (3.9, 3.37) (-.36, -.34) (.78,.69) (.8,.) DISK-PIE (3.74, 3.95) (-.33, -.3) (.3,.45) (.8,.85) MSPE 95% PI Coverage 95% PI Length lagp.5 (.).95 (.).35 (.) DISK-CMC.4 (.).3 (.).4 (.) DISK-PIE.4 (.).95 (.).67 (.)

43 Outline Background of GP D&C Bayes Method Theory Experiments Future Work

44 Future Work Theory for the finite-dimensional parameters (α, σ ), and the joint distribution of both parametric and nonparametric parts. Theory for the frequentist coverage of DISK posterior credible intervals. The choice of partition distribution; The influence from the number of knots in MPP. Extension to spatial-temporal processes. Manuscript at arxiv:

45 Future Work Theory for the finite-dimensional parameters (α, σ ), and the joint distribution of both parametric and nonparametric parts. Theory for the frequentist coverage of DISK posterior credible intervals. The choice of partition distribution; The influence from the number of knots in MPP. Extension to spatial-temporal processes. Manuscript at arxiv: Thank you! & Questions?

arxiv: v1 [stat.me] 28 Dec 2017

arxiv: v1 [stat.me] 28 Dec 2017 A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Raarshi Guhaniyogi, Cheng Li, Terrance D. Savitsky 3, and Sanvesh Srivastava 4 arxiv:7.9767v [stat.me] 8 Dec 7 Department of Applied Mathematics