arxiv: v1 [stat.me] 28 Dec 2017

Size: px

Start display at page:

Download "arxiv: v1 [stat.me] 28 Dec 2017"

Andrea Goodman
5 years ago
Views:

1 A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Raarshi Guhaniyogi, Cheng Li, Terrance D. Savitsky 3, and Sanvesh Srivastava 4 arxiv:7.9767v [stat.me] 8 Dec 7 Department of Applied Mathematics and Statistics, UC Santa Cruz Department of Statistics and Applied Probability, National University of Singapore 3 U. S. Bureau of Labor Statistics 4 Department of Statistics and Actuarial Science, The University of Iowa November, 8 Abstract Flexible hierarchical Bayesian modeling of massive data is challenging due to poorly scaling computations in large sample size settings. This article is motivated by spatial process models for analyzing geostatistical data, which typically entail computations that become prohibitive as the number of spatial locations becomes large. We propose a three-step divide-and-conquer strategy within the Bayesian paradigm to achieve massive scalability for any spatial process model. We partition the data into a large number of subsets, apply a readily available Bayesian spatial process model on every subset in parallel, and optimally combine the posterior distributions estimated across all the subsets into a pseudo posterior distribution that conditions on the entire data. The combined pseudo posterior distribution is used for predicting the responses at arbitrary locations and for performing posterior inference on the model parameters and the residual spatial surface. We call this approach Distributed Kriging DISK. It offers significant advantages in applications where the entire data are or can be stored across multiple machines. Under the standard theoretical setup, we show that if the number of subsets is not too large, then the Bayes L -risk of estimating the true residual spatial surface using the DISK posterior distribution decays to zero at a nearly optimal rate. While DISK is a general approach to distributed nonparametric regression, we focus on its applications in spatial statistics and demonstrate its empirical performance using a stationary full-rank and a nonstationary lowrank model based on Gaussian process GP prior. A variety of simulations and a geostatistical analysis of the Pacific Ocean sea surface temperature data validate our theoretical results. Keywords: Distributed Bayesian inference; Gaussian process; modified predictive process; large and complex spatial data; Wasserstein distance; Wasserstein barycenter. guhaniyogi@ucsc.edu stalic@nus.edu.sg savitsky.terrance@bls.gov sanvesh-srivastava@uiowa.edu

2 Introduction A fundamental challenge in geostatistics is the analysis of massive spatially-referenced data. Due to the recent influx of data with complex spatial associations, sophisticated spatial modeling has become an enormously active area of research; see, for example, [8, 3, 4]. Massive spatial data provide scientists with an unprecedented opportunity to hypothesize and test complex theories. This leads to the implementation of rather complex hierarchical GP-based models that are computationally intractable for large n, where n is the number of spatial locations, due to the On 3 computational cost and the On storage cost. This article develops a general distributed Bayesian approach, called Distributed Kriging DISK, for boosting the scalability of any state-of-the-art spatial process model based on GP prior to multiple folds using the divide-and-conquer technique. The literature on process-based modeling of massive spatial data is large, so we only provide a selective review. Briefly, these methods seek dimension-reduction by endowing the spatial covariance matrix either with a low-rank or a sparse structure. Low-rank structures represent spatial surface using r apriori chosen basis functions. They include fixed-rank kriging [], or predictive process and its variants [6, 4, 34, 5, 6]; see [78] and [4] for comprehensive reviews. The time complexity for fitting spatial models with a low-rank structure decreases from On 3 to Onr floating point operations flops; however, practical considerations show that when n is large, r must grow roughly as O n for accurate estimation, implying that Onr flops are also expensive in low-rank structures. On the other hand, sparse structures intuit that spatial correlation between two distantly located observations is nearly zero, so little information is lost by assuming conditional independence given the intermediate locations. For example, covariance tapering [4, 9, 5, 63] uses compactly supported covariance functions to create sparse spatial covariance matrices that approximate the full covariance matrix. Alternately, one could introduce sparsity in the inverse covariance precision matrix using conditional independence assumptions or composite likelihoods [75, 59, 7,, 5, 7, 36]. In related literature on computer experiments, localized approximations of GP models are proposed, see, for example, [3, 8, 54]. GP-based modeling using low-rank or sparse structures has also received significant attention in machine learning; see [55, ] for recent reviews. Some variants of dimension-reduction methods partition the spatial domain into sub-regions containing fewer spatial locations. Each of these sub-regions is modeled using a GP which are then hierarchically combined by borrowing information from across the sub-regions. Examples include non-stationary models [4], multi-level and multi-resolution models [7, 5, 4, 35], and the Bayesian Treed GP models [3]. These models usually achieve scalability by assuming blockindependence at some level of the hierarchy, usually across sub-regions, but may lose scalability when they borrow across sub-regions. In an unrelated thread, [45] propose parameter estimation in the GP-based geostatistical model using resampling based on stochastic approximation. Besides being fully frequentist in nature, it is less clear as to how such an idea would be extended to enable analysis of more general nonstationary models with massive data.

3 The proposed DISK framework is a three-step approach for distributed Bayesian inference in any model based on spatial process. First, we divide the n spatial locations into k subsets such that each subset has representative samples from all regions of the spatial domain. Second, we choose any spatial model and estimate the posterior distributions for inference and prediction in parallel across k subsets after raising the likelihood to a power of k in each subset. The pseudo posterior distribution obtained after modifying the likelihood for each subset of data is referred to as the subset posterior. Since each subset posterior distribution conditions on /k-fraction of the full data, the modification by raising the likelihood to the power k ensures that variance of each subset posterior distribution is of the same order as a function of n as that of the full data posterior distribution. Third, the k subset posterior distributions are combined into a single pseudo probability distribution, called the DISK pseudo posterior henceforth, DISK posterior, that conditions on the full data and replaces the computationally expensive full data posterior distribution for the purpose of prediction and inference. Computationally, the main innovations are in the first and third steps, where general partitioning and combining schemes are unavailable in process-based modeling of spatial data. Theoretically, we provide guarantees on the rate of decay of the Bayes L -risk in estimating the true residual spatial surface using the DISK posterior as a function of n, k, and analytic properties of the true spatial surface. This involves two new upper bounds. First, an upper bound for the Bayes risk of the DISK posterior is developed assuming each subset size approaches to infinity. Second, we provide an in depth analysis of the bias-variance tradeoff in estimating the true spatial surface using the DISK posterior and develop upper bounds on k as a function on n that lead to near optimal performance as n tends to infinity. Motivated by large and complex data, there has been significant interest in adopting the divideand-conquer technique for distributed Bayesian inference [5, 76, 77, 3, 6, 33, 48, 8]. DISK is significantly different from these as it is based on combining the collection of k subset posterior distributions through their barycenter, a notion of geometric center that generalizes the Euclidean mean to a space of probability measures. There are recent approaches based on combining subset inferences, either through the posterior means and covariance matrices of parameters across the k subsets [6] or by employing a product of experts [5, 7]. These approaches are intuitively appealing for GP regression, but lead to theoretically sub-optimal uncertainty quantification [7] and are less suited when a GP or its derivative is embedded in a more general hierarchical model; for example, GP-based classification [37]. Recent developments on distributed variational GP [6] are impractical for n > 7 and no theoretical results are available on the quality of approximation of the full posterior distribution. There are recent works on combining subset posterior distributions through their geometric centers, such as the mean or the median, but they are restricted to parametric models [46, 67, 44, 6, 47, 68]. Extensions to general nonparametric models, including those based on stochastic processes, are missing, except for some empirical results on nonparametric regression using GP [46, 67] and theoretical guarantees for both these methods in the Gaussian white noise model [7]. Combining subset posterior distributions of a stochastic process is challenging for two maor reasons. First, it requires estimation of a function, an infinite dimensional parameter. 3

4 Second, and most importantly, stochastic process models induce complex dependencies among observations that is challenging to capture with a combination of subset posterior distributions that are estimated independently without accounting for the inter-subset dependence. Divide-and-conquer nonparametric regression, which includes kriging as a special case, has received significant attention lately in the optimization literature, though the Bayesian literature is relatively lightly populated. The bias-variance decomposition of the L -risk in the estimation of true regression function in divide-and-conquer kernel ridge regression is known [84, 8]. Bayesian divideand-conquer nonparametric regression has been mostly studied from the theoretical perspective [, 64, 65, 7]. Filling the methodological gap, the DISK framework provides a general approach to enhance the scalability of any process-based model for Bayesian nonparametric regression, including models based on spatial processes. For example, if the application of spatial process models is feasible for a subset of size m, then one can run them on k subsets in parallel and DISK allows prediction and inference using n = mk spatial locations. The values of m and k depend mainly upon the available computational resources and the model, but our theoretical results provide guidance on choosing k depending on the analytic properties of the true spatial process. For clarity of exposition, we illustrate the empirical performance of the DISK framework with the stationary GP and the modified predictive process MPP [4] priors. The MPP is a low-rank nonstationary GP prior that allows accurate modeling of spatial surfaces whose variability or smoothness changes with the location. MPP, like any low-rank model, faces computational bottlenecks when n is large, and either computational efficiency or accuracy worsens when MPP is applied to even 4 observations. Our numerical results establish that DISK with MPP scales to 6 observations without compromising on either computational efficiency or accuracy in inference and prediction. We expect this conclusion to hold for other popular structured GP priors. The remainder of the manuscript evolves as follows. In Section we outline a Bayesian hierarchical mixed model framework that incorporates models based on both the full-rank and the low-rank GP priors. Our DISK approach will work with posterior samples from such models. Section 3 develops the framework for DISK, discusses how to compute the DISK posterior distribution, and offers theoretical insights into the DISK for general GPs and their approximations. A detailed simulation study followed by an analysis of the Pacific ocean sea surface temperature data are illustrated in Section 4 to ustify the use of DISK for real data. Finally, Section 5 discusses what DISK achieves, and proposes a number of future directions to explore. Hierarchical Bayesian inference for GP-based spatial models Consider the standard univariate spatial regression model for the data observed at location s in a compact spatial domain D, ys = xs T β +ws + ɛs, 4

5 where ys is a univariate response at s, xs is a p predictor vector at s, β is a p predictor coefficient, ws is the realization of an unknown spatial function w at s, and ɛs is the realization of white-noise process ɛ at s and is independent of w. The Bayesian implementation of the model in customarily assumes a that β apriori follows a Gaussian distribution with mean µ β and covariance matrix Σ β and b that w and ɛ apriori follow mean GPs with covariance functions C α s, s and D α s, s that model covws, ws } and covɛs, ɛs }, respectively, where α are the process parameters indexing the two families of covariance functions and s, s D; therefore, the model parameters are Ω = α, β}. If β = in, then we obtain the setup for Bayesian nonparametric regression using GP prior, with s as covariates and ys as the response. The training data consists of n predictors and responses, denoted as xs, ys },..., xs n, ys n }, observed at n spatial locations, denoted as S = s,..., s n }. Standard Markov chain Monte Carlo MCMC algorithms exist for performing posterior inference on Ω and the values of w at a given set of locations S = s,..., s l }, where S S =, and for predicting ys for any s S [4]. Given S, the prior assumptions on w and ɛ imply that w T = ws,..., ws n } and ɛ T = ɛs,..., ɛs n } are independent and follow N, Cα} and N, Dα}, respectively, where Nm, V denotes the density of a multivariate Gaussian distribution of appropriate dimension with mean m and covariance matrix V and the i, th entries of Cα and Dα are C α s i, s and D α s i, s, respectively. The hierarchy in is completed by assuming that α apriori follows a distribution with density πα. If y = ys,..., ys n } T is the n response vector and X = [xs : : xs n ] T is the n p matrix of predictors, where p < n, then the MCMC algorithm for sampling Ω, w T = ws,..., ws l }, and y T = ys,..., ys l } cycle through the following three steps until sufficient samples are drawn post convergence:. Integrate over w in and a sample β given y, X, and α from Nm β, V β, where } V β = X T Vα X + Σ β, mβ = V β X T Vα y + Σ β and Vα = Cα + Dα; and µ β }, b sample α given y, X, and β using the Metropolis-Hastings algorithm with a normal random walk proposal.. Sample w given y, X, α, and β from Nm, V, where V = C, α C α Vα C α T, m = C α Vα y X β, 3 C α and C, α are l n and l l matrices, respectively, and the i, th entries of C, α and C α are C α s i, s and C αs i, s, respectively. 3. Sample y given α, β, and w from N X β + w, Dα}, where X T = [xs : : xs l ]. 5

6 Many Bayesian spatial models can be formulated in terms of by assuming different forms of C α s, s and D α s, s ; see [4] and supplementary material for details on the MCMC algorithm. Cα. MCMC computations face a maor computational bottleneck due to matrix inversions involving The steps a, b, and for sampling Ω and w involve inversion of Cα + Dα. Irrespective of the form of Dα, if no additional assumptions are made on the structure of Cα, then the three steps require On 3 flops in computation and On memory units in storage in every MCMC iteration. Spatial models with this form of posterior computations are based on a full-rank GP prior. In practice, if n 4, then posterior computations in a model based on a full-rank GP prior are infeasible due to numerical issues in matrix inversions involving an unstructured Cα. This problem is solved by imposing additional structure on Cα. Our focus is on those methods that impose a sparse or low-rank structure on the covariance function of a GP prior [55, 4]. Every method in this class expresses the covariance function in terms of r n basis functions, in turn inducing a low-rank GP prior. methods. Let S = s We use the MPP as a representative example of this class of,..., s r } be a set of r locations, known as the knots, which may or may not intersect with S. Let cs, S = C α s, s,..., C αs, s r } T be an r vector and CS be an r r matrix whose i, th entry is C α s i, s. Using cs, S,..., cs n, S and CS, define the diagonal matrix δ = diagδs,..., δs n } with δs i = C α s i, s i c T s i, S CS cs i, S, i =,..., n. 4 Let a = b = if a = b and otherwise. Then, the MPP is a GP with the covariance function C α s, s = c T s, S CS cs, S + δs s = s, s, s D, 5 where C α s, s depends on the covariance function of the parent GP and the selected r knots, which define CS, c T s, S, and c T s, S. We have used a in 5 to distinguish the covariance function of a low-rank GP prior from that of its parent full-rank GP. If Cα is a matrix with i, th entry C α s i, s, then the posterior computations using MPP, a low-rank GP prior, replace Cα by Cα in the steps a, b, and. The low rank r structure imposed by CS implies that Cα computation requires Onr flops using the Woodbury formula [38]. Spatial models based on a low-rank GP prior, including MPP, suffer from computational bottlenecks in massive data settings. The computational complexity of posterior computations for a rank-r GP prior is Onr, which is linear in n; however, practical considerations often necessitate that r = O n for accurate inference and prediction. This severely limits the computational advantages of low-rank GP priors, including MPP, especially in applications with large n [5]. The next few sections develop our DISK framework, which is key in extending any GP-based spatial model to massive data using the divide-and-conquer technique without compromising either on the quality of inference and prediction or on the computational cost. 6

7 3 Distributed Kriging 3. First step: partitioning of spatial locations We partition the n spatial locations into k subsets. The value of k depends on the chosen spatial model, and it is large enough to ensure efficient posterior computations on any subset. The default partitioning scheme is to randomly allocate the locations into k subsets, but we specify a technical condition later which ensures that every subset has locations from all regions of the spatial domain. For theoretical and notational simplicity, we also assume that the k subsets are non-overlapping and that every subset has m spatial locations so that n = mk. Let S be the set of spatial locations in subset =,..., k. Our simplifying assumptions imply that k = S = S, S S = for, and S = s,..., s m }, where s i = s i for some s i S and for every =,..., k and i =,..., m. Denote the data in the th partition as y, X } =,..., k, where y = ys,..., ys m } T is a m vector and X = [xs : : xs m ] T is a m p matrix of predictors corresponding to the spatial locations in S with p < m. The univariate spatial regression models using either a full-rank or a low-rank GP prior for the data observed at any location s i S D is given by ys i = xs i T β +ws i + ɛs i, i =,..., m. 6 Let w T = ws,..., ws m } and ɛ T = ɛs,..., ɛs m } be the realizations of GP w and white-noise process ɛ, respectively, in the th subset. After marginalizing over w in the GP-based model for the th subset, the likelihood of Ω = α, β} is given by l Ω = Ny X β, V α}, 7 where Ny m, V represents the multivariate normal density of y with mean m and covariance matrix V, V α = C α + D α and V α = C α + D α for full-rank and low-rank GP priors, respectively, and C α, C α, D α are obtained by extending the definitions of Cα, Cα, Dα to the th subset. In a model based on full-rank or low-rank GP prior, the likelihood of w given y, X, and Ω is l w = Ny X β w, D α}. 8 The likelihoods in 7 and 8 are used for defining the posterior distributions for β, α, w, y, called th subset posterior distributions, using a full-rank or a low-rank GP prior in subset. 7

8 3. Second step: sampling from subset posterior distributions We define subset posterior distributions by modifying the likelihoods in 7 and 8. More precisely, the density of the th subset posterior distribution of Ω is given by π m Ω y = l Ω} k πω l Ω} k πωd Ω, 9 where we assume that l Ω} k πωd Ω <, and the subscript m denotes that the density conditions on m samples in the th subset. The modification of likelihood to yield the subset posterior density in 9 is called stochastic approximation [46]. Raising the likelihood to the power of k is equivalent to replicating every ys i k times i =,..., m. Thus, stochastic approximation accounts for the fact that the th subset posterior distribution conditions on a /k-fraction k = n/m of the full data and ensures that its variance is of the same order as a function of n as that of the full data posterior distribution. This is a common strategy adopted in divide-and-conquer based Bayesian inference in parametric models; see, for example, [44, 8] for recent applications. With the proposed stochastic approximation in 9, the full conditional densities of th subset posterior distributions for prediction and inference follow from their full data counterparts. The th full conditional densities of β and α in the GP-based models are given by π m β y, α = l Ω} k πβ l Ω} k πβd β, π mα y, β = l Ω} k πα l Ω} k παd α, where πβ = Nµ β, Σ β, πα is the prior density of α, and we assume that l Ω} k πβd β and l Ω} k παd α respectively are finite. The th full conditional densities of y and w are calculated after modifying the likelihood of w in 7 using stochastic approximation. Given y, X, and Ω, straightforward calculation yields that the th subset posterior predictive density of w is π m w y, Ω = Nw m, V, where V = C, α C α V α C α T, m = C α V α y X β, where V α = C α + k D α and V α = C α + k D α for full-rank and low-rank GP priors, respectively, and C, α, C α are l l, l m matrices obtained by extending the definition in 3 to subset for full-rank and low-rank GP priors with covariance functions C α, and C α,, respectively. We note that the stochastic approximation exponent, k, scales D α in V α so that the uncertainty in subset posterior distributions are scaled to that of the full data posterior. The th subset posterior predictive density of y given the samples of w and Ω in the th subset is Ny X β + w, D α}. We employ the same three-step sampling algorithm, as earlier introduced, specialized to subset =,..., k, sampling β, α, y, w } in each subset across multiple MCMC iterations; see supplementary material for detailed derivations of subset posterior sampling algorithms in the full-rank and low-rank GP priors. The computational complexity of th subset posterior computations follows from their full data 8

9 counterparts if we replace n by m. Specifically, the computational complexities for sampling a subset posterior distribution are Om 3 and Omr flops if the model in 6 uses a full-rank or a low-rank GP prior, respectively. Since the subset posterior computations are performed in parallel across k subsets, the computational complexities for obtaining B post burn-in subset posterior samples from k subsets are Okm 3 = Onm and Okmr = Onr flops in models based on full-rank and low-rank GP priors, respectively. The combination step of subset posteriors using the DISK framework outlined below is more widely applicable compared to other divide-and-conquer type approaches because it does not rely on any model- or data-specific assumptions, such as independence, except that every subset posterior distribution has a density with respect to the Lebesgue measure and has finite second moments. 3.3 Third step: combination of subset posterior distributions 3.3. Wasserstein distance and Wasserstein barycenter The combination step relies on the Wasserstein barycenter, so we provide some background on this topic. Let Θ, ρ be a complete separable metric space and PΘ be the space of all probability measures on Θ. The Wasserstein space of order is a set of probability distributions defined as P Θ = } µ PΘ : ρ θ, θ µdθ <, Θ where θ Θ is arbitrary and P Θ does not depend on the choice of θ. The Wasserstein distance of order, denoted as W, metrizes P Θ. Let µ, ν be two probability measures in P Θ and Πµ, ν be the set of all probability measures on Θ Θ with marginals µ and ν, then W distance between µ and ν is defined as W µ, ν = inf π Πµ,ν Θ Θ ρ x, y dπx, y. 3 Let ν,..., ν k P Θ, then the Wasserstein barycenter of ν,..., ν k is defined as ν = argmin ν P Θ It is known that ν P Θ is the unique solution of a linear program []. k = k W ν, ν. 4 If ν,..., ν k in 4 represent the k subset posterior distributions, then Wasserstein barycenter ν provides a general notion of obtaining the mean of k subset posterior distributions. If the k subset posteriors are combined using ν, then ν has finite second moments, conditions on the full data, and does not rely on model-specific or data-specific assumptions. If ν,..., ν k are analytically intractable but MCMC samples are available from them, then an empirical approximation of the Wasserstein barycenter can be estimated by solving a sparse linear program or by averaging empirical subset posterior quantiles [4, 9, 67, 44, 69]. 9

10 3.3. Combination scheme In the DISK framework, we combine the collection of posterior samples from the k subset posterior distributions for β, α, w, and y through their respective Wasserstein barycenters. Our combination scheme relies on the following key result. If θ is a one-dimensional functional of the full data posterior distribution and the th subset posterior distribution for θ is denoted by ν, then the qth quantile of the Wasserstein barycenter for θ, denoted as ν, is estimated from the collection of k subset posterior samples as ˆν q = k k = ˆν q, q = ξ, ξ,..., ξ, 5 where ξ is the grid-size of the quantiles, ˆν q is the estimate of qth quantile of ν obtained using MCMC samples from ν, and ˆν q is the estimate of the qth quantile of ν [44]. The post burn-in samples from the k subset posterior distributions are combined using 5. Let β b } B b=, α b} B b=, w b }B b=, y b }B b= =,..., k be the collection of B post burn-in MCMC samples from the k subsets; β ib and α i b be the bth post burn-in MCMC samples for the ith and i th marginals of β and α from their th subset posteriors, where i =,..., p, i =,..., s, p is the dimension of β, and s is the dimension of α. If β, α, w, and y represent the random variables that follow the DISK posterior distributions for β, α, w, and y, then MCMC-based estimates of the DISK posterior are obtained through their quantile estimates as follows:. Use 5 to estimate the qth quantiles of the ith and i th marginals of the DISK posteriors for β and α, respectively, as ˆ β q i = k where ˆβ q i and ˆα q i k ˆβ q i, i =,..., p, ˆα q i = k = k ˆα q i, i =,..., s, 6 = are the estimates of qth quantiles of ith and i th marginals of th subset posterior distributions for β and α obtained from β b } B b= and α b} B b=.. Use 5 to estimate the pointwise qth quantiles of ws i and ys i as ŵ q s i = k k ŵ q s i, = ŷ q s i = k ŷ q k s i, i =,..., l, 7 = where ŵ q s i and ŷq s i are the estimates of qth quantiles of ws i and ys i of the th subset posterior disributions for ws i and ys i obtained from w b }B b= and y b }B b=. A key feature of the DISK combination scheme is that given the subset posterior samples, the combination step is agnostic to the choice of a model. Specifically, 5 remains the same for models based on a full-rank prior or a low-rank GP prior, such as MPP, given MCMC samples from the k subset posterior distributions. Since the averaging over k subsets takes Ok flops and k < n,

11 the total time for computing the empirical quantile estimates of the DISK posterior in inference or prediction requires Ok + Om 3 and Ok + Onm flops in models based on full-rank and low-rank GP priors. Assuming that we have abundant computational resources, k is chosen large enough so that Om 3 computations are feasible. This would enable applications of the DISK framework in models based on both full-rank and low-rank GP priors in massive n settings. 3.4 Illustrative example: linear regression We develop intuitions behind the DISK posterior in this section by studying its theoretical properties in multivariate linear regression. The spatial linear model in reduces to a linear regression model with error variance if ɛs follows N, and there is no spatial effect; that is, ws = for every s D. A flat prior on β implies that the full data posterior distribution of β given y, denoted as Π n, has density Nˆβ, X T X }, where ˆβ = X T X X T y. Using notations similar to the earlier sections, the th subset posterior distribution has density Nˆβ, k X T X }, where ˆβ = X T X X T y =,..., k and the factor of k in the posterior covariance matrix comes from stochastic approximation. The Wasserstein barycenter of the k subset posterior distributions, denoted as Π n, is Gaussian with density Nm, V, where m = k k ˆβ and V is such that k = k = V / k X T X V / } / = V; 8 see Section 6.3 in []. The DISK framework replaces Nˆβ, X T X } with Nm, V for inference and predictions in massive data settings. The covariance matrix V in 8 is estimated via fixed point iterations; however, V is analytically tractable when X,..., X k have identical left and right singular vectors []. If U and V are m p and p p orthogonal matrices and D = diagd,..., d p is a p p diagonal matrix containing singular values of X, then X = U D V T, X T X = V D V T, and X T X = V k = D V T. The mean vector and covariance matrix of Π n reduce to m = k k = V D U T y and V = V k k 3/ = D V T. Let β be the true value of β and β, β be the random variables with distributions Π, Π n. Assume that c D m dr for some universal positive constant c D and for =,.., k, and r =,..., p, then the Bayes L -risk of the DISK posterior in estimating β is E β β β = E β E β β y } pc D n = O E β β β, where p is fixed, is the Euclidean norm, and E β represents expectation with respect to the density of y obtained by fixing β = β in. This shows that the Bayes L -risk of the DISK posterior in estimating β is upper bounded by the Bayes L -risk of full data posterior distribution. A result like the one above is crucial in developing intuition on the DISK posterior as an alternative to the full data posterior and are generally difficult to come by in the literature for models presented in Section. In the next two sections, we develop theoretical guarantees for the

12 DISK posterior as an alternative to the full posterior in estimating the residual spatial surface. 3.5 Bayes L -risk of DISK: General convergence rates Consider a special case when w = in. The regression model reduces to a finite dimensional model with parameters Ω = β, α} and the DISK posterior reduces to the recently developed WASP method [68]. If Ω is the true value of Ω, then it is known that when the data are independent, WASP converges in probability to a Dirac measure centered at Ω or to the full data posterior distribution of Ω at a near optimal rate, under certain assumptions as n, m [68, 44]; however, in models based on spatial process, inference on the infinite dimensional true residual spatial surface, denoted as w, is of primary importance and no formal results are available in the regard. A notable exception is [7], which shows that combination using Wasserstein barycenter has optimal Bayes risk and adapts to the smoothness of w in the Gaussian white noise model. This model is a special case of with additional smoothness assumptions on w. We focus on providing theoretical guarantees for the DISK posterior distribution for estimating w using the spatial regression model in with a GP prior on w, including the low-rank GP prior such as MPP, as n, m. Recall that S is the set of l reference locations in D and S S =. Let w = w s,..., w s l }T be the true residual spatial surface generating the data at the locations in S and w = ws,..., ws l }T be the realization of GP w at the locations in S. Following the standard theoretical setup in [73], we assume that β =, α is known, and Dα = τ I, where τ is the known non-spatial error variance, so that reduces to ys i = ws i + ɛs i, ɛs i N, τ, i =,..., n, w GP, C α, }. 9 This setup subsumes the low-rank GP priors, hence MPP, as MPP is a GP with covariance function C α, in 5. We also assume that the data are generated using ys = w s + ɛs, s D. Adapting our discussing in Section 3. for the models in 6 to the one in 9, we have that y given w follows the Gaussian distribution with density Nw, k τ I after stochastic approximation as in and the GP prior on w implies that after integrating over w y w NA w, Σ, A = C T C,, Σ = k τ I + C, C T C, C, where C,, C,, and C are defined in. Let A and Σ represent the full data versions of A and Σ in. For any b R l, we define two norms b S = / / m bt A T Σ A b, b S = n bt A T Σ A b. Equivalently, for a generic function b defined on the domain D, if b in is the functional evaluation of b at the testing locations S, then S and S defined in can be viewed as two Banach norms on the space of functions with the domain D.

13 Based on the definitions and notation introduced previously, we make the following five assumptions for deriving the general convergence rates of the DISK posterior: A. Compact domain The spatial domain D is a compact space in metric. A. Norm equivalence The partitions S,..., S k of S are such that there exist universal positive constants H l < < H u independent of such that H l S S H u S for =,..., k. A.3 Metric entropy Suppose that ɛ m is a positive sequence that satisfies i mɛ m for all m ; ii ɛ m as m ; iii with a slight abuse of notations, for every r >, there is a set F r such that for all m, Dɛ m, F r, S e mɛ m H l r and ΠF r e mɛ m r, where Dɛ, F r, S is the minimum number of S -balls of radius ɛ that cover F r. A.4 Prior thickness For the ɛ m sequence in Assumption A.3 and for all m, the prior assigns positive mass to any small neighborhood around w, Πw : w w S ɛ m e mh uɛ m. A.5 The metrics and S are equivalent in that C l S C u S universal constants C l and C u. for some positive Assumption A. is common to all models based on GP priors. Assumption A. specifies a technical condition on the partitioning scheme so that the realizations of the GP observed in the th subset are similar to those in the full data, where such similarity is described in terms of the norms S and S. Assumption A.3 regulates the complexity of the sequence of sets F r in terms of S -metric entropy and specifies a condition on the probability assigned by GP prior to F r, ensuring that the prior probability of F r under the Gaussian measure induced by the GP prior increases with increasing S -metric entropy of F r. The subscript r here should not be confused with the number of knots in MPP. Assumption A.4 says that the GP prior assigns positive probability to arbitrarily small S -neighborhood around the true parameter w. Assumption A.5 is a technical condition that is used in upper bounding the Bayes L -risk of Wasserstein barycenter in the estimation of w if we have Bayes L -risk upper bounds for the subset posterior distributions. We define the Bayes L -risk in the estimation of w using the full data posterior as E S,S E w w y } } = E S,S w w dπ n w y, where E S,S is the expectation under the true space varying function w with respect to density of y conditional on S, S in 9. The decay rate of the risk in is known under assumptions that are similar to A., A.3, A.4 and are obtained by replacing m by n [73]. We present two theorems below that describe the asymptotic properties of the DISK posterior measured in this Bayes L -risk, for the spatial model specified in 9 with a prior Π on w and that satisfies assumptions A. A.5. The first theorem describes the Bayes L -risk of each subset posterior distributions in our case and is based on Proposition in [73]. 3

14 Theorem 3. If Assumptions A. A.5 hold for the th subset posterior Π m y with =,..., k, then there exists a positive constant ch l that only depends on H l, such that as m, E S,S E w w y } C u ch l ɛ m, where E S,S is the expectation under the true space varying function w with respect to the subset y of size m conditional on S, S. The proof of this theorem is in the supplementary material, along with other proofs in this section. Theorem 3. holds for any ɛ m sequence that satisfies Assumptions A.3 and A.4. Explicit expressions for ɛ m are available if w and F r are restricted to class of functions with known regularity and Π is a GP prior with the Matérn or squared exponential covariance kernels. For any a, b >, let C a [, ] d and H b [, ] d be the Hölder and Sobolev spaces of functions on [, ] d with regularity index a and b, respectively. Define D = [, ] d and C α to be the Matérn kernel with C α s, s σ = ν Γν s s φ ν K ν s s φ for s, s D, where K ν is a modified Bessel function of the second kind with order, ν, that controls the process smoothness, φ is a lengthscale parameter that controls the the decay in spatial correlation and Γ is the Gamma function. If w C b [, ] d H b [, ] d and F r C b [, ] d H b [, ] d for r >, then ɛ m = m minν,b /ν +d provided b >, minν, b > d/. Similarly, if D = [, ] d, C α is the squared exponential kernel with C α s, s = σ e φ s s, and w is an analytic function on D, then ɛ m = log m / / m; see Theorems 5 and in [73]. The squared exponential kernel is not relevant to spatial statistics, but we provide this additional result for a more general audience, especially in machine learning. The second theorem below provides an upper bound on the Bayes L -risk of the DISK posterior, Π y,..., y k, using the upper bounds on the k subset posterior distributions. Theorem 3. If Assumptions A. A.5 hold for all subset posteriors Π m y with =,..., k, then as m, E S,S E w w y,..., y k } = E S,S } w w dπw y,..., y k CucH l ɛ m, where E S,S is the expectation under the true space varying function w with respect to the full dataset of size n conditional on S, S. The ɛ m sequence here is the same as in Theorem 3.. If k log a n for some a >, then m n log a n. With this choice of m, k, our previous discussion implies that ɛ m = n c log ac n for the Matérn, where c = minν,b a +d, and ɛ m = log n a/+/ / n for the squared exponential kernels. Both these rates are minimax optimal up to log factors in the estimation of w ; see [73] for proofs. In applications, we are also interested in estimating functions of w. An attractive property of the DISK posterior is that its theoretical guarantees extend to a large class of functions of w. Let f be any function that maps w to fw and that f is bounded almost linearly by the metric. Then, we have the following corollary from a direct application of Lemma 8.5 in [8]. 4

15 Corollary 3.3 Suppose that Assumptions A. A.5 hold for all subset posteriors Π m y with =,..., k. Let f be a continuous function that maps R l to R l and satisfies fw C f + w w for any w R l, where C f > is a fixed constant. Let f Π y,..., y k represent the DISK posterior of fw, then as m, f fw df Πf y,..., y k = O p ɛ m, where O p is in the probability measure under the true space varying function w with respect to the full dataset of size n conditional on S, S. 3.6 Bayes L -risk of DISK: Bias-variance tradeoff and the choice of k While Section 3.5 describes asymptotic optimality results for the DISK posterior when n, m, a common problem in applications is the choice of k for a large n. The risk for the DISK posterior derived in Theorem 3. and Corollary 3.3 is applicable to any prior distribution that satisfies Assumptions A.3 and A.4. If k log a n for some a >, then the DISK posterior gives near minimax optimal performance in the estimation of w for the two covariance kernels. In practice, however, we want to choose a k that is much larger than log a n due to the abundance of computational resources. If the number of subsets k is very small, then the biases in subset posterior distributions are small due to a large m but the variance of the DISK posterior is large due to the small k. In contrast, if k is very large, then the biases in subset posterior distributions are large due to a small subset size m but the variance of the DISK posterior can be small due to the large k. An optimal choice of k balances the bias-variance tradeoff and minimizes the risk of the DISK posterior. We introduce some definitions used in stating the results in this section. Let P s be a probability distribution over D, L P s be the L space under P s, the inner product in L P s is defined as f, g L P s = E Ps fg for any f, g L P s, and φ i s : i =,,...} be an orthonormal basis with respect to P s. Assume that the kernel has the series expansion C α s, s = i= µ iφ i sφ i s with respect to P s for any s, s D, where µ µ... are the eigenvalues of C α. The trace of the kernel C α is defined as trc α = i= µ i. Any f L P s has the series expansion fs = i= θ iφ i s, where θ i = f, φ i L P s. The reproducing kernel Hilbert space RKHS H attached to C α is the space of all functions f L P s such that the H-norm f H = i= θ i /µ i <. The RKHS H is the completion of the linear space of functions defined as I i= a ic α s i,, where I is a positive integer, s i D, and a i R i =,..., I; see [74] for greater details. Simplifying our setup in Section 3.5, we consider a random design scheme with the observed locations S = s,..., s n } and S = s }, where s,..., s n, s are mutually independent and follow P s. The assumptions we impose below are used to derive analytic bounds for the bias and variance terms associated with the Bayes L -risk, and they are stronger than those in Section 3.5. B. RKHS The true function w is an element of the RKHS H attached to the kernel C α. B. Trace class kernel trc α <. 5

16 B.3 Moment condition There are positive constants ρ and, with a slight abuse of notation, r such that E Ps φ r i s} ρ r for every i =,,...,, and var ɛs} τ < for any s D. Assumption B. is a stronger assumption than Assumption A.4 in Section 3.5. Assumption B. is not required in Section 3.5 because the DISK posterior can learn any continuous w for a large class of GP priors as n, even if w / H. In general, the RKHS H can be a much smaller space relative to the support of the GP prior, in the sense that the GP prior can assign zero probability to H and positive probability to any neighborhood of arbitrary size around any continuous w. While we use Assumption B. mainly for technical simplicity in this section, it can be possibly relaxed by considering sieves with increasing Hilbert norms; see for example, Assumption B and Theorem in Zhang et al. [84]. In Assumption B., trc α measures the size of the covariance kernel and imposes conditions on the regularity of the functions that the DISK posterior can learn. Assumption B.3 controls the error in approximating C α s, s by a finite sum, and the superscript r here should not be confused with the number of knots in MPP or the r in Assumption A.3. Our results are valid for any error distribution that guarantees var ɛs} τ for every s D, and it is trivially satisfied in 9. We examine the Bayes L -risk of the DISK posterior for estimating w in 9. Under the setup of 9, let E s, E, E S, and E S respectively be the expectations with respect to the distributions of s, S, y, S, and y given S. If ws is a random variable that follows the DISK posterior for estimating w s, then ws has the density Nm, v, where m = k k c T, C, + τ k I y, v / = k = k = v /, v = c, c T, C, + τ k I c,, 3 c, = covws, ws }, and c T, = [covws, ws },..., covws m, ws }]. The Bayes L - risk of the DISK posterior in estimating w is E [ Es ws w s } ], and it is decomposed into squared bias, variance of mean of the DISK posterior, and variance of the DISK posterior terms as bias = E s E S c T k L +τ I w w s }, var mean = τ E s E S c T k L +τ I c }, vardisk = E s E S v, 4 where c T = c T,,..., ct k,, w = w s,..., w s k } =,..., k, w T = w,..., w k }, and L is a block-diagonal matrix with C,,..., C k,k along the diagonal. The next theorem describes the asymptotic behavior of each of the three terms in 4. Theorem 3.4 If Assumptions B. B.3 hold, then Bayes L risk = E S E S E s ws w s } = bias + var mean + var DISK, [ bias 8τ n w H + w H inf 8n Abm, d, rρ ρ 4 trc α trcα d γ τ } r ] n + µ, d N m τ 6

17 var mean n + 4 w H k τ k var DISK τ τ n γ n inf d N [ n w H + τ n γ [ + inf d N µ d+ + n τ ρ 4 trc α trcα d + τ, n trc d α + trc α where N is the set of all positive integers, A is a positive constant, maxr, maxr, log d bm, d, r = max log d,, γa = i= Abm, d, rρ γ τ n m } r ] + Abm, d, rρ γ τ } r ] n, 5 m m / /r µ i µ i + a for any a >, trcd α = i=d+ Theorem 3.4 is based on arguments similar to Theorem in Zhang et al. [84], which has derived the risk bounds for the frequentist divide-and-conquer estimator in kernel ridge regression. We however are considering the divide-and-conquer scheme in the Bayesian context, so our risk bound in Theorem 3.4 involves two variance terms, including the DISK posterior variance term var DISK, which has not been considered by Zhang et al. [84] and other divide-and-conquer literature before due to their interests in frequentist point estimation of w. The function γa measures the effective dimensionality of C α with respect to L P s [83]. The function trc d α describes the tail behavior of the eigenvalues of C α [84]. The upper bound for bias and var mean are the DISK analogues of the upper bounds in Lemma 6 and Lemma 7, respectively, of Zhang et al. [84]. From the risk bounds in Theorem 3.4, one can see that the first and second terms in the upper bound for var DISK are dominated by the last and second terms in the upper bounds for var mean and bias, respectively. Therefore, if we use the DISK posterior mean or a random draw from the DISK posterior to estimate w, we will observe the bias-variance tradeoff phenomenon similar to the one described in [84]: as k increases, the squared bias of the estimate increases, while the variation between the subset estimates decreases. Theoretically, this implies the existence of an optimal k such that the Bayes L -risk of the DISK posterior decreases for k k and increases for k > k. Our empirical results in Section 4 demonstrate this bias-variance tradeoff. For three types of commonly used kernels, the next theorem provides conditions on k such that the Bayes L -risk in 5 is nearly minimax optimal. The covariance kernel C α is a degenerate kernel of rank d if there is some constant positive integer d such that µ µ... µ d > and µ d + = µ d + =... = µ =. The covariance kernels in subset of regressors approximation [55] and predictive process [6] are degenerate with their ranks equaling the number of selected regressors and knots, respectively. The squared exponential kernel is very popular in machine learning. Its RKHS belongs to the class of RKHSs of kernels with exponentially decaying eigenvalues. Similarly, the class of RKHSs of kernels with polynomially decaying eigenvalues includes the Sobolev spaces with different orders of smoothness and the RKHS of the Matérn kernel. µ i. This kernel is most 7

18 relevant for spatial applications, but we provide the other two results for a more general audience. Theorem 3.5 If Assumptions B. B.3 hold and r > 4 in Assumption B.3, then, as n, i if C α is a degenerate kernel of rank d and k cn r 4 r /log n r r for some constant c >, then the L -risk of DISK posterior satisfies E s E S E S ws w s } = O n ; ii if µ i c µ exp c µ i for some constants c µ >, c µ > and all i N, and for some constant c >, k cn r 4 3r r /log n r, then the L -risk of DISK posterior satisfies E s E S E S ws w s } = O log n/n ; and iii if µ i c µ i ν for some constants c µ >, ν > r r 4 k cn r 4ν r r ν /log n r w s } = O n ν ν. and all i N, and for some constant c >, r, then the L -risk of DISK posterior satisfies E s E S E S ws The rate of decay of the L -risks in i and ii are known to be minimax optimal [57, 84, 8], whereas the rate of decay of the L -risk in iii is slightly larger than the minimax optimal rate by a factor of n νν+ [84]. The main advantage of the DISK posterior over its non-bayesian counterparts is that it achieves optimal performance in the first two cases and a near optimal performance in the third case while being free of tuning parameter selection. In most applications D is also compact, so that r in Assumption B.3 can be taken as infinity. This implies that the upper bounds on k in i, ii, and iii reduce to k = On/ log n, k = On/ log 3 n, and k = On ν ν / log n, respectively. Theoretical results similar to Theorems 3.4 and 3.5 are well studied in frequentist divide-andconquer kernel ridge regression, but developing Bayesian analogues of these results remains an active area of research. Cheng and Shang have studied the theoretical properties of divide-and-conquer Bayesian nonparametric regression under various theoretical setups [, 64, 65]. Recently, Szabo and van Zanten [7] have explored the convergence rate of the L -risk and coverage in the classical Gaussian white noise model. Focusing on the practical issue, we have developed a general method and related sampling algorithms for extending any method for Bayesian nonparametric regression based on GP prior to massive data settings using the divide-and-conquer technique. Theorem 3.4 describes the L -risk of our method, and Theorem 3.5 provides guidance on choosing the number of subsets. Under certain assumptions on w, our theoretical results reduce to those obtained in the previous works; however, none of the previous work focus on developing a computational method that is as widely applicable as DISK and is grounded in Bayesian asymptotic theory. 4 Experiments 4. Simulation setup We compare DISK with its competitors on synthetic data based on its performance in learning the process parameters, interpolating the unobserved residual spatial surface, and predicting at new locations. This section presents two simulation studies. The first Simulation and second 8

A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging

A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Cheng Li DSAP, National University of Singapore Joint work with Rajarshi Guhaniyogi (UC Santa Cruz), Terrance D. Savitsky (US Bureau of Labor