arxiv: v1 [stat.me] 28 Dec 2017
|
|
- Andrea Goodman
- 5 years ago
- Views:
Transcription
1 A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Raarshi Guhaniyogi, Cheng Li, Terrance D. Savitsky 3, and Sanvesh Srivastava 4 arxiv:7.9767v [stat.me] 8 Dec 7 Department of Applied Mathematics and Statistics, UC Santa Cruz Department of Statistics and Applied Probability, National University of Singapore 3 U. S. Bureau of Labor Statistics 4 Department of Statistics and Actuarial Science, The University of Iowa November, 8 Abstract Flexible hierarchical Bayesian modeling of massive data is challenging due to poorly scaling computations in large sample size settings. This article is motivated by spatial process models for analyzing geostatistical data, which typically entail computations that become prohibitive as the number of spatial locations becomes large. We propose a three-step divide-and-conquer strategy within the Bayesian paradigm to achieve massive scalability for any spatial process model. We partition the data into a large number of subsets, apply a readily available Bayesian spatial process model on every subset in parallel, and optimally combine the posterior distributions estimated across all the subsets into a pseudo posterior distribution that conditions on the entire data. The combined pseudo posterior distribution is used for predicting the responses at arbitrary locations and for performing posterior inference on the model parameters and the residual spatial surface. We call this approach Distributed Kriging DISK. It offers significant advantages in applications where the entire data are or can be stored across multiple machines. Under the standard theoretical setup, we show that if the number of subsets is not too large, then the Bayes L -risk of estimating the true residual spatial surface using the DISK posterior distribution decays to zero at a nearly optimal rate. While DISK is a general approach to distributed nonparametric regression, we focus on its applications in spatial statistics and demonstrate its empirical performance using a stationary full-rank and a nonstationary lowrank model based on Gaussian process GP prior. A variety of simulations and a geostatistical analysis of the Pacific Ocean sea surface temperature data validate our theoretical results. Keywords: Distributed Bayesian inference; Gaussian process; modified predictive process; large and complex spatial data; Wasserstein distance; Wasserstein barycenter. guhaniyogi@ucsc.edu stalic@nus.edu.sg savitsky.terrance@bls.gov sanvesh-srivastava@uiowa.edu
2 Introduction A fundamental challenge in geostatistics is the analysis of massive spatially-referenced data. Due to the recent influx of data with complex spatial associations, sophisticated spatial modeling has become an enormously active area of research; see, for example, [8, 3, 4]. Massive spatial data provide scientists with an unprecedented opportunity to hypothesize and test complex theories. This leads to the implementation of rather complex hierarchical GP-based models that are computationally intractable for large n, where n is the number of spatial locations, due to the On 3 computational cost and the On storage cost. This article develops a general distributed Bayesian approach, called Distributed Kriging DISK, for boosting the scalability of any state-of-the-art spatial process model based on GP prior to multiple folds using the divide-and-conquer technique. The literature on process-based modeling of massive spatial data is large, so we only provide a selective review. Briefly, these methods seek dimension-reduction by endowing the spatial covariance matrix either with a low-rank or a sparse structure. Low-rank structures represent spatial surface using r apriori chosen basis functions. They include fixed-rank kriging [], or predictive process and its variants [6, 4, 34, 5, 6]; see [78] and [4] for comprehensive reviews. The time complexity for fitting spatial models with a low-rank structure decreases from On 3 to Onr floating point operations flops; however, practical considerations show that when n is large, r must grow roughly as O n for accurate estimation, implying that Onr flops are also expensive in low-rank structures. On the other hand, sparse structures intuit that spatial correlation between two distantly located observations is nearly zero, so little information is lost by assuming conditional independence given the intermediate locations. For example, covariance tapering [4, 9, 5, 63] uses compactly supported covariance functions to create sparse spatial covariance matrices that approximate the full covariance matrix. Alternately, one could introduce sparsity in the inverse covariance precision matrix using conditional independence assumptions or composite likelihoods [75, 59, 7,, 5, 7, 36]. In related literature on computer experiments, localized approximations of GP models are proposed, see, for example, [3, 8, 54]. GP-based modeling using low-rank or sparse structures has also received significant attention in machine learning; see [55, ] for recent reviews. Some variants of dimension-reduction methods partition the spatial domain into sub-regions containing fewer spatial locations. Each of these sub-regions is modeled using a GP which are then hierarchically combined by borrowing information from across the sub-regions. Examples include non-stationary models [4], multi-level and multi-resolution models [7, 5, 4, 35], and the Bayesian Treed GP models [3]. These models usually achieve scalability by assuming blockindependence at some level of the hierarchy, usually across sub-regions, but may lose scalability when they borrow across sub-regions. In an unrelated thread, [45] propose parameter estimation in the GP-based geostatistical model using resampling based on stochastic approximation. Besides being fully frequentist in nature, it is less clear as to how such an idea would be extended to enable analysis of more general nonstationary models with massive data.
3 The proposed DISK framework is a three-step approach for distributed Bayesian inference in any model based on spatial process. First, we divide the n spatial locations into k subsets such that each subset has representative samples from all regions of the spatial domain. Second, we choose any spatial model and estimate the posterior distributions for inference and prediction in parallel across k subsets after raising the likelihood to a power of k in each subset. The pseudo posterior distribution obtained after modifying the likelihood for each subset of data is referred to as the subset posterior. Since each subset posterior distribution conditions on /k-fraction of the full data, the modification by raising the likelihood to the power k ensures that variance of each subset posterior distribution is of the same order as a function of n as that of the full data posterior distribution. Third, the k subset posterior distributions are combined into a single pseudo probability distribution, called the DISK pseudo posterior henceforth, DISK posterior, that conditions on the full data and replaces the computationally expensive full data posterior distribution for the purpose of prediction and inference. Computationally, the main innovations are in the first and third steps, where general partitioning and combining schemes are unavailable in process-based modeling of spatial data. Theoretically, we provide guarantees on the rate of decay of the Bayes L -risk in estimating the true residual spatial surface using the DISK posterior as a function of n, k, and analytic properties of the true spatial surface. This involves two new upper bounds. First, an upper bound for the Bayes risk of the DISK posterior is developed assuming each subset size approaches to infinity. Second, we provide an in depth analysis of the bias-variance tradeoff in estimating the true spatial surface using the DISK posterior and develop upper bounds on k as a function on n that lead to near optimal performance as n tends to infinity. Motivated by large and complex data, there has been significant interest in adopting the divideand-conquer technique for distributed Bayesian inference [5, 76, 77, 3, 6, 33, 48, 8]. DISK is significantly different from these as it is based on combining the collection of k subset posterior distributions through their barycenter, a notion of geometric center that generalizes the Euclidean mean to a space of probability measures. There are recent approaches based on combining subset inferences, either through the posterior means and covariance matrices of parameters across the k subsets [6] or by employing a product of experts [5, 7]. These approaches are intuitively appealing for GP regression, but lead to theoretically sub-optimal uncertainty quantification [7] and are less suited when a GP or its derivative is embedded in a more general hierarchical model; for example, GP-based classification [37]. Recent developments on distributed variational GP [6] are impractical for n > 7 and no theoretical results are available on the quality of approximation of the full posterior distribution. There are recent works on combining subset posterior distributions through their geometric centers, such as the mean or the median, but they are restricted to parametric models [46, 67, 44, 6, 47, 68]. Extensions to general nonparametric models, including those based on stochastic processes, are missing, except for some empirical results on nonparametric regression using GP [46, 67] and theoretical guarantees for both these methods in the Gaussian white noise model [7]. Combining subset posterior distributions of a stochastic process is challenging for two maor reasons. First, it requires estimation of a function, an infinite dimensional parameter. 3
4 Second, and most importantly, stochastic process models induce complex dependencies among observations that is challenging to capture with a combination of subset posterior distributions that are estimated independently without accounting for the inter-subset dependence. Divide-and-conquer nonparametric regression, which includes kriging as a special case, has received significant attention lately in the optimization literature, though the Bayesian literature is relatively lightly populated. The bias-variance decomposition of the L -risk in the estimation of true regression function in divide-and-conquer kernel ridge regression is known [84, 8]. Bayesian divideand-conquer nonparametric regression has been mostly studied from the theoretical perspective [, 64, 65, 7]. Filling the methodological gap, the DISK framework provides a general approach to enhance the scalability of any process-based model for Bayesian nonparametric regression, including models based on spatial processes. For example, if the application of spatial process models is feasible for a subset of size m, then one can run them on k subsets in parallel and DISK allows prediction and inference using n = mk spatial locations. The values of m and k depend mainly upon the available computational resources and the model, but our theoretical results provide guidance on choosing k depending on the analytic properties of the true spatial process. For clarity of exposition, we illustrate the empirical performance of the DISK framework with the stationary GP and the modified predictive process MPP [4] priors. The MPP is a low-rank nonstationary GP prior that allows accurate modeling of spatial surfaces whose variability or smoothness changes with the location. MPP, like any low-rank model, faces computational bottlenecks when n is large, and either computational efficiency or accuracy worsens when MPP is applied to even 4 observations. Our numerical results establish that DISK with MPP scales to 6 observations without compromising on either computational efficiency or accuracy in inference and prediction. We expect this conclusion to hold for other popular structured GP priors. The remainder of the manuscript evolves as follows. In Section we outline a Bayesian hierarchical mixed model framework that incorporates models based on both the full-rank and the low-rank GP priors. Our DISK approach will work with posterior samples from such models. Section 3 develops the framework for DISK, discusses how to compute the DISK posterior distribution, and offers theoretical insights into the DISK for general GPs and their approximations. A detailed simulation study followed by an analysis of the Pacific ocean sea surface temperature data are illustrated in Section 4 to ustify the use of DISK for real data. Finally, Section 5 discusses what DISK achieves, and proposes a number of future directions to explore. Hierarchical Bayesian inference for GP-based spatial models Consider the standard univariate spatial regression model for the data observed at location s in a compact spatial domain D, ys = xs T β +ws + ɛs, 4
5 where ys is a univariate response at s, xs is a p predictor vector at s, β is a p predictor coefficient, ws is the realization of an unknown spatial function w at s, and ɛs is the realization of white-noise process ɛ at s and is independent of w. The Bayesian implementation of the model in customarily assumes a that β apriori follows a Gaussian distribution with mean µ β and covariance matrix Σ β and b that w and ɛ apriori follow mean GPs with covariance functions C α s, s and D α s, s that model covws, ws } and covɛs, ɛs }, respectively, where α are the process parameters indexing the two families of covariance functions and s, s D; therefore, the model parameters are Ω = α, β}. If β = in, then we obtain the setup for Bayesian nonparametric regression using GP prior, with s as covariates and ys as the response. The training data consists of n predictors and responses, denoted as xs, ys },..., xs n, ys n }, observed at n spatial locations, denoted as S = s,..., s n }. Standard Markov chain Monte Carlo MCMC algorithms exist for performing posterior inference on Ω and the values of w at a given set of locations S = s,..., s l }, where S S =, and for predicting ys for any s S [4]. Given S, the prior assumptions on w and ɛ imply that w T = ws,..., ws n } and ɛ T = ɛs,..., ɛs n } are independent and follow N, Cα} and N, Dα}, respectively, where Nm, V denotes the density of a multivariate Gaussian distribution of appropriate dimension with mean m and covariance matrix V and the i, th entries of Cα and Dα are C α s i, s and D α s i, s, respectively. The hierarchy in is completed by assuming that α apriori follows a distribution with density πα. If y = ys,..., ys n } T is the n response vector and X = [xs : : xs n ] T is the n p matrix of predictors, where p < n, then the MCMC algorithm for sampling Ω, w T = ws,..., ws l }, and y T = ys,..., ys l } cycle through the following three steps until sufficient samples are drawn post convergence:. Integrate over w in and a sample β given y, X, and α from Nm β, V β, where } V β = X T Vα X + Σ β, mβ = V β X T Vα y + Σ β and Vα = Cα + Dα; and µ β }, b sample α given y, X, and β using the Metropolis-Hastings algorithm with a normal random walk proposal.. Sample w given y, X, α, and β from Nm, V, where V = C, α C α Vα C α T, m = C α Vα y X β, 3 C α and C, α are l n and l l matrices, respectively, and the i, th entries of C, α and C α are C α s i, s and C αs i, s, respectively. 3. Sample y given α, β, and w from N X β + w, Dα}, where X T = [xs : : xs l ]. 5
6 Many Bayesian spatial models can be formulated in terms of by assuming different forms of C α s, s and D α s, s ; see [4] and supplementary material for details on the MCMC algorithm. Cα. MCMC computations face a maor computational bottleneck due to matrix inversions involving The steps a, b, and for sampling Ω and w involve inversion of Cα + Dα. Irrespective of the form of Dα, if no additional assumptions are made on the structure of Cα, then the three steps require On 3 flops in computation and On memory units in storage in every MCMC iteration. Spatial models with this form of posterior computations are based on a full-rank GP prior. In practice, if n 4, then posterior computations in a model based on a full-rank GP prior are infeasible due to numerical issues in matrix inversions involving an unstructured Cα. This problem is solved by imposing additional structure on Cα. Our focus is on those methods that impose a sparse or low-rank structure on the covariance function of a GP prior [55, 4]. Every method in this class expresses the covariance function in terms of r n basis functions, in turn inducing a low-rank GP prior. methods. Let S = s We use the MPP as a representative example of this class of,..., s r } be a set of r locations, known as the knots, which may or may not intersect with S. Let cs, S = C α s, s,..., C αs, s r } T be an r vector and CS be an r r matrix whose i, th entry is C α s i, s. Using cs, S,..., cs n, S and CS, define the diagonal matrix δ = diagδs,..., δs n } with δs i = C α s i, s i c T s i, S CS cs i, S, i =,..., n. 4 Let a = b = if a = b and otherwise. Then, the MPP is a GP with the covariance function C α s, s = c T s, S CS cs, S + δs s = s, s, s D, 5 where C α s, s depends on the covariance function of the parent GP and the selected r knots, which define CS, c T s, S, and c T s, S. We have used a in 5 to distinguish the covariance function of a low-rank GP prior from that of its parent full-rank GP. If Cα is a matrix with i, th entry C α s i, s, then the posterior computations using MPP, a low-rank GP prior, replace Cα by Cα in the steps a, b, and. The low rank r structure imposed by CS implies that Cα computation requires Onr flops using the Woodbury formula [38]. Spatial models based on a low-rank GP prior, including MPP, suffer from computational bottlenecks in massive data settings. The computational complexity of posterior computations for a rank-r GP prior is Onr, which is linear in n; however, practical considerations often necessitate that r = O n for accurate inference and prediction. This severely limits the computational advantages of low-rank GP priors, including MPP, especially in applications with large n [5]. The next few sections develop our DISK framework, which is key in extending any GP-based spatial model to massive data using the divide-and-conquer technique without compromising either on the quality of inference and prediction or on the computational cost. 6
7 3 Distributed Kriging 3. First step: partitioning of spatial locations We partition the n spatial locations into k subsets. The value of k depends on the chosen spatial model, and it is large enough to ensure efficient posterior computations on any subset. The default partitioning scheme is to randomly allocate the locations into k subsets, but we specify a technical condition later which ensures that every subset has locations from all regions of the spatial domain. For theoretical and notational simplicity, we also assume that the k subsets are non-overlapping and that every subset has m spatial locations so that n = mk. Let S be the set of spatial locations in subset =,..., k. Our simplifying assumptions imply that k = S = S, S S = for, and S = s,..., s m }, where s i = s i for some s i S and for every =,..., k and i =,..., m. Denote the data in the th partition as y, X } =,..., k, where y = ys,..., ys m } T is a m vector and X = [xs : : xs m ] T is a m p matrix of predictors corresponding to the spatial locations in S with p < m. The univariate spatial regression models using either a full-rank or a low-rank GP prior for the data observed at any location s i S D is given by ys i = xs i T β +ws i + ɛs i, i =,..., m. 6 Let w T = ws,..., ws m } and ɛ T = ɛs,..., ɛs m } be the realizations of GP w and white-noise process ɛ, respectively, in the th subset. After marginalizing over w in the GP-based model for the th subset, the likelihood of Ω = α, β} is given by l Ω = Ny X β, V α}, 7 where Ny m, V represents the multivariate normal density of y with mean m and covariance matrix V, V α = C α + D α and V α = C α + D α for full-rank and low-rank GP priors, respectively, and C α, C α, D α are obtained by extending the definitions of Cα, Cα, Dα to the th subset. In a model based on full-rank or low-rank GP prior, the likelihood of w given y, X, and Ω is l w = Ny X β w, D α}. 8 The likelihoods in 7 and 8 are used for defining the posterior distributions for β, α, w, y, called th subset posterior distributions, using a full-rank or a low-rank GP prior in subset. 7
8 3. Second step: sampling from subset posterior distributions We define subset posterior distributions by modifying the likelihoods in 7 and 8. More precisely, the density of the th subset posterior distribution of Ω is given by π m Ω y = l Ω} k πω l Ω} k πωd Ω, 9 where we assume that l Ω} k πωd Ω <, and the subscript m denotes that the density conditions on m samples in the th subset. The modification of likelihood to yield the subset posterior density in 9 is called stochastic approximation [46]. Raising the likelihood to the power of k is equivalent to replicating every ys i k times i =,..., m. Thus, stochastic approximation accounts for the fact that the th subset posterior distribution conditions on a /k-fraction k = n/m of the full data and ensures that its variance is of the same order as a function of n as that of the full data posterior distribution. This is a common strategy adopted in divide-and-conquer based Bayesian inference in parametric models; see, for example, [44, 8] for recent applications. With the proposed stochastic approximation in 9, the full conditional densities of th subset posterior distributions for prediction and inference follow from their full data counterparts. The th full conditional densities of β and α in the GP-based models are given by π m β y, α = l Ω} k πβ l Ω} k πβd β, π mα y, β = l Ω} k πα l Ω} k παd α, where πβ = Nµ β, Σ β, πα is the prior density of α, and we assume that l Ω} k πβd β and l Ω} k παd α respectively are finite. The th full conditional densities of y and w are calculated after modifying the likelihood of w in 7 using stochastic approximation. Given y, X, and Ω, straightforward calculation yields that the th subset posterior predictive density of w is π m w y, Ω = Nw m, V, where V = C, α C α V α C α T, m = C α V α y X β, where V α = C α + k D α and V α = C α + k D α for full-rank and low-rank GP priors, respectively, and C, α, C α are l l, l m matrices obtained by extending the definition in 3 to subset for full-rank and low-rank GP priors with covariance functions C α, and C α,, respectively. We note that the stochastic approximation exponent, k, scales D α in V α so that the uncertainty in subset posterior distributions are scaled to that of the full data posterior. The th subset posterior predictive density of y given the samples of w and Ω in the th subset is Ny X β + w, D α}. We employ the same three-step sampling algorithm, as earlier introduced, specialized to subset =,..., k, sampling β, α, y, w } in each subset across multiple MCMC iterations; see supplementary material for detailed derivations of subset posterior sampling algorithms in the full-rank and low-rank GP priors. The computational complexity of th subset posterior computations follows from their full data 8
9 counterparts if we replace n by m. Specifically, the computational complexities for sampling a subset posterior distribution are Om 3 and Omr flops if the model in 6 uses a full-rank or a low-rank GP prior, respectively. Since the subset posterior computations are performed in parallel across k subsets, the computational complexities for obtaining B post burn-in subset posterior samples from k subsets are Okm 3 = Onm and Okmr = Onr flops in models based on full-rank and low-rank GP priors, respectively. The combination step of subset posteriors using the DISK framework outlined below is more widely applicable compared to other divide-and-conquer type approaches because it does not rely on any model- or data-specific assumptions, such as independence, except that every subset posterior distribution has a density with respect to the Lebesgue measure and has finite second moments. 3.3 Third step: combination of subset posterior distributions 3.3. Wasserstein distance and Wasserstein barycenter The combination step relies on the Wasserstein barycenter, so we provide some background on this topic. Let Θ, ρ be a complete separable metric space and PΘ be the space of all probability measures on Θ. The Wasserstein space of order is a set of probability distributions defined as P Θ = } µ PΘ : ρ θ, θ µdθ <, Θ where θ Θ is arbitrary and P Θ does not depend on the choice of θ. The Wasserstein distance of order, denoted as W, metrizes P Θ. Let µ, ν be two probability measures in P Θ and Πµ, ν be the set of all probability measures on Θ Θ with marginals µ and ν, then W distance between µ and ν is defined as W µ, ν = inf π Πµ,ν Θ Θ ρ x, y dπx, y. 3 Let ν,..., ν k P Θ, then the Wasserstein barycenter of ν,..., ν k is defined as ν = argmin ν P Θ It is known that ν P Θ is the unique solution of a linear program []. k = k W ν, ν. 4 If ν,..., ν k in 4 represent the k subset posterior distributions, then Wasserstein barycenter ν provides a general notion of obtaining the mean of k subset posterior distributions. If the k subset posteriors are combined using ν, then ν has finite second moments, conditions on the full data, and does not rely on model-specific or data-specific assumptions. If ν,..., ν k are analytically intractable but MCMC samples are available from them, then an empirical approximation of the Wasserstein barycenter can be estimated by solving a sparse linear program or by averaging empirical subset posterior quantiles [4, 9, 67, 44, 69]. 9
10 3.3. Combination scheme In the DISK framework, we combine the collection of posterior samples from the k subset posterior distributions for β, α, w, and y through their respective Wasserstein barycenters. Our combination scheme relies on the following key result. If θ is a one-dimensional functional of the full data posterior distribution and the th subset posterior distribution for θ is denoted by ν, then the qth quantile of the Wasserstein barycenter for θ, denoted as ν, is estimated from the collection of k subset posterior samples as ˆν q = k k = ˆν q, q = ξ, ξ,..., ξ, 5 where ξ is the grid-size of the quantiles, ˆν q is the estimate of qth quantile of ν obtained using MCMC samples from ν, and ˆν q is the estimate of the qth quantile of ν [44]. The post burn-in samples from the k subset posterior distributions are combined using 5. Let β b } B b=, α b} B b=, w b }B b=, y b }B b= =,..., k be the collection of B post burn-in MCMC samples from the k subsets; β ib and α i b be the bth post burn-in MCMC samples for the ith and i th marginals of β and α from their th subset posteriors, where i =,..., p, i =,..., s, p is the dimension of β, and s is the dimension of α. If β, α, w, and y represent the random variables that follow the DISK posterior distributions for β, α, w, and y, then MCMC-based estimates of the DISK posterior are obtained through their quantile estimates as follows:. Use 5 to estimate the qth quantiles of the ith and i th marginals of the DISK posteriors for β and α, respectively, as ˆ β q i = k where ˆβ q i and ˆα q i k ˆβ q i, i =,..., p, ˆα q i = k = k ˆα q i, i =,..., s, 6 = are the estimates of qth quantiles of ith and i th marginals of th subset posterior distributions for β and α obtained from β b } B b= and α b} B b=.. Use 5 to estimate the pointwise qth quantiles of ws i and ys i as ŵ q s i = k k ŵ q s i, = ŷ q s i = k ŷ q k s i, i =,..., l, 7 = where ŵ q s i and ŷq s i are the estimates of qth quantiles of ws i and ys i of the th subset posterior disributions for ws i and ys i obtained from w b }B b= and y b }B b=. A key feature of the DISK combination scheme is that given the subset posterior samples, the combination step is agnostic to the choice of a model. Specifically, 5 remains the same for models based on a full-rank prior or a low-rank GP prior, such as MPP, given MCMC samples from the k subset posterior distributions. Since the averaging over k subsets takes Ok flops and k < n,
11 the total time for computing the empirical quantile estimates of the DISK posterior in inference or prediction requires Ok + Om 3 and Ok + Onm flops in models based on full-rank and low-rank GP priors. Assuming that we have abundant computational resources, k is chosen large enough so that Om 3 computations are feasible. This would enable applications of the DISK framework in models based on both full-rank and low-rank GP priors in massive n settings. 3.4 Illustrative example: linear regression We develop intuitions behind the DISK posterior in this section by studying its theoretical properties in multivariate linear regression. The spatial linear model in reduces to a linear regression model with error variance if ɛs follows N, and there is no spatial effect; that is, ws = for every s D. A flat prior on β implies that the full data posterior distribution of β given y, denoted as Π n, has density Nˆβ, X T X }, where ˆβ = X T X X T y. Using notations similar to the earlier sections, the th subset posterior distribution has density Nˆβ, k X T X }, where ˆβ = X T X X T y =,..., k and the factor of k in the posterior covariance matrix comes from stochastic approximation. The Wasserstein barycenter of the k subset posterior distributions, denoted as Π n, is Gaussian with density Nm, V, where m = k k ˆβ and V is such that k = k = V / k X T X V / } / = V; 8 see Section 6.3 in []. The DISK framework replaces Nˆβ, X T X } with Nm, V for inference and predictions in massive data settings. The covariance matrix V in 8 is estimated via fixed point iterations; however, V is analytically tractable when X,..., X k have identical left and right singular vectors []. If U and V are m p and p p orthogonal matrices and D = diagd,..., d p is a p p diagonal matrix containing singular values of X, then X = U D V T, X T X = V D V T, and X T X = V k = D V T. The mean vector and covariance matrix of Π n reduce to m = k k = V D U T y and V = V k k 3/ = D V T. Let β be the true value of β and β, β be the random variables with distributions Π, Π n. Assume that c D m dr for some universal positive constant c D and for =,.., k, and r =,..., p, then the Bayes L -risk of the DISK posterior in estimating β is E β β β = E β E β β y } pc D n = O E β β β, where p is fixed, is the Euclidean norm, and E β represents expectation with respect to the density of y obtained by fixing β = β in. This shows that the Bayes L -risk of the DISK posterior in estimating β is upper bounded by the Bayes L -risk of full data posterior distribution. A result like the one above is crucial in developing intuition on the DISK posterior as an alternative to the full data posterior and are generally difficult to come by in the literature for models presented in Section. In the next two sections, we develop theoretical guarantees for the
12 DISK posterior as an alternative to the full posterior in estimating the residual spatial surface. 3.5 Bayes L -risk of DISK: General convergence rates Consider a special case when w = in. The regression model reduces to a finite dimensional model with parameters Ω = β, α} and the DISK posterior reduces to the recently developed WASP method [68]. If Ω is the true value of Ω, then it is known that when the data are independent, WASP converges in probability to a Dirac measure centered at Ω or to the full data posterior distribution of Ω at a near optimal rate, under certain assumptions as n, m [68, 44]; however, in models based on spatial process, inference on the infinite dimensional true residual spatial surface, denoted as w, is of primary importance and no formal results are available in the regard. A notable exception is [7], which shows that combination using Wasserstein barycenter has optimal Bayes risk and adapts to the smoothness of w in the Gaussian white noise model. This model is a special case of with additional smoothness assumptions on w. We focus on providing theoretical guarantees for the DISK posterior distribution for estimating w using the spatial regression model in with a GP prior on w, including the low-rank GP prior such as MPP, as n, m. Recall that S is the set of l reference locations in D and S S =. Let w = w s,..., w s l }T be the true residual spatial surface generating the data at the locations in S and w = ws,..., ws l }T be the realization of GP w at the locations in S. Following the standard theoretical setup in [73], we assume that β =, α is known, and Dα = τ I, where τ is the known non-spatial error variance, so that reduces to ys i = ws i + ɛs i, ɛs i N, τ, i =,..., n, w GP, C α, }. 9 This setup subsumes the low-rank GP priors, hence MPP, as MPP is a GP with covariance function C α, in 5. We also assume that the data are generated using ys = w s + ɛs, s D. Adapting our discussing in Section 3. for the models in 6 to the one in 9, we have that y given w follows the Gaussian distribution with density Nw, k τ I after stochastic approximation as in and the GP prior on w implies that after integrating over w y w NA w, Σ, A = C T C,, Σ = k τ I + C, C T C, C, where C,, C,, and C are defined in. Let A and Σ represent the full data versions of A and Σ in. For any b R l, we define two norms b S = / / m bt A T Σ A b, b S = n bt A T Σ A b. Equivalently, for a generic function b defined on the domain D, if b in is the functional evaluation of b at the testing locations S, then S and S defined in can be viewed as two Banach norms on the space of functions with the domain D.
13 Based on the definitions and notation introduced previously, we make the following five assumptions for deriving the general convergence rates of the DISK posterior: A. Compact domain The spatial domain D is a compact space in metric. A. Norm equivalence The partitions S,..., S k of S are such that there exist universal positive constants H l < < H u independent of such that H l S S H u S for =,..., k. A.3 Metric entropy Suppose that ɛ m is a positive sequence that satisfies i mɛ m for all m ; ii ɛ m as m ; iii with a slight abuse of notations, for every r >, there is a set F r such that for all m, Dɛ m, F r, S e mɛ m H l r and ΠF r e mɛ m r, where Dɛ, F r, S is the minimum number of S -balls of radius ɛ that cover F r. A.4 Prior thickness For the ɛ m sequence in Assumption A.3 and for all m, the prior assigns positive mass to any small neighborhood around w, Πw : w w S ɛ m e mh uɛ m. A.5 The metrics and S are equivalent in that C l S C u S universal constants C l and C u. for some positive Assumption A. is common to all models based on GP priors. Assumption A. specifies a technical condition on the partitioning scheme so that the realizations of the GP observed in the th subset are similar to those in the full data, where such similarity is described in terms of the norms S and S. Assumption A.3 regulates the complexity of the sequence of sets F r in terms of S -metric entropy and specifies a condition on the probability assigned by GP prior to F r, ensuring that the prior probability of F r under the Gaussian measure induced by the GP prior increases with increasing S -metric entropy of F r. The subscript r here should not be confused with the number of knots in MPP. Assumption A.4 says that the GP prior assigns positive probability to arbitrarily small S -neighborhood around the true parameter w. Assumption A.5 is a technical condition that is used in upper bounding the Bayes L -risk of Wasserstein barycenter in the estimation of w if we have Bayes L -risk upper bounds for the subset posterior distributions. We define the Bayes L -risk in the estimation of w using the full data posterior as E S,S E w w y } } = E S,S w w dπ n w y, where E S,S is the expectation under the true space varying function w with respect to density of y conditional on S, S in 9. The decay rate of the risk in is known under assumptions that are similar to A., A.3, A.4 and are obtained by replacing m by n [73]. We present two theorems below that describe the asymptotic properties of the DISK posterior measured in this Bayes L -risk, for the spatial model specified in 9 with a prior Π on w and that satisfies assumptions A. A.5. The first theorem describes the Bayes L -risk of each subset posterior distributions in our case and is based on Proposition in [73]. 3
14 Theorem 3. If Assumptions A. A.5 hold for the th subset posterior Π m y with =,..., k, then there exists a positive constant ch l that only depends on H l, such that as m, E S,S E w w y } C u ch l ɛ m, where E S,S is the expectation under the true space varying function w with respect to the subset y of size m conditional on S, S. The proof of this theorem is in the supplementary material, along with other proofs in this section. Theorem 3. holds for any ɛ m sequence that satisfies Assumptions A.3 and A.4. Explicit expressions for ɛ m are available if w and F r are restricted to class of functions with known regularity and Π is a GP prior with the Matérn or squared exponential covariance kernels. For any a, b >, let C a [, ] d and H b [, ] d be the Hölder and Sobolev spaces of functions on [, ] d with regularity index a and b, respectively. Define D = [, ] d and C α to be the Matérn kernel with C α s, s σ = ν Γν s s φ ν K ν s s φ for s, s D, where K ν is a modified Bessel function of the second kind with order, ν, that controls the process smoothness, φ is a lengthscale parameter that controls the the decay in spatial correlation and Γ is the Gamma function. If w C b [, ] d H b [, ] d and F r C b [, ] d H b [, ] d for r >, then ɛ m = m minν,b /ν +d provided b >, minν, b > d/. Similarly, if D = [, ] d, C α is the squared exponential kernel with C α s, s = σ e φ s s, and w is an analytic function on D, then ɛ m = log m / / m; see Theorems 5 and in [73]. The squared exponential kernel is not relevant to spatial statistics, but we provide this additional result for a more general audience, especially in machine learning. The second theorem below provides an upper bound on the Bayes L -risk of the DISK posterior, Π y,..., y k, using the upper bounds on the k subset posterior distributions. Theorem 3. If Assumptions A. A.5 hold for all subset posteriors Π m y with =,..., k, then as m, E S,S E w w y,..., y k } = E S,S } w w dπw y,..., y k CucH l ɛ m, where E S,S is the expectation under the true space varying function w with respect to the full dataset of size n conditional on S, S. The ɛ m sequence here is the same as in Theorem 3.. If k log a n for some a >, then m n log a n. With this choice of m, k, our previous discussion implies that ɛ m = n c log ac n for the Matérn, where c = minν,b a +d, and ɛ m = log n a/+/ / n for the squared exponential kernels. Both these rates are minimax optimal up to log factors in the estimation of w ; see [73] for proofs. In applications, we are also interested in estimating functions of w. An attractive property of the DISK posterior is that its theoretical guarantees extend to a large class of functions of w. Let f be any function that maps w to fw and that f is bounded almost linearly by the metric. Then, we have the following corollary from a direct application of Lemma 8.5 in [8]. 4
15 Corollary 3.3 Suppose that Assumptions A. A.5 hold for all subset posteriors Π m y with =,..., k. Let f be a continuous function that maps R l to R l and satisfies fw C f + w w for any w R l, where C f > is a fixed constant. Let f Π y,..., y k represent the DISK posterior of fw, then as m, f fw df Πf y,..., y k = O p ɛ m, where O p is in the probability measure under the true space varying function w with respect to the full dataset of size n conditional on S, S. 3.6 Bayes L -risk of DISK: Bias-variance tradeoff and the choice of k While Section 3.5 describes asymptotic optimality results for the DISK posterior when n, m, a common problem in applications is the choice of k for a large n. The risk for the DISK posterior derived in Theorem 3. and Corollary 3.3 is applicable to any prior distribution that satisfies Assumptions A.3 and A.4. If k log a n for some a >, then the DISK posterior gives near minimax optimal performance in the estimation of w for the two covariance kernels. In practice, however, we want to choose a k that is much larger than log a n due to the abundance of computational resources. If the number of subsets k is very small, then the biases in subset posterior distributions are small due to a large m but the variance of the DISK posterior is large due to the small k. In contrast, if k is very large, then the biases in subset posterior distributions are large due to a small subset size m but the variance of the DISK posterior can be small due to the large k. An optimal choice of k balances the bias-variance tradeoff and minimizes the risk of the DISK posterior. We introduce some definitions used in stating the results in this section. Let P s be a probability distribution over D, L P s be the L space under P s, the inner product in L P s is defined as f, g L P s = E Ps fg for any f, g L P s, and φ i s : i =,,...} be an orthonormal basis with respect to P s. Assume that the kernel has the series expansion C α s, s = i= µ iφ i sφ i s with respect to P s for any s, s D, where µ µ... are the eigenvalues of C α. The trace of the kernel C α is defined as trc α = i= µ i. Any f L P s has the series expansion fs = i= θ iφ i s, where θ i = f, φ i L P s. The reproducing kernel Hilbert space RKHS H attached to C α is the space of all functions f L P s such that the H-norm f H = i= θ i /µ i <. The RKHS H is the completion of the linear space of functions defined as I i= a ic α s i,, where I is a positive integer, s i D, and a i R i =,..., I; see [74] for greater details. Simplifying our setup in Section 3.5, we consider a random design scheme with the observed locations S = s,..., s n } and S = s }, where s,..., s n, s are mutually independent and follow P s. The assumptions we impose below are used to derive analytic bounds for the bias and variance terms associated with the Bayes L -risk, and they are stronger than those in Section 3.5. B. RKHS The true function w is an element of the RKHS H attached to the kernel C α. B. Trace class kernel trc α <. 5
16 B.3 Moment condition There are positive constants ρ and, with a slight abuse of notation, r such that E Ps φ r i s} ρ r for every i =,,...,, and var ɛs} τ < for any s D. Assumption B. is a stronger assumption than Assumption A.4 in Section 3.5. Assumption B. is not required in Section 3.5 because the DISK posterior can learn any continuous w for a large class of GP priors as n, even if w / H. In general, the RKHS H can be a much smaller space relative to the support of the GP prior, in the sense that the GP prior can assign zero probability to H and positive probability to any neighborhood of arbitrary size around any continuous w. While we use Assumption B. mainly for technical simplicity in this section, it can be possibly relaxed by considering sieves with increasing Hilbert norms; see for example, Assumption B and Theorem in Zhang et al. [84]. In Assumption B., trc α measures the size of the covariance kernel and imposes conditions on the regularity of the functions that the DISK posterior can learn. Assumption B.3 controls the error in approximating C α s, s by a finite sum, and the superscript r here should not be confused with the number of knots in MPP or the r in Assumption A.3. Our results are valid for any error distribution that guarantees var ɛs} τ for every s D, and it is trivially satisfied in 9. We examine the Bayes L -risk of the DISK posterior for estimating w in 9. Under the setup of 9, let E s, E, E S, and E S respectively be the expectations with respect to the distributions of s, S, y, S, and y given S. If ws is a random variable that follows the DISK posterior for estimating w s, then ws has the density Nm, v, where m = k k c T, C, + τ k I y, v / = k = k = v /, v = c, c T, C, + τ k I c,, 3 c, = covws, ws }, and c T, = [covws, ws },..., covws m, ws }]. The Bayes L - risk of the DISK posterior in estimating w is E [ Es ws w s } ], and it is decomposed into squared bias, variance of mean of the DISK posterior, and variance of the DISK posterior terms as bias = E s E S c T k L +τ I w w s }, var mean = τ E s E S c T k L +τ I c }, vardisk = E s E S v, 4 where c T = c T,,..., ct k,, w = w s,..., w s k } =,..., k, w T = w,..., w k }, and L is a block-diagonal matrix with C,,..., C k,k along the diagonal. The next theorem describes the asymptotic behavior of each of the three terms in 4. Theorem 3.4 If Assumptions B. B.3 hold, then Bayes L risk = E S E S E s ws w s } = bias + var mean + var DISK, [ bias 8τ n w H + w H inf 8n Abm, d, rρ ρ 4 trc α trcα d γ τ } r ] n + µ, d N m τ 6
17 var mean n + 4 w H k τ k var DISK τ τ n γ n inf d N [ n w H + τ n γ [ + inf d N µ d+ + n τ ρ 4 trc α trcα d + τ, n trc d α + trc α where N is the set of all positive integers, A is a positive constant, maxr, maxr, log d bm, d, r = max log d,, γa = i= Abm, d, rρ γ τ n m } r ] + Abm, d, rρ γ τ } r ] n, 5 m m / /r µ i µ i + a for any a >, trcd α = i=d+ Theorem 3.4 is based on arguments similar to Theorem in Zhang et al. [84], which has derived the risk bounds for the frequentist divide-and-conquer estimator in kernel ridge regression. We however are considering the divide-and-conquer scheme in the Bayesian context, so our risk bound in Theorem 3.4 involves two variance terms, including the DISK posterior variance term var DISK, which has not been considered by Zhang et al. [84] and other divide-and-conquer literature before due to their interests in frequentist point estimation of w. The function γa measures the effective dimensionality of C α with respect to L P s [83]. The function trc d α describes the tail behavior of the eigenvalues of C α [84]. The upper bound for bias and var mean are the DISK analogues of the upper bounds in Lemma 6 and Lemma 7, respectively, of Zhang et al. [84]. From the risk bounds in Theorem 3.4, one can see that the first and second terms in the upper bound for var DISK are dominated by the last and second terms in the upper bounds for var mean and bias, respectively. Therefore, if we use the DISK posterior mean or a random draw from the DISK posterior to estimate w, we will observe the bias-variance tradeoff phenomenon similar to the one described in [84]: as k increases, the squared bias of the estimate increases, while the variation between the subset estimates decreases. Theoretically, this implies the existence of an optimal k such that the Bayes L -risk of the DISK posterior decreases for k k and increases for k > k. Our empirical results in Section 4 demonstrate this bias-variance tradeoff. For three types of commonly used kernels, the next theorem provides conditions on k such that the Bayes L -risk in 5 is nearly minimax optimal. The covariance kernel C α is a degenerate kernel of rank d if there is some constant positive integer d such that µ µ... µ d > and µ d + = µ d + =... = µ =. The covariance kernels in subset of regressors approximation [55] and predictive process [6] are degenerate with their ranks equaling the number of selected regressors and knots, respectively. The squared exponential kernel is very popular in machine learning. Its RKHS belongs to the class of RKHSs of kernels with exponentially decaying eigenvalues. Similarly, the class of RKHSs of kernels with polynomially decaying eigenvalues includes the Sobolev spaces with different orders of smoothness and the RKHS of the Matérn kernel. µ i. This kernel is most 7
18 relevant for spatial applications, but we provide the other two results for a more general audience. Theorem 3.5 If Assumptions B. B.3 hold and r > 4 in Assumption B.3, then, as n, i if C α is a degenerate kernel of rank d and k cn r 4 r /log n r r for some constant c >, then the L -risk of DISK posterior satisfies E s E S E S ws w s } = O n ; ii if µ i c µ exp c µ i for some constants c µ >, c µ > and all i N, and for some constant c >, k cn r 4 3r r /log n r, then the L -risk of DISK posterior satisfies E s E S E S ws w s } = O log n/n ; and iii if µ i c µ i ν for some constants c µ >, ν > r r 4 k cn r 4ν r r ν /log n r w s } = O n ν ν. and all i N, and for some constant c >, r, then the L -risk of DISK posterior satisfies E s E S E S ws The rate of decay of the L -risks in i and ii are known to be minimax optimal [57, 84, 8], whereas the rate of decay of the L -risk in iii is slightly larger than the minimax optimal rate by a factor of n νν+ [84]. The main advantage of the DISK posterior over its non-bayesian counterparts is that it achieves optimal performance in the first two cases and a near optimal performance in the third case while being free of tuning parameter selection. In most applications D is also compact, so that r in Assumption B.3 can be taken as infinity. This implies that the upper bounds on k in i, ii, and iii reduce to k = On/ log n, k = On/ log 3 n, and k = On ν ν / log n, respectively. Theoretical results similar to Theorems 3.4 and 3.5 are well studied in frequentist divide-andconquer kernel ridge regression, but developing Bayesian analogues of these results remains an active area of research. Cheng and Shang have studied the theoretical properties of divide-and-conquer Bayesian nonparametric regression under various theoretical setups [, 64, 65]. Recently, Szabo and van Zanten [7] have explored the convergence rate of the L -risk and coverage in the classical Gaussian white noise model. Focusing on the practical issue, we have developed a general method and related sampling algorithms for extending any method for Bayesian nonparametric regression based on GP prior to massive data settings using the divide-and-conquer technique. Theorem 3.4 describes the L -risk of our method, and Theorem 3.5 provides guidance on choosing the number of subsets. Under certain assumptions on w, our theoretical results reduce to those obtained in the previous works; however, none of the previous work focus on developing a computational method that is as widely applicable as DISK and is grounded in Bayesian asymptotic theory. 4 Experiments 4. Simulation setup We compare DISK with its competitors on synthetic data based on its performance in learning the process parameters, interpolating the unobserved residual spatial surface, and predicting at new locations. This section presents two simulation studies. The first Simulation and second 8
A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging
A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Cheng Li DSAP, National University of Singapore Joint work with Rajarshi Guhaniyogi (UC Santa Cruz), Terrance D. Savitsky (US Bureau of Labor
More informationHierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets
Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Abhirup Datta 1 Sudipto Banerjee 1 Andrew O. Finley 2 Alan E. Gelfand 3 1 University of Minnesota, Minneapolis,
More informationScaling up Bayesian Inference
Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationNearest Neighbor Gaussian Processes for Large Spatial Data
Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationIntroduction to Geostatistics
Introduction to Geostatistics Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore,
More informationHierarchical Modeling for Univariate Spatial Data
Hierarchical Modeling for Univariate Spatial Data Geography 890, Hierarchical Bayesian Models for Environmental Spatial Data Analysis February 15, 2011 1 Spatial Domain 2 Geography 890 Spatial Domain This
More informationHierarchical Modelling for Univariate Spatial Data
Hierarchical Modelling for Univariate Spatial Data Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department
More informationCan we do statistical inference in a non-asymptotic way? 1
Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationOn Gaussian Process Models for High-Dimensional Geostatistical Datasets
On Gaussian Process Models for High-Dimensional Geostatistical Datasets Sudipto Banerjee Joint work with Abhirup Datta, Andrew O. Finley and Alan E. Gelfand University of California, Los Angeles, USA May
More informationA Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models
A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models arxiv:1811.03735v1 [math.st] 9 Nov 2018 Lu Zhang UCLA Department of Biostatistics Lu.Zhang@ucla.edu Sudipto Banerjee UCLA
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationA short introduction to INLA and R-INLA
A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationBayesian Aggregation for Extraordinarily Large Dataset
Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University www.science.purdue.edu/bigdata Department Seminar Statistics@LSE May 19, 2017 1 A Joint Work
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationOn Bayesian Computation
On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints
More informationKernel adaptive Sequential Monte Carlo
Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline
More informationComparing Non-informative Priors for Estimation and Prediction in Spatial Models
Environmentrics 00, 1 12 DOI: 10.1002/env.XXXX Comparing Non-informative Priors for Estimation and Prediction in Spatial Models Regina Wu a and Cari G. Kaufman a Summary: Fitting a Bayesian model to spatial
More informationA Bayesian perspective on GMM and IV
A Bayesian perspective on GMM and IV Christopher A. Sims Princeton University sims@princeton.edu November 26, 2013 What is a Bayesian perspective? A Bayesian perspective on scientific reporting views all
More informationThe Bayesian approach to inverse problems
The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu
More informationCS 7140: Advanced Machine Learning
Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)
More informationGaussian predictive process models for large spatial data sets.
Gaussian predictive process models for large spatial data sets. Sudipto Banerjee, Alan E. Gelfand, Andrew O. Finley, and Huiyan Sang Presenters: Halley Brantley and Chris Krut September 28, 2015 Overview
More informationSemi-Nonparametric Inferences for Massive Data
Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work
More informationApplications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices
Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationHierarchical Modelling for Univariate Spatial Data
Spatial omain Hierarchical Modelling for Univariate Spatial ata Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A.
More informationWeb Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.
Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts
ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Alan Gelfand 1 and Andrew O. Finley 2 1 Department of Statistical Science, Duke University, Durham, North
More informationA Framework for Daily Spatio-Temporal Stochastic Weather Simulation
A Framework for Daily Spatio-Temporal Stochastic Weather Simulation, Rick Katz, Balaji Rajagopalan Geophysical Statistics Project Institute for Mathematics Applied to Geosciences National Center for Atmospheric
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota,
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Andrew O. Finley 1 and Sudipto Banerjee 2 1 Department of Forestry & Department of Geography, Michigan
More informationChoosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation
Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Andrew O. Finley Department of Forestry & Department of Geography, Michigan State University, Lansing
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationPart 6: Multivariate Normal and Linear Models
Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of
More informationNonparametric Bayesian Methods
Nonparametric Bayesian Methods Debdeep Pati Florida State University October 2, 2014 Large spatial datasets (Problem of big n) Large observational and computer-generated datasets: Often have spatial and
More informationMCMC Sampling for Bayesian Inference using L1-type Priors
MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling
More informationQuantile Regression for Extraordinarily Large Data
Quantile Regression for Extraordinarily Large Data Shih-Kang Chao Department of Statistics Purdue University November, 2016 A joint work with Stanislav Volgushev and Guang Cheng Quantile regression Two-step
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationStatistics: Learning models from data
DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationDefault Priors and Effcient Posterior Computation in Bayesian
Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature
More informationPractical Bayesian Quantile Regression. Keming Yu University of Plymouth, UK
Practical Bayesian Quantile Regression Keming Yu University of Plymouth, UK (kyu@plymouth.ac.uk) A brief summary of some recent work of us (Keming Yu, Rana Moyeed and Julian Stander). Summary We develops
More informationSTAT 518 Intro Student Presentation
STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible
More informationspbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models
spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models Andrew O. Finley 1, Sudipto Banerjee 2, and Bradley P. Carlin 2 1 Michigan State University, Departments
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationNonparametric Bayesian Methods - Lecture I
Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics
More informationMonte Carlo Studies. The response in a Monte Carlo study is a random variable.
Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationGibbs Sampling in Linear Models #2
Gibbs Sampling in Linear Models #2 Econ 690 Purdue University Outline 1 Linear Regression Model with a Changepoint Example with Temperature Data 2 The Seemingly Unrelated Regressions Model 3 Gibbs sampling
More informationA Process over all Stationary Covariance Kernels
A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationBayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling
Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling Jon Wakefield Departments of Statistics and Biostatistics University of Washington 1 / 37 Lecture Content Motivation
More informationAsymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands
Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationHandbook of Spatial Statistics Chapter 2: Continuous Parameter Stochastic Process Theory by Gneiting and Guttorp
Handbook of Spatial Statistics Chapter 2: Continuous Parameter Stochastic Process Theory by Gneiting and Guttorp Marcela Alfaro Córdoba August 25, 2016 NCSU Department of Statistics Continuous Parameter
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters
More informationBayesian Linear Regression
Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective
More informationLong-Run Covariability
Long-Run Covariability Ulrich K. Müller and Mark W. Watson Princeton University October 2016 Motivation Study the long-run covariability/relationship between economic variables great ratios, long-run Phillips
More informationBasics of Point-Referenced Data Models
Basics of Point-Referenced Data Models Basic tool is a spatial process, {Y (s), s D}, where D R r Chapter 2: Basics of Point-Referenced Data Models p. 1/45 Basics of Point-Referenced Data Models Basic
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationMarkov Chain Monte Carlo
Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).
More informationEffective Dimension and Generalization of Kernel Learning
Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationVCMC: Variational Consensus Monte Carlo
VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object
More informationSpatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields
Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields 1 Introduction Jo Eidsvik Department of Mathematical Sciences, NTNU, Norway. (joeid@math.ntnu.no) February
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationKernel Sequential Monte Carlo
Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section
More informationBayesian Linear Models
Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. 2 Biostatistics, School of Public
More informationABC methods for phase-type distributions with applications in insurance risk problems
ABC methods for phase-type with applications problems Concepcion Ausin, Department of Statistics, Universidad Carlos III de Madrid Joint work with: Pedro Galeano, Universidad Carlos III de Madrid Simon
More informationBasic Sampling Methods
Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution
More informationAn introduction to Bayesian statistics and model calibration and a host of related topics
An introduction to Bayesian statistics and model calibration and a host of related topics Derek Bingham Statistics and Actuarial Science Simon Fraser University Cast of thousands have participated in the
More informationMARKOV CHAIN MONTE CARLO
MARKOV CHAIN MONTE CARLO RYAN WANG Abstract. This paper gives a brief introduction to Markov Chain Monte Carlo methods, which offer a general framework for calculating difficult integrals. We start with
More informationSequential Monte Carlo Methods for Bayesian Computation
Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationIntroduction to Machine Learning. Lecture 2
Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationRecurrent Latent Variable Networks for Session-Based Recommendation
Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)
More information