arxiv: v1 [stat.me] 28 Dec 2017

Size: px
Start display at page:

Download "arxiv: v1 [stat.me] 28 Dec 2017"

Transcription

1 A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Raarshi Guhaniyogi, Cheng Li, Terrance D. Savitsky 3, and Sanvesh Srivastava 4 arxiv:7.9767v [stat.me] 8 Dec 7 Department of Applied Mathematics and Statistics, UC Santa Cruz Department of Statistics and Applied Probability, National University of Singapore 3 U. S. Bureau of Labor Statistics 4 Department of Statistics and Actuarial Science, The University of Iowa November, 8 Abstract Flexible hierarchical Bayesian modeling of massive data is challenging due to poorly scaling computations in large sample size settings. This article is motivated by spatial process models for analyzing geostatistical data, which typically entail computations that become prohibitive as the number of spatial locations becomes large. We propose a three-step divide-and-conquer strategy within the Bayesian paradigm to achieve massive scalability for any spatial process model. We partition the data into a large number of subsets, apply a readily available Bayesian spatial process model on every subset in parallel, and optimally combine the posterior distributions estimated across all the subsets into a pseudo posterior distribution that conditions on the entire data. The combined pseudo posterior distribution is used for predicting the responses at arbitrary locations and for performing posterior inference on the model parameters and the residual spatial surface. We call this approach Distributed Kriging DISK. It offers significant advantages in applications where the entire data are or can be stored across multiple machines. Under the standard theoretical setup, we show that if the number of subsets is not too large, then the Bayes L -risk of estimating the true residual spatial surface using the DISK posterior distribution decays to zero at a nearly optimal rate. While DISK is a general approach to distributed nonparametric regression, we focus on its applications in spatial statistics and demonstrate its empirical performance using a stationary full-rank and a nonstationary lowrank model based on Gaussian process GP prior. A variety of simulations and a geostatistical analysis of the Pacific Ocean sea surface temperature data validate our theoretical results. Keywords: Distributed Bayesian inference; Gaussian process; modified predictive process; large and complex spatial data; Wasserstein distance; Wasserstein barycenter. guhaniyogi@ucsc.edu stalic@nus.edu.sg savitsky.terrance@bls.gov sanvesh-srivastava@uiowa.edu

2 Introduction A fundamental challenge in geostatistics is the analysis of massive spatially-referenced data. Due to the recent influx of data with complex spatial associations, sophisticated spatial modeling has become an enormously active area of research; see, for example, [8, 3, 4]. Massive spatial data provide scientists with an unprecedented opportunity to hypothesize and test complex theories. This leads to the implementation of rather complex hierarchical GP-based models that are computationally intractable for large n, where n is the number of spatial locations, due to the On 3 computational cost and the On storage cost. This article develops a general distributed Bayesian approach, called Distributed Kriging DISK, for boosting the scalability of any state-of-the-art spatial process model based on GP prior to multiple folds using the divide-and-conquer technique. The literature on process-based modeling of massive spatial data is large, so we only provide a selective review. Briefly, these methods seek dimension-reduction by endowing the spatial covariance matrix either with a low-rank or a sparse structure. Low-rank structures represent spatial surface using r apriori chosen basis functions. They include fixed-rank kriging [], or predictive process and its variants [6, 4, 34, 5, 6]; see [78] and [4] for comprehensive reviews. The time complexity for fitting spatial models with a low-rank structure decreases from On 3 to Onr floating point operations flops; however, practical considerations show that when n is large, r must grow roughly as O n for accurate estimation, implying that Onr flops are also expensive in low-rank structures. On the other hand, sparse structures intuit that spatial correlation between two distantly located observations is nearly zero, so little information is lost by assuming conditional independence given the intermediate locations. For example, covariance tapering [4, 9, 5, 63] uses compactly supported covariance functions to create sparse spatial covariance matrices that approximate the full covariance matrix. Alternately, one could introduce sparsity in the inverse covariance precision matrix using conditional independence assumptions or composite likelihoods [75, 59, 7,, 5, 7, 36]. In related literature on computer experiments, localized approximations of GP models are proposed, see, for example, [3, 8, 54]. GP-based modeling using low-rank or sparse structures has also received significant attention in machine learning; see [55, ] for recent reviews. Some variants of dimension-reduction methods partition the spatial domain into sub-regions containing fewer spatial locations. Each of these sub-regions is modeled using a GP which are then hierarchically combined by borrowing information from across the sub-regions. Examples include non-stationary models [4], multi-level and multi-resolution models [7, 5, 4, 35], and the Bayesian Treed GP models [3]. These models usually achieve scalability by assuming blockindependence at some level of the hierarchy, usually across sub-regions, but may lose scalability when they borrow across sub-regions. In an unrelated thread, [45] propose parameter estimation in the GP-based geostatistical model using resampling based on stochastic approximation. Besides being fully frequentist in nature, it is less clear as to how such an idea would be extended to enable analysis of more general nonstationary models with massive data.

3 The proposed DISK framework is a three-step approach for distributed Bayesian inference in any model based on spatial process. First, we divide the n spatial locations into k subsets such that each subset has representative samples from all regions of the spatial domain. Second, we choose any spatial model and estimate the posterior distributions for inference and prediction in parallel across k subsets after raising the likelihood to a power of k in each subset. The pseudo posterior distribution obtained after modifying the likelihood for each subset of data is referred to as the subset posterior. Since each subset posterior distribution conditions on /k-fraction of the full data, the modification by raising the likelihood to the power k ensures that variance of each subset posterior distribution is of the same order as a function of n as that of the full data posterior distribution. Third, the k subset posterior distributions are combined into a single pseudo probability distribution, called the DISK pseudo posterior henceforth, DISK posterior, that conditions on the full data and replaces the computationally expensive full data posterior distribution for the purpose of prediction and inference. Computationally, the main innovations are in the first and third steps, where general partitioning and combining schemes are unavailable in process-based modeling of spatial data. Theoretically, we provide guarantees on the rate of decay of the Bayes L -risk in estimating the true residual spatial surface using the DISK posterior as a function of n, k, and analytic properties of the true spatial surface. This involves two new upper bounds. First, an upper bound for the Bayes risk of the DISK posterior is developed assuming each subset size approaches to infinity. Second, we provide an in depth analysis of the bias-variance tradeoff in estimating the true spatial surface using the DISK posterior and develop upper bounds on k as a function on n that lead to near optimal performance as n tends to infinity. Motivated by large and complex data, there has been significant interest in adopting the divideand-conquer technique for distributed Bayesian inference [5, 76, 77, 3, 6, 33, 48, 8]. DISK is significantly different from these as it is based on combining the collection of k subset posterior distributions through their barycenter, a notion of geometric center that generalizes the Euclidean mean to a space of probability measures. There are recent approaches based on combining subset inferences, either through the posterior means and covariance matrices of parameters across the k subsets [6] or by employing a product of experts [5, 7]. These approaches are intuitively appealing for GP regression, but lead to theoretically sub-optimal uncertainty quantification [7] and are less suited when a GP or its derivative is embedded in a more general hierarchical model; for example, GP-based classification [37]. Recent developments on distributed variational GP [6] are impractical for n > 7 and no theoretical results are available on the quality of approximation of the full posterior distribution. There are recent works on combining subset posterior distributions through their geometric centers, such as the mean or the median, but they are restricted to parametric models [46, 67, 44, 6, 47, 68]. Extensions to general nonparametric models, including those based on stochastic processes, are missing, except for some empirical results on nonparametric regression using GP [46, 67] and theoretical guarantees for both these methods in the Gaussian white noise model [7]. Combining subset posterior distributions of a stochastic process is challenging for two maor reasons. First, it requires estimation of a function, an infinite dimensional parameter. 3

4 Second, and most importantly, stochastic process models induce complex dependencies among observations that is challenging to capture with a combination of subset posterior distributions that are estimated independently without accounting for the inter-subset dependence. Divide-and-conquer nonparametric regression, which includes kriging as a special case, has received significant attention lately in the optimization literature, though the Bayesian literature is relatively lightly populated. The bias-variance decomposition of the L -risk in the estimation of true regression function in divide-and-conquer kernel ridge regression is known [84, 8]. Bayesian divideand-conquer nonparametric regression has been mostly studied from the theoretical perspective [, 64, 65, 7]. Filling the methodological gap, the DISK framework provides a general approach to enhance the scalability of any process-based model for Bayesian nonparametric regression, including models based on spatial processes. For example, if the application of spatial process models is feasible for a subset of size m, then one can run them on k subsets in parallel and DISK allows prediction and inference using n = mk spatial locations. The values of m and k depend mainly upon the available computational resources and the model, but our theoretical results provide guidance on choosing k depending on the analytic properties of the true spatial process. For clarity of exposition, we illustrate the empirical performance of the DISK framework with the stationary GP and the modified predictive process MPP [4] priors. The MPP is a low-rank nonstationary GP prior that allows accurate modeling of spatial surfaces whose variability or smoothness changes with the location. MPP, like any low-rank model, faces computational bottlenecks when n is large, and either computational efficiency or accuracy worsens when MPP is applied to even 4 observations. Our numerical results establish that DISK with MPP scales to 6 observations without compromising on either computational efficiency or accuracy in inference and prediction. We expect this conclusion to hold for other popular structured GP priors. The remainder of the manuscript evolves as follows. In Section we outline a Bayesian hierarchical mixed model framework that incorporates models based on both the full-rank and the low-rank GP priors. Our DISK approach will work with posterior samples from such models. Section 3 develops the framework for DISK, discusses how to compute the DISK posterior distribution, and offers theoretical insights into the DISK for general GPs and their approximations. A detailed simulation study followed by an analysis of the Pacific ocean sea surface temperature data are illustrated in Section 4 to ustify the use of DISK for real data. Finally, Section 5 discusses what DISK achieves, and proposes a number of future directions to explore. Hierarchical Bayesian inference for GP-based spatial models Consider the standard univariate spatial regression model for the data observed at location s in a compact spatial domain D, ys = xs T β +ws + ɛs, 4

5 where ys is a univariate response at s, xs is a p predictor vector at s, β is a p predictor coefficient, ws is the realization of an unknown spatial function w at s, and ɛs is the realization of white-noise process ɛ at s and is independent of w. The Bayesian implementation of the model in customarily assumes a that β apriori follows a Gaussian distribution with mean µ β and covariance matrix Σ β and b that w and ɛ apriori follow mean GPs with covariance functions C α s, s and D α s, s that model covws, ws } and covɛs, ɛs }, respectively, where α are the process parameters indexing the two families of covariance functions and s, s D; therefore, the model parameters are Ω = α, β}. If β = in, then we obtain the setup for Bayesian nonparametric regression using GP prior, with s as covariates and ys as the response. The training data consists of n predictors and responses, denoted as xs, ys },..., xs n, ys n }, observed at n spatial locations, denoted as S = s,..., s n }. Standard Markov chain Monte Carlo MCMC algorithms exist for performing posterior inference on Ω and the values of w at a given set of locations S = s,..., s l }, where S S =, and for predicting ys for any s S [4]. Given S, the prior assumptions on w and ɛ imply that w T = ws,..., ws n } and ɛ T = ɛs,..., ɛs n } are independent and follow N, Cα} and N, Dα}, respectively, where Nm, V denotes the density of a multivariate Gaussian distribution of appropriate dimension with mean m and covariance matrix V and the i, th entries of Cα and Dα are C α s i, s and D α s i, s, respectively. The hierarchy in is completed by assuming that α apriori follows a distribution with density πα. If y = ys,..., ys n } T is the n response vector and X = [xs : : xs n ] T is the n p matrix of predictors, where p < n, then the MCMC algorithm for sampling Ω, w T = ws,..., ws l }, and y T = ys,..., ys l } cycle through the following three steps until sufficient samples are drawn post convergence:. Integrate over w in and a sample β given y, X, and α from Nm β, V β, where } V β = X T Vα X + Σ β, mβ = V β X T Vα y + Σ β and Vα = Cα + Dα; and µ β }, b sample α given y, X, and β using the Metropolis-Hastings algorithm with a normal random walk proposal.. Sample w given y, X, α, and β from Nm, V, where V = C, α C α Vα C α T, m = C α Vα y X β, 3 C α and C, α are l n and l l matrices, respectively, and the i, th entries of C, α and C α are C α s i, s and C αs i, s, respectively. 3. Sample y given α, β, and w from N X β + w, Dα}, where X T = [xs : : xs l ]. 5

6 Many Bayesian spatial models can be formulated in terms of by assuming different forms of C α s, s and D α s, s ; see [4] and supplementary material for details on the MCMC algorithm. Cα. MCMC computations face a maor computational bottleneck due to matrix inversions involving The steps a, b, and for sampling Ω and w involve inversion of Cα + Dα. Irrespective of the form of Dα, if no additional assumptions are made on the structure of Cα, then the three steps require On 3 flops in computation and On memory units in storage in every MCMC iteration. Spatial models with this form of posterior computations are based on a full-rank GP prior. In practice, if n 4, then posterior computations in a model based on a full-rank GP prior are infeasible due to numerical issues in matrix inversions involving an unstructured Cα. This problem is solved by imposing additional structure on Cα. Our focus is on those methods that impose a sparse or low-rank structure on the covariance function of a GP prior [55, 4]. Every method in this class expresses the covariance function in terms of r n basis functions, in turn inducing a low-rank GP prior. methods. Let S = s We use the MPP as a representative example of this class of,..., s r } be a set of r locations, known as the knots, which may or may not intersect with S. Let cs, S = C α s, s,..., C αs, s r } T be an r vector and CS be an r r matrix whose i, th entry is C α s i, s. Using cs, S,..., cs n, S and CS, define the diagonal matrix δ = diagδs,..., δs n } with δs i = C α s i, s i c T s i, S CS cs i, S, i =,..., n. 4 Let a = b = if a = b and otherwise. Then, the MPP is a GP with the covariance function C α s, s = c T s, S CS cs, S + δs s = s, s, s D, 5 where C α s, s depends on the covariance function of the parent GP and the selected r knots, which define CS, c T s, S, and c T s, S. We have used a in 5 to distinguish the covariance function of a low-rank GP prior from that of its parent full-rank GP. If Cα is a matrix with i, th entry C α s i, s, then the posterior computations using MPP, a low-rank GP prior, replace Cα by Cα in the steps a, b, and. The low rank r structure imposed by CS implies that Cα computation requires Onr flops using the Woodbury formula [38]. Spatial models based on a low-rank GP prior, including MPP, suffer from computational bottlenecks in massive data settings. The computational complexity of posterior computations for a rank-r GP prior is Onr, which is linear in n; however, practical considerations often necessitate that r = O n for accurate inference and prediction. This severely limits the computational advantages of low-rank GP priors, including MPP, especially in applications with large n [5]. The next few sections develop our DISK framework, which is key in extending any GP-based spatial model to massive data using the divide-and-conquer technique without compromising either on the quality of inference and prediction or on the computational cost. 6

7 3 Distributed Kriging 3. First step: partitioning of spatial locations We partition the n spatial locations into k subsets. The value of k depends on the chosen spatial model, and it is large enough to ensure efficient posterior computations on any subset. The default partitioning scheme is to randomly allocate the locations into k subsets, but we specify a technical condition later which ensures that every subset has locations from all regions of the spatial domain. For theoretical and notational simplicity, we also assume that the k subsets are non-overlapping and that every subset has m spatial locations so that n = mk. Let S be the set of spatial locations in subset =,..., k. Our simplifying assumptions imply that k = S = S, S S = for, and S = s,..., s m }, where s i = s i for some s i S and for every =,..., k and i =,..., m. Denote the data in the th partition as y, X } =,..., k, where y = ys,..., ys m } T is a m vector and X = [xs : : xs m ] T is a m p matrix of predictors corresponding to the spatial locations in S with p < m. The univariate spatial regression models using either a full-rank or a low-rank GP prior for the data observed at any location s i S D is given by ys i = xs i T β +ws i + ɛs i, i =,..., m. 6 Let w T = ws,..., ws m } and ɛ T = ɛs,..., ɛs m } be the realizations of GP w and white-noise process ɛ, respectively, in the th subset. After marginalizing over w in the GP-based model for the th subset, the likelihood of Ω = α, β} is given by l Ω = Ny X β, V α}, 7 where Ny m, V represents the multivariate normal density of y with mean m and covariance matrix V, V α = C α + D α and V α = C α + D α for full-rank and low-rank GP priors, respectively, and C α, C α, D α are obtained by extending the definitions of Cα, Cα, Dα to the th subset. In a model based on full-rank or low-rank GP prior, the likelihood of w given y, X, and Ω is l w = Ny X β w, D α}. 8 The likelihoods in 7 and 8 are used for defining the posterior distributions for β, α, w, y, called th subset posterior distributions, using a full-rank or a low-rank GP prior in subset. 7

8 3. Second step: sampling from subset posterior distributions We define subset posterior distributions by modifying the likelihoods in 7 and 8. More precisely, the density of the th subset posterior distribution of Ω is given by π m Ω y = l Ω} k πω l Ω} k πωd Ω, 9 where we assume that l Ω} k πωd Ω <, and the subscript m denotes that the density conditions on m samples in the th subset. The modification of likelihood to yield the subset posterior density in 9 is called stochastic approximation [46]. Raising the likelihood to the power of k is equivalent to replicating every ys i k times i =,..., m. Thus, stochastic approximation accounts for the fact that the th subset posterior distribution conditions on a /k-fraction k = n/m of the full data and ensures that its variance is of the same order as a function of n as that of the full data posterior distribution. This is a common strategy adopted in divide-and-conquer based Bayesian inference in parametric models; see, for example, [44, 8] for recent applications. With the proposed stochastic approximation in 9, the full conditional densities of th subset posterior distributions for prediction and inference follow from their full data counterparts. The th full conditional densities of β and α in the GP-based models are given by π m β y, α = l Ω} k πβ l Ω} k πβd β, π mα y, β = l Ω} k πα l Ω} k παd α, where πβ = Nµ β, Σ β, πα is the prior density of α, and we assume that l Ω} k πβd β and l Ω} k παd α respectively are finite. The th full conditional densities of y and w are calculated after modifying the likelihood of w in 7 using stochastic approximation. Given y, X, and Ω, straightforward calculation yields that the th subset posterior predictive density of w is π m w y, Ω = Nw m, V, where V = C, α C α V α C α T, m = C α V α y X β, where V α = C α + k D α and V α = C α + k D α for full-rank and low-rank GP priors, respectively, and C, α, C α are l l, l m matrices obtained by extending the definition in 3 to subset for full-rank and low-rank GP priors with covariance functions C α, and C α,, respectively. We note that the stochastic approximation exponent, k, scales D α in V α so that the uncertainty in subset posterior distributions are scaled to that of the full data posterior. The th subset posterior predictive density of y given the samples of w and Ω in the th subset is Ny X β + w, D α}. We employ the same three-step sampling algorithm, as earlier introduced, specialized to subset =,..., k, sampling β, α, y, w } in each subset across multiple MCMC iterations; see supplementary material for detailed derivations of subset posterior sampling algorithms in the full-rank and low-rank GP priors. The computational complexity of th subset posterior computations follows from their full data 8

9 counterparts if we replace n by m. Specifically, the computational complexities for sampling a subset posterior distribution are Om 3 and Omr flops if the model in 6 uses a full-rank or a low-rank GP prior, respectively. Since the subset posterior computations are performed in parallel across k subsets, the computational complexities for obtaining B post burn-in subset posterior samples from k subsets are Okm 3 = Onm and Okmr = Onr flops in models based on full-rank and low-rank GP priors, respectively. The combination step of subset posteriors using the DISK framework outlined below is more widely applicable compared to other divide-and-conquer type approaches because it does not rely on any model- or data-specific assumptions, such as independence, except that every subset posterior distribution has a density with respect to the Lebesgue measure and has finite second moments. 3.3 Third step: combination of subset posterior distributions 3.3. Wasserstein distance and Wasserstein barycenter The combination step relies on the Wasserstein barycenter, so we provide some background on this topic. Let Θ, ρ be a complete separable metric space and PΘ be the space of all probability measures on Θ. The Wasserstein space of order is a set of probability distributions defined as P Θ = } µ PΘ : ρ θ, θ µdθ <, Θ where θ Θ is arbitrary and P Θ does not depend on the choice of θ. The Wasserstein distance of order, denoted as W, metrizes P Θ. Let µ, ν be two probability measures in P Θ and Πµ, ν be the set of all probability measures on Θ Θ with marginals µ and ν, then W distance between µ and ν is defined as W µ, ν = inf π Πµ,ν Θ Θ ρ x, y dπx, y. 3 Let ν,..., ν k P Θ, then the Wasserstein barycenter of ν,..., ν k is defined as ν = argmin ν P Θ It is known that ν P Θ is the unique solution of a linear program []. k = k W ν, ν. 4 If ν,..., ν k in 4 represent the k subset posterior distributions, then Wasserstein barycenter ν provides a general notion of obtaining the mean of k subset posterior distributions. If the k subset posteriors are combined using ν, then ν has finite second moments, conditions on the full data, and does not rely on model-specific or data-specific assumptions. If ν,..., ν k are analytically intractable but MCMC samples are available from them, then an empirical approximation of the Wasserstein barycenter can be estimated by solving a sparse linear program or by averaging empirical subset posterior quantiles [4, 9, 67, 44, 69]. 9

10 3.3. Combination scheme In the DISK framework, we combine the collection of posterior samples from the k subset posterior distributions for β, α, w, and y through their respective Wasserstein barycenters. Our combination scheme relies on the following key result. If θ is a one-dimensional functional of the full data posterior distribution and the th subset posterior distribution for θ is denoted by ν, then the qth quantile of the Wasserstein barycenter for θ, denoted as ν, is estimated from the collection of k subset posterior samples as ˆν q = k k = ˆν q, q = ξ, ξ,..., ξ, 5 where ξ is the grid-size of the quantiles, ˆν q is the estimate of qth quantile of ν obtained using MCMC samples from ν, and ˆν q is the estimate of the qth quantile of ν [44]. The post burn-in samples from the k subset posterior distributions are combined using 5. Let β b } B b=, α b} B b=, w b }B b=, y b }B b= =,..., k be the collection of B post burn-in MCMC samples from the k subsets; β ib and α i b be the bth post burn-in MCMC samples for the ith and i th marginals of β and α from their th subset posteriors, where i =,..., p, i =,..., s, p is the dimension of β, and s is the dimension of α. If β, α, w, and y represent the random variables that follow the DISK posterior distributions for β, α, w, and y, then MCMC-based estimates of the DISK posterior are obtained through their quantile estimates as follows:. Use 5 to estimate the qth quantiles of the ith and i th marginals of the DISK posteriors for β and α, respectively, as ˆ β q i = k where ˆβ q i and ˆα q i k ˆβ q i, i =,..., p, ˆα q i = k = k ˆα q i, i =,..., s, 6 = are the estimates of qth quantiles of ith and i th marginals of th subset posterior distributions for β and α obtained from β b } B b= and α b} B b=.. Use 5 to estimate the pointwise qth quantiles of ws i and ys i as ŵ q s i = k k ŵ q s i, = ŷ q s i = k ŷ q k s i, i =,..., l, 7 = where ŵ q s i and ŷq s i are the estimates of qth quantiles of ws i and ys i of the th subset posterior disributions for ws i and ys i obtained from w b }B b= and y b }B b=. A key feature of the DISK combination scheme is that given the subset posterior samples, the combination step is agnostic to the choice of a model. Specifically, 5 remains the same for models based on a full-rank prior or a low-rank GP prior, such as MPP, given MCMC samples from the k subset posterior distributions. Since the averaging over k subsets takes Ok flops and k < n,

11 the total time for computing the empirical quantile estimates of the DISK posterior in inference or prediction requires Ok + Om 3 and Ok + Onm flops in models based on full-rank and low-rank GP priors. Assuming that we have abundant computational resources, k is chosen large enough so that Om 3 computations are feasible. This would enable applications of the DISK framework in models based on both full-rank and low-rank GP priors in massive n settings. 3.4 Illustrative example: linear regression We develop intuitions behind the DISK posterior in this section by studying its theoretical properties in multivariate linear regression. The spatial linear model in reduces to a linear regression model with error variance if ɛs follows N, and there is no spatial effect; that is, ws = for every s D. A flat prior on β implies that the full data posterior distribution of β given y, denoted as Π n, has density Nˆβ, X T X }, where ˆβ = X T X X T y. Using notations similar to the earlier sections, the th subset posterior distribution has density Nˆβ, k X T X }, where ˆβ = X T X X T y =,..., k and the factor of k in the posterior covariance matrix comes from stochastic approximation. The Wasserstein barycenter of the k subset posterior distributions, denoted as Π n, is Gaussian with density Nm, V, where m = k k ˆβ and V is such that k = k = V / k X T X V / } / = V; 8 see Section 6.3 in []. The DISK framework replaces Nˆβ, X T X } with Nm, V for inference and predictions in massive data settings. The covariance matrix V in 8 is estimated via fixed point iterations; however, V is analytically tractable when X,..., X k have identical left and right singular vectors []. If U and V are m p and p p orthogonal matrices and D = diagd,..., d p is a p p diagonal matrix containing singular values of X, then X = U D V T, X T X = V D V T, and X T X = V k = D V T. The mean vector and covariance matrix of Π n reduce to m = k k = V D U T y and V = V k k 3/ = D V T. Let β be the true value of β and β, β be the random variables with distributions Π, Π n. Assume that c D m dr for some universal positive constant c D and for =,.., k, and r =,..., p, then the Bayes L -risk of the DISK posterior in estimating β is E β β β = E β E β β y } pc D n = O E β β β, where p is fixed, is the Euclidean norm, and E β represents expectation with respect to the density of y obtained by fixing β = β in. This shows that the Bayes L -risk of the DISK posterior in estimating β is upper bounded by the Bayes L -risk of full data posterior distribution. A result like the one above is crucial in developing intuition on the DISK posterior as an alternative to the full data posterior and are generally difficult to come by in the literature for models presented in Section. In the next two sections, we develop theoretical guarantees for the

12 DISK posterior as an alternative to the full posterior in estimating the residual spatial surface. 3.5 Bayes L -risk of DISK: General convergence rates Consider a special case when w = in. The regression model reduces to a finite dimensional model with parameters Ω = β, α} and the DISK posterior reduces to the recently developed WASP method [68]. If Ω is the true value of Ω, then it is known that when the data are independent, WASP converges in probability to a Dirac measure centered at Ω or to the full data posterior distribution of Ω at a near optimal rate, under certain assumptions as n, m [68, 44]; however, in models based on spatial process, inference on the infinite dimensional true residual spatial surface, denoted as w, is of primary importance and no formal results are available in the regard. A notable exception is [7], which shows that combination using Wasserstein barycenter has optimal Bayes risk and adapts to the smoothness of w in the Gaussian white noise model. This model is a special case of with additional smoothness assumptions on w. We focus on providing theoretical guarantees for the DISK posterior distribution for estimating w using the spatial regression model in with a GP prior on w, including the low-rank GP prior such as MPP, as n, m. Recall that S is the set of l reference locations in D and S S =. Let w = w s,..., w s l }T be the true residual spatial surface generating the data at the locations in S and w = ws,..., ws l }T be the realization of GP w at the locations in S. Following the standard theoretical setup in [73], we assume that β =, α is known, and Dα = τ I, where τ is the known non-spatial error variance, so that reduces to ys i = ws i + ɛs i, ɛs i N, τ, i =,..., n, w GP, C α, }. 9 This setup subsumes the low-rank GP priors, hence MPP, as MPP is a GP with covariance function C α, in 5. We also assume that the data are generated using ys = w s + ɛs, s D. Adapting our discussing in Section 3. for the models in 6 to the one in 9, we have that y given w follows the Gaussian distribution with density Nw, k τ I after stochastic approximation as in and the GP prior on w implies that after integrating over w y w NA w, Σ, A = C T C,, Σ = k τ I + C, C T C, C, where C,, C,, and C are defined in. Let A and Σ represent the full data versions of A and Σ in. For any b R l, we define two norms b S = / / m bt A T Σ A b, b S = n bt A T Σ A b. Equivalently, for a generic function b defined on the domain D, if b in is the functional evaluation of b at the testing locations S, then S and S defined in can be viewed as two Banach norms on the space of functions with the domain D.

13 Based on the definitions and notation introduced previously, we make the following five assumptions for deriving the general convergence rates of the DISK posterior: A. Compact domain The spatial domain D is a compact space in metric. A. Norm equivalence The partitions S,..., S k of S are such that there exist universal positive constants H l < < H u independent of such that H l S S H u S for =,..., k. A.3 Metric entropy Suppose that ɛ m is a positive sequence that satisfies i mɛ m for all m ; ii ɛ m as m ; iii with a slight abuse of notations, for every r >, there is a set F r such that for all m, Dɛ m, F r, S e mɛ m H l r and ΠF r e mɛ m r, where Dɛ, F r, S is the minimum number of S -balls of radius ɛ that cover F r. A.4 Prior thickness For the ɛ m sequence in Assumption A.3 and for all m, the prior assigns positive mass to any small neighborhood around w, Πw : w w S ɛ m e mh uɛ m. A.5 The metrics and S are equivalent in that C l S C u S universal constants C l and C u. for some positive Assumption A. is common to all models based on GP priors. Assumption A. specifies a technical condition on the partitioning scheme so that the realizations of the GP observed in the th subset are similar to those in the full data, where such similarity is described in terms of the norms S and S. Assumption A.3 regulates the complexity of the sequence of sets F r in terms of S -metric entropy and specifies a condition on the probability assigned by GP prior to F r, ensuring that the prior probability of F r under the Gaussian measure induced by the GP prior increases with increasing S -metric entropy of F r. The subscript r here should not be confused with the number of knots in MPP. Assumption A.4 says that the GP prior assigns positive probability to arbitrarily small S -neighborhood around the true parameter w. Assumption A.5 is a technical condition that is used in upper bounding the Bayes L -risk of Wasserstein barycenter in the estimation of w if we have Bayes L -risk upper bounds for the subset posterior distributions. We define the Bayes L -risk in the estimation of w using the full data posterior as E S,S E w w y } } = E S,S w w dπ n w y, where E S,S is the expectation under the true space varying function w with respect to density of y conditional on S, S in 9. The decay rate of the risk in is known under assumptions that are similar to A., A.3, A.4 and are obtained by replacing m by n [73]. We present two theorems below that describe the asymptotic properties of the DISK posterior measured in this Bayes L -risk, for the spatial model specified in 9 with a prior Π on w and that satisfies assumptions A. A.5. The first theorem describes the Bayes L -risk of each subset posterior distributions in our case and is based on Proposition in [73]. 3

14 Theorem 3. If Assumptions A. A.5 hold for the th subset posterior Π m y with =,..., k, then there exists a positive constant ch l that only depends on H l, such that as m, E S,S E w w y } C u ch l ɛ m, where E S,S is the expectation under the true space varying function w with respect to the subset y of size m conditional on S, S. The proof of this theorem is in the supplementary material, along with other proofs in this section. Theorem 3. holds for any ɛ m sequence that satisfies Assumptions A.3 and A.4. Explicit expressions for ɛ m are available if w and F r are restricted to class of functions with known regularity and Π is a GP prior with the Matérn or squared exponential covariance kernels. For any a, b >, let C a [, ] d and H b [, ] d be the Hölder and Sobolev spaces of functions on [, ] d with regularity index a and b, respectively. Define D = [, ] d and C α to be the Matérn kernel with C α s, s σ = ν Γν s s φ ν K ν s s φ for s, s D, where K ν is a modified Bessel function of the second kind with order, ν, that controls the process smoothness, φ is a lengthscale parameter that controls the the decay in spatial correlation and Γ is the Gamma function. If w C b [, ] d H b [, ] d and F r C b [, ] d H b [, ] d for r >, then ɛ m = m minν,b /ν +d provided b >, minν, b > d/. Similarly, if D = [, ] d, C α is the squared exponential kernel with C α s, s = σ e φ s s, and w is an analytic function on D, then ɛ m = log m / / m; see Theorems 5 and in [73]. The squared exponential kernel is not relevant to spatial statistics, but we provide this additional result for a more general audience, especially in machine learning. The second theorem below provides an upper bound on the Bayes L -risk of the DISK posterior, Π y,..., y k, using the upper bounds on the k subset posterior distributions. Theorem 3. If Assumptions A. A.5 hold for all subset posteriors Π m y with =,..., k, then as m, E S,S E w w y,..., y k } = E S,S } w w dπw y,..., y k CucH l ɛ m, where E S,S is the expectation under the true space varying function w with respect to the full dataset of size n conditional on S, S. The ɛ m sequence here is the same as in Theorem 3.. If k log a n for some a >, then m n log a n. With this choice of m, k, our previous discussion implies that ɛ m = n c log ac n for the Matérn, where c = minν,b a +d, and ɛ m = log n a/+/ / n for the squared exponential kernels. Both these rates are minimax optimal up to log factors in the estimation of w ; see [73] for proofs. In applications, we are also interested in estimating functions of w. An attractive property of the DISK posterior is that its theoretical guarantees extend to a large class of functions of w. Let f be any function that maps w to fw and that f is bounded almost linearly by the metric. Then, we have the following corollary from a direct application of Lemma 8.5 in [8]. 4

15 Corollary 3.3 Suppose that Assumptions A. A.5 hold for all subset posteriors Π m y with =,..., k. Let f be a continuous function that maps R l to R l and satisfies fw C f + w w for any w R l, where C f > is a fixed constant. Let f Π y,..., y k represent the DISK posterior of fw, then as m, f fw df Πf y,..., y k = O p ɛ m, where O p is in the probability measure under the true space varying function w with respect to the full dataset of size n conditional on S, S. 3.6 Bayes L -risk of DISK: Bias-variance tradeoff and the choice of k While Section 3.5 describes asymptotic optimality results for the DISK posterior when n, m, a common problem in applications is the choice of k for a large n. The risk for the DISK posterior derived in Theorem 3. and Corollary 3.3 is applicable to any prior distribution that satisfies Assumptions A.3 and A.4. If k log a n for some a >, then the DISK posterior gives near minimax optimal performance in the estimation of w for the two covariance kernels. In practice, however, we want to choose a k that is much larger than log a n due to the abundance of computational resources. If the number of subsets k is very small, then the biases in subset posterior distributions are small due to a large m but the variance of the DISK posterior is large due to the small k. In contrast, if k is very large, then the biases in subset posterior distributions are large due to a small subset size m but the variance of the DISK posterior can be small due to the large k. An optimal choice of k balances the bias-variance tradeoff and minimizes the risk of the DISK posterior. We introduce some definitions used in stating the results in this section. Let P s be a probability distribution over D, L P s be the L space under P s, the inner product in L P s is defined as f, g L P s = E Ps fg for any f, g L P s, and φ i s : i =,,...} be an orthonormal basis with respect to P s. Assume that the kernel has the series expansion C α s, s = i= µ iφ i sφ i s with respect to P s for any s, s D, where µ µ... are the eigenvalues of C α. The trace of the kernel C α is defined as trc α = i= µ i. Any f L P s has the series expansion fs = i= θ iφ i s, where θ i = f, φ i L P s. The reproducing kernel Hilbert space RKHS H attached to C α is the space of all functions f L P s such that the H-norm f H = i= θ i /µ i <. The RKHS H is the completion of the linear space of functions defined as I i= a ic α s i,, where I is a positive integer, s i D, and a i R i =,..., I; see [74] for greater details. Simplifying our setup in Section 3.5, we consider a random design scheme with the observed locations S = s,..., s n } and S = s }, where s,..., s n, s are mutually independent and follow P s. The assumptions we impose below are used to derive analytic bounds for the bias and variance terms associated with the Bayes L -risk, and they are stronger than those in Section 3.5. B. RKHS The true function w is an element of the RKHS H attached to the kernel C α. B. Trace class kernel trc α <. 5

16 B.3 Moment condition There are positive constants ρ and, with a slight abuse of notation, r such that E Ps φ r i s} ρ r for every i =,,...,, and var ɛs} τ < for any s D. Assumption B. is a stronger assumption than Assumption A.4 in Section 3.5. Assumption B. is not required in Section 3.5 because the DISK posterior can learn any continuous w for a large class of GP priors as n, even if w / H. In general, the RKHS H can be a much smaller space relative to the support of the GP prior, in the sense that the GP prior can assign zero probability to H and positive probability to any neighborhood of arbitrary size around any continuous w. While we use Assumption B. mainly for technical simplicity in this section, it can be possibly relaxed by considering sieves with increasing Hilbert norms; see for example, Assumption B and Theorem in Zhang et al. [84]. In Assumption B., trc α measures the size of the covariance kernel and imposes conditions on the regularity of the functions that the DISK posterior can learn. Assumption B.3 controls the error in approximating C α s, s by a finite sum, and the superscript r here should not be confused with the number of knots in MPP or the r in Assumption A.3. Our results are valid for any error distribution that guarantees var ɛs} τ for every s D, and it is trivially satisfied in 9. We examine the Bayes L -risk of the DISK posterior for estimating w in 9. Under the setup of 9, let E s, E, E S, and E S respectively be the expectations with respect to the distributions of s, S, y, S, and y given S. If ws is a random variable that follows the DISK posterior for estimating w s, then ws has the density Nm, v, where m = k k c T, C, + τ k I y, v / = k = k = v /, v = c, c T, C, + τ k I c,, 3 c, = covws, ws }, and c T, = [covws, ws },..., covws m, ws }]. The Bayes L - risk of the DISK posterior in estimating w is E [ Es ws w s } ], and it is decomposed into squared bias, variance of mean of the DISK posterior, and variance of the DISK posterior terms as bias = E s E S c T k L +τ I w w s }, var mean = τ E s E S c T k L +τ I c }, vardisk = E s E S v, 4 where c T = c T,,..., ct k,, w = w s,..., w s k } =,..., k, w T = w,..., w k }, and L is a block-diagonal matrix with C,,..., C k,k along the diagonal. The next theorem describes the asymptotic behavior of each of the three terms in 4. Theorem 3.4 If Assumptions B. B.3 hold, then Bayes L risk = E S E S E s ws w s } = bias + var mean + var DISK, [ bias 8τ n w H + w H inf 8n Abm, d, rρ ρ 4 trc α trcα d γ τ } r ] n + µ, d N m τ 6

17 var mean n + 4 w H k τ k var DISK τ τ n γ n inf d N [ n w H + τ n γ [ + inf d N µ d+ + n τ ρ 4 trc α trcα d + τ, n trc d α + trc α where N is the set of all positive integers, A is a positive constant, maxr, maxr, log d bm, d, r = max log d,, γa = i= Abm, d, rρ γ τ n m } r ] + Abm, d, rρ γ τ } r ] n, 5 m m / /r µ i µ i + a for any a >, trcd α = i=d+ Theorem 3.4 is based on arguments similar to Theorem in Zhang et al. [84], which has derived the risk bounds for the frequentist divide-and-conquer estimator in kernel ridge regression. We however are considering the divide-and-conquer scheme in the Bayesian context, so our risk bound in Theorem 3.4 involves two variance terms, including the DISK posterior variance term var DISK, which has not been considered by Zhang et al. [84] and other divide-and-conquer literature before due to their interests in frequentist point estimation of w. The function γa measures the effective dimensionality of C α with respect to L P s [83]. The function trc d α describes the tail behavior of the eigenvalues of C α [84]. The upper bound for bias and var mean are the DISK analogues of the upper bounds in Lemma 6 and Lemma 7, respectively, of Zhang et al. [84]. From the risk bounds in Theorem 3.4, one can see that the first and second terms in the upper bound for var DISK are dominated by the last and second terms in the upper bounds for var mean and bias, respectively. Therefore, if we use the DISK posterior mean or a random draw from the DISK posterior to estimate w, we will observe the bias-variance tradeoff phenomenon similar to the one described in [84]: as k increases, the squared bias of the estimate increases, while the variation between the subset estimates decreases. Theoretically, this implies the existence of an optimal k such that the Bayes L -risk of the DISK posterior decreases for k k and increases for k > k. Our empirical results in Section 4 demonstrate this bias-variance tradeoff. For three types of commonly used kernels, the next theorem provides conditions on k such that the Bayes L -risk in 5 is nearly minimax optimal. The covariance kernel C α is a degenerate kernel of rank d if there is some constant positive integer d such that µ µ... µ d > and µ d + = µ d + =... = µ =. The covariance kernels in subset of regressors approximation [55] and predictive process [6] are degenerate with their ranks equaling the number of selected regressors and knots, respectively. The squared exponential kernel is very popular in machine learning. Its RKHS belongs to the class of RKHSs of kernels with exponentially decaying eigenvalues. Similarly, the class of RKHSs of kernels with polynomially decaying eigenvalues includes the Sobolev spaces with different orders of smoothness and the RKHS of the Matérn kernel. µ i. This kernel is most 7

18 relevant for spatial applications, but we provide the other two results for a more general audience. Theorem 3.5 If Assumptions B. B.3 hold and r > 4 in Assumption B.3, then, as n, i if C α is a degenerate kernel of rank d and k cn r 4 r /log n r r for some constant c >, then the L -risk of DISK posterior satisfies E s E S E S ws w s } = O n ; ii if µ i c µ exp c µ i for some constants c µ >, c µ > and all i N, and for some constant c >, k cn r 4 3r r /log n r, then the L -risk of DISK posterior satisfies E s E S E S ws w s } = O log n/n ; and iii if µ i c µ i ν for some constants c µ >, ν > r r 4 k cn r 4ν r r ν /log n r w s } = O n ν ν. and all i N, and for some constant c >, r, then the L -risk of DISK posterior satisfies E s E S E S ws The rate of decay of the L -risks in i and ii are known to be minimax optimal [57, 84, 8], whereas the rate of decay of the L -risk in iii is slightly larger than the minimax optimal rate by a factor of n νν+ [84]. The main advantage of the DISK posterior over its non-bayesian counterparts is that it achieves optimal performance in the first two cases and a near optimal performance in the third case while being free of tuning parameter selection. In most applications D is also compact, so that r in Assumption B.3 can be taken as infinity. This implies that the upper bounds on k in i, ii, and iii reduce to k = On/ log n, k = On/ log 3 n, and k = On ν ν / log n, respectively. Theoretical results similar to Theorems 3.4 and 3.5 are well studied in frequentist divide-andconquer kernel ridge regression, but developing Bayesian analogues of these results remains an active area of research. Cheng and Shang have studied the theoretical properties of divide-and-conquer Bayesian nonparametric regression under various theoretical setups [, 64, 65]. Recently, Szabo and van Zanten [7] have explored the convergence rate of the L -risk and coverage in the classical Gaussian white noise model. Focusing on the practical issue, we have developed a general method and related sampling algorithms for extending any method for Bayesian nonparametric regression based on GP prior to massive data settings using the divide-and-conquer technique. Theorem 3.4 describes the L -risk of our method, and Theorem 3.5 provides guidance on choosing the number of subsets. Under certain assumptions on w, our theoretical results reduce to those obtained in the previous works; however, none of the previous work focus on developing a computational method that is as widely applicable as DISK and is grounded in Bayesian asymptotic theory. 4 Experiments 4. Simulation setup We compare DISK with its competitors on synthetic data based on its performance in learning the process parameters, interpolating the unobserved residual spatial surface, and predicting at new locations. This section presents two simulation studies. The first Simulation and second 8

A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging

A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging A Divide-and-Conquer Bayesian Approach to Large-Scale Kriging Cheng Li DSAP, National University of Singapore Joint work with Rajarshi Guhaniyogi (UC Santa Cruz), Terrance D. Savitsky (US Bureau of Labor

More information

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Abhirup Datta 1 Sudipto Banerjee 1 Andrew O. Finley 2 Alan E. Gelfand 3 1 University of Minnesota, Minneapolis,

More information

Scaling up Bayesian Inference

Scaling up Bayesian Inference Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Nearest Neighbor Gaussian Processes for Large Spatial Data

Nearest Neighbor Gaussian Processes for Large Spatial Data Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Introduction to Geostatistics

Introduction to Geostatistics Introduction to Geostatistics Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore,

More information

Hierarchical Modeling for Univariate Spatial Data

Hierarchical Modeling for Univariate Spatial Data Hierarchical Modeling for Univariate Spatial Data Geography 890, Hierarchical Bayesian Models for Environmental Spatial Data Analysis February 15, 2011 1 Spatial Domain 2 Geography 890 Spatial Domain This

More information

Hierarchical Modelling for Univariate Spatial Data

Hierarchical Modelling for Univariate Spatial Data Hierarchical Modelling for Univariate Spatial Data Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. 2 Department

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

On Gaussian Process Models for High-Dimensional Geostatistical Datasets On Gaussian Process Models for High-Dimensional Geostatistical Datasets Sudipto Banerjee Joint work with Abhirup Datta, Andrew O. Finley and Alan E. Gelfand University of California, Los Angeles, USA May

More information

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models arxiv:1811.03735v1 [math.st] 9 Nov 2018 Lu Zhang UCLA Department of Biostatistics Lu.Zhang@ucla.edu Sudipto Banerjee UCLA

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

A short introduction to INLA and R-INLA

A short introduction to INLA and R-INLA A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Bayesian Aggregation for Extraordinarily Large Dataset

Bayesian Aggregation for Extraordinarily Large Dataset Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University www.science.purdue.edu/bigdata Department Seminar Statistics@LSE May 19, 2017 1 A Joint Work

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

Kernel adaptive Sequential Monte Carlo

Kernel adaptive Sequential Monte Carlo Kernel adaptive Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) December 7, 2015 1 / 36 Section 1 Outline

More information

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models Environmentrics 00, 1 12 DOI: 10.1002/env.XXXX Comparing Non-informative Priors for Estimation and Prediction in Spatial Models Regina Wu a and Cari G. Kaufman a Summary: Fitting a Bayesian model to spatial

More information

A Bayesian perspective on GMM and IV

A Bayesian perspective on GMM and IV A Bayesian perspective on GMM and IV Christopher A. Sims Princeton University sims@princeton.edu November 26, 2013 What is a Bayesian perspective? A Bayesian perspective on scientific reporting views all

More information

The Bayesian approach to inverse problems

The Bayesian approach to inverse problems The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Gaussian predictive process models for large spatial data sets.

Gaussian predictive process models for large spatial data sets. Gaussian predictive process models for large spatial data sets. Sudipto Banerjee, Alan E. Gelfand, Andrew O. Finley, and Huiyan Sang Presenters: Halley Brantley and Chris Krut September 28, 2015 Overview

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Hierarchical Modelling for Univariate Spatial Data

Hierarchical Modelling for Univariate Spatial Data Spatial omain Hierarchical Modelling for Univariate Spatial ata Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A.

More information

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015 Point Processes

More information

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Alan Gelfand 1 and Andrew O. Finley 2 1 Department of Statistical Science, Duke University, Durham, North

More information

A Framework for Daily Spatio-Temporal Stochastic Weather Simulation

A Framework for Daily Spatio-Temporal Stochastic Weather Simulation A Framework for Daily Spatio-Temporal Stochastic Weather Simulation, Rick Katz, Balaji Rajagopalan Geophysical Statistics Project Institute for Mathematics Applied to Geosciences National Center for Atmospheric

More information

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota,

More information

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Andrew O. Finley 1 and Sudipto Banerjee 2 1 Department of Forestry & Department of Geography, Michigan

More information

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble

More information

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Andrew O. Finley Department of Forestry & Department of Geography, Michigan State University, Lansing

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Nonparametric Bayesian Methods

Nonparametric Bayesian Methods Nonparametric Bayesian Methods Debdeep Pati Florida State University October 2, 2014 Large spatial datasets (Problem of big n) Large observational and computer-generated datasets: Often have spatial and

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

Quantile Regression for Extraordinarily Large Data

Quantile Regression for Extraordinarily Large Data Quantile Regression for Extraordinarily Large Data Shih-Kang Chao Department of Statistics Purdue University November, 2016 A joint work with Stanislav Volgushev and Guang Cheng Quantile regression Two-step

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Practical Bayesian Quantile Regression. Keming Yu University of Plymouth, UK

Practical Bayesian Quantile Regression. Keming Yu University of Plymouth, UK Practical Bayesian Quantile Regression Keming Yu University of Plymouth, UK (kyu@plymouth.ac.uk) A brief summary of some recent work of us (Keming Yu, Rana Moyeed and Julian Stander). Summary We develops

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models

spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models spbayes: An R Package for Univariate and Multivariate Hierarchical Point-referenced Spatial Models Andrew O. Finley 1, Sudipto Banerjee 2, and Bradley P. Carlin 2 1 Michigan State University, Departments

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Gibbs Sampling in Linear Models #2

Gibbs Sampling in Linear Models #2 Gibbs Sampling in Linear Models #2 Econ 690 Purdue University Outline 1 Linear Regression Model with a Changepoint Example with Temperature Data 2 The Seemingly Unrelated Regressions Model 3 Gibbs sampling

More information

A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling Jon Wakefield Departments of Statistics and Biostatistics University of Washington 1 / 37 Lecture Content Motivation

More information

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Handbook of Spatial Statistics Chapter 2: Continuous Parameter Stochastic Process Theory by Gneiting and Guttorp

Handbook of Spatial Statistics Chapter 2: Continuous Parameter Stochastic Process Theory by Gneiting and Guttorp Handbook of Spatial Statistics Chapter 2: Continuous Parameter Stochastic Process Theory by Gneiting and Guttorp Marcela Alfaro Córdoba August 25, 2016 NCSU Department of Statistics Continuous Parameter

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

Long-Run Covariability

Long-Run Covariability Long-Run Covariability Ulrich K. Müller and Mark W. Watson Princeton University October 2016 Motivation Study the long-run covariability/relationship between economic variables great ratios, long-run Phillips

More information

Basics of Point-Referenced Data Models

Basics of Point-Referenced Data Models Basics of Point-Referenced Data Models Basic tool is a spatial process, {Y (s), s D}, where D R r Chapter 2: Basics of Point-Referenced Data Models p. 1/45 Basics of Point-Referenced Data Models Basic

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Markov Chain Monte Carlo

Markov Chain Monte Carlo Markov Chain Monte Carlo Recall: To compute the expectation E ( h(y ) ) we use the approximation E(h(Y )) 1 n n h(y ) t=1 with Y (1),..., Y (n) h(y). Thus our aim is to sample Y (1),..., Y (n) from f(y).

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

VCMC: Variational Consensus Monte Carlo

VCMC: Variational Consensus Monte Carlo VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object

More information

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields 1 Introduction Jo Eidsvik Department of Mathematical Sciences, NTNU, Norway. (joeid@math.ntnu.no) February

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Kernel Sequential Monte Carlo

Kernel Sequential Monte Carlo Kernel Sequential Monte Carlo Ingmar Schuster (Paris Dauphine) Heiko Strathmann (University College London) Brooks Paige (Oxford) Dino Sejdinovic (Oxford) * equal contribution April 25, 2016 1 / 37 Section

More information

Bayesian Linear Models

Bayesian Linear Models Bayesian Linear Models Sudipto Banerjee 1 and Andrew O. Finley 2 1 Department of Forestry & Department of Geography, Michigan State University, Lansing Michigan, U.S.A. 2 Biostatistics, School of Public

More information

ABC methods for phase-type distributions with applications in insurance risk problems

ABC methods for phase-type distributions with applications in insurance risk problems ABC methods for phase-type with applications problems Concepcion Ausin, Department of Statistics, Universidad Carlos III de Madrid Joint work with: Pedro Galeano, Universidad Carlos III de Madrid Simon

More information

Basic Sampling Methods

Basic Sampling Methods Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution

More information

An introduction to Bayesian statistics and model calibration and a host of related topics

An introduction to Bayesian statistics and model calibration and a host of related topics An introduction to Bayesian statistics and model calibration and a host of related topics Derek Bingham Statistics and Actuarial Science Simon Fraser University Cast of thousands have participated in the

More information

MARKOV CHAIN MONTE CARLO

MARKOV CHAIN MONTE CARLO MARKOV CHAIN MONTE CARLO RYAN WANG Abstract. This paper gives a brief introduction to Markov Chain Monte Carlo methods, which offer a general framework for calculating difficult integrals. We start with

More information

Sequential Monte Carlo Methods for Bayesian Computation

Sequential Monte Carlo Methods for Bayesian Computation Sequential Monte Carlo Methods for Bayesian Computation A. Doucet Kyoto Sept. 2012 A. Doucet (MLSS Sept. 2012) Sept. 2012 1 / 136 Motivating Example 1: Generic Bayesian Model Let X be a vector parameter

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Recurrent Latent Variable Networks for Session-Based Recommendation

Recurrent Latent Variable Networks for Session-Based Recommendation Recurrent Latent Variable Networks for Session-Based Recommendation Panayiotis Christodoulou Cyprus University of Technology paa.christodoulou@edu.cut.ac.cy 27/8/2017 Panayiotis Christodoulou (C.U.T.)

More information