ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES

Size: px
Start display at page:

Download "ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES"

Transcription

1 Submitted to the Annals of Statistics arxiv: v1 ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES By Anirban Bhattacharya,, Debdeep Pati, and David Dunson, Duke University, Florida State University In nonparametric regression problems involving multiple predictors, there is typically interest in estimating an anisotropic multivariate regression surface in the important predictors while discarding the unimportant ones. Our focus is on defining a Bayesian procedure that leads to the minimax optimal rate of posterior contraction (up to a log factor) adapting to the unknown dimension and anisotropic smoothness of the true surface. We propose such an approach based on a Gaussian process prior with dimension-specific scalings, which are assigned carefully-chosen hyperpriors. We additionally show that using a homogenous Gaussian process with a single bandwidth leads to a sub-optimal rate in anisotropic cases. 1. Introduction. Gaussian processes (Rasmussen, 2004; van der Vaart and van Zanten, 2008b) are widely used as priors on functions due to tractable posterior computation and attractive theoretical properties. The law of a mean zero Gaussian process (GP) W t is entirely characterized by its covariance kernel c(s, t) = E(W s W t ). A squared exponential covariance kernel given by c(s, t) = exp( a s t 2 ) is commonly used in the literature. Given n independent observations, the optimal rate of estimation of a d-variable function that is only known to be α-smooth is n α/(2α+d) (Stone, 1982). The quality of estimation thus improves with increasing smoothness of the true function while it deteriorates with increase in dimensionality. In practice, the smoothness α is typically unknown and one would like an estimation procedure that adapts to any possible α > 0. Accordingly, a lot of effort has been employed to develop adaptive estimation methods that are rate-optimal for every regularity level of the unknown function. The literature on adaptive estimation in a minimax setting was initiated by Lepski in a series of papers (Lepski, 1990, 1991, 1992); see also Birgé Postdoctoral scholar, Duke University Assistant Professor, Florida State University Professor, Duke University AMS 2000 subject classifications: Primary 62G07, 62G20; secondary 60K35 Keywords and phrases: Adaptive, Anisotropic, Bayesian nonparametrics, Function estimation, Gaussian process, Rate of convergence 1

2 2 A BHATTACHARYA, D. PATI AND D.B. DUNSON (2001) for a discussion on this topic. We also refer the reader to Hoffmann and Lepski (2002), which contains an extensive list of developments in the frequentist literature on adaptive estimation. There is a growing literature on Bayesian adaptation over the last decade. Previous works include Belitser and Ghosal (2003); de Jonge and van Zanten (2010); Ghosal, Lember and van der Vaart (2003); Ghosal, Lember and Van Der Vaart (2008); Huang (2004); Kruijer, Rousseau and van der Vaart (2010); Rousseau (2010); Scricciolo (2006); Shen and Ghosal (2011). A key idea in frequentist adaptive estimation is to narrow down the search for an optimal estimator within a class of estimators indexed by a smoothness or bandwidth parameter, and make a data-driven choice to select the proper bandwidth. In a Bayesian context, one would place a prior on the bandwidth parameter and model-average across different values of the bandwidth through the posterior distribution. The parameter a in the squaredexponential covariance kernel c plays the role of a scaling or inverse bandwidth. van der Vaart and van Zanten (2009) showed that with a gamma prior on a d, one obtains the minimax rate of posterior contraction n α/(2α+d) up to a logarithmic factor for α-smooth functions adaptively over all α > 0. In most multivariate applications, the isotropic smoothness assumption seems too restrictive. Potentially, one can incorporate a separate scaling variable a j for each dimension using the covariance kernel c(s, t) = exp( d j=1 a j s j t j 2 ), intuitively enabling better approximation of anisotropic functions. Such kernels, going by the name automatic relevance determination (ARD), have been heavily used in the machine learning community; see for example Rasmussen (2004) and references therein. Zou et al. (2010) and Savitsky, Vannucci and Sha (2011) recently considered such a model, with point mass mixture priors on the a j s. Although this is an attractive approach with encouraging empirical performance, there has not been any theoretical studies of asymptotic properties of related models in a Bayesian framework. In the frequentist literature, minimax rates of convergence in anisotropic Sobolev, Besov and Hölder spaces have been studied in Birgé (1986); Ibragimov and Khasminski (1981); Nussbaum (1985), with adaptive estimation procedures developed in Barron, Birgé and Massart (1999); Hoffmann and Lepski (2002); Kerkyacharian, Lepski and Picard (2001); Klutchnikoff (2005) among others. The traditional way of dealing with anisotropy is to employ a separate bandwidth or scaling parameter for the different dimensions, and choose an optimal combination of scales in a data-driven way. However, the multidimensional nature of the problem makes the optimal bandwidth selection difficult compared to the isotropic case, as there is no natural ordering among the estimators with multiple bandwidths (Lepski and Levit, 1999).

3 ANISOTROPIC FUNCTION ESTIMATION 3 It is known (Hoffmann and Lepski, 2002) that the minimax rate of convergence for a function with smoothness α i along the ith dimension is given by n α 0/(2α 0 +1), where α0 1 = d i=1 α 1 i is an exponent of global smoothness (Birgé, 1986). When α i = α for all i = 1,..., d, one reduces back to the optimal rate for isotropic classes. On the contrary, if the true function belongs to an anisotropic class, the assumption of isotropy would lead to loss of efficiency which would be more and more accentuated in higher dimensions. In addition, if the true function depends on a subset of coordinates I = {i 1,..., i d0 } {1,..., d} for some 1 d 0 d, the minimax rate would further improve to n α 0I/(2α 0I +1), with α 1 0I = j I α 1 j. The objective of this article is to study whether one can fully adapt to this larger class of functions in a Bayesian framework using dimension specific rescalings of a homogenous Gaussian process, referred to as a multibandwidth Gaussian process from now on. We answer the question in the affirmative to establish rate adaptiveness of the posterior distribution in a variety of settings involving a multi-bandwidth Gaussian process through a novel prior specification on the vector of bandwidths. For simplicity of exposition, we initially study the problem in two parts: (i) adaptive estimation over anisotropic Hölder functions of d arguments, and (ii) adaptive estimation over functions that can possibly depend on fewer coordinates and have isotropic Hölder smoothness over the remaining coordinates. The proposed prior specification for the two cases above are intuitively interpretable and can be easily connected to prescribe a unified prior leading to adaptivity over (i) and (ii) combined. Although our prior specification involving dimension-specific bandwidth parameters leads to adaptivity, a stronger result is required to conclude that a single bandwidth would be inadequate for the above classes of functions. We prove that the optimal prior choice in the isotropic case leads to a suboptimal convergence rate if the true function has anisotropic smoothness by obtaining a lower bound on the posterior contraction rate. Previous results on posterior lower bounds in non-parametric problems include Castillo (2008); van der Vaart and van Zanten (2011). The remaining paper is organized as follows. In Section 2, we introduce relevant notations and conventions used throughout the paper. The multibandwidth Gaussian process is introduced in Section 3. Sections 3.1 and 3.2 discuss the main developments with applications to anisotropic Gaussian process mean regression and logistic Gaussian process density estimation described in Section 3.4. Section 3.5 establishes the necessity of the multibandwidth Gaussian process by showing a lower-bound result. In Sections 4.1 and 4.2, we study various properties of rescaled Gaussian processes which

4 4 A BHATTACHARYA, D. PATI AND D.B. DUNSON are crucially used in the proofs of the main theorems in Section Preliminaries. To keep the notation clean, we shall only use boldface for a, b and α to denote vectors. We shall make frequent use of the following multi-index notations. For vectors a, b R d, let a. = d j=1 a j, a = d j=1 a j, a! = d j=1 a j!, ā = max j a j, a = min j a j, a./b = (a 1 /b 1,..., a d /b d ) T, a b = (a 1 b 1,..., a d b d ) T, a b = d j=1 ab j j. Denote a b if a j b j for all j = 1,..., d. For n = (n 1,..., n d ), let D n f denote the mixed partial derivatives of order (n 1,..., n d ) of f. Let C[0, 1] d and C β [0, 1] d denote the space of all continuous functions and the Hölder space of β-smooth functions f : [0, 1] d R respectively, endowed with the supremum norm f = sup t [0,1] d f(t). For β > 0, the Hölder space C β [0, 1] d consists of functions f C[0, 1] d that have bounded mixed partial derivatives up to order β, with the partial derivatives of order β being Lipschitz continuous of order β β. Also, denote by H β [0, 1] d the Sobolev space of functions f : [0, 1] d R that are restrictions of a function f : R d R with Fourier transform ˆf(λ) = (2π) d e i(λ,t) f(t)dt such that (1 + λ 2 ) β 2 ˆf(λ) dλ <. Note that using the above convention, the inverse Fourier transform f(t) = e i(λ,t) ˆf(λ)dλ. Next, we define an anisotropic Hölder class of functions previously used in Barron, Birgé and Massart (1999); Klutchnikoff (2005). For a function f C[0, 1] d, x [0, 1] d, and 1 i d, let f i ( x) denote the univariate function y f(x 1,..., x i 1, y, x i+1,..., x d ). For a vector of positive numbers α = (α 1,..., α d ), the anisotropic Hölder space C α [0, 1] d consists of functions f which satisfy, for some L > 0, (2.1) max sup α i 1 i n x [0,1] d j=0 D j f i ( x) L, and, for any y [0, 1], h small such that y + h [0, 1] and for all 1 i d, sup D αi f i (y + h x) D αi (2.2) f i (y x) L h α i α i. x [0,1] d For t R d and a subset I {1,..., d} of size I = d with 1 d d, let t I denote the vector of size d consisting of the coordinates (t j : j I). Let C[0, 1] I denote the subset of C[0, 1] d consisting of functions f such that f(t) = g(t I ) for some function g C[0, 1] d. Also, let C α [0, 1] I denote the

5 ANISOTROPIC FUNCTION ESTIMATION 5 subset of C α [0, 1] d consisting of functions f such that f(t) = g(t I ) for some function g C α I [0, 1] d. The ɛ-covering number N(ɛ, S, d) of a semi-metric space S relative to the semi-metric d is the minimal number of balls of radius ɛ needed to cover S. The logarithm of the covering number is referred to as the entropy. We write for inequality up to a constant multiple. Let φ(x) = (2π) 1/2 exp( x 2 /2) denote the standard normal density, and let φ σ (x) = (1/σ)φ(x/σ). Let an asterisk denote a convolution, e.g., (φ σ f)(y) = φ σ (y x)f(x)dx. Let R + denote the set of non-negative real numbers and let R denote positive reals. Denote by S d 1 the (d 1)-dimensional simplex {x R d : x i 0, 1 i d, d i=1 x i = 1}. Unless otherwise stated, c, C, C, C 1, C 2,..., K, K 1, K 2,... shall denote global constants irrelevant to our purpose. 3. Main results. Let W = {W t : t [0, 1] d } be a centered homogeneous Gaussian process with covariance function E(W s W t ) = c(s t). A detailed review of the facts on Gaussian processes relevant to the present application can be found in van der Vaart and van Zanten (2008b). If c : R d R is continuous, by Bochner s theorem, there exists a finite positive measure ν on R d, called the spectral measure of W, such that c(t) = e i(λ,t) ν(dλ), R d where for u, v C d, (u, v) denotes the complex inner product. As in van der Vaart and van Zanten (2009), we shall restrict ourselves to processes with spectral measure ν having sub-exponential tails, i.e., for some δ > 0, (3.1) e δ λ ν(dλ) <. The spectral measure ν of a squared exponential covariance kernel with c(t) = exp( t 2 ) has a density w.r.t. the Lebesgue measure given by f(λ) = 1/(2 d π d/2 ) exp( λ 2 /4) which clearly satisfies (3.1). Let H be the RKHS of W ; see van der Vaart and van Zanten (2008b) for a review of relevant facts. van der Vaart and van Zanten (2008a) showed that the rate of posterior contraction with a Gaussian process prior W, using a metric where appropriate testing is possible, is determined by the behavior of the concentration function φ w0 (ɛ) for ɛ close to zero, where (3.2) φ w0 (ɛ) = h 2 H inf log P ( W h H: h w 0 ɛ 2 ɛ).

6 6 A BHATTACHARYA, D. PATI AND D.B. DUNSON We tacitly assume that there is a given statistical problem where the true parameter g 0 is a known function of w 0. van der Vaart and van Zanten (2009) studied rescaled Gaussian processes W A = {W At : t [0, 1] d } for a real positive random variable A stochastically independent of W, and showed that with a Gamma prior on A d, one obtains the minimax-optimal rate of convergence n α/(2α+d) (up to a logarithmic factor) for α-smooth functions. Since their prior specification does not involve the unknown smoothness α, the procedure is fully adaptive. The key result of van der Vaart and van Zanten (2009) was to construct sets B n C[0, 1] d so that given α > 0, a function w 0 C α [0, 1] d, and a constant C > 1, there exists a constant D > 0 such that, for every sufficiently large n, (3.3) (3.4) (3.5) log N( ɛ n, B n, ) Dn ɛ 2 n, P(W A / B n ) e Cnɛ2 n, P( W A w 0 ɛ n ) e nɛ2 n, with ɛ n = n α/(2α+1) (log n) κ 1, ɛ n = n α/(2α+1) (log n) κ 2 for constants κ 1, κ 2 > 0. In this article, we shall study multi-bandwidth Gaussian processes of the form {W a t = Wa t : t [0, 1] d } for a vector of rescalings (or inversebandwidths) a = (a 1,..., a d ) T with a j > 0 for all j = 1,..., d. We consider two function classes defined in Section 2: (i) Hölder class of functions C α [0, 1] d with anisotropic smoothness (α R d ). (ii) Hölder class of functions C α [0, 1] I with isotropic smoothness that can possibly depend on fewer dimensions (α > 0 and I {1,..., d}). For a continuous function in the support of a Gaussian process, the probability assigned to a sup-norm neighborhood of the function is controlled by the centered small ball probability and how well the function can be approximated from the RKHS of the process (Section 5 of van der Vaart and van Zanten (2008b)). With the target class of functions as in (i) or (ii), a single scaling seems inadequate and it is intuitively appealing to introduce multiple bandwidth parameters to enlarge the RKHS and facilitate improved approximation from the RKHS. The main technical challenge for adaptation in our setting is to find a joint prior on a and devise sets B n (henceforth called sieves) so that (3.3) (3.5) are satisfied with w 0 in the above function classes (i)-(ii) and ɛ n being the optimal rate of convergence for the same. To that end, we propose a novel class of joint priors on the rescaling vector a that leads to adaptation over

7 ANISOTROPIC FUNCTION ESTIMATION 7 function classes (i) and (ii) in Sections 3.1 and 3.2 respectively. Connections between the two prior choices are discussed and a unified framework is prescribed for the function class { C α [0, 1] I : α R d, I {1,..., d} } combining (i) and (ii). The construction of the sieves B n are laid out in Section 5. With such B n, one can use standard results to establish adaptive minimax rate of convergence in various statistical settings; refer to the discussion following Theorem 3.1 in van der Vaart and van Zanten (2009). Some such specific applications are described in Section Adaptive estimation of anisotropic functions. Let A = (A 1,... A d ) T be a random vector in R d with each A j a non-negative random variable stochastically independent of W. We can then define a scaled process W A = {W A t : t [0, 1] d }, to be interpreted as a Borel measurable map in C[0, 1] d equipped with the sup-norm. The basic idea here is to scale the different dimensions by different amounts so that the resulting process becomes suitable for approximating functions having different smoothness along the different coordinate axes. We shall define a joint distribution on A induced through the following hierarchical specification. Let Θ = (Θ 1,..., Θ d ) denote a random vector with a density supported on the simplex S d 1. (PA1) draw Θ Dir(β 1,..., β d ) for some β = (β 1,..., β d ). (PA2) given Θ = θ, draw A 1/θ j j g independently, where g is a density on the positive real line satisfying, (3.6) C 1 x p exp( D 1 x log q x) g(x) C 2 x p exp( D 2 x log q x), for positive constants C 1, C 2, D 1, D 2 and non-negative constants p, q and every sufficiently large x > 0. In particular, g corresponds to a gamma(b 1, b 2 ) distribution if b 1 = p + 1, D 1 = D 2 = b 2, q = 0, C 1 = C 2 in (3.6). For notational simplicity, we shall assume g to be gamma(b 1, b 2 ) from now on, noting that the main results would all hold for the general form of g above. Let π A denote the joint prior on A induced through (PA1) (PA2), so that π A (a) = d j=1 π(a j θ j )dπ(θ). We now state our main theorem for the anisotropic smoothness class in (i), with a detailed proof provided in Section 5. Theorem 3.1. Let W be a centered homogeneous Gaussian random field on R d with spectral measure ν that satisfies (3.1) and let W A denote the multi-bandwidth process with A π A as in (PA1) (PA2). Let

8 8 A BHATTACHARYA, D. PATI AND D.B. DUNSON α = (α 1,..., α d ) be a vector of positive numbers and α 0 = ( d i=1 α 1 i ) 1. Suppose w 0 belongs to the anisotropic Hölder space C α [0, 1] d. Then for every constant C > 1, there exist Borel measurable subsets B n of C[0, 1] d and a constant D > 0 such that, for every sufficiently large n, the conditions (3.3) (3.5) are satisfied by W A with ɛ n = n α 0/(2α 0 +1) (log n) κ 1, ɛ n = n α 0/(2α 0 +1) (log n) κ 2 for constants κ 1, κ 2 > Adaptive dimension reduction. We next consider the smoothness class in (ii), namely C α [0, 1] I for I {1,..., d} and α > 0. If the true function has isotropic smoothness on the dimensions it depends on, it is intuitively clear that one doesn t need a separate scaling for each of the dimensions. Indeed, had we known the true coordinates I {1,..., d}, we could have only scaled the dimensions in I by a positive random variable A, and a slight modification of the results in van der Vaart and van Zanten (2009) would imply that a gamma prior on A I would lead to adaptation. With that motivation, consider a joint prior π A on A induced through the following hierarchical scheme: (PD1) draw d uniformly on {0,..., d} (PD2) given d, draw a subset S {1,..., d} with S = d uniformly from all subsets of size d (PD3) generate a pair of random variables (A, B) with A d gamma(b 1, b 2 ) and B drawn from a fixed compactly supported distribution (PD4) set A j = A for j S and A j = B for j / S In particular, one can fix B to be any constant c > 0 in (PD3), which corresponds to the δ c ( ) distribution. We next state our main result on adaptive dimension reduction. The proof of the following Theorem 3.2 has elements in common with the proof of the Theorem 3.1, and hence only a sketch of the proof is provided in Section 5. Theorem 3.2. Let W be a centered homogeneous Gaussian random field on R d with spectral measure ν that satisfies (3.1) and let W A denote the multi-bandwidth process with A π A as in (PD1) (PD4). Suppose w 0 belongs to the Hölder space C α [0, 1] I for some subset I of {1,..., d} and α > 0. Then for every constant C > 1, there exist Borel measurable subsets B n of C[0, 1] d and a constant D > 0 such that, for every sufficiently large n, the conditions (3.3) (3.5) are satisfied by W A with ɛ n = n α/(2α+d0) (log n) κ 1, ɛ n = n α/(2α+d0) (log n) κ 2 for constants κ 1, κ 2 > 0 and d 0 = I.

9 ANISOTROPIC FUNCTION ESTIMATION Connections between cases (i) and (ii). The joint distributions on A specified in (PA1) (PA2) and (PD1) (PD4) are closely connected. To begin with, note that if we set A j = A, θ j = 1/d, j = 1,..., d in (PA1) (PA2), one reduces to a gamma prior on A d ; the optimal prior choice in the isotropic case (van der Vaart and van Zanten, 2009). In the anisotropic case, our proposed prior can be motivated as follows. Recall that the purpose of rescaling is to traverse the sample paths of an infinite smooth stochastic process on a larger domain to make it more suitable for less smooth functions. If the true function has anisotropic smoothness, then we would like to stretch those directions more where the function is less smooth. For smaller θ j s, the marginal distribution of a j has lighter tails compared to larger values of θ j. We would thus like θ j to assume smaller values for the directions j where the function is more smooth and larger values corresponding to the less smooth directions. Without further constraints on θ, it is not possible to separate the scale of A from θ. This motivates us to constrain θ to the simplex which serves as a weak identifiability condition. In the limit as θ j 0, the distribution of a j converges to a point mass at zero. Accordingly, if the true function doesn t depend on a set of (d d ) dimensions, we would set θ j = 0 for those dimensions and choose the remaining θ j s from a d 1 dimensional simplex. In particular, if the function has isotropic smoothness in the remaining d coordinates, one can simply choose θ j = 1/d for those dimensions. This reduces to our prior choice in (PD1) (PD4). The dimensionality reduction in Section 3.2 deals with finitely many models and can be alternatively studied as a model selection problem (Ghosal, Lember and Van Der Vaart, 2008). However, the connection established above allows us to treat anisotropy and dimension reduction under a single framework with the dimension reduction paradigm recognized as a limiting case of the anisotropic framework. We exploit this connection to propose a unified framework for adaptively estimating functions which possibly depend on fewer coordinates and have anisotropic smoothness in the remaining ones, i.e., functions in C α [0, 1] I for α R d and I {1,..., d}. In particular, we continue to consider rescaled Gaussian processes W A with the following prior on A: (P1) draw d uniformly on {0,..., d} (P2) given d, draw a subset S {1,..., d} with S = d uniformly from all subsets of size d (P3) draw Θ = (θ 1,..., θ d) S d 1 from a Dir(β 1,..., β d) distribution (P4) given S and Θ = θ, draw A 1/θ j j gamma(b 1, b 2 ) independently for

10 10 A BHATTACHARYA, D. PATI AND D.B. DUNSON j S, and fix A j = c for some c > 0 and j / S. A unification of Theorems 3.1 and 3.2 is provided in the following Theorem 3.3, with the proof omitted as it is similar to the previous theorems. Theorem 3.3. Consider W A with A π A as in (P1) (P4). Suppose w 0 belongs to C α [0, 1] I for some subset I of {1,..., d} and α R d. Let α 1 0I = j I α 1 j. Then for every constant C > 1, the conclusions of Theorem 3.1 are satisfied by W A with ɛ n = n α 0I/(2α 0I +1) (log n) κ 1, ɛ n = n α 0I/(2α 0I +1) (log n) κ 2 for constants κ 1, κ 2 > 0. In Theorem 3.3, the exponents of the logarithmic terms increase linearly with the dimension and decrease with α 0I. In particular, κ 1 and κ 2 can be estimated by d+1 α 0I +2 and (α 0I+3)(d+1) 2(α 0I +2) respectively Rates of convergence in specific settings. Theorem 3.3 is in the same spirit as Theorem 3.1 of van der Vaart and van Zanten (2009) (see also Theorem 2.2 of de Jonge and van Zanten (2010)) and can be used to derive rates of posterior contraction in a variety of statistical problems involving Gaussian random fields. We shall consider a couple of specific problems with the message that similar results can be obtained for a large class of problems. We first consider a regression problem where given independent response variable y i and covariates x i [0, 1] d, the response is modeled as random perturbations around a smooth regression surface, i.e., (3.7) y i = µ(x i ) + ɛ i, ɛ i N(0, σ 2 ). As motivated before, the true regression surface µ 0 might depend only on a subset of variables I and have anisotropic smoothness in the remaining variables. Accordingly, we assume µ 0 C α [0, 1] I for some I and α. Also, assume the true value σ 0 of σ lies in an interval [a, b] [0, ). We use the law of W A as a prior on µ, with the prior on A as in (P1) (P4). We also assume a prior on σ supported on [a, b]. Denote the posterior distribution by Π( y 1,..., y n ). Let µ 2 n = n 1 n i=1 µ2 (x i ) denote the L 2 norm corresponding to the empirical distribution of the design points. The posterior is said to contract at a rate ɛ n, if for every sufficiently large M, (3.8) E µ0,σ 0 Π [ (µ, σ) : µ µ 0 n + σ σ 0 > Mɛ n y 1,..., y n ] 0. Theorem 3.4. Consider the nonparametric regression model (3.7) with W A as a prior on µ, with the prior on A as in (P1) (P4). Let α =

11 ANISOTROPIC FUNCTION ESTIMATION 11 (α 1,..., α d ) be a vector of positive numbers and I be a subset of {1,..., d}. If µ 0 C α [0, 1] I, then (3.8) holds with ɛ n = n α 0I/(2α 0I +1) log κ n, where α 1 0I = and κ > 0 is a positive constant. j I α 1 j Thus, one obtains the minimax optimal rate up to a log factor adapting to the unknown dimensionality and anisotropic smoothness. Remark 3.5. Consider the random design setting i.e., when x i Q, where Q is a probability measure on [0, 1] d. Then the conclusion of Theorem 3.4 holds with µ n replaced by µ 2 2,Q = [0,1] d µ 2 (t)q(t)dt where q is the density of Q with respect to the Lebesgue measure on [0, 1] d. Remark 3.6. Theorem 3.4 guarantees adaptive estimation of the regression surface. It is often of interest additionally to select the important variables affecting the response y. In the context of variable selection in a normal linear model, Barbieri and Berger (2004) advocated using the median probability model consisting of the variables with posterior marginal inclusion probability greater than or equal to half. They also proved the predictive optimality of such models. The same approach could be followed here for variable selection, however the issue of optimality needs to be studied. A similar result on adaptation holds for density estimation using the logistic Gaussian process, where an unknown density g on the hypercube [0, 1] d is modeled as (3.9) g(t) = e µ(t) [0,1] e µ(s) ds d for a function µ : R d R. Suppose X 1,..., X n are drawn i.i.d. from a continuous, everywhere positive density g 0 = log µ 0 on [0, 1] d. Theorem 3.7. Suppose one uses the law of W A as a prior on µ in (3.9), with the prior on A as in (P1) (P4). Let α = (α 1,..., α d ) be a vector of positive numbers and I be a subset of {1,..., d}. If µ 0 = log g 0 C α [0, 1] I, then the posterior contracts at the rate ɛ n = n α 0I/(2α 0I +1) log κ n with respect to the Hellinger distance, where α 1 0I = j I α 1 j. The proofs of Theorems 3.4 and 3.7 follow in a straightforward manner from our main results in Theorem 3.1 and 3.2. We don t provide a proof here since the steps are very similar to those in Section 3 of van der Vaart and van Zanten (2008a).

12 12 A BHATTACHARYA, D. PATI AND D.B. DUNSON 3.5. Lower bounds on posterior contraction rates. We now demonstrate in the regression function estimation setting (3.7) that a Gaussian process prior with a single bandwidth leads to a sub-optimal rate of convergence when the true regression function µ 0 has anisotropic smoothness. For simplicity of exposition, we will not consider estimating σ here which is assumed to be fixed and known. Assume x i Q, where Q is a distribution on [0, 1] d admitting a density q with respect to the Lebesgue measure on [0, 1] d. Consider the following assumption on the true regression function µ 0. (LBT) Let I = {1, 2,..., d 1} with d 5. Assume µ 0 (t) = µ 0 (t I )η(t d ), where µ 0 C α [0, 1] d 1 has support [v, w] d 1 for 0 < v < w < 1 and η : [0, 1] R is infinitely differentiable with support [v, w]. Assume further that there exists 0 < u < 1 such that for any λ I > 1 and large λ d, the Fourier transform of µ 0 satisfies (3.10) ˆµ 0 (λ) K λ I k exp{ λ d u } for k = α + δ + 1 with δ (0, 1) and K > 0. An explicit construction of such a function µ 0 is provided in Remark 3.8 below. Remark 3.8. Let µ 0 be a function as in the statement of Theorem 8 in van der Vaart and van Zanten (2011). Let η : [0, 1] R be given by η(x) = Ψ{2(x v)/(w v) 1}, where Ψ(t) = exp{ 1/(1 t 2 )}χ t <1. Ψ is an example of a C bump function with support on [ 1, 1] (see, for example, Section 13 of Tu (2010) ) and η is constructed to shift and scale Ψ to have support on [v, w]. From Johnson (2007), ˆΨ(λ) exp( C λ ) for large λ. With these choices, if we set µ 0 (t) = µ 0 (t I ) η(t d ), then (3.10) is clearly satisfied. In the following, we shall consider a rescaled Gaussian process W A for a positive random variable A stochastically independent of W. Consider (LBP) W A as the prior for the regression function µ in (3.7), with a prior on A specified by A d h, where h is a gamma(b 1, b 2 ) density. Recall that (LBP) results in the minimax rate of contraction adaptively over µ 0 being an isotropic α-hölder function of d variables for any α > 0. In the following Theorem 3.9, we show that the above specification involving a single bandwidth leads to sub-optimal rate with respect to the L 2 norm corresponding to the covariate distribution, w 2 2 = x [0,1] d w 2 (x)q(x)dx, for the true regression function µ 0 satisfying (LBT).

13 ANISOTROPIC FUNCTION ESTIMATION 13 Theorem 3.9. Consider (3.7) with (LBP) as the prior and assume the true regression function µ 0 satisfies (LBT) for some appropriately chosen α > 0. Then (3.11) P ( µ µ 0 2 n α/(2α+d) log t0 n Y 1,..., Y n, x 1,..., x n ) 0 a.s. as n for some constant t 0 > 0. Since µ 0 C α [0, 1] d 1 and η is infinitely differentiable, the minimax rate for estimating µ 0 is faster than n α/(2α+d) log t 0 n by a power of n. Hence Theorem 3.9 shows that a single bandwidth in this case leads to a suboptimal rate of posterior convergence. 4. Auxiliary results. In this section, we present a number of auxiliary results that are crucially used to prove the main results Properties of the multi-bandwidth Gaussian process. We first summarize some properties of the RKHS of the scaled process W a for a fixed vector of scales a. Lemmata generalize the results in section 4 of van der Vaart and van Zanten (2009) from a single scaling to a vector of scales; we briefly sketch the proofs emphasizing the places we differ substantially from van der Vaart and van Zanten (2009). Assume that the spectral measure ν of W has a spectral density f. For a R d, the rescaled proces W a has a spectral measure νa given by νa(b) = ν(b./a). Further, νa admits a spectral density fa, with fa(λ) = a 1 f(λ./a). For w 0 C[0, 1] d, define φ a w 0 (ɛ) to be the concentration function of the rescaled Gaussian process W a. As a straightforward extension of Lemma 4.1 and 4.2 in van der Vaart and van Zanten (2009), it turns out that the RKHS of the process W a can be characterized as below. Lemma 4.1. The RKHS H a of the process {W a t : t [0, 1] d } consists of real parts of the functions t e i(λ,t) g(λ)νa(dλ), where g runs over the complex Hilbert space L 2 (νa). Further, the RKHS norm of the element in the above display is given by g L2 (νa ). Lemma 4.3 of van der Vaart and van Zanten (2009) shows that for any isotropic Hölder smooth function w, convolutions with an appropriately chosen class of higher order kernels indexed by the scaling parameter a belong

14 14 A BHATTACHARYA, D. PATI AND D.B. DUNSON to the RKHS. This suggests that driving the bandwidth 1/a to zero, one can obtain improved approximations to any Hölder smooth function. The following Lemma 4.2 illustrates the usefulness of using separate bandwidths for each dimension for approximating anisotropic Hölder functions from the RKHS. Lemma 4.2. Assume ν has a density with respect to the Lebesgue measure which is bounded away from zero on a neighborhood of the origin. Let α R d be given. Then, for any subset I of {1,..., d} and w C α [0, 1] I, there exists constants C and D depending only on ν and w such that, for a i s large enough, inf { h 2 Ha : h w C i I a α } i i Da. Proof. We shall prove the result for w C α [0, 1] d and sketch an argument for extending the proof to any w C α [0, 1] I. Let ψ j, j = 1,..., d, be a set of higher order kernels which satisfy ψ j (t j )dt j = 1, t k j ψ j(t j )dt j = 0 for any positive integer k and t j α j ψ j (t j ) dt j 1. Define ψ : R d C by ψ(t) = ψ 1 (t 1 )... ψ d (t d ) so that one has R d ψ(t)dt = 1, R d t k ψ(t)dt = 0 for any non-zero multi-index k = (k 1,..., k d ), and the functions ˆψ /f and ˆψ 2 /f are uniformly bounded, where ˆψ denotes the Fourier transform of ψ. For a vector of positive numbers a = (a 1,..., a d ), let ψa(t) = a ψ(a t), where a = d j=1 a j. Proceeding as in Lemma 4.3 of van der Vaart and van Zanten (2009), one can show that the convolution ψa w is contained in the RKHS H a and the squared RKHS norm of ψa w is bounded by Da, with D depending only on ν and w. Thus, the proof of Lemma 4.2 would be completed if we can show that ψa w w C d j=1 a α j j. We have, for any t R d, ψa w(t) w(t) = ψ(s){w(t s./a) w(t)}ds. For 1 j d 1, let u (j) denote the vector in R d with u (j) i = 0 for i = 1,..., j and u (j) i = 1 for i = j + 1,..., d. For any two vectors x, y R d, we can navigate from x to y in a piecewise linear fashion traveling parallel to one of the coordinate axes at a time. The vertices of the path will be given by x (0) = x, x (j) = u (j) x + (1 u (j) ) y for j = 1,..., d 1 and x (d) = y. A multivariate Taylor expansion of w(t s./a) around w(t) cannot take advantage of the anisotropic smoothness of w across different coordinate

15 ANISOTROPIC FUNCTION ESTIMATION 15 axes. Letting x = t, y = t s./a and x (j), j = 0, 1,..., d as above, let us write w(y) w(x) in the following telescoping form, w(y) w(x) = d w(x (j) ) w(x (j 1) ) = j=1 d w j (t j s j /a j x (j) ) w j (t j x (j) ), where the functions w j are as defined in Section 2, with w j (t x) = w(x 1,..., x j 1, t, x j+1,..., x d ) for any t R and x R d. Thus, j=1 w(t s./a) w(t) = d [ α j j=1 i=1 D i w j (t j x (j) ) ( s j/a j ) i i! ] + S j (t j, s j /a j ), where S j (t j, s j /a j ) Ks α j j a α j j by (2.2), for a constant K depending on ν and w but not on t and s. Combining the above, we have ψ(s){w(t s./a) w(t)}ds = d d ψ(s j )S j (t j, s j /a j )ds j C a α j j. j=1 If, w C α [0, 1] I for some subset I of {1,..., d} with I = d, so that w(t) = w 0 (t I ) for some w 0 C α I [0, 1] d, then the conclusion follows trivially follows from the observation ψa w = ψa I w 0. j=1 We next study the centered small ball probability of the rescaled process and the metric entropy of the unit RKHS ball. Lemma 4.3. For any a 0 positive, there exists constants C and ɛ 0 > 0 such that for a a 0 and ɛ < ɛ 0, log P ( W a ɛ ) ( Ca log ā ) d+1. ɛ Proof. This follows from Theorem 2 in Kuelbs and Li (1993) and Lemma 4.6 in van der Vaart and van Zanten (2009). Proceeding as in Lemma 4.6 in van der Vaart and van Zanten (2009) and Lemma 4.4, we obtain (4.1) φ a ( (ɛ) + log 0.5 K 1 a log φa 0 (ɛ) ) 1+d, ɛ

16 16 A BHATTACHARYA, D. PATI AND D.B. DUNSON for some constant K 1 > 0. Note that with L = [0, a 1 ] [0, a d ], φ a 0 (ɛ) = log P ( W a ɛ) = log P (sup W t ɛ) t L (ā ) τ log P ( sup W t ɛ) K 2, t [0,ā] d ɛ for some constant K 2 and τ > 0, where the last inequality follows from the proof of Lemma 4.6 in van der Vaart and van Zanten (2009). Inserting this bound in (4.1), we obtain the desired result. Let H a 1 denote the unit ball in the RKHS of W a. Lemma 4.4. There exists a constant K, depending only on ν and d, such that, for ɛ < 1/2, log N(ɛ, H a 1, ) Ka ( log 1 ɛ ) d+1. Proof. By Lemma 4.1, an element of H a 1 can be written as the real part of the function h : [0, 1] d C given by (4.2) h(t) = e i(λ,t) ψ(λ)νa(dλ) for ψ : R d C a function with ψ(λ) 2 νa(dλ) 1. For z C d, continue to denote the function z e (λ,z) ψ(λ)ν a (dλ) by h. Using the Cauchy-Schwartz inequality and the change of variable theorem, (4.3) h(z) 2 e (λ,2a Re(z)) ν(dλ), where Re(z) denotes the vector whose jth element is the real part of z j for j = 1,..., d, and a Re(z) = (a 1 Re(z 1 ),..., a d Re(z d )) T. From (4.3) and the dominated convergence theorem, any h H a 1 can be analytically extended to Γ = {z C d : 2a Re(z) 2 < δ}. Clearly, Γ contains a strip Ω in C d given by Ω = {z C d : Re(z j ) R j, j = 1,..., d} with R j = δ/(6a j d). Also, for every z Ω, h satisfies the uniform bound h(z) 2 e δ λ ν(dλ) = C 2. Let R = (R 1,..., R d ) T. Partition T = [0, 1] d into rectangles Γ 1,..., Γ m with centers {t 1, t 2,..., t m } such that given any z T, there exists Γ j with

17 ANISOTROPIC FUNCTION ESTIMATION 17 center t j = (t j1,..., t jd ) T with z i t ji R i /4, i = 1,..., d. Consider the piecewise polynomials P = m j=1 P j,γ j 1 Γj with P j,γj (t) = n. k γ j,n (t t j ) n. A finite set of functions Pa is obtained by discretizing the coefficients γ j,n for each j and n over a grid of mesh width ɛ/r n in the interval [ C/R n, C/R n ], with R n = R n Rn d d and C defined as above. Choosing m 1/R and k such that k log(1/ɛ) and (2/3) k Kɛ for some constant K, the collection Pa can be shown to form a Kɛ-net to H a 1 using (4.4) and (4.5); D n h ψ (t i ) (z t i ) n ( ) C 2 k (4.4) n! R n (R/2)n KC 3 n.>k n.>k D n (4.5) h ψ (t i ) (z t i ) n P i,γi (z) n! Kɛ. n. k The details are similar to Lemma 4.5 in van der Vaart and van Zanten (2009) and hence omitted. Lemma 4.7 of van der Vaart and van Zanten (2009) exploited a containment relation among unit RKHS balls with different scalings to construct the sieves B n. Such a result sufficed in the single bandwidth case exploiting the ordering of R +. However, the result can only be generalized with respect to the partial order on R d + and one doesn t obtain a straightforward generalization of their sieve construction in the multi-bandwidth case since the entropy of their sieve blows up in trying to control the joint probability of the rescaling vector a outside a hyper-rectangle in R d +. The problem mentioned above is fundamentally due to the curse of dimensionality and one needs a more careful construction of the sieve to avoid this problem. In the proof of Lemma 4.4, a collection of piece-wise polynomials is used to cover the unit RKHS ball H a 1. The main idea in the next Lemma 4.5 is to exploit the fact that the same set of piecewise polynomials can also be used to cover H b 1 for b sufficiently close to a. We then come up with a careful choice of a compact subset Q of R d + that balances the metric entropy of the collection of unit RKHS balls H a 1 with a Q and the complement probability of Q under the joint prior on a. For u R d +, let Cu denote the rectangle in the positive quadrant given by a u, i.e., 0 a j u j for all j = 1,..., d. For a fixed r > 1, let Q (r)

18 18 A BHATTACHARYA, D. PATI AND D.B. DUNSON consist of vectors a with a j r θ j for some θ S d 1. It is easy to see that Q (r) is a union of rectangles C r θ with θ varying over S d 1, Q (r) = C r θ. θ S d 1 The outer boundary of Q (r) consists of points a with 0 < a j r for all j = 1,..., d and a = r (see Figure 1). By Lemma 4.4, for any such a in the outer boundary of Q (r), the metric entropy of H a 1 is bounded by a constant multiple of r log d+1 (1/ɛ). Our techniques were motivated by our observation that the entropy remains of the same order even if one considers a union over the outer boundary of Q (r). We present a stronger result in Lemma 4.5 which further states that the entropy remains of the same order even if the union is considered over all of Q (r). Fig 1. Left panel: For d = 2 & fixed r > 1, rectangles C r θ = {0 a r θ } for different values of θ S d 1. Right panel: the region Q (shaded) resulting from the union of all such rectangles. Lemma 4.5. Let ν satisfy (3.1). Fix r 1. Then, there exists a constant K depending on ν and d only, so that, for ɛ < 1/2, ( log N ɛ, a Q (r) H a 1, ) ( Kr log 1 ) d+1. ɛ

19 ANISOTROPIC FUNCTION ESTIMATION 19 Proof. Let Q = Q (r). Fix a Q. Let I denote the subset of {1,..., d} such that a j 1 for all j I and a j > 1 for all j / I. Let Ω a = {z C d : Re(z j ) R j, j = 1,..., d} with R j = δ/(6a j d) if j / I and Rj = δ/(6 d) if j I. Note that for z Ω a, (4.6) 2a Re(z) 2 2 = j I 4R 2 j + j / I d 4a 2 j Re(z j ) 2 j=1 4a 2 jr 2 j δ2 9. Hence, 2a Re(z) 2 δ/3 for z Ω a. Following the argument after the display in (4.3), it thus follows that any function h H a 1 has an analytic extension to Ω a. Let b Q satisfy max j a j b j 1. We shall exhibit that any h H b 1 can also be extended analytically to the same strip Ωa by showing that 2b Re(z) 2 < δ on Ω a. To that end, for z Ω a, first observe that (4.7) 2b Re(z) 2 2a Re(z) 2 + 2(b a) Re(z) 2. The first term in the right hand side above is bounded by δ/3 following (4.6). To tackle the second term, we use b j a j 1 a j for j / I to conclude (4.8) 2(b a) Re(z) 2 2 = j I 4R 2 j + j / I d 4(b j a j ) 2 Re(z j ) 2 j=1 4a 2 jr 2 j δ2 9. Combining (4.7) and (4.8), 2b Re(z) 2 2δ/3 on Ω a, proving our claim. Since the same tail estimate as in (4.4) works for any h H b 1, it follows from (4.5) that the set of functions Pa form a Kɛ-net for H b 1. Let A be a set of points in Q such that for any b Q, there exists a A such that max j a j b j 1. One can clearly find an A with A r d. The proof is completed by observing that a APa form a Kɛ net for a QH a Results for lower bound. We now prove Lemma 4.6 which enables us to derive a lower bound to the concentration function φ a (ɛ) for a fixed a > 0. This lower bound coupled with the model identifiability of (3.7) results in a lower bound to the posterior concentration rate.

20 20 A BHATTACHARYA, D. PATI AND D.B. DUNSON Denote by H a the reproducing kernel Hilbert space of the Gaussian process W a. The key to obtaining a lower bound to the concentration function φ a µ 0 (ɛ) is to find a lower bound to inf h H a : h µ 0 2 ɛ 1 2 h 2 H a when µ 0 has anisotropic smoothness as in (LBT). Let h a ψ be the real part of the function e i(λ,t) ψ(λ)f a (λ)dλ, where f a (λ) denotes the spectral density corresponding to the spectral measure ν a of W a, so that f a (λ) = a d f(λ/a) with f(λ) = e λ 2 /4 /(2 d π d/2 ). From van der Vaart and van Zanten (2009), the RKHS of H a consists of functions h a ψ for ψ L 2 (ν a ). For given ɛ > 0, let ψ L 2 (ν a ) be such that h a ψ µ 0 < ɛ. 2 (4.9) Lemma 4.6. If µ 0 satisfies (LBT) and a < ɛ 1/α, then h a ψ H a a d/2 e c(1/ɛ)min(δ 1,δ 2 ), where δ 1 = 2 k 0.5(d 1) 2 α and δ 2 = 1 2{k 0.5(d 1)}. Proof. From Proposition A.1 in Appendix A, let r be a function such that r is equal to 1 on the support of µ 0, has itself support inside [0, 1] d and the Fourier transform ˆr(λ) e K λ for large λ. Then, h a ψ r has support inside [0, 1] d and µ 0 r = µ 0, so that h a ψ r µ 0 h a ψ µ 0, 2,R 2 where 2,R is the norm of L 2 (R d ) and 2 the norm of L 2 [0, 1] d. The function h a ψ r has Fourier transform (ψ 1f a ) ˆr, where ψ 1 (λ) = ψ( λ). Hence, by Parseval s identity (ψ 1 f a ) ˆr ˆµ 0 2,R < ɛ. Considering the set χ K1,K 2 = {λ R d : λ I > K 1, λ d [K 2, 2K 2 ]}, we have Also ψ 1 f a ˆrχ 2K1,K 2 2,R ˆµ 0 χ 2K1,K 2 2,R ɛ. ˆµ 0 χ 2K1,K 2 2 2,R C ( 1 K 1 ) 2k (d 1) exp{ ck u 2 }. Using Proposition B.1 in Appendix B with 2K 1 instead of K 1, A 1 = K 1, f = ψ 1 f a and g = ˆr, ( ) k 0.5(d 1) 1 ψ 1 f a χ K1, 2,R ˆr(t) dt C e t cku 2 I K 1 K 1 (4.10) ɛ ψ 1 f a 2,R t I >K 1 ˆr(t) dt.

21 Since (4.11) Also (4.12) h a ψ ANISOTROPIC FUNCTION ESTIMATION 21 = H a ψ fa 2,R and f a is symmetric about zero, ψ 1 f a 2,R Ca d/2 h a ψ H a. ( ) ψ 1 f a χ K1, 2 2,R λ = (ψ1f 2 a )f a sup f a h a ψ I >K 1 λ I >K 1 (4.11) and (4.12) imply (4.13) h a ψ Ca d e K2 1 /4a2 h a ψ 2 H a. a (C d/2 ( ) ) 1 k 0.5(d 1)e ck2 u K 1 ɛ H a e K2 1 /8a2 t I K 1 ˆr(t) dt + t I >K 1 ˆr(t) dt. Fix K 2 to be a large number. Choose K 1 large enough depending on ɛ such that C(1/K 1 ) k 0.5(d 1) e cku 2 = 2ɛ, so that K 1 = ( C/ɛ ) 1/(k 0.5(d 1)) for some C > 0. Since a < ɛ 1/α, (4.14) t I >K 1 ˆr(t) dt e K ( ) 1 K1 e K 2{k 0.5(d 1)} 1/ɛ. We have k < α + 0.5(d 1) as k = α + δ + 1 for δ (0, 1) and d 5, so that K1 2 > C ( 1/ɛ ) 2 8a 2 k 0.5(d 1) 2 α, implying e K 2 ) 2 1 (1/ɛ 8a 2 e C k 0.5(d 1) α 2 (4.15). (4.13), (4.14) and (4.15) imply ad/2 e c(1/ɛ)min(δ 1,δ 2 ) if a < ɛ 1/α. H a h a ψ 2 H a 5. Proof of main results. We shall only provide a detailed proof of Theorem 3.1 and sketch the main steps in the proof of Theorem Proof of Theorem 3.1. Let us begin by observing that, ) P( W A w 0 2ɛ = P( W a w 0 2ɛ)π A (da) { = P( } W a w 0 2ɛ)π(a θ)da π(θ)dθ.

22 22 A BHATTACHARYA, D. PATI AND D.B. DUNSON As in van der Vaart and van Zanten (2009), we first derive bounds on the non-centered small ball probability for a fixed rescaling a, and then integrate over the distribution of a to derive the same for W A. Given a R d, recall the definition of the centered and non-centered concentration functions of the process W a, (5.1) φ a 0 (ɛ) = log P( W a ɛ), φ a w 0 (ɛ) = inf h 2 h Ha : h w0 Ha log P( W a ɛ). ɛ For a fixed a, the non-centered small ball probability of W a can be bound in terms of the concentration function as follows (van der Vaart and van Zanten, 2008b), P( W a w 0 2ɛ) e φa w 0 (ɛ). Now, suppose that w 0 C α [0, 1] d for some α R d. From Lemma 4.2 and 4.3, it follows that for every a 0 > 0, there exist positive constants ɛ 0 < 1/2, C, D and E that depend only on w 0 and ν such that, for a > a 0, ɛ < ɛ 0 and C d i=1 a α i i < ɛ, φ a ( w 0 (ɛ) Da + Ea log ā ) 1+d ( K 1 a log ā ) 1+d, ɛ ɛ with K 1 depending only on a 0, ν and d. Thus, for ɛ < min{ɛ 0, C 1 a ᾱ 0 }, by (5.1), for constants K 2,..., K 6 > 0 and C 2,..., C 6 > 0, ) P( W A w 0 2ɛ { e φa } w (ɛ) 0 π(a θ)da π(θ)dθ θ { 2(C1 /ɛ) 1/α 1 θ a 1 =(C 1 /ɛ) 1/α 1 2(C1 /ɛ) 1/α d a d =(C 1 /ɛ) 1/α d { 2(C1 C 2 e K /ɛ) 1/α 2(1/ɛ) 1/α 0 log 1+d 1 (1/ɛ) θ } e K 1a log 1+d (ā/ɛ) π(a θ)da π(θ)dθ a 1 =(C 1 /ɛ) 1/α 1 2(C1 /ɛ) 1/α d } π(a θ)da π(θ)dθ. a d =(C 1 /ɛ) 1/α d Let Γ denote the region in the simplex S d 1 given by Γ = {θ S d 1 : τ < θ j α 0 α j < 2τ, j = 1,..., d 1}. Since d j=1 α 0/α j = 1, we can choose τ > 0 small enough to guarantee that any θ satisfying the set of inequalities lies inside the simplex. Moreover, with θ d = 1 d 1 j=1 θ j, one has (d 1)τ < θ d <

23 ANISOTROPIC FUNCTION ESTIMATION 23 2(d 1)τ. Choosing τ = C 3 / log(1/ɛ), one can show that d j=1 (1/ɛ)1/(α jθ j ) C 4 (1/ɛ) 1/α 0 for any θ Γ. Now, { 2(C1 /ɛ) 1/α 1 a 1 =(C 1 /ɛ) 1/α 1 { 2(C1 /ɛ) 1/α 1 e K 3 θ Γ 2(C1 /ɛ) 1/α d a 1 =(C 1 /ɛ) 1/α 1 d j=1 (1/ɛ)1/α j θ j π(θ)dθ } π(a θ)da π(θ)dθ a d =(C 1 /ɛ) 1/α d 2(C1 /ɛ) 1/α d e d j=1 a1/θ j j a d =(C 1 /ɛ) 1/α d e K 4(1/ɛ) 1/α0 π(θ)dθ C 5 e K 5(1/ɛ) 1/α0. } da π(θ)dθ The last inequality in the above display uses that Γ contains a hyper-cube of width τ so that its π-mass is at least polynomial in τ. Hence, ) P( W A w 0 2ɛ C 6 e K 6(1/ɛ) 1/α 0 log (5.2) 1+d (1/ɛ). Let B 1 denote the unit sup-norm ball of C[0, 1] d. For a vector θ S d 1 and positive constants M, r > 1, ɛ, let B θ = B θ (M, r, ɛ) denote the set, B θ = a r θ (MH a 1 ) + ɛb 1, where r θ denotes the vector whose jth element is r θ j. We further let, B = (MH a 1 ) + ɛb 1 = (MH a 1 ) + ɛb 1. a Q (r) θ S d 1 a r θ Let us first calculate the probability P(W A / B θ θ). Note that, P(W A / B θ θ) = P(W θ / B θ )π(a θ)da P(W a / B θ )π(a θ)da + P(A r θ θ), a r θ where P(W a r θ) is a shorthand notation for P(at least one A j > r θ j θ).

24 24 A BHATTACHARYA, D. PATI AND D.B. DUNSON To tackle the first term in the last display, note that B θ contains the set MH a 1 + ɛb 1 for any a r θ. Hence, for any a r θ, by Borell s inequality, P(W a / B θ ) P(W a / MH a 1 + ɛb 1 ) 1 Φ {M + Φ (e 1 φa )} 0 (ɛ) ( )} 1 Φ {M + Φ 1 e φrθ 0 (ɛ) e φrθ 0 (ɛ), if M 2Φ 1( e φrθ 0 (ɛ)). The penultimate inequality in the above display follows from the fact that, with T = [0, 1] d, e φa 0 (ɛ) = P( sup W t ɛ) P( sup W t ɛ) = e φrθ 0 (ɛ). t a T t r θ T By Lemma 4.10 of van der Vaart and van Zanten (2009), Φ 1 (u) {2 log(1/u)} 1/2 for u (0, 1). Hence, the last inequality in the above display remains valid if we choose M 4 φ rθ 0 (ɛ). Since A 1/θ j j follows a gamma distribution given θ j, in view of Lemma 4.9 of van der Vaart and van Zanten (2009), for r larger than a positive constant depending only on the parameters of the gamma distribution, P(A j > r θ j θ) C 1 r D 1 e D 2r. Combining the above, since B contains B θ for every θ S d 1, { } P(W A / B) = P(W a / B θ)g(a θ) θ { } P(W a / B θ θ)g(a θ) (5.3) θ C 2 r D 1 e D 2r + e D 3r log(r/ɛ) d+1. From Lemma 4.5, the entropy of B can be estimated as, log N(2ɛ, B, ) log N(ɛ, (MH a 1 ), ) a r θ (5.4) Kr log θ S d 1 ( M ɛ ) d+1.

25 ANISOTROPIC FUNCTION ESTIMATION 25 Thus (5.2), (5.3) and (5.4) can be simultaneously satisfied if we choose, for constants κ, κ 1, κ 2 > 0, ɛ n = n α 0/(2α 0 +1) log κ (n), r n = n 1/(2α 0+1) log κ 1 (n), M n = r n log κ 2 (n) Proof of Theorem 3.2. For ease of notation, we shall make the simplifying assumption that the random variable B is degenerate at 1. For a > 0 and S {1,..., d}, let H a,s denote the RKHS of W a, where a j = a for j S and a j = 1 for j / S. For a subset S {1,..., d} with S = d, and given positive constants M, r, ξ, ɛ, let B S = B S (M, r, ξ, ɛ) [ = M ( r ξ ) dh r,s 1 + ɛb 1 ] [ a<ξ ] ( MH a,s) 1 + ɛb1. Since, given S, A d gamma(b 1, b 2 ), it can be shown that, for some constant C 1 > 0, P(W A / B S S) e C 1r d. The dominating term in the ɛ entropy of B S is bounded by ( ) C 2 r d log 1+d C3 M. ɛ While calculating the concentration probability around w 0 C α [0, 1] I, simply use the fact that pr(s = I) > 0. Combining the above, the sieves B n are constructed as, B n = d d=1 S: S = d B S (M S n, r S n, ξ n, ɛ n ), where, for constants κ, κ 1 > 0, ɛ n = n α/(2α+d 0) log κ n, r S n = and (M S n ) 2 = (r S n) d log(r S n/ɛ n ). ( n d 0 2α+d 0 ) 1/ S log κ 1 (n)

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Latent factor density regression models

Latent factor density regression models Biometrika (2010), 97, 1, pp. 1 7 C 2010 Biometrika Trust Printed in Great Britain Advance Access publication on 31 July 2010 Latent factor density regression models BY A. BHATTACHARYA, D. PATI, D.B. DUNSON

More information

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University Submitted to the Annals of Statistics DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS By Subhashis Ghosal North Carolina State University First I like to congratulate the authors Botond Szabó, Aad van der

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Optimality of Poisson Processes Intensity Learning with Gaussian Processes

Optimality of Poisson Processes Intensity Learning with Gaussian Processes Journal of Machine Learning Research 16 (2015) 2909-2919 Submitted 9/14; Revised 3/15; Published 12/15 Optimality of Poisson Processes Intensity Learning with Gaussian Processes Alisa Kirichenko Harry

More information

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3 Brownian Motion Contents 1 Definition 2 1.1 Brownian Motion................................. 2 1.2 Wiener measure.................................. 3 2 Construction 4 2.1 Gaussian process.................................

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Continuous Functions on Metric Spaces

Continuous Functions on Metric Spaces Continuous Functions on Metric Spaces Math 201A, Fall 2016 1 Continuous functions Definition 1. Let (X, d X ) and (Y, d Y ) be metric spaces. A function f : X Y is continuous at a X if for every ɛ > 0

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

CS 7140: Advanced Machine Learning

CS 7140: Advanced Machine Learning Instructor CS 714: Advanced Machine Learning Lecture 3: Gaussian Processes (17 Jan, 218) Jan-Willem van de Meent (j.vandemeent@northeastern.edu) Scribes Mo Han (han.m@husky.neu.edu) Guillem Reus Muns (reusmuns.g@husky.neu.edu)

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Most Continuous Functions are Nowhere Differentiable

Most Continuous Functions are Nowhere Differentiable Most Continuous Functions are Nowhere Differentiable Spring 2004 The Space of Continuous Functions Let K = [0, 1] and let C(K) be the set of all continuous functions f : K R. Definition 1 For f C(K) we

More information

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Ann Inst Stat Math (29) 61:835 859 DOI 1.17/s1463-8-168-2 Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Taeryon Choi Received: 1 January 26 / Revised:

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Real Analysis Problems

Real Analysis Problems Real Analysis Problems Cristian E. Gutiérrez September 14, 29 1 1 CONTINUITY 1 Continuity Problem 1.1 Let r n be the sequence of rational numbers and Prove that f(x) = 1. f is continuous on the irrationals.

More information

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011 LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS S. G. Bobkov and F. L. Nazarov September 25, 20 Abstract We study large deviations of linear functionals on an isotropic

More information

Semiparametric posterior limits

Semiparametric posterior limits Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations

More information

The Heine-Borel and Arzela-Ascoli Theorems

The Heine-Borel and Arzela-Ascoli Theorems The Heine-Borel and Arzela-Ascoli Theorems David Jekel October 29, 2016 This paper explains two important results about compactness, the Heine- Borel theorem and the Arzela-Ascoli theorem. We prove them

More information

Integration on Measure Spaces

Integration on Measure Spaces Chapter 3 Integration on Measure Spaces In this chapter we introduce the general notion of a measure on a space X, define the class of measurable functions, and define the integral, first on a class of

More information

2. Function spaces and approximation

2. Function spaces and approximation 2.1 2. Function spaces and approximation 2.1. The space of test functions. Notation and prerequisites are collected in Appendix A. Let Ω be an open subset of R n. The space C0 (Ω), consisting of the C

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor

RANDOM FIELDS AND GEOMETRY. Robert Adler and Jonathan Taylor RANDOM FIELDS AND GEOMETRY from the book of the same name by Robert Adler and Jonathan Taylor IE&M, Technion, Israel, Statistics, Stanford, US. ie.technion.ac.il/adler.phtml www-stat.stanford.edu/ jtaylor

More information

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals Acta Applicandae Mathematicae 78: 145 154, 2003. 2003 Kluwer Academic Publishers. Printed in the Netherlands. 145 Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals M.

More information

The main results about probability measures are the following two facts:

The main results about probability measures are the following two facts: Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS. By Chao Gao and Harrison H. Zhou Yale University

RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS. By Chao Gao and Harrison H. Zhou Yale University Submitted to the Annals o Statistics RATE EXACT BAYESIAN ADAPTATION WITH MODIFIED BLOCK PRIORS By Chao Gao and Harrison H. Zhou Yale University A novel block prior is proposed or adaptive Bayesian estimation.

More information

Distance between multinomial and multivariate normal models

Distance between multinomial and multivariate normal models Chapter 9 Distance between multinomial and multivariate normal models SECTION 1 introduces Andrew Carter s recursive procedure for bounding the Le Cam distance between a multinomialmodeland its approximating

More information

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define Homework, Real Analysis I, Fall, 2010. (1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define ρ(f, g) = 1 0 f(x) g(x) dx. Show that

More information

arxiv: v1 [math.fa] 14 Jul 2018

arxiv: v1 [math.fa] 14 Jul 2018 Construction of Regular Non-Atomic arxiv:180705437v1 [mathfa] 14 Jul 2018 Strictly-Positive Measures in Second-Countable Locally Compact Non-Atomic Hausdorff Spaces Abstract Jason Bentley Department of

More information

Sobolev spaces. May 18

Sobolev spaces. May 18 Sobolev spaces May 18 2015 1 Weak derivatives The purpose of these notes is to give a very basic introduction to Sobolev spaces. More extensive treatments can e.g. be found in the classical references

More information

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents MATH 3969 - MEASURE THEORY AND FOURIER ANALYSIS ANDREW TULLOCH Contents 1. Measure Theory 2 1.1. Properties of Measures 3 1.2. Constructing σ-algebras and measures 3 1.3. Properties of the Lebesgue measure

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Bayesian Single Index Model with Covariates Missing at Random

Bayesian Single Index Model with Covariates Missing at Random Bayesian Single Index Model with Covariates Missing at Random Abstract Bayesian single index model is a highly promising dimension reduction tool for an interpretable modeling of the non linear relationship

More information

Stable Process. 2. Multivariate Stable Distributions. July, 2006

Stable Process. 2. Multivariate Stable Distributions. July, 2006 Stable Process 2. Multivariate Stable Distributions July, 2006 1. Stable random vectors. 2. Characteristic functions. 3. Strictly stable and symmetric stable random vectors. 4. Sub-Gaussian random vectors.

More information

REAL AND COMPLEX ANALYSIS

REAL AND COMPLEX ANALYSIS REAL AND COMPLE ANALYSIS Third Edition Walter Rudin Professor of Mathematics University of Wisconsin, Madison Version 1.1 No rights reserved. Any part of this work can be reproduced or transmitted in any

More information

Notation. General. Notation Description See. Sets, Functions, and Spaces. a b & a b The minimum and the maximum of a and b

Notation. General. Notation Description See. Sets, Functions, and Spaces. a b & a b The minimum and the maximum of a and b Notation General Notation Description See a b & a b The minimum and the maximum of a and b a + & a f S u The non-negative part, a 0, and non-positive part, (a 0) of a R The restriction of the function

More information

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS PROBABILITY: LIMIT THEOREMS II, SPRING 218. HOMEWORK PROBLEMS PROF. YURI BAKHTIN Instructions. You are allowed to work on solutions in groups, but you are required to write up solutions on your own. Please

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability

More information

An introduction to Mathematical Theory of Control

An introduction to Mathematical Theory of Control An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018

More information

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation:

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: Oct. 1 The Dirichlet s P rinciple In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: 1. Dirichlet s Principle. u = in, u = g on. ( 1 ) If we multiply

More information

at time t, in dimension d. The index i varies in a countable set I. We call configuration the family, denoted generically by Φ: U (x i (t) x j (t))

at time t, in dimension d. The index i varies in a countable set I. We call configuration the family, denoted generically by Φ: U (x i (t) x j (t)) Notations In this chapter we investigate infinite systems of interacting particles subject to Newtonian dynamics Each particle is characterized by its position an velocity x i t, v i t R d R d at time

More information

AW -Convergence and Well-Posedness of Non Convex Functions

AW -Convergence and Well-Posedness of Non Convex Functions Journal of Convex Analysis Volume 10 (2003), No. 2, 351 364 AW -Convergence Well-Posedness of Non Convex Functions Silvia Villa DIMA, Università di Genova, Via Dodecaneso 35, 16146 Genova, Italy villa@dima.unige.it

More information

Adaptive Bayesian procedures using random series priors

Adaptive Bayesian procedures using random series priors Adaptive Bayesian procedures using random series priors WEINING SHEN 1 and SUBHASHIS GHOSAL 2 1 Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030 2 Department

More information

L p Spaces and Convexity

L p Spaces and Convexity L p Spaces and Convexity These notes largely follow the treatments in Royden, Real Analysis, and Rudin, Real & Complex Analysis. 1. Convex functions Let I R be an interval. For I open, we say a function

More information

Nonparametric Bayesian Uncertainty Quantification

Nonparametric Bayesian Uncertainty Quantification Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017 Contents Introduction Recovery

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure? MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due ). Show that the open disk x 2 + y 2 < 1 is a countable union of planar elementary sets. Show that the closed disk x 2 + y 2 1 is a countable

More information

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018 EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018 While these notes are under construction, I expect there will be many typos. The main reference for this is volume 1 of Hörmander, The analysis of liner

More information

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active

More information

Fourier Transform & Sobolev Spaces

Fourier Transform & Sobolev Spaces Fourier Transform & Sobolev Spaces Michael Reiter, Arthur Schuster Summer Term 2008 Abstract We introduce the concept of weak derivative that allows us to define new interesting Hilbert spaces the Sobolev

More information

Nonparametric Bayes tensor factorizations for big data

Nonparametric Bayes tensor factorizations for big data Nonparametric Bayes tensor factorizations for big data David Dunson Department of Statistical Science, Duke University Funded from NIH R01-ES017240, R01-ES017436 & DARPA N66001-09-C-2082 Motivation Conditional

More information

Scaling up Bayesian Inference

Scaling up Bayesian Inference Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background

More information

LECTURE 5: THE METHOD OF STATIONARY PHASE

LECTURE 5: THE METHOD OF STATIONARY PHASE LECTURE 5: THE METHOD OF STATIONARY PHASE Some notions.. A crash course on Fourier transform For j =,, n, j = x j. D j = i j. For any multi-index α = (α,, α n ) N n. α = α + + α n. α! = α! α n!. x α =

More information

Random Bernstein-Markov factors

Random Bernstein-Markov factors Random Bernstein-Markov factors Igor Pritsker and Koushik Ramachandran October 20, 208 Abstract For a polynomial P n of degree n, Bernstein s inequality states that P n n P n for all L p norms on the unit

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Duality of multiparameter Hardy spaces H p on spaces of homogeneous type

Duality of multiparameter Hardy spaces H p on spaces of homogeneous type Duality of multiparameter Hardy spaces H p on spaces of homogeneous type Yongsheng Han, Ji Li, and Guozhen Lu Department of Mathematics Vanderbilt University Nashville, TN Internet Analysis Seminar 2012

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

We denote the space of distributions on Ω by D ( Ω) 2.

We denote the space of distributions on Ω by D ( Ω) 2. Sep. 1 0, 008 Distributions Distributions are generalized functions. Some familiarity with the theory of distributions helps understanding of various function spaces which play important roles in the study

More information

arxiv:submit/ [math.st] 6 May 2011

arxiv:submit/ [math.st] 6 May 2011 A Continuous Mapping Theorem for the Smallest Argmax Functional arxiv:submit/0243372 [math.st] 6 May 2011 Emilio Seijo and Bodhisattva Sen Columbia University Abstract This paper introduces a version of

More information

Tools from Lebesgue integration

Tools from Lebesgue integration Tools from Lebesgue integration E.P. van den Ban Fall 2005 Introduction In these notes we describe some of the basic tools from the theory of Lebesgue integration. Definitions and results will be given

More information

Local Minimax Testing

Local Minimax Testing Local Minimax Testing Sivaraman Balakrishnan and Larry Wasserman Carnegie Mellon June 11, 2017 1 / 32 Hypothesis testing beyond classical regimes Example 1: Testing distribution of bacteria in gut microbiome

More information

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian Sparse Linear Regression with Unknown Symmetric Error Bayesian Sparse Linear Regression with Unknown Symmetric Error Minwoo Chae 1 Joint work with Lizhen Lin 2 David B. Dunson 3 1 Department of Mathematics, The University of Texas at Austin 2 Department of

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 09: Stochastic Convergence, Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and

More information

Laplace s Equation. Chapter Mean Value Formulas

Laplace s Equation. Chapter Mean Value Formulas Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic

More information

Probability and Measure

Probability and Measure Probability and Measure Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Convergence of Random Variables 1. Convergence Concepts 1.1. Convergence of Real

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure? MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due 9/5). Prove that every countable set A is measurable and µ(a) = 0. 2 (Bonus). Let A consist of points (x, y) such that either x or y is

More information

Adaptivity to Local Smoothness and Dimension in Kernel Regression

Adaptivity to Local Smoothness and Dimension in Kernel Regression Adaptivity to Local Smoothness and Dimension in Kernel Regression Samory Kpotufe Toyota Technological Institute-Chicago samory@tticedu Vikas K Garg Toyota Technological Institute-Chicago vkg@tticedu Abstract

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

arxiv: v2 [math.st] 18 Feb 2013

arxiv: v2 [math.st] 18 Feb 2013 Bayesian optimal adaptive estimation arxiv:124.2392v2 [math.st] 18 Feb 213 using a sieve prior Julyan Arbel 1,2, Ghislaine Gayraud 1,3 and Judith Rousseau 1,4 1 Laboratoire de Statistique, CREST, France

More information

Nonlinear Systems and Control Lecture # 12 Converse Lyapunov Functions & Time Varying Systems. p. 1/1

Nonlinear Systems and Control Lecture # 12 Converse Lyapunov Functions & Time Varying Systems. p. 1/1 Nonlinear Systems and Control Lecture # 12 Converse Lyapunov Functions & Time Varying Systems p. 1/1 p. 2/1 Converse Lyapunov Theorem Exponential Stability Let x = 0 be an exponentially stable equilibrium

More information

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535 Additive functionals of infinite-variance moving averages Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535 Departments of Statistics The University of Chicago Chicago, Illinois 60637 June

More information

Invariant measures for iterated function systems

Invariant measures for iterated function systems ANNALES POLONICI MATHEMATICI LXXV.1(2000) Invariant measures for iterated function systems by Tomasz Szarek (Katowice and Rzeszów) Abstract. A new criterion for the existence of an invariant distribution

More information

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics Data from one or a series of random experiments are collected. Planning experiments and collecting data (not discussed here). Analysis:

More information

A note on some approximation theorems in measure theory

A note on some approximation theorems in measure theory A note on some approximation theorems in measure theory S. Kesavan and M. T. Nair Department of Mathematics, Indian Institute of Technology, Madras, Chennai - 600 06 email: kesh@iitm.ac.in and mtnair@iitm.ac.in

More information

State Space Representation of Gaussian Processes

State Space Representation of Gaussian Processes State Space Representation of Gaussian Processes Simo Särkkä Department of Biomedical Engineering and Computational Science (BECS) Aalto University, Espoo, Finland June 12th, 2013 Simo Särkkä (Aalto University)

More information

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by L p Functions Given a measure space (, µ) and a real number p [, ), recall that the L p -norm of a measurable function f : R is defined by f p = ( ) /p f p dµ Note that the L p -norm of a function f may

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

THE INVERSE FUNCTION THEOREM

THE INVERSE FUNCTION THEOREM THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

C m Extension by Linear Operators

C m Extension by Linear Operators C m Extension by Linear Operators by Charles Fefferman Department of Mathematics Princeton University Fine Hall Washington Road Princeton, New Jersey 08544 Email: cf@math.princeton.edu Partially supported

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

INTRODUCTION TO REAL ANALYSIS II MATH 4332 BLECHER NOTES

INTRODUCTION TO REAL ANALYSIS II MATH 4332 BLECHER NOTES INTRODUCTION TO REAL ANALYSIS II MATH 433 BLECHER NOTES. As in earlier classnotes. As in earlier classnotes (Fourier series) 3. Fourier series (continued) (NOTE: UNDERGRADS IN THE CLASS ARE NOT RESPONSIBLE

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information