ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES

Size: px

Start display at page:

Download "ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES"

Christiana Dickerson
5 years ago
Views:

1 Submitted to the Annals of Statistics arxiv: v1 ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES By Anirban Bhattacharya,, Debdeep Pati, and David Dunson, Duke University, Florida State University In nonparametric regression problems involving multiple predictors, there is typically interest in estimating an anisotropic multivariate regression surface in the important predictors while discarding the unimportant ones. Our focus is on defining a Bayesian procedure that leads to the minimax optimal rate of posterior contraction (up to a log factor) adapting to the unknown dimension and anisotropic smoothness of the true surface. We propose such an approach based on a Gaussian process prior with dimension-specific scalings, which are assigned carefully-chosen hyperpriors. We additionally show that using a homogenous Gaussian process with a single bandwidth leads to a sub-optimal rate in anisotropic cases. 1. Introduction. Gaussian processes (Rasmussen, 2004; van der Vaart and van Zanten, 2008b) are widely used as priors on functions due to tractable posterior computation and attractive theoretical properties. The law of a mean zero Gaussian process (GP) W t is entirely characterized by its covariance kernel c(s, t) = E(W s W t ). A squared exponential covariance kernel given by c(s, t) = exp( a s t 2 ) is commonly used in the literature. Given n independent observations, the optimal rate of estimation of a d-variable function that is only known to be α-smooth is n α/(2α+d) (Stone, 1982). The quality of estimation thus improves with increasing smoothness of the true function while it deteriorates with increase in dimensionality. In practice, the smoothness α is typically unknown and one would like an estimation procedure that adapts to any possible α > 0. Accordingly, a lot of effort has been employed to develop adaptive estimation methods that are rate-optimal for every regularity level of the unknown function. The literature on adaptive estimation in a minimax setting was initiated by Lepski in a series of papers (Lepski, 1990, 1991, 1992); see also Birgé Postdoctoral scholar, Duke University Assistant Professor, Florida State University Professor, Duke University AMS 2000 subject classifications: Primary 62G07, 62G20; secondary 60K35 Keywords and phrases: Adaptive, Anisotropic, Bayesian nonparametrics, Function estimation, Gaussian process, Rate of convergence 1

2 2 A BHATTACHARYA, D. PATI AND D.B. DUNSON (2001) for a discussion on this topic. We also refer the reader to Hoffmann and Lepski (2002), which contains an extensive list of developments in the frequentist literature on adaptive estimation. There is a growing literature on Bayesian adaptation over the last decade. Previous works include Belitser and Ghosal (2003); de Jonge and van Zanten (2010); Ghosal, Lember and van der Vaart (2003); Ghosal, Lember and Van Der Vaart (2008); Huang (2004); Kruijer, Rousseau and van der Vaart (2010); Rousseau (2010); Scricciolo (2006); Shen and Ghosal (2011). A key idea in frequentist adaptive estimation is to narrow down the search for an optimal estimator within a class of estimators indexed by a smoothness or bandwidth parameter, and make a data-driven choice to select the proper bandwidth. In a Bayesian context, one would place a prior on the bandwidth parameter and model-average across different values of the bandwidth through the posterior distribution. The parameter a in the squaredexponential covariance kernel c plays the role of a scaling or inverse bandwidth. van der Vaart and van Zanten (2009) showed that with a gamma prior on a d, one obtains the minimax rate of posterior contraction n α/(2α+d) up to a logarithmic factor for α-smooth functions adaptively over all α > 0. In most multivariate applications, the isotropic smoothness assumption seems too restrictive. Potentially, one can incorporate a separate scaling variable a j for each dimension using the covariance kernel c(s, t) = exp( d j=1 a j s j t j 2 ), intuitively enabling better approximation of anisotropic functions. Such kernels, going by the name automatic relevance determination (ARD), have been heavily used in the machine learning community; see for example Rasmussen (2004) and references therein. Zou et al. (2010) and Savitsky, Vannucci and Sha (2011) recently considered such a model, with point mass mixture priors on the a j s. Although this is an attractive approach with encouraging empirical performance, there has not been any theoretical studies of asymptotic properties of related models in a Bayesian framework. In the frequentist literature, minimax rates of convergence in anisotropic Sobolev, Besov and Hölder spaces have been studied in Birgé (1986); Ibragimov and Khasminski (1981); Nussbaum (1985), with adaptive estimation procedures developed in Barron, Birgé and Massart (1999); Hoffmann and Lepski (2002); Kerkyacharian, Lepski and Picard (2001); Klutchnikoff (2005) among others. The traditional way of dealing with anisotropy is to employ a separate bandwidth or scaling parameter for the different dimensions, and choose an optimal combination of scales in a data-driven way. However, the multidimensional nature of the problem makes the optimal bandwidth selection difficult compared to the isotropic case, as there is no natural ordering among the estimators with multiple bandwidths (Lepski and Levit, 1999).

3 ANISOTROPIC FUNCTION ESTIMATION 3 It is known (Hoffmann and Lepski, 2002) that the minimax rate of convergence for a function with smoothness α i along the ith dimension is given by n α 0/(2α 0 +1), where α0 1 = d i=1 α 1 i is an exponent of global smoothness (Birgé, 1986). When α i = α for all i = 1,..., d, one reduces back to the optimal rate for isotropic classes. On the contrary, if the true function belongs to an anisotropic class, the assumption of isotropy would lead to loss of efficiency which would be more and more accentuated in higher dimensions. In addition, if the true function depends on a subset of coordinates I = {i 1,..., i d0 } {1,..., d} for some 1 d 0 d, the minimax rate would further improve to n α 0I/(2α 0I +1), with α 1 0I = j I α 1 j. The objective of this article is to study whether one can fully adapt to this larger class of functions in a Bayesian framework using dimension specific rescalings of a homogenous Gaussian process, referred to as a multibandwidth Gaussian process from now on. We answer the question in the affirmative to establish rate adaptiveness of the posterior distribution in a variety of settings involving a multi-bandwidth Gaussian process through a novel prior specification on the vector of bandwidths. For simplicity of exposition, we initially study the problem in two parts: (i) adaptive estimation over anisotropic Hölder functions of d arguments, and (ii) adaptive estimation over functions that can possibly depend on fewer coordinates and have isotropic Hölder smoothness over the remaining coordinates. The proposed prior specification for the two cases above are intuitively interpretable and can be easily connected to prescribe a unified prior leading to adaptivity over (i) and (ii) combined. Although our prior specification involving dimension-specific bandwidth parameters leads to adaptivity, a stronger result is required to conclude that a single bandwidth would be inadequate for the above classes of functions. We prove that the optimal prior choice in the isotropic case leads to a suboptimal convergence rate if the true function has anisotropic smoothness by obtaining a lower bound on the posterior contraction rate. Previous results on posterior lower bounds in non-parametric problems include Castillo (2008); van der Vaart and van Zanten (2011). The remaining paper is organized as follows. In Section 2, we introduce relevant notations and conventions used throughout the paper. The multibandwidth Gaussian process is introduced in Section 3. Sections 3.1 and 3.2 discuss the main developments with applications to anisotropic Gaussian process mean regression and logistic Gaussian process density estimation described in Section 3.4. Section 3.5 establishes the necessity of the multibandwidth Gaussian process by showing a lower-bound result. In Sections 4.1 and 4.2, we study various properties of rescaled Gaussian processes which

4 4 A BHATTACHARYA, D. PATI AND D.B. DUNSON are crucially used in the proofs of the main theorems in Section Preliminaries. To keep the notation clean, we shall only use boldface for a, b and α to denote vectors. We shall make frequent use of the following multi-index notations. For vectors a, b R d, let a. = d j=1 a j, a = d j=1 a j, a! = d j=1 a j!, ā = max j a j, a = min j a j, a./b = (a 1 /b 1,..., a d /b d ) T, a b = (a 1 b 1,..., a d b d ) T, a b = d j=1 ab j j. Denote a b if a j b j for all j = 1,..., d. For n = (n 1,..., n d ), let D n f denote the mixed partial derivatives of order (n 1,..., n d ) of f. Let C[0, 1] d and C β [0, 1] d denote the space of all continuous functions and the Hölder space of β-smooth functions f : [0, 1] d R respectively, endowed with the supremum norm f = sup t [0,1] d f(t). For β > 0, the Hölder space C β [0, 1] d consists of functions f C[0, 1] d that have bounded mixed partial derivatives up to order β, with the partial derivatives of order β being Lipschitz continuous of order β β. Also, denote by H β [0, 1] d the Sobolev space of functions f : [0, 1] d R that are restrictions of a function f : R d R with Fourier transform ˆf(λ) = (2π) d e i(λ,t) f(t)dt such that (1 + λ 2 ) β 2 ˆf(λ) dλ <. Note that using the above convention, the inverse Fourier transform f(t) = e i(λ,t) ˆf(λ)dλ. Next, we define an anisotropic Hölder class of functions previously used in Barron, Birgé and Massart (1999); Klutchnikoff (2005). For a function f C[0, 1] d, x [0, 1] d, and 1 i d, let f i ( x) denote the univariate function y f(x 1,..., x i 1, y, x i+1,..., x d ). For a vector of positive numbers α = (α 1,..., α d ), the anisotropic Hölder space C α [0, 1] d consists of functions f which satisfy, for some L > 0, (2.1) max sup α i 1 i n x [0,1] d j=0 D j f i ( x) L, and, for any y [0, 1], h small such that y + h [0, 1] and for all 1 i d, sup D αi f i (y + h x) D αi (2.2) f i (y x) L h α i α i. x [0,1] d For t R d and a subset I {1,..., d} of size I = d with 1 d d, let t I denote the vector of size d consisting of the coordinates (t j : j I). Let C[0, 1] I denote the subset of C[0, 1] d consisting of functions f such that f(t) = g(t I ) for some function g C[0, 1] d. Also, let C α [0, 1] I denote the

5 ANISOTROPIC FUNCTION ESTIMATION 5 subset of C α [0, 1] d consisting of functions f such that f(t) = g(t I ) for some function g C α I [0, 1] d. The ɛ-covering number N(ɛ, S, d) of a semi-metric space S relative to the semi-metric d is the minimal number of balls of radius ɛ needed to cover S. The logarithm of the covering number is referred to as the entropy. We write for inequality up to a constant multiple. Let φ(x) = (2π) 1/2 exp( x 2 /2) denote the standard normal density, and let φ σ (x) = (1/σ)φ(x/σ). Let an asterisk denote a convolution, e.g., (φ σ f)(y) = φ σ (y x)f(x)dx. Let R + denote the set of non-negative real numbers and let R denote positive reals. Denote by S d 1 the (d 1)-dimensional simplex {x R d : x i 0, 1 i d, d i=1 x i = 1}. Unless otherwise stated, c, C, C, C 1, C 2,..., K, K 1, K 2,... shall denote global constants irrelevant to our purpose. 3. Main results. Let W = {W t : t [0, 1] d } be a centered homogeneous Gaussian process with covariance function E(W s W t ) = c(s t). A detailed review of the facts on Gaussian processes relevant to the present application can be found in van der Vaart and van Zanten (2008b). If c : R d R is continuous, by Bochner s theorem, there exists a finite positive measure ν on R d, called the spectral measure of W, such that c(t) = e i(λ,t) ν(dλ), R d where for u, v C d, (u, v) denotes the complex inner product. As in van der Vaart and van Zanten (2009), we shall restrict ourselves to processes with spectral measure ν having sub-exponential tails, i.e., for some δ > 0, (3.1) e δ λ ν(dλ) <. The spectral measure ν of a squared exponential covariance kernel with c(t) = exp( t 2 ) has a density w.r.t. the Lebesgue measure given by f(λ) = 1/(2 d π d/2 ) exp( λ 2 /4) which clearly satisfies (3.1). Let H be the RKHS of W ; see van der Vaart and van Zanten (2008b) for a review of relevant facts. van der Vaart and van Zanten (2008a) showed that the rate of posterior contraction with a Gaussian process prior W, using a metric where appropriate testing is possible, is determined by the behavior of the concentration function φ w0 (ɛ) for ɛ close to zero, where (3.2) φ w0 (ɛ) = h 2 H inf log P ( W h H: h w 0 ɛ 2 ɛ).

6 6 A BHATTACHARYA, D. PATI AND D.B. DUNSON We tacitly assume that there is a given statistical problem where the true parameter g 0 is a known function of w 0. van der Vaart and van Zanten (2009) studied rescaled Gaussian processes W A = {W At : t [0, 1] d } for a real positive random variable A stochastically independent of W, and showed that with a Gamma prior on A d, one obtains the minimax-optimal rate of convergence n α/(2α+d) (up to a logarithmic factor) for α-smooth functions. Since their prior specification does not involve the unknown smoothness α, the procedure is fully adaptive. The key result of van der Vaart and van Zanten (2009) was to construct sets B n C[0, 1] d so that given α > 0, a function w 0 C α [0, 1] d, and a constant C > 1, there exists a constant D > 0 such that, for every sufficiently large n, (3.3) (3.4) (3.5) log N( ɛ n, B n, ) Dn ɛ 2 n, P(W A / B n ) e Cnɛ2 n, P( W A w 0 ɛ n ) e nɛ2 n, with ɛ n = n α/(2α+1) (log n) κ 1, ɛ n = n α/(2α+1) (log n) κ 2 for constants κ 1, κ 2 > 0. In this article, we shall study multi-bandwidth Gaussian processes of the form {W a t = Wa t : t [0, 1] d } for a vector of rescalings (or inversebandwidths) a = (a 1,..., a d ) T with a j > 0 for all j = 1,..., d. We consider two function classes defined in Section 2: (i) Hölder class of functions C α [0, 1] d with anisotropic smoothness (α R d ). (ii) Hölder class of functions C α [0, 1] I with isotropic smoothness that can possibly depend on fewer dimensions (α > 0 and I {1,..., d}). For a continuous function in the support of a Gaussian process, the probability assigned to a sup-norm neighborhood of the function is controlled by the centered small ball probability and how well the function can be approximated from the RKHS of the process (Section 5 of van der Vaart and van Zanten (2008b)). With the target class of functions as in (i) or (ii), a single scaling seems inadequate and it is intuitively appealing to introduce multiple bandwidth parameters to enlarge the RKHS and facilitate improved approximation from the RKHS. The main technical challenge for adaptation in our setting is to find a joint prior on a and devise sets B n (henceforth called sieves) so that (3.3) (3.5) are satisfied with w 0 in the above function classes (i)-(ii) and ɛ n being the optimal rate of convergence for the same. To that end, we propose a novel class of joint priors on the rescaling vector a that leads to adaptation over

7 ANISOTROPIC FUNCTION ESTIMATION 7 function classes (i) and (ii) in Sections 3.1 and 3.2 respectively. Connections between the two prior choices are discussed and a unified framework is prescribed for the function class { C α [0, 1] I : α R d, I {1,..., d} } combining (i) and (ii). The construction of the sieves B n are laid out in Section 5. With such B n, one can use standard results to establish adaptive minimax rate of convergence in various statistical settings; refer to the discussion following Theorem 3.1 in van der Vaart and van Zanten (2009). Some such specific applications are described in Section Adaptive estimation of anisotropic functions. Let A = (A 1,... A d ) T be a random vector in R d with each A j a non-negative random variable stochastically independent of W. We can then define a scaled process W A = {W A t : t [0, 1] d }, to be interpreted as a Borel measurable map in C[0, 1] d equipped with the sup-norm. The basic idea here is to scale the different dimensions by different amounts so that the resulting process becomes suitable for approximating functions having different smoothness along the different coordinate axes. We shall define a joint distribution on A induced through the following hierarchical specification. Let Θ = (Θ 1,..., Θ d ) denote a random vector with a density supported on the simplex S d 1. (PA1) draw Θ Dir(β 1,..., β d ) for some β = (β 1,..., β d ). (PA2) given Θ = θ, draw A 1/θ j j g independently, where g is a density on the positive real line satisfying, (3.6) C 1 x p exp( D 1 x log q x) g(x) C 2 x p exp( D 2 x log q x), for positive constants C 1, C 2, D 1, D 2 and non-negative constants p, q and every sufficiently large x > 0. In particular, g corresponds to a gamma(b 1, b 2 ) distribution if b 1 = p + 1, D 1 = D 2 = b 2, q = 0, C 1 = C 2 in (3.6). For notational simplicity, we shall assume g to be gamma(b 1, b 2 ) from now on, noting that the main results would all hold for the general form of g above. Let π A denote the joint prior on A induced through (PA1) (PA2), so that π A (a) = d j=1 π(a j θ j )dπ(θ). We now state our main theorem for the anisotropic smoothness class in (i), with a detailed proof provided in Section 5. Theorem 3.1. Let W be a centered homogeneous Gaussian random field on R d with spectral measure ν that satisfies (3.1) and let W A denote the multi-bandwidth process with A π A as in (PA1) (PA2). Let

8 8 A BHATTACHARYA, D. PATI AND D.B. DUNSON α = (α 1,..., α d ) be a vector of positive numbers and α 0 = ( d i=1 α 1 i ) 1. Suppose w 0 belongs to the anisotropic Hölder space C α [0, 1] d. Then for every constant C > 1, there exist Borel measurable subsets B n of C[0, 1] d and a constant D > 0 such that, for every sufficiently large n, the conditions (3.3) (3.5) are satisfied by W A with ɛ n = n α 0/(2α 0 +1) (log n) κ 1, ɛ n = n α 0/(2α 0 +1) (log n) κ 2 for constants κ 1, κ 2 > Adaptive dimension reduction. We next consider the smoothness class in (ii), namely C α [0, 1] I for I {1,..., d} and α > 0. If the true function has isotropic smoothness on the dimensions it depends on, it is intuitively clear that one doesn t need a separate scaling for each of the dimensions. Indeed, had we known the true coordinates I {1,..., d}, we could have only scaled the dimensions in I by a positive random variable A, and a slight modification of the results in van der Vaart and van Zanten (2009) would imply that a gamma prior on A I would lead to adaptation. With that motivation, consider a joint prior π A on A induced through the following hierarchical scheme: (PD1) draw d uniformly on {0,..., d} (PD2) given d, draw a subset S {1,..., d} with S = d uniformly from all subsets of size d (PD3) generate a pair of random variables (A, B) with A d gamma(b 1, b 2 ) and B drawn from a fixed compactly supported distribution (PD4) set A j = A for j S and A j = B for j / S In particular, one can fix B to be any constant c > 0 in (PD3), which corresponds to the δ c ( ) distribution. We next state our main result on adaptive dimension reduction. The proof of the following Theorem 3.2 has elements in common with the proof of the Theorem 3.1, and hence only a sketch of the proof is provided in Section 5. Theorem 3.2. Let W be a centered homogeneous Gaussian random field on R d with spectral measure ν that satisfies (3.1) and let W A denote the multi-bandwidth process with A π A as in (PD1) (PD4). Suppose w 0 belongs to the Hölder space C α [0, 1] I for some subset I of {1,..., d} and α > 0. Then for every constant C > 1, there exist Borel measurable subsets B n of C[0, 1] d and a constant D > 0 such that, for every sufficiently large n, the conditions (3.3) (3.5) are satisfied by W A with ɛ n = n α/(2α+d0) (log n) κ 1, ɛ n = n α/(2α+d0) (log n) κ 2 for constants κ 1, κ 2 > 0 and d 0 = I.

9 ANISOTROPIC FUNCTION ESTIMATION Connections between cases (i) and (ii). The joint distributions on A specified in (PA1) (PA2) and (PD1) (PD4) are closely connected. To begin with, note that if we set A j = A, θ j = 1/d, j = 1,..., d in (PA1) (PA2), one reduces to a gamma prior on A d ; the optimal prior choice in the isotropic case (van der Vaart and van Zanten, 2009). In the anisotropic case, our proposed prior can be motivated as follows. Recall that the purpose of rescaling is to traverse the sample paths of an infinite smooth stochastic process on a larger domain to make it more suitable for less smooth functions. If the true function has anisotropic smoothness, then we would like to stretch those directions more where the function is less smooth. For smaller θ j s, the marginal distribution of a j has lighter tails compared to larger values of θ j. We would thus like θ j to assume smaller values for the directions j where the function is more smooth and larger values corresponding to the less smooth directions. Without further constraints on θ, it is not possible to separate the scale of A from θ. This motivates us to constrain θ to the simplex which serves as a weak identifiability condition. In the limit as θ j 0, the distribution of a j converges to a point mass at zero. Accordingly, if the true function doesn t depend on a set of (d d ) dimensions, we would set θ j = 0 for those dimensions and choose the remaining θ j s from a d 1 dimensional simplex. In particular, if the function has isotropic smoothness in the remaining d coordinates, one can simply choose θ j = 1/d for those dimensions. This reduces to our prior choice in (PD1) (PD4). The dimensionality reduction in Section 3.2 deals with finitely many models and can be alternatively studied as a model selection problem (Ghosal, Lember and Van Der Vaart, 2008). However, the connection established above allows us to treat anisotropy and dimension reduction under a single framework with the dimension reduction paradigm recognized as a limiting case of the anisotropic framework. We exploit this connection to propose a unified framework for adaptively estimating functions which possibly depend on fewer coordinates and have anisotropic smoothness in the remaining ones, i.e., functions in C α [0, 1] I for α R d and I {1,..., d}. In particular, we continue to consider rescaled Gaussian processes W A with the following prior on A: (P1) draw d uniformly on {0,..., d} (P2) given d, draw a subset S {1,..., d} with S = d uniformly from all subsets of size d (P3) draw Θ = (θ 1,..., θ d) S d 1 from a Dir(β 1,..., β d) distribution (P4) given S and Θ = θ, draw A 1/θ j j gamma(b 1, b 2 ) independently for

10 10 A BHATTACHARYA, D. PATI AND D.B. DUNSON j S, and fix A j = c for some c > 0 and j / S. A unification of Theorems 3.1 and 3.2 is provided in the following Theorem 3.3, with the proof omitted as it is similar to the previous theorems. Theorem 3.3. Consider W A with A π A as in (P1) (P4). Suppose w 0 belongs to C α [0, 1] I for some subset I of {1,..., d} and α R d. Let α 1 0I = j I α 1 j. Then for every constant C > 1, the conclusions of Theorem 3.1 are satisfied by W A with ɛ n = n α 0I/(2α 0I +1) (log n) κ 1, ɛ n = n α 0I/(2α 0I +1) (log n) κ 2 for constants κ 1, κ 2 > 0. In Theorem 3.3, the exponents of the logarithmic terms increase linearly with the dimension and decrease with α 0I. In particular, κ 1 and κ 2 can be estimated by d+1 α 0I +2 and (α 0I+3)(d+1) 2(α 0I +2) respectively Rates of convergence in specific settings. Theorem 3.3 is in the same spirit as Theorem 3.1 of van der Vaart and van Zanten (2009) (see also Theorem 2.2 of de Jonge and van Zanten (2010)) and can be used to derive rates of posterior contraction in a variety of statistical problems involving Gaussian random fields. We shall consider a couple of specific problems with the message that similar results can be obtained for a large class of problems. We first consider a regression problem where given independent response variable y i and covariates x i [0, 1] d, the response is modeled as random perturbations around a smooth regression surface, i.e., (3.7) y i = µ(x i ) + ɛ i, ɛ i N(0, σ 2 ). As motivated before, the true regression surface µ 0 might depend only on a subset of variables I and have anisotropic smoothness in the remaining variables. Accordingly, we assume µ 0 C α [0, 1] I for some I and α. Also, assume the true value σ 0 of σ lies in an interval [a, b] [0, ). We use the law of W A as a prior on µ, with the prior on A as in (P1) (P4). We also assume a prior on σ supported on [a, b]. Denote the posterior distribution by Π( y 1,..., y n ). Let µ 2 n = n 1 n i=1 µ2 (x i ) denote the L 2 norm corresponding to the empirical distribution of the design points. The posterior is said to contract at a rate ɛ n, if for every sufficiently large M, (3.8) E µ0,σ 0 Π [ (µ, σ) : µ µ 0 n + σ σ 0 > Mɛ n y 1,..., y n ] 0. Theorem 3.4. Consider the nonparametric regression model (3.7) with W A as a prior on µ, with the prior on A as in (P1) (P4). Let α =

11 ANISOTROPIC FUNCTION ESTIMATION 11 (α 1,..., α d ) be a vector of positive numbers and I be a subset of {1,..., d}. If µ 0 C α [0, 1] I, then (3.8) holds with ɛ n = n α 0I/(2α 0I +1) log κ n, where α 1 0I = and κ > 0 is a positive constant. j I α 1 j Thus, one obtains the minimax optimal rate up to a log factor adapting to the unknown dimensionality and anisotropic smoothness. Remark 3.5. Consider the random design setting i.e., when x i Q, where Q is a probability measure on [0, 1] d. Then the conclusion of Theorem 3.4 holds with µ n replaced by µ 2 2,Q = [0,1] d µ 2 (t)q(t)dt where q is the density of Q with respect to the Lebesgue measure on [0, 1] d. Remark 3.6. Theorem 3.4 guarantees adaptive estimation of the regression surface. It is often of interest additionally to select the important variables affecting the response y. In the context of variable selection in a normal linear model, Barbieri and Berger (2004) advocated using the median probability model consisting of the variables with posterior marginal inclusion probability greater than or equal to half. They also proved the predictive optimality of such models. The same approach could be followed here for variable selection, however the issue of optimality needs to be studied. A similar result on adaptation holds for density estimation using the logistic Gaussian process, where an unknown density g on the hypercube [0, 1] d is modeled as (3.9) g(t) = e µ(t) [0,1] e µ(s) ds d for a function µ : R d R. Suppose X 1,..., X n are drawn i.i.d. from a continuous, everywhere positive density g 0 = log µ 0 on [0, 1] d. Theorem 3.7. Suppose one uses the law of W A as a prior on µ in (3.9), with the prior on A as in (P1) (P4). Let α = (α 1,..., α d ) be a vector of positive numbers and I be a subset of {1,..., d}. If µ 0 = log g 0 C α [0, 1] I, then the posterior contracts at the rate ɛ n = n α 0I/(2α 0I +1) log κ n with respect to the Hellinger distance, where α 1 0I = j I α 1 j. The proofs of Theorems 3.4 and 3.7 follow in a straightforward manner from our main results in Theorem 3.1 and 3.2. We don t provide a proof here since the steps are very similar to those in Section 3 of van der Vaart and van Zanten (2008a).

12 12 A BHATTACHARYA, D. PATI AND D.B. DUNSON 3.5. Lower bounds on posterior contraction rates. We now demonstrate in the regression function estimation setting (3.7) that a Gaussian process prior with a single bandwidth leads to a sub-optimal rate of convergence when the true regression function µ 0 has anisotropic smoothness. For simplicity of exposition, we will not consider estimating σ here which is assumed to be fixed and known. Assume x i Q, where Q is a distribution on [0, 1] d admitting a density q with respect to the Lebesgue measure on [0, 1] d. Consider the following assumption on the true regression function µ 0. (LBT) Let I = {1, 2,..., d 1} with d 5. Assume µ 0 (t) = µ 0 (t I )η(t d ), where µ 0 C α [0, 1] d 1 has support [v, w] d 1 for 0 < v < w < 1 and η : [0, 1] R is infinitely differentiable with support [v, w]. Assume further that there exists 0 < u < 1 such that for any λ I > 1 and large λ d, the Fourier transform of µ 0 satisfies (3.10) ˆµ 0 (λ) K λ I k exp{ λ d u } for k = α + δ + 1 with δ (0, 1) and K > 0. An explicit construction of such a function µ 0 is provided in Remark 3.8 below. Remark 3.8. Let µ 0 be a function as in the statement of Theorem 8 in van der Vaart and van Zanten (2011). Let η : [0, 1] R be given by η(x) = Ψ{2(x v)/(w v) 1}, where Ψ(t) = exp{ 1/(1 t 2 )}χ t <1. Ψ is an example of a C bump function with support on [ 1, 1] (see, for example, Section 13 of Tu (2010) ) and η is constructed to shift and scale Ψ to have support on [v, w]. From Johnson (2007), ˆΨ(λ) exp( C λ ) for large λ. With these choices, if we set µ 0 (t) = µ 0 (t I ) η(t d ), then (3.10) is clearly satisfied. In the following, we shall consider a rescaled Gaussian process W A for a positive random variable A stochastically independent of W. Consider (LBP) W A as the prior for the regression function µ in (3.7), with a prior on A specified by A d h, where h is a gamma(b 1, b 2 ) density. Recall that (LBP) results in the minimax rate of contraction adaptively over µ 0 being an isotropic α-hölder function of d variables for any α > 0. In the following Theorem 3.9, we show that the above specification involving a single bandwidth leads to sub-optimal rate with respect to the L 2 norm corresponding to the covariate distribution, w 2 2 = x [0,1] d w 2 (x)q(x)dx, for the true regression function µ 0 satisfying (LBT).

13 ANISOTROPIC FUNCTION ESTIMATION 13 Theorem 3.9. Consider (3.7) with (LBP) as the prior and assume the true regression function µ 0 satisfies (LBT) for some appropriately chosen α > 0. Then (3.11) P ( µ µ 0 2 n α/(2α+d) log t0 n Y 1,..., Y n, x 1,..., x n ) 0 a.s. as n for some constant t 0 > 0. Since µ 0 C α [0, 1] d 1 and η is infinitely differentiable, the minimax rate for estimating µ 0 is faster than n α/(2α+d) log t 0 n by a power of n. Hence Theorem 3.9 shows that a single bandwidth in this case leads to a suboptimal rate of posterior convergence. 4. Auxiliary results. In this section, we present a number of auxiliary results that are crucially used to prove the main results Properties of the multi-bandwidth Gaussian process. We first summarize some properties of the RKHS of the scaled process W a for a fixed vector of scales a. Lemmata generalize the results in section 4 of van der Vaart and van Zanten (2009) from a single scaling to a vector of scales; we briefly sketch the proofs emphasizing the places we differ substantially from van der Vaart and van Zanten (2009). Assume that the spectral measure ν of W has a spectral density f. For a R d, the rescaled proces W a has a spectral measure νa given by νa(b) = ν(b./a). Further, νa admits a spectral density fa, with fa(λ) = a 1 f(λ./a). For w 0 C[0, 1] d, define φ a w 0 (ɛ) to be the concentration function of the rescaled Gaussian process W a. As a straightforward extension of Lemma 4.1 and 4.2 in van der Vaart and van Zanten (2009), it turns out that the RKHS of the process W a can be characterized as below. Lemma 4.1. The RKHS H a of the process {W a t : t [0, 1] d } consists of real parts of the functions t e i(λ,t) g(λ)νa(dλ), where g runs over the complex Hilbert space L 2 (νa). Further, the RKHS norm of the element in the above display is given by g L2 (νa ). Lemma 4.3 of van der Vaart and van Zanten (2009) shows that for any isotropic Hölder smooth function w, convolutions with an appropriately chosen class of higher order kernels indexed by the scaling parameter a belong

14 14 A BHATTACHARYA, D. PATI AND D.B. DUNSON to the RKHS. This suggests that driving the bandwidth 1/a to zero, one can obtain improved approximations to any Hölder smooth function. The following Lemma 4.2 illustrates the usefulness of using separate bandwidths for each dimension for approximating anisotropic Hölder functions from the RKHS. Lemma 4.2. Assume ν has a density with respect to the Lebesgue measure which is bounded away from zero on a neighborhood of the origin. Let α R d be given. Then, for any subset I of {1,..., d} and w C α [0, 1] I, there exists constants C and D depending only on ν and w such that, for a i s large enough, inf { h 2 Ha : h w C i I a α } i i Da. Proof. We shall prove the result for w C α [0, 1] d and sketch an argument for extending the proof to any w C α [0, 1] I. Let ψ j, j = 1,..., d, be a set of higher order kernels which satisfy ψ j (t j )dt j = 1, t k j ψ j(t j )dt j = 0 for any positive integer k and t j α j ψ j (t j ) dt j 1. Define ψ : R d C by ψ(t) = ψ 1 (t 1 )... ψ d (t d ) so that one has R d ψ(t)dt = 1, R d t k ψ(t)dt = 0 for any non-zero multi-index k = (k 1,..., k d ), and the functions ˆψ /f and ˆψ 2 /f are uniformly bounded, where ˆψ denotes the Fourier transform of ψ. For a vector of positive numbers a = (a 1,..., a d ), let ψa(t) = a ψ(a t), where a = d j=1 a j. Proceeding as in Lemma 4.3 of van der Vaart and van Zanten (2009), one can show that the convolution ψa w is contained in the RKHS H a and the squared RKHS norm of ψa w is bounded by Da, with D depending only on ν and w. Thus, the proof of Lemma 4.2 would be completed if we can show that ψa w w C d j=1 a α j j. We have, for any t R d, ψa w(t) w(t) = ψ(s){w(t s./a) w(t)}ds. For 1 j d 1, let u (j) denote the vector in R d with u (j) i = 0 for i = 1,..., j and u (j) i = 1 for i = j + 1,..., d. For any two vectors x, y R d, we can navigate from x to y in a piecewise linear fashion traveling parallel to one of the coordinate axes at a time. The vertices of the path will be given by x (0) = x, x (j) = u (j) x + (1 u (j) ) y for j = 1,..., d 1 and x (d) = y. A multivariate Taylor expansion of w(t s./a) around w(t) cannot take advantage of the anisotropic smoothness of w across different coordinate

15 ANISOTROPIC FUNCTION ESTIMATION 15 axes. Letting x = t, y = t s./a and x (j), j = 0, 1,..., d as above, let us write w(y) w(x) in the following telescoping form, w(y) w(x) = d w(x (j) ) w(x (j 1) ) = j=1 d w j (t j s j /a j x (j) ) w j (t j x (j) ), where the functions w j are as defined in Section 2, with w j (t x) = w(x 1,..., x j 1, t, x j+1,..., x d ) for any t R and x R d. Thus, j=1 w(t s./a) w(t) = d [ α j j=1 i=1 D i w j (t j x (j) ) ( s j/a j ) i i! ] + S j (t j, s j /a j ), where S j (t j, s j /a j ) Ks α j j a α j j by (2.2), for a constant K depending on ν and w but not on t and s. Combining the above, we have ψ(s){w(t s./a) w(t)}ds = d d ψ(s j )S j (t j, s j /a j )ds j C a α j j. j=1 If, w C α [0, 1] I for some subset I of {1,..., d} with I = d, so that w(t) = w 0 (t I ) for some w 0 C α I [0, 1] d, then the conclusion follows trivially follows from the observation ψa w = ψa I w 0. j=1 We next study the centered small ball probability of the rescaled process and the metric entropy of the unit RKHS ball. Lemma 4.3. For any a 0 positive, there exists constants C and ɛ 0 > 0 such that for a a 0 and ɛ < ɛ 0, log P ( W a ɛ ) ( Ca log ā ) d+1. ɛ Proof. This follows from Theorem 2 in Kuelbs and Li (1993) and Lemma 4.6 in van der Vaart and van Zanten (2009). Proceeding as in Lemma 4.6 in van der Vaart and van Zanten (2009) and Lemma 4.4, we obtain (4.1) φ a ( (ɛ) + log 0.5 K 1 a log φa 0 (ɛ) ) 1+d, ɛ

16 16 A BHATTACHARYA, D. PATI AND D.B. DUNSON for some constant K 1 > 0. Note that with L = [0, a 1 ] [0, a d ], φ a 0 (ɛ) = log P ( W a ɛ) = log P (sup W t ɛ) t L (ā ) τ log P ( sup W t ɛ) K 2, t [0,ā] d ɛ for some constant K 2 and τ > 0, where the last inequality follows from the proof of Lemma 4.6 in van der Vaart and van Zanten (2009). Inserting this bound in (4.1), we obtain the desired result. Let H a 1 denote the unit ball in the RKHS of W a. Lemma 4.4. There exists a constant K, depending only on ν and d, such that, for ɛ < 1/2, log N(ɛ, H a 1, ) Ka ( log 1 ɛ ) d+1. Proof. By Lemma 4.1, an element of H a 1 can be written as the real part of the function h : [0, 1] d C given by (4.2) h(t) = e i(λ,t) ψ(λ)νa(dλ) for ψ : R d C a function with ψ(λ) 2 νa(dλ) 1. For z C d, continue to denote the function z e (λ,z) ψ(λ)ν a (dλ) by h. Using the Cauchy-Schwartz inequality and the change of variable theorem, (4.3) h(z) 2 e (λ,2a Re(z)) ν(dλ), where Re(z) denotes the vector whose jth element is the real part of z j for j = 1,..., d, and a Re(z) = (a 1 Re(z 1 ),..., a d Re(z d )) T. From (4.3) and the dominated convergence theorem, any h H a 1 can be analytically extended to Γ = {z C d : 2a Re(z) 2 < δ}. Clearly, Γ contains a strip Ω in C d given by Ω = {z C d : Re(z j ) R j, j = 1,..., d} with R j = δ/(6a j d). Also, for every z Ω, h satisfies the uniform bound h(z) 2 e δ λ ν(dλ) = C 2. Let R = (R 1,..., R d ) T. Partition T = [0, 1] d into rectangles Γ 1,..., Γ m with centers {t 1, t 2,..., t m } such that given any z T, there exists Γ j with

17 ANISOTROPIC FUNCTION ESTIMATION 17 center t j = (t j1,..., t jd ) T with z i t ji R i /4, i = 1,..., d. Consider the piecewise polynomials P = m j=1 P j,γ j 1 Γj with P j,γj (t) = n. k γ j,n (t t j ) n. A finite set of functions Pa is obtained by discretizing the coefficients γ j,n for each j and n over a grid of mesh width ɛ/r n in the interval [ C/R n, C/R n ], with R n = R n Rn d d and C defined as above. Choosing m 1/R and k such that k log(1/ɛ) and (2/3) k Kɛ for some constant K, the collection Pa can be shown to form a Kɛ-net to H a 1 using (4.4) and (4.5); D n h ψ (t i ) (z t i ) n ( ) C 2 k (4.4) n! R n (R/2)n KC 3 n.>k n.>k D n (4.5) h ψ (t i ) (z t i ) n P i,γi (z) n! Kɛ. n. k The details are similar to Lemma 4.5 in van der Vaart and van Zanten (2009) and hence omitted. Lemma 4.7 of van der Vaart and van Zanten (2009) exploited a containment relation among unit RKHS balls with different scalings to construct the sieves B n. Such a result sufficed in the single bandwidth case exploiting the ordering of R +. However, the result can only be generalized with respect to the partial order on R d + and one doesn t obtain a straightforward generalization of their sieve construction in the multi-bandwidth case since the entropy of their sieve blows up in trying to control the joint probability of the rescaling vector a outside a hyper-rectangle in R d +. The problem mentioned above is fundamentally due to the curse of dimensionality and one needs a more careful construction of the sieve to avoid this problem. In the proof of Lemma 4.4, a collection of piece-wise polynomials is used to cover the unit RKHS ball H a 1. The main idea in the next Lemma 4.5 is to exploit the fact that the same set of piecewise polynomials can also be used to cover H b 1 for b sufficiently close to a. We then come up with a careful choice of a compact subset Q of R d + that balances the metric entropy of the collection of unit RKHS balls H a 1 with a Q and the complement probability of Q under the joint prior on a. For u R d +, let Cu denote the rectangle in the positive quadrant given by a u, i.e., 0 a j u j for all j = 1,..., d. For a fixed r > 1, let Q (r)

18 A BHATTACHARYA, D. PATI AND D.B. DUNSON consist of vectors a with a j r θ j for some θ S d 1. It is easy to see that Q (r) is a union of rectangles C r θ with θ varying over S d 1, Q (r) = C r θ.

18 18 A BHATTACHARYA, D. PATI AND D.B. DUNSON consist of vectors a with a j r θ j for some θ S d 1. It is easy to see that Q (r) is a union of rectangles C r θ with θ varying over S d 1, Q (r) = C r θ. θ S d 1 The outer boundary of Q (r) consists of points a with 0 < a j r for all j = 1,..., d and a = r (see Figure 1). By Lemma 4.4, for any such a in the outer boundary of Q (r), the metric entropy of H a 1 is bounded by a constant multiple of r log d+1 (1/ɛ). Our techniques were motivated by our observation that the entropy remains of the same order even if one considers a union over the outer boundary of Q (r). We present a stronger result in Lemma 4.5 which further states that the entropy remains of the same order even if the union is considered over all of Q (r). Fig 1. Left panel: For d = 2 & fixed r > 1, rectangles C r θ = {0 a r θ } for different values of θ S d 1. Right panel: the region Q (shaded) resulting from the union of all such rectangles. Lemma 4.5. Let ν satisfy (3.1). Fix r 1. Then, there exists a constant K depending on ν and d only, so that, for ɛ < 1/2, ( log N ɛ, a Q (r) H a 1, ) ( Kr log 1 ) d+1. ɛ

19 ANISOTROPIC FUNCTION ESTIMATION 19 Proof. Let Q = Q (r). Fix a Q. Let I denote the subset of {1,..., d} such that a j 1 for all j I and a j > 1 for all j / I. Let Ω a = {z C d : Re(z j ) R j, j = 1,..., d} with R j = δ/(6a j d) if j / I and Rj = δ/(6 d) if j I. Note that for z Ω a, (4.6) 2a Re(z) 2 2 = j I 4R 2 j + j / I d 4a 2 j Re(z j ) 2 j=1 4a 2 jr 2 j δ2 9. Hence, 2a Re(z) 2 δ/3 for z Ω a. Following the argument after the display in (4.3), it thus follows that any function h H a 1 has an analytic extension to Ω a. Let b Q satisfy max j a j b j 1. We shall exhibit that any h H b 1 can also be extended analytically to the same strip Ωa by showing that 2b Re(z) 2 < δ on Ω a. To that end, for z Ω a, first observe that (4.7) 2b Re(z) 2 2a Re(z) 2 + 2(b a) Re(z) 2. The first term in the right hand side above is bounded by δ/3 following (4.6). To tackle the second term, we use b j a j 1 a j for j / I to conclude (4.8) 2(b a) Re(z) 2 2 = j I 4R 2 j + j / I d 4(b j a j ) 2 Re(z j ) 2 j=1 4a 2 jr 2 j δ2 9. Combining (4.7) and (4.8), 2b Re(z) 2 2δ/3 on Ω a, proving our claim. Since the same tail estimate as in (4.4) works for any h H b 1, it follows from (4.5) that the set of functions Pa form a Kɛ-net for H b 1. Let A be a set of points in Q such that for any b Q, there exists a A such that max j a j b j 1. One can clearly find an A with A r d. The proof is completed by observing that a APa form a Kɛ net for a QH a Results for lower bound. We now prove Lemma 4.6 which enables us to derive a lower bound to the concentration function φ a (ɛ) for a fixed a > 0. This lower bound coupled with the model identifiability of (3.7) results in a lower bound to the posterior concentration rate.

20 20 A BHATTACHARYA, D. PATI AND D.B. DUNSON Denote by H a the reproducing kernel Hilbert space of the Gaussian process W a. The key to obtaining a lower bound to the concentration function φ a µ 0 (ɛ) is to find a lower bound to inf h H a : h µ 0 2 ɛ 1 2 h 2 H a when µ 0 has anisotropic smoothness as in (LBT). Let h a ψ be the real part of the function e i(λ,t) ψ(λ)f a (λ)dλ, where f a (λ) denotes the spectral density corresponding to the spectral measure ν a of W a, so that f a (λ) = a d f(λ/a) with f(λ) = e λ 2 /4 /(2 d π d/2 ). From van der Vaart and van Zanten (2009), the RKHS of H a consists of functions h a ψ for ψ L 2 (ν a ). For given ɛ > 0, let ψ L 2 (ν a ) be such that h a ψ µ 0 < ɛ. 2 (4.9) Lemma 4.6. If µ 0 satisfies (LBT) and a < ɛ 1/α, then h a ψ H a a d/2 e c(1/ɛ)min(δ 1,δ 2 ), where δ 1 = 2 k 0.5(d 1) 2 α and δ 2 = 1 2{k 0.5(d 1)}. Proof. From Proposition A.1 in Appendix A, let r be a function such that r is equal to 1 on the support of µ 0, has itself support inside [0, 1] d and the Fourier transform ˆr(λ) e K λ for large λ. Then, h a ψ r has support inside [0, 1] d and µ 0 r = µ 0, so that h a ψ r µ 0 h a ψ µ 0, 2,R 2 where 2,R is the norm of L 2 (R d ) and 2 the norm of L 2 [0, 1] d. The function h a ψ r has Fourier transform (ψ 1f a ) ˆr, where ψ 1 (λ) = ψ( λ). Hence, by Parseval s identity (ψ 1 f a ) ˆr ˆµ 0 2,R < ɛ. Considering the set χ K1,K 2 = {λ R d : λ I > K 1, λ d [K 2, 2K 2 ]}, we have Also ψ 1 f a ˆrχ 2K1,K 2 2,R ˆµ 0 χ 2K1,K 2 2,R ɛ. ˆµ 0 χ 2K1,K 2 2 2,R C ( 1 K 1 ) 2k (d 1) exp{ ck u 2 }. Using Proposition B.1 in Appendix B with 2K 1 instead of K 1, A 1 = K 1, f = ψ 1 f a and g = ˆr, ( ) k 0.5(d 1) 1 ψ 1 f a χ K1, 2,R ˆr(t) dt C e t cku 2 I K 1 K 1 (4.10) ɛ ψ 1 f a 2,R t I >K 1 ˆr(t) dt.

21 Since (4.11) Also (4.12) h a ψ ANISOTROPIC FUNCTION ESTIMATION 21 = H a ψ fa 2,R and f a is symmetric about zero, ψ 1 f a 2,R Ca d/2 h a ψ H a. ( ) ψ 1 f a χ K1, 2 2,R λ = (ψ1f 2 a )f a sup f a h a ψ I >K 1 λ I >K 1 (4.11) and (4.12) imply (4.13) h a ψ Ca d e K2 1 /4a2 h a ψ 2 H a. a (C d/2 ( ) ) 1 k 0.5(d 1)e ck2 u K 1 ɛ H a e K2 1 /8a2 t I K 1 ˆr(t) dt + t I >K 1 ˆr(t) dt. Fix K 2 to be a large number. Choose K 1 large enough depending on ɛ such that C(1/K 1 ) k 0.5(d 1) e cku 2 = 2ɛ, so that K 1 = ( C/ɛ ) 1/(k 0.5(d 1)) for some C > 0. Since a < ɛ 1/α, (4.14) t I >K 1 ˆr(t) dt e K ( ) 1 K1 e K 2{k 0.5(d 1)} 1/ɛ. We have k < α + 0.5(d 1) as k = α + δ + 1 for δ (0, 1) and d 5, so that K1 2 > C ( 1/ɛ ) 2 8a 2 k 0.5(d 1) 2 α, implying e K 2 ) 2 1 (1/ɛ 8a 2 e C k 0.5(d 1) α 2 (4.15). (4.13), (4.14) and (4.15) imply ad/2 e c(1/ɛ)min(δ 1,δ 2 ) if a < ɛ 1/α. H a h a ψ 2 H a 5. Proof of main results. We shall only provide a detailed proof of Theorem 3.1 and sketch the main steps in the proof of Theorem Proof of Theorem 3.1. Let us begin by observing that, ) P( W A w 0 2ɛ = P( W a w 0 2ɛ)π A (da) { = P( } W a w 0 2ɛ)π(a θ)da π(θ)dθ.

22 22 A BHATTACHARYA, D. PATI AND D.B. DUNSON As in van der Vaart and van Zanten (2009), we first derive bounds on the non-centered small ball probability for a fixed rescaling a, and then integrate over the distribution of a to derive the same for W A. Given a R d, recall the definition of the centered and non-centered concentration functions of the process W a, (5.1) φ a 0 (ɛ) = log P( W a ɛ), φ a w 0 (ɛ) = inf h 2 h Ha : h w0 Ha log P( W a ɛ). ɛ For a fixed a, the non-centered small ball probability of W a can be bound in terms of the concentration function as follows (van der Vaart and van Zanten, 2008b), P( W a w 0 2ɛ) e φa w 0 (ɛ). Now, suppose that w 0 C α [0, 1] d for some α R d. From Lemma 4.2 and 4.3, it follows that for every a 0 > 0, there exist positive constants ɛ 0 < 1/2, C, D and E that depend only on w 0 and ν such that, for a > a 0, ɛ < ɛ 0 and C d i=1 a α i i < ɛ, φ a ( w 0 (ɛ) Da + Ea log ā ) 1+d ( K 1 a log ā ) 1+d, ɛ ɛ with K 1 depending only on a 0, ν and d. Thus, for ɛ < min{ɛ 0, C 1 a ᾱ 0 }, by (5.1), for constants K 2,..., K 6 > 0 and C 2,..., C 6 > 0, ) P( W A w 0 2ɛ { e φa } w (ɛ) 0 π(a θ)da π(θ)dθ θ { 2(C1 /ɛ) 1/α 1 θ a 1 =(C 1 /ɛ) 1/α 1 2(C1 /ɛ) 1/α d a d =(C 1 /ɛ) 1/α d { 2(C1 C 2 e K /ɛ) 1/α 2(1/ɛ) 1/α 0 log 1+d 1 (1/ɛ) θ } e K 1a log 1+d (ā/ɛ) π(a θ)da π(θ)dθ a 1 =(C 1 /ɛ) 1/α 1 2(C1 /ɛ) 1/α d } π(a θ)da π(θ)dθ. a d =(C 1 /ɛ) 1/α d Let Γ denote the region in the simplex S d 1 given by Γ = {θ S d 1 : τ < θ j α 0 α j < 2τ, j = 1,..., d 1}. Since d j=1 α 0/α j = 1, we can choose τ > 0 small enough to guarantee that any θ satisfying the set of inequalities lies inside the simplex. Moreover, with θ d = 1 d 1 j=1 θ j, one has (d 1)τ < θ d <

23 ANISOTROPIC FUNCTION ESTIMATION 23 2(d 1)τ. Choosing τ = C 3 / log(1/ɛ), one can show that d j=1 (1/ɛ)1/(α jθ j ) C 4 (1/ɛ) 1/α 0 for any θ Γ. Now, { 2(C1 /ɛ) 1/α 1 a 1 =(C 1 /ɛ) 1/α 1 { 2(C1 /ɛ) 1/α 1 e K 3 θ Γ 2(C1 /ɛ) 1/α d a 1 =(C 1 /ɛ) 1/α 1 d j=1 (1/ɛ)1/α j θ j π(θ)dθ } π(a θ)da π(θ)dθ a d =(C 1 /ɛ) 1/α d 2(C1 /ɛ) 1/α d e d j=1 a1/θ j j a d =(C 1 /ɛ) 1/α d e K 4(1/ɛ) 1/α0 π(θ)dθ C 5 e K 5(1/ɛ) 1/α0. } da π(θ)dθ The last inequality in the above display uses that Γ contains a hyper-cube of width τ so that its π-mass is at least polynomial in τ. Hence, ) P( W A w 0 2ɛ C 6 e K 6(1/ɛ) 1/α 0 log (5.2) 1+d (1/ɛ). Let B 1 denote the unit sup-norm ball of C[0, 1] d. For a vector θ S d 1 and positive constants M, r > 1, ɛ, let B θ = B θ (M, r, ɛ) denote the set, B θ = a r θ (MH a 1 ) + ɛb 1, where r θ denotes the vector whose jth element is r θ j. We further let, B = (MH a 1 ) + ɛb 1 = (MH a 1 ) + ɛb 1. a Q (r) θ S d 1 a r θ Let us first calculate the probability P(W A / B θ θ). Note that, P(W A / B θ θ) = P(W θ / B θ )π(a θ)da P(W a / B θ )π(a θ)da + P(A r θ θ), a r θ where P(W a r θ) is a shorthand notation for P(at least one A j > r θ j θ).

24 24 A BHATTACHARYA, D. PATI AND D.B. DUNSON To tackle the first term in the last display, note that B θ contains the set MH a 1 + ɛb 1 for any a r θ. Hence, for any a r θ, by Borell s inequality, P(W a / B θ ) P(W a / MH a 1 + ɛb 1 ) 1 Φ {M + Φ (e 1 φa )} 0 (ɛ) ( )} 1 Φ {M + Φ 1 e φrθ 0 (ɛ) e φrθ 0 (ɛ), if M 2Φ 1( e φrθ 0 (ɛ)). The penultimate inequality in the above display follows from the fact that, with T = [0, 1] d, e φa 0 (ɛ) = P( sup W t ɛ) P( sup W t ɛ) = e φrθ 0 (ɛ). t a T t r θ T By Lemma 4.10 of van der Vaart and van Zanten (2009), Φ 1 (u) {2 log(1/u)} 1/2 for u (0, 1). Hence, the last inequality in the above display remains valid if we choose M 4 φ rθ 0 (ɛ). Since A 1/θ j j follows a gamma distribution given θ j, in view of Lemma 4.9 of van der Vaart and van Zanten (2009), for r larger than a positive constant depending only on the parameters of the gamma distribution, P(A j > r θ j θ) C 1 r D 1 e D 2r. Combining the above, since B contains B θ for every θ S d 1, { } P(W A / B) = P(W a / B θ)g(a θ) θ { } P(W a / B θ θ)g(a θ) (5.3) θ C 2 r D 1 e D 2r + e D 3r log(r/ɛ) d+1. From Lemma 4.5, the entropy of B can be estimated as, log N(2ɛ, B, ) log N(ɛ, (MH a 1 ), ) a r θ (5.4) Kr log θ S d 1 ( M ɛ ) d+1.

25 ANISOTROPIC FUNCTION ESTIMATION 25 Thus (5.2), (5.3) and (5.4) can be simultaneously satisfied if we choose, for constants κ, κ 1, κ 2 > 0, ɛ n = n α 0/(2α 0 +1) log κ (n), r n = n 1/(2α 0+1) log κ 1 (n), M n = r n log κ 2 (n) Proof of Theorem 3.2. For ease of notation, we shall make the simplifying assumption that the random variable B is degenerate at 1. For a > 0 and S {1,..., d}, let H a,s denote the RKHS of W a, where a j = a for j S and a j = 1 for j / S. For a subset S {1,..., d} with S = d, and given positive constants M, r, ξ, ɛ, let B S = B S (M, r, ξ, ɛ) [ = M ( r ξ ) dh r,s 1 + ɛb 1 ] [ a<ξ ] ( MH a,s) 1 + ɛb1. Since, given S, A d gamma(b 1, b 2 ), it can be shown that, for some constant C 1 > 0, P(W A / B S S) e C 1r d. The dominating term in the ɛ entropy of B S is bounded by ( ) C 2 r d log 1+d C3 M. ɛ While calculating the concentration probability around w 0 C α [0, 1] I, simply use the fact that pr(s = I) > 0. Combining the above, the sieves B n are constructed as, B n = d d=1 S: S = d B S (M S n, r S n, ξ n, ɛ n ), where, for constants κ, κ 1 > 0, ɛ n = n α/(2α+d 0) log κ n, r S n = and (M S n ) 2 = (r S n) d log(r S n/ɛ n ). ( n d 0 2α+d 0 ) 1/ S log κ 1 (n)

Bayesian Regularization

Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors