arxiv: v1 [math.st] 15 Dec 2017

Size: px

Start display at page:

Download "arxiv: v1 [math.st] 15 Dec 2017"

Sheryl Barton
5 years ago
Views:

1 A THEORETICAL FRAMEWORK FOR BAYESIAN NONPARAMETRIC REGRESSION: ORTHONORMAL RANDOM SERIES AND RATES OF CONTRACTION arxiv: v1 math.st] 15 Dec 017 By Fangzheng Xie, Wei Jin, and Yanxun Xu Johns Hopkins University We develop a unifying framework for Bayesian nonparametric regression to study the rates of contraction with respect to the integrated L -distance without assuming the regression function space to be uniformly bounded. The framework is built upon orthonormal random series in a flexible manner. A general theorem for deriving rates of contraction for Bayesian nonparametric regression is provided under the proposed framework. As specific applications, we obtain the near-parametric rate of contraction for the squared-exponential Gaussian process when the true function is analytic, the adaptive rates of contraction for the sieve prior, and the adaptive-and-exact rates of contraction for the un-modified block prior when the true function is α-smooth. Extensions to wavelet series priors and fixeddesign regression problems are also discussed. 1. Introduction. Consider the standard nonparametric regression problem y i = fx i +e i, i = 1,,n, where the set of predictors x i n are referred to as design points and take values in 0,1] p R p, e i s are independent and identically distributed i.i.d. mean-zero Gaussian noises with vare i = σ, and y i s are the responses. We follow the popular Bayesian approach by assigning f a carefully-selected prior, and perform inference tasks by finding the posterior distribution of f given the observations x i,y i n. We propose a theoretical framework for Bayesian nonparametric regression to study the rates of contraction with respect to the integrated L - distance { 1/ f g = fx gx] dx}. X The regression function f is assumed to be represented using a set of appropriate orthonormal basis functions ψ k : fx = k β kψ k x. The MSC 010 subject classifications: Primary 6G08; secondary 6C10, 6G0 Keywords and phrases: Bayesian nonparametric regression, integrated L -distance, orthonormal random series, rate of contraction 1

2 F. XIE, W. JIN, AND Y. XU coefficients β k are allowed to range over the entire real line R and the space of regression functions is allowed to be unbounded, including the renowned Gaussian process priors as special examples. Rates of contraction of posterior distributions for Bayesian nonparametric priors have been studied extensively. Following the earliest framework on generic rates of contraction theorems with i.i.d. data proposed by 13], specific examples for density estimation via Dirichlet process mixture models 5, 14, 17, 30] and location-scale mixture models, 38] are discussed. For nonparametric regression, the rates of contraction had not been discussed until 15], who develop a generic framework for fixed-design nonparametric regression to study rates of contraction with respect to the empirical L -distance. There are extensive studies for various priors that fall into this framework, including location-scale mixture priors 9], conditional Gaussian tensor-product splines 10], and Gaussian processes 35, 37], among which adaptive rates are obtained in 9, 10, 37]. Although it is interesting to achieve adaptive rates of contraction with respect to the empirical L -distance for nonparametric regression, this might be restrictive since the empirical L -distance quantifies the convergence of functions only at the given design points. In nonparametric regression, one also expects that the error between the estimated function and the true function can be globally small over the whole design space 39], i.e., small mean-squared error for out-of-sample prediction. Therefore the integrated L -distance is a natural choice. For Gaussian processes, 33], 40], and 4] provide contraction rates for nonparametric regression with respect to the integerated L -distance. A novel spike-and-slab wavelet series prior is constructed in 43] to achieve adaptive contraction with respect to the stronger L -distance. These examples however, take advantages of their respective prior structures and may not be easily generalized. Although a generic framework for theintegrated L -loss is presentedin 0], the priorthereis imposed on a uniformly bounded function space and hence rules out some popular priors, e.g., the popular Gaussian process prior 5]. It is therefore natural to ask the following fundamental question: for Bayesian nonparametric regression, can one build a unifying framework to study rates of contraction for various priors with respect to the integrated L -distance without assuming the uniform boundedness of the regression function space? In this paper we provide a positive answer to this question. Inspired by 11], we build the general framework upon series expansion of functions with respect to a set of orthonormal basis functions. The prior is not necessarily supported on a uniformly bounded function space. A general rate of contraction theorem with respect to the integrated L -distance for

3 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 3 Bayesian nonparametric regression is provided under the proposed framework. Specific applications that fall into this framework include the classical squared-exponential Gaussian process prior 5], the sieve prior 6, 8], and the un-modified block prior 11], etc. In particular, for the block prior regression, rather than modifying the block prior by conditioning on a truncated function space as in 11] with a known upper bound for the unknown true regression function, we prove that the un-modified block prior automatically yields rate-exact Bayesian adaptation for nonparametric regression without such a truncation. We also discuss some natural extensions of the proposed framework, including the wavelet series and the fixed-design regression problems. The analyses of the above applications and extensions under the proposed framework also generalize their respective results in the literature. This is made clear in sections 3 and 4. The literatures regarding rates of contraction for Bayesian nonparmatric models are largely based on the cutting-edge prior-concentration-and-tesing procedure proposed by ] and 13]. The major challenge of establishing a general framework for Bayesian nonparametric regression is that unlike the Hellinger distance for density estimation problem and the empirical L - distanceforfixed-regressionproblem,theintegrated L -distanceforrandomdesign regression problem is not locally testable. The exact definition of local testability of a distance is provided later, but a locally testable distance is the key ingredient for the prior-concentration-and-tesing procedure to work. It turns out that by modifying the testing procedure, we only require the integrated L -distance to be locally testable over an appropriate class of functions. The above arguments are made precise in section. The layout of this paper is as follows. In section we introduce the random series framework for Bayesian nonparametric regression and present the main result concerning rates of contraction. As applications of the main result, we derive the rates of contraction of various renowned priors for nonparametric regression in the literature with substantial improvements in section 3. Section 4 elaborates on extensions to wavelet series priors as well as fixed-design regression problems. The technical proofs of the main results are deferred to section 5. Notations. For 1 r, we use r to denote both the l r -norm on any finite dimensional Euclidean space and the integrated L r -norm of a measurable function with respect to the Lebesgue measure. In particular, for any function f L 0,1] p, we use f to denote the integrated L - norm defined to be f = 0,1] p f xdx. We follow the convention that when r =, the subscript is omitted, i.e., =. The Hilbert space

4 4 F. XIE, W. JIN, AND Y. XU l denotes the space of sequences that are squared-summable. We use x to denote the maximal integer no greater than x, and x to denote the minimum integer no less than x. The notations a b and a b denote the inequalitiesuptoapositivemultiplicativeconstant,andwewritea bifa b and a b. Throughout capital letters C,C 1, C,C,D,D 1, D,D, are used to denote generic positive constants and their values might change from line to line unless particularly specified, but are universal and unimportant for the analysis. We refer to P as a statistical model if it consists of a class of densities on a sample space X with respect to some underlying σ-finite measure. Given a frequentist statistical model P and the i.i.d. data w i n from some P P, the prior and the posterior distribution on P are always denoted by Π and Π w 1,,w n, respectively. Given a function f : X R, we use P n f = n 1 n fx i to denote the empirical measure of f, and G n f = n 1/ n fx i Efx i ] to denote the empirical process of f, given the i.i.d. data x i n. With a slight abuse of notations, when applying to a set of design points x i n, we also denote P nf = n 1 n fx i and G n f = n 1/ n fx i Efx i ] to be the empirical measure and empirical process, even when the design points x i n are deterministic. In particular, φ denotes the probability density function of the univariate standard normal distribution, and we use the shorthand notation φ σ y = φy/σ/σ to denote the density of N0,σ. For a metric space F,d, for any ǫ > 0, the ǫ-covering number of F,d, denoted by Nǫ,F,d, is defined to be the minimum number of ǫ-balls of the form {g F : df,g < ǫ} that are needed to cover F.. The framework and main results. Consider the nonparametric regression model: y i = fx i +e i, where e i n are i.i.d. mean-zero Gaussian noises with vare i = σ, and x i n are design points taking values in 0,1] p. Throughout sections and 3 the design points x i n are assumed to be independently and uniformly sampled. The framework presented in this paper can be easily generalized to the case where the design points are independently sampled from a density function that is bounded away from 0 and. We assume that the responses y i s are generated from y i = f 0 x i +e i for some unknown f 0 L 0,1] p, thus the data D n = x i,y i n can be regarded as i.i.d. samples from a distribution P 0 with joint density p 0 x,y = φ σ y f 0 x. Throughout we assume that the variance σ of the noises is known, but our framework can be easily extended to the case where σ is unknown by placing a prior on σ that is supported on a compact interval contained in 0,.

5 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 5 Theregression model is parametrized by the function f, and we follow the standard nonparametric Bayes procedure by assigning a prior distribution Π on f as follows. Let ψ k be a set of orthonormal basis functions in L 0,1] p. Since f 0 L 0,1] p, we also impose the prior Π on a sub-class of L 0,1] p and write f in terms of the orthonormal series expansion fx = β k ψ k x, where the coefficients β k are imposed a prior distribution such that β k < a.s.. The prior on β k can be very general as long as it yields realizations that are squared-summable with probability one. We shall also assume that the true function f 0 yields a series expansion f 0 x = β 0k ψ k x, where β 0k <. We require that ψ k are uniformly bounded. The widely used tensor-product Fourier basis satisfies this requirement: ψ k1 k p x 1,,x p = p ψk 1 j x j, where ψ1 1x = 1, ψ1 k x = sinkπx, and ψ1 k+1 x = coskπx, k = 1,,. Remark.1. We remark that the prior Π is supported on a function space that may not beuniformlyboundedand thesetup hereis moregeneral than that in 0]. The proposed framework includes the well-known Gaussian process prior as a special case. Let K : 0,1] p 0,1] p 0, be a positive definite covariance function that yields an eigenfunction expansion Kx,x = λ kψ k xψ k x, where ψ k is a set of orthonormal eigenfunctions of K, and λ k are the corresponding eigenvalues. For a Gaussian process prior GP0, K on f, the Karhunen-Loève theorem 36] yields the following representation of f: fx = j=1 β k ψ k x, where β k N0,λ k independently. Before presenting the main result for studying convergence of Bayesian nonparametric regression under the proposed framework, let us first recall the extensively used prior-concentration-and-testing procedure for deriving

6 6 F. XIE, W. JIN, AND Y. XU rates of contraction proposed by 13]. The general setup is as follows. Suppose w i n W are i.i.d. data sampled from a distribution P 0 that yields a density p 0 with respect to some underlying σ-finite measure. Let Θ be the parameter space typically infinite dimensional for nonparametric problems such that p 0 = p θ0 for some θ 0 Θ. By imposing a prior Π on the parameter θ Θ equipped with a suitable measurable structure, the posterior distribution can be written as ΠA D n = AexpΛ nθ D n ]Πdθ Θ expλ nθ D n ]Πdθ, for any measurable A in Θ, where Λ n is the log-likelihood ratio Λ n θ D n = n log p θw i p 0 w i. An appropriate distance d on Θ is crucial to study of rates of contraction, since typically different distances would yield different rates, and when Θ is infinite dimensional, different distances are possibly not equivalent. In 13], a distance that satisfies the local testability condition is required. Definition.1 13]. Given a parameter space Θ, we say that a distance d : Θ Θ 0, is locally testable, if there exists a sequence of test functions φ n : W n 0,1] such that the following holds for some constant C > 0 when n is sufficiently large whenever θ 1 θ 0 : E 0 φ n exp Cndθ 0,θ 1, sup E θ 1 φ n exp Cndθ 0,θ 1, dθ,θ 1 ξdθ 0,θ 1 where ξ 0,1 is a fixed constant. When the parameter space Θ itself is a class of densities, i.e., {p θ : θ Θ} is parametrized by p itself, it is proved that the Hellinger distance is locally testable 3]. The widely-used empirical L -distance for fixed-design nonparametric regression is also locally testable see, for example, 15]. For a locally testable distance d on Θ, in order that the posterior distribution Π D n contracts to θ 0 at rate ǫ n, i.e., Πdθ,θ 0 > Mǫ n D n 0 in P 0 -probability for some large constant M > 0, the prior-concentrationand-testing procedure requires that the following hold with some constant D > 0 for sufficiently large n:

7 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 7 1. The prior concentration condition holds: Π E 0 log p 0 ǫ n p,e 0 log p ] 0.1 ǫ n e Dnǫ n. θ p θ. There exists a sequence Θ n n=1 of subsets of Θ often referred to as the sieves and test functions φ n n=1 such that ΠΘ c n e Dnǫ n, E 0 φ n 0, sup E θ 1 φ n 0. θ Θ n {dθ,θ 0 >Mǫ n} The major challenge for establishing a general framework to study rates of contraction for nonparametric regression with respect to the integrated L - distance is that is not locally testable when the space of functions is not uniformly bounded. To resolve this issue, we modify the general procedure above by imposing more assumptions on the sieves. For convenience we use Λ n = Λ n f D n to denote the log-likelihood ratio Λ n f D n = n log p fx i,y i p 0 x i,y i, where p f x,y = φ σ y fx is the density parametrized by the regression function f. For a fixed integer m and a positive constant δ, we denote F m δ by a class of functions in L 0,1] p that satisfies the following property: { }. F m δ fx = β k ψ k x : β k β 0k δ. k=m+1 The sieves F n n=1 are constructed by letting F n = F mn δ n for certain sequences m n n=1 N + and δ n n=1 0,. Rather than considering the supremum over {f : f f 0 ξ f 1 f 0 } as in 13], in the proposed framework we consider the supremum over the smaller {f : f f 0 ξ f 1 f 0 } F m δ and obtain the following weaker result analogous to the local testability condition. Lemma.1. Let F m δ satisfies.. Then for any f 1 F m δ with n f1 f 0 > 1, there exists a test function φ n : X Y n 0,1] such that E 0 φ n exp Cn f 1 f 0,

8 8 F. XIE, W. JIN, AND Y. XU sup E f 1 φ n exp Cn f 1 f 0 {f F mδ: f f 1 ξ f 0 f 1 } +exp Cn f 1 f 0 m f 1 f 0 +δ for some constant C > 0 and ξ 0,1. The following lemma constructs a test function that tests against the large composite alternative F mn δ n { f f 0 > Mǫ n }. In particular, it is useful to estimate the type I error E 0 φ n and the type II error sup f Fmδ { f f 0 >Mǫ n}e f 1 φ n. We also observe that this result is similar to the testing procedure described above. Lemma.. Suppose that F m δ satisfies. for an m N + and a δ > 0. Let ǫ n n=1 be a sequence with nǫ n. Then there exists a sequence of test functions φ n n=1 such that E 0 φ n N nj exp Cnj ǫ n, j=m sup E f 1 φ n exp CM nǫ n {f F mδ: f f 0 >Mǫ n} +exp CM nǫ n mm ǫ n +δ, where N nj = Nξjǫ n,s nj ǫ n, is the covering number of S nj ǫ n = {f F m δ : jǫ n < f f 0 j +1ǫ n }, and C is some positive constant. The prior concentration condition.1 is a very important ingredient in the study of Bayes theory. It guarantees that the denominator in the posterior distribution Θ expλ nθ D n ]Πdθ can be lower bounded by e D nǫ n for some constant D > 0 with large probability see, for example, lemma 8.1 in 13]. In the context of normal regression, the Kullback-Leibler divergence is proportional to the integrated L -distance between two regression functions. Motivated by this observation, we establish the following lemma that yields exponential lower bound for the denominator expλ n Πdf under the proposed framework. Lemma.3. Denote { B n m,ǫ,ω = fx = β k ψ k x : f f 0 < ǫ, k=m+1 β k β 0k ω }

9 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 9 for any ǫ,δ > 0 and m N +. Suppose sequences ǫ n n=1,ω n n=1 0,, and k n n=1 satisfy ǫ n 0, nǫ n, k n ǫ n = O1, and ω n = O1. Then for any constant C > 0, P 0 expλ n Πdf ΠB n k n,ǫ n,ω n exp C + 1σ ] nǫ n 0. In some cases it is also straightforward to consider the prior concentration with respect to the stronger -norm. For example, for a wide class of Gaussian process priors, the prior concentration Π f f 0 < ǫ has been extensively studied see, for example, 16, 35, 37] for more details. Lemma.4. Suppose the sequence ǫ n n=1 satisfies ǫ n 0 and nǫ n. Then for any constant C > 0, P 0 expλ n Πdf Π f f 0 < ǫ n exp C + 1σ ] nǫ n 0. Now we present the main result regarding the rates of contraction for Bayesian nonparametric regression under the orthonormal random series framework. The proof is based on the modification of the prior-concentrationand-testing procedure. Theorem.1 Generic Contraction. Let ǫ n n=1 and ǫ n n=1 be sequences such that minnǫ n,nǫ n as n, with 0 ǫ n ǫ n 0. Assume that the sieve F mn δ n n=1 satisfies., m nǫ n 0, and δ n = O1. In addition, assume that there exist another two sequences k n n=1 N +, ω n n=1 0, such that k nǫ n = O1 and ω n = O1. Suppose the following conditions hold for some constant D > 0 and sufficiently large n and M:.3.4 N nj exp Dnj ǫ n 0, j=m ΠF c m n δ n exp D + 1σ ] nǫ n,.5 ΠB n k n,ǫ n,ω n exp Dnǫ n, where N nj = Nξjǫ n,s nj ǫ n, is the covering number of S nj = {f F mn δ n : jǫ n < f f 0 j + 1ǫ n ]}, and B n k n,ǫ n,ω n is defined in lemma.3. Then E 0 Π f f 0 > Mǫ n D n ] 0.

10 10 F. XIE, W. JIN, AND Y. XU Remark.. In light of lemma.4, by exploiting the proof of theorem.1 we remark that when the assumptions and conditions in theorem.1 hold with.5 replaced by Π f f 0 < ǫ n exp Dnǫ n, it also holds that E 0 Π f f 0 > Mǫ n D n ] 0 for sufficiently large M > Applications. In this section we consider three concrete priors on f for the nonparametric regression problem y i = fx i + e i, i = 1,,n. For simplicity the design points are assumed to be one-dimensional, i.e., x i n 0,1]. For some of the examples, the results presented in this section can be easily generalized to the case where the design points are multi-dimensional by considering the tensor-product Fourier basis. The results for these applications under the proposed framework also generalize their respective counterparts in the literature. The three concrete priors considered are the squared-exponential Gaussian process, the sieve prior, and the un-modified block prior. The basis functions ψ k are assumed to be the Fourier basis, i.e., ψ 1x = 1, ψ k x = sinkπx, and ψ k+1 x = coskπx. For the squaredexponentialgaussianprocess,theunderlyingtrueregressionfunctionf 0 x = β 0kψ k x that governs P 0 is assumed to satisfy β 0k ek /4 <. The resulting f 0 is not only infinitely differentiable, but can also be analytically extended to a complex strip in the complex plane C that contains 0,1] see, for example, exercise I.4.4 in 1]. For the other two priors, the underlying true regression function f 0 that governs P 0 is assumed to be αsmooth in a wide sense. Specifically, a function f is said to be an α-smooth function, if it lies in one of the following function classes: α-sobolev function: A squared-integrable function f L 0,1] is said to be an α-sobolev function for α > 1/, if under the Fourier basis ψ k = 1, sinkπx, coskπx,k N +, the coefficients β k = 1 0 fxψ kxdx satisfy kα βk <. α-hölder function: A squared-integrable function f L 0,1] is said to be an α-hölder function for α > 1/, if under the Fourier basis ψ k = 1, sinkπx, coskπx,k N +, the coefficients β k = 1 0 fxψ kxdx satisfy kα β k <. Roughly speaking when α N +, a function is α-sobolev if it has squaredintegrable derivatives up to order α 1, and is α-hölder if it has Lipschitz continuous derivatives up to order α 1. When the regression function f 0 is α-smooth, it is proved that the minimax-optimal rate of convergence is

11 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 11 ǫ n = n α/α+1 31], and the rate of contraction using a Bayes procedure cannot be faster than the minimax-optimal rate. We derive the rates of contraction for the sieve prior and the block prior under the framework proposed in section and show that they are minimax-optimal possibly up to a logarithmic factor Squared-exponential Gaussian process with near parametric rate. In this subsection we consider one of the most widely used and perhaps the most popular Gaussian processes 5] GP0, K with the covariance function Kx,x = exp x x ] of the squared-exponential form. We claim that such a Gaussian process prior indeed falls into the proposed framework. The eigen-system of the squared-exponential covariance function has been studied in the literature see, for example, 41]. Under the Fourier basis, the K yields the following eigen-expansion Kx,x = λ kψ k xψ k x, where the eigenvalues λ k decay at λ k exp k /4. It follows by the Karhunen-Loèvetheoremthatf yieldsaseriesexpansionfx = β kψ k x, where β k N0,λ k. Given constant c,q > 0, define the function class A c Q = { fx = β k ψ k x : } k βk exp Q. c The function class A c Q is closely related to the reproducing kernel Hilbert space RKHS associated with GP0, K, a concept playing a fundamental role in the literature of Bayes theory. For a complete and thorough review of RKHS from a Bayesian perspective, we refer to 36]. For the squaredexponential Gaussian process regression, the following property regarding the corresponding RKHS is available by applying theorem 4.1 in 36]. Lemma 3.1. Let H be the RKHS associated with the squared-exponential Gaussian process GP0,K, where Kx,x = exp x x ]. Then A 4 Q H for any Q > 0. Under the squared-exponential Gaussian process prior Π, the rate of contraction for f 0 A 4 Q is 1/ n up to a logarithmic factor. Theorem 3.1. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are uniformly sampled from 0,1], e i n are i.i.d. mean-zero Gaussian noises with vare i = σ, and f 0 A 4 Q for some Q > 0. Suppose f follows the squared-exponential Gaussian process prior

12 1 F. XIE, W. JIN, AND Y. XU Π = GP0,K. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn 1/ logn D n 0. The major technique for proving theorem 3.1 using theorem.1 is the construction of an appropriate sieve F mn δ n. Specifically, the sieve F mn δ n is taken as F mn Q = fx = β k ψ k x : β k β 0k e k /8 Q, with the sequences m n logn and δ n = Q. The probability ΠF c m n Q is bounded by the Markov s inequality, and the covering number N nj is bounded by the following lemma: Lemma 3.. For all ǫ > 0, it holds that { } logn ǫ, β 1,β, l : βk /c ek Q, log 1 ǫ 3/. Remark 3.1. The contraction rate n 1/ logn is the same as the parametric rate 1/ n up to a logarithmic factor the renowned Bernstein-von Mises theorem, see, for example, chapter 10 in 34]. This is in accordance with the following interesting fact: the space of functions that yield analytic extension to an open subset of the complex plane is only slightly larger than finite-dimensional spaces in terms of their covering numbers see, for example, proposition C.9 in 16]. The result obtained in theorem 3.1 is similar to that in 33]. However, we put the squared-exponential Gaussian process regression in a more general framework that is quite flexible and allows other non-gaussian priors, as will be seen in subsection Sieve prior regression with adaptive rate. In this subsection the true regression function f 0 is assumed to be in the intersection of the α-hölder ball C α Q and the α-sobolev ball H α Q, where α > 1/ is the smoothness level, { } C α Q = fx = β k ψ k x : k α β k Q, H α Q = { fx = β k ψ k x : } k α βk Q,

13 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 13 and Q > 0 is some constant. In subsection 3.1, we provide a near-parametric rate of contraction to an analytic function f 0 A 4 Q with the squaredexponential Gaussian process prior, which is infinitely differentiable. Nevertheless when f 0 is only smooth up to level α, i.e., f is either α-sobolev or α-hölder, the resulting rate of contraction is only a polynomial of logn and is much slower than the minimax-optimal rate n α/α+1 33]. In this subsection we consider estimating an α-smooth function f 0 without knowing its smoothness level α a priori. In the literature of Bayes theory, such a procedure is referred to as adaptive. We shall also assume that a lower bound α for α is known: α α > 1/. The sieve prior 1, 6, 8] is another popular prior in the literature of Bayesian nonparametric theory. It is a class of hierarchical priors that first draw an integer-valued random variable serving as the number of terms to be used in a finite sum, and then sample the term-wise parameters given the number of terms. The sieve prior typically does not depend on the smoothness level of the true function, often yields minimax-optimal rates of contraction up to a logarithmic factor in many other nonparametric problems e.g., density estimation 6, 8] and fixed-design regression 1], and hence is fully adaptive. However, the adaptive rates of contraction for the sieve prior in random-design regression with respect to the integrated L -distance hasnot beenestablished. In this subsection weaddressthis issue within the orthonormal random series framework proposed in Section. We now formally construct the sieve prior for nonparametric regression. Let fx be represented by a set of orthonormal basis functions ψ k in L 0,1]: fx = β kψ k x. The coefficients β k are imposed a prior distribution as follows: first sample an integer-valued random variable N from a density function π N with respect to the counting measure on N +, and then given N = m the coefficients β k s are independently sampled according to Πβ k A N = m = ½k m k γ gk γ βdβ + ½k > mδ 0 A, where A is any measurable set, g is an exponential power density gx exp τ 0 x τ for some τ,τ 0 > 0 9], and γ 1/,α. We further require that π N m exp b 0 mlogm and π N m exp b 1 mlogm A N=m+1 for some constants b 1, b 0. The zero-truncated Poisson distribution π N m = e λ 1 1 λ m /m!½m 1 satisfies the above condition 38].

14 14 F. XIE, W. JIN, AND Y. XU The following theorem shows that the sieve prior constructed above is adaptive and the rate of contraction with respect to the integrated L - distance is minimax-optimal up to a logarithmic factor. Theorem 3.. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are uniformly sampled from 0,1], e i n are i.i.d. mean-zero Gaussian with vare i = σ, and f 0 C α Q H α Q for some α > 1/ and Q > 0. Suppose f is imposed the prior Π given above. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn α/α+1 logn t D n 0 for any t > α/α Block prior regression with adaptive and exact rate. In the literature of adaptive Bayesian procedure, the minimax-optimal rates of contraction are often obtained with an extra logarithmic factor. It typically requires significant work to obtain the exact minimax-optimal rate. Gao and Zhou 11] elegantly construct a modified block prior that yields rate-adaptive i.e., the prior does not depend on the smoothness level and rate-exact contraction for a wide class of nonparametric problems, without the redundant logarithmic factor. Nevertheless, for nonparametric regression, 11] modifies the block prior by conditioning on the space of uniformly bounded functions. Requiring a known upper bound for the unknown f 0 when constructing the prior is restrictive since it eliminates the popular Gaussian process priors. Besides the theoretical concern, the block prior itself is also a conditional Gaussian process and such a modification is inconvenient for implementation. In this section, we address this issue by showing that for nonparametric regression such a modification is not necessary. Namely, the un-modified block prior itself also yields the exact minimax-optimal rate of contraction for α-sobolev f 0 H α Q and it does not depend on the smoothness level α of the true regression function. We now describe the block prior. Given a sequence β = β 1,β, in the squared-summable sequence space l, define the lth block B l to be the integer index set B l = {k l,,k l+1 1} and n l = B l = k l+1 k l, where k l = e l. We use β l = β j : j B l R n l to denote the coefficients with index lying in the lth block B l. Let ψ k be the Fourier basis, i.e., ψ 1 x = 1, ψ k x = sinπkx, and ψ k+1 x = cosπkx, k N +. The block prior is a prior distribution on fx = β kψ k x induced by the following distribution on the Fourier coefficients β k : A l g l independently for each l,

15 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 15 β l A l N0,A l I nl independently for each l, where g l l=0 is a sequence of densities satisfying the following properties: 1. There exists c 1 > 0 such that for any l and t e l,e l ], 3.1 g l t exp c 1 e l.. There exists c > 0 such that for any l, 3. 0 tg l tdt 4exp c l. 3. There exists c 3 > 0 such that for any l, 3.3 e l g l tdt exp c 3 e l. The existence of a sequence of densities g l l=0 satisfying 3.1, 3., and 3.3 is verified in 11] see proposition.1 in 11]. Our major improvement for the block prior regression is the following theorem, which shows that the un-modified block prior yields rate-exact Bayesian adaptation for nonparametric regression. Theorem 3.3. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are uniformly sampled from 0,1], e i n are i.i.d. mean-zero Gaussian noises with vare i = σ, and f 0 H α Q for some α > 1/ and Q > 0. Suppose fx = β kψ k x is imposed the block prior Π as described above. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn α/α+1 D n 0. Rather than using the sieve F n proposed in theorem.1 in 11], which does not necessarily satisfy., weconstructf mn δ n inaslightly different fashion: { } F mn Q = fx = β k ψ k x : β k β 0k k α Q with m n n 1/α+1 and δ n = Q. The covering number N nj can be bounded using the metric entropy for Sobolev balls for example, see lemma 6.4 in 3], and the rest conditions in.1 can be verified using similar techniques as in 11].

16 16 F. XIE, W. JIN, AND Y. XU 4. Extensions Extension to wavelet series. Besides the popular Fourier basis, another widely-used class of orthonormal basis functions for L 0,1] are the wavelet basis functions. They are related to the Besov space 4], which is more flexible. For a complete and thorough review of wavelets from a statistical perspective, we refer to 18]. Letψ jk j N,k Ij beanorthonormalbasisofcompactlysupportedwavelets for L 0,1], with j referringto the so-called resolution level, and k to the dilation see, for example, section E.3 in 16]. We adopt the convention that the index set I j for the jth resolution level runs through {0,1,, j 1}. The exact definition and specific formulas for the wavelet basis are not of great interest to us. We shall assume that the wavelet basis ψ jk s are appropriately selected such that for any fx = j=0 following inequalities hold 6, 7, 19]: k I j β jk ψ jk x, the 1/ f = βjk 1/ β jk j=0 k I j j=0 k Ij f = sup β jk ψ jk x x 0,1] j=0 k I j j/ max β jk. k I j j=0 For the construction of a specific wavelet basis ψ kj satisfying the above conditions, we refer to 7]. Remark 4.1. We remark that technically speaking the wavelet random series does not fall into the framework proposed in section, since unlike the Fourier basis, the wavelet basis functions are typically not uniformly bounded in : sup k,j ψ kj =. However, the supremum norm of the wavelet series can be upper bounded in terms of the wavelet coefficients. Such a feature allows us to modify the framework in section and build up its wavelet counterpart. Similar to the framework proposed in section, we also assume that the true regression function f 0 yields a wavelet series expansion f 0 x = j=0 k I j β 0jk ψ jk x.

17 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 17 For a given integer J and positive real δ > 0, analogous to., we require that the class of functions F J δ satisfies the following property: F J δ fx = 4.1 β jk ψ jk x : j/ max β jk β 0jk δ k I j j=0 k I j j=j holds forall J N + andδ 0,. Thesieves F n n=1 arethen constructed by taking F n = F Jn δ n for carefully chosen sequences J n n=1 N and δ n n=1 0,. We provide the corresponding local testing result below for the wavelet series, which is similar to lemma.1. Lemma 4.1. Let F J δ satisfies 4.1. Then for any f 1 F J δ with n f1 f 0 > 1, there exists a test function φ n : X Y n 0,1] such that E 0 φ n exp Cn f 1 f 0, sup E f 1 φ n exp Cn f 1 f 0 {f F J δ: f f 1 ξ f 0 f 1 } +exp Cn f 1 f 0 J f 1 f 0 +δ for some constant C > 0 and ξ 0,1. The following generic rate of contraction theorem can be proved following the exact same lines of that for theorem.1. For any J N + and ǫ,ω 0,, define B n J,ǫ,ω = f = β jk ψ jk : f f 0 < ǫ, j/ max β jk β 0jk ω k I j. j=0 k I j j=j Theorem 4.1 Generic Contraction, Wavelet Series. Let ǫ n n=1 and ǫ n n=1 be sequences such that minnǫ n,nǫ n as n with 0 ǫ n ǫ n 0. Assume that the sieve F Jn δ n n=1 satisfies 4.1 and Jn ǫ n 0 and δ n = O1. In addition, assume that there exist another two sequences j n n=1 N +, ω n n=1 0, such that jn ǫ n = O1 and ω n = O1. Suppose the conditions.3,.4, and 4. ΠB n j n,ǫ n,ω n exp Dnǫ n

18 18 F. XIE, W. JIN, AND Y. XU hold for some constant D > 0 and sufficiently large n and M, with F Jn δ n in replace of F mn δ n. Then E 0 Π f f 0 > Mǫ n D n ] 0. As discussed in section 4. in 11], the block prior can be easily extended to the wavelet series. Given a function f with wavelet series expansion fx = j=0 k I j β jk ψ jk x, the block prior for wavelet series is introduced through the wavelet coefficients β jk s as follows: A j g j independently for each j, β j A j N0,A k I nk, independently for each j, where β j = β jk : k I j, n k = I k = j, and g j is given by e jlog e j log T j t+t j, 0 t e jlog, g j t = e j log, e jlog < t e jlog, 0, t > e jlog, T j = exp 1+j log ] exp j +j jlog ] +e j log. We further assume that f 0 lies in the,,α-besov ball B α, Q defined as follows 11] for some α > 1/ and Q > 0: B α, Q = fx = β jk ψ jk x : j=0 k I j αj βjk Q. k I j For the block prior regression via wavelet series, the rate-exact Bayesian adaptation also holds. Theorem 4.. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are uniformly sampled from 0,1], e i n are i.i.d. mean-zero Gaussian noises with vare i = σ, and f 0 B α, Q for some α > 1/ and Q > 0. Suppose fx = j=0 k I j β jk ψ jk x is imposed the block prior for wavelet series Π as described above. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn α/α+1 D n 0. j=0 4.. Extension to fixed-design regression. Sofar,thedesignpointsx i n in this paper for the nonparametric regression problem y i = fx i +e i are assumed to be randomly sampled over 0,1] p. This is referred to as the

19 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 19 random-design regression problem. There are, however, many cases where the design points x i n are fixed and can be controlled. One of the examples is the design and analysis of computer experiments 8, 7]. To emulate a computer model, the design points are typically manipulated so that they are reasonably spread. In some physical experiments the design points can also be required to be fixed 3]. In this subsection we extend the framework in section to the fixed-design regression problem in a simplified context. The Bayesian literatures dealing with fixed-design regression problem typically obtain rates of contraction with respect to the empirical L -distance. We show that when the design points are reasonably selected, the integrated L -distance contraction is also obtainable. Suppose that the design points x i n 0,1] are one-dimensional and are not randomly sampled, but fixed instead. Intuitively, the design points need to be relatively spread so that the global behavior of the true signal f 0 can berecovered as much as possible. Formally, we requirethat thedesign points satisfy 4, 43] 1 n sup ½x x i x n = O. n x 0,1] Asimpleexampleofsuchdesignx i n istheunivariateequidistancedesign, i.e., x i = i 1//n. It clearly satisfies 4.3 see, for example, 43]. Within the orthonormal random series framework in section, we assume that f yields Fourier series expansion fx = β k ψ k x, where ψ 1 x = 1, ψ k x = sinkπx, and ψ k+1 x = coskπx, k N +. We also assume that f 0 x = β 0kψ k x lies in the α-hölder space C α Q with α > 1. Now consider the sieve F mn δ = fx = 4.4 β k ψ k x : β k β 0k k α δ with α > 1. Then clearly F mn δ satisfies condition.. The following lemma is the key ingredient in bridging the gap between the empirical L -distance and the integrated L -distance.

20 0 F. XIE, W. JIN, AND Y. XU Lemma 4.. Suppose the design points x i n are fixed and satisfy 4.3. Let F mn δ be defined by 4.4 with a sequence m n n=1 such that m n and m n /n 0. Suppose f and f 1 F mn δ. Then for sufficiently large n, P n f f 1 f f 1 Cm n n f f 1 + Cδ n f f 1, P n f 0 f 1 f 0 f 1 Cm n n f 0 f 1 + Cδ n f 0 f 1 hold for some universal constant C > 0. For the fixed-design regression problem, it turns out that the integrated L -distance is locally testable over the sieve F mn δ n. Theunderlyingreason is that by forcing the design points to satisfy 4.3, the empirical L -distance can be bounded by the integrated L -distance from both below and above, as suggested by lemma 4.. Lemma 4.3. Suppose the design points x i n are fixed and satisfy 4.3. Let F mn δ be defined as 4.4 with m n, m n /n 0, and δ is some constant. Then for any f 1 F mn δ with n f 1 f 0 > 1, there exists a test function φ n : Y n 0,1] such that E 0 φ n exp Cn f 1 f 0, sup E f 1 φ n exp Cn f 1 f 0 {f F mn δ: f f 1 ξ f 0 f 1 } for some constant C > 0 and ξ in 0,1. Now we present the generic rate of contraction theorem for fixed-design regression. We remark that the distance we use is the integrated L -distance rather than the empirical L -distance widely used in the Bayesian literature. Theorem 4.3 Generic Contraction, Fixed-design. Suppose the design points x i n are fixed and satisfy 4.3. Let ǫ n n=1 and ǫ n n=1 be sequences such that minnǫ n,nǫ n as n with 0 ǫ n ǫ n 0. Let the sieves F mn δ n=1 be defined by 4.4, where m n and m n /n 0. Suppose the conditions.3,.4, and Π f f 0 < ǫ n exp Dnǫ n hold for some constant D > 0 and sufficiently large n and M, with the sequence δ n n=1 fixed at a constant δ. Then E 0 Π f f 0 > Mǫ n D n ] 0. Let us, for a concrete example, consider the squared-exponential Gaussian process prior for fixed-design nonparametric regression.

21 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 1 Theorem 4.4. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are fixed design points on 0,1] satisfying 4.3, e i n are i.i.d. mean-zero Gaussian with vare i = σ, and f 0 H α Q for some α > 1 and Q > 0. Suppose f is imposed the squared-exponential Gaussian process prior Π = GP0, K. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn 1/ logn D n 0. Remark 4.. For the squared-exponential Gaussian process regression, the rate of contraction with respect to the integrated L -distance for f 0 A 4 Q has been studied in the literature. A similar result is obtained in theorem 3.1. In contrast, we remark that for the fixed-design regression problem, theorem 4.4 is new and original, and provides a stronger result compared to the existing literature see, for example, theorem 10 in 33]. 5. Proofs of main results. Proof of lemma.1. Wefirstobservethefollowingfact:foranyfx = k β kψ k x F m δ, the following holds: 5.1 f f 0 m f f 0 +δ. In fact, by the Cauchy-Schwartz inequality, f f 0 β k β 0k m β k β 0k + m f f 0 +δ, k=m+1 β k β 0k and hence, f f 0 m f f 0 +δ. Let us take ξ = 1/4. Define the test function to be { n φ n = ½ y i f 1 x i f 0 x i 1 np nf1 f 0 n + 8 f } 1 f 0 npn f 1 f 0, with the test statistic to be n T n = y i f 1 x i f 0 x i 1 np nf1 f 0

22 F. XIE, W. JIN, AND Y. XU n 8 f 1 f 0 npn f 1 f 0. WefirstconsiderthetypeIerrorprobability.UnderP 0,wehavey i = f 0 x i + e i, where e i s are i.i.d. N0,σ noises. Therefore, there exists a constant C 1 > 0 such that P 0 e i > t exp 4C 1 t for all t > 0. Then for a sequence a i n Rn, Chernoff bound yields P 0 n a i e i t exp 4C 1t n. a i Now we set a i = f 1 x i f 0 x i and t = 1 n np nf 1 f f 1 f 0 npn f 1 f 0. Clearly, 1 t np n f 1 f 0 4 np nf 1 f ] 18 n f 1 f 0. Then under P 0 x 1,,x n, we have n E 0 φ n x 1,,x n = P 0 e i f 1 x i f 0 x i ] t x 1,,x n exp C 1 np n f 1 f 0 C ] 1 3 n f 1 f 0 exp C 1 3 n f 1 f 0. It follows that the unconditioning error can be bounded: E 0 φ n exp C 1 3 n f 1 f 0. We next consider the type II error probability. Under P f, we have y i = fx i + e i with e i s being i.i.d. mean-zero Gaussian. Then for any f with f f 1 f 0 f 1 /4 f 0 f 1 /4, we have { E f 1 φ n E ½ P n f f 1 1 } ] 16 P nf 1 f 0 E f 1 φ n x 1,,x n +P P n f f 1 > 116 P nf 1 f 0.

23 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 3 When P n f f 1 P n f 1 f 0 /16, we have n T n + 8 f 1 f 0 npn f 1 f 0 n = e i f 1 x i f 0 x i ]+np n f f 1 f 1 f np nf 1 f 0 n e i f 1 x i f 0 x i ]+ 1 np nf 1 f 0 n P n f f 1 P n f 1 f 0 n e i f 1 x i f 0 x i ]+ 1 4 np nf 1 f 0. Now set R = Rx 1,,x n = 1 4 np nf 1 f 0 n 8 f 1 f 0 npn f 1 f 0. Then given R n f 1 f 0 npn f 1 f 0 /8, we use the Chernoff bound to obtain n P f T n < 0 x 1,,x n P e i f 1 x i f 0 x i ] R x 1,,x n 4C 1 R exp np n f 1 f 0 exp C 1n f 1 f 0. 3 On the other hand, n P R < 8 f 1 f 0 npn f 1 f 0 = P P n f 1 f 0 < 1 f 1 f 0 n = P G n f 1 f 0 < f 1 f 0. It follows that E ½ { P n f f 1 1 E ½ { R } 16 P nf 1 f 0 n 8 f 1 f 0 npn f 1 f 0, ] E f 1 φ n x 1,,x n

24 4 F. XIE, W. JIN, AND Y. XU P n f f 1 1 } ] 16 P nf 1 f 0 P f T n < 0 x 1,,x n n +P R < 8 f 1 f 0 npn f 1 f 0 exp C 1 n 3 n f 1 f 0 +P G n f 1 f 0 < f 1 f 0. Using Bernstein s inequality Lemma 19.3 in 34], we obtain the tail probability of the empirical process G n f 1 f 0 n P G n f 1 f 0 < f 1 f 0 exp 1 n f 1 f 0 4 /4 4Ef 1 f f 1 f 0 f 1 f 0 / exp n f 1 f 0 4 f 1 f 0 exp C n f 1 f 0 m f 1 f 0, +δ forsomeconstant C > 0,whereweusetherelation5.1.Ontheotherhand, when P n f f 1 > P n f 1 f 0 /16, we again use Bernstein s inequality and the fact that f {f F m δ : f f 1 5 f 0 f 1 } to compute P P n f f 1 > 116 P nf 1 f 0 = P P n f f ] f 1 f 0 f f f 1 f 0 > 1 16 f 1 f 0 f f 1 P G n f f 1 1 ] n 16 f 1 f 0 > 3 f 1 f 0 exp 1 n f 1 f 0 4 /104 4 g + f 1 f 0 g, /3 where g = f f 1 f 1 f 0 /16. We further compute g f f f 1 f 0 f f 1 f f f 1 f 0 f 1 f 0

25 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 5 f f 1 f f 1 + f 1 f 0 f 1 f 0 m f 1 f 0 +δ f 0 f 1, where we use 5.1, the fact that f f 1 f 0 f 1, and that f f 1 f f 0 + f 0 f 1 m f 1 f 0 +m f 0 f 1 +δ m f 1 f 0 +δ. Similarly, we obtain on the other hand, g f f f 1 f 0 = f f f 1 f 0 m f 0 f 1 +δ. Therefore, we end up with P P n f f 1 > 116 P nf 1 f 0 exp C n f 1 f 0 m f 1 f 0, +δ where C > 0 is some constant. Hence we obtain the following exponential bound for type I and type II error probabilities: E 0 φ n exp Cn f 1 f 0 E1 φ n exp Cn f 1 f 0 +exp Cn f 1 f 0 m f 1 f 0 +δ for some constant C > 0 whenever f f 1 f 1 f 0 /3. Taking the supremum of the type II error over f {f F m δ : f f 1 f 1 f 0 /3} completes the proof. Proof of lemma.. We partition the alternative set into disjoint union {f F m δ : f f 0 > Mǫ n } {f F m δ : jǫ n < f f 0 j +1ǫ n } := j=m j=m S nj ǫ n. For each S nj ǫ n, we can find N nj = N ξjǫ n,s nj ǫ n, -many functions f njl S nj ǫ n such that S nj ǫ n N nj l=1 {f F m δ : f f njl ξjǫ n }.

26 6 F. XIE, W. JIN, AND Y. XU Since for each f njl, we have f njl S nj ǫ n, implying that f njl f 0 > jǫ n, we obtain the final decomposition of the alternative S nj ǫ n N nj l=1 {f F m δ : f f njl ξ f 0 f njl }. Now we apply lemma.1 to construct individual test function φ njl for each f njl satisfying the following property E 0 φ njl exp Cnj ǫ n, sup {f F mδ: f f njl ξ f 0 f njl } E f 1 φ njl exp Cnj ǫ n +exp Cnj ǫ n mj ǫ n +δ. where we have used the fact that f njl f 0 > jǫ n. Now define φ n = sup j M max 1 l Nnj φ njl. Then the type I error probability can be upper bounded using the union bound E 0 φ n N nj E 0 φ njl j=m l=1 N nj exp Cnj ǫ n = j=m l=1 N nj exp Cnj ǫ n. j=m The type II error probability can also be upper bounded: sup E f 1 φ n {f F mδ: f f 0 >Mǫ n} sup j M sup j M sup l=1,,n nj sup l=1,,n nj exp CM nǫ n+exp sup {f F mδ: f f njl ξ f 0 f njl } E f 1 φ njl exp Cj nǫ n +exp Cnj ǫ n mj ǫ n +δ CnM ǫ n mm ǫ n +δ. ] Proof of lemma.3. Denotethere-normalizedrestrictionofΠonB n = B n k n,ǫ n,ω n to be Π B n, and the random variables V ni n, W ni n to be V ni = f 0 x i fx i Πdf B n, W ni = 1 fx i f 0 x i Πdf B n.

27 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 7 Then by Jensen s inequality { Hn c : = expλ n f D n Πdf ΠB n exp { expλ n f D n Πdf B n exp { 1 σ { 1 σ n e i V ni +W ni C + 1σ nǫ n C + 1σ C + 1σ nǫ n } } { n n } e i V ni Cnǫ n W ni nǫ n, Now we use the Chernoff bound for Gaussian random variables to obtain the conditional probability bound for the first event given the design points x i n : P 0 n e i V ni Cσ nǫ n x 1,,x n exp C σ 4 nǫ 4 n P n Vni. Since over the function class B n, we have f f 0 ǫ n, k n ǫ n = O1, ω n = O1, and f f 0 β k β 0k kn ] 1/ k n β k β 0k + k n f f 0 +ω n = O1, k=k n+1 it follows from Fubini s theorem that EVni f 0 f Πdf B n ǫ n, E ] Vni 4 E f 0 x fx 4 Πdf B n f f 0 f f 0 Πdf B n ǫ n f f 0 Πdf B n ǫ n. β k β 0k nǫ n ]} ]}

28 8 F. XIE, W. JIN, AND Y. XU Hence by the Chebyshev s inequality, P P n Vni E Vni > ǫ n ǫ 1 nǫ ǫ 4 varvni 1 n nǫ 4 nǫ EV ni 4 1 nǫ n for any ǫ > 0, i.e., P n V ni EV ni +o Pǫ n ǫ n1+o P 1, and hence, exp C σ 4 nǫ 4 n P n Vni = exp C σ 4 nǫ n 0 1+o P 1 0 in probability. Therefore by the dominated convergence theorem the unconditional probability goes to 0: n n ] P 0 e i V ni Cσ nǫ n = E P 0 e i V ni Cσ nǫ n x 1,,x n E exp C σ 4 nǫ 4 ] n P n Vni 0. For the second event we use the Bernstein s inequality. Since EW ni = 1 f f 0 Πdf B n 1 ǫ n, EWni = 1 ] 4 E fx f 0 x Πdf B n 1 ] 4 E fx f 0 x 4 Πdf B n 1 f f 0 4 f f 0 Πdf B n ǫ n, then n n P W ni > nǫ n P W ni EW ni > 1 nǫ n 1 n = P W ni EW ni > 1 nǫ n n exp 1 nǫ 4 n/4 4EWni +ǫ n W ni / Ĉ 1 nǫ n exp, 1+ W ni

29 where A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 9 1 W ni = sup x 0,1] p fx f 0 x Πdf B n. Since for any f B n, f f 0 = O1, it follows that W ni = O1, and hence, P i W ni > nǫ n 0. We conclude that n n PHn c P 0 e i V ni Cσ nǫ n +P W ni > nǫ n 0. Proof of lemma.4. Denote Π B n = Π B n /ΠB n to be the re-normalized restriction of Π on B n = { f f 0 < ǫ n }, and V ni = f 0 x i fx i Πdf B n, W ni = 1 fx i f 0 x i Πdf B n. Similar to the proof of lemma.3, we obtain { Hn c = expλ n f D n Πdf ΠB n exp { 1 n e i V ni +W ni C + 1σ } nǫ n σ { 1 σ n e i V ni C + 1 } σ nǫ n, C + 1σ ]} nǫ n where we use the fact that W ni 1/ f f 0 Πdf B n ǫ n/ for all f B n in the last step. Conditioning on the design points x i n, we have P 0 Hn c x 1,,x n exp C + 1 ] σ 4 nǫ 4 n σ P n Vni { exp C + 1 σ σ 4 nǫ 4 n P n { exp C + 1 σ σ 4 nǫ 4 n ] } 1 f f 0 Πdf B n ] } 1 f f 0 Πdf B n

30 30 F. XIE, W. JIN, AND Y. XU exp C + 1 ] σ σ 4 nǫ n 0. The proof is completed by applying the dominated convergence theorem. Proof of theorem.1. Let φ n be the test function given by lemma. and { 3D H n = expλ n f D n Πdf exp + 1 ]} σ nǫ n. It follows from condition.5 that { D Hn c expλ n Πdf < ΠB n k n,ǫ n,ω n exp + 1 ]} σ nǫ n, and hence, P 0 H c n = o1 by lemma.3. Now we decompose the expected value of the posterior probability E 0 Π f f 0 > Mǫ n D n ] E 0 1 φ n ½H n Π f f 0 > Mǫ n D n ] +E 0 φ n +E 0 1 φ n ½Hn c ] { f f E 0 1 φ n ½H n 0 >Mǫ expλ ] n} nf D n Πdf expλn f D n Πdf +E 0 φ n +P 0 H c n. By.3 and lemma. the type I error probability E 0 φ n 0. It suffices to bound the first term. Observe that on the event H n, the denominator in the square bracket can be lower bounded: E 0 1 φ n ½H n 3D exp + 1 σ E 0 1 φ n 3D +exp + 1 σ { f f 0 >Mǫ n} expλ nf D n Πdf expλn f D n Πdf ] nǫ n F mn δ n { f f 0 >Mǫ n} nǫ n ] expλ n f D n Πdf ] E 0 F c mn δn expλ n f D n Πdf ] ].

31 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 31 By Fubini s theorem, lemma. we have E 0 1 φ n F mn δ n { f f 0 >Mǫ n} F mn δ n { f f 0 >Mǫ n} expλ n f D n Πdf E 0 1 φ n expλ n D n ]Πdf sup E f 1 φ n f F mn δ n { f f 0 >Mǫ n} exp CM nǫ n +exp CM nǫ n m n ǫ n M +δ n exp CM nǫ n. for some constatn C > 0 for sufficiently large n, since m n ǫ n 0 and δ n = O1 by assumption. For the integral on Fm c n δ n, we apply Fubini s theorem to obtain ] n ] p f x i,y i E 0 expλ n f D n Πdf = Πdf Fmn c δn p 0 x i,y i F c mn δn E 0 ΠF c m n δ n. Hence we proceed to compute { f f E 0 1 φ n ½H n 0 >Mǫ expλ ] n} nf D n Πdf expλn f D n Πdf 3D exp + 1 ] σ nǫ n CM nǫ n 3D +exp + 1 nǫ D n σ + 1σ ] nǫ n 0 as long as M is sufficiently large, where.4 is applied. SUPPLEMENTARY MATERIAL Supplement to A theoretical framework for Bayesian nonparametric regression: orthonormal random series and rates of contraction doi: to be determined;.pdf. The supplementary material contains the remaining proofs and additional technical results. ]

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012