arxiv: v1 [math.st] 15 Dec 2017

Size: px
Start display at page:

Download "arxiv: v1 [math.st] 15 Dec 2017"

Transcription

1 A THEORETICAL FRAMEWORK FOR BAYESIAN NONPARAMETRIC REGRESSION: ORTHONORMAL RANDOM SERIES AND RATES OF CONTRACTION arxiv: v1 math.st] 15 Dec 017 By Fangzheng Xie, Wei Jin, and Yanxun Xu Johns Hopkins University We develop a unifying framework for Bayesian nonparametric regression to study the rates of contraction with respect to the integrated L -distance without assuming the regression function space to be uniformly bounded. The framework is built upon orthonormal random series in a flexible manner. A general theorem for deriving rates of contraction for Bayesian nonparametric regression is provided under the proposed framework. As specific applications, we obtain the near-parametric rate of contraction for the squared-exponential Gaussian process when the true function is analytic, the adaptive rates of contraction for the sieve prior, and the adaptive-and-exact rates of contraction for the un-modified block prior when the true function is α-smooth. Extensions to wavelet series priors and fixeddesign regression problems are also discussed. 1. Introduction. Consider the standard nonparametric regression problem y i = fx i +e i, i = 1,,n, where the set of predictors x i n are referred to as design points and take values in 0,1] p R p, e i s are independent and identically distributed i.i.d. mean-zero Gaussian noises with vare i = σ, and y i s are the responses. We follow the popular Bayesian approach by assigning f a carefully-selected prior, and perform inference tasks by finding the posterior distribution of f given the observations x i,y i n. We propose a theoretical framework for Bayesian nonparametric regression to study the rates of contraction with respect to the integrated L - distance { 1/ f g = fx gx] dx}. X The regression function f is assumed to be represented using a set of appropriate orthonormal basis functions ψ k : fx = k β kψ k x. The MSC 010 subject classifications: Primary 6G08; secondary 6C10, 6G0 Keywords and phrases: Bayesian nonparametric regression, integrated L -distance, orthonormal random series, rate of contraction 1

2 F. XIE, W. JIN, AND Y. XU coefficients β k are allowed to range over the entire real line R and the space of regression functions is allowed to be unbounded, including the renowned Gaussian process priors as special examples. Rates of contraction of posterior distributions for Bayesian nonparametric priors have been studied extensively. Following the earliest framework on generic rates of contraction theorems with i.i.d. data proposed by 13], specific examples for density estimation via Dirichlet process mixture models 5, 14, 17, 30] and location-scale mixture models, 38] are discussed. For nonparametric regression, the rates of contraction had not been discussed until 15], who develop a generic framework for fixed-design nonparametric regression to study rates of contraction with respect to the empirical L -distance. There are extensive studies for various priors that fall into this framework, including location-scale mixture priors 9], conditional Gaussian tensor-product splines 10], and Gaussian processes 35, 37], among which adaptive rates are obtained in 9, 10, 37]. Although it is interesting to achieve adaptive rates of contraction with respect to the empirical L -distance for nonparametric regression, this might be restrictive since the empirical L -distance quantifies the convergence of functions only at the given design points. In nonparametric regression, one also expects that the error between the estimated function and the true function can be globally small over the whole design space 39], i.e., small mean-squared error for out-of-sample prediction. Therefore the integrated L -distance is a natural choice. For Gaussian processes, 33], 40], and 4] provide contraction rates for nonparametric regression with respect to the integerated L -distance. A novel spike-and-slab wavelet series prior is constructed in 43] to achieve adaptive contraction with respect to the stronger L -distance. These examples however, take advantages of their respective prior structures and may not be easily generalized. Although a generic framework for theintegrated L -loss is presentedin 0], the priorthereis imposed on a uniformly bounded function space and hence rules out some popular priors, e.g., the popular Gaussian process prior 5]. It is therefore natural to ask the following fundamental question: for Bayesian nonparametric regression, can one build a unifying framework to study rates of contraction for various priors with respect to the integrated L -distance without assuming the uniform boundedness of the regression function space? In this paper we provide a positive answer to this question. Inspired by 11], we build the general framework upon series expansion of functions with respect to a set of orthonormal basis functions. The prior is not necessarily supported on a uniformly bounded function space. A general rate of contraction theorem with respect to the integrated L -distance for

3 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 3 Bayesian nonparametric regression is provided under the proposed framework. Specific applications that fall into this framework include the classical squared-exponential Gaussian process prior 5], the sieve prior 6, 8], and the un-modified block prior 11], etc. In particular, for the block prior regression, rather than modifying the block prior by conditioning on a truncated function space as in 11] with a known upper bound for the unknown true regression function, we prove that the un-modified block prior automatically yields rate-exact Bayesian adaptation for nonparametric regression without such a truncation. We also discuss some natural extensions of the proposed framework, including the wavelet series and the fixed-design regression problems. The analyses of the above applications and extensions under the proposed framework also generalize their respective results in the literature. This is made clear in sections 3 and 4. The literatures regarding rates of contraction for Bayesian nonparmatric models are largely based on the cutting-edge prior-concentration-and-tesing procedure proposed by ] and 13]. The major challenge of establishing a general framework for Bayesian nonparametric regression is that unlike the Hellinger distance for density estimation problem and the empirical L - distanceforfixed-regressionproblem,theintegrated L -distanceforrandomdesign regression problem is not locally testable. The exact definition of local testability of a distance is provided later, but a locally testable distance is the key ingredient for the prior-concentration-and-tesing procedure to work. It turns out that by modifying the testing procedure, we only require the integrated L -distance to be locally testable over an appropriate class of functions. The above arguments are made precise in section. The layout of this paper is as follows. In section we introduce the random series framework for Bayesian nonparametric regression and present the main result concerning rates of contraction. As applications of the main result, we derive the rates of contraction of various renowned priors for nonparametric regression in the literature with substantial improvements in section 3. Section 4 elaborates on extensions to wavelet series priors as well as fixed-design regression problems. The technical proofs of the main results are deferred to section 5. Notations. For 1 r, we use r to denote both the l r -norm on any finite dimensional Euclidean space and the integrated L r -norm of a measurable function with respect to the Lebesgue measure. In particular, for any function f L 0,1] p, we use f to denote the integrated L - norm defined to be f = 0,1] p f xdx. We follow the convention that when r =, the subscript is omitted, i.e., =. The Hilbert space

4 4 F. XIE, W. JIN, AND Y. XU l denotes the space of sequences that are squared-summable. We use x to denote the maximal integer no greater than x, and x to denote the minimum integer no less than x. The notations a b and a b denote the inequalitiesuptoapositivemultiplicativeconstant,andwewritea bifa b and a b. Throughout capital letters C,C 1, C,C,D,D 1, D,D, are used to denote generic positive constants and their values might change from line to line unless particularly specified, but are universal and unimportant for the analysis. We refer to P as a statistical model if it consists of a class of densities on a sample space X with respect to some underlying σ-finite measure. Given a frequentist statistical model P and the i.i.d. data w i n from some P P, the prior and the posterior distribution on P are always denoted by Π and Π w 1,,w n, respectively. Given a function f : X R, we use P n f = n 1 n fx i to denote the empirical measure of f, and G n f = n 1/ n fx i Efx i ] to denote the empirical process of f, given the i.i.d. data x i n. With a slight abuse of notations, when applying to a set of design points x i n, we also denote P nf = n 1 n fx i and G n f = n 1/ n fx i Efx i ] to be the empirical measure and empirical process, even when the design points x i n are deterministic. In particular, φ denotes the probability density function of the univariate standard normal distribution, and we use the shorthand notation φ σ y = φy/σ/σ to denote the density of N0,σ. For a metric space F,d, for any ǫ > 0, the ǫ-covering number of F,d, denoted by Nǫ,F,d, is defined to be the minimum number of ǫ-balls of the form {g F : df,g < ǫ} that are needed to cover F.. The framework and main results. Consider the nonparametric regression model: y i = fx i +e i, where e i n are i.i.d. mean-zero Gaussian noises with vare i = σ, and x i n are design points taking values in 0,1] p. Throughout sections and 3 the design points x i n are assumed to be independently and uniformly sampled. The framework presented in this paper can be easily generalized to the case where the design points are independently sampled from a density function that is bounded away from 0 and. We assume that the responses y i s are generated from y i = f 0 x i +e i for some unknown f 0 L 0,1] p, thus the data D n = x i,y i n can be regarded as i.i.d. samples from a distribution P 0 with joint density p 0 x,y = φ σ y f 0 x. Throughout we assume that the variance σ of the noises is known, but our framework can be easily extended to the case where σ is unknown by placing a prior on σ that is supported on a compact interval contained in 0,.

5 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 5 Theregression model is parametrized by the function f, and we follow the standard nonparametric Bayes procedure by assigning a prior distribution Π on f as follows. Let ψ k be a set of orthonormal basis functions in L 0,1] p. Since f 0 L 0,1] p, we also impose the prior Π on a sub-class of L 0,1] p and write f in terms of the orthonormal series expansion fx = β k ψ k x, where the coefficients β k are imposed a prior distribution such that β k < a.s.. The prior on β k can be very general as long as it yields realizations that are squared-summable with probability one. We shall also assume that the true function f 0 yields a series expansion f 0 x = β 0k ψ k x, where β 0k <. We require that ψ k are uniformly bounded. The widely used tensor-product Fourier basis satisfies this requirement: ψ k1 k p x 1,,x p = p ψk 1 j x j, where ψ1 1x = 1, ψ1 k x = sinkπx, and ψ1 k+1 x = coskπx, k = 1,,. Remark.1. We remark that the prior Π is supported on a function space that may not beuniformlyboundedand thesetup hereis moregeneral than that in 0]. The proposed framework includes the well-known Gaussian process prior as a special case. Let K : 0,1] p 0,1] p 0, be a positive definite covariance function that yields an eigenfunction expansion Kx,x = λ kψ k xψ k x, where ψ k is a set of orthonormal eigenfunctions of K, and λ k are the corresponding eigenvalues. For a Gaussian process prior GP0, K on f, the Karhunen-Loève theorem 36] yields the following representation of f: fx = j=1 β k ψ k x, where β k N0,λ k independently. Before presenting the main result for studying convergence of Bayesian nonparametric regression under the proposed framework, let us first recall the extensively used prior-concentration-and-testing procedure for deriving

6 6 F. XIE, W. JIN, AND Y. XU rates of contraction proposed by 13]. The general setup is as follows. Suppose w i n W are i.i.d. data sampled from a distribution P 0 that yields a density p 0 with respect to some underlying σ-finite measure. Let Θ be the parameter space typically infinite dimensional for nonparametric problems such that p 0 = p θ0 for some θ 0 Θ. By imposing a prior Π on the parameter θ Θ equipped with a suitable measurable structure, the posterior distribution can be written as ΠA D n = AexpΛ nθ D n ]Πdθ Θ expλ nθ D n ]Πdθ, for any measurable A in Θ, where Λ n is the log-likelihood ratio Λ n θ D n = n log p θw i p 0 w i. An appropriate distance d on Θ is crucial to study of rates of contraction, since typically different distances would yield different rates, and when Θ is infinite dimensional, different distances are possibly not equivalent. In 13], a distance that satisfies the local testability condition is required. Definition.1 13]. Given a parameter space Θ, we say that a distance d : Θ Θ 0, is locally testable, if there exists a sequence of test functions φ n : W n 0,1] such that the following holds for some constant C > 0 when n is sufficiently large whenever θ 1 θ 0 : E 0 φ n exp Cndθ 0,θ 1, sup E θ 1 φ n exp Cndθ 0,θ 1, dθ,θ 1 ξdθ 0,θ 1 where ξ 0,1 is a fixed constant. When the parameter space Θ itself is a class of densities, i.e., {p θ : θ Θ} is parametrized by p itself, it is proved that the Hellinger distance is locally testable 3]. The widely-used empirical L -distance for fixed-design nonparametric regression is also locally testable see, for example, 15]. For a locally testable distance d on Θ, in order that the posterior distribution Π D n contracts to θ 0 at rate ǫ n, i.e., Πdθ,θ 0 > Mǫ n D n 0 in P 0 -probability for some large constant M > 0, the prior-concentrationand-testing procedure requires that the following hold with some constant D > 0 for sufficiently large n:

7 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 7 1. The prior concentration condition holds: Π E 0 log p 0 ǫ n p,e 0 log p ] 0.1 ǫ n e Dnǫ n. θ p θ. There exists a sequence Θ n n=1 of subsets of Θ often referred to as the sieves and test functions φ n n=1 such that ΠΘ c n e Dnǫ n, E 0 φ n 0, sup E θ 1 φ n 0. θ Θ n {dθ,θ 0 >Mǫ n} The major challenge for establishing a general framework to study rates of contraction for nonparametric regression with respect to the integrated L - distance is that is not locally testable when the space of functions is not uniformly bounded. To resolve this issue, we modify the general procedure above by imposing more assumptions on the sieves. For convenience we use Λ n = Λ n f D n to denote the log-likelihood ratio Λ n f D n = n log p fx i,y i p 0 x i,y i, where p f x,y = φ σ y fx is the density parametrized by the regression function f. For a fixed integer m and a positive constant δ, we denote F m δ by a class of functions in L 0,1] p that satisfies the following property: { }. F m δ fx = β k ψ k x : β k β 0k δ. k=m+1 The sieves F n n=1 are constructed by letting F n = F mn δ n for certain sequences m n n=1 N + and δ n n=1 0,. Rather than considering the supremum over {f : f f 0 ξ f 1 f 0 } as in 13], in the proposed framework we consider the supremum over the smaller {f : f f 0 ξ f 1 f 0 } F m δ and obtain the following weaker result analogous to the local testability condition. Lemma.1. Let F m δ satisfies.. Then for any f 1 F m δ with n f1 f 0 > 1, there exists a test function φ n : X Y n 0,1] such that E 0 φ n exp Cn f 1 f 0,

8 8 F. XIE, W. JIN, AND Y. XU sup E f 1 φ n exp Cn f 1 f 0 {f F mδ: f f 1 ξ f 0 f 1 } +exp Cn f 1 f 0 m f 1 f 0 +δ for some constant C > 0 and ξ 0,1. The following lemma constructs a test function that tests against the large composite alternative F mn δ n { f f 0 > Mǫ n }. In particular, it is useful to estimate the type I error E 0 φ n and the type II error sup f Fmδ { f f 0 >Mǫ n}e f 1 φ n. We also observe that this result is similar to the testing procedure described above. Lemma.. Suppose that F m δ satisfies. for an m N + and a δ > 0. Let ǫ n n=1 be a sequence with nǫ n. Then there exists a sequence of test functions φ n n=1 such that E 0 φ n N nj exp Cnj ǫ n, j=m sup E f 1 φ n exp CM nǫ n {f F mδ: f f 0 >Mǫ n} +exp CM nǫ n mm ǫ n +δ, where N nj = Nξjǫ n,s nj ǫ n, is the covering number of S nj ǫ n = {f F m δ : jǫ n < f f 0 j +1ǫ n }, and C is some positive constant. The prior concentration condition.1 is a very important ingredient in the study of Bayes theory. It guarantees that the denominator in the posterior distribution Θ expλ nθ D n ]Πdθ can be lower bounded by e D nǫ n for some constant D > 0 with large probability see, for example, lemma 8.1 in 13]. In the context of normal regression, the Kullback-Leibler divergence is proportional to the integrated L -distance between two regression functions. Motivated by this observation, we establish the following lemma that yields exponential lower bound for the denominator expλ n Πdf under the proposed framework. Lemma.3. Denote { B n m,ǫ,ω = fx = β k ψ k x : f f 0 < ǫ, k=m+1 β k β 0k ω }

9 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 9 for any ǫ,δ > 0 and m N +. Suppose sequences ǫ n n=1,ω n n=1 0,, and k n n=1 satisfy ǫ n 0, nǫ n, k n ǫ n = O1, and ω n = O1. Then for any constant C > 0, P 0 expλ n Πdf ΠB n k n,ǫ n,ω n exp C + 1σ ] nǫ n 0. In some cases it is also straightforward to consider the prior concentration with respect to the stronger -norm. For example, for a wide class of Gaussian process priors, the prior concentration Π f f 0 < ǫ has been extensively studied see, for example, 16, 35, 37] for more details. Lemma.4. Suppose the sequence ǫ n n=1 satisfies ǫ n 0 and nǫ n. Then for any constant C > 0, P 0 expλ n Πdf Π f f 0 < ǫ n exp C + 1σ ] nǫ n 0. Now we present the main result regarding the rates of contraction for Bayesian nonparametric regression under the orthonormal random series framework. The proof is based on the modification of the prior-concentrationand-testing procedure. Theorem.1 Generic Contraction. Let ǫ n n=1 and ǫ n n=1 be sequences such that minnǫ n,nǫ n as n, with 0 ǫ n ǫ n 0. Assume that the sieve F mn δ n n=1 satisfies., m nǫ n 0, and δ n = O1. In addition, assume that there exist another two sequences k n n=1 N +, ω n n=1 0, such that k nǫ n = O1 and ω n = O1. Suppose the following conditions hold for some constant D > 0 and sufficiently large n and M:.3.4 N nj exp Dnj ǫ n 0, j=m ΠF c m n δ n exp D + 1σ ] nǫ n,.5 ΠB n k n,ǫ n,ω n exp Dnǫ n, where N nj = Nξjǫ n,s nj ǫ n, is the covering number of S nj = {f F mn δ n : jǫ n < f f 0 j + 1ǫ n ]}, and B n k n,ǫ n,ω n is defined in lemma.3. Then E 0 Π f f 0 > Mǫ n D n ] 0.

10 10 F. XIE, W. JIN, AND Y. XU Remark.. In light of lemma.4, by exploiting the proof of theorem.1 we remark that when the assumptions and conditions in theorem.1 hold with.5 replaced by Π f f 0 < ǫ n exp Dnǫ n, it also holds that E 0 Π f f 0 > Mǫ n D n ] 0 for sufficiently large M > Applications. In this section we consider three concrete priors on f for the nonparametric regression problem y i = fx i + e i, i = 1,,n. For simplicity the design points are assumed to be one-dimensional, i.e., x i n 0,1]. For some of the examples, the results presented in this section can be easily generalized to the case where the design points are multi-dimensional by considering the tensor-product Fourier basis. The results for these applications under the proposed framework also generalize their respective counterparts in the literature. The three concrete priors considered are the squared-exponential Gaussian process, the sieve prior, and the un-modified block prior. The basis functions ψ k are assumed to be the Fourier basis, i.e., ψ 1x = 1, ψ k x = sinkπx, and ψ k+1 x = coskπx. For the squaredexponentialgaussianprocess,theunderlyingtrueregressionfunctionf 0 x = β 0kψ k x that governs P 0 is assumed to satisfy β 0k ek /4 <. The resulting f 0 is not only infinitely differentiable, but can also be analytically extended to a complex strip in the complex plane C that contains 0,1] see, for example, exercise I.4.4 in 1]. For the other two priors, the underlying true regression function f 0 that governs P 0 is assumed to be αsmooth in a wide sense. Specifically, a function f is said to be an α-smooth function, if it lies in one of the following function classes: α-sobolev function: A squared-integrable function f L 0,1] is said to be an α-sobolev function for α > 1/, if under the Fourier basis ψ k = 1, sinkπx, coskπx,k N +, the coefficients β k = 1 0 fxψ kxdx satisfy kα βk <. α-hölder function: A squared-integrable function f L 0,1] is said to be an α-hölder function for α > 1/, if under the Fourier basis ψ k = 1, sinkπx, coskπx,k N +, the coefficients β k = 1 0 fxψ kxdx satisfy kα β k <. Roughly speaking when α N +, a function is α-sobolev if it has squaredintegrable derivatives up to order α 1, and is α-hölder if it has Lipschitz continuous derivatives up to order α 1. When the regression function f 0 is α-smooth, it is proved that the minimax-optimal rate of convergence is

11 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 11 ǫ n = n α/α+1 31], and the rate of contraction using a Bayes procedure cannot be faster than the minimax-optimal rate. We derive the rates of contraction for the sieve prior and the block prior under the framework proposed in section and show that they are minimax-optimal possibly up to a logarithmic factor Squared-exponential Gaussian process with near parametric rate. In this subsection we consider one of the most widely used and perhaps the most popular Gaussian processes 5] GP0, K with the covariance function Kx,x = exp x x ] of the squared-exponential form. We claim that such a Gaussian process prior indeed falls into the proposed framework. The eigen-system of the squared-exponential covariance function has been studied in the literature see, for example, 41]. Under the Fourier basis, the K yields the following eigen-expansion Kx,x = λ kψ k xψ k x, where the eigenvalues λ k decay at λ k exp k /4. It follows by the Karhunen-Loèvetheoremthatf yieldsaseriesexpansionfx = β kψ k x, where β k N0,λ k. Given constant c,q > 0, define the function class A c Q = { fx = β k ψ k x : } k βk exp Q. c The function class A c Q is closely related to the reproducing kernel Hilbert space RKHS associated with GP0, K, a concept playing a fundamental role in the literature of Bayes theory. For a complete and thorough review of RKHS from a Bayesian perspective, we refer to 36]. For the squaredexponential Gaussian process regression, the following property regarding the corresponding RKHS is available by applying theorem 4.1 in 36]. Lemma 3.1. Let H be the RKHS associated with the squared-exponential Gaussian process GP0,K, where Kx,x = exp x x ]. Then A 4 Q H for any Q > 0. Under the squared-exponential Gaussian process prior Π, the rate of contraction for f 0 A 4 Q is 1/ n up to a logarithmic factor. Theorem 3.1. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are uniformly sampled from 0,1], e i n are i.i.d. mean-zero Gaussian noises with vare i = σ, and f 0 A 4 Q for some Q > 0. Suppose f follows the squared-exponential Gaussian process prior

12 1 F. XIE, W. JIN, AND Y. XU Π = GP0,K. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn 1/ logn D n 0. The major technique for proving theorem 3.1 using theorem.1 is the construction of an appropriate sieve F mn δ n. Specifically, the sieve F mn δ n is taken as F mn Q = fx = β k ψ k x : β k β 0k e k /8 Q, with the sequences m n logn and δ n = Q. The probability ΠF c m n Q is bounded by the Markov s inequality, and the covering number N nj is bounded by the following lemma: Lemma 3.. For all ǫ > 0, it holds that { } logn ǫ, β 1,β, l : βk /c ek Q, log 1 ǫ 3/. Remark 3.1. The contraction rate n 1/ logn is the same as the parametric rate 1/ n up to a logarithmic factor the renowned Bernstein-von Mises theorem, see, for example, chapter 10 in 34]. This is in accordance with the following interesting fact: the space of functions that yield analytic extension to an open subset of the complex plane is only slightly larger than finite-dimensional spaces in terms of their covering numbers see, for example, proposition C.9 in 16]. The result obtained in theorem 3.1 is similar to that in 33]. However, we put the squared-exponential Gaussian process regression in a more general framework that is quite flexible and allows other non-gaussian priors, as will be seen in subsection Sieve prior regression with adaptive rate. In this subsection the true regression function f 0 is assumed to be in the intersection of the α-hölder ball C α Q and the α-sobolev ball H α Q, where α > 1/ is the smoothness level, { } C α Q = fx = β k ψ k x : k α β k Q, H α Q = { fx = β k ψ k x : } k α βk Q,

13 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 13 and Q > 0 is some constant. In subsection 3.1, we provide a near-parametric rate of contraction to an analytic function f 0 A 4 Q with the squaredexponential Gaussian process prior, which is infinitely differentiable. Nevertheless when f 0 is only smooth up to level α, i.e., f is either α-sobolev or α-hölder, the resulting rate of contraction is only a polynomial of logn and is much slower than the minimax-optimal rate n α/α+1 33]. In this subsection we consider estimating an α-smooth function f 0 without knowing its smoothness level α a priori. In the literature of Bayes theory, such a procedure is referred to as adaptive. We shall also assume that a lower bound α for α is known: α α > 1/. The sieve prior 1, 6, 8] is another popular prior in the literature of Bayesian nonparametric theory. It is a class of hierarchical priors that first draw an integer-valued random variable serving as the number of terms to be used in a finite sum, and then sample the term-wise parameters given the number of terms. The sieve prior typically does not depend on the smoothness level of the true function, often yields minimax-optimal rates of contraction up to a logarithmic factor in many other nonparametric problems e.g., density estimation 6, 8] and fixed-design regression 1], and hence is fully adaptive. However, the adaptive rates of contraction for the sieve prior in random-design regression with respect to the integrated L -distance hasnot beenestablished. In this subsection weaddressthis issue within the orthonormal random series framework proposed in Section. We now formally construct the sieve prior for nonparametric regression. Let fx be represented by a set of orthonormal basis functions ψ k in L 0,1]: fx = β kψ k x. The coefficients β k are imposed a prior distribution as follows: first sample an integer-valued random variable N from a density function π N with respect to the counting measure on N +, and then given N = m the coefficients β k s are independently sampled according to Πβ k A N = m = ½k m k γ gk γ βdβ + ½k > mδ 0 A, where A is any measurable set, g is an exponential power density gx exp τ 0 x τ for some τ,τ 0 > 0 9], and γ 1/,α. We further require that π N m exp b 0 mlogm and π N m exp b 1 mlogm A N=m+1 for some constants b 1, b 0. The zero-truncated Poisson distribution π N m = e λ 1 1 λ m /m!½m 1 satisfies the above condition 38].

14 14 F. XIE, W. JIN, AND Y. XU The following theorem shows that the sieve prior constructed above is adaptive and the rate of contraction with respect to the integrated L - distance is minimax-optimal up to a logarithmic factor. Theorem 3.. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are uniformly sampled from 0,1], e i n are i.i.d. mean-zero Gaussian with vare i = σ, and f 0 C α Q H α Q for some α > 1/ and Q > 0. Suppose f is imposed the prior Π given above. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn α/α+1 logn t D n 0 for any t > α/α Block prior regression with adaptive and exact rate. In the literature of adaptive Bayesian procedure, the minimax-optimal rates of contraction are often obtained with an extra logarithmic factor. It typically requires significant work to obtain the exact minimax-optimal rate. Gao and Zhou 11] elegantly construct a modified block prior that yields rate-adaptive i.e., the prior does not depend on the smoothness level and rate-exact contraction for a wide class of nonparametric problems, without the redundant logarithmic factor. Nevertheless, for nonparametric regression, 11] modifies the block prior by conditioning on the space of uniformly bounded functions. Requiring a known upper bound for the unknown f 0 when constructing the prior is restrictive since it eliminates the popular Gaussian process priors. Besides the theoretical concern, the block prior itself is also a conditional Gaussian process and such a modification is inconvenient for implementation. In this section, we address this issue by showing that for nonparametric regression such a modification is not necessary. Namely, the un-modified block prior itself also yields the exact minimax-optimal rate of contraction for α-sobolev f 0 H α Q and it does not depend on the smoothness level α of the true regression function. We now describe the block prior. Given a sequence β = β 1,β, in the squared-summable sequence space l, define the lth block B l to be the integer index set B l = {k l,,k l+1 1} and n l = B l = k l+1 k l, where k l = e l. We use β l = β j : j B l R n l to denote the coefficients with index lying in the lth block B l. Let ψ k be the Fourier basis, i.e., ψ 1 x = 1, ψ k x = sinπkx, and ψ k+1 x = cosπkx, k N +. The block prior is a prior distribution on fx = β kψ k x induced by the following distribution on the Fourier coefficients β k : A l g l independently for each l,

15 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 15 β l A l N0,A l I nl independently for each l, where g l l=0 is a sequence of densities satisfying the following properties: 1. There exists c 1 > 0 such that for any l and t e l,e l ], 3.1 g l t exp c 1 e l.. There exists c > 0 such that for any l, 3. 0 tg l tdt 4exp c l. 3. There exists c 3 > 0 such that for any l, 3.3 e l g l tdt exp c 3 e l. The existence of a sequence of densities g l l=0 satisfying 3.1, 3., and 3.3 is verified in 11] see proposition.1 in 11]. Our major improvement for the block prior regression is the following theorem, which shows that the un-modified block prior yields rate-exact Bayesian adaptation for nonparametric regression. Theorem 3.3. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are uniformly sampled from 0,1], e i n are i.i.d. mean-zero Gaussian noises with vare i = σ, and f 0 H α Q for some α > 1/ and Q > 0. Suppose fx = β kψ k x is imposed the block prior Π as described above. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn α/α+1 D n 0. Rather than using the sieve F n proposed in theorem.1 in 11], which does not necessarily satisfy., weconstructf mn δ n inaslightly different fashion: { } F mn Q = fx = β k ψ k x : β k β 0k k α Q with m n n 1/α+1 and δ n = Q. The covering number N nj can be bounded using the metric entropy for Sobolev balls for example, see lemma 6.4 in 3], and the rest conditions in.1 can be verified using similar techniques as in 11].

16 16 F. XIE, W. JIN, AND Y. XU 4. Extensions Extension to wavelet series. Besides the popular Fourier basis, another widely-used class of orthonormal basis functions for L 0,1] are the wavelet basis functions. They are related to the Besov space 4], which is more flexible. For a complete and thorough review of wavelets from a statistical perspective, we refer to 18]. Letψ jk j N,k Ij beanorthonormalbasisofcompactlysupportedwavelets for L 0,1], with j referringto the so-called resolution level, and k to the dilation see, for example, section E.3 in 16]. We adopt the convention that the index set I j for the jth resolution level runs through {0,1,, j 1}. The exact definition and specific formulas for the wavelet basis are not of great interest to us. We shall assume that the wavelet basis ψ jk s are appropriately selected such that for any fx = j=0 following inequalities hold 6, 7, 19]: k I j β jk ψ jk x, the 1/ f = βjk 1/ β jk j=0 k I j j=0 k Ij f = sup β jk ψ jk x x 0,1] j=0 k I j j/ max β jk. k I j j=0 For the construction of a specific wavelet basis ψ kj satisfying the above conditions, we refer to 7]. Remark 4.1. We remark that technically speaking the wavelet random series does not fall into the framework proposed in section, since unlike the Fourier basis, the wavelet basis functions are typically not uniformly bounded in : sup k,j ψ kj =. However, the supremum norm of the wavelet series can be upper bounded in terms of the wavelet coefficients. Such a feature allows us to modify the framework in section and build up its wavelet counterpart. Similar to the framework proposed in section, we also assume that the true regression function f 0 yields a wavelet series expansion f 0 x = j=0 k I j β 0jk ψ jk x.

17 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 17 For a given integer J and positive real δ > 0, analogous to., we require that the class of functions F J δ satisfies the following property: F J δ fx = 4.1 β jk ψ jk x : j/ max β jk β 0jk δ k I j j=0 k I j j=j holds forall J N + andδ 0,. Thesieves F n n=1 arethen constructed by taking F n = F Jn δ n for carefully chosen sequences J n n=1 N and δ n n=1 0,. We provide the corresponding local testing result below for the wavelet series, which is similar to lemma.1. Lemma 4.1. Let F J δ satisfies 4.1. Then for any f 1 F J δ with n f1 f 0 > 1, there exists a test function φ n : X Y n 0,1] such that E 0 φ n exp Cn f 1 f 0, sup E f 1 φ n exp Cn f 1 f 0 {f F J δ: f f 1 ξ f 0 f 1 } +exp Cn f 1 f 0 J f 1 f 0 +δ for some constant C > 0 and ξ 0,1. The following generic rate of contraction theorem can be proved following the exact same lines of that for theorem.1. For any J N + and ǫ,ω 0,, define B n J,ǫ,ω = f = β jk ψ jk : f f 0 < ǫ, j/ max β jk β 0jk ω k I j. j=0 k I j j=j Theorem 4.1 Generic Contraction, Wavelet Series. Let ǫ n n=1 and ǫ n n=1 be sequences such that minnǫ n,nǫ n as n with 0 ǫ n ǫ n 0. Assume that the sieve F Jn δ n n=1 satisfies 4.1 and Jn ǫ n 0 and δ n = O1. In addition, assume that there exist another two sequences j n n=1 N +, ω n n=1 0, such that jn ǫ n = O1 and ω n = O1. Suppose the conditions.3,.4, and 4. ΠB n j n,ǫ n,ω n exp Dnǫ n

18 18 F. XIE, W. JIN, AND Y. XU hold for some constant D > 0 and sufficiently large n and M, with F Jn δ n in replace of F mn δ n. Then E 0 Π f f 0 > Mǫ n D n ] 0. As discussed in section 4. in 11], the block prior can be easily extended to the wavelet series. Given a function f with wavelet series expansion fx = j=0 k I j β jk ψ jk x, the block prior for wavelet series is introduced through the wavelet coefficients β jk s as follows: A j g j independently for each j, β j A j N0,A k I nk, independently for each j, where β j = β jk : k I j, n k = I k = j, and g j is given by e jlog e j log T j t+t j, 0 t e jlog, g j t = e j log, e jlog < t e jlog, 0, t > e jlog, T j = exp 1+j log ] exp j +j jlog ] +e j log. We further assume that f 0 lies in the,,α-besov ball B α, Q defined as follows 11] for some α > 1/ and Q > 0: B α, Q = fx = β jk ψ jk x : j=0 k I j αj βjk Q. k I j For the block prior regression via wavelet series, the rate-exact Bayesian adaptation also holds. Theorem 4.. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are uniformly sampled from 0,1], e i n are i.i.d. mean-zero Gaussian noises with vare i = σ, and f 0 B α, Q for some α > 1/ and Q > 0. Suppose fx = j=0 k I j β jk ψ jk x is imposed the block prior for wavelet series Π as described above. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn α/α+1 D n 0. j=0 4.. Extension to fixed-design regression. Sofar,thedesignpointsx i n in this paper for the nonparametric regression problem y i = fx i +e i are assumed to be randomly sampled over 0,1] p. This is referred to as the

19 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 19 random-design regression problem. There are, however, many cases where the design points x i n are fixed and can be controlled. One of the examples is the design and analysis of computer experiments 8, 7]. To emulate a computer model, the design points are typically manipulated so that they are reasonably spread. In some physical experiments the design points can also be required to be fixed 3]. In this subsection we extend the framework in section to the fixed-design regression problem in a simplified context. The Bayesian literatures dealing with fixed-design regression problem typically obtain rates of contraction with respect to the empirical L -distance. We show that when the design points are reasonably selected, the integrated L -distance contraction is also obtainable. Suppose that the design points x i n 0,1] are one-dimensional and are not randomly sampled, but fixed instead. Intuitively, the design points need to be relatively spread so that the global behavior of the true signal f 0 can berecovered as much as possible. Formally, we requirethat thedesign points satisfy 4, 43] 1 n sup ½x x i x n = O. n x 0,1] Asimpleexampleofsuchdesignx i n istheunivariateequidistancedesign, i.e., x i = i 1//n. It clearly satisfies 4.3 see, for example, 43]. Within the orthonormal random series framework in section, we assume that f yields Fourier series expansion fx = β k ψ k x, where ψ 1 x = 1, ψ k x = sinkπx, and ψ k+1 x = coskπx, k N +. We also assume that f 0 x = β 0kψ k x lies in the α-hölder space C α Q with α > 1. Now consider the sieve F mn δ = fx = 4.4 β k ψ k x : β k β 0k k α δ with α > 1. Then clearly F mn δ satisfies condition.. The following lemma is the key ingredient in bridging the gap between the empirical L -distance and the integrated L -distance.

20 0 F. XIE, W. JIN, AND Y. XU Lemma 4.. Suppose the design points x i n are fixed and satisfy 4.3. Let F mn δ be defined by 4.4 with a sequence m n n=1 such that m n and m n /n 0. Suppose f and f 1 F mn δ. Then for sufficiently large n, P n f f 1 f f 1 Cm n n f f 1 + Cδ n f f 1, P n f 0 f 1 f 0 f 1 Cm n n f 0 f 1 + Cδ n f 0 f 1 hold for some universal constant C > 0. For the fixed-design regression problem, it turns out that the integrated L -distance is locally testable over the sieve F mn δ n. Theunderlyingreason is that by forcing the design points to satisfy 4.3, the empirical L -distance can be bounded by the integrated L -distance from both below and above, as suggested by lemma 4.. Lemma 4.3. Suppose the design points x i n are fixed and satisfy 4.3. Let F mn δ be defined as 4.4 with m n, m n /n 0, and δ is some constant. Then for any f 1 F mn δ with n f 1 f 0 > 1, there exists a test function φ n : Y n 0,1] such that E 0 φ n exp Cn f 1 f 0, sup E f 1 φ n exp Cn f 1 f 0 {f F mn δ: f f 1 ξ f 0 f 1 } for some constant C > 0 and ξ in 0,1. Now we present the generic rate of contraction theorem for fixed-design regression. We remark that the distance we use is the integrated L -distance rather than the empirical L -distance widely used in the Bayesian literature. Theorem 4.3 Generic Contraction, Fixed-design. Suppose the design points x i n are fixed and satisfy 4.3. Let ǫ n n=1 and ǫ n n=1 be sequences such that minnǫ n,nǫ n as n with 0 ǫ n ǫ n 0. Let the sieves F mn δ n=1 be defined by 4.4, where m n and m n /n 0. Suppose the conditions.3,.4, and Π f f 0 < ǫ n exp Dnǫ n hold for some constant D > 0 and sufficiently large n and M, with the sequence δ n n=1 fixed at a constant δ. Then E 0 Π f f 0 > Mǫ n D n ] 0. Let us, for a concrete example, consider the squared-exponential Gaussian process prior for fixed-design nonparametric regression.

21 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 1 Theorem 4.4. Consider the nonparametric regression y i = fx i + e i, i = 1,,n, where x i n are fixed design points on 0,1] satisfying 4.3, e i n are i.i.d. mean-zero Gaussian with vare i = σ, and f 0 H α Q for some α > 1 and Q > 0. Suppose f is imposed the squared-exponential Gaussian process prior Π = GP0, K. Then there exists some sufficiently large constant M > 0 such that ] E 0 Π f f 0 > Mn 1/ logn D n 0. Remark 4.. For the squared-exponential Gaussian process regression, the rate of contraction with respect to the integrated L -distance for f 0 A 4 Q has been studied in the literature. A similar result is obtained in theorem 3.1. In contrast, we remark that for the fixed-design regression problem, theorem 4.4 is new and original, and provides a stronger result compared to the existing literature see, for example, theorem 10 in 33]. 5. Proofs of main results. Proof of lemma.1. Wefirstobservethefollowingfact:foranyfx = k β kψ k x F m δ, the following holds: 5.1 f f 0 m f f 0 +δ. In fact, by the Cauchy-Schwartz inequality, f f 0 β k β 0k m β k β 0k + m f f 0 +δ, k=m+1 β k β 0k and hence, f f 0 m f f 0 +δ. Let us take ξ = 1/4. Define the test function to be { n φ n = ½ y i f 1 x i f 0 x i 1 np nf1 f 0 n + 8 f } 1 f 0 npn f 1 f 0, with the test statistic to be n T n = y i f 1 x i f 0 x i 1 np nf1 f 0

22 F. XIE, W. JIN, AND Y. XU n 8 f 1 f 0 npn f 1 f 0. WefirstconsiderthetypeIerrorprobability.UnderP 0,wehavey i = f 0 x i + e i, where e i s are i.i.d. N0,σ noises. Therefore, there exists a constant C 1 > 0 such that P 0 e i > t exp 4C 1 t for all t > 0. Then for a sequence a i n Rn, Chernoff bound yields P 0 n a i e i t exp 4C 1t n. a i Now we set a i = f 1 x i f 0 x i and t = 1 n np nf 1 f f 1 f 0 npn f 1 f 0. Clearly, 1 t np n f 1 f 0 4 np nf 1 f ] 18 n f 1 f 0. Then under P 0 x 1,,x n, we have n E 0 φ n x 1,,x n = P 0 e i f 1 x i f 0 x i ] t x 1,,x n exp C 1 np n f 1 f 0 C ] 1 3 n f 1 f 0 exp C 1 3 n f 1 f 0. It follows that the unconditioning error can be bounded: E 0 φ n exp C 1 3 n f 1 f 0. We next consider the type II error probability. Under P f, we have y i = fx i + e i with e i s being i.i.d. mean-zero Gaussian. Then for any f with f f 1 f 0 f 1 /4 f 0 f 1 /4, we have { E f 1 φ n E ½ P n f f 1 1 } ] 16 P nf 1 f 0 E f 1 φ n x 1,,x n +P P n f f 1 > 116 P nf 1 f 0.

23 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 3 When P n f f 1 P n f 1 f 0 /16, we have n T n + 8 f 1 f 0 npn f 1 f 0 n = e i f 1 x i f 0 x i ]+np n f f 1 f 1 f np nf 1 f 0 n e i f 1 x i f 0 x i ]+ 1 np nf 1 f 0 n P n f f 1 P n f 1 f 0 n e i f 1 x i f 0 x i ]+ 1 4 np nf 1 f 0. Now set R = Rx 1,,x n = 1 4 np nf 1 f 0 n 8 f 1 f 0 npn f 1 f 0. Then given R n f 1 f 0 npn f 1 f 0 /8, we use the Chernoff bound to obtain n P f T n < 0 x 1,,x n P e i f 1 x i f 0 x i ] R x 1,,x n 4C 1 R exp np n f 1 f 0 exp C 1n f 1 f 0. 3 On the other hand, n P R < 8 f 1 f 0 npn f 1 f 0 = P P n f 1 f 0 < 1 f 1 f 0 n = P G n f 1 f 0 < f 1 f 0. It follows that E ½ { P n f f 1 1 E ½ { R } 16 P nf 1 f 0 n 8 f 1 f 0 npn f 1 f 0, ] E f 1 φ n x 1,,x n

24 4 F. XIE, W. JIN, AND Y. XU P n f f 1 1 } ] 16 P nf 1 f 0 P f T n < 0 x 1,,x n n +P R < 8 f 1 f 0 npn f 1 f 0 exp C 1 n 3 n f 1 f 0 +P G n f 1 f 0 < f 1 f 0. Using Bernstein s inequality Lemma 19.3 in 34], we obtain the tail probability of the empirical process G n f 1 f 0 n P G n f 1 f 0 < f 1 f 0 exp 1 n f 1 f 0 4 /4 4Ef 1 f f 1 f 0 f 1 f 0 / exp n f 1 f 0 4 f 1 f 0 exp C n f 1 f 0 m f 1 f 0, +δ forsomeconstant C > 0,whereweusetherelation5.1.Ontheotherhand, when P n f f 1 > P n f 1 f 0 /16, we again use Bernstein s inequality and the fact that f {f F m δ : f f 1 5 f 0 f 1 } to compute P P n f f 1 > 116 P nf 1 f 0 = P P n f f ] f 1 f 0 f f f 1 f 0 > 1 16 f 1 f 0 f f 1 P G n f f 1 1 ] n 16 f 1 f 0 > 3 f 1 f 0 exp 1 n f 1 f 0 4 /104 4 g + f 1 f 0 g, /3 where g = f f 1 f 1 f 0 /16. We further compute g f f f 1 f 0 f f 1 f f f 1 f 0 f 1 f 0

25 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 5 f f 1 f f 1 + f 1 f 0 f 1 f 0 m f 1 f 0 +δ f 0 f 1, where we use 5.1, the fact that f f 1 f 0 f 1, and that f f 1 f f 0 + f 0 f 1 m f 1 f 0 +m f 0 f 1 +δ m f 1 f 0 +δ. Similarly, we obtain on the other hand, g f f f 1 f 0 = f f f 1 f 0 m f 0 f 1 +δ. Therefore, we end up with P P n f f 1 > 116 P nf 1 f 0 exp C n f 1 f 0 m f 1 f 0, +δ where C > 0 is some constant. Hence we obtain the following exponential bound for type I and type II error probabilities: E 0 φ n exp Cn f 1 f 0 E1 φ n exp Cn f 1 f 0 +exp Cn f 1 f 0 m f 1 f 0 +δ for some constant C > 0 whenever f f 1 f 1 f 0 /3. Taking the supremum of the type II error over f {f F m δ : f f 1 f 1 f 0 /3} completes the proof. Proof of lemma.. We partition the alternative set into disjoint union {f F m δ : f f 0 > Mǫ n } {f F m δ : jǫ n < f f 0 j +1ǫ n } := j=m j=m S nj ǫ n. For each S nj ǫ n, we can find N nj = N ξjǫ n,s nj ǫ n, -many functions f njl S nj ǫ n such that S nj ǫ n N nj l=1 {f F m δ : f f njl ξjǫ n }.

26 6 F. XIE, W. JIN, AND Y. XU Since for each f njl, we have f njl S nj ǫ n, implying that f njl f 0 > jǫ n, we obtain the final decomposition of the alternative S nj ǫ n N nj l=1 {f F m δ : f f njl ξ f 0 f njl }. Now we apply lemma.1 to construct individual test function φ njl for each f njl satisfying the following property E 0 φ njl exp Cnj ǫ n, sup {f F mδ: f f njl ξ f 0 f njl } E f 1 φ njl exp Cnj ǫ n +exp Cnj ǫ n mj ǫ n +δ. where we have used the fact that f njl f 0 > jǫ n. Now define φ n = sup j M max 1 l Nnj φ njl. Then the type I error probability can be upper bounded using the union bound E 0 φ n N nj E 0 φ njl j=m l=1 N nj exp Cnj ǫ n = j=m l=1 N nj exp Cnj ǫ n. j=m The type II error probability can also be upper bounded: sup E f 1 φ n {f F mδ: f f 0 >Mǫ n} sup j M sup j M sup l=1,,n nj sup l=1,,n nj exp CM nǫ n+exp sup {f F mδ: f f njl ξ f 0 f njl } E f 1 φ njl exp Cj nǫ n +exp Cnj ǫ n mj ǫ n +δ CnM ǫ n mm ǫ n +δ. ] Proof of lemma.3. Denotethere-normalizedrestrictionofΠonB n = B n k n,ǫ n,ω n to be Π B n, and the random variables V ni n, W ni n to be V ni = f 0 x i fx i Πdf B n, W ni = 1 fx i f 0 x i Πdf B n.

27 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 7 Then by Jensen s inequality { Hn c : = expλ n f D n Πdf ΠB n exp { expλ n f D n Πdf B n exp { 1 σ { 1 σ n e i V ni +W ni C + 1σ nǫ n C + 1σ C + 1σ nǫ n } } { n n } e i V ni Cnǫ n W ni nǫ n, Now we use the Chernoff bound for Gaussian random variables to obtain the conditional probability bound for the first event given the design points x i n : P 0 n e i V ni Cσ nǫ n x 1,,x n exp C σ 4 nǫ 4 n P n Vni. Since over the function class B n, we have f f 0 ǫ n, k n ǫ n = O1, ω n = O1, and f f 0 β k β 0k kn ] 1/ k n β k β 0k + k n f f 0 +ω n = O1, k=k n+1 it follows from Fubini s theorem that EVni f 0 f Πdf B n ǫ n, E ] Vni 4 E f 0 x fx 4 Πdf B n f f 0 f f 0 Πdf B n ǫ n f f 0 Πdf B n ǫ n. β k β 0k nǫ n ]} ]}

28 8 F. XIE, W. JIN, AND Y. XU Hence by the Chebyshev s inequality, P P n Vni E Vni > ǫ n ǫ 1 nǫ ǫ 4 varvni 1 n nǫ 4 nǫ EV ni 4 1 nǫ n for any ǫ > 0, i.e., P n V ni EV ni +o Pǫ n ǫ n1+o P 1, and hence, exp C σ 4 nǫ 4 n P n Vni = exp C σ 4 nǫ n 0 1+o P 1 0 in probability. Therefore by the dominated convergence theorem the unconditional probability goes to 0: n n ] P 0 e i V ni Cσ nǫ n = E P 0 e i V ni Cσ nǫ n x 1,,x n E exp C σ 4 nǫ 4 ] n P n Vni 0. For the second event we use the Bernstein s inequality. Since EW ni = 1 f f 0 Πdf B n 1 ǫ n, EWni = 1 ] 4 E fx f 0 x Πdf B n 1 ] 4 E fx f 0 x 4 Πdf B n 1 f f 0 4 f f 0 Πdf B n ǫ n, then n n P W ni > nǫ n P W ni EW ni > 1 nǫ n 1 n = P W ni EW ni > 1 nǫ n n exp 1 nǫ 4 n/4 4EWni +ǫ n W ni / Ĉ 1 nǫ n exp, 1+ W ni

29 where A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 9 1 W ni = sup x 0,1] p fx f 0 x Πdf B n. Since for any f B n, f f 0 = O1, it follows that W ni = O1, and hence, P i W ni > nǫ n 0. We conclude that n n PHn c P 0 e i V ni Cσ nǫ n +P W ni > nǫ n 0. Proof of lemma.4. Denote Π B n = Π B n /ΠB n to be the re-normalized restriction of Π on B n = { f f 0 < ǫ n }, and V ni = f 0 x i fx i Πdf B n, W ni = 1 fx i f 0 x i Πdf B n. Similar to the proof of lemma.3, we obtain { Hn c = expλ n f D n Πdf ΠB n exp { 1 n e i V ni +W ni C + 1σ } nǫ n σ { 1 σ n e i V ni C + 1 } σ nǫ n, C + 1σ ]} nǫ n where we use the fact that W ni 1/ f f 0 Πdf B n ǫ n/ for all f B n in the last step. Conditioning on the design points x i n, we have P 0 Hn c x 1,,x n exp C + 1 ] σ 4 nǫ 4 n σ P n Vni { exp C + 1 σ σ 4 nǫ 4 n P n { exp C + 1 σ σ 4 nǫ 4 n ] } 1 f f 0 Πdf B n ] } 1 f f 0 Πdf B n

30 30 F. XIE, W. JIN, AND Y. XU exp C + 1 ] σ σ 4 nǫ n 0. The proof is completed by applying the dominated convergence theorem. Proof of theorem.1. Let φ n be the test function given by lemma. and { 3D H n = expλ n f D n Πdf exp + 1 ]} σ nǫ n. It follows from condition.5 that { D Hn c expλ n Πdf < ΠB n k n,ǫ n,ω n exp + 1 ]} σ nǫ n, and hence, P 0 H c n = o1 by lemma.3. Now we decompose the expected value of the posterior probability E 0 Π f f 0 > Mǫ n D n ] E 0 1 φ n ½H n Π f f 0 > Mǫ n D n ] +E 0 φ n +E 0 1 φ n ½Hn c ] { f f E 0 1 φ n ½H n 0 >Mǫ expλ ] n} nf D n Πdf expλn f D n Πdf +E 0 φ n +P 0 H c n. By.3 and lemma. the type I error probability E 0 φ n 0. It suffices to bound the first term. Observe that on the event H n, the denominator in the square bracket can be lower bounded: E 0 1 φ n ½H n 3D exp + 1 σ E 0 1 φ n 3D +exp + 1 σ { f f 0 >Mǫ n} expλ nf D n Πdf expλn f D n Πdf ] nǫ n F mn δ n { f f 0 >Mǫ n} nǫ n ] expλ n f D n Πdf ] E 0 F c mn δn expλ n f D n Πdf ] ].

31 A THEORETICAL FRAMEWORK FOR BAYESIAN REGRESSION 31 By Fubini s theorem, lemma. we have E 0 1 φ n F mn δ n { f f 0 >Mǫ n} F mn δ n { f f 0 >Mǫ n} expλ n f D n Πdf E 0 1 φ n expλ n D n ]Πdf sup E f 1 φ n f F mn δ n { f f 0 >Mǫ n} exp CM nǫ n +exp CM nǫ n m n ǫ n M +δ n exp CM nǫ n. for some constatn C > 0 for sufficiently large n, since m n ǫ n 0 and δ n = O1 by assumption. For the integral on Fm c n δ n, we apply Fubini s theorem to obtain ] n ] p f x i,y i E 0 expλ n f D n Πdf = Πdf Fmn c δn p 0 x i,y i F c mn δn E 0 ΠF c m n δ n. Hence we proceed to compute { f f E 0 1 φ n ½H n 0 >Mǫ expλ ] n} nf D n Πdf expλn f D n Πdf 3D exp + 1 ] σ nǫ n CM nǫ n 3D +exp + 1 nǫ D n σ + 1σ ] nǫ n 0 as long as M is sufficiently large, where.4 is applied. SUPPLEMENTARY MATERIAL Supplement to A theoretical framework for Bayesian nonparametric regression: orthonormal random series and rates of contraction doi: to be determined;.pdf. The supplementary material contains the remaining proofs and additional technical results. ]

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Nonparametric Bayesian Uncertainty Quantification

Nonparametric Bayesian Uncertainty Quantification Nonparametric Bayesian Uncertainty Quantification Lecture 1: Introduction to Nonparametric Bayes Aad van der Vaart Universiteit Leiden, Netherlands YES, Eindhoven, January 2017 Contents Introduction Recovery

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University Submitted to the Annals of Statistics DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS By Subhashis Ghosal North Carolina State University First I like to congratulate the authors Botond Szabó, Aad van der

More information

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Ann Inst Stat Math (29) 61:835 859 DOI 1.17/s1463-8-168-2 Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors Taeryon Choi Received: 1 January 26 / Revised:

More information

Bayesian Aggregation for Extraordinarily Large Dataset

Bayesian Aggregation for Extraordinarily Large Dataset Bayesian Aggregation for Extraordinarily Large Dataset Guang Cheng 1 Department of Statistics Purdue University www.science.purdue.edu/bigdata Department Seminar Statistics@LSE May 19, 2017 1 A Joint Work

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

The Heine-Borel and Arzela-Ascoli Theorems

The Heine-Borel and Arzela-Ascoli Theorems The Heine-Borel and Arzela-Ascoli Theorems David Jekel October 29, 2016 This paper explains two important results about compactness, the Heine- Borel theorem and the Arzela-Ascoli theorem. We prove them

More information

BAYESIAN DETECTION OF IMAGE BOUNDARIES

BAYESIAN DETECTION OF IMAGE BOUNDARIES Submitted to the Annals of Statistics arxiv: arxiv:1508.05847 BAYESIAN DETECTION OF IMAGE BOUNDARIES By Meng Li and Subhashis Ghosal Duke University and North Carolina State University arxiv:1508.05847v3

More information

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Bayesian Nonparametric Point Estimation Under a Conjugate Prior University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-15-2002 Bayesian Nonparametric Point Estimation Under a Conjugate Prior Xuefeng Li University of Pennsylvania Linda

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

7 Convergence in R d and in Metric Spaces

7 Convergence in R d and in Metric Spaces STA 711: Probability & Measure Theory Robert L. Wolpert 7 Convergence in R d and in Metric Spaces A sequence of elements a n of R d converges to a limit a if and only if, for each ǫ > 0, the sequence a

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

Overview of normed linear spaces

Overview of normed linear spaces 20 Chapter 2 Overview of normed linear spaces Starting from this chapter, we begin examining linear spaces with at least one extra structure (topology or geometry). We assume linearity; this is a natural

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS G. RAMESH Contents Introduction 1 1. Bounded Operators 1 1.3. Examples 3 2. Compact Operators 5 2.1. Properties 6 3. The Spectral Theorem 9 3.3. Self-adjoint

More information

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535 Additive functionals of infinite-variance moving averages Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535 Departments of Statistics The University of Chicago Chicago, Illinois 60637 June

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

JUHA KINNUNEN. Harmonic Analysis

JUHA KINNUNEN. Harmonic Analysis JUHA KINNUNEN Harmonic Analysis Department of Mathematics and Systems Analysis, Aalto University 27 Contents Calderón-Zygmund decomposition. Dyadic subcubes of a cube.........................2 Dyadic cubes

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

9 Brownian Motion: Construction

9 Brownian Motion: Construction 9 Brownian Motion: Construction 9.1 Definition and Heuristics The central limit theorem states that the standard Gaussian distribution arises as the weak limit of the rescaled partial sums S n / p n of

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

Problem set 1, Real Analysis I, Spring, 2015.

Problem set 1, Real Analysis I, Spring, 2015. Problem set 1, Real Analysis I, Spring, 015. (1) Let f n : D R be a sequence of functions with domain D R n. Recall that f n f uniformly if and only if for all ɛ > 0, there is an N = N(ɛ) so that if n

More information

Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII

Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII Nonparametric estimation using wavelet methods Dominique Picard Laboratoire Probabilités et Modèles Aléatoires Université Paris VII http ://www.proba.jussieu.fr/mathdoc/preprints/index.html 1 Nonparametric

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Chapter One. The Calderón-Zygmund Theory I: Ellipticity

Chapter One. The Calderón-Zygmund Theory I: Ellipticity Chapter One The Calderón-Zygmund Theory I: Ellipticity Our story begins with a classical situation: convolution with homogeneous, Calderón- Zygmund ( kernels on R n. Let S n 1 R n denote the unit sphere

More information

Stochastic Spectral Approaches to Bayesian Inference

Stochastic Spectral Approaches to Bayesian Inference Stochastic Spectral Approaches to Bayesian Inference Prof. Nathan L. Gibson Department of Mathematics Applied Mathematics and Computation Seminar March 4, 2011 Prof. Gibson (OSU) Spectral Approaches to

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

arxiv: v1 [math.st] 26 Mar 2012

arxiv: v1 [math.st] 26 Mar 2012 POSTERIOR CONSISTENCY OF THE BAYESIAN APPROACH TO LINEAR ILL-POSED INVERSE PROBLEMS arxiv:103.5753v1 [math.st] 6 Mar 01 By Sergios Agapiou Stig Larsson and Andrew M. Stuart University of Warwick and Chalmers

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

Notes 6 : First and second moment methods

Notes 6 : First and second moment methods Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative

More information

ICES REPORT Model Misspecification and Plausibility

ICES REPORT Model Misspecification and Plausibility ICES REPORT 14-21 August 2014 Model Misspecification and Plausibility by Kathryn Farrell and J. Tinsley Odena The Institute for Computational Engineering and Sciences The University of Texas at Austin

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

Distance between multinomial and multivariate normal models

Distance between multinomial and multivariate normal models Chapter 9 Distance between multinomial and multivariate normal models SECTION 1 introduces Andrew Carter s recursive procedure for bounding the Le Cam distance between a multinomialmodeland its approximating

More information

arxiv: v1 [math.pr] 22 Dec 2018

arxiv: v1 [math.pr] 22 Dec 2018 arxiv:1812.09618v1 [math.pr] 22 Dec 2018 Operator norm upper bound for sub-gaussian tailed random matrices Eric Benhamou Jamal Atif Rida Laraki December 27, 2018 Abstract This paper investigates an upper

More information

Nonparametric Inference In Functional Data

Nonparametric Inference In Functional Data Nonparametric Inference In Functional Data Zuofeng Shang Purdue University Joint work with Guang Cheng from Purdue Univ. An Example Consider the functional linear model: Y = α + where 1 0 X(t)β(t)dt +

More information

Priors for the frequentist, consistency beyond Schwartz

Priors for the frequentist, consistency beyond Schwartz Victoria University, Wellington, New Zealand, 11 January 2016 Priors for the frequentist, consistency beyond Schwartz Bas Kleijn, KdV Institute for Mathematics Part I Introduction Bayesian and Frequentist

More information

Topics in Harmonic Analysis Lecture 1: The Fourier transform

Topics in Harmonic Analysis Lecture 1: The Fourier transform Topics in Harmonic Analysis Lecture 1: The Fourier transform Po-Lam Yung The Chinese University of Hong Kong Outline Fourier series on T: L 2 theory Convolutions The Dirichlet and Fejer kernels Pointwise

More information

Adaptive Bayesian procedures using random series priors

Adaptive Bayesian procedures using random series priors Adaptive Bayesian procedures using random series priors WEINING SHEN 1 and SUBHASHIS GHOSAL 2 1 Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030 2 Department

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Segment Description of Turbulence

Segment Description of Turbulence Dynamics of PDE, Vol.4, No.3, 283-291, 2007 Segment Description of Turbulence Y. Charles Li Communicated by Y. Charles Li, received August 25, 2007. Abstract. We propose a segment description for turbulent

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

The main results about probability measures are the following two facts:

The main results about probability measures are the following two facts: Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Multiscale Frame-based Kernels for Image Registration

Multiscale Frame-based Kernels for Image Registration Multiscale Frame-based Kernels for Image Registration Ming Zhen, Tan National University of Singapore 22 July, 16 Ming Zhen, Tan (National University of Singapore) Multiscale Frame-based Kernels for Image

More information

BALANCING GAUSSIAN VECTORS. 1. Introduction

BALANCING GAUSSIAN VECTORS. 1. Introduction BALANCING GAUSSIAN VECTORS KEVIN P. COSTELLO Abstract. Let x 1,... x n be independent normally distributed vectors on R d. We determine the distribution function of the minimum norm of the 2 n vectors

More information

Some functional (Hölderian) limit theorems and their applications (II)

Some functional (Hölderian) limit theorems and their applications (II) Some functional (Hölderian) limit theorems and their applications (II) Alfredas Račkauskas Vilnius University Outils Statistiques et Probabilistes pour la Finance Université de Rouen June 1 5, Rouen (Rouen

More information

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation:

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: Oct. 1 The Dirichlet s P rinciple In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: 1. Dirichlet s Principle. u = in, u = g on. ( 1 ) If we multiply

More information

arxiv: v3 [math.na] 18 Sep 2016

arxiv: v3 [math.na] 18 Sep 2016 Infinite-dimensional integration and the multivariate decomposition method F. Y. Kuo, D. Nuyens, L. Plaskota, I. H. Sloan, and G. W. Wasilkowski 9 September 2016 arxiv:1501.05445v3 [math.na] 18 Sep 2016

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

A NEW PROOF OF THE ATOMIC DECOMPOSITION OF HARDY SPACES

A NEW PROOF OF THE ATOMIC DECOMPOSITION OF HARDY SPACES A NEW PROOF OF THE ATOMIC DECOMPOSITION OF HARDY SPACES S. DEKEL, G. KERKYACHARIAN, G. KYRIAZIS, AND P. PETRUSHEV Abstract. A new proof is given of the atomic decomposition of Hardy spaces H p, 0 < p 1,

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

On prediction and density estimation Peter McCullagh University of Chicago December 2004

On prediction and density estimation Peter McCullagh University of Chicago December 2004 On prediction and density estimation Peter McCullagh University of Chicago December 2004 Summary Having observed the initial segment of a random sequence, subsequent values may be predicted by calculating

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

at time t, in dimension d. The index i varies in a countable set I. We call configuration the family, denoted generically by Φ: U (x i (t) x j (t))

at time t, in dimension d. The index i varies in a countable set I. We call configuration the family, denoted generically by Φ: U (x i (t) x j (t)) Notations In this chapter we investigate infinite systems of interacting particles subject to Newtonian dynamics Each particle is characterized by its position an velocity x i t, v i t R d R d at time

More information

Optimality of Poisson Processes Intensity Learning with Gaussian Processes

Optimality of Poisson Processes Intensity Learning with Gaussian Processes Journal of Machine Learning Research 16 (2015) 2909-2919 Submitted 9/14; Revised 3/15; Published 12/15 Optimality of Poisson Processes Intensity Learning with Gaussian Processes Alisa Kirichenko Harry

More information

Gaussian Approximations for Probability Measures on R d

Gaussian Approximations for Probability Measures on R d SIAM/ASA J. UNCERTAINTY QUANTIFICATION Vol. 5, pp. 1136 1165 c 2017 SIAM and ASA. Published by SIAM and ASA under the terms of the Creative Commons 4.0 license Gaussian Approximations for Probability Measures

More information

Existence and Uniqueness

Existence and Uniqueness Chapter 3 Existence and Uniqueness An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

Real Variables # 10 : Hilbert Spaces II

Real Variables # 10 : Hilbert Spaces II randon ehring Real Variables # 0 : Hilbert Spaces II Exercise 20 For any sequence {f n } in H with f n = for all n, there exists f H and a subsequence {f nk } such that for all g H, one has lim (f n k,

More information

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2) 14:17 11/16/2 TOPIC. Convergence in distribution and related notions. This section studies the notion of the so-called convergence in distribution of real random variables. This is the kind of convergence

More information

ECO Class 6 Nonparametric Econometrics

ECO Class 6 Nonparametric Econometrics ECO 523 - Class 6 Nonparametric Econometrics Carolina Caetano Contents 1 Nonparametric instrumental variable regression 1 2 Nonparametric Estimation of Average Treatment Effects 3 2.1 Asymptotic results................................

More information

4 Uniform convergence

4 Uniform convergence 4 Uniform convergence In the last few sections we have seen several functions which have been defined via series or integrals. We now want to develop tools that will allow us to show that these functions

More information

Krzysztof Burdzy University of Washington. = X(Y (t)), t 0}

Krzysztof Burdzy University of Washington. = X(Y (t)), t 0} VARIATION OF ITERATED BROWNIAN MOTION Krzysztof Burdzy University of Washington 1. Introduction and main results. Suppose that X 1, X 2 and Y are independent standard Brownian motions starting from 0 and

More information

BAYESIAN DETECTION OF IMAGE BOUNDARIES. Duke University and North Carolina State University

BAYESIAN DETECTION OF IMAGE BOUNDARIES. Duke University and North Carolina State University Submitted to the Annals of Statistics arxiv: arxiv:1508.05847 BAYESIAN DETECTION OF IMAGE BOUNDARIES By Meng Li and Subhashis Ghosal Duke University and North Carolina State University Detecting boundary

More information

7 Influence Functions

7 Influence Functions 7 Influence Functions The influence function is used to approximate the standard error of a plug-in estimator. The formal definition is as follows. 7.1 Definition. The Gâteaux derivative of T at F in the

More information

Tools from Lebesgue integration

Tools from Lebesgue integration Tools from Lebesgue integration E.P. van den Ban Fall 2005 Introduction In these notes we describe some of the basic tools from the theory of Lebesgue integration. Definitions and results will be given

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1 Chapter 2 Probability measures 1. Existence Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension to the generated σ-field Proof of Theorem 2.1. Let F 0 be

More information

Bayesian Consistency for Markov Models

Bayesian Consistency for Markov Models Bayesian Consistency for Markov Models Isadora Antoniano-Villalobos Bocconi University, Milan, Italy. Stephen G. Walker University of Texas at Austin, USA. Abstract We consider sufficient conditions for

More information

Decouplings and applications

Decouplings and applications April 27, 2018 Let Ξ be a collection of frequency points ξ on some curved, compact manifold S of diameter 1 in R n (e.g. the unit sphere S n 1 ) Let B R = B(c, R) be a ball with radius R 1. Let also a

More information

Real Analysis Notes. Thomas Goller

Real Analysis Notes. Thomas Goller Real Analysis Notes Thomas Goller September 4, 2011 Contents 1 Abstract Measure Spaces 2 1.1 Basic Definitions........................... 2 1.2 Measurable Functions........................ 2 1.3 Integration..............................

More information

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability

More information

Convergence of latent mixing measures in nonparametric and mixture models 1

Convergence of latent mixing measures in nonparametric and mixture models 1 Convergence of latent mixing measures in nonparametric and mixture models 1 XuanLong Nguyen xuanlong@umich.edu Department of Statistics University of Michigan Technical Report 527 (Revised, Jan 25, 2012)

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space. Hilbert Spaces Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space. Vector Space. Vector space, ν, over the field of complex numbers,

More information

Asymptotic efficiency of simple decisions for the compound decision problem

Asymptotic efficiency of simple decisions for the compound decision problem Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com

More information

An introduction to Mathematical Theory of Control

An introduction to Mathematical Theory of Control An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018

More information