Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Similar documents
Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Chi-square lower bounds

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mark Gordon Low

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Optimal Estimation of a Nonsmooth Functional

Wavelet Shrinkage for Nonequispaced Samples

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Minimax Risk: Pinsker Bound

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

Lecture 7 Introduction to Statistical Decision Theory

MATH 205C: STATIONARY PHASE LEMMA

Variance Function Estimation in Multivariate Nonparametric Regression

Lecture 8: Information Theory and Statistics

Root-Unroot Methods for Nonparametric Density Estimation and Poisson Random-Effects Models

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

ADAPTIVE CONFIDENCE BALLS 1. BY T. TONY CAI AND MARK G. LOW University of Pennsylvania

Minimax lower bounds I

Bayesian Aggregation for Extraordinarily Large Dataset

Random matrices: Distribution of the least singular value (via Property Testing)

ESTIMATORS FOR GAUSSIAN MODELS HAVING A BLOCK-WISE STRUCTURE

1 Bayesian Linear Regression (BLR)

Model Selection and Geometry

A talk on Oracle inequalities and regularization. by Sara van de Geer

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach

Worst-Case Bounds for Gaussian Process Models

Spatial Process Estimates as Smoothers: A Review

D I S C U S S I O N P A P E R

1.1 Basis of Statistical Decision Theory

A Lower Bound Theorem. Lin Hu.

Bayesian Regularization

Can we do statistical inference in a non-asymptotic way? 1

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

Lawrence D. Brown* and Daniel McCarthy*

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

6.1 Variational representation of f-divergences

Lecture 8: Information Theory and Statistics

Statistical Inference

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Nonparametric Regression

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;

Estimation of the Bivariate and Marginal Distributions with Censored Data

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

The Root-Unroot Algorithm for Density Estimation as Implemented. via Wavelet Block Thresholding

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Problem set 1, Real Analysis I, Spring, 2015.

NORMS ON SPACE OF MATRICES

Parametric Techniques Lecture 3

Optimal rates of convergence for estimating Toeplitz covariance matrices

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Nonparametric Density and Regression Estimation

arxiv: v3 [stat.me] 10 Mar 2016

ADAPTIVE BAYESIAN INFERENCE ON THE MEAN OF AN INFINITE-DIMENSIONAL NORMAL DISTRIBUTION

On the Estimation of the Function and Its Derivatives in Nonparametric Regression: A Bayesian Testimation Approach

Inequalities Relating Addition and Replacement Type Finite Sample Breakdown Points

Parametric Techniques

Austin Mohr Math 704 Homework 6

Optimal Design in Regression. and Spline Smoothing

Change-point models and performance measures for sequential change detection

NECESSARY CONDITIONS FOR WEIGHTED POINTWISE HARDY INEQUALITIES

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

A COUNTEREXAMPLE TO AN ENDPOINT BILINEAR STRICHARTZ INEQUALITY TERENCE TAO. t L x (R R2 ) f L 2 x (R2 )

An Introduction to Wavelets and some Applications

A Data-Driven Block Thresholding Approach To Wavelet Estimation

Density Estimation. Seungjin Choi

Global Maxwellians over All Space and Their Relation to Conserved Quantites of Classical Kinetic Equations

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Testing Algebraic Hypotheses

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA

12 - Nonparametric Density Estimation

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

Probability and Measure

Gaussian Approximations for Probability Measures on R d

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Minimax Rates. Homology Inference

Inverse problems in statistics

MORE NOTES FOR MATH 823, FALL 2007

CHAPTER 10 Shape Preserving Properties of B-splines

3.0.1 Multivariate version and tensor product of experiments

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Semi-Nonparametric Inferences for Massive Data

THE KADISON-SINGER PROBLEM AND THE UNCERTAINTY PRINCIPLE

L. Brown. Statistics Department, Wharton School University of Pennsylvania

Design of MMSE Multiuser Detectors using Random Matrix Techniques

High-dimensional two-sample tests under strongly spiked eigenvalue models

Statistics: Learning models from data

Concentration behavior of the penalized least squares estimator

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Splines and Friends: Basis Expansion and Regularization

1 Planar rotations. Math Abstract Linear Algebra Fall 2011, section E1 Orthogonal matrices and rotations

19.1 Problem setup: Sparse linear regression

Transcription:

University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-15-2002 Bayesian Nonparametric Point Estimation Under a Conjugate Prior Xuefeng Li University of Pennsylvania Linda H. Zhao University of Pennsylvania Follow this and additional works at: http://repository.upenn.edu/statistics_papers Part of the Statistics and Probability Commons Recommended Citation Li, X., & Zhao, L. H. (2002). Bayesian Nonparametric Point Estimation Under a Conjugate Prior. Statistics & Probability Letters, 58 (1), 23-30. http://dx.doi.org/10.1016/s0167-7152(02)00097-4 This paper is posted at ScholarlyCommons. http://repository.upenn.edu/statistics_papers/570 For more information, please contact repository@pobox.upenn.edu.

Bayesian Nonparametric Point Estimation Under a Conjugate Prior Abstract Estimation of a nonparametric regression function at a point is considered. The function is assumed to lie in a Sobolev space, S q, of order q. The asymptotic squared-error performance of Bayes estimators corresponding to Gaussian priors is investigated as the sample size, n, increases. It is shown that for any such fixed prior on S q the Bayes procedures do not attain the optimal minimax rate over balls in S q. This result complements that in Zhao (Ann. Statist. 28 (2000) 532) for estimating the entire regression function, but the proof is rather different. Keywords nonparametric regression, point estimation, Bayesian procedure, Gaussian prior, optimum rate Disciplines Statistics and Probability This journal article is available at ScholarlyCommons: http://repository.upenn.edu/statistics_papers/570

Bayesian nonparametric point estimation under a conjugate prior Xuefeng Li Linda H. Zhao Statistics Department University of Pennsylvania Abstract Estimation of a non-parametric regression function at a point is considered. The function is assumed to lie in a Sobolev space, S q,oforderq. The asymptotic squared-error performance of Bayes estimators corresponding to Gaussian priors is investigated as the sample size, n, increases. It is shown that for any such fixed prior on S q the Bayes procedures do not attain the optimal minimax rate over balls in S q. This result complements that in Zhao (2000) for estimating the entire regression function, but the proof is rather different. Key words: nonparametric regression, point estimation, Bayesian procedure, Gaussian prior, optimal rate. Research supported in part by NSF grant DMS-9971848. Linda Zhao, 3000 SH-DH, 3620 Locust Walk, Philadelphia, PA 19104-6302 1

1 Introduction Within the past two decades nonparametric regression has become an important, widely used statistical methodology. More recently there has been increasing interest in the possibility of effectively using a Bayesian approach for such situations. This paper involves one step in that direction. We investigate an aspect of the performance of the Bayes estimator for a natural conjugate prior. (These priors correspond to infinite dimensional Gaussian distribution.) Of interest is the asymptotic performance of the estimator of the regression function, f, at a given point, x 0. As is customary, we assume that regression function lies in a standard function space in this case a Sobolev space of specified smoothness. Consistent with this we derive the prior to be supported on this Sobolev space. We show that for any such prior the Bayes estimators for samples size n do not attain the optimal minimax rate of squared error risk. Zhao (2000) demonstrates an analogous deficiency of Bayes procedures for the problems of estimating the entire regression function. For that problem she also constructs a nonconjugate prior distribution whose Bayes estimators do attain the optimal minimax rate. It follows, however, from Cai, Low and Zhao (2001) that these estimators will not attain the optimal minimax rate for estimating f(x 0 ). Thequestionisthusanopenonofwhetherthere exists a prior on the Sobolev space whose Bayes procedures attain this rate for estimating f(x 0 ). We suspect such prior exist but have not (yet) succeeded in constructing them. For further background on problems of this nature and additional references we refer the reader to Zhao (2000). We close the introduction by noting two additional features of the results in the present manuscript. First, we prove here an additional result that shows there do exist Gaussian priors supported outside the given Sobolev-space whose estimators of f(x 0 ) do attain the optimal minimax rate. This result is analogous to one in Zhao (2000) for estimating all of f; however the appropriate priors are different in the two problems. (This is consistent with the fact that the minimax rates are also different.) Second, although the main theorem here has 2

analogies to the main result in Zhao (2000), the proof is rather different. 2 Preliminaries In a standard nonparametric regression problem one observes (x i,z i ), i =1,...,n where z i = f(x i )+ɛ i, i =1,...,n. (1) Here we take x i = i/(n + 1) to be equally spaced on [0, 1] and we take ɛ i i.i.d. N(0,σ 2 ). For simplicity assume σ 2 = 1. Our goal is to estimate f(x 0 ), the value of f at a given point, x 0 [0, 1]. The loss function is squared error loss: L( ˆf(x 0 ),f(x 0 )) = ( ˆf(x 0 ) f(x 0 )) 2. (2) Donoho, Liu and MacGibbon (1990), Brown and Low (1996) and Brown and Zhao (2001) show the following equivalence results. Suppose f is expressed by an orthonormal basis {ϕ i (x)} on L 2 = {f : 1 f 2 (x) dx < }, i.e. 0 f(x) = θ i ϕ i (x). Then we can construct {y i } as a (randomized) function of {z i } such that y i = θ i + 1 n ɛ i,i=1,..., ɛ i i.i.d. N(0, 1). (3) Estimating a functional such as f(x 0 ) is asymptotically equivalents to estimating the matching functional of θ = {θ i }. (Brown and Zhao (2001) give an equivalence construction that is also valid if the x i are themselves observations of i.i.d. random variables on [0, 1].) To be explicit we take ϕ i to be the usual Fourier basis on [0,1]. Thus, for x [0, 1] ϕ 0 (x) =1 ϕ 2k 1 (x) =2 1/2 cos(2πkx) ϕ 2k (x) =2 1/2 sin(2πkx), k =1, 2,... Then θ i = 1 0 f(x)ϕ i(x) dx, i =0,...Ifˆθ = {ˆθ i } is an estimator of θ = {θ i } then ˆf(x 0 )= a iˆθi, a i = ϕ i (x 0 ) 3

is the matching estimator of f(x 0 ). This can conveniently be rewritten as ˆf(x 0 )=a ˆθ where a = {a i }, ˆθ = {ˆθ i }.IfT denotes an estimator of f(x 0 ) then the risk function is of course R(T,f(x 0 )) = E(T f(x 0 )) 2. When f is assumed to be in a Sobolev ball S q (B) ={{θ i } : i 2q θ 2 i B} when q>1/2 the optimal minimax rate in the present problem is known to be n (2q 1)/2q, i.e., 0 < inf T sup n 2q 1 2q R(T,f(x0 )) <. (4) θ S q(b) This rate was established by Wahba (1975); see also Donoho and Low (1992). We assume throughout that q>1/2. 3 Main results How does a Bayesian estimator perform for this nonparametric point estimation problem? We are especially interested in the question of whether the Bayes solution resulting from a Gaussian prior possesses the optimal property defined in (4). Zhao (2000) dealt with Bayesian estimation of the entire function. We establish similar results for estimating f(x 0 ) under square error loss. The general results have points of similarity, but there are also some differences and the methods of proof are rather different. A product Gaussian prior π(θ) = N(0,τi 2 ) (5) has support on S q {{θ i } : i 2q θi 2 < } if and only if i 2q τi 2 <. It is straightforward to compute the posterior mean of the prior to be i ˆθ i = τ 2 i τi 2 + 1 y i. (6) n For details of both assertion see Zhao (2000). The Bayes estimator of f(x 0 ) is then easily calculated to be ˆT = a iˆθi = a ˆθ. (7) 4

We first show that there exist independent normal priors whose Bayes procedure attains the optimal minimax rate, but these priors are not supported on S q. Theorem 3.1 Let τi 2 = i 2p in (5). When p>max( q, 1) the Bayes estimator ˆT in (7) has 2 2 the minimax rate n m(p,q),wherem(p, q) =min(1 1 ).Tobemoreprecise, 0 < lim sup n θ S q(b), 2q 1 2p 2p n m(p,q) R(a ˆθ, f(x0 )) <. In particular, the Bayes estimator attains the optimal minimax rate if and only if p = q. Proof: Take B = 1 with no loss of generality. The risk function can be written as Note that if b i > 0, then R(a ˆθ, a θ)=var(a ˆθ)+Bias 2 (a ˆθ, a θ). (8) sup ( w i θ i ) 2 = wi 2. bi θi 2 1 b i (9) Then, for the squared bias in (8) sup Bias 2 (a ˆθ, a θ) i 2q θi 2 1 = [ ] τ 2 2 i sup ai ( i 2q θi 2 1 τi 2 +1/n θ i θ i ) (10) where the last assertion comes from θ i [ = sup ai i 2q θi 2 1 1+nτi 2 = a 2 (1 + ni 2p ) 2 i n 1 2q 2p i 2q ] 2 (11) and from For the variance term (1 + ni 2p ) 2 i 2q n 1 2q 2p a 2 2k 1 + a 2 2k =1, k 1. (12) Var(a ˆθ) 1 ( = a 2 i 2p n i i 2p + 1 n ) n (1 1 2p ) 5

where we have again used (12). Hence the minimax rate for a ˆθ is n min(1 1/(2p),(2q 1)/2p). And when p = q, min(1 1/(2p), (2q 1)/2p) achieves its maximum value 2q 1, for which the corresponding rate is 2q just the optimal rate. Remark: Among priors with τ 2 i = i 2p the Bayes estimator is optimal only when p = q. But in that case both the prior and the posterior distribution have measure 0 on the space S q of the interest. The next theorem builds from Theorem 3.1 and gives a more general result about Bayesian approaches. Theorem 3.2 There does not exist a Gaussian prior on S q such that the corresponding sequence of Bayes procedures attains the optimal minimax rate. That is, if Σ is the covariance matrix of a Gaussian measure on S q, then the Bayes estimator ˆT of f(x 0 )=a θ must have lim sup n θ S q(b) n 2q 1 2q R( ˆT,a θ)=. (13) Before we prove the above theorem let us derive some basic facts as lemmas. Lemma 3.1 Let D, W be positive definite m m matrices and b an m dimensional vector. Then sup (b Wξ) 2 = b WD 1 Wb ξ Dξ 1 Proof: This standard result is the matrix generalization of (9). We omit the proof. Lemma 3.2 Let P m ={P : P is an m m positive definite matrix with maximum eigenvalue < 1 } and u be a unit vector. Then { inf u P 2 u + Tr(P 1 I) } > 0.889. (14) P P m Proof: Write P = OΛO,withO some orthonormal matrix and Λ = Diag(λ i ). Here {λ i } are the eigenvalues of P.Let v = O u. Then v is also a unit vector, i.e. v i 2 = 1. Then, u P 2 u + Tr(P 1 I) = v Λ 2 v + 1 m λ i = λ 2 i v 2 i + 1 m. λ i 6

Hence inf u P 2 u + Tr(P 1 I) inf P P m 0<λ i 1 { λ 2 i v 2 i + 1 m}. (15) v 2 i =1 λ i If 1/2 <v 2 1 the function λ 2 v 2 +1/λ attains its minimum over 0 λ 1atλ =1/(2v 2 ) 1/3. If 0 v 2 1/2 then this function attains its minimum on this region at λ =1. Atmostone v 2 j can satisfy v 2 j > 1/2 since v 2 j = 1. Suppose there is one such v j.then inf { λ 2 i v 2 i + 1 m} 0<λ i 1 vi 2 =1 λ i inf (2 2/3 +2 1/3 )v 2/3 2 j v j 1/2 v 2 j 1 On the other hand, if all v j 2 1/2 then inf (2 2/3 +2 1/3 )v 2/3 2 j v j 1/2 v 2 j 1 =.88988. inf { λ 2 i v 2 i + 1 m} 0<λ i 1 vi 2 =1 λ i v 2 j + m m =1. Proof: 1. It suffices to consider prior having mean 0. Since f(x 0 )=a θ the Bayes estimator will be ˆT = a ˆθ with Now, ˆθ = a ( I n +Σ) 1 Y. R( ˆT,a θ) Bias 2 ( ˆT,a θ) (16) [ = a ( ( I ] 2 n +Σ) 1 I)θ = (a (I + nσ) 1 θ) 2 = (a Vθ) 2 with V defined as V =(I + nσ) 1. (17) Notice that all eigenvalues of V are between 0 and 1 and Σ=n 1 (V 1 I). (18) 7

Given an integer m let D denote the (m m) diagonal matrix with diagonal entries (m+i) 2q, i =1,...,m.LetW denote the (m m) matrix composed of the (m+1) th,...2m th rows and columns of V and let b consist of the corresponding coordinates of a. Note that D, W, b all depend on m, but the dependence is suppressed in the notation. From (16) and (17), sup Bias 2 ( ˆT,a θ) sup Bias 2 ( ˆT,a θ) (19) i 2q θi 2 1 2m i=m+1 i2q θi 2 1: θ i=0 if i [m+1,2m] = sup (b Wξ) 2 ξ Dξ 1 where ξ corresponds to the vector in the previous expression having coordinates (θ m+1,...,θ 2m ). Hence, by Lemma 3.1 R( ˆT,a θ) b WD 1 Wb. (20) 2. Recall that a 2 2k 1 + a2 2k all m>100 =1,k =1,... Hence in (20) b m/2. More precisely, for b 2.49m. (21) that Suppose that the assertion of the theorem is false. Then by (20) there is a c< such n 2q 1 2q b WD 1 Wb c. (22) Note that D 1 (2m) 2q I is positive semi-definite. Hence (21) and (22) imply n 2q 1 1 2q (.49m) (2m) 2q u W 2 u c (23) where u = b/ b is a unit vector. Now take Then from (23) 3. Now, for m as in (24) 2m i=m+1 i 2q Σ ii = 1 n 2 2q m =[n 1 2q (.4.49 c) 1 1 2q ]. (24) 2m i=m+1 u W 2 u.4. (25) i 2q ((V 1 ) ii 1) (26) 8

1 2m n m2q ((V 1 ) ii 1) i=m+1 c 1 Tr(W 1 I m ) for (V 1 ) diag > (V diag ) 1 with c 1 > 0, independent of m. Apply Lemma 3.2 and (25) to get Tr(W 1 I m ).48. Hence 2m i=m+1 i 2q Σ ii.48c 1. (27) As n there is an infinite sequence of arbitrarily large corresponding values of m given by (24). Thus (27) yields i 2q Σ ii =. i=1 This establishes that the Gaussian prior is not supported on S q. Remark: Note that the preceding proof establishes a slightly stronger fact than claimed in (13). Namely, for any Gaussian prior on S q the squared bias does not converge at the optimal rate. Remark: The preceding results explicitly concern one special formulation of S q. However the same general results can be shown to apply to the more general problem of estimating f at a point with parameter space {f = θ i φ i (x) : ci θi 2 < }, wherec i ci 2q. Only minor modifications of the preceding proof are needed. References [1] Brown, L.D. and Low, M.(1996). Asymptotic equivalence of nonparametric regression and white noise. Ann. Statist. 24 2384-2398 [2] Brown, L.D. and Zhao, L. (2001). Direct asymptotic equivalence of nonparametric regression and the infinite dimensional location problem. Technical report. (available from www-stat.wharton.upenn.edu/ lzhao). 9

[3] Cai, T.T., Low, M.G. and Zhao, L. (2001) Tradeoffs between global and local risks in nonparametric function estimation. Technical report. (available from www-stat.wharton.upenn.edu/ lzhao). [4] Donoho, D.L., Liu, R. and MacGibbon, B. (1990) Minimax risk over hyperrectangles and implications, Ann. Statist., 18, 1416-1437 [5] Donoho, D.L., Low, M.G. (1992) Renormalization exponents and optimal pointwise rates of convergence, Ann. Statist., 20, 944-970 [6] Wahba, G. (1975) Smoothing noisy data by spline functions, Numer. Math., v 24, 383-393 [7] Zhao, L. (2000) Bayesian aspects of some nonparametric problems. Ann. Statist., 28, 532-552 10