Nonparametric Estimation: Part I

Similar documents
OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Preface. Technion City, Haifa 32000, Israel;

Statistical inference on Lévy processes

1.1 Basis of Statistical Decision Theory

Optimal Estimation of a Nonsmooth Functional

Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII

Optimal global rates of convergence for interpolation problems with random design

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

Model Selection and Geometry

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Nonparametric regression with martingale increment errors

Adaptivity to Local Smoothness and Dimension in Kernel Regression

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Non linear estimation in anisotropic multiindex denoising II

Introduction to Machine Learning (67577) Lecture 3

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Inverse problems in statistics

Inverse Statistical Learning

Wavelet Shrinkage for Nonequispaced Samples

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

A New Method for Varying Adaptive Bandwidth Selection

Hypothesis Testing via Convex Optimization

Probability and Measure

Gaussian Random Field: simulation and quantification of the error

Lecture 3: Statistical Decision Theory (Part II)

Modelling Non-linear and Non-stationary Time Series

Plug-in Approach to Active Learning

Image Denoising: Pointwise Adaptive Approach

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals

arxiv: v1 [math.ap] 17 May 2017

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

D I S C U S S I O N P A P E R

Springer Series in Statistics

Inverse problems in statistics

Lecture 7 Introduction to Statistical Decision Theory

Lecture 21: Minimax Theory

Estimation of a quadratic regression functional using the sinc kernel

Asymptotic Equivalence and Adaptive Estimation for Robust Nonparametric Regression

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants

1 Introduction. 2 Measure theoretic definitions

Time Series and Forecasting Lecture 4 NonLinear Time Series

Chapter One. The Calderón-Zygmund Theory I: Ellipticity

Fast learning rates for plug-in classifiers under the margin condition

L p -boundedness of the Hilbert transform

Minimax Risk: Pinsker Bound

41903: Introduction to Nonparametrics

Convergence rates of spectral methods for statistical inverse learning problems

Stability of optimization problems with stochastic dominance constraints

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

Nonparametric Regression. Badr Missaoui

arxiv: v4 [stat.me] 27 Nov 2017

Homework 1 Due: Wednesday, September 28, 2016

Chapter 9: Basic of Hypercontractivity

Can we do statistical inference in a non-asymptotic way? 1

A Data-Driven Block Thresholding Approach To Wavelet Estimation

Laplace s Equation. Chapter Mean Value Formulas

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Hypercontractivity of spherical averages in Hamming space

Duality of multiparameter Hardy spaces H p on spaces of homogeneous type

Chapter 9. Non-Parametric Density Function Estimation

Minimax lower bounds I

How much should we rely on Besov spaces as a framework for the mathematical study of images?

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Computer Intensive Methods in Mathematical Statistics

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be

Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Regularity of the density for the stochastic heat equation

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Lecture No 1 Introduction to Diffusion equations The heat equat

Fourier Transform & Sobolev Spaces

Distance between multinomial and multivariate normal models

STAT 200C: High-dimensional Statistics

Remarks on Extremization Problems Related To Young s Inequality

VISCOSITY SOLUTIONS. We follow Han and Lin, Elliptic Partial Differential Equations, 5.

Nonparametric Methods

Sparsity Measure and the Detection of Significant Data

ECE 4400:693 - Information Theory

Announcements. Proposals graded

Model selection theory: a tutorial with applications to learning

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

A NEW PROOF OF THE ATOMIC DECOMPOSITION OF HARDY SPACES

Multiple points of the Brownian sheet in critical dimensions

Multivariate Topological Data Analysis

Convex Geometry. Carsten Schütt

The Root-Unroot Algorithm for Density Estimation as Implemented. via Wavelet Block Thresholding

6.1 Variational representation of f-divergences

Cheng Soon Ong & Christian Walder. Canberra February June 2018

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Transcription:

Nonparametric Estimation: Part I Regression in Function Space Yanjun Han Department of Electrical Engineering Stanford University yjhan@stanford.edu November 13, 2015 (Pray for Paris)

Outline 1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 2 / 65

The statistical model Given a family of distributions {P f ( )} f F and an observation Y P f, we aim to: 1 estimate f ; 2 estimate a functional F (f ) of f ; 3 do hypothesis testing: given a partition F = N j=1 F j, decide which F j the true function f belongs to. 3 / 65

The risk The risk of using ˆf to estimate θ(f ) is ( ) R(ˆf, f ) = Ψ 1 E f Ψ(d(ˆf (Y ), θ(f )) where Ψ is a nondecreasing function on [0, ) with Ψ(0) = 0, and d(, ) is some metric The minimax approach R(ˆf, F) = sup R(ˆf, f ) f F R (F) = inf ˆf R(ˆf, F) 4 / 65

Nonparametric regression Problem: recover a function f : [0, 1] d R in a given set F L 2 ([0, 1] d ) via noisy observations y i = f (x i ) + σξ i, i = 1, 2,, n Some options: Grid {x i } n i=1 : deterministic design (equidistant grid, general case), random design Noise {ξ i } n i=1 : iid N (0, 1), general iid case, with dependence Noise level σ: known, unknown Function space: Hölder ball, Sobolev space, Besov space Risk function: pointwise risk (confidence interval), integrated risk (L q risk, 1 q, with normalization) ) 1 (E f R q (ˆf, f ) = [0,1] ˆf (x) f (x) q q dx, 1 q < d ). E f (ess sup x [0,1] d ˆf (x) f (x), q = 5 / 65

Equivalence between models Under mild smoothness conditions, Brown et al. proved the asymptotic equivalence between the following models: Regression model: for iid N (0, 1) noise {ξ i } n i=1, y i = f (i/n) + σξ i, i = 1, 2,, n Gaussian white noise model: dy t = f (t)dt + σ n db t, t [0, 1] Poisson process: generate N = Poi(n) iid samples from common density g (g = f 2, σ = 1/2) Density estimation model: generate n iid samples from common density g (g = f 2, σ = 1/2) 6 / 65

Bias-variance decomposition Deterministic error (bias) and stochastic error of an estimator ˆf ( ): Analysis of the L q risk: For 1 q < : For q = : R q (ˆf, f ) = b(x) = E f ˆf (x) f (x) s(x) = ˆf (x) E f ˆf (x) ( ) 1 E f ˆf f q q q b q + ( E s q ) 1 q q R (ˆf, f ) b + E s = ( E f b + s q ) 1 q q 7 / 65

1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 8 / 65

The first example Consider the following regression model: y i = f (i/n) + σξ i, i = 1, 2,, n where {ξ i } n i=1 iid N (0, 1), and f Hs 1 (L) for some known s (0, 1] and L > 0, and the Hölder ball is defined as H s 1(L) = {f C[0, 1] : f (x) f (y) L x y s, x, y [0, 1]}. 9 / 65

A window estimate To estimate the value of f at x, consider the window B x = [x h/2, x + h/2], then a natural estimator takes the form ˆf (x) = 1 y i, n(b x ) i:x i B x where n(b x ) denotes the number of point x i in B x. Deterministic error: b(x) = E f ˆf (x) f (x) = 1 f (x i ) f (x) n(b x ) i:x i B x 1 f (x i ) f (x) 1 L x i x s Lh s n(b x ) n(b x ) i:x i B x i:x i B x Stochastic error: s(x) = ˆf (x) E f ˆf (x) = σ n(b x ) i:x i B x ξ i 10 / 65

Optimal window size: 1 q < Bounding the integrated risk: b(x) Lh s, s(x) = σ n(b x ) i:x i B x ξ i R q (ˆf, f ) b q + (E s q q) 1 q Lh s + σ nh Optimal window size and risk (1 q < ) h n 1 2s+1, Rq (ˆf, H s 1(L)) n s 2s+1 11 / 65

Optimal window size: q = b(x) Lh s, s(x) = σ n(b x ) i:x i B x ξ i Fact: for N (0, 1) rv {ξ i } M i=1 (possibly correlated), there exists a constant C such that for any w 1, ( P max ξ i Cw ) ( ln M exp w 2 ) ln M 1 i M 2 Proof: apply the union bound and P( N (0, 1) > x) 2 exp( x 2 /2). ln n Corollary: E s σ nh (M = O(n 2 )). Optimal window size and risk (q = ) ( ) 1 ln n h n 2s+1, R (ˆf, H s 1(L)) ( ln n n ) s 2s+1 12 / 65

Bias-variance tradeoff Figure 1: Bias-variance tradeoff 13 / 65

General Hölder ball Consider the following regression problem: y ι = f (x ι ) + σξ ι, f H s d (L), ι = (i 1, i 2,, i d ) {1, 2,, m} d where x (i1,i 2,,i d ) = (i 1 /m, i 2 /m,, i d /m), n = m d {ξ ι } are iid N (0, 1) noises The general Hölder ball Hd s (L) is defined as { Hd s (L) = f : [0, 1] d R, D k (f )(x) D k (f )(x ) L x x α, x, x } where s = k + α, k N, α (0, 1], and D k (f )( ) is the vector function comprised of all partial derivatives of f with full order k. s > 0: modulus of smoothness d: dimensionality 14 / 65

Local polynomial approximation As before, consider the cube B x centered at x with edge size h, where h 0, nh d. Simple average no longer works! Local polynomial approximation: for x ι B x, design weights w ι (x) such that if the true f Pd k, i.e., polynomial of full degree k and of d variables, we have f (x) = w ι (x)f (x ι ) ι:x ι B x Lemma There exists weights {w ι (x)} ι:xι B x which depends continuously on x and w ι (x) 1 C 1, w ι (x) 2 C 2 n(bx ) = C 2 nh d where C 1, C 2 are two universal constants depending only on k and d. 15 / 65

The estimator Based on the weights {w ι (x)}, construct a linear estimator ˆf (x) = w ι (x)y ι ι:x ι B x Deterministic error: b(x) = w ι (x)f (x ι ) f (x) ι:x ι B x = inf w ι (x)(f (x ι ) p(x ι )) (f (x) p(x)) p Pd k ι:x ι B x (1 + w ι (x) 1 ) inf f p,bx p Pk d Stochastic error: s(x) = σ w ι (x)ξ ι ι:x ι B x σ 1 nh d w ι (x) 2 ι:x ι B x w ι (x)ξ ι 16 / 65

Rate of convergence Bounding the deterministic error: by Taylor expansion, inf f p,bx max D k f (η z ) D k f (x) p Pk d z B x (z x) k Lh s k! Hence, b q Lh s for 1 q Bounding the stochastic error: as before, ( E s q q ) 1 q σ nh d ln n (1 q < ), E s σ nh d Theorem The optimal window size h and the corresponding risk is given by { {n 1 h 2s+d ( ln n n ), R 1 q (ˆf, Hd s (L)) n s 2s+d, 1 q < 2s+d ( ln n n ) s 2s+d, q = and the resulting estimator is minimax rate-optimal (see later). 17 / 65

constitutes an approximation basis for H s d (L). 18 / 65 Why approximation? The general linear estimator takes the form ˆf lin (x) = w ι (x)y ι ι {1,2,,m} d Observations: The estimator ˆf lin is unbiased for f if and only if f (x) = ι w ι (x)f (x ι ), x [0, 1] d Plugging in x = x ι yields that z = {f (x ι )} is a solution to (w ι (x κ ) δ ικ ) ι,κ z = 0 Denote by {z ι (k) } M k=1 all linearly independent solutions to the previous equation, then f k (x) = ι w ι (x)z (k) ι, k = 1,, M

Why polynomial? Definition (Kolmogorov n-width) For a linear normed space (X, ) and subset K X, the Kolmogorov n-width of K is defined as d n (K) d n (K, X ) = inf V n where V n X has dimension n. sup x K Theorem (Kolmogorov n-width for H s d (L)) d n (H s d (L), C[0, 1]d ) n s d. inf y V n x y The piecewise polynomial basis in each cube with edge size h achieves the optimal rate (so other basis does not help): d Θ(1)h d (H s d (L), C[0, 1]d ) (h d ) s d = h s 19 / 65

Methods in nonparametric regression Function space (this lecture): Kernel estimates: ˆf (x) = ι K h(x, x ι )y ι (Nadaraya, Watson,...) Local polynomial kernel estimates: ˆf (x) = M k=1 φ k(x)c k (x), where (c 1 (x),, c k (x)) = arg min c ι (y ι M k=1 c k(x)φ k (x ι )) 2 K h (x x ι ) (Stone,...) Penalized spline estimates: ˆf = arg min g y ι g 2 2 + λ g (s) 2 2 (Speckman, Ibragimov, Khas minski,...) Nonlinear estimates (Lepski, Nemirovski,...) Transformed space (next lecture): Fourier transform: projection estimates (Pinsker, Efromovich,...) Wavelet transform: shrinkage estimates (Donoho, Johnstone,...) 20 / 65

1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 21 / 65

Regression in Sobolev space Consider the following regression problem: y ι = f (x ι ) + σξ ι, f S k,p d (L), ι = (i 1, i 2,, i d ) {1, 2,, m} d where x (i1,i 2,,i d ) = (i 1 /m, i 2 /m,, i d /m), n = m d {ξ ι } are iid N (0, 1) noises The Sobolev ball S k,p d S k,p d (L) = (L) is defined as { f : [0, 1] d R, D k (f ) p L where D k (f )( ) is the vector function comprised of all partial derivatives (in terms of distributions) of f with order k. Parameters: d: dimensionality k: order of differentiation p: p d to ensure continuous embedding S k,p d (L) C[0, 1] d q: norm of the risk } 22 / 65

Minimax lower bound Theorem (Minimax lower bound) The minimax risk in Sobolev ball regression problem over all estimators is R q (S k,p n k 2k+d, q < (1 + 2k d d (L), n) )p ( ln n n ) k d/p+d/q 2(k d/p)+d, q (1 + 2k d )p Theorem (Linear minimax lower bound) The minimax risk in Sobolev ball regression problem over all linear estimators is n k 2k+d, q p Rq lin (S k,p d (L), n) n k d/p+d/q 2(k d/p+d/q)+d, p < q < ( ln n k d/p+d/q 2(k d/p+d/q)+d, q = n ) 23 / 65

Start from linear estimates Consider the linear estimator given by local polynomial approximation with window size h as before: Stochastic error: ( ) 1 E s q q q σ ln n (1 q < ), E s σ nh d nh d Fact Deterministic error: corresponds to polynomial approximation error For f S k,p d (L), there exists constant C > 0 such that D k 1 f (x) D k 1 f (y) C x y 1 d/p ( Taylor polynomial yields: b(x) h k d/p ( B x D k f (z) p dz B ) 1 D k f (z) p p dz, ) 1 p x, y B 24 / 65

Linear estimator: integrated bias Integrated bias: Note that [0,1] d b q q h (k d/p)q [0,1] d ( D k f (z) p dzdx = h d B x B x D k f (z) p dz ) q p dx [0,1] d D k f (x) p dx h d L p Fact a r 1 + a r 2 + + a r n { (a 1 + a 2 + + a n ) r /n r 1, 0 < r 1 (a 1 + a 2 + + a n ) r, r > 1 Case q/p 1: b q q h (k d/p)q L q h d q p = L q h kq (regular case) Case q/p > 1: b q q h (k d/p)q L q h d = L q h (k d/p+d/q)q (sparse case) 25 / 65

Linear estimator: optimal risk In summary, we have { Lh k, q p σ b q Lh k d/p+d/q, p < q, s q σ nh d, ln n n, q < q = Theorem (Optimal linear risk) R q (ˆf lin, Sk,p d (L)) n k 2k+d, n k d/p+d/q 2(k d/p+d/q)+d, ( ln n ) k d/p 2(k d/p)+d n, q p p < q < q = Alternative proof: Theorem (Sobolev embedding) For d p < q, we have S k,p d (L) S k d/p+d/q,q d (L ). 26 / 65

Minimax lower bound: tool Theorem (Fano s inequality) Suppose H 1,, H N are probability distributions (hypotheses) on sample space (Ω, F), and there exists decision rule D : Ω {1, 2,, N} such that H i (ω : D(ω) = i) δ i, 1 i N. Then max D KL(F i F j ) 1 i,j N ( 1 N ) N δ i ln(n 1) ln 2 Apply to nonparametric regression: Suppose convergence rate r n is attainable, construct N functions (hypotheses) f 1,, f N F such that f i f j q > 4r n for any i j Decision rule: after obtaining ˆf, choose j such that ˆf f j q 2r n As a result, δ i 1/2, and Fano s inequality gives 1 2σ 2 max 1 i,j N i=1 f i (x ι ) f j (x ι ) 2 1 ln(n 1) ln 2 2 ι 27 / 65

Minimax lower bound: sparse case Suppose q (1 + 2k d )p. Fix a smooth function g supported on [0, 1] d Divide [0, 1] d into h d disjoint cubes with size h, and construct N = h d hypotheses: f j supported on j-th cube, and equals h s g(x/h) on that cube (with translation) To ensure f S k,p d (L), set s = k d/p For i j, r n f i f j q h k d/p+d/q, and f i (x ι ) f j (x ι ) 2 h 2(k d/p) nh d = nh 2(k d/p)+d ι Fano s inequality gives ( ln n nh 2(k d/p)+d ln N ln h 1 = r n n ) k d/p+d/q 2(k d/p)+d 28 / 65

Minimax lower bound: regular case Suppose q < (1 + 2k d )p. Fix smooth function g supported on [0, 1] d, and set g h (x) = h s g(x/h) Divide [0, 1] d into h d disjoint cubes with size h, and construct N = 2 h d hypotheses: f j = h d i=1 ɛ ig h,i, where g h,i is the translation of g h to i-th cube Can choose M = 2 Θ(h d ) hypotheses such that for i j, f i differs f j on at least Θ(h d ) cubes To ensure f S k,p d (L), set s = k For i j, r n f i f j q h k, and f i (x ι ) f j (x ι ) 2 h 2k n = nh 2k Fano s inequality gives ι nh 2k ln M h d = r n ( ) k 1 2k+d n 29 / 65

Construct minimax estimator Why linear estimator fails in the sparse case? Too much variance in the flat region! Suppose we know that the true function f is supported at a cube with size h, we have b q h k d/p+d/q ( ), E s q 1/q 1 q nh d hd/q then h n 1 2(k d/p)+d Nemirovski s construction yields the optimal risk Θ(n k d/p+d/q 2(k d/p)+d ) In both cases, construct ˆf as the solution to the following optimization problem ˆf = arg min g S k,p d (L) g y B = arg min g S k,p d max (L) B B 1 n(b) (g(x ι ) y ι ) ι:x ι B where B is the set of all cubes in [0, 1] d with nodes belonging to {x ι }. 30 / 65

Estimator analysis Question 1: given a linear estimator ˆf at x of window size h, which size h of B B achieves the maximum of ι:x ι B ˆf (x ι ) y ι / n(b)? If h h, the value max{ n(b) b h (x), N (0, 1) } increases with h If h h, the value is close to zero Hence, h h achieves the maximum Question 2: what window size h at x achieves min ˆf h y B? Answer: h achieves the bias-variance balance locally at x The 1/ n(b) term helps to achieve the spatial homogeneity Theorem (Optimality of the estimator) R q (ˆf, S k,p d (L)) ( ln n n ) min{ k 2k+d, k d/p+d/q 2(k d/p)+d } Hence ˆf is minimax rate-optimal (with a logarithmic gap in regular case). 31 / 65

Proof of the optimality First observation: ˆf f B ˆf y B + f y B 2 ξ B ln n and e ˆf f S k,p d (2L). Definition (Regular cube) A cube B [0, 1] d is called a regular cube if e(b) C[h(B)] k d/p Ω(e, B) where C > 0 is a suitably chosen constant, e(b) = max x B e(x), h(b) is the edge size of B, and ( Ω(e, B) = B ) 1 D k f (x) p p dx. One can show that [0, 1] d can be (roughly) partitioned into maximal regular cubes with replaced by = (i.e., balanced bias and variance). 32 / 65

Property of regular cubes Lemma If cube B is regular, we have Proof: 1 sup B B,B B n(b ) e(x ι ) ι:x ι B ln n = e(b) ln n n(b) Since B is regular, there exists a polynomial e k (x) in B of d variables and degree no more than k such that e(b) 4 e e k,b On one hand, e k (x) 5e(B)/4, and Markov s inequality for polynomial implies that De k e(b)/h(b) On the other hand, e k (x 0 ) 3e(B)/4 for some x 0 B, the derivative bound implies that there exists B B, h(b ) h(b) such that e k (x) e(b)/2 on B Choosing this B in the assumption completes the proof 33 / 65

Upper bound for the L q risk Since [0, 1] d can be (roughly) partitioned into regular cubes {B i } i=1 such that e(b i ) [h(b i )] k p/d Ω(e, B i ), we have e q q = i=1 B i e(x) q dx [h(b i )] d e q (B i ) The previous lemma asserts that e(b i ) ln n n[h(b i )] d, and we cancel out h(b i ), e(b i ) and get ( ln n e q q n ( ln n n ) q(k d/p+d/q) 2(k d/p)+d ) q(k d/p+d/q) 2(k d/p)+d i=1 Ω(e, B i ) i=1 d(q 2) 2(k d/p)+d ( ) Ω(e, B i ) p i=1 d(q 2) p(2(k d/p)+d) if q q = (1 + 2k d )p. For 1 q < q we use e q e q. Q.E.D. 34 / 65

1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 35 / 65

Adaptive estimation Drawbacks of the previous estimator: Computationally intractable Require perfect knowledge of parameters Adaptive estimation Given a family of function classes {F n }, find an estimator which (nearly) attains the optimal risk simultaneously without knowing which F n the true function belongs to. In the sequel we will consider {F n } = k S,p>d,L>0 S k,p d (L) 36 / 65

Data-driven window size Consider again a linear estimate with window size h locally at x, recall that σn (0, 1) ˆf (x) f (x) inf f p,bh p Pd k (x) + nh d The optimal window size h(x) should balance these two terms The stochastic term can be upper bounded (with overwhelming ln n probability) by s n (h) = wσ depending only on known nh d parameters and h, where the constant w > 0 is large enough But the bias term depends on the unknown local property of f on B h (x)! 37 / 65

Bias-variance tradeoff: revisit Figure 2: Bias-variance tradeoff 38 / 65

Lepski s trick Lepski s adaptive scheme Construct a family of local polynomial approximation estimators {ˆf h } with all window size h I. Then use ˆfĥ(x) (x) as the estimate of f (x), where ĥ(x) sup{h I : ˆf h (x) ˆf h (x) 4s n (h ), h (0, h), h I }. Denote by h (x) the optimal window size b n (h ) = s n (h ) Existence of ĥ(x): clearly h (x) satisfies the condition, for h (0, h ) ˆf h (x) ˆf h (x) ˆf h (x) f + f ˆf h (x) Performance of ĥ(x): b n (h ) + s n (h ) + b n (h ) + s n (h ) 4s n (h ), ˆfĥ(x) (x) f ˆf h (x)(x) f + ˆfĥ(x) (x) ˆf h (x)(x) b n (h ) + s n (h ) + 4s n (h ) = 6s n (h ) 39 / 65

Lepski s estimator is adaptive! Properties of Lepski s scheme: Adaptive to parameters: agnostic to p, k, L Spatially adaptive: still work when there is spatial inhomogeneity / only estimate a portion of f Theorem (Adaptive optimality) Suppose that the modulus of smoothness k is upper bounded by a known hyper-parameter S: R q (ˆf, S k,p d (L)) ( ln n and ˆf is adaptively optimal in the sense that n ) min{ k 2k+d, k d/p+d/q 2(k d/p)+d } inf ˆf sup k S,p>d,L>0,B [0,1] d R q,b(ˆf, S k,p d (L)) inf ˆf R q,b (ˆf, S k,p d (L)) (ln n) S 2S+d. 40 / 65

Covering of the ideal window By the property of the data-driven window size ĥ(x), it suffices to consider the ideal window size h (x) satisfying s n (h (x)) [h (x)] k d/p Ω(f, B h (x)(x)) Lemma There exists {x i } M i=1 and a partition {V i} M i=1 of [0, 1]d such that 1 Cubes {B h (x i )(x i )} M i=1 are pairwise disjoint 2 For every x V i, we have and B h (x)(x) B h (x i )(x i ) h (x) 1 2 max{h (x i ), x x i } 41 / 65

Analysis of the estimator Using the covering of the local windows, we have M ˆf f q q s n (h (x)) q dx = s n (h (x)) q dx [0,1] d V i ln n Recall: s n (h) and h (x) max{h (x nh d i ), x x i }/2 for x V i, ( ) ln n q M ˆf f q q [max{h (x i ), x x i }] dq/2 dx n i=1 V i ( ) ln n q M r d 1 [max{h (x i ), r}] dq/2 dr n i=1 0 ( ) ln n q M [h (x i )] d dq/2 n i=1 Plugging in s n (h (x)) [h (x)] k d/p Ω(f, B h (x)(x)) and use the disjointness of {B h (x i )(x)} M i=1 completes the proof. Q.E.D. 42 / 65 i=1

Experiment: Blocks Figure 3: Blocks: original, noisy and reconstructed signal 43 / 65

Experiment: Bumps Figure 4: Bumps: original, noisy and reconstructed signal 44 / 65

Experiment: Heavysine Figure 5: Heavysine: original, noisy and reconstructed signal 45 / 65

Experiment: Doppler Figure 6: Doppler: original, noisy and reconstructed signal 46 / 65

Generalization: anisotropic case What if the modulus of smoothness differs in different directions? Now we should replace the cube with a rectangle (h 1,, h d )! Deterministic and stochastic term: b h (x) C 1 d i=1 h s i i b(h), s h (x) C 2 where V (h) = d i=1 h i. Suppose b(h ) = s(h ). Naive guess ln n nv (h) s(h) ĥ(x) = arg max {V (h) : ˆf h (x) ˆf h (x) 4s(h ), h, V (h ) V (h)} h doesn t work: h may not satisfy this condition still doesn t work if we replace V (h ) V (h) by h h: fĥ may not be close to f h 47 / 65

Adaptive regime Adaptive regime (Kerkyacharian-Lepski-Picard 01) ĥ(x) = arg max {V (h) : ˆf h h (x) ˆf h (x) 4s(h ), h, V (h ) V (h)} h where h h = max{h, h } (coordinate-wise). h satisfies the condition: for V (h) V (h ), ˆf h (x) ˆf h (x) b(h h) b(h) + s(h h) + s(h) (h s i i + (hi ) s i ) + 2s(h) 4s(h) i:h i <h i fĥ is close to f h : since V (h ) V (ĥ), ˆfĥ f ˆfĥ fĥ h + fĥ h f h + f h f (ĥ s i i + (hi ) s i ) + 4s(h ) + 2s(h ) 8s(h ) i:h i >ĥi 48 / 65

More on Lepski s trick General adaptive rule: consider a family of models (X n, F (s) ) s I and rate-optimal minimax estimators ˆf n (s) for each s with minimax risk r s (n). For simplicity suppose r s (n) decreases with s. Definition (Adaptive estimator) Choose increasing sequence {s i } k(n) i=1 from I with k(n), select ˆk = sup{1 k k(n) : d(ˆf sk, ˆf si ) Cr i (n), 1 i < k} for some prescribed constant C, and choose ˆf = ˆf sˆk. Theorem (Lepski 90) If r s (n) n r(s), then under mild conditions, R n (ˆf, F (s) ) n r(s) (ln n) r(s) 2, s I. As a result, the sub-optimality gap is at most (ln n) sup s I r(s)/2. 49 / 65

More on adaptive schemes Instead of exploiting the monotonicity of bias and variance, we can also estimate the risk directly. Given {ˆf h } h I, suppose ˆr(h) is a good estimator of the risk of ˆf h, then we can write ĥ = arg min ˆr(h) h In Gaussian models, ˆr(h) can be Stein s Unbiased Risk Estimator (SURE): zero bias and small variance Lepski s method: define {ˆf h,h } (h,h ) I I from {ˆf h } h I, and ˆr(h) = ( ) sup ˆf h,h ˆf h Q(s(h )) + 2Q(s(h)) h :s(h ) s(h) + where Q is some function related to variance ˆf h,h is usually the convolution of estimators, and ˆf h,h = ˆf h or ˆf h h in the previous example 50 / 65

1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 51 / 65

Further generalization Another point of view of Sobolev ball S k,p 1 (L): there exists a polynomial r(x) = x k with degree k such that r( d dx )f p L, f S k,p 1 (L) Generalization: what about r( ) is an unknown monic function with degree no more than k? For modulated signal, e.g., f (x) = g(x) sin(ωx + φ) for g S k,p d (L): Does not belong to any Sobolev balls for large ω However, we can write f (x) = f + (x) + f (x), where g(x) exp(±i(ωx + φ)) f ± (x) = ±2i and by induction we have ( ) d k dx iω f ± (x) = g (k) p L 2 2 p 52 / 65

General regression model Definition (Signal satisfying differential inequalities) Function f : [0, 1] R belongs to the class W l,k,p (L), if and only if f = l i=1 f i, and there exist monic polynomials r 1,, r l of degree k such that l r i i=1 ( ) d p f i L. dx General regression problem: recover f W l,k,p (L) from noisy observations y i = f (i/n) + σξ i, i = 1, 2,, n, ξ i iid N (0, 1) where we only know: Noise level σ An upper bound S kl 53 / 65

Recovering approach The risk function Note that now we cannot recover the whole function f (e.g., f 0 and f (x) = sin(2πnx) correspond to the same model) The discrete q-norm: ˆf f q = ( 1 n n i=1 ˆf (i/n) f (i/n) q ) 1 q Recovering approach Discretization: transform differential inequalities to inequalities of finite difference Sequence estimation: given a window size, recover the discrete sequence satisfying an unknown difference inequality Adaptive window size: apply Lepski s trick to choose the optimal window size adaptively 54 / 65

Discretization: from derivative to difference Lemma For any f C k 1 [0, 1] and any monic polynomial r(z) of degree k, there corresponds another monic polynomial η(z) of degree k such that η( )f n p n k+1/p r( d dx )f p where f n = {f (i/n)} n i=1, and ( φ) t = φ t 1 is the backward shift operator. Applying to our regression problem: The sequence f n can be written as f n = l i=1 f n,i There correspond monic polynomials η i of degree k such that l η i ( )f n,i p Ln k+1/p i=1 55 / 65

Sequence estimation: window estimate Given a sequence of observations {y t }, consider using {y t } t T to estimate f n [0], where T is the window size The estimator: ˆf n [0] = T t= T w ty t On one hand, the filter {w t } t T should have a small L 2 norm to suppress the noise On the other hand, if η( )f n 0 with a known η, the filter should be designed such that the error term only consists of iid noises The approach (Goldenshluger-Nemirovski 97) The filter {w t } t T is the solution to the following optimization problem: { T } min F w t y t+s y s t= T s T s.t. F(w T ) 1 C T where F({φ t } t T ) denotes the discrete Fourier transform of {φ t } t T. 56 / 65

Frequency domain Figure 7: The observations (upper panel) and the noises (lower panel) 57 / 65

Performance analysis Theorem If f n = l i=1 f n,i and there exist monic polynomials η i of degree k such that l i=1 η i( )f n,i p,t ɛ, we have f n [0] ˆf n [0] T k 1/p ɛ + ΘT (ξ) T where Θ T (ξ) is the supremum of O(T 2 ) N (0, 1) random variables and is thus of order ln T. This result is a uniform result (η i is unknown), and the ln T gap is avoidable to achieve the uniformity. Plugging in ɛ n k+1/p f p,b from the discretization step yields f n [m] ˆf n [m] ( T n )k 1/p f p,b + ΘT (ξ) T where B is a segment with center m/n and length T. 58 / 65

Polynomial multiplication Begin with the Hölder ball case, where we know that (1 ) k f n ɛ Lemma Write convolution as polynomial multiplication, we have 1 w T (z) = 1 t T w tz t can be divided by (1 z) k (1 w T ( ))p = 1 w T ( ) (1 ) k (1 ) k p = 0, p P1 k 1 Moreover, w T 1 1, w T 2 1/ T, and F(w T ) 1 1/ T There exists {η t } t T such that η T (z) = t T η tz t 1 wt (z) such that F(wT ) 1 1/ T, and for each i = 1, 2,, l, we have η T (z) = η i (z)ρ i (z) with ρ i T k 1 If we knew η T (z), we could just use ˆf n [0] = [w T ( )y] 0 to estimate f n [0] 59 / 65

Optimization solution Performance of ˆf n : f ˆf n = (1 wt ( ))f + w T ( )ξ η T ( )f + wt ( )ξ l l η T ( )f i + wt ( )ξ = ρ i ( )(η i ( )f i ) + wt ( )ξ Fact i=1 F(f ˆf n ) ( T T k 1/p ɛ + Θ ) T (ξ), η T ( )f T k 1/p ɛ T Observation: by definition F(f ˆf n ) F(f ˆf n ), thus F((1 w T ( ))f ) F(f ˆf n ) + F(w T ( )ξ) ( T T k 1/p ɛ + Θ ) T (ξ) T i=1 60 / 65

Performance analysis Stochastic error: Deterministic error: s = [w T ( )ξ] 0 F(w T ) 1 F(ξ) Θ T (ξ) T b = [(1 w T ( ))f ] 0 [(1 w T ( ))η T ( )f ] 0 + [(1 w T ( ))w T ( )f ] 0 η T ( )f + F(η T ( )f ) F(w T ) 1 + F((1 w T ( ))f ) F(w T ) 1 T k 1/p ɛ + Θ T (ξ) T and we re done. Q.E.D. 61 / 65

Adaptive window size Apply Lepski s trick to select ˆT [m] = nĥ[m]: ĥ[m] = sup{h (0, 1) : ˆf n,nh [m] ˆf n,nh [m] C ln n nh, h (0, h)} Theorem Suppose S kl, we have ( ln n R q (ˆf, W l,k,p (L)) n and ˆf is adaptively optimal in the sense that { } ) min k 2k+1, k 1/p+1/q 2(k 1/p)+1 inf ˆf sup kl S,p>0,L>0,B [0,1] d R q,b(ˆf, Wl,k,p (L)) inf ˆf R q,b (ˆf, W l,k,p (L)) (ln n) S 2S+1. 62 / 65

Summary Key points in nonparametric regression: Bias-variance tradeoff, bandwidth selection Choice of approximation basis Nonlinear estimators Adaptive methods Lepski s trick to adapt to unknown smoothness and spatial inhomogeneity Estimation of sequence satisfying an unknown differential equation 63 / 65

Duality to functional estimation Shrinkage vs. Approximation: Shrinkage: increase bias to reduce variance Approximation: increase variance to reduce bias Some more dualities: Nonparametric regression: known approximation order & variance, unknown function, bandwidth & bias Functional estimation: known function, bandwidth & bias, unknown approximation order & variance Similarity: kernel estimation, unbiasedness of polynomials Related questions: Necessity of nonlinear estimators? Applications of Lepski s trick? Logarithmic phase transition?... 64 / 65

References Nonparametric regression: A. Nemirovski, Topics in Non-Parametric Statistics. Lectures on probability theory and statistics, Saint Flour 1998, 1738. 2000. A. B. Tsybakov, Introduction to nonparametric estimation. Springer Science & Business Media, 2008. L. Györfi, et al. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006. Approximation theory: R. A. DeVore and G. G. Lorentz, Constructive approximation. Springer Science & Business Media, 1993. G. G. Lorentz, M. von Golitschek, and Y. Makovoz, Constructive approximation: advanced problems. Berlin: Springer, 1996. Function space: O. V. Besov, V. P. Il in, and S. M. Nikol ski, Integral representations of functions and embedding theorems. V. H. Winston, 1978. 65 / 65