Nonparametric Estimation: Part I

Nonparametric Estimation: Part I Regression in Function Space Yanjun Han Department of Electrical Engineering Stanford University yjhan@stanford.edu November 13, 2015 (Pray for Paris)

Outline 1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 2 / 65

The statistical model Given a family of distributions {P f ( )} f F and an observation Y P f, we aim to: 1 estimate f ; 2 estimate a functional F (f ) of f ; 3 do hypothesis testing: given a partition F = N j=1 F j, decide which F j the true function f belongs to. 3 / 65

The risk The risk of using ˆf to estimate θ(f ) is ( ) R(ˆf, f ) = Ψ 1 E f Ψ(d(ˆf (Y ), θ(f )) where Ψ is a nondecreasing function on [0, ) with Ψ(0) = 0, and d(, ) is some metric The minimax approach R(ˆf, F) = sup R(ˆf, f ) f F R (F) = inf ˆf R(ˆf, F) 4 / 65

Nonparametric regression Problem: recover a function f : [0, 1] d R in a given set F L 2 ([0, 1] d ) via noisy observations y i = f (x i ) + σξ i, i = 1, 2,, n Some options: Grid {x i } n i=1 : deterministic design (equidistant grid, general case), random design Noise {ξ i } n i=1 : iid N (0, 1), general iid case, with dependence Noise level σ: known, unknown Function space: Hölder ball, Sobolev space, Besov space Risk function: pointwise risk (confidence interval), integrated risk (L q risk, 1 q, with normalization) ) 1 (E f R q (ˆf, f ) = [0,1] ˆf (x) f (x) q q dx, 1 q < d ). E f (ess sup x [0,1] d ˆf (x) f (x), q = 5 / 65

Equivalence between models Under mild smoothness conditions, Brown et al. proved the asymptotic equivalence between the following models: Regression model: for iid N (0, 1) noise {ξ i } n i=1, y i = f (i/n) + σξ i, i = 1, 2,, n Gaussian white noise model: dy t = f (t)dt + σ n db t, t [0, 1] Poisson process: generate N = Poi(n) iid samples from common density g (g = f 2, σ = 1/2) Density estimation model: generate n iid samples from common density g (g = f 2, σ = 1/2) 6 / 65

Bias-variance decomposition Deterministic error (bias) and stochastic error of an estimator ˆf ( ): Analysis of the L q risk: For 1 q < : For q = : R q (ˆf, f ) = b(x) = E f ˆf (x) f (x) s(x) = ˆf (x) E f ˆf (x) ( ) 1 E f ˆf f q q q b q + ( E s q ) 1 q q R (ˆf, f ) b + E s = ( E f b + s q ) 1 q q 7 / 65

1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 8 / 65

The first example Consider the following regression model: y i = f (i/n) + σξ i, i = 1, 2,, n where {ξ i } n i=1 iid N (0, 1), and f Hs 1 (L) for some known s (0, 1] and L > 0, and the Hölder ball is defined as H s 1(L) = {f C[0, 1] : f (x) f (y) L x y s, x, y [0, 1]}. 9 / 65

A window estimate To estimate the value of f at x, consider the window B x = [x h/2, x + h/2], then a natural estimator takes the form ˆf (x) = 1 y i, n(b x ) i:x i B x where n(b x ) denotes the number of point x i in B x. Deterministic error: b(x) = E f ˆf (x) f (x) = 1 f (x i ) f (x) n(b x ) i:x i B x 1 f (x i ) f (x) 1 L x i x s Lh s n(b x ) n(b x ) i:x i B x i:x i B x Stochastic error: s(x) = ˆf (x) E f ˆf (x) = σ n(b x ) i:x i B x ξ i 10 / 65

Optimal window size: 1 q < Bounding the integrated risk: b(x) Lh s, s(x) = σ n(b x ) i:x i B x ξ i R q (ˆf, f ) b q + (E s q q) 1 q Lh s + σ nh Optimal window size and risk (1 q < ) h n 1 2s+1, Rq (ˆf, H s 1(L)) n s 2s+1 11 / 65

Optimal window size: q = b(x) Lh s, s(x) = σ n(b x ) i:x i B x ξ i Fact: for N (0, 1) rv {ξ i } M i=1 (possibly correlated), there exists a constant C such that for any w 1, ( P max ξ i Cw ) ( ln M exp w 2 ) ln M 1 i M 2 Proof: apply the union bound and P( N (0, 1) > x) 2 exp( x 2 /2). ln n Corollary: E s σ nh (M = O(n 2 )). Optimal window size and risk (q = ) ( ) 1 ln n h n 2s+1, R (ˆf, H s 1(L)) ( ln n n ) s 2s+1 12 / 65

Bias-variance tradeoff Figure 1: Bias-variance tradeoff 13 / 65

General Hölder ball Consider the following regression problem: y ι = f (x ι ) + σξ ι, f H s d (L), ι = (i 1, i 2,, i d ) {1, 2,, m} d where x (i1,i 2,,i d ) = (i 1 /m, i 2 /m,, i d /m), n = m d {ξ ι } are iid N (0, 1) noises The general Hölder ball Hd s (L) is defined as { Hd s (L) = f : [0, 1] d R, D k (f )(x) D k (f )(x ) L x x α, x, x } where s = k + α, k N, α (0, 1], and D k (f )( ) is the vector function comprised of all partial derivatives of f with full order k. s > 0: modulus of smoothness d: dimensionality 14 / 65

Local polynomial approximation As before, consider the cube B x centered at x with edge size h, where h 0, nh d. Simple average no longer works! Local polynomial approximation: for x ι B x, design weights w ι (x) such that if the true f Pd k, i.e., polynomial of full degree k and of d variables, we have f (x) = w ι (x)f (x ι ) ι:x ι B x Lemma There exists weights {w ι (x)} ι:xι B x which depends continuously on x and w ι (x) 1 C 1, w ι (x) 2 C 2 n(bx ) = C 2 nh d where C 1, C 2 are two universal constants depending only on k and d. 15 / 65

The estimator Based on the weights {w ι (x)}, construct a linear estimator ˆf (x) = w ι (x)y ι ι:x ι B x Deterministic error: b(x) = w ι (x)f (x ι ) f (x) ι:x ι B x = inf w ι (x)(f (x ι ) p(x ι )) (f (x) p(x)) p Pd k ι:x ι B x (1 + w ι (x) 1 ) inf f p,bx p Pk d Stochastic error: s(x) = σ w ι (x)ξ ι ι:x ι B x σ 1 nh d w ι (x) 2 ι:x ι B x w ι (x)ξ ι 16 / 65

Rate of convergence Bounding the deterministic error: by Taylor expansion, inf f p,bx max D k f (η z ) D k f (x) p Pk d z B x (z x) k Lh s k! Hence, b q Lh s for 1 q Bounding the stochastic error: as before, ( E s q q ) 1 q σ nh d ln n (1 q < ), E s σ nh d Theorem The optimal window size h and the corresponding risk is given by { {n 1 h 2s+d ( ln n n ), R 1 q (ˆf, Hd s (L)) n s 2s+d, 1 q < 2s+d ( ln n n ) s 2s+d, q = and the resulting estimator is minimax rate-optimal (see later). 17 / 65

constitutes an approximation basis for H s d (L). 18 / 65 Why approximation? The general linear estimator takes the form ˆf lin (x) = w ι (x)y ι ι {1,2,,m} d Observations: The estimator ˆf lin is unbiased for f if and only if f (x) = ι w ι (x)f (x ι ), x [0, 1] d Plugging in x = x ι yields that z = {f (x ι )} is a solution to (w ι (x κ ) δ ικ ) ι,κ z = 0 Denote by {z ι (k) } M k=1 all linearly independent solutions to the previous equation, then f k (x) = ι w ι (x)z (k) ι, k = 1,, M

Why polynomial? Definition (Kolmogorov n-width) For a linear normed space (X, ) and subset K X, the Kolmogorov n-width of K is defined as d n (K) d n (K, X ) = inf V n where V n X has dimension n. sup x K Theorem (Kolmogorov n-width for H s d (L)) d n (H s d (L), C[0, 1]d ) n s d. inf y V n x y The piecewise polynomial basis in each cube with edge size h achieves the optimal rate (so other basis does not help): d Θ(1)h d (H s d (L), C[0, 1]d ) (h d ) s d = h s 19 / 65

Methods in nonparametric regression Function space (this lecture): Kernel estimates: ˆf (x) = ι K h(x, x ι )y ι (Nadaraya, Watson,...) Local polynomial kernel estimates: ˆf (x) = M k=1 φ k(x)c k (x), where (c 1 (x),, c k (x)) = arg min c ι (y ι M k=1 c k(x)φ k (x ι )) 2 K h (x x ι ) (Stone,...) Penalized spline estimates: ˆf = arg min g y ι g 2 2 + λ g (s) 2 2 (Speckman, Ibragimov, Khas minski,...) Nonlinear estimates (Lepski, Nemirovski,...) Transformed space (next lecture): Fourier transform: projection estimates (Pinsker, Efromovich,...) Wavelet transform: shrinkage estimates (Donoho, Johnstone,...) 20 / 65

Regression in Sobolev space Consider the following regression problem: y ι = f (x ι ) + σξ ι, f S k,p d (L), ι = (i 1, i 2,, i d ) {1, 2,, m} d where x (i1,i 2,,i d ) = (i 1 /m, i 2 /m,, i d /m), n = m d {ξ ι } are iid N (0, 1) noises The Sobolev ball S k,p d S k,p d (L) = (L) is defined as { f : [0, 1] d R, D k (f ) p L where D k (f )( ) is the vector function comprised of all partial derivatives (in terms of distributions) of f with order k. Parameters: d: dimensionality k: order of differentiation p: p d to ensure continuous embedding S k,p d (L) C[0, 1] d q: norm of the risk } 22 / 65

Minimax lower bound Theorem (Minimax lower bound) The minimax risk in Sobolev ball regression problem over all estimators is R q (S k,p n k 2k+d, q < (1 + 2k d d (L), n) )p ( ln n n ) k d/p+d/q 2(k d/p)+d, q (1 + 2k d )p Theorem (Linear minimax lower bound) The minimax risk in Sobolev ball regression problem over all linear estimators is n k 2k+d, q p Rq lin (S k,p d (L), n) n k d/p+d/q 2(k d/p+d/q)+d, p < q < ( ln n k d/p+d/q 2(k d/p+d/q)+d, q = n ) 23 / 65

Start from linear estimates Consider the linear estimator given by local polynomial approximation with window size h as before: Stochastic error: ( ) 1 E s q q q σ ln n (1 q < ), E s σ nh d nh d Fact Deterministic error: corresponds to polynomial approximation error For f S k,p d (L), there exists constant C > 0 such that D k 1 f (x) D k 1 f (y) C x y 1 d/p ( Taylor polynomial yields: b(x) h k d/p ( B x D k f (z) p dz B ) 1 D k f (z) p p dz, ) 1 p x, y B 24 / 65

Linear estimator: integrated bias Integrated bias: Note that [0,1] d b q q h (k d/p)q [0,1] d ( D k f (z) p dzdx = h d B x B x D k f (z) p dz ) q p dx [0,1] d D k f (x) p dx h d L p Fact a r 1 + a r 2 + + a r n { (a 1 + a 2 + + a n ) r /n r 1, 0 < r 1 (a 1 + a 2 + + a n ) r, r > 1 Case q/p 1: b q q h (k d/p)q L q h d q p = L q h kq (regular case) Case q/p > 1: b q q h (k d/p)q L q h d = L q h (k d/p+d/q)q (sparse case) 25 / 65

Linear estimator: optimal risk In summary, we have { Lh k, q p σ b q Lh k d/p+d/q, p < q, s q σ nh d, ln n n, q < q = Theorem (Optimal linear risk) R q (ˆf lin, Sk,p d (L)) n k 2k+d, n k d/p+d/q 2(k d/p+d/q)+d, ( ln n ) k d/p 2(k d/p)+d n, q p p < q < q = Alternative proof: Theorem (Sobolev embedding) For d p < q, we have S k,p d (L) S k d/p+d/q,q d (L ). 26 / 65

Minimax lower bound: tool Theorem (Fano s inequality) Suppose H 1,, H N are probability distributions (hypotheses) on sample space (Ω, F), and there exists decision rule D : Ω {1, 2,, N} such that H i (ω : D(ω) = i) δ i, 1 i N. Then max D KL(F i F j ) 1 i,j N ( 1 N ) N δ i ln(n 1) ln 2 Apply to nonparametric regression: Suppose convergence rate r n is attainable, construct N functions (hypotheses) f 1,, f N F such that f i f j q > 4r n for any i j Decision rule: after obtaining ˆf, choose j such that ˆf f j q 2r n As a result, δ i 1/2, and Fano s inequality gives 1 2σ 2 max 1 i,j N i=1 f i (x ι ) f j (x ι ) 2 1 ln(n 1) ln 2 2 ι 27 / 65

Minimax lower bound: sparse case Suppose q (1 + 2k d )p. Fix a smooth function g supported on [0, 1] d Divide [0, 1] d into h d disjoint cubes with size h, and construct N = h d hypotheses: f j supported on j-th cube, and equals h s g(x/h) on that cube (with translation) To ensure f S k,p d (L), set s = k d/p For i j, r n f i f j q h k d/p+d/q, and f i (x ι ) f j (x ι ) 2 h 2(k d/p) nh d = nh 2(k d/p)+d ι Fano s inequality gives ( ln n nh 2(k d/p)+d ln N ln h 1 = r n n ) k d/p+d/q 2(k d/p)+d 28 / 65

Minimax lower bound: regular case Suppose q < (1 + 2k d )p. Fix smooth function g supported on [0, 1] d, and set g h (x) = h s g(x/h) Divide [0, 1] d into h d disjoint cubes with size h, and construct N = 2 h d hypotheses: f j = h d i=1 ɛ ig h,i, where g h,i is the translation of g h to i-th cube Can choose M = 2 Θ(h d ) hypotheses such that for i j, f i differs f j on at least Θ(h d ) cubes To ensure f S k,p d (L), set s = k For i j, r n f i f j q h k, and f i (x ι ) f j (x ι ) 2 h 2k n = nh 2k Fano s inequality gives ι nh 2k ln M h d = r n ( ) k 1 2k+d n 29 / 65

Construct minimax estimator Why linear estimator fails in the sparse case? Too much variance in the flat region! Suppose we know that the true function f is supported at a cube with size h, we have b q h k d/p+d/q ( ), E s q 1/q 1 q nh d hd/q then h n 1 2(k d/p)+d Nemirovski s construction yields the optimal risk Θ(n k d/p+d/q 2(k d/p)+d ) In both cases, construct ˆf as the solution to the following optimization problem ˆf = arg min g S k,p d (L) g y B = arg min g S k,p d max (L) B B 1 n(b) (g(x ι ) y ι ) ι:x ι B where B is the set of all cubes in [0, 1] d with nodes belonging to {x ι }. 30 / 65

Estimator analysis Question 1: given a linear estimator ˆf at x of window size h, which size h of B B achieves the maximum of ι:x ι B ˆf (x ι ) y ι / n(b)? If h h, the value max{ n(b) b h (x), N (0, 1) } increases with h If h h, the value is close to zero Hence, h h achieves the maximum Question 2: what window size h at x achieves min ˆf h y B? Answer: h achieves the bias-variance balance locally at x The 1/ n(b) term helps to achieve the spatial homogeneity Theorem (Optimality of the estimator) R q (ˆf, S k,p d (L)) ( ln n n ) min{ k 2k+d, k d/p+d/q 2(k d/p)+d } Hence ˆf is minimax rate-optimal (with a logarithmic gap in regular case). 31 / 65

Proof of the optimality First observation: ˆf f B ˆf y B + f y B 2 ξ B ln n and e ˆf f S k,p d (2L). Definition (Regular cube) A cube B [0, 1] d is called a regular cube if e(b) C[h(B)] k d/p Ω(e, B) where C > 0 is a suitably chosen constant, e(b) = max x B e(x), h(b) is the edge size of B, and ( Ω(e, B) = B ) 1 D k f (x) p p dx. One can show that [0, 1] d can be (roughly) partitioned into maximal regular cubes with replaced by = (i.e., balanced bias and variance). 32 / 65

Property of regular cubes Lemma If cube B is regular, we have Proof: 1 sup B B,B B n(b ) e(x ι ) ι:x ι B ln n = e(b) ln n n(b) Since B is regular, there exists a polynomial e k (x) in B of d variables and degree no more than k such that e(b) 4 e e k,b On one hand, e k (x) 5e(B)/4, and Markov s inequality for polynomial implies that De k e(b)/h(b) On the other hand, e k (x 0 ) 3e(B)/4 for some x 0 B, the derivative bound implies that there exists B B, h(b ) h(b) such that e k (x) e(b)/2 on B Choosing this B in the assumption completes the proof 33 / 65

Upper bound for the L q risk Since [0, 1] d can be (roughly) partitioned into regular cubes {B i } i=1 such that e(b i ) [h(b i )] k p/d Ω(e, B i ), we have e q q = i=1 B i e(x) q dx [h(b i )] d e q (B i ) The previous lemma asserts that e(b i ) ln n n[h(b i )] d, and we cancel out h(b i ), e(b i ) and get ( ln n e q q n ( ln n n ) q(k d/p+d/q) 2(k d/p)+d ) q(k d/p+d/q) 2(k d/p)+d i=1 Ω(e, B i ) i=1 d(q 2) 2(k d/p)+d ( ) Ω(e, B i ) p i=1 d(q 2) p(2(k d/p)+d) if q q = (1 + 2k d )p. For 1 q < q we use e q e q. Q.E.D. 34 / 65

Adaptive estimation Drawbacks of the previous estimator: Computationally intractable Require perfect knowledge of parameters Adaptive estimation Given a family of function classes {F n }, find an estimator which (nearly) attains the optimal risk simultaneously without knowing which F n the true function belongs to. In the sequel we will consider {F n } = k S,p>d,L>0 S k,p d (L) 36 / 65

Data-driven window size Consider again a linear estimate with window size h locally at x, recall that σn (0, 1) ˆf (x) f (x) inf f p,bh p Pd k (x) + nh d The optimal window size h(x) should balance these two terms The stochastic term can be upper bounded (with overwhelming ln n probability) by s n (h) = wσ depending only on known nh d parameters and h, where the constant w > 0 is large enough But the bias term depends on the unknown local property of f on B h (x)! 37 / 65

Bias-variance tradeoff: revisit Figure 2: Bias-variance tradeoff 38 / 65

Lepski s trick Lepski s adaptive scheme Construct a family of local polynomial approximation estimators {ˆf h } with all window size h I. Then use ˆfĥ(x) (x) as the estimate of f (x), where ĥ(x) sup{h I : ˆf h (x) ˆf h (x) 4s n (h ), h (0, h), h I }. Denote by h (x) the optimal window size b n (h ) = s n (h ) Existence of ĥ(x): clearly h (x) satisfies the condition, for h (0, h ) ˆf h (x) ˆf h (x) ˆf h (x) f + f ˆf h (x) Performance of ĥ(x): b n (h ) + s n (h ) + b n (h ) + s n (h ) 4s n (h ), ˆfĥ(x) (x) f ˆf h (x)(x) f + ˆfĥ(x) (x) ˆf h (x)(x) b n (h ) + s n (h ) + 4s n (h ) = 6s n (h ) 39 / 65

Lepski s estimator is adaptive! Properties of Lepski s scheme: Adaptive to parameters: agnostic to p, k, L Spatially adaptive: still work when there is spatial inhomogeneity / only estimate a portion of f Theorem (Adaptive optimality) Suppose that the modulus of smoothness k is upper bounded by a known hyper-parameter S: R q (ˆf, S k,p d (L)) ( ln n and ˆf is adaptively optimal in the sense that n ) min{ k 2k+d, k d/p+d/q 2(k d/p)+d } inf ˆf sup k S,p>d,L>0,B [0,1] d R q,b(ˆf, S k,p d (L)) inf ˆf R q,b (ˆf, S k,p d (L)) (ln n) S 2S+d. 40 / 65

Covering of the ideal window By the property of the data-driven window size ĥ(x), it suffices to consider the ideal window size h (x) satisfying s n (h (x)) [h (x)] k d/p Ω(f, B h (x)(x)) Lemma There exists {x i } M i=1 and a partition {V i} M i=1 of [0, 1]d such that 1 Cubes {B h (x i )(x i )} M i=1 are pairwise disjoint 2 For every x V i, we have and B h (x)(x) B h (x i )(x i ) h (x) 1 2 max{h (x i ), x x i } 41 / 65

Analysis of the estimator Using the covering of the local windows, we have M ˆf f q q s n (h (x)) q dx = s n (h (x)) q dx [0,1] d V i ln n Recall: s n (h) and h (x) max{h (x nh d i ), x x i }/2 for x V i, ( ) ln n q M ˆf f q q [max{h (x i ), x x i }] dq/2 dx n i=1 V i ( ) ln n q M r d 1 [max{h (x i ), r}] dq/2 dr n i=1 0 ( ) ln n q M [h (x i )] d dq/2 n i=1 Plugging in s n (h (x)) [h (x)] k d/p Ω(f, B h (x)(x)) and use the disjointness of {B h (x i )(x)} M i=1 completes the proof. Q.E.D. 42 / 65 i=1

Experiment: Blocks Figure 3: Blocks: original, noisy and reconstructed signal 43 / 65

Experiment: Bumps Figure 4: Bumps: original, noisy and reconstructed signal 44 / 65

Experiment: Heavysine Figure 5: Heavysine: original, noisy and reconstructed signal 45 / 65

Experiment: Doppler Figure 6: Doppler: original, noisy and reconstructed signal 46 / 65

Generalization: anisotropic case What if the modulus of smoothness differs in different directions? Now we should replace the cube with a rectangle (h 1,, h d )! Deterministic and stochastic term: b h (x) C 1 d i=1 h s i i b(h), s h (x) C 2 where V (h) = d i=1 h i. Suppose b(h ) = s(h ). Naive guess ln n nv (h) s(h) ĥ(x) = arg max {V (h) : ˆf h (x) ˆf h (x) 4s(h ), h, V (h ) V (h)} h doesn t work: h may not satisfy this condition still doesn t work if we replace V (h ) V (h) by h h: fĥ may not be close to f h 47 / 65

Adaptive regime Adaptive regime (Kerkyacharian-Lepski-Picard 01) ĥ(x) = arg max {V (h) : ˆf h h (x) ˆf h (x) 4s(h ), h, V (h ) V (h)} h where h h = max{h, h } (coordinate-wise). h satisfies the condition: for V (h) V (h ), ˆf h (x) ˆf h (x) b(h h) b(h) + s(h h) + s(h) (h s i i + (hi ) s i ) + 2s(h) 4s(h) i:h i <h i fĥ is close to f h : since V (h ) V (ĥ), ˆfĥ f ˆfĥ fĥ h + fĥ h f h + f h f (ĥ s i i + (hi ) s i ) + 4s(h ) + 2s(h ) 8s(h ) i:h i >ĥi 48 / 65

More on Lepski s trick General adaptive rule: consider a family of models (X n, F (s) ) s I and rate-optimal minimax estimators ˆf n (s) for each s with minimax risk r s (n). For simplicity suppose r s (n) decreases with s. Definition (Adaptive estimator) Choose increasing sequence {s i } k(n) i=1 from I with k(n), select ˆk = sup{1 k k(n) : d(ˆf sk, ˆf si ) Cr i (n), 1 i < k} for some prescribed constant C, and choose ˆf = ˆf sˆk. Theorem (Lepski 90) If r s (n) n r(s), then under mild conditions, R n (ˆf, F (s) ) n r(s) (ln n) r(s) 2, s I. As a result, the sub-optimality gap is at most (ln n) sup s I r(s)/2. 49 / 65

More on adaptive schemes Instead of exploiting the monotonicity of bias and variance, we can also estimate the risk directly. Given {ˆf h } h I, suppose ˆr(h) is a good estimator of the risk of ˆf h, then we can write ĥ = arg min ˆr(h) h In Gaussian models, ˆr(h) can be Stein s Unbiased Risk Estimator (SURE): zero bias and small variance Lepski s method: define {ˆf h,h } (h,h ) I I from {ˆf h } h I, and ˆr(h) = ( ) sup ˆf h,h ˆf h Q(s(h )) + 2Q(s(h)) h :s(h ) s(h) + where Q is some function related to variance ˆf h,h is usually the convolution of estimators, and ˆf h,h = ˆf h or ˆf h h in the previous example 50 / 65

Further generalization Another point of view of Sobolev ball S k,p 1 (L): there exists a polynomial r(x) = x k with degree k such that r( d dx )f p L, f S k,p 1 (L) Generalization: what about r( ) is an unknown monic function with degree no more than k? For modulated signal, e.g., f (x) = g(x) sin(ωx + φ) for g S k,p d (L): Does not belong to any Sobolev balls for large ω However, we can write f (x) = f + (x) + f (x), where g(x) exp(±i(ωx + φ)) f ± (x) = ±2i and by induction we have ( ) d k dx iω f ± (x) = g (k) p L 2 2 p 52 / 65

General regression model Definition (Signal satisfying differential inequalities) Function f : [0, 1] R belongs to the class W l,k,p (L), if and only if f = l i=1 f i, and there exist monic polynomials r 1,, r l of degree k such that l r i i=1 ( ) d p f i L. dx General regression problem: recover f W l,k,p (L) from noisy observations y i = f (i/n) + σξ i, i = 1, 2,, n, ξ i iid N (0, 1) where we only know: Noise level σ An upper bound S kl 53 / 65

Recovering approach The risk function Note that now we cannot recover the whole function f (e.g., f 0 and f (x) = sin(2πnx) correspond to the same model) The discrete q-norm: ˆf f q = ( 1 n n i=1 ˆf (i/n) f (i/n) q ) 1 q Recovering approach Discretization: transform differential inequalities to inequalities of finite difference Sequence estimation: given a window size, recover the discrete sequence satisfying an unknown difference inequality Adaptive window size: apply Lepski s trick to choose the optimal window size adaptively 54 / 65

Discretization: from derivative to difference Lemma For any f C k 1 [0, 1] and any monic polynomial r(z) of degree k, there corresponds another monic polynomial η(z) of degree k such that η( )f n p n k+1/p r( d dx )f p where f n = {f (i/n)} n i=1, and ( φ) t = φ t 1 is the backward shift operator. Applying to our regression problem: The sequence f n can be written as f n = l i=1 f n,i There correspond monic polynomials η i of degree k such that l η i ( )f n,i p Ln k+1/p i=1 55 / 65

Sequence estimation: window estimate Given a sequence of observations {y t }, consider using {y t } t T to estimate f n [0], where T is the window size The estimator: ˆf n [0] = T t= T w ty t On one hand, the filter {w t } t T should have a small L 2 norm to suppress the noise On the other hand, if η( )f n 0 with a known η, the filter should be designed such that the error term only consists of iid noises The approach (Goldenshluger-Nemirovski 97) The filter {w t } t T is the solution to the following optimization problem: { T } min F w t y t+s y s t= T s T s.t. F(w T ) 1 C T where F({φ t } t T ) denotes the discrete Fourier transform of {φ t } t T. 56 / 65

Frequency domain Figure 7: The observations (upper panel) and the noises (lower panel) 57 / 65

Performance analysis Theorem If f n = l i=1 f n,i and there exist monic polynomials η i of degree k such that l i=1 η i( )f n,i p,t ɛ, we have f n [0] ˆf n [0] T k 1/p ɛ + ΘT (ξ) T where Θ T (ξ) is the supremum of O(T 2 ) N (0, 1) random variables and is thus of order ln T. This result is a uniform result (η i is unknown), and the ln T gap is avoidable to achieve the uniformity. Plugging in ɛ n k+1/p f p,b from the discretization step yields f n [m] ˆf n [m] ( T n )k 1/p f p,b + ΘT (ξ) T where B is a segment with center m/n and length T. 58 / 65

Polynomial multiplication Begin with the Hölder ball case, where we know that (1 ) k f n ɛ Lemma Write convolution as polynomial multiplication, we have 1 w T (z) = 1 t T w tz t can be divided by (1 z) k (1 w T ( ))p = 1 w T ( ) (1 ) k (1 ) k p = 0, p P1 k 1 Moreover, w T 1 1, w T 2 1/ T, and F(w T ) 1 1/ T There exists {η t } t T such that η T (z) = t T η tz t 1 wt (z) such that F(wT ) 1 1/ T, and for each i = 1, 2,, l, we have η T (z) = η i (z)ρ i (z) with ρ i T k 1 If we knew η T (z), we could just use ˆf n [0] = [w T ( )y] 0 to estimate f n [0] 59 / 65

Optimization solution Performance of ˆf n : f ˆf n = (1 wt ( ))f + w T ( )ξ η T ( )f + wt ( )ξ l l η T ( )f i + wt ( )ξ = ρ i ( )(η i ( )f i ) + wt ( )ξ Fact i=1 F(f ˆf n ) ( T T k 1/p ɛ + Θ ) T (ξ), η T ( )f T k 1/p ɛ T Observation: by definition F(f ˆf n ) F(f ˆf n ), thus F((1 w T ( ))f ) F(f ˆf n ) + F(w T ( )ξ) ( T T k 1/p ɛ + Θ ) T (ξ) T i=1 60 / 65

Performance analysis Stochastic error: Deterministic error: s = [w T ( )ξ] 0 F(w T ) 1 F(ξ) Θ T (ξ) T b = [(1 w T ( ))f ] 0 [(1 w T ( ))η T ( )f ] 0 + [(1 w T ( ))w T ( )f ] 0 η T ( )f + F(η T ( )f ) F(w T ) 1 + F((1 w T ( ))f ) F(w T ) 1 T k 1/p ɛ + Θ T (ξ) T and we re done. Q.E.D. 61 / 65

Adaptive window size Apply Lepski s trick to select ˆT [m] = nĥ[m]: ĥ[m] = sup{h (0, 1) : ˆf n,nh [m] ˆf n,nh [m] C ln n nh, h (0, h)} Theorem Suppose S kl, we have ( ln n R q (ˆf, W l,k,p (L)) n and ˆf is adaptively optimal in the sense that { } ) min k 2k+1, k 1/p+1/q 2(k 1/p)+1 inf ˆf sup kl S,p>0,L>0,B [0,1] d R q,b(ˆf, Wl,k,p (L)) inf ˆf R q,b (ˆf, W l,k,p (L)) (ln n) S 2S+1. 62 / 65

Summary Key points in nonparametric regression: Bias-variance tradeoff, bandwidth selection Choice of approximation basis Nonlinear estimators Adaptive methods Lepski s trick to adapt to unknown smoothness and spatial inhomogeneity Estimation of sequence satisfying an unknown differential equation 63 / 65

Duality to functional estimation Shrinkage vs. Approximation: Shrinkage: increase bias to reduce variance Approximation: increase variance to reduce bias Some more dualities: Nonparametric regression: known approximation order & variance, unknown function, bandwidth & bias Functional estimation: known function, bandwidth & bias, unknown approximation order & variance Similarity: kernel estimation, unbiasedness of polynomials Related questions: Necessity of nonlinear estimators? Applications of Lepski s trick? Logarithmic phase transition?... 64 / 65

References Nonparametric regression: A. Nemirovski, Topics in Non-Parametric Statistics. Lectures on probability theory and statistics, Saint Flour 1998, 1738. 2000. A. B. Tsybakov, Introduction to nonparametric estimation. Springer Science & Business Media, 2008. L. Györfi, et al. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006. Approximation theory: R. A. DeVore and G. G. Lorentz, Constructive approximation. Springer Science & Business Media, 1993. G. G. Lorentz, M. von Golitschek, and Y. Makovoz, Constructive approximation: advanced problems. Berlin: Springer, 1996. Function space: O. V. Besov, V. P. Il in, and S. M. Nikol ski, Integral representations of functions and embedding theorems. V. H. Winston, 1978. 65 / 65