Nonparametric Estimation: Part I

Size: px
Start display at page:

Download "Nonparametric Estimation: Part I"

Transcription

1 Nonparametric Estimation: Part I Regression in Function Space Yanjun Han Department of Electrical Engineering Stanford University yjhan@stanford.edu November 13, 2015 (Pray for Paris)

2 Outline 1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 2 / 65

3 The statistical model Given a family of distributions {P f ( )} f F and an observation Y P f, we aim to: 1 estimate f ; 2 estimate a functional F (f ) of f ; 3 do hypothesis testing: given a partition F = N j=1 F j, decide which F j the true function f belongs to. 3 / 65

4 The risk The risk of using ˆf to estimate θ(f ) is ( ) R(ˆf, f ) = Ψ 1 E f Ψ(d(ˆf (Y ), θ(f )) where Ψ is a nondecreasing function on [0, ) with Ψ(0) = 0, and d(, ) is some metric The minimax approach R(ˆf, F) = sup R(ˆf, f ) f F R (F) = inf ˆf R(ˆf, F) 4 / 65

5 Nonparametric regression Problem: recover a function f : [0, 1] d R in a given set F L 2 ([0, 1] d ) via noisy observations y i = f (x i ) + σξ i, i = 1, 2,, n Some options: Grid {x i } n i=1 : deterministic design (equidistant grid, general case), random design Noise {ξ i } n i=1 : iid N (0, 1), general iid case, with dependence Noise level σ: known, unknown Function space: Hölder ball, Sobolev space, Besov space Risk function: pointwise risk (confidence interval), integrated risk (L q risk, 1 q, with normalization) ) 1 (E f R q (ˆf, f ) = [0,1] ˆf (x) f (x) q q dx, 1 q < d ). E f (ess sup x [0,1] d ˆf (x) f (x), q = 5 / 65

6 Equivalence between models Under mild smoothness conditions, Brown et al. proved the asymptotic equivalence between the following models: Regression model: for iid N (0, 1) noise {ξ i } n i=1, y i = f (i/n) + σξ i, i = 1, 2,, n Gaussian white noise model: dy t = f (t)dt + σ n db t, t [0, 1] Poisson process: generate N = Poi(n) iid samples from common density g (g = f 2, σ = 1/2) Density estimation model: generate n iid samples from common density g (g = f 2, σ = 1/2) 6 / 65

7 Bias-variance decomposition Deterministic error (bias) and stochastic error of an estimator ˆf ( ): Analysis of the L q risk: For 1 q < : For q = : R q (ˆf, f ) = b(x) = E f ˆf (x) f (x) s(x) = ˆf (x) E f ˆf (x) ( ) 1 E f ˆf f q q q b q + ( E s q ) 1 q q R (ˆf, f ) b + E s = ( E f b + s q ) 1 q q 7 / 65

8 1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 8 / 65

9 The first example Consider the following regression model: y i = f (i/n) + σξ i, i = 1, 2,, n where {ξ i } n i=1 iid N (0, 1), and f Hs 1 (L) for some known s (0, 1] and L > 0, and the Hölder ball is defined as H s 1(L) = {f C[0, 1] : f (x) f (y) L x y s, x, y [0, 1]}. 9 / 65

10 A window estimate To estimate the value of f at x, consider the window B x = [x h/2, x + h/2], then a natural estimator takes the form ˆf (x) = 1 y i, n(b x ) i:x i B x where n(b x ) denotes the number of point x i in B x. Deterministic error: b(x) = E f ˆf (x) f (x) = 1 f (x i ) f (x) n(b x ) i:x i B x 1 f (x i ) f (x) 1 L x i x s Lh s n(b x ) n(b x ) i:x i B x i:x i B x Stochastic error: s(x) = ˆf (x) E f ˆf (x) = σ n(b x ) i:x i B x ξ i 10 / 65

11 Optimal window size: 1 q < Bounding the integrated risk: b(x) Lh s, s(x) = σ n(b x ) i:x i B x ξ i R q (ˆf, f ) b q + (E s q q) 1 q Lh s + σ nh Optimal window size and risk (1 q < ) h n 1 2s+1, Rq (ˆf, H s 1(L)) n s 2s+1 11 / 65

12 Optimal window size: q = b(x) Lh s, s(x) = σ n(b x ) i:x i B x ξ i Fact: for N (0, 1) rv {ξ i } M i=1 (possibly correlated), there exists a constant C such that for any w 1, ( P max ξ i Cw ) ( ln M exp w 2 ) ln M 1 i M 2 Proof: apply the union bound and P( N (0, 1) > x) 2 exp( x 2 /2). ln n Corollary: E s σ nh (M = O(n 2 )). Optimal window size and risk (q = ) ( ) 1 ln n h n 2s+1, R (ˆf, H s 1(L)) ( ln n n ) s 2s+1 12 / 65

13 Bias-variance tradeoff Figure 1: Bias-variance tradeoff 13 / 65

14 General Hölder ball Consider the following regression problem: y ι = f (x ι ) + σξ ι, f H s d (L), ι = (i 1, i 2,, i d ) {1, 2,, m} d where x (i1,i 2,,i d ) = (i 1 /m, i 2 /m,, i d /m), n = m d {ξ ι } are iid N (0, 1) noises The general Hölder ball Hd s (L) is defined as { Hd s (L) = f : [0, 1] d R, D k (f )(x) D k (f )(x ) L x x α, x, x } where s = k + α, k N, α (0, 1], and D k (f )( ) is the vector function comprised of all partial derivatives of f with full order k. s > 0: modulus of smoothness d: dimensionality 14 / 65

15 Local polynomial approximation As before, consider the cube B x centered at x with edge size h, where h 0, nh d. Simple average no longer works! Local polynomial approximation: for x ι B x, design weights w ι (x) such that if the true f Pd k, i.e., polynomial of full degree k and of d variables, we have f (x) = w ι (x)f (x ι ) ι:x ι B x Lemma There exists weights {w ι (x)} ι:xι B x which depends continuously on x and w ι (x) 1 C 1, w ι (x) 2 C 2 n(bx ) = C 2 nh d where C 1, C 2 are two universal constants depending only on k and d. 15 / 65

16 The estimator Based on the weights {w ι (x)}, construct a linear estimator ˆf (x) = w ι (x)y ι ι:x ι B x Deterministic error: b(x) = w ι (x)f (x ι ) f (x) ι:x ι B x = inf w ι (x)(f (x ι ) p(x ι )) (f (x) p(x)) p Pd k ι:x ι B x (1 + w ι (x) 1 ) inf f p,bx p Pk d Stochastic error: s(x) = σ w ι (x)ξ ι ι:x ι B x σ 1 nh d w ι (x) 2 ι:x ι B x w ι (x)ξ ι 16 / 65

17 Rate of convergence Bounding the deterministic error: by Taylor expansion, inf f p,bx max D k f (η z ) D k f (x) p Pk d z B x (z x) k Lh s k! Hence, b q Lh s for 1 q Bounding the stochastic error: as before, ( E s q q ) 1 q σ nh d ln n (1 q < ), E s σ nh d Theorem The optimal window size h and the corresponding risk is given by { {n 1 h 2s+d ( ln n n ), R 1 q (ˆf, Hd s (L)) n s 2s+d, 1 q < 2s+d ( ln n n ) s 2s+d, q = and the resulting estimator is minimax rate-optimal (see later). 17 / 65

18 constitutes an approximation basis for H s d (L). 18 / 65 Why approximation? The general linear estimator takes the form ˆf lin (x) = w ι (x)y ι ι {1,2,,m} d Observations: The estimator ˆf lin is unbiased for f if and only if f (x) = ι w ι (x)f (x ι ), x [0, 1] d Plugging in x = x ι yields that z = {f (x ι )} is a solution to (w ι (x κ ) δ ικ ) ι,κ z = 0 Denote by {z ι (k) } M k=1 all linearly independent solutions to the previous equation, then f k (x) = ι w ι (x)z (k) ι, k = 1,, M

19 Why polynomial? Definition (Kolmogorov n-width) For a linear normed space (X, ) and subset K X, the Kolmogorov n-width of K is defined as d n (K) d n (K, X ) = inf V n where V n X has dimension n. sup x K Theorem (Kolmogorov n-width for H s d (L)) d n (H s d (L), C[0, 1]d ) n s d. inf y V n x y The piecewise polynomial basis in each cube with edge size h achieves the optimal rate (so other basis does not help): d Θ(1)h d (H s d (L), C[0, 1]d ) (h d ) s d = h s 19 / 65

20 Methods in nonparametric regression Function space (this lecture): Kernel estimates: ˆf (x) = ι K h(x, x ι )y ι (Nadaraya, Watson,...) Local polynomial kernel estimates: ˆf (x) = M k=1 φ k(x)c k (x), where (c 1 (x),, c k (x)) = arg min c ι (y ι M k=1 c k(x)φ k (x ι )) 2 K h (x x ι ) (Stone,...) Penalized spline estimates: ˆf = arg min g y ι g λ g (s) 2 2 (Speckman, Ibragimov, Khas minski,...) Nonlinear estimates (Lepski, Nemirovski,...) Transformed space (next lecture): Fourier transform: projection estimates (Pinsker, Efromovich,...) Wavelet transform: shrinkage estimates (Donoho, Johnstone,...) 20 / 65

21 1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 21 / 65

22 Regression in Sobolev space Consider the following regression problem: y ι = f (x ι ) + σξ ι, f S k,p d (L), ι = (i 1, i 2,, i d ) {1, 2,, m} d where x (i1,i 2,,i d ) = (i 1 /m, i 2 /m,, i d /m), n = m d {ξ ι } are iid N (0, 1) noises The Sobolev ball S k,p d S k,p d (L) = (L) is defined as { f : [0, 1] d R, D k (f ) p L where D k (f )( ) is the vector function comprised of all partial derivatives (in terms of distributions) of f with order k. Parameters: d: dimensionality k: order of differentiation p: p d to ensure continuous embedding S k,p d (L) C[0, 1] d q: norm of the risk } 22 / 65

23 Minimax lower bound Theorem (Minimax lower bound) The minimax risk in Sobolev ball regression problem over all estimators is R q (S k,p n k 2k+d, q < (1 + 2k d d (L), n) )p ( ln n n ) k d/p+d/q 2(k d/p)+d, q (1 + 2k d )p Theorem (Linear minimax lower bound) The minimax risk in Sobolev ball regression problem over all linear estimators is n k 2k+d, q p Rq lin (S k,p d (L), n) n k d/p+d/q 2(k d/p+d/q)+d, p < q < ( ln n k d/p+d/q 2(k d/p+d/q)+d, q = n ) 23 / 65

24 Start from linear estimates Consider the linear estimator given by local polynomial approximation with window size h as before: Stochastic error: ( ) 1 E s q q q σ ln n (1 q < ), E s σ nh d nh d Fact Deterministic error: corresponds to polynomial approximation error For f S k,p d (L), there exists constant C > 0 such that D k 1 f (x) D k 1 f (y) C x y 1 d/p ( Taylor polynomial yields: b(x) h k d/p ( B x D k f (z) p dz B ) 1 D k f (z) p p dz, ) 1 p x, y B 24 / 65

25 Linear estimator: integrated bias Integrated bias: Note that [0,1] d b q q h (k d/p)q [0,1] d ( D k f (z) p dzdx = h d B x B x D k f (z) p dz ) q p dx [0,1] d D k f (x) p dx h d L p Fact a r 1 + a r a r n { (a 1 + a a n ) r /n r 1, 0 < r 1 (a 1 + a a n ) r, r > 1 Case q/p 1: b q q h (k d/p)q L q h d q p = L q h kq (regular case) Case q/p > 1: b q q h (k d/p)q L q h d = L q h (k d/p+d/q)q (sparse case) 25 / 65

26 Linear estimator: optimal risk In summary, we have { Lh k, q p σ b q Lh k d/p+d/q, p < q, s q σ nh d, ln n n, q < q = Theorem (Optimal linear risk) R q (ˆf lin, Sk,p d (L)) n k 2k+d, n k d/p+d/q 2(k d/p+d/q)+d, ( ln n ) k d/p 2(k d/p)+d n, q p p < q < q = Alternative proof: Theorem (Sobolev embedding) For d p < q, we have S k,p d (L) S k d/p+d/q,q d (L ). 26 / 65

27 Minimax lower bound: tool Theorem (Fano s inequality) Suppose H 1,, H N are probability distributions (hypotheses) on sample space (Ω, F), and there exists decision rule D : Ω {1, 2,, N} such that H i (ω : D(ω) = i) δ i, 1 i N. Then max D KL(F i F j ) 1 i,j N ( 1 N ) N δ i ln(n 1) ln 2 Apply to nonparametric regression: Suppose convergence rate r n is attainable, construct N functions (hypotheses) f 1,, f N F such that f i f j q > 4r n for any i j Decision rule: after obtaining ˆf, choose j such that ˆf f j q 2r n As a result, δ i 1/2, and Fano s inequality gives 1 2σ 2 max 1 i,j N i=1 f i (x ι ) f j (x ι ) 2 1 ln(n 1) ln 2 2 ι 27 / 65

28 Minimax lower bound: sparse case Suppose q (1 + 2k d )p. Fix a smooth function g supported on [0, 1] d Divide [0, 1] d into h d disjoint cubes with size h, and construct N = h d hypotheses: f j supported on j-th cube, and equals h s g(x/h) on that cube (with translation) To ensure f S k,p d (L), set s = k d/p For i j, r n f i f j q h k d/p+d/q, and f i (x ι ) f j (x ι ) 2 h 2(k d/p) nh d = nh 2(k d/p)+d ι Fano s inequality gives ( ln n nh 2(k d/p)+d ln N ln h 1 = r n n ) k d/p+d/q 2(k d/p)+d 28 / 65

29 Minimax lower bound: regular case Suppose q < (1 + 2k d )p. Fix smooth function g supported on [0, 1] d, and set g h (x) = h s g(x/h) Divide [0, 1] d into h d disjoint cubes with size h, and construct N = 2 h d hypotheses: f j = h d i=1 ɛ ig h,i, where g h,i is the translation of g h to i-th cube Can choose M = 2 Θ(h d ) hypotheses such that for i j, f i differs f j on at least Θ(h d ) cubes To ensure f S k,p d (L), set s = k For i j, r n f i f j q h k, and f i (x ι ) f j (x ι ) 2 h 2k n = nh 2k Fano s inequality gives ι nh 2k ln M h d = r n ( ) k 1 2k+d n 29 / 65

30 Construct minimax estimator Why linear estimator fails in the sparse case? Too much variance in the flat region! Suppose we know that the true function f is supported at a cube with size h, we have b q h k d/p+d/q ( ), E s q 1/q 1 q nh d hd/q then h n 1 2(k d/p)+d Nemirovski s construction yields the optimal risk Θ(n k d/p+d/q 2(k d/p)+d ) In both cases, construct ˆf as the solution to the following optimization problem ˆf = arg min g S k,p d (L) g y B = arg min g S k,p d max (L) B B 1 n(b) (g(x ι ) y ι ) ι:x ι B where B is the set of all cubes in [0, 1] d with nodes belonging to {x ι }. 30 / 65

31 Estimator analysis Question 1: given a linear estimator ˆf at x of window size h, which size h of B B achieves the maximum of ι:x ι B ˆf (x ι ) y ι / n(b)? If h h, the value max{ n(b) b h (x), N (0, 1) } increases with h If h h, the value is close to zero Hence, h h achieves the maximum Question 2: what window size h at x achieves min ˆf h y B? Answer: h achieves the bias-variance balance locally at x The 1/ n(b) term helps to achieve the spatial homogeneity Theorem (Optimality of the estimator) R q (ˆf, S k,p d (L)) ( ln n n ) min{ k 2k+d, k d/p+d/q 2(k d/p)+d } Hence ˆf is minimax rate-optimal (with a logarithmic gap in regular case). 31 / 65

32 Proof of the optimality First observation: ˆf f B ˆf y B + f y B 2 ξ B ln n and e ˆf f S k,p d (2L). Definition (Regular cube) A cube B [0, 1] d is called a regular cube if e(b) C[h(B)] k d/p Ω(e, B) where C > 0 is a suitably chosen constant, e(b) = max x B e(x), h(b) is the edge size of B, and ( Ω(e, B) = B ) 1 D k f (x) p p dx. One can show that [0, 1] d can be (roughly) partitioned into maximal regular cubes with replaced by = (i.e., balanced bias and variance). 32 / 65

33 Property of regular cubes Lemma If cube B is regular, we have Proof: 1 sup B B,B B n(b ) e(x ι ) ι:x ι B ln n = e(b) ln n n(b) Since B is regular, there exists a polynomial e k (x) in B of d variables and degree no more than k such that e(b) 4 e e k,b On one hand, e k (x) 5e(B)/4, and Markov s inequality for polynomial implies that De k e(b)/h(b) On the other hand, e k (x 0 ) 3e(B)/4 for some x 0 B, the derivative bound implies that there exists B B, h(b ) h(b) such that e k (x) e(b)/2 on B Choosing this B in the assumption completes the proof 33 / 65

34 Upper bound for the L q risk Since [0, 1] d can be (roughly) partitioned into regular cubes {B i } i=1 such that e(b i ) [h(b i )] k p/d Ω(e, B i ), we have e q q = i=1 B i e(x) q dx [h(b i )] d e q (B i ) The previous lemma asserts that e(b i ) ln n n[h(b i )] d, and we cancel out h(b i ), e(b i ) and get ( ln n e q q n ( ln n n ) q(k d/p+d/q) 2(k d/p)+d ) q(k d/p+d/q) 2(k d/p)+d i=1 Ω(e, B i ) i=1 d(q 2) 2(k d/p)+d ( ) Ω(e, B i ) p i=1 d(q 2) p(2(k d/p)+d) if q q = (1 + 2k d )p. For 1 q < q we use e q e q. Q.E.D. 34 / 65

35 1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 35 / 65

36 Adaptive estimation Drawbacks of the previous estimator: Computationally intractable Require perfect knowledge of parameters Adaptive estimation Given a family of function classes {F n }, find an estimator which (nearly) attains the optimal risk simultaneously without knowing which F n the true function belongs to. In the sequel we will consider {F n } = k S,p>d,L>0 S k,p d (L) 36 / 65

37 Data-driven window size Consider again a linear estimate with window size h locally at x, recall that σn (0, 1) ˆf (x) f (x) inf f p,bh p Pd k (x) + nh d The optimal window size h(x) should balance these two terms The stochastic term can be upper bounded (with overwhelming ln n probability) by s n (h) = wσ depending only on known nh d parameters and h, where the constant w > 0 is large enough But the bias term depends on the unknown local property of f on B h (x)! 37 / 65

38 Bias-variance tradeoff: revisit Figure 2: Bias-variance tradeoff 38 / 65

39 Lepski s trick Lepski s adaptive scheme Construct a family of local polynomial approximation estimators {ˆf h } with all window size h I. Then use ˆfĥ(x) (x) as the estimate of f (x), where ĥ(x) sup{h I : ˆf h (x) ˆf h (x) 4s n (h ), h (0, h), h I }. Denote by h (x) the optimal window size b n (h ) = s n (h ) Existence of ĥ(x): clearly h (x) satisfies the condition, for h (0, h ) ˆf h (x) ˆf h (x) ˆf h (x) f + f ˆf h (x) Performance of ĥ(x): b n (h ) + s n (h ) + b n (h ) + s n (h ) 4s n (h ), ˆfĥ(x) (x) f ˆf h (x)(x) f + ˆfĥ(x) (x) ˆf h (x)(x) b n (h ) + s n (h ) + 4s n (h ) = 6s n (h ) 39 / 65

40 Lepski s estimator is adaptive! Properties of Lepski s scheme: Adaptive to parameters: agnostic to p, k, L Spatially adaptive: still work when there is spatial inhomogeneity / only estimate a portion of f Theorem (Adaptive optimality) Suppose that the modulus of smoothness k is upper bounded by a known hyper-parameter S: R q (ˆf, S k,p d (L)) ( ln n and ˆf is adaptively optimal in the sense that n ) min{ k 2k+d, k d/p+d/q 2(k d/p)+d } inf ˆf sup k S,p>d,L>0,B [0,1] d R q,b(ˆf, S k,p d (L)) inf ˆf R q,b (ˆf, S k,p d (L)) (ln n) S 2S+d. 40 / 65

41 Covering of the ideal window By the property of the data-driven window size ĥ(x), it suffices to consider the ideal window size h (x) satisfying s n (h (x)) [h (x)] k d/p Ω(f, B h (x)(x)) Lemma There exists {x i } M i=1 and a partition {V i} M i=1 of [0, 1]d such that 1 Cubes {B h (x i )(x i )} M i=1 are pairwise disjoint 2 For every x V i, we have and B h (x)(x) B h (x i )(x i ) h (x) 1 2 max{h (x i ), x x i } 41 / 65

42 Analysis of the estimator Using the covering of the local windows, we have M ˆf f q q s n (h (x)) q dx = s n (h (x)) q dx [0,1] d V i ln n Recall: s n (h) and h (x) max{h (x nh d i ), x x i }/2 for x V i, ( ) ln n q M ˆf f q q [max{h (x i ), x x i }] dq/2 dx n i=1 V i ( ) ln n q M r d 1 [max{h (x i ), r}] dq/2 dr n i=1 0 ( ) ln n q M [h (x i )] d dq/2 n i=1 Plugging in s n (h (x)) [h (x)] k d/p Ω(f, B h (x)(x)) and use the disjointness of {B h (x i )(x)} M i=1 completes the proof. Q.E.D. 42 / 65 i=1

43 Experiment: Blocks Figure 3: Blocks: original, noisy and reconstructed signal 43 / 65

44 Experiment: Bumps Figure 4: Bumps: original, noisy and reconstructed signal 44 / 65

45 Experiment: Heavysine Figure 5: Heavysine: original, noisy and reconstructed signal 45 / 65

46 Experiment: Doppler Figure 6: Doppler: original, noisy and reconstructed signal 46 / 65

47 Generalization: anisotropic case What if the modulus of smoothness differs in different directions? Now we should replace the cube with a rectangle (h 1,, h d )! Deterministic and stochastic term: b h (x) C 1 d i=1 h s i i b(h), s h (x) C 2 where V (h) = d i=1 h i. Suppose b(h ) = s(h ). Naive guess ln n nv (h) s(h) ĥ(x) = arg max {V (h) : ˆf h (x) ˆf h (x) 4s(h ), h, V (h ) V (h)} h doesn t work: h may not satisfy this condition still doesn t work if we replace V (h ) V (h) by h h: fĥ may not be close to f h 47 / 65

48 Adaptive regime Adaptive regime (Kerkyacharian-Lepski-Picard 01) ĥ(x) = arg max {V (h) : ˆf h h (x) ˆf h (x) 4s(h ), h, V (h ) V (h)} h where h h = max{h, h } (coordinate-wise). h satisfies the condition: for V (h) V (h ), ˆf h (x) ˆf h (x) b(h h) b(h) + s(h h) + s(h) (h s i i + (hi ) s i ) + 2s(h) 4s(h) i:h i <h i fĥ is close to f h : since V (h ) V (ĥ), ˆfĥ f ˆfĥ fĥ h + fĥ h f h + f h f (ĥ s i i + (hi ) s i ) + 4s(h ) + 2s(h ) 8s(h ) i:h i >ĥi 48 / 65

49 More on Lepski s trick General adaptive rule: consider a family of models (X n, F (s) ) s I and rate-optimal minimax estimators ˆf n (s) for each s with minimax risk r s (n). For simplicity suppose r s (n) decreases with s. Definition (Adaptive estimator) Choose increasing sequence {s i } k(n) i=1 from I with k(n), select ˆk = sup{1 k k(n) : d(ˆf sk, ˆf si ) Cr i (n), 1 i < k} for some prescribed constant C, and choose ˆf = ˆf sˆk. Theorem (Lepski 90) If r s (n) n r(s), then under mild conditions, R n (ˆf, F (s) ) n r(s) (ln n) r(s) 2, s I. As a result, the sub-optimality gap is at most (ln n) sup s I r(s)/2. 49 / 65

50 More on adaptive schemes Instead of exploiting the monotonicity of bias and variance, we can also estimate the risk directly. Given {ˆf h } h I, suppose ˆr(h) is a good estimator of the risk of ˆf h, then we can write ĥ = arg min ˆr(h) h In Gaussian models, ˆr(h) can be Stein s Unbiased Risk Estimator (SURE): zero bias and small variance Lepski s method: define {ˆf h,h } (h,h ) I I from {ˆf h } h I, and ˆr(h) = ( ) sup ˆf h,h ˆf h Q(s(h )) + 2Q(s(h)) h :s(h ) s(h) + where Q is some function related to variance ˆf h,h is usually the convolution of estimators, and ˆf h,h = ˆf h or ˆf h h in the previous example 50 / 65

51 1 Nonparametric regression problem 2 Regression in function spaces Regression in Hölder balls Regression in Sobolev balls 3 Adaptive estimation Adapt to unknown smoothness Adapt to unknown differential inequalities 51 / 65

52 Further generalization Another point of view of Sobolev ball S k,p 1 (L): there exists a polynomial r(x) = x k with degree k such that r( d dx )f p L, f S k,p 1 (L) Generalization: what about r( ) is an unknown monic function with degree no more than k? For modulated signal, e.g., f (x) = g(x) sin(ωx + φ) for g S k,p d (L): Does not belong to any Sobolev balls for large ω However, we can write f (x) = f + (x) + f (x), where g(x) exp(±i(ωx + φ)) f ± (x) = ±2i and by induction we have ( ) d k dx iω f ± (x) = g (k) p L 2 2 p 52 / 65

53 General regression model Definition (Signal satisfying differential inequalities) Function f : [0, 1] R belongs to the class W l,k,p (L), if and only if f = l i=1 f i, and there exist monic polynomials r 1,, r l of degree k such that l r i i=1 ( ) d p f i L. dx General regression problem: recover f W l,k,p (L) from noisy observations y i = f (i/n) + σξ i, i = 1, 2,, n, ξ i iid N (0, 1) where we only know: Noise level σ An upper bound S kl 53 / 65

54 Recovering approach The risk function Note that now we cannot recover the whole function f (e.g., f 0 and f (x) = sin(2πnx) correspond to the same model) The discrete q-norm: ˆf f q = ( 1 n n i=1 ˆf (i/n) f (i/n) q ) 1 q Recovering approach Discretization: transform differential inequalities to inequalities of finite difference Sequence estimation: given a window size, recover the discrete sequence satisfying an unknown difference inequality Adaptive window size: apply Lepski s trick to choose the optimal window size adaptively 54 / 65

55 Discretization: from derivative to difference Lemma For any f C k 1 [0, 1] and any monic polynomial r(z) of degree k, there corresponds another monic polynomial η(z) of degree k such that η( )f n p n k+1/p r( d dx )f p where f n = {f (i/n)} n i=1, and ( φ) t = φ t 1 is the backward shift operator. Applying to our regression problem: The sequence f n can be written as f n = l i=1 f n,i There correspond monic polynomials η i of degree k such that l η i ( )f n,i p Ln k+1/p i=1 55 / 65

56 Sequence estimation: window estimate Given a sequence of observations {y t }, consider using {y t } t T to estimate f n [0], where T is the window size The estimator: ˆf n [0] = T t= T w ty t On one hand, the filter {w t } t T should have a small L 2 norm to suppress the noise On the other hand, if η( )f n 0 with a known η, the filter should be designed such that the error term only consists of iid noises The approach (Goldenshluger-Nemirovski 97) The filter {w t } t T is the solution to the following optimization problem: { T } min F w t y t+s y s t= T s T s.t. F(w T ) 1 C T where F({φ t } t T ) denotes the discrete Fourier transform of {φ t } t T. 56 / 65

57 Frequency domain Figure 7: The observations (upper panel) and the noises (lower panel) 57 / 65

58 Performance analysis Theorem If f n = l i=1 f n,i and there exist monic polynomials η i of degree k such that l i=1 η i( )f n,i p,t ɛ, we have f n [0] ˆf n [0] T k 1/p ɛ + ΘT (ξ) T where Θ T (ξ) is the supremum of O(T 2 ) N (0, 1) random variables and is thus of order ln T. This result is a uniform result (η i is unknown), and the ln T gap is avoidable to achieve the uniformity. Plugging in ɛ n k+1/p f p,b from the discretization step yields f n [m] ˆf n [m] ( T n )k 1/p f p,b + ΘT (ξ) T where B is a segment with center m/n and length T. 58 / 65

59 Polynomial multiplication Begin with the Hölder ball case, where we know that (1 ) k f n ɛ Lemma Write convolution as polynomial multiplication, we have 1 w T (z) = 1 t T w tz t can be divided by (1 z) k (1 w T ( ))p = 1 w T ( ) (1 ) k (1 ) k p = 0, p P1 k 1 Moreover, w T 1 1, w T 2 1/ T, and F(w T ) 1 1/ T There exists {η t } t T such that η T (z) = t T η tz t 1 wt (z) such that F(wT ) 1 1/ T, and for each i = 1, 2,, l, we have η T (z) = η i (z)ρ i (z) with ρ i T k 1 If we knew η T (z), we could just use ˆf n [0] = [w T ( )y] 0 to estimate f n [0] 59 / 65

60 Optimization solution Performance of ˆf n : f ˆf n = (1 wt ( ))f + w T ( )ξ η T ( )f + wt ( )ξ l l η T ( )f i + wt ( )ξ = ρ i ( )(η i ( )f i ) + wt ( )ξ Fact i=1 F(f ˆf n ) ( T T k 1/p ɛ + Θ ) T (ξ), η T ( )f T k 1/p ɛ T Observation: by definition F(f ˆf n ) F(f ˆf n ), thus F((1 w T ( ))f ) F(f ˆf n ) + F(w T ( )ξ) ( T T k 1/p ɛ + Θ ) T (ξ) T i=1 60 / 65

61 Performance analysis Stochastic error: Deterministic error: s = [w T ( )ξ] 0 F(w T ) 1 F(ξ) Θ T (ξ) T b = [(1 w T ( ))f ] 0 [(1 w T ( ))η T ( )f ] 0 + [(1 w T ( ))w T ( )f ] 0 η T ( )f + F(η T ( )f ) F(w T ) 1 + F((1 w T ( ))f ) F(w T ) 1 T k 1/p ɛ + Θ T (ξ) T and we re done. Q.E.D. 61 / 65

62 Adaptive window size Apply Lepski s trick to select ˆT [m] = nĥ[m]: ĥ[m] = sup{h (0, 1) : ˆf n,nh [m] ˆf n,nh [m] C ln n nh, h (0, h)} Theorem Suppose S kl, we have ( ln n R q (ˆf, W l,k,p (L)) n and ˆf is adaptively optimal in the sense that { } ) min k 2k+1, k 1/p+1/q 2(k 1/p)+1 inf ˆf sup kl S,p>0,L>0,B [0,1] d R q,b(ˆf, Wl,k,p (L)) inf ˆf R q,b (ˆf, W l,k,p (L)) (ln n) S 2S / 65

63 Summary Key points in nonparametric regression: Bias-variance tradeoff, bandwidth selection Choice of approximation basis Nonlinear estimators Adaptive methods Lepski s trick to adapt to unknown smoothness and spatial inhomogeneity Estimation of sequence satisfying an unknown differential equation 63 / 65

64 Duality to functional estimation Shrinkage vs. Approximation: Shrinkage: increase bias to reduce variance Approximation: increase variance to reduce bias Some more dualities: Nonparametric regression: known approximation order & variance, unknown function, bandwidth & bias Functional estimation: known function, bandwidth & bias, unknown approximation order & variance Similarity: kernel estimation, unbiasedness of polynomials Related questions: Necessity of nonlinear estimators? Applications of Lepski s trick? Logarithmic phase transition? / 65

65 References Nonparametric regression: A. Nemirovski, Topics in Non-Parametric Statistics. Lectures on probability theory and statistics, Saint Flour 1998, A. B. Tsybakov, Introduction to nonparametric estimation. Springer Science & Business Media, L. Györfi, et al. A distribution-free theory of nonparametric regression. Springer Science & Business Media, Approximation theory: R. A. DeVore and G. G. Lorentz, Constructive approximation. Springer Science & Business Media, G. G. Lorentz, M. von Golitschek, and Y. Makovoz, Constructive approximation: advanced problems. Berlin: Springer, Function space: O. V. Besov, V. P. Il in, and S. M. Nikol ski, Integral representations of functions and embedding theorems. V. H. Winston, / 65

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

Preface. Technion City, Haifa 32000, Israel;

Preface. Technion City, Haifa 32000, Israel; Topics in Non-Parametric Statistics Arkadi Nemirovski 1 M. Emery, A. Nemirovski, D. Voiculescu, Lectures on Probability Theory and Statistics, Ecole d Eteé de Probabilités de Saint-Flour XXVII 1998, Ed.

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII

Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII Nonparametric estimation using wavelet methods Dominique Picard Laboratoire Probabilités et Modèles Aléatoires Université Paris VII http ://www.proba.jussieu.fr/mathdoc/preprints/index.html 1 Nonparametric

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua

More information

Nonparametric regression with martingale increment errors

Nonparametric regression with martingale increment errors S. Gaïffas (LSTA - Paris 6) joint work with S. Delattre (LPMA - Paris 7) work in progress Motivations Some facts: Theoretical study of statistical algorithms requires stationary and ergodicity. Concentration

More information

Adaptivity to Local Smoothness and Dimension in Kernel Regression

Adaptivity to Local Smoothness and Dimension in Kernel Regression Adaptivity to Local Smoothness and Dimension in Kernel Regression Samory Kpotufe Toyota Technological Institute-Chicago samory@tticedu Vikas K Garg Toyota Technological Institute-Chicago vkg@tticedu Abstract

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Non linear estimation in anisotropic multiindex denoising II

Non linear estimation in anisotropic multiindex denoising II Non linear estimation in anisotropic multiindex denoising II Gérard Kerkyacharian, Oleg Lepski, Dominique Picard Abstract In dimension one, it has long been observed that the minimax rates of convergences

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Bayesian Nonparametric Point Estimation Under a Conjugate Prior University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-15-2002 Bayesian Nonparametric Point Estimation Under a Conjugate Prior Xuefeng Li University of Pennsylvania Linda

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).

More information

Inverse Statistical Learning

Inverse Statistical Learning Inverse Statistical Learning Minimax theory, adaptation and algorithm avec (par ordre d apparition) C. Marteau, M. Chichignoud, C. Brunet and S. Souchet Dijon, le 15 janvier 2014 Inverse Statistical Learning

More information

Wavelet Shrinkage for Nonequispaced Samples

Wavelet Shrinkage for Nonequispaced Samples University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Wavelet Shrinkage for Nonequispaced Samples T. Tony Cai University of Pennsylvania Lawrence D. Brown University

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach

Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1999 Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach T. Tony Cai University of Pennsylvania

More information

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are

More information

A New Method for Varying Adaptive Bandwidth Selection

A New Method for Varying Adaptive Bandwidth Selection IEEE TRASACTIOS O SIGAL PROCESSIG, VOL. 47, O. 9, SEPTEMBER 1999 2567 TABLE I SQUARE ROOT MEA SQUARED ERRORS (SRMSE) OF ESTIMATIO USIG THE LPA AD VARIOUS WAVELET METHODS A ew Method for Varying Adaptive

More information

Hypothesis Testing via Convex Optimization

Hypothesis Testing via Convex Optimization Hypothesis Testing via Convex Optimization Arkadi Nemirovski Joint research with Alexander Goldenshluger Haifa University Anatoli Iouditski Grenoble University Information Theory, Learning and Big Data

More information

Probability and Measure

Probability and Measure Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 84 Paper 4, Section II 26J Let (X, A) be a measurable space. Let T : X X be a measurable map, and µ a probability

More information

Gaussian Random Field: simulation and quantification of the error

Gaussian Random Field: simulation and quantification of the error Gaussian Random Field: simulation and quantification of the error EMSE Workshop 6 November 2017 3 2 1 0-1 -2-3 80 60 40 1 Continuity Separability Proving continuity without separability 2 The stationary

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Modelling Non-linear and Non-stationary Time Series

Modelling Non-linear and Non-stationary Time Series Modelling Non-linear and Non-stationary Time Series Chapter 2: Non-parametric methods Henrik Madsen Advanced Time Series Analysis September 206 Henrik Madsen (02427 Adv. TS Analysis) Lecture Notes September

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Image Denoising: Pointwise Adaptive Approach

Image Denoising: Pointwise Adaptive Approach Image Denoising: Pointwise Adaptive Approach Polzehl, Jörg Spokoiny, Vladimir Weierstrass-Institute, Mohrenstr. 39, 10117 Berlin, Germany Keywords: image and edge estimation, pointwise adaptation, averaging

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals Acta Applicandae Mathematicae 78: 145 154, 2003. 2003 Kluwer Academic Publishers. Printed in the Netherlands. 145 Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals M.

More information

arxiv: v1 [math.ap] 17 May 2017

arxiv: v1 [math.ap] 17 May 2017 ON A HOMOGENIZATION PROBLEM arxiv:1705.06174v1 [math.ap] 17 May 2017 J. BOURGAIN Abstract. We study a homogenization question for a stochastic divergence type operator. 1. Introduction and Statement Let

More information

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past. 1 Markov chain: definition Lecture 5 Definition 1.1 Markov chain] A sequence of random variables (X n ) n 0 taking values in a measurable state space (S, S) is called a (discrete time) Markov chain, if

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Springer Series in Statistics

Springer Series in Statistics Springer Series in Statistics Advisors: P. Bickel, P. Diggle, S. Fienberg, U. Gather, I. Olkin, S. Zeger The French edition of this work that is the basis of this expanded edition was translated by Vladimir

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) YES, Eurandom, 10 October 2011 p. 1/32 Part II 2) Adaptation and oracle inequalities YES, Eurandom, 10 October 2011

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Estimation of a quadratic regression functional using the sinc kernel

Estimation of a quadratic regression functional using the sinc kernel Estimation of a quadratic regression functional using the sinc kernel Nicolai Bissantz Hajo Holzmann Institute for Mathematical Stochastics, Georg-August-University Göttingen, Maschmühlenweg 8 10, D-37073

More information

Asymptotic Equivalence and Adaptive Estimation for Robust Nonparametric Regression

Asymptotic Equivalence and Adaptive Estimation for Robust Nonparametric Regression Asymptotic Equivalence and Adaptive Estimation for Robust Nonparametric Regression T. Tony Cai 1 and Harrison H. Zhou 2 University of Pennsylvania and Yale University Abstract Asymptotic equivalence theory

More information

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012 Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM

More information

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean

More information

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants

Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants Nonparametric Inference in Cosmology and Astrophysics: Biases and Variants Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Collaborators:

More information

1 Introduction. 2 Measure theoretic definitions

1 Introduction. 2 Measure theoretic definitions 1 Introduction These notes aim to recall some basic definitions needed for dealing with random variables. Sections to 5 follow mostly the presentation given in chapter two of [1]. Measure theoretic definitions

More information

Time Series and Forecasting Lecture 4 NonLinear Time Series

Time Series and Forecasting Lecture 4 NonLinear Time Series Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations

More information

Chapter One. The Calderón-Zygmund Theory I: Ellipticity

Chapter One. The Calderón-Zygmund Theory I: Ellipticity Chapter One The Calderón-Zygmund Theory I: Ellipticity Our story begins with a classical situation: convolution with homogeneous, Calderón- Zygmund ( kernels on R n. Let S n 1 R n denote the unit sphere

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

L p -boundedness of the Hilbert transform

L p -boundedness of the Hilbert transform L p -boundedness of the Hilbert transform Kunal Narayan Chaudhury Abstract The Hilbert transform is essentially the only singular operator in one dimension. This undoubtedly makes it one of the the most

More information

Minimax Risk: Pinsker Bound

Minimax Risk: Pinsker Bound Minimax Risk: Pinsker Bound Michael Nussbaum Cornell University From: Encyclopedia of Statistical Sciences, Update Volume (S. Kotz, Ed.), 1999. Wiley, New York. Abstract We give an account of the Pinsker

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Convergence rates of spectral methods for statistical inverse learning problems

Convergence rates of spectral methods for statistical inverse learning problems Convergence rates of spectral methods for statistical inverse learning problems G. Blanchard Universtität Potsdam UCL/Gatsby unit, 04/11/2015 Joint work with N. Mücke (U. Potsdam); N. Krämer (U. München)

More information

Stability of optimization problems with stochastic dominance constraints

Stability of optimization problems with stochastic dominance constraints Stability of optimization problems with stochastic dominance constraints D. Dentcheva and W. Römisch Stevens Institute of Technology, Hoboken Humboldt-University Berlin www.math.hu-berlin.de/~romisch SIAM

More information

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit Evan Kwiatkowski, Jan Mandel University of Colorado Denver December 11, 2014 OUTLINE 2 Data Assimilation Bayesian Estimation

More information

Nonparametric Regression. Badr Missaoui

Nonparametric Regression. Badr Missaoui Badr Missaoui Outline Kernel and local polynomial regression. Penalized regression. We are given n pairs of observations (X 1, Y 1 ),...,(X n, Y n ) where Y i = r(x i ) + ε i, i = 1,..., n and r(x) = E(Y

More information

arxiv: v4 [stat.me] 27 Nov 2017

arxiv: v4 [stat.me] 27 Nov 2017 CLASSIFICATION OF LOCAL FIELD POTENTIALS USING GAUSSIAN SEQUENCE MODEL Taposh Banerjee John Choi Bijan Pesaran Demba Ba and Vahid Tarokh School of Engineering and Applied Sciences, Harvard University Center

More information

Homework 1 Due: Wednesday, September 28, 2016

Homework 1 Due: Wednesday, September 28, 2016 0-704 Information Processing and Learning Fall 06 Homework Due: Wednesday, September 8, 06 Notes: For positive integers k, [k] := {,..., k} denotes te set of te first k positive integers. Wen p and Y q

More information

Chapter 9: Basic of Hypercontractivity

Chapter 9: Basic of Hypercontractivity Analysis of Boolean Functions Prof. Ryan O Donnell Chapter 9: Basic of Hypercontractivity Date: May 6, 2017 Student: Chi-Ning Chou Index Problem Progress 1 Exercise 9.3 (Tightness of Bonami Lemma) 2/2

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

A Data-Driven Block Thresholding Approach To Wavelet Estimation

A Data-Driven Block Thresholding Approach To Wavelet Estimation A Data-Driven Block Thresholding Approach To Wavelet Estimation T. Tony Cai 1 and Harrison H. Zhou University of Pennsylvania and Yale University Abstract A data-driven block thresholding procedure for

More information

Laplace s Equation. Chapter Mean Value Formulas

Laplace s Equation. Chapter Mean Value Formulas Chapter 1 Laplace s Equation Let be an open set in R n. A function u C 2 () is called harmonic in if it satisfies Laplace s equation n (1.1) u := D ii u = 0 in. i=1 A function u C 2 () is called subharmonic

More information

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R. Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions

More information

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad 40.850: athematical Foundation of Big Data Analysis Spring 206 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Lecturer: Fang Han arch 07 Disclaimer: These notes have not been subjected to the

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Hypercontractivity of spherical averages in Hamming space

Hypercontractivity of spherical averages in Hamming space Hypercontractivity of spherical averages in Hamming space Yury Polyanskiy Department of EECS MIT yp@mit.edu Allerton Conference, Oct 2, 2014 Yury Polyanskiy Hypercontractivity in Hamming space 1 Hypercontractivity:

More information

Duality of multiparameter Hardy spaces H p on spaces of homogeneous type

Duality of multiparameter Hardy spaces H p on spaces of homogeneous type Duality of multiparameter Hardy spaces H p on spaces of homogeneous type Yongsheng Han, Ji Li, and Guozhen Lu Department of Mathematics Vanderbilt University Nashville, TN Internet Analysis Seminar 2012

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

How much should we rely on Besov spaces as a framework for the mathematical study of images?

How much should we rely on Besov spaces as a framework for the mathematical study of images? How much should we rely on Besov spaces as a framework for the mathematical study of images? C. Sinan Güntürk Princeton University, Program in Applied and Computational Mathematics Abstract Relations between

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be

Introduction Wavelet shrinage methods have been very successful in nonparametric regression. But so far most of the wavelet regression methods have be Wavelet Estimation For Samples With Random Uniform Design T. Tony Cai Department of Statistics, Purdue University Lawrence D. Brown Department of Statistics, University of Pennsylvania Abstract We show

More information

Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes

Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes Pointwise convergence rates and central limit theorems for kernel density estimators in linear processes Anton Schick Binghamton University Wolfgang Wefelmeyer Universität zu Köln Abstract Convergence

More information

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery Jorge F. Silva and Eduardo Pavez Department of Electrical Engineering Information and Decision Systems Group Universidad

More information

Regularity of the density for the stochastic heat equation

Regularity of the density for the stochastic heat equation Regularity of the density for the stochastic heat equation Carl Mueller 1 Department of Mathematics University of Rochester Rochester, NY 15627 USA email: cmlr@math.rochester.edu David Nualart 2 Department

More information

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo Outline in High Dimensions Using the Rodeo Han Liu 1,2 John Lafferty 2,3 Larry Wasserman 1,2 1 Statistics Department, 2 Machine Learning Department, 3 Computer Science Department, Carnegie Mellon University

More information

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018

EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018 EXPOSITORY NOTES ON DISTRIBUTION THEORY, FALL 2018 While these notes are under construction, I expect there will be many typos. The main reference for this is volume 1 of Hörmander, The analysis of liner

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

Lecture No 1 Introduction to Diffusion equations The heat equat

Lecture No 1 Introduction to Diffusion equations The heat equat Lecture No 1 Introduction to Diffusion equations The heat equation Columbia University IAS summer program June, 2009 Outline of the lectures We will discuss some basic models of diffusion equations and

More information

Fourier Transform & Sobolev Spaces

Fourier Transform & Sobolev Spaces Fourier Transform & Sobolev Spaces Michael Reiter, Arthur Schuster Summer Term 2008 Abstract We introduce the concept of weak derivative that allows us to define new interesting Hilbert spaces the Sobolev

More information

Distance between multinomial and multivariate normal models

Distance between multinomial and multivariate normal models Chapter 9 Distance between multinomial and multivariate normal models SECTION 1 introduces Andrew Carter s recursive procedure for bounding the Le Cam distance between a multinomialmodeland its approximating

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Remarks on Extremization Problems Related To Young s Inequality

Remarks on Extremization Problems Related To Young s Inequality Remarks on Extremization Problems Related To Young s Inequality Michael Christ University of California, Berkeley University of Wisconsin May 18, 2016 Part 1: Introduction Young s convolution inequality

More information

VISCOSITY SOLUTIONS. We follow Han and Lin, Elliptic Partial Differential Equations, 5.

VISCOSITY SOLUTIONS. We follow Han and Lin, Elliptic Partial Differential Equations, 5. VISCOSITY SOLUTIONS PETER HINTZ We follow Han and Lin, Elliptic Partial Differential Equations, 5. 1. Motivation Throughout, we will assume that Ω R n is a bounded and connected domain and that a ij C(Ω)

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Sparsity Measure and the Detection of Significant Data

Sparsity Measure and the Detection of Significant Data Sparsity Measure and the Detection of Significant Data Abdourrahmane Atto, Dominique Pastor, Grégoire Mercier To cite this version: Abdourrahmane Atto, Dominique Pastor, Grégoire Mercier. Sparsity Measure

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Model selection theory: a tutorial with applications to learning

Model selection theory: a tutorial with applications to learning Model selection theory: a tutorial with applications to learning Pascal Massart Université Paris-Sud, Orsay ALT 2012, October 29 Asymptotic approach to model selection - Idea of using some penalized empirical

More information

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active

More information

A NEW PROOF OF THE ATOMIC DECOMPOSITION OF HARDY SPACES

A NEW PROOF OF THE ATOMIC DECOMPOSITION OF HARDY SPACES A NEW PROOF OF THE ATOMIC DECOMPOSITION OF HARDY SPACES S. DEKEL, G. KERKYACHARIAN, G. KYRIAZIS, AND P. PETRUSHEV Abstract. A new proof is given of the atomic decomposition of Hardy spaces H p, 0 < p 1,

More information

Multiple points of the Brownian sheet in critical dimensions

Multiple points of the Brownian sheet in critical dimensions Multiple points of the Brownian sheet in critical dimensions Robert C. Dalang Ecole Polytechnique Fédérale de Lausanne Based on joint work with: Carl Mueller Multiple points of the Brownian sheet in critical

More information

Multivariate Topological Data Analysis

Multivariate Topological Data Analysis Cleveland State University November 20, 2008, Duke University joint work with Gunnar Carlsson (Stanford), Peter Kim and Zhiming Luo (Guelph), and Moo Chung (Wisconsin Madison) Framework Ideal Truth (Parameter)

More information

Convex Geometry. Carsten Schütt

Convex Geometry. Carsten Schütt Convex Geometry Carsten Schütt November 25, 2006 2 Contents 0.1 Convex sets... 4 0.2 Separation.... 9 0.3 Extreme points..... 15 0.4 Blaschke selection principle... 18 0.5 Polytopes and polyhedra.... 23

More information

The Root-Unroot Algorithm for Density Estimation as Implemented. via Wavelet Block Thresholding

The Root-Unroot Algorithm for Density Estimation as Implemented. via Wavelet Block Thresholding The Root-Unroot Algorithm for Density Estimation as Implemented via Wavelet Block Thresholding Lawrence Brown, Tony Cai, Ren Zhang, Linda Zhao and Harrison Zhou Abstract We propose and implement a density

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information