Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII

Size: px

Start display at page:

Download "Nonparametric estimation using wavelet methods. Dominique Picard. Laboratoire Probabilités et Modèles Aléatoires Université Paris VII"

Reynold Wilkinson
5 years ago
Views:

1 Nonparametric estimation using wavelet methods Dominique Picard Laboratoire Probabilités et Modèles Aléatoires Université Paris VII http :// 1

2 Nonparametric estimation 2

3 Examples of nonparametrics Estimate a density of probability We observe X 1,...,X n i.i.d. with probability P having a density f w.r. to Lebesgue measure. Our aim is to estimate f. Regression framework We observe (X 1, Y 1 )...,(X n, Y n ) i.i.d. X i Uniform on [0,1], Our aim is to estimate f. Y i = f(x i ) + ǫ i, ǫ i N(0, 1). 3

4 Examples of nonparametrics White noise model dy ǫ t = f(t)dt + ǫdw t, t [0, 1], ǫ = 1/ n. We observe : φ L 2 ([0, 1]), Y φ = 1 0 φ(t)f(t)dt + ǫξ φ, (ξ φ, ξ η ) N 0, φ 2 < φ, η > Our aim is to 0 < φ, η > η 2 estimate f. 4

5 Examples of nonparametrics (more involved) EDS dx t = b(t)dt + f(t)dw t. We observe X i = X i, i = 1,...,n. Our aim is to estimate f. 5

6 Examples of nonparametrics (more involved) Inverse models dy ǫ t = Af(t)dt + ǫdw t, t [0, 1], ǫ = 1/ n. We observe : φ L 2 ([0, 1]), Y φ = 1 0 φ(t)af(t)dt + ǫξ φ, A known linear operator, for instance Af(s) = g(s t)f(t)dt, A Radon transform... Our aim is to estimate f. 6

7 Why is it difficult? Estimate a density of probability We observe X 1,...,X n i.i.d. with probability P having a density f w.r. to Lebesgue measure. Our aim is to estimate f. Easy : Estimate F(x) = P(X i x) : ˆF n (x) = 1 n n i=0 1 ],x] (X i ), ˆF n (x) F(x) = ξ n(x) n {ξ n (x), x R} Loi {B 0 (F(x)), x R} 2 obstructions to differentiation : ˆF n (x) is not differentiable. B 0 (F(x)) is not differentiable either. Kolmogorov Smirnov. 7

8 Parzen kernel method ˆK hn (x) = = 1 h n K( x y 1 nh n n i=0 h n )df n (y) K( x X i h n ) 8

9 density.default(x = z, bw = density.default(x = z, bw = Density Density Density density.default(x = z, bw = density.default(x = z, bw = 0 Density N = 300 Bandwidth = 3 N = 300 Bandwidth = 1 N = 300 Bandwidth = 0.7 N = 300 Bandwidth = 0.4 density.default(x = z, bw = 0 density.default(x = z, bw = 0 density.default(x = z) density.default(x = z) Density Density Density Density N = 300 Bandwidth = 0.3 N = 300 Bandwidth = 0.1 N = 300 Bandwidth = N = 3000 Bandwidth =

10 Minimax framework Our aim is to estimate f V We have a loss function l( ˆf, f) (for instance l( ˆf, f) = ˆf f p p, ˆf f ) f is minimax (exactly) if sup V E f l(f, f) = inf ˆf sup V E f l( ˆf, f). f n is minimax (up to constants) if c inf sup E n fl( ˆf n, f) sup E n fl(fn, f) C inf sup E n fl( ˆf n, f) n. ˆf n V V ˆf n V 10

11 Minimax framework Two steps : Find a rate r(n) 1. Lower bound sup V E n f l( ˆf n, f) c r(n) 2. Upper bound : construct an estimation method f n with sup V E n f l(f n, f) C r(n). 11

12 Lower bound Models : density-1, regression-2, white noise model-3 V = V α (L) = {f : [0, 1] R, sup f(x) f(y) Lδ α, δ, f(0) L} x y δ l( ˆf, f) = ˆf f p p Theorem 1 For 0 α 1, p [1, [, in models 1,2 or 3, inf ˆf n Proof p 155 and more HKPT. sup E n f ˆf n f p p cn pα 1+2α := cr(n) V α (L) 12

13 Lower bound The proof consists in finding a collection Γ of functions (as big as possible) with the following requirements 1. Γ V α (L) 2. f g p p δ, f g Γ 3. d(p n f, Pg n ) 1 2, f g Γ 13

14 Upper bound ˆK hn (x) = ˆK hn (x) = ˆK hn (x) = 1 nh n 1 nh n 1 0 n i=0 n i=0 K( x X i h n ) density (Rosenblatt 1956) K( x X i h n )Y i regression (NadarayaWatson 1964) 1 K( x t )dyt ǫ h n h n white noise Theorem 2 For 0 α 1, p [1, [, in models 1,2 or 3, K(x)dx = 1, K compactly supported [ M, M], K <. f n = ˆK hn, h n = n 1 1+2α sup E n f fn f p p Cn pα 1+2α = Cr(n) V α (L) 14

15 density.default(x = z, bw = density.default(x = z, bw = Density Density Density density.default(x = z, bw = density.default(x = z, bw = 0 Density N = 300 Bandwidth = 3 N = 300 Bandwidth = 1 N = 300 Bandwidth = 0.7 N = 300 Bandwidth = 0.4 density.default(x = z, bw = 0 density.default(x = z, bw = 0 density.default(x = z) density.default(x = z) Density Density Density Density N = 300 Bandwidth = 0.3 N = 300 Bandwidth = 0.1 N = 300 Bandwidth = N = 3000 Bandwidth =

16 Upper bound p = 2, model (1) E n f( ˆK hn (x) f(x)) 2 = E n f( ˆK hn (x) K hn f(x)) 2 + (K hn f(x) f(x)) 2 Balance bias, variance Variance E n f( ˆK hn (x) K hn f(x)) 2 1 K hn (x) 2 f(x)dx n L K(x) 2 dx nh n L 2M K 2 nh n 16

17 Balance bias, variance Bias K hn f(x) f(x) = K hn (u)[f(x u) f(x)du K hn sup u/h n M 2M K L(2Mh n ) α f(x u) f(x) 17

18 Balance bias, variance E n f( ˆK hn (x) f(x)) 2 = E n f( ˆK hn (x) K hn f(x)) 2 + (K hn f(x) f(x)) 2 L nh n 2M K 2 + 4M 2 K 2 L 2 (2Mh n ) 2α Proof general p (Rosenthal inequality), see HKPT Observe that what is needed is in fact f V = K h f f p Ch α 18

19 Orthogonal series methods f L 2 ([0, 1]), E = {ψ i, i N} orthonormal basis of L 2 ([0, 1], dt), f = θ i ψ i, x i = ψ i dy, i N, General estimator ˆf = i A ˆθ i ψ i. Two choices : A, ˆθ i. 19

20 Orthogonal series methods A = (generally) {0,...,K} ˆθ i = 1 n ˆθ i = 1 n ˆθ i = n ψ i (X i ) i=0 i=0 1 0 n ψ i (X i )Y i ψ i (t)dy ǫ t density regression white noise ˆf K 20

21 Upper bounds If we assume f belongs to a polynomially tail compact domain : For s > 0, fixed, V = {f = θ k ψ k, θk 2 MK 2s, K} k>k 21

22 Upper bounds f V (s, M) = {f = θ k ψ k, θ 2 k M 2 K 2s, K} k>k E n f ˆf K f 2 = k K E n f(ˆθ k θ k ) 2 + k>k θ 2 k (K + 1) 1 n + k>k θ 2 k (K + 1) 1 n + M2 K 2s Optimized for Ks = c[n] 1+2s K 0 = cn (decreasing in s) 1 sup E ˆf K f 2 c n 2s 1+2s. f V (s,m) 22

23 Moreover Minimax lower bound (Pinsker) Hence Lower bounds inf sup E Est f 2 2 c 0 n 2s 1+2s Est V sup E ˆf K s f 2 c n 2s 1+2s f V (s,m) says that ˆf K s is rate optimal over V but Ks = cn 2 1+2s depends on s 23

24 Kernels versus series Easier calculation for series (for proof and computation) Tuning parameters K h 1 gives an interpretation of the bandwidth parameter and the dimension of the problem. Space V depends on the basis, on the numbering in the basis, only allows a L 2 loss function. 24

25 Bases and functional spaces 25

26 Trigonometric basis and Sobolev spaces : L 2 ([0, 1]) of periodic functions ψ 0 = 1 ψ 2k (x) = 2 cos2kπx ψ 2k+1 (x) = 2 sin2kπx Let β N, the following Sobolev space, W(β, L) = {f : [0, 1] R : f β 1 absolut. continuous W per (β, L) = {f W(β, L), periodic} (f β ) 2 (x)dx L 2 } 26

27 Trigonometric basis and Sobolev spaces Let Θ((a j ), Q) = {θ l 2 : j a 2 jθ 2 j Q 2 } We have, W per (β, L) = Θ((a j ), Q) := Θ(β, Q), a j = j β, j even a j = (j 1) β, j odd Q = L π β 27

28 Trigonometric basis and Sobolev spaces Θ(β, Q) = {θ l 2 : j 2β θj 2 Q 2 } j Θ(β, Q) V (β, Q) j K θ 2 j j K[ j K ]2β θ 2 j K 2β j j 2β θ 2 j 28

29 Examples of bases Haar Wavelet basis ψ 1,0 =: 1 [0,1] =: φ ψ ( x) =: 1 [0,1/2[ 1 [1/2,1] ψ j,k (x) =: 2 j/2 ψ(2 j x k), j N, k {0,...,2 j 1} {ψ j,k, j N, k {0,...,2 j 1} O.N. basis of L 2 ([0, 1]) 29

30 y Index

31 Haar Wavelet basis and Besov spaces Let us define φ j,k (x) =: 2 j/2 φ(2 j x k), j N, k {0,...,2 j 1} V j =: {f = α jk φ j,k, α jk R} W j =: {f = k {0,...,2 j 1} k {0,...,2 j 1} β jk ψ j,k, β jk R} 31

32 We have easily, V j = V j 1 W j J f V J f = β jk ψ j,k j= 1 k {0,...,2 j 1} f L 2 ([0, 1]) f = β jk = j= 1 fψ j,k k {0,...,2 j 1}β jk ψ j,k, jk β 2 jk = f 2 2 < 32

33 Wavelet estimators versus kernel Defined as an orthogonal series estimator ˆf J = J ˆβ jk ψ j,k ˆβ jk = 1 n ˆβ jk = 1 n ˆβ jk = j= 1 k {0,...,2 j 1} n ψ j,k (X i ) i=0 i=0 1 0 n ψ j,k (X i )Y i ψ j,k (t)dy ǫ t density regression white noise 33

34 Wavelet estimators versus kernel Defined as a kernel estimator ˆf J (x) = ˆα Jk = 1 n ˆα Jk = 1 n ˆα Jk = k {0,...,2 J 1} n φ J,k (X i ) i=0 i=0 1 0 n φ J,k (X i )Y i φ J,k (t)dy ǫ t ˆα Jk φ J,k (x) density regression white noise 34

35 K J (t, x) = φ J,k (t)φ J,k (x) k {0,...,2 J 1} = 2 J φ(2 J t k)φ(2 J x k) k {0,...,2 J 1} = 2 J K(2 J x, 2 J t), K(s, t) = k φ(t k)φ(x k) ˆf J (x) = 1 n ˆf J (x) = 1 n ˆf(x) = n K J (X i, x) i=0 i=0 1 0 n K J (X i, x)y i K J (t, x)dy ǫ t density regression white noise 35

36 Wavelet bases and Besov spaces In this case the polynomially tail compactness condition writes V (s, M) = {f = j= 1 k {0,...,2 j 1}β jk ψ j,k, j J βjk 2 M 2 2 2Js J N} k Let us define the Besov space B s 2, = {f = j= 1 k β jk ψ j,k, sup J N 2 Js [ k β 2 Jk] 1/2 < } V (s, M) is then equivalent to a ball of this space. 36

37 Wavelet bases and Besov spaces f B s 2, (M) = {f = j= 1 k β jk ψ j,k, sup J N 2 Js [ k β 2 Jk] 1/2 M} J s, 2 J s = [n] 1 1+2s sup E ˆf J f 2 c n 2s 1+2s. f B2, s (M) 37

38 Moreover inf Est Minimax lower bound Hence Lower bounds sup E Est f 2 f B2, s (M) 2 c 0 n 2s 1+2s sup E ˆf J f B2, s (M) s f 2 c n 2s 1+2s says that ˆf J s is rate optimal over B2, (M) s but Js, 2 J s 2 = n 1+2s depends on s 38

39 Wavelet bases and Besov spaces More generally, B s p, = {f = j= 1 k β jk ψ j,k, sup j N 2 j(s+12 1p ) [ k β jk p ] 1/p < } and B s p,q = {f = β jk ψ j,k, 2 j(s p )q [ k j N k j= 1 β jk p ] q/p < } B s, = {f = j= 1 k β jk ψ j,k, sup2 j(s+ 1 2 ) [sup β jk < } j N k 39

40 Besov spaces versus Hölder spaces V α (L) = {f : [0, 1] R, sup f(x) f(y) Lδ α, δ, f(0) L} x y δ f V α (L) = β jk M 2 2 j(α+1 α 2 ) j = f B, α = f B2,. α i.e. V α B α, B α 2, 40

41 Besov spaces : Remarks Wavelets Other wavelets exist, compactly supported or not, generally more regular than the Haar wavelets, see HKPT Chapters 5, 6, 7, 8 Besov spaces There are conditions on the wavelets ensuring that the Besov defined earlier do coincide (see HKPT, Theorem 9.4 p 119). Sparsity The Besov conditions are among conditions called sparsity conditions (meaning essentially that in the representation of f, only few coefficients are meaningful. 41

42 Besov spaces : Embeddings q q B s p,q B s p,q (comparison of l q norms) p p B s p,q B s p,q if s 1 p = s 1 p (comparison of l p norms) p p, compactly supported case B s p,q B s p,q (Convexity inequalities) [Compactly supported case :] B s p,q B s [ 1 p 1 p ] + p,q 42

43 Thresholding estimates in wavelet systems 43

44 Thresholding Estimators : ˆf = J j= 1 k {0,...,2 j 1} 2 J = n log n ˆβ jk = 1 n ψ j,k (X i ) n ˆβ jk = 1 n ˆβ jk = i=0 i=0 1 0 n ψ j,k (X i )Y i ψ j,k (t)dy ǫ t log n ˆβ jk I{ ˆβ jk κ n }ψ j,k density regression white noise 44

45 Thresholding = Adaptation s > s 0 sup E ˆf f 2 f B, s (M) 2 c log n n 2s 1+2s sup E ˆf f 2 f B2, s (M) 2 c log n n 2s 1+2s. 45

46 Thresholding = Adaptation, General result Theorem 1. For 1 p, 1 r, π 1, κ κ 0, s > 1 p, there exists some constants c p (M) such that, s < π 2 (1 p 1 π ) +, s π 2 (1 p 1 π ) +, [ ] (s p 1 + π 1 )π sup E ˆf f n π π c (M) log n δ(s,p,q) 2s p 2 +1 f f B s log n p,r [ ] sπ sup E ˆf n f π π c p (M) log n δ(s,p,q) 2s+1 f f B s log n p,r 46

47 Sequence space models, thresholding estimates 47

48 White noise model dy t = f(t)dt + ǫdw t, t [0, 1], f L 2 ([0, 1], dt) E = {ψ i, i N} orthonormal basis of L 2 ([0, 1], dt), f = θ i ψ i, x i = ψ i dy, i N, x i = θ i + ǫv i, i N where the v i s are iid N(0, 1) and θ = (θ i ) i N l 2 (N). 48

49 General Framework : f = i N θ ie i (unknown) is randomly observed meaning we can estimate θ i by ˆθ i n for i Λn with the following properties : 49

50 E n f ˆθ i n θi 2p Cσ 2p i c(n) 2p, ) n P n ( ˆθ i θi κσ i c(n)/2 Cc(n) 2p c(n) 4, Λ n, c(n) 0 50

51 Examples c(n) = log(n)/n, Λ n = c(n) 2 is the most common choice. Density estimation model ( Donoho, Johnstone, K., P. ) for wavelet bases, ˆ θ jk = 1 n n ψ jk (X i ) Regression model i=1 51

52 Examples More delicate models (wavelet bases) : Stationary processes, Evolutionary spectra (Neumann and von Sachs 1997), Locally stationary processes (Donoho, Mallat and von Sachs 1998, and Mallat, Papanicolaou and Zhang 1998), Partially observed diffusion models (Hoffmann 1999 ), Multivariate extensions with t [0, 1] d (Donoho 1997, Neumann 1998), Markov chains models (hidden or not) (Clemençon 2000). 52

53 Thresholding Estimators : fˆ n = n ˆθi 1( n ˆθi κc(n)) e i. i Λ n 53

54 Maxisets Definition 1. Let us define the maxiset associated with the sequence ˆq n, the loss function ρ, the rate α n and the constant T as the following set : MS(ˆq n, ρ, α n )(T) := {θ Θ, sup n E n θ ρ(ˆq n, q(θ))(α n ) 1 T } Examples For parametric regular sequences of models, we generally have MS(ˆq n, ρ, n 1/2 )(T) = Θ for various loss functions and large enough constant T. 54

55 Density Estimation :linear kernel methods X 1....,X n i.i.d. f. ρ( ˆf n, f) = ˆf n f p p. Θ = {f, density, f p R}, Ê j(n) (x) := 1 n n i=1 E j(n)(x, X i ). j(n) : 2 j(n) = n (1 α), with α (0, 1) 55

56 MS(Ê j(n), f, n αp/2 ) :=: Θ B s,p s = α 2(1 α) or, α = 2s 1+2s Kerkyacharian, P. Stat. Proba. Letters (1993) 56

57 MS(Ê j(n), f, α n ) :=: Θ B s,p (i) For any T, there exists M such that MS(Ê j(n), f, α n )(T) Θ B s,p, (M) (ii) For any M, there exists T such that MS(Ê j(n), f, α n )(T) Θ B s,p, (M) 57

58 Maxisets for Thresholding estimators : Model : general framework : f randomly observed. ρ( ˆf n, f) = ˆf n f p p. Θ = {f, f p R}, fˆ n = n ˆθi 1( n ˆθi σi κc(n)) e i. i Λ n 58

59 p = 2, e i is an ordinary orthonormal basis l q, (E) = {f = θ n e n X, supλ q card{n; θ n > λ} < } λ>0 For 0 < s, 0 < r 2, Λ n = c(n) r MS( ˆf n, f, c(n) α ) :=: l q, (E) B α r,2 α = 2s 1+2s, s = 1 q

60 2001 Cohen, DeVore, Kerkyacharian, P. B s,p = {f L p, sup j 0 2 js E j f f p < } B u,2 = {f = θ i e i L 2, sup n u n 1 θ i e i 2 < } i=n 60

61 B u,p = {f = θ i e i L p, sup n u n 1 θ i e i p < } i=n 61

Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach

University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1999 Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach T. Tony Cai University of Pennsylvania