Minimax lower bounds I

Size: px

Start display at page:

Download "Minimax lower bounds I"

Maryann Fitzgerald
5 years ago
Views:

1 Minimax lower bounds I Kyoung Hee Kim Sungshin University

2 1 Preliminaries 2 General strategy 3 Le Cam, Assouad, Appendix

3 Setting Family of probability measures {P θ : θ Θ} on a sigma field A Estimator ˆθ: measurable map from Ω to Θ L(ˆθ, θ): loss function Maximum risk R(Θ, ˆθ) := sup θ Θ E θ L(ˆθ, θ) Minimax risk with a minimax estimator ˆθ mm, R(Θ) = inf θ sup θ Θ E θ L( θ, θ) = sup E θ L(ˆθ mm, θ). θ Θ 1 / 44

4 Why minimax? Uniformity excludes super-efficient estimators Optimal rates reveal the difficulty in the model of interest Guidance on the choice of estimators 2 / 44

5 Various distances P, Q: two probability measures with densities p, q w.r.t ν. Total variation distance V (P, Q) = sup P (A) Q(A) = sup (p q)dν A A A A A Squared Hellinger distance h 2 (P, Q) = (p 1/2 q 1/2 ) 2 dν. Kullback Leibler (KL) divergence KL(P, Q) = p log p q dν. Chi-squared χ 2 distance p χ 2 2 (P, Q) = q dν 1. 3 / 44

6 Relation between distances P, Q: two probability measures with densities p, q w.r.t ν h2 (P, Q) V (P, Q) h(p, Q) 1 h2 (P,Q) 4 2 V (P, Q) h(p, Q) KL(P, Q) χ 2 (P, Q) 3 h 2 (P n, Q n ) nh 2 (P, Q) KL(P,Q) 4 (Pinsker) V (P, Q) 2 and V (P, Q) exp( KL(P, Q)). 4 / 44

7 Relation to Bayes estimator Let ˆθ Π be Bayes estimator using a prior Π. Least favourable prior gives a maximal Bayes risk (BR) for all prior distributions If Bayes risk of ˆθ Π is equal to a maximal risk of ˆθ Π, then ˆθ Π is minimax, and Π is least favourable. If maximal risk of ˆθ is equal to lim n BR of ˆθ Πn, converging limit of Bayes risks with a sequence of priors (which is at least any Bayes risk for any prior), then ˆθ is minimax. 5 / 44

8 Examples of minimax estimator 1 Sample mean Let X i be i.i.d. N(θ, 1), i = 1,..., n and let L(ˆθ, θ) = (ˆθ θ) 2. X is minimax with a sequence of prior Π b = N(µ, b) since ˆθ Πb = (n X + µ/b)/(n + 1/b) and the posterior variance 1/(n + 1/b) converges to 1/n as b, which is equal to V ar( X). 2 James Stein estimator Let X N d (θ, I d ) where d 3, and let L(ˆθ, θ) = d i=1 (ˆθ i θ i ) 2. Minimax ( estimator ˆθ ) mm (dominating X) is found as ˆθ mm = 1 d 2 d X i=1 X2 i But ˆθ mm is not unique minimax (and ( not admissible). ) For instance, positive part estimator 1 d 2 d X i=1 X2 i dominates ˆθ mm. + 6 / 44

9 risk function (d=4) JS, Minimax X, MLE, UMVUE Positive JS, Minimax θ 7 / 44

10 Minimax optimality Usually difficult to find minimax risk and minimax estimator. Typically satisfied if we find a good lower bound l(n) on R(Θ) and if we find a good upper bound u(n) by using a specific estimator θ, where l(n) R(Θ) sup E θ L( θ, θ) u(n) (1) θ Θ u(n) lim n l(n) C. If θ satisfies (1), then θ is called a minimax optimal estimator. 8 / 44

11 Possible questions 1 In a nonparametric regression problem where the regression function θ satisfies some smooth conditions (e.g. twice conti. differentiable), with the squared error loss function at one point, what is a minimax lower bound? That is, sup E θ ( θ(x 0 ) θ(x 0 )) 2...? θ Θ 2 In a density estimation problem where a density f on R is assumed to be s times differentiable, with the integrated squared error loss function, what is a mnimax lower bound? That is, sup E f ( f(x) f(x)) 2 dx...? f F s 9 / 44

12 Lower bounds via T V distance d is a metric on Θ d(θ 0, θ 1 ) 2δ where θ 0, θ 1 Θ Denote A 0 := {ω : d(θ 0, ˆθ(ω)) < δ}. (1) Risk bound implies T V bound. Suppose P θ {d(θ, ˆθ) δ} ɛ, θ Θ Then P θ0 A 0 1 ɛ and P θ1 A 0 ɛ, which shows V (P θ0, P θ1 ) P θ0 A 0 P θ1 A 0 1 2ɛ 10 / 44

13 Lower bounds via T V distance d is a metric on Θ d(θ 0, θ 1 ) 2δ where θ 0, θ 1 Θ Denote A 0 := {ω : d(θ 0, ˆθ(ω)) < δ}. (2) T V bound implies risk bound. Suppose V (P θ0, P θ1 ) < 1 2ɛ. Then, sup θ Θ P θ (d(θ, ˆθ) ) δ = R(Θ, ˆθ) > ɛ since 2R(Θ, ˆθ) P θ0 (d(θ 0, ˆθ) ) ( δ + P θ1 d(θ 1, ˆθ) ) δ P θ0 A c 0 + P θ1 A 0 1 sup P θ0 A P θ1 A > 2ɛ. A 10 / 44

14 Lower bounds via testing Above argument uses L(ξ, θ) = 1 {d(ξ,θ) δ} L(ξ, θ 0 ) + L(ξ, θ 1 ) 1 for all ξ if d(θ 0, θ 1 ) 2δ. For a general loss function, Suppose inf ξ Θ (L(ξ, θ 0 ) + L(ξ, θ 1 )) d(θ 0, θ 1 ) for a pair θ 0 and θ 1 in Θ. Then ( L(ξ, θ0 ) inf ξ Θ d(θ 0, θ 1 ) + L(ξ, θ ) 1) 1. d(θ 0, θ 1 ) 11 / 44

15 Lower bounds via testing 2R(Θ, ˆθ) E θ0 L(ˆθ, θ 0 ) + E θ1 L(ˆθ, θ 1 ) [ ] L(ˆθ, θ 0 ) = d(θ 0, θ 1 ) E θ0 d(θ 0, θ 1 ) + E L(ˆθ, θ 1 ) θ 1 d(θ 0, θ 1 ) d(θ 0, θ 1 ) inf [E θ 0 f 0 + E θ1 f 1 ] f 0 +f 1 1,f 0,f 1 0 Note that inf E θ 0 f 0 + E θ1 f 1 = (p θ0 p θ1 ) dν f 0 +f 1 1,f 0,f 1 0 = 1 1 p θ0 p θ1 dν 2 = 1 V (P θ0, P θ1 ) 12 / 44

16 N(θ 0, 1) N(θ 1, 1) errori θ 0 + θ 1 2 errorii / 44

17 Testing affinity (T A) Testing affinity: (p θ0 p θ1 ) dν := P θ0 P θ1 1 T A = E θ0 1 {pθ0 p θ1 } + E θ1 1 {pθ1 p θ0 } = Sum of two errors If P θ0 = P θ1, then T A = 1 meaning testing impossible If supp(p θ0 ) = [0, 1], supp(p θ1 ) = [2, 3], then T A = 0 meaning perfect test Calculation of T A for product measures P θ0 = P n 0, P θ 1 = P n 1 (1 T A) 2 nh 2 (P 0, P 1 ) h 2 (P 0, P 1 ) KL(P 0, P 1 ) χ 2 (P 0, P 1 ) for P 0 P 1 14 / 44

18 Reduction strategy Restrict attention to a finite subset where A is a finite index set. With an estimator ˆθ, define Θ F := {θ α : α A} Θ ˆα := arg min α A L(ˆθ, θ α ). Assume with a metric d, for some constant δ > 0, inf (L(ξ, θ α) + L(ξ, θ β )) δd(θ α, θ β ) > 0. α, β A (2) ξ Θ 15 / 44

19 Reduction strategy Reduce maximum risk by the average risk over the finite sub parameter space R(Θ, ˆθ) = 1 2 sup E θ 2L(ˆθ, θ) θ Θ 1 2 max α A E θ α 2L(ˆθ, θ α ) 1 ( 2 max E θ α L(ˆθ, θ α ) + L(ˆθ, θˆα ) α A ) by def. of ˆα δ 2 max E θ α d(θˆα, θ α ) by condition (2) αa δ E θα d(θˆα, θ α ). bound max by average 2 A α A 16 / 44

20 Two-point tests Let A = {0, 1} and d(θ α, θ β ) = 1 {θα θ β } where α, β A. By the above calculation, R(Θ, ˆθ) δ 4 (E θ 0 d(θˆα, θ 0 ) + E θ1 d(θˆα, θ 1 )) Thus δ 4 (E θ 0 (ˆα = 1) + E θ1 (ˆα = 0)) δ 4 inf {E θ 0 f 0 + E θ1 f 1 : 0 f 0, f 1 1, f 0 + f 1 = 1} δ 4 P θ 0 P θ1 1. R(Θ) δ 4 P θ 0 P θ1 1 δ 4 (1 h(p θ 0, P θ1 )). (3) 17 / 44

21 Le Cam s Lemma Construct such that A := {θ 0, θ 1 } Θ 1 inf ξ Θ (L(ξ, θ 0 ) + L(ξ, θ 1 )) δ, 2 P θ0 P θ1 1 c > 0. Then, for every estimator ˆθ, sup E θ L(ˆθ, θ) cδ θ Θ / 44

22 Examples (1) Normal location models Let Y 1,..., Y n N(θ, 1), where θ Θ R. Then for any estimator θ, ) 2 sup E θ ( θ θ Cn 1. θ Θ P θ = Pθ n (i.i.d. sample) and P θ = N(θ, 1) with a loss function L(ˆθ, θ) = (ˆθ θ) 2. (loss cond.) L(ξ, θ 0 ) + L(ξ, θ 1 ) 1 2 L(θ 0, θ 1 ) (?) δ (testing cond.) V 2 (P θ0, P θ1 ) h 2 (P θ0, P θ1 ) nh 2 (P θ0, P θ1 ) nkl(p θ0, P θ1 ) = n (θ 1 θ 0 ) 2 2 (?) 1/2 Fix θ 0 R and take θ 1 = θ 0 + 1/ n. Then (2) is satisfied with δ = 1/(2n), and h 2 (P θ0, P θ1 ) 1/2. By (3), we have R(Θ) 1 8n (1 1/2). 19 / 44

23 (2) Nonparametric regression Let Y i = θ(x i ) + ɛ i for i = 1,..., n, where ɛ i N(0, 1), x i = i/n, θ Θ with Θ the set of all twice continuously differentiable functions on [0, 1], θ (x) < M. Then for any estimator θ and any x 0 [0, 1], 2 sup E θ ( θ(x0 ) θ(x 0 )) Cn 4/5. θ Θ P θ = n i=1 P θ,i and P θ,i = N(θ(i/n), 1) with a loss function L(ˆθ, θ) = (ˆθ(x 0 ) θ(x 0 )) 2. (loss cond.) L(ξ, θ 0 ) + L(ξ, θ 1 ) 1 2 L(θ 0, θ 1 ) (?) δ (testing cond.) h 2 (P θ0, P θ1 ) KL(P θ0, P θ1 ) (?) 1/2 20 / 44

24 ) Let θ 0 = 0 on [0, 1] and θ 1 (x) = MhnK( 2 x x0 h n where ( K(t) = a exp )1 1 1 (2t) 2 { 2t 1} where a is a normalizing constant so K(t) is a kernel, and let h = cn 1/5. Check θ 0, θ 1 Θ. Check L(ξ, θ 0 ) + L(ξ, θ 1 ) 1 2 L(θ 0, θ 1 ) = M 2 2 h4 nk 2 (0) = M 2 2 h4 na 2 exp( 2) Check KL(P θ0,p θ1 ) = = n i=1 n i=1 φ(u i ) φ(u i ) log n i=1 n i=1 φ(u i) n i=1 φ(u i θ 1 (x i )) du 1... du n ( u i θ 1 (x i ) + 12 θ 1(x i ) 2 ) du 1... du n 21 / 44

25 KL(P θ0,p θ1 ) = 1 n θ 1 (x i ) 2 2 i=1 = M 2 n h 4 2 na 2 exp i=1 M 2 2 h4 na 2 n i=1 2 ) 2 1 4( xi x 0 h n 1 {x0 hn 2 i n x 0+ hn 2 } M 2 2 h4 na 2 nh n = 1 2 a2 nh 5 n 1 { xi x 0 hn 2 } Since h n 1/5, by choosing a sufficiently small, we have the lower bound of order h 4 n n 4/5. 22 / 44

26 Multiple tests Let A = {0, 1} m and d(θ α, θ β ) = H(α, β) = m k=1 1 {α k β k } where α, β A. e.g. α = (0, 0,..., 1, 0, 1,..., 1) Start from R(Θ, ˆθ) δ E θα d(θˆα, θ α ) 2 A α A 23 / 44

27 R(Θ, ˆθ) δ E θα d(θˆα, θ α ) 2 A α A = δ m 2 2 m E θα 1{θˆαk θ αk } α A k=1 = δ m m 1 E θα 1 {ˆαk 0} m 1 E θα 1 {ˆαk 1} k=1 α A 0,k α A 1,k = δ m ( P0,k 1 4 {ˆαk 0} + P 1,k 1 {ˆαk 1}) δ 4 k=1 m P 0,k P 1,k 1 δ 4 m min P θ α P θβ 1 H(α,β)=1 k=1 where P i,k = 1 2 m 1 α A i,k P θα, and A i,k = {α A : α k = 0} where i = 0 or / 44

28 Assouad s Lemma Construct such that A := {θ α, α {0, 1} m } Θ 1 inf ξ Θ (L(ξ, θ α ) + L(ξ, θ β )) δh(α, β) α, β {0, 1} m, 2 P θα P θβ 1 c > 0 if H(α, β) = 1. Then, for every estimator ˆθ, sup E θ L(θ, ˆθ) cδ θ Θ 4 m. Notation: H(α, β) = α β 0 = m k=1 1 {α k β k }. 25 / 44

29 Examples (3) Gaussian sequence models Let Y i = θ i + σɛ i where ɛ i N(0, 1) and σ 0 with θ = (θ 1, θ 2,...) Θ = { θ : k k2s θ 2 k M} where M is a constant. Then for any estimator θ, sup E θ θ θ 2 2 Cσ 4s 1+2s. θ Θ P θk = N(θ k, σ 2 ) for k = 1, 2,... where M is a constant. P θ = k N P θ k L(θ, t) = k (θ k t k ) 2 satisfying L(θ α, t) + L(θ β, t) 1 2 L(θ α, θ β ). 26 / 44

30 Need to construct θ α Θ with α {0, 1} m s.t k (θ α,k θ β,k ) 2 δ α β 0 for all α, β {0, 1} m. 2 P θα P θβ 1 c > 0 for α β 0 = 1. Construct a finite subset by setting Then with δ = 1 2 ɛ θ α = (θ α1, θ α2,..., θ αm, 0, 0,...) = ɛ(α 1, α 2,..., α m, 0, 0,...). (θ αk θ βk ) 2 = 1 2 ɛ2 H(α, β) k 27 / 44

31 T A calculation: w.l.o.g. α 1 β 1, then ( P θα P θβ 1 = P 0 Q ) ( P 0 P ɛ Q P 0) i>m i>m 1 = P 0 P ɛ 1. With normal distribution, if ɛ = c 0 σ for σ 0. ɛ/2 P 0 P ɛ 1 = 2 φ σ 2(x ɛ)dx ( = 2 1 Φ( ɛ ) 2σ ) > 0 28 / 44

32 Check θ α Θ by letting m = (Mɛ 2 ) 1/(1+2s) since k k2s θ 2 k m k=1 k2s ɛ 2 m 1+2s ɛ 2 M. Lower bound is obtained as mɛ 2 ɛ 4s/(1+2s) σ 4s/(1+2s) which gives n 2s/(1+2s) with the usual choice σ = n 1/2. 29 / 44

33 (4) Smooth density estimation Let Y 1,..., Y n be an i.i.d. sample from a density f F s where F s is a class of s times differentiable densities with Hölder type condition ( f (s 1) (x + h) f (s 1) (x) ) 2 dx (M h ) 2. Then sup E f f F s ( f(x) f(x)) 2 dx Cn 2s 1+2s. P f = P n f where P f has a density f. Loss L(f, g) = (f(x) g(x)) 2 dx satisfying L(ξ, f α ) + L(ξ, f β ) 1 2 L(f α, f β ) 30 / 44

34 Typical construction: small perturbations on a base function m f α (x) = f 0 (x) + ɛ α k ϕ k (x) For the loss separation condition, 1 2 k=1 (f α (x) f β (x)) 2 = 1 ( m 2 ɛ2 (α k β k )ϕ k (x) (?) δ k=1 m (α k β k ) 2 k=1 Convenient if ϕ k ϕ k = V 1 {k=k } s.t. δ = V ɛ 2 /2. ) 2 31 / 44

35 Ibragimov and Has minskii(1983) Construct perturbed uniform densities on [0, 1] f 0 (x) = 1 ϕ k (x) = ϕ(mx k) for k = 1,..., m 1 where ( ϕ(x) = c 0 exp 1 x 1 ) 1 1 2x {0 x 1 2 } with ϕ( x) = ϕ(x), where c 0 is small enough to satisfy ϕ F. Perturbation function ϕ(x): infinitely differentiable, 1/2 ϕ(x)dx = 0. 1/2 1 0 f α(x)dx = 1 0 [1 + ɛ m k=1 α kϕ k (x)] dx = 1 with f α (x) c 1 > 0 32 / 44

36 perturbation function / 44

37 perturbation on the first interval / 44

38 perturbation on every interval / 44

39 ϕk ϕ k = 1{k = k } C m, then δ = Cɛ2 /(2m) Then for the testing condition with α β 0 = 1, we need χ 2 (fα f β ) 2 (P fα, P fβ ) := dx f α 1 (f α f β ) 2 c 1 = 1 c 1 δ α β 0 = δ c 1, 1 n δ = Cɛ2 2m with the lower bound mδ ɛ 2 m/n up to a constant. 34 / 44

40 [ ] ϕ (s 1) k (x) = m s 1 ϕ (s 1) (mx k + 1/2) Check f α F. (f (s 1) α if m ɛ 1/s. (x + h) f α (s 1) (x)) 2 dx = ɛ 2 ( k 2 α k (ϕ (s 1) k (x + h) ϕ (s 1) k (x))) dx [ ] ɛ 2 m 2s 2 ( ϕ (s 1) (x + mh) ϕ (s 1) (x)) 2 dx ɛ 2 m 2s 2 (M mh ) 2 since ϕ F = M 2 ɛ 2 m 2s h 2 (?) (M h ) 2 From testing condition, 1/n ɛ 2 /m ɛ 2+(1/s). That is, ɛ n s/(1+2s). Then the lower bound is ɛ 2 n 2s/(1+2s). 35 / 44

41 Lemma V (P, Q) = 1 2 Proof. p q dν = 1 (p q). Let A 0 = {x X : p(x) q(x)}. 1 Then V P (A 0 ) Q(A 0 ) = 1 2 p q dν = 1 (p q). 2 For all A A, A (p q)dν = A A 0 (p q) + A A q) { 0(p c max A (p q), } 0 A (q p) = 1 c 0 2 p q. By the above two, we have proved the claims. 36 / 44

42 Lemma 1 2 h2 (P, Q) V (P, Q) h(p, Q) 1 h2 (P,Q) 4. Proof h2 (P, Q) = 1 ( ) pq = 1 p>q pq + q>p pq ( 1 p>q q + ) q>p p = 1 (p q) = V (P, Q). ( (p ) 2 2 (1 1 2 h2 (P, Q)) 2 = q)(p q) (p q) (p q) = ( (p q) 2 ) (p q) = (1 V )(1 + V ) = 1 V 2. Thus V (P, Q) 2 1 (1 1 2 h2 (P, Q)) 2 = h(p, Q) 2 (1 1 4 h2 (P, Q)). 37 / 44

43 Lemma h 2 (P, Q) KL(P, Q) χ 2 (P, Q). Proof. 1 KL(P, Q) = pq>0 p log p q = 2 pq>0 p log p q 2 pq>0 p log ( q p ) 2 pq>0 p ( q p 1 ) = 2 pq + 2 = h 2 (P, Q). 2 KL(P, Q) = pq>0 p log p q = pq>0 p log ( p q ) p 2 q 1 = χ2 (P, Q). 38 / 44

44 Lemma ( h 2 (P n, Q n ) = 2 1 ( 1 h2 (P,Q)) ) n 2 nh 2 (P, Q). Proof h2 (P n, Q n ) = ( p n q n ) n ( ) = pq = 1 h2 (P,Q) n. 2 2 Use (1 a) n 1 an for 0 a < / 44

45 Lemma (Pinsker) KL(P,Q) V (P, Q) 2. Proof. 1 Let ψ(x) = x log x x + 1 for x 0 where 0 log 0 := 0. Note that ψ(0) = 1, ψ(1) = 0, ψ (1) = 0, ψ (x) = 1/x 0 and ψ(x) 0 for all x 0. ( ) 2 Use x ψ(x) (x 1) 2 where x 0. If P Q, V (P, Q) = 1 2 p q = 1 2 q>0 p q 1 q 1 ( 2 q>0 q p ) ( 3q ψ p ) q 1 ( 4q p ) 3 q>0 qψ( p) q = 1 2 pq>0 p log p q = KL(P,Q) / 44

46 psi(x) approx. by (x 1)^2 x * log(x) x (4/3+2x/3)*psi(x) (x 1)^ x x 41 / 44

47 Lemma V (P, Q) exp( KL(P, Q)). Proof. 1 W.l.o.g., KL(P, Q) < (i.e. P Q). 2 ( ( pq) 2 = exp 2 log ) pq>0 pq = ( exp 2 log ) ( pq>0 p q p exp 2 ) pq>0 p log q p = ( ) exp KL(P, Q). 3 Use 2 ( (p q) 2 (p q)) (p q) ( (p ) 2 ( ) 2 ( ) q)(p q) = pq exp KL(P, Q) and 2 (p q) = 2(1 V (P, Q)). 42 / 44

48 Theorem Suppose Π is a distribution on Θ such that R(θ, δ Π )dπ(θ) = sup R(θ, δ Π ). θ Then 1 δ Π is minimax. 2 Π is least favourable. Proof. (i) Let δ be any other procedure. Then sup θ R(θ, δ) R(θ, δ)dπ(θ) R(θ, δπ )dπ(θ) = sup θ R(θ, δ Π ). (ii) Let Π be some other distribution of θ. Then R(θ, δπ )dπ (θ) R(θ, δ Π )dπ (θ) sup θ R(θ, δ Π ) = R(θ, δπ )dπ(θ). 43 / 44

49 Theorem Suppose {Π n } is a sequence of prior distributions with Bayes risks R(θ, δ n )dπ n (θ) =: r Πn, and that δ is an estimator for which sup θ R(θ, δ) = lim n r Πn. Then 1 δ is minimax. 2 the sequence {Π n } is least favourable. Proof. (i) Suppose δ is any other estimator. Then sup θ R(θ, δ ) R(θ, δ )dπ n (θ) r Πn, and this holds for every n. Hence sup θ R(θ, δ ) sup θ R(θ, δ). Thus δ is minimax. (ii) If Π is any distribution, then R(θ, δπ )dπ(θ) R(θ, δ)dπ(θ) sup θ R(θ, δ) = lim n r Πn. The claim holds by definition. 44 / 44

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose