Theoretical Statistics. Lecture 14.

Size: px

Start display at page:

Download "Theoretical Statistics. Lecture 14."

Jane Cross
5 years ago
Views:

1 Theoretical Statistics. Lecture 14. Peter Bartlett Metric entropy. 1. Chaining: Dudley s entropy integral 1

2 Recall: Sub-Gaussian processes Definition: A stochastic process θ X θ with indexing set T is sub- Gaussian with respect to a metric d ont if, for all θ,θ T and all λ R, ( λ Eexp(λ(X θ X θ )) 2 d(θ,θ ) 2 ) exp. 2 Lemma: [Finite Classes] ForX θ sub-gaussian wrtdont, andaaset of pairs from T, E max (θ,θ ) A (X θ X θ ) max (θ,θ ) A d(θ,θ ) 2log A. 2

3 Recall: Covering number bound Theorem: Consider a zero-mean process X θ that is sub-gaussian wrt the metric d on T. Suppose that the diameter of T is D = sup θ,θ d(θ,θ ). Then for any ǫ, Esup θ X θ 2E sup (X θ X θ )+2D logn(ǫ,t,d). d(θ,θ ) ǫ 3

4 Dudley s entropy integral Theorem: Let X θ be a zero-mean stochastic process that is sub-gaussian wrt a pseudo-metric d on the indexing set T. Then Esup θ X θ 8 2 logn(ǫ,t,d)dǫ. Note that we can always rewrite the integral as an integral from to the diameter of T. 4

5 Dudley s entropy integral: Proof As before, Esup θ X θ = Esup(X θ X θ ) Esup(X θ X θ ), θ θ,θ and choosing ˆθ ˆT (a minimal ǫ-cover) withd(ˆθ,θ) ǫ (and similarly for θ ), we have X θ X θ = X θ Xˆθ +Xˆθ Xˆθ +Xˆθ X θ 2 sup (X θ Xˆθ)+ sup d(θ,ˆθ) ǫ ˆθ,ˆθ ˆT Xˆθ Xˆθ. 5

6 Dudley s entropy integral: Proof Consider boundingesupˆθ,ˆθ (Xˆθ Xˆθ ). Previously, we bounded the supremum over theǫ-cover ˆT (for which the diameter is that of T ). Instead, we consider a sequence of progressively better approximations to elements of ˆT (which leads to sets with progressively smaller diameters). Suppose the diameter of ˆT isd. We first define ˆT k = ˆT, and think of it as a (2 k D)-cover of ˆT, where k = log 2 (D/ǫ) ensures that 2 k D ǫ. Then we define ˆT i 1 = a minimal (2 (i 1) D)-cover of ˆT i, for i going from k 1 down to. Notice that ˆT is a minimal D-cover of ˆT 1, so ˆT = 1. [PICTURE]. 6

7 Dudley s entropy integral: Proof Pick ˆθ k = ˆθ, and then pick ˆθ i 1 ˆT i 1 as the best approximation of ˆθ i. We can write ˆθ i 1 = f i 1 (ˆθ i ), where f i 1 : ˆT i ˆT i 1 is the best approximation operator. Then we can write Xˆθ = Xˆθk = Xˆθ + k i=1 and, using the same notation for ˆθ, we have (Xˆθi Xˆθi 1 ) = Xˆθ Xˆθ = Xˆθk Xˆθ k k (Xˆθi Xˆθi 1 ) k (Xˆθ i Xˆθ i 1 ). i=1 i=1 7

8 Dudley s entropy integral: Proof Thus, E sup ˆθ,ˆθ ˆT Xˆθ Xˆθ 2 k i=1 ( ) E sup X Xˆθi fi 1 (ˆθ i ). ˆθ i ˆT i Sinced(ˆθ i, ˆθ i 1 ) 2 (i 1) D, the Finite Lemma shows that ( ) E sup X Xˆθi fi 1 (ˆθ i ) 2 (i 1) D 2log ˆT i ˆθ i ˆT i 2 (i 1) D 2logN(2 i D,T). 8

9 Dudley s entropy integral: Proof Finally, since logn(2 i D) logn(u) for u 2 i D, we can approximate the area of the rectangle from (2 (i+1) D,) to (2 i D, 2logN(2 i D)) by the integral under 2logN(u) for u in that interval (which has length 2 (i+1) D): 2 (i 1) D 2logN(2 i D) = 4 2 (i+1) D 2logN(2 i D) 4 2 i D 2 (i+1) D 2logN(u,T)du. 9

10 Dudley s entropy integral: Proof Combining, we have Esup θ X θ 2E sup (X θ Xˆθ)+2 d(θ,ˆθ) ǫ 2E sup (X θ Xˆθ)+2 d(θ,ˆθ) ǫ k i=1 2E sup (X θ Xˆθ)+8 2 d(θ,ˆθ) ǫ ( ) E sup X Xˆθi fi 1 (ˆθ i ) ˆθ i ˆT i k 2 (i 1) D 2logN(2 i D,T) i=1 D/2 2 (k+1) D logn(u,t)du. Whenǫ, the first term goes to zero and (since k = log 2 (D/ǫ) ), the second term approaches the integral from tod/2, which gives the result. 1

11 Dudley s entropy integral We actually proved the following result: Theorem: Let X θ be a zero-mean stochastic process that is sub-gaussian wrt a pseudo-metric d on the indexing set T. Then Esup θ X θ 2E sup (X θ X θ )+8 D/2 2 logn(ǫ,t,d)dǫ. d(θ,θ ) δ δ/2 When the entropy integral does not exist (because N(ǫ,T,d) grows too quickly as ǫ ), this can still give a useful bound. 11

12 Dudley s entropy integral When does the entropy integral exist? SupposeT has diameter D and logn(ǫ,t,d) = O(ǫ d ). Then D logn(ǫ,t,d)dǫ C D = ǫ d/2 dǫ C 1 d/2 D1 d/2 provided that d < 2. The integral does not exist otherwise. 12

13 Entropy Integral: Lipschitz parameterized class Suppose that F is a parameterized class, F = {f(θ, ) : θ Θ}, where Θ = B 2 R p. The parameterization isl-lipschitz wrt Euclidean distance onθ, so that for all x, f(θ,x) f(θ,x) L θ θ 2. Suppose also that F = F (that is,f is closed under negations). Theorem: E R n F = O ( ) p L. n NB: We ve lost the log factor. 13

14 Entropy Integral: Lipschitz parameterized class Recall that ne R n F = E sup F F ǫ, = Esup F ǫ, = Esup ǫ,f(θ,x1), n which is sub-gaussian wrt the Euclidean distance on R n. Also, recall that N(δ,f(Θ,X n 1), 2 ) N(δ/(L n),θ, 2 ) (1+2L n/δ) p. θ 14

15 Entropy Integral: Lipschitz parameterized class Hence, E R n F 8 2 n = 8 2L n 8 p 2L n 8 p 2L n logn ( ) ǫ L n,θ, 2 dǫ logn(ǫ,θ, 2 )dǫ 2 2 log log ( 1+ 2 ) dǫ ǫ ( ) 4 dǫ. ǫ 15

16 Entropy Integral: Lipschitz parameterized class Integrating by parts, E R n F 8 p 2L n = 8 p 2L n 2 ( log ( ) 4 dǫ ǫ [4e y2 y] log2 16 ( ) 2 log2+ 2π 8.7p < L n. 4 p L n log2 e y2 dy ) 16

17 Entropy Integral: VC-class Theorem: For F a class of{, 1}-valued functions with VC-dimension d, ( ) d E R n F = O. n Compare with the consequence of Sauer s Lemma: O( dlog(n/d)/n). We lose the log factor. Note: This leads to a faster rate (without the log factor) in the proof of the Glivenko-Cantelli Theorem: ( Pr F n F c ) ) +t 2exp ( nt2. n 8 17

18 Entropy Integral: VC-class We have where E R n F 8 2 n E 8 2 n E = 8 2 n E 2 n 2 n 2 f g 2 L 2 (P n ) = 1 n logn(ǫ,f(x n 1 ), 2)dǫ logn(ǫ/ n,f, L2 (P n ))dǫ logn(ǫ,f, L2 (P n ))dǫ, n (f(x i ) g(x i )) 2. i=1 18

19 Entropy Integral: VC-class Fact (due to Haussler): N(ǫ,F, L2 (P n )) cd(16e) d ǫ 2d. E R n F n E 8 2 n E = d c n. 2 logn(ǫ,f, L2 (P n ))dǫ log(cd(16e) d ǫ 2d )dǫ 19

20 An aside: Generic Chaining Theorem: Let X θ be a zero-mean stochastic process that is sub-gaussian wrt a pseudo-metric d on the indexing set T. Then for any probability distributionµont, Esup θ X θ csup θ T log 1 µ(b(θ,ǫ)) dǫ. 2

21 An aside: Generic Chaining Talagrand s γ 2 : Theorem: ForX θ as above and γ 2 (T,d) = inf sup µ θ T log 1 µ(b(θ,ǫ)) dǫ, we have Esup θ X θ cγ 2 (T,d). 21

22 Sudakov s Lower Bound Theorem: For a zero-mean Gaussian process X θ defined ont, define the variance pseudometric d(θ,θ ) 2 = Var(X θ X θ ). Then ǫ EsupX θ sup logm(ǫ,t,d). θ ǫ> 2 22

23 Sudakov s Lower Bound Compare with the Entropy integral: Theorem: Let X θ be a zero-mean stochastic process that is sub-gaussian wrt a pseudo-metric d on the indexing set T. Then Esup θ X θ 8 2 logn(ǫ,t,d)dǫ. Suppose that Var(X θ X θ ) is on the same scale asd(θ,θ ) 2 (think of the Gaussian example of a sub-gaussian process this is precisely the variance). Then, modulo constants, the lower bound is the area of the largest rectangle that can fit under the curve (ǫ, logn(ǫ)), whereas the upper bound is the area under the curve. 23

Theoretical Statistics. Lecture 12.

Theoretical Statistics. Lecture 12. Peter Bartlett Uniform laws of large numbers: Bounding Rademacher complexity. 1. Metric entropy. 2. Canonical Rademacher and Gaussian processes 1 Recall: Covering numbers