Theoretical Statistics. Lecture 12.

Size: px

Start display at page:

Download "Theoretical Statistics. Lecture 12."

Marylou Andrews
5 years ago
Views:

1 Theoretical Statistics. Lecture 12. Peter Bartlett Uniform laws of large numbers: Bounding Rademacher complexity. 1. Metric entropy. 2. Canonical Rademacher and Gaussian processes 1

2 Recall: Covering numbers A pseudometric is like a metric, but we don t insist that d(x,y) = 0 impliesx = y. Definition: Anǫ-cover of a subsett of a pseudometric space(s,d) is a set ˆT T such that for each t T there is a ˆt ˆT such that d(t,ˆt) ǫ. The ǫ-covering number of T is N(ǫ,T,d) = min{ ˆT : ˆT is an ǫ-cover oft}. A set T is totally bounded if, for all ǫ > 0, N(ǫ,T,d) <. The function ǫ log N(ǫ, T, d) is the metric entropy of T. If lim ǫ 0 logn(ǫ)/log(1/ǫ) exists, it is called the metric dimension. 2

3 Covering numbers Intuition: A d-dimensional set has metric dimension d. (N(ǫ) = Θ(1/ǫ d ).) Example: ([0,1] d,l ) hasn(ǫ) = Θ(1/ǫ d ). 3

4 Packing numbers Definition: An ǫ-packing of a subset T of a pseudometric space (S,d) is a subset ˆT T such that each pair s, t ˆT satisfies d(s, t) > ǫ. The ǫ-packing number oft is M(ǫ,T,d) = max{ ˆT : ˆT is anǫ-packing oft}. 4

5 Covering and packing numbers Theorem: For all ǫ > 0,M(2ǫ) N(ǫ) M(ǫ). Thus, the scaling of the covering and packing numbers is the same. 5

6 Covering and packing numbers: Proof For the first inequality, consider a minimal ǫ-cover ˆT. Any two elements of a2ǫ-packing oft cannot be withinǫof the same element of ˆT. (Otherwise, the triangle inequality shows that they are within2ǫ of each other.) Thus, there can be no more than one element of a2ǫ packing for each of the N(ǫ) elements of ˆT. That is,m(2ǫ) N(ǫ). For the second inequality, consider an ǫ-packing ˆT of size M(ǫ). Since it is maximal, no other point s T can be added for which some t ˆT has d(s,t) > ǫ. Thus, ˆT is anǫ-cover. So the minimal ǫ-cover has size N(ǫ) M(ǫ). 6

7 Covering and packing numbers: Example Theorem: Let be a norm onr d and let B be the unit ball. Then 1 ǫ d N(ǫ,B, ) ( ) d 2 ǫ +1. 7

8 Covering and packing numbers of a norm ball: Proof Lower bound: Consider an ǫ-cover {x 1,...,x N } of sizen = N(ǫ,B), and notice that so N B (x i +ǫb), i=1 vol(b) N(ǫ,B)vol(ǫB) = N(ǫ,B)ǫ d vol(b), and hence N(ǫ,B) 1/ǫ d. 8

9 Covering and packing numbers of a norm ball: Proof Upper bound: Consider a maximal ǫ-packing {x 1,...,x M } of size M = M(ǫ,B). Since it s a packing, the ballsx i +(ǫ/2)b are disjoint. Each of these balls is contained in(1+ǫ/2)b. Thus, so M i=1 (x i + ǫ 2 B ) (1+ǫ/2)B, Mvol((ǫ/2)B) vol((1+ǫ/2)b) ( ǫ dvol(b) ( M 1+ 2) 2) ǫ dvol(b). and hence N(ǫ,B) M(ǫ,B) (2/ǫ+1) d. 9

10 Example: smoothly parameterized functions Let F be a parameterized class of functions, F = {f(θ, ) : θ Θ}. Let Θ be a norm onθand let F be a norm onf. Suppose that the mapping θ f(θ, ) is L-Lipschitz, that is, f(θ, ) f(θ, ) F L θ θ Θ. Then N(ǫ,F, F ) N(ǫ/L,Θ, Θ ). 10

11 Example: smoothly parameterized functions A Lipschitz parameterization allows us to translates a cover of the parameter space into a cover of the function space. Example: If F is smoothly parameterized by a (compact set of) d parameters, then N(ǫ,F) = O(1/ǫ d ). 11

12 Example: 1-dimensional Lipschitz functions LetF be the set ofl-lipschitz functions mapping from[0,1] to[0,1]. Then in the infinity norm f = sup x [0,1] f(x), logn(ǫ,f, ) = Θ(L/ǫ). Proof idea: form an ǫ grid of the y-axis, and anǫ/l grid of the x-axis, and consider all functions that are piecewise linear on this grid, where all pieces have slopes+l or L. There are 1/ǫ starting points, and for each starting point there are 2 L/ǫ slope choices. It s easy to show that this set is ano(ǫ) packing and ano(ǫ) cover. 12

13 Example: d-dimensional Lipschitz functions Let F d be the set of L-Lipschitz functions (wrt ) mapping from[0,1] d to[0, 1]. Then logn(ǫ,f d, ) = Θ ( (L/ǫ) d). Note the exponential dependence on the dimension. 13

14 Canonical Rademacher and Gaussian Processes Definition: Fix a set T R n. 1. The canonical Gaussian process is the stochastic process G θ = g,θ = n g i θ i, i=1 where g i N(0,1) i.i.d. 2. The canonical Rademacher process is the stochastic process R θ = ǫ,θ = n ǫ i θ i, i=1 where theǫ i are i.i.d. and uniform on{±1}. 14

15 Canonical Rademacher and Gaussian Processes Definition: A stochastic process θ X θ with indexing set T is sub- Gaussian with respect to a metric d ont if, for all θ,θ T and all λ R, ( λ Eexp(λ(X θ X θ )) 2 d(θ,θ ) 2 ) exp. 2 The canonical Rademacher and Gaussian processes are sub-gaussian wrt the Euclidean metric. 15

16 Canonical Rademacher and Gaussian Processes Indeed: G θ G θ = g,θ θ, which isn(0, θ θ 2 ), and hence its moment generating function is equal to the upper bound. R θ R θ = ǫ,θ θ, which, by the bounded differences property, is sub-gaussian with parameter θ θ 2. 16

17 An aside: Orlicz norms Definition: For1 α 2, the α-orlicz norm of a random variable X is { ( ) } X α X ψα = inf C > 0 : Eexp 2. C α Theorem: There are constants c 1,c 2 such that, for all X and all t 1, ( ) t α Pr( X t) 2exp c 1 X α ψ α and conversely, Pr( X t) cexp( t α /K α ) implies X ψα c 2 K. Sub-Gaussian means X θ X θ ψ 2 Ld(θ,θ )., 17

18 Canonical Gaussian and Rademacher processes Theorem: ForT R n, Esup θ T R θ π 2 Esup G θ c lognesupr θ. θ T θ T 18

19 Canonical Gaussian and Rademacher processes Proof of first inequality: Esup θ T G θ = Esup θ T = Esup θ T n g i θ i i=1 n ǫ i g i θ i i=1 n Esup ǫ i E[ g i ]θ i θ T i=1 2 = π Esup R θ. θ T 19

20 Canonical Gaussian and Rademacher processes: Example ForΘthe l 1 -ball inr n, Esup ǫ,θ = E ǫ = 1. θ [where we ve used the duality ofl 1 and l (equivalently, that Hölder s inequality is tight).] Also, Esup g,θ = E g 2lnn. θ The Gaussian and Rademacher complexities are a logn factor apart in this case. 20

21 Canonical Gaussian and Rademacher processes: Example To see the last inequality, we generalize the Finite Lemma to the sub-gaussian case: Lemma: Forg with independent sub-gaussian components, Emax g,a max a 2log A. a A a A In this case, A = {e i : 1 i n}, so max a A a = 1 and A = n. 21

22 Canonical Gaussian and Rademacher processes: Example Proof: exp ( ) λemax g,a a A Eexp ( ) λmax g,a a A = Emax a A exp(λ g,a ) a AEexp(λ g,a ) A exp ( λ 2 R 2 /2 ), since g i is sub-gaussian (here, R 2 = max a A a 2 2). Picking λ 2 = 2log A /R 2 gives the result. 22

Theoretical Statistics. Lecture 14.

Theoretical Statistics. Lecture 14. Peter Bartlett Metric entropy. 1. Chaining: Dudley s entropy integral 1 Recall: Sub-Gaussian processes Definition: A stochastic process θ X θ with indexing set T is