Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Size: px

Start display at page:

Download "Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses"

Aubrey Paul
5 years ago
Views:

1 Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous lecture to develop lower bounds using lower bounds for testing multiple hypotheses Recall that for a random variable V uniformly distrbuted over a set of hypotheses V, the probability of of error for any classifier T(x) with input x coming from P n v for a randomly chosen v V, is lower bounded as P [T(x) = V] n E v,v V [D(P v P v )] log V Recall that the above bound can be used to lower bound the minimax risk for a loss function l( θ, θ) = Φ d( θ, θ) and a set of distributions Π We proved in the last lecture that for a set of distributions P v } v V Π with the property that v = v, d(p v, P v ) δ, we have M n (Π, l) = inf θ sup P Π [ ] E l( θ(x), θ(p)) x P n Φ(δ) inf P [T(x) = V]} T To use the above bounds, we need to come up with a set of distributions which are far in terms of the property θ (so that the second bound is large), but close on aevrage in terms of KL-divergence (so that the first bound is large) This is also known as the local Fano method since we derived the first bound using Fano s inequality, and are applying it by using (a local bound on) KL-divergence for every pair of distributions P v, P v (recall that we used convexity of KL-divergence to reduce to the local setting) You can find other variants of this method in the notes by Duchi [Duc6] Gaussian mean estimation While binary hypothesis testing was used show a bound for estimating the mean of Bernoulli random variables, the multiple hypotheses setting is often useful in considering highdimensional problems We take Π to be the set of d-dimensional Gaussian distributions as below Π = N(µ, I d ) µ R d}

2 Let the property θ be the mean as before, and let l( θ, θ) = θ θ We first check the expected loss for the empirical mean estimator Proposition Let θ(x,, x n ) = n x i Then, for any µ R d, we have that E X (N(µ,I d )) n n X i µ = d n Proof: The proof is similar to the case of Bernoulli random variables By the (pairwise) independence of the n samples, we have that E X (N(µ,I d )) n n X i µ = [ ] n E X i µ + n E [ X i µ, X j µ ] i =j = n [ ] n E X µ X N(µ,I d ) = n n d = d n We will apply the local Fano method to prove the optimality of the above bound in terms of both d and n We first need the following expression for KL-divergenge of two normal distributions Exercise Prove (using the chain rule) that D(N(µ, I d ) N(µ, I d )) = ln µ µ Thus, we need to find a large collection of distributions, equivalent to finding a large collection of means, such that for any two µ = µ, we have that µ µ is somewhat large (to lower bound the loss), but still µ µ is small on average (to upper bound the average KL-divergence) It is actually quite easy for the setting above, but we will take a slightly longer route through covering and packing numbers to illustrate a general method Covering and packing numbers Definition 3 Let S be a set of points with a metric d(, ) A collection of points C S is called a δ-covering of S (with respect to the metric d) if x S, y C d(x, y) δ

3 A set of points P is called a δ-packing if x, y P, x = y d(x, y) δ The size of the minimal δ-covering, denoted as N(δ, S, d), is called the δ-covering number of S and the size of the maximal δ-packing is called the δ-packing number The quantity log N(δ, S, d) is also called the metric entropy of S We will take the required collection of means to be a scaled copy of a (/)-packing of the unit ball in R d (under the Euclidean distance) We will show a lower bound on the size of this collection (the packing number) by using a relatioship between the packing and covering numbers Exercise 4 For any set S, metric d and δ > 0, show that M(δ, S, d) N(δ, S, d) M(δ, S, d) (Hint: First prove that an optimal δ-packing must also be a δ-covering) Let B d (x, r) denote the ball in R d of radius r (in the Euclidean distance) with its center at x We know that Vol (B d (x, r)) = c d r d for some constant c d 0 Note that if C B d (0, ) is a δ-covering of B d (0, ), then B(0, ) x C B d (x, δ) Thus, we have c d = Vol (B d (0, )) Vol (B d (x, δ)) = N (δ, B d (0, ), ) c d δ d x C Combining with the previous relationship between covering and packing numbers, this gives M (δ, B d (0, ), ) N (δ, B d (0, ), ) δ d Thus, there exists a (/)-packing of B d (0, ) of size at least d We will use this to prove the lower bound for mean estimation 3 Back to lower bounds for mean estimation We can now get back to the lowe bound Let V be a (/)-packing of B d (0, ) of size at least d We consider the set of distributions N(4δ v, I d ) v V} 3

4 Since V is a (/)-packing, we have that for all P, P Π, θ(p) θ(p ) δ Also, since v v for any v, v V, we get that for any P, P Π, the means are at distance at most 8δ Hence, D(P P ) = ln µ µ ln (64δ ) = 3δ ln Applying the lower bound on minimax loss in terms of KL-divergences gives ( M n (Π, l) δ n ) ( (3δ / ln ) δ n ) (3δ / ln ) log V d 4 Sparse mean estimation From the previous examples, it seems like the empirical mean estimator is always the best one, and the role of information theory is primarily for proving lower bounds However, it can also serve as a guide for the right bound to aim for Consider the set of normal distributions, where the mean has only one non-zero coordinate } Π = N(µ, I d ) µ R d, µ 0 Let θ(p) = E x P [x] be the mean, and let l( θ, θ) = θ θ as before Exercise 5 Let V = e,, e d } be the set of standard basis vectors in R d Use the set of means µ v = v for v V to show that there exists a constant c such that M n (Π, l) c log d n The optimal estimator for the above problem actually extends the definition of the mean as the minimizer of the total square distance (from the sample points) For a sample X (R d ) n and j [d], we define µ (j) as a vector which is equal to the empirical mean in the j th coordinate and zero elsewhere ie, We take the estimator to be µ (j) k = n (X i ) j if k = j 0 otherwise µ = arg min µ 0, µ (),, µ (d) } } n X i µ This estimator indeed achieves an expected loss of O((log d)/n), but we will not discuss the proof here 4

5 References [Duc6] John Duchi Lecture notes on Information Theory and Statistics, 06 URL: 5

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways