Minimax lower bounds

Size: px

Start display at page:

Download "Minimax lower bounds"

Ruth Holt
6 years ago
Views:

1 Minimax lower bounds Maxim Raginsky December 4, 013 Now that we have a good handle on the performance of ERM and its variants, it is time to ask whether we can do better. For example, consider binary classification: we observe n i.i.d. training samples from an unknown joint distribution P on X {0,1}, where X is some feature space, and for a fixed class F of candidate classifiers f : X {0,1} we let f n be the ERM solution f n = argmin f F 1 n n 1 {f (Xi Y i }. (1 IF F is a VC class with VC dimension V, then the excess risk of f n over the best-in-class performance L (F inf f F L(f satisfies L( f n L (F +C V n + log(1/δ n with probability at least 1 δ, where C > 0 is some absolute constant. Integrating, we also get the following bound on the expected excess risk: E [ L( f n L (F ] C V n, ( for some constant C > 0. Crucially, the bound ( holds for all possible joint distributions P on X {0,1}, and the right-hand side is independent of P it depends only on the properties of the class F! Thus, we deduce the following remarkable distribution-free guarantee for ERM: for any VC class F, the ERM algorithm (1 satisfies [ E P LP ( f n L P (F ] V C P P (X {0,1} n. (3 (We have used subscript P to explicitly indicate the fact that the quantity under the remum depends on the underlying distribution P. In the sequel, we will often drop the subscript to keep the notation uncluttered. Let s take a moment to reflect on the significance of the bound (3. What it says is that, regardless of how weird the stochastic relationship between the feature X X and the label Y {0,1} might be, as long as we scale our ambition back and aim at approaching the performance of the best classifier in some VC class F, the ERM algorithm will produce a good classifier with a uniform O( V /n guarantee on its excess risk. 1

2 At this point, we stop and ask ourselves: could this bound be too pessimistic, even when we are so lucky that the optimal (Bayes classifier happens to be in F? (Recall that the Bayes classifier for a given P has the form { 1, if η(x 1/ f (x =, 0, otherwise where η(x = E[Y X = x] = P(Y = 1 X = x is the regression function. Let P (F denote the subset of P (X {0,1} consisting of all joint distributions of X X and Y {0,1}, such that f F. Then from ( we have E [ L( f n L(f ] C P P (F V n, (4 where f n is the ERM solution (1. However, we know that if the relationship between X and Y is deterministic, i.e., if Y = f (X, then ERM performs much better. More precisely, let P 0 (F be the zero-error class: Then one can show that P 0 (F { P P (F : Y = f (X a.s. }. E [ L( f n L(f ] C V P P 0 (F n, (5 a much better bound than the global bound (4 (see, e.g., the book by Vapnik [Vap98]. This suggests that the performance of ERM is somehow tied to how sharp the behavior of η is around the decision boundary that separates the sets {x X : η(x 1/} and {x X : η(x < 1/}. To see whether this is the case, let us define, for any h [0,1], the class of distributions P (h,f { P P (F : η(x 1 h a.s. } (in that case, the distributions in P (h,f are said to satisfy the Massart noise condition with margin h. We have already seen the two extreme cases: h = 0 this gives P (0,F = P (F (the bound η 1 0 holds trivially for any P. h = 1 this gives the zero-error regime P (1,F = P 0 (F (if η 1 1 a.s., then η can take only values 0 and 1 a.s.. However, intermediate values of h are also of interest: if a distribution P belongs to P (h,f for some 0 < h < 1, then its regression function η makes a jump of size h as we cross the decision boundary. With this in mind, for any n N and h [0,1] let us define the minimax risk R n (h,f inf P P (h,f E [ L( L(f ], (6 where the infimum is over all learning algorithms based on n i.i.d. training samples. The term minimax indicates that we are minimizing over all admissible learning algorithms, while maximizing over all distributions in a given class. The following result was proved by Pascal Massart and Élodie Nédélec [MN06]:

3 Theorem 1. Let F be a VC class of binary-valued functions on X with VC dimension V. Then for any n V and any h [0,1] we have the lower bound where c > 0 is some absolute constant. Let us examine some implications: R n (h,f c min ( V n, V nh, (7 When h = 0, the right-hand side of (7 is equal to c V /n. Thus, without any further assumptions, ERM is as good as it gets (it is minimax-optimal, up to multiplicative constants. When h = 1 (the zero-error case, the right-hand side of (7 is equal to cv /n, which matches the upper bound (5 up to constants. Thus, if we happen to know that we are in a zero-error situation, ERM is minimax-optimal as well. For intermediate values of h, the lower bound depends on the relative sizes of h, V, and n. In particular, if h V /n, we have the minimax lower bound R n (h,f cv /nh. Alternatively, for a fixed h (0,1, we may think of n = V /h as the cutoff sample size, beyond which the effect of the margin condition on η can be spotted and exploited by a learning algorithm. In the same paper, Massart and Nédélec obtain the following upper bound on ERM: P P (h,f E [ L( f n L(f ] V C n, if h V /n C V (1 + log nh, if h >. (8 V /n nh V Thus, ERM is nearly minimax-optimal (we say nearly because of the extra log factor in the above bound; in fact, as Massart and Nédélec show, the log factor is unavoidable when the function class F is rich in a certain sense. The proof of the above upper bound is rather involved and technical, and we will not get into it here. The appearance of the logarithmic term in (8 is rather curious. Given the lower bound of Theorem 1, one may be tempted to dismiss it as an artifact of the analysis used by Massart and Nédélec. However, as we will now see, in certain situations this logarithmic term is unavoidable. To that end, we first need a definition: We say that a class F binary-valued functions f : X {0,1} is (N,D-rich, for some N,D N, if there exist N distinct points x 1,..., x N X, such that the projection { (f F (x N = (x1,..., f (x N } : f F of F onto x N contains all binary strings of Hamming weight 1 D. Some examples: If F is a VC-class with VC dimension V, then it is (V,D-rich for all 1 D V. This follows directly from definitions. 1 The Hamming weight of a binary string is the number of nonzero bits it has. 3

4 A nontrivial example, and one that is relevant to statistical learning, is as follows. Let F be the collection of indicators of all halfspaces in R d, for some d. There is a result in computational geometry which says that, for any N d + 1, one can find N distinct points x 1,..., x N, such that F (x n contains all strings in {0,1} N with Hamming weight up to, and including, d/. Consequently, F is (N, d/ -rich for all N d + 1. We can now state the following result [MN06]: Theorem. Given some D 1, pose that F is (N, D-rich for all N 4D. Then R n (h,f c(1 h D ] [1 + log nh nh D (9 for any D/n h < 1, where c > 0 is some absolute constant. We will present the proofs of Theorems 1 and in Sections and 3, after giving some necessary background on minimax lower bounds. 1 Preparation: minimax lower bounds for statistical estimation To prove Theorem 1, we need some background on minimax lower bounds for statistical estimation problems. The presentation here follows an excellent paper by Bin Yu [Yu97]. The setting is as follows: we have an indexed set {P θ : θ Θ} of probability distributions on some set Z, where Θ is some parameter set. For our purposes, it suffices to consider the case when Z is a finite set. We observe a random sample Z from one of the P θ s, and our goal is to estimate θ. We will measure the quality of any estimator θ = θ(z by its expected risk E θ [d(θ, θ(z ] = z ZP θ (zd(θ, θ(z, where E θ [ ] denotes expectation with respect to P θ, and d : Θ Θ R + is a given loss function. Here, we will assume that d is a pseudometric on Θ, i.e., it has the following properties: 1. Symmetry d(θ,θ = d(θ,θ for any θ,θ Θ.. Triangle inequality d(θ,θ d(θ,η + d(η,θ for any θ,θ,η Θ. (We do not require that d(θ,θ = 0 if and only if θ = θ, in which case d is a metric. Since we do not know ahead of time which of the θ s we will be facing, it is natural to expect the worst and seek an estimator θ to minimize the worst-case risk θ Θ E θ [d(θ, θ(z ]. With this in mind, let us define the minimax risk M(Θ inf θ E θ [d(θ, θ(z ], θ Θ where the infimum is over all estimators θ = θ(z. We are particularly interested in tight lower bounds on M(Θ, since they will give us some idea of how difficult the estimation problem is. A simple yet powerful technique for getting lower bounds is the two-point method introduced by Lucien Le Cam: 4

5 Lemma 1. For any two θ,θ Θ and any estimator θ, E θ [d(θ, θ(z ] + E θ [d(θ, θ(z ] d(θ,θ min(p θ (z,p θ (z. (10 Proof. Consider an arbitrary z Z. If P θ (z P θ (z, then P θ (zd(θ, θ(z + P θ d(θ, θ(z = P θ (z [ d(θ, θ(z + d(θ, θ(z ] + [P θ (z P θ (z]d(θ, θ(z P θ (z [ d(θ, θ(z + d(θ, θ(z ] P θ (zd(θ,θ, where the second line is because P θ (z P θ (z and d(, is nonnegative, while the third line is by the triangle inequality. By the same token, if P θ (z P θ (z, then Summing over all z Z, we get (10. z Z P θ (zd(θ, θ(z + P θ d(θ, θ(z P θ (zd(θ,θ. The sum on the right-hand side of (10 can be expressed in terms of the so-called total variation distance between P θ and P θ. For any two probability distributions P,Q P (Z, the total variation distance is Moreover, P Q TV 1 P(z Q(z. (11 z Z P Q TV = 1 z Zmin(P(z,Q(z. (1 To prove (1, we just use the definition (11: if we let Z { z Z : P(z Q(z }, then P Q TV = 1 (P(z Q(z + 1 (Q(z P(z z Z z Z = 1 ( P(Z Q(Z + 1 ( Q(Z c P(Z c = 1 ( P(Z Q(Z + 1 ( P(Z Q(Z = P(Z Q(Z = (P(z Q(z = z:p(z Q(z z:p(z Q(z (P(z min(p(z,q(z = z Z(P(z min(p(z,q(z = 1 z Zmin(P(z,Q(z. z:p(z Q(z (P(z min(p(z,q(z } {{ } =0 and we are done. Thus, we can rewrite the bound (10 as E θ [d(θ, θ(z ] + E θ [d(θ, θ(z ] d(θ,θ (1 P θ P θ TV. (13 5

6 Let us examine some consequences. If we take the remum of both sides of (13 over the choices of θ and θ, then we see that, for any estimator θ, E θ [d(θ, θ(z ] 1 θ Θ { d(θ,θ (1 } P θ P θ TV. (14 θ θ Notice that the right-hand side does not depend on the choice of θ, so we can take the infimum of both sides of (14 over θ and obtain the following: Corollary 1. The minimax risk M(Θ can be lower-bounded as M(Θ 1 θ θ { d(θ,θ (1 P θ P θ TV }. (15 The two-point bound (15 gives us an idea of when the problem of estimating θ based on observations Z P θ is difficult: when there exists at least one pair of distinct parameter values θ,θ Θ such that d(θ,θ is large, while the total variation distance P θ P θ TV is small. This means that the observation Z does not give us enough information to reliably distinguish between θ and θ, but the consequences of mistaking one for the other are severe (since d(θ,θ is large. Another useful consequence of Lemma 1 comes about when we consider the Bayesian setting, i.e., when we have a prior distribution π on Θ and consider the average risk of any estimator θ: E π [d(θ, θ(z ] π(dθe θ [d(θ, θ(z ]. Θ Then we have the following lower bound on M(Θ: [ ] M(Θ inf E π d(θ, θ(z. θ To prove this, we simply note that, for any θ, E π [d(θ, θ(z ] π(dθe θ [d(θ, θ(z ] E θ [d(θ, θ]. Θ θ Θ (In fact, under suitable regularity conditions, one can prove that the minimax risk is equal to the worstcase Bayes risk: M(Θ = inf θ π P (Θ E π [ d(θ, θ(z ] = π P (Θ [ ] inf E π d(θ, θ(z, θ where the first equality always holds, while the second equality is valid whenever the conditions of the minimax theorem are satisfied. With these definitions, we have the following: Corollary. Let π be any prior distribution on Θ, and let µ be any joint probability distribution of a random pair (θ,θ Θ Θ, such that the marginal distributions of both θ and θ are equal to π. Then [ ] [ M(Θ inf E π d(θ, θ(z Eµ d(θ,θ (1 ] P θ P θ TV. (16 θ I have learned this particular formulation from my colleagues Yihong Wu (here at Illinois and Yury Polyanskiy (MIT. 6

7 Proof. Take the expectation of both sides of (13 with respect to µ and then use the fact that, under µ, both θ and θ have the same distribution π. In many cases, the analysis of minimax lower bounds can be reduced to the following problem: Suppose that the parameter set Θ can be identified with the m-dimensional binary hypercube {0,1} m for some m, and the objective is to determine every bit of the underlying unknown parameter θ {0,1} m based on an observation Z P θ. In that setting, a key result known as Assouad s lemma says that the difficulty of estimating the entire bit string θ is related to the difficulty of estimating each bit of θ separately, assuming you already know all other bits: Theorem 3 (Assouad s lemma. Suppose that Θ = {0,1} m and consider the Hamming metric Then M(Θ m d H (θ,θ m 1 {θi θ i }. (17 ( 1 max P θ,θ :d H (θ,θ θ P θ TV. (18 =1 Proof. Notice that we can write d H (θ,θ as a sum m d i (θ,θ, where d i (θ,θ 1 {θi θ i }, and each d i is a pseudometric. Now let π be the uniform distribution on Θ = {0,1} m, and for each i {1,...,m} let µ i be the joint distribution of a random pair (θ,θ Θ Θ, under which θ π and θ differs from θ only in the i th bit, i.e., θ π and d i (θ,θ = 1. Then the marginal distribution of θ under µ i is µ i (θ,θ = 1 θ {0,1} m m 1 {θi θ θ {0,1} m i and θ j =θ j,j i } = 1 m = π(θ, since for each θ there is only one θ that differs from it in a single bit. Now the idea is to apply Corollary separately to each bit: [ M(Θ = inf E π dh (θ, θ(z ] θ = inf θ m m [ E π di (θ, θ(z ] [ inf E π di (θ, θ(z ] θ m 1 E [ µ i di (θ,θ (1 ] P θ P θ TV, where the last step is by Corollary. Since d i (θ,θ = 1 under µ i for every i, we have and we are done. M(Θ 1 1 = m m [ ] E µi 1 Pθ P θ TV m min θ,θ :d H (θ,θ =1 ( 1 Pθ P θ TV ( 1 max θ,θ :d H (θ,θ =1 P θ P θ TV, 7

8 In order to apply Assouad s lemma, we need to have an upper bound on the total variation distance between any two P θ and P θ with d H (θ,θ = 1. More often than not, it is easier to work with another distance between probability distributions, the so-called Hellinger distance. For any two P,Q P (Z, their squared Hellinger distance is given by H (P,Q 1 ( P(z Q(z. The total variation distance can be both upper- and lower-bounded by Hellinger: Moreover, for any n pairs of distributions (P 1,Q 1,...,(P n,q n, 1 z Z H (P,Q P Q TV H(P,Q. (19 H (P 1... P n,q 1... Q n Armed with these facts, we can prove the following version of Assouad s lemma: n H (P i,q i. (0 Corollary 3. Let { P θ : θ {0,1} m} be a collection of probability distributions on some set Z indexed by the vertices of the binary hypercube Θ = {0,1} m. Suppose that there exists some constant α > 0, such that H (P θ,p θ α, if d H (θ,θ = 1. (1 Consider the problem of estimating the parameter θ Θ based on n i.i.d. observations from P θ, where the loss is measured by the Hamming distance. Then the corresponding minimax risk, which we denote by M n (Θ, is lower-bounded as M n (Θ m (1 αn. ( Proof. For any θ Θ, let P n θ denote the product of n copies of P θ, i.e., the joint distribution of n i.i.d. samples from P θ. For any two θ,θ Θ with d H (θ,θ = 1, we have P n θ P n θ TV H(P n θ,qn θ n H (P θ,p θ αn, where the first step is by (19, the second is by (0, and the third is by (1. The bound ( then follows from Assouad s lemma. Assouad s lemma is a wonderful tool, but it is only applicable when the parameter space Θ is a binary hypercube and the metric d is the Hamming metric. To handle other situations, we need different methods. Let s recall what we had seen earlier: the minimax risk M(Θ is large if we can find at least two distinct points θ,θ Θ, such that (a the distance d(θ,θ is large and (b the distributions P θ and P θ are statistically close to one another, i.e., the total variation distance P θ P θ TV is small. Let s see what happens when we follow this logic further and consider any two θ,θ Θ that are separated from each other by a distance of at least ε > 0: 8

9 Lemma. Fix ε > 0, and let Θ 0 Θ be any set which is ε-separated, i.e., d(θ,θ ε for any two distinct θ,θ Θ 0. Then where the infimum is over all estimators θ that take values in Θ 0. M(Θ ε inf ( P θ θ θ, (3 θ θ Θ 0 Proof. Fix any ε-separated set Θ 0. Then, since Θ 0 Θ, we have M(Θ M(Θ 0. Given any estimator θ = θ(z taking values in Θ, define another estimator θ = θ(z taking values in Θ 0 as follows: Then, for any θ Θ 0, θ = argmin θ Θ 0 d( θ,θ. d( θ,θ d( θ, θ + d(θ, θ d(θ, θ, where the first step is by the triangle inequality, while the second step is by definition of θ. Consequently, M(Θ M(Θ 0 1 inf E θ [d( θ,θ]. (4 θ θ Θ 0 Now let us consider the event {z Z : θ(z θ} for some θ Θ 0. Since θ Θ 0 by construction, and Θ 0 is ε-separated, we have {z Z : θ(z θ} {z Z : d( θ(z,θ ε}. Using this fact and Markov s inequality, we can write [ ] ( ( E θ d( θ,θ P θ θ θ Pθ d( θ,θ ε. (5 ε Using this in (4, we get (3. Lemma is at its most effective when we can find a finite set Θ 0 which is ε-separated and has large cardinality. The reason for the latter requirement is that then it should not be too hard to arrange things in such a way that the probability of error P θ ( θ θ is uniformly bounded away from zero for any estimator θ. In particular, if we can find an ε-separated set Θ 0 Θ, such that infmax P θ ( θ,θ α θ θ Θ 0 for some absolute constant α > 0, then we can lower-bound the minimax risk M(Θ by αε/. So now we need tight lower bounds on the probability of error when testing multiple hypotheses that hold for any estimation procedure. Bounds of this sort rely, in one way or another, on a fundamental result from information theory known as Fano s inequality [CT06] (or Fano s lemma, as statisticians call it [Yu97]. We will use a particular version, due to Lucien Birgé [Bir05]. To state it, we first need to define relative entropy (or Kullback Leibler divergence. Given any two probability distributions P,Q on Z, the relative entropy (or Kullback Leibler divergence between P and Q is defined as D(P Q z Z +, P(zlog P(z Q(z, if p(p p(q otherwise where p(p {z Z : P(z > 0} is the port of P. Here are some useful properties: (6 9

10 D(P Q 0, with equality if and only if P Q; for any n pairs of distributions (P i,q i, 1 i n, we have D(P 1... P n Q 1... Q n = n D(P i Q i. (7 Also, one can bound the total variation distance in terms of the KL divergence as follows: 1 P Q TV D(P Q. (8 This inequality is usually referred to as Pinsker s inequality after Mark S. Pinsker, although Pinsker didn t prove it. In the present form, the inequality (8, with the optimal constant 1/ in front of D(P Q, was obtained independently by I. Csiszár, J.H.B. Kemperman, and S. Kullback. With these preliminaries out of the way, we can state Birgé s bound: Lemma 3 (Birgé. Let {P i } N i=0 be a finite family of probability distributions on Z, and let {A i } N be a family i=0 of disjoint events. Then where α = 0.71 and ( min P i (A i max α, 0 i N K = 1 N K log(n + 1 N D(P i P 0. Now let s see how we can apply Birgé s lemma: Let Θ 0 = {θ 0,...,θ N } be some ε-separated subset of Θ. Fix an estimator θ taking values in Θ 0. For each 0 i N, consider the probability distribution P i = P θi and the event A i = {z Z : θ(z = θ i }. Since the elements of Θ 0 are all distinct, the events A 0,..., A N are disjoint. Now pose that we have chosen Θ 0 in such a way that Then, by Birgé s lemma, K = 1 N 0 P i = N D(P 1 N, N D ( P θ0 P θi αlog(n + 1. (9 maxp θ ( θ θ = max P i (A c i θ Θ 0 0 i N = 1 min 0 i N P i (A i 1 α 0.9. This bound holds for any estimator θ. Consequently, using Lemma, we get the lower bound M(Θ (1 αε/. 10

11 Proof of Theorem 1 Roughly speaking, the strategy of the proof will be as follows: We start by observing that we can lowerbound the minimax risk R n (h,f by R n (h,f inf P Q E [ L( L(f ] for any subset Q P (h,f. We will then construct Q in such a way that its elements can be naturally indexed by the vertices of a binary hypercube of dimension V 1 (where V is the VC dimension of F, and the expected excess risk for each P Q can be related to the minimax risk for the problem of estimating the corresponding binary string of length V 1. At that point, we will be in a position to apply Assouad s lemma..1 Construction of Q As stated above, our set Q of difficult distributions will consist of V 1 distributions P b, b {0,1} V 1. To construct these distributions, we will first pick the marginal distribution P X P (X of the feature X, and then specify the conditional distributions P (b Y X of the binary label Y given X, for each b {0,1}V 1. For now, let us assume that h > 0. We construct P X as follows. Since F is a VC class with VC dimension V, there exists a set {x 1,..., x V } X of points that can be shattered by F, i.e., for any binary string β {0,1} V there exists at least one f F, such that f (x j = β j for all 1 j V. Given a parameter p [0,1/(V 1], which will be chosen later, we choose P X so that P X ({x 1,..., x V } = 1, and { p, if 1 j V 1 P X (x j =. (30 1 (V 1p, otherwise In other words, P X is a discrete distribution ported on the shattered set {x 1,..., x V } that puts the same mass p on each of the first V 1 points and dumps the rest of the probability mass, equal to 1 (V 1p, on the last point x V. Owing to our condition on p, this is a valid probability distribution. Next we choose P (b Y X for each b {0,1}V 1 as follows: For a fixed b, the conditional distribution of Y given X = x is ( 1 + (b j 1h Bernoulli, if x = x j and j {1,...,V 1} Bernoulli(0, otherwise In other words, for a fixed b, we have 1 h η b (x P (b, if x = x j for some j {1,...,V 1}, and b j = 0 1+h (Y = 1 X = x = Y X, if x = x j for some j {1,...,V 1}, and b j = 1. (31 0, otherwise Therefore, the corresponding Bayes classifier, which we denote by f, is given by b 0, if x = x j for some j {1,...,V 1}, and b j = 0 f b (x = 1, if x = x j for some j {1,...,V 1}, and b j = 1. (3 0, otherwise 11

12 That is, the output of the Bayes classifier on each x j, 1 j V 1, is simply equal to the bit value b j, and is zero everywhere else. It remains to check whether the resulting set Q = { P b : b {0,1} V 1} is contained in P (h,f. First of all, from (31 we see that η b (x 1 h for all x (indeed, η b (x 1 = h when x {x 1,..., x V 1 }, and η b (x 1 = 1 otherwise. Secondly, because {x 1,..., x V } is shattered by F, there exists at least one f F, such that f b (x = f (x for all x {x 1,..., x V }. Thus, Q P (h,f, as claimed.. Reduction to an estimation problem on the binary hypercube Now that we have constructed Q, we will show that the problem of learning a classifier when the n training samples are drawn i.i.d. from some P b Q is at least as difficult as reconstructing the underlying bit string b indeed, intuitively this makes sense, given the structure of the Bayes classifier f b corresponding to P b. We start by noting that, for any classifier f : X {0,1} and any distribution P on X {0,1}, we have 3 From this, we see that if P P (h,f, then L(f L(f = E [ η(x 1 f (X f (X ]. L(f L(f h E [ f (X f (X ] h f f L1, where the L 1 norm is computed w.r.t. the marginal distribution P X. Hence, we have inf E [ L( f n L(f ] [ = inf max E b L( fn L(f b {0,1} V 1 b ] P Q f n h inf max E b f n f b {0,1} V 1 b L 1, (33 where the L 1 norm is w.r.t. the distribution P X defined in (30, and E b [ ] denotes expectation w.r.t. P b. Now we are ready to carry out the reduction to an estimation problem on the binary hypercube {0,1} V 1. Given any candidate estimator f n, let b n be an estimator that takes values in {0,1} V 1, and is defined as follows: b n argmin f n f b L 1. (34 b {0,1} V 1 In other words, b n is the binary string that indexes the element of { f b : b {0,1}V 1} which is the closest to f n in the L 1 norm. Then, for any b, f bn f b L 1 f bn L1 + f b 1 f b L 1, (35 where the first step is by the triangle inequality, while the second step is by (34. Combining (35 with (33, we get 3 Exercise: prove this! inf E [ L( f n L(f ] h inf max E b f bn f b {0,1} V 1 b L 1, (36 P Q b n 1

13 where the infimum is over all estimators that take values in {0,1} V 1 based on n i.i.d. samples from one of the P b s. Now let us inspect the L 1 norm f b f b L1 for any two b,b. Using (3, we have Substituting this into (36 gives inf P Q f b f b L1 = = = p = p X V j =1 f b (x f b (x P X (d x P X (x j f b (x j f b (x j V 1 j =1 V 1 j =1 f b (x j f b (x j b j b j = p d H (b,b. E [ L( f n L(f ] ph inf [ max E b dh ( b n,b ], (37 b {0,1} V 1 as claimed. Now we are in a position to apply Assouad s lemma. b n.3 Applying Assouad s lemma In order to apply Assouad s lemma, we need an upper bound on the squared Hellinger distance H (P b,p b for all b,b with d H (b,b = 1. This is a simple computation. In fact, for any two b,b {0,1} V 1 we have H (P b,p b = V j =1 y {0,1} V 1 = p j =1 y {0,1} V 1 = p H (Bernoulli j =1 ( P b (x j, y P b (x j, y ( P (b Y X (y x j P (b Y X (y x j ( ( 1 + (b j 1h 1 + (b j 1h,Bernoulli. For each j {1,...,V 1}, the j th term in the above summation is nonzero if and only if b j b j, in which case it is equal to the squared Hellinger distance between the Bernoulli( 1 h 1+h and Bernoulli( distributions. Thus, ( ( 1 h 1 + h H (P b,p b = ph (Bernoulli,Bernoulli d H (b,b = p 1 h 1 + h d H (b,b ( = p 1 1 h d H (b,b. 13

14 In particular, the collection of distributions {P b : b {0,1} V 1 } satisfies the condition (1 of Corollary 3 with α = p(1 1 h ph. Therefore, applying the bound (, we get inf P Q E [ L( f n L(f ] ( p(v 1h 1 nph. 4 If we let p = /9nh, the term in parentheses will be equal to 1/3, and inf P Q E [ L( L(f ] V 1 54nh, assuming that the condition p 1/(V 1 holds. This will be the case if h (V 1/n. Therefore, inf P P (h,f E [ L( L(f ] V 1 54nh, if h (V 1/n. (38 If h (V 1/n, we can use the above construction with h = (V 1/n. In that case, because P (h,f P (h,f whenever h h, we see that inf P P (h,f E [ L( f n L(f ] inf f n V 1 54n h = 1 54 P P ( h,f Putting together (38 and (39, we get the bound of Theorem 1. E [ L( L(f ] V 1 n, if h (V 1/n. (39 3 Proof of Theorem In its broad outline, the proof is very similar to the proof of Theorem 1. Fix some N 4D. Since F is (N,D-rich, there exist N distinct points x 1,..., x N X, such that {0,1} N D { b {0,1} N : wt(b = D } F (x n, where wt(b denotes the Hamming weight of a binary string b. Let P X be the uniform distribution on {x 1,..., x N }. Also, for each b {0,1} N D, let η b (x i 1 + (b i 1h, 1 i N. Now for each such b define a distribution P b P (X {0,1} by P b = P X P (b Y X, where P (b Y X (1 x i = 1 + (b i 1h η b (x i. In other words, the conditional distribution P (b is a Bernoulli measure with bias 1+h Y X =x i if b i = 1 and 1 h if b i = 0. The corresponding Bayes classifier is given by f b (x i = b i for each i, as before. Moreover, 14

15 since F is (N,D-rich, for any b {0,1} N D there exists at least one f F, such that f (x i = b i = f b (x i for every i. Consequently, Q = { P b : b {0,1} N D } P (h,f. Proceeding in the same way as in the proof of Theorem 1, for any subset C {0,1} N D, R n (h,f h = h N inf max b n C b C E b f bn f b L1 inf max E [ b dh ( b n,b ]. b n C b C Since {0,1} N D is not the entire binary hypercube {0,1}N, we cannot use Assouad s lemma. Instead, we will use Lemma in conjunction with Birgé s bound. In order to do that, we must first construct a nice, wellseparated subset of {0,1} N D. The following combinatorial result, due to P. Reynaud-Bouret, does the trick. There exists a subset C {0,1} N D with the following properties: 1. d H (b,b > D/ for any two distinct b,b C ;. log C κd log N D, where κ Item 1 says that this C is D/-separated in the Hamming distance. Thus, by Lemma, R n (h,f hd 4N = hd 4N inf b n C ( max P ( b bn b b C 1 min P ( b bn b. b n C b C We are now in a position to apply Birgé s lemma: min P ( b bn b α, b n C b C where α = 0.71, provided 1 K = D ( P n b C 1 P n b 0 αlog C, (40 b C,b b 0 where b 0 is some fixed but arbitrary element of C, and P n b is a product of n copies of P b (recall that our estimators are based on an i.i.d. sample of size n. So now we turn to the analysis of K. For any two b,b {0,1} N, we have D ( P n b P n b = nd(pb P b N = n P b (x i, ylog P b(x i, y P b (x i, y = n N = n N y {0,1} N P (b y {0,1} N D Y X (y x i log P (b Y X (y x i ( Bernoulli P (b Y X (y x i ( 1 + (bi 1h 1 + (b Bernoulli( i 1h, 15

16 where the first line is by (7, while the rest follows from definitions. Now, in the above summation, the i th term is nonzero if and only if b i = b, and in that case it is equal to i ( ( ( 1 + h 1 h D Bernoulli Bernoulli = h log 1 + h 1 h. Consequently, D ( ( P n b P n n 1 + h b = N h log 1 h = n N h log ( 1 + h 1 h N 1 {bi b i } d H (b,b. Now, for any two b,b {0,1} N D, d H(b,b wt(b+wt(b = D. Moreover, using the bound log t t 1 for any t 0, we have h log 1 + h ( 1 + h 1 h h 1 h 1 = h 1 h. Therefore, for any two b,b C, D ( P n b P n b 4nh D N (1 h, which implies that K 4nh D N (1 h. Using this and the fact that log C κd log N D the condition (40 will be satisfied if we can choose some N 4D, so that with κ 0.33, we see that N log N D A tedious calculation (see [MN06] shows that the choice N = 8nh ( ακ(1 h 1 + log nh D does the job, provided h D/n. This completes the proof. 4nh ακ(1 h. (41 References [Bir05] L. Birgé. A new lower bound for multiple hypothesis testing. IEEE Transactions on Information Theory, 51(4: , April 005. [CT06] T. M. Cover and J. T. Thomas. Elements of Information Theory. Wiley, nd edition, 006. [MN06] P. Massart and É. Nédélec. Risk bounds for statistical learning. Annals of Statistics, 34(5:36 366, 006. [Vap98] V. Vapnik. Statistical Learning Theory. Wiley, [Yu97] B. Yu. Assouad, Fano, and Le Cam. In D. Pollard, E. Torgersen, and G. Yang, editors, Festschrift for Lucien Le Cam, pages Springer,

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose