Minimax lower bounds

Size: px
Start display at page:

Download "Minimax lower bounds"

Transcription

1 Minimax lower bounds Maxim Raginsky December 4, 013 Now that we have a good handle on the performance of ERM and its variants, it is time to ask whether we can do better. For example, consider binary classification: we observe n i.i.d. training samples from an unknown joint distribution P on X {0,1}, where X is some feature space, and for a fixed class F of candidate classifiers f : X {0,1} we let f n be the ERM solution f n = argmin f F 1 n n 1 {f (Xi Y i }. (1 IF F is a VC class with VC dimension V, then the excess risk of f n over the best-in-class performance L (F inf f F L(f satisfies L( f n L (F +C V n + log(1/δ n with probability at least 1 δ, where C > 0 is some absolute constant. Integrating, we also get the following bound on the expected excess risk: E [ L( f n L (F ] C V n, ( for some constant C > 0. Crucially, the bound ( holds for all possible joint distributions P on X {0,1}, and the right-hand side is independent of P it depends only on the properties of the class F! Thus, we deduce the following remarkable distribution-free guarantee for ERM: for any VC class F, the ERM algorithm (1 satisfies [ E P LP ( f n L P (F ] V C P P (X {0,1} n. (3 (We have used subscript P to explicitly indicate the fact that the quantity under the remum depends on the underlying distribution P. In the sequel, we will often drop the subscript to keep the notation uncluttered. Let s take a moment to reflect on the significance of the bound (3. What it says is that, regardless of how weird the stochastic relationship between the feature X X and the label Y {0,1} might be, as long as we scale our ambition back and aim at approaching the performance of the best classifier in some VC class F, the ERM algorithm will produce a good classifier with a uniform O( V /n guarantee on its excess risk. 1

2 At this point, we stop and ask ourselves: could this bound be too pessimistic, even when we are so lucky that the optimal (Bayes classifier happens to be in F? (Recall that the Bayes classifier for a given P has the form { 1, if η(x 1/ f (x =, 0, otherwise where η(x = E[Y X = x] = P(Y = 1 X = x is the regression function. Let P (F denote the subset of P (X {0,1} consisting of all joint distributions of X X and Y {0,1}, such that f F. Then from ( we have E [ L( f n L(f ] C P P (F V n, (4 where f n is the ERM solution (1. However, we know that if the relationship between X and Y is deterministic, i.e., if Y = f (X, then ERM performs much better. More precisely, let P 0 (F be the zero-error class: Then one can show that P 0 (F { P P (F : Y = f (X a.s. }. E [ L( f n L(f ] C V P P 0 (F n, (5 a much better bound than the global bound (4 (see, e.g., the book by Vapnik [Vap98]. This suggests that the performance of ERM is somehow tied to how sharp the behavior of η is around the decision boundary that separates the sets {x X : η(x 1/} and {x X : η(x < 1/}. To see whether this is the case, let us define, for any h [0,1], the class of distributions P (h,f { P P (F : η(x 1 h a.s. } (in that case, the distributions in P (h,f are said to satisfy the Massart noise condition with margin h. We have already seen the two extreme cases: h = 0 this gives P (0,F = P (F (the bound η 1 0 holds trivially for any P. h = 1 this gives the zero-error regime P (1,F = P 0 (F (if η 1 1 a.s., then η can take only values 0 and 1 a.s.. However, intermediate values of h are also of interest: if a distribution P belongs to P (h,f for some 0 < h < 1, then its regression function η makes a jump of size h as we cross the decision boundary. With this in mind, for any n N and h [0,1] let us define the minimax risk R n (h,f inf P P (h,f E [ L( L(f ], (6 where the infimum is over all learning algorithms based on n i.i.d. training samples. The term minimax indicates that we are minimizing over all admissible learning algorithms, while maximizing over all distributions in a given class. The following result was proved by Pascal Massart and Élodie Nédélec [MN06]:

3 Theorem 1. Let F be a VC class of binary-valued functions on X with VC dimension V. Then for any n V and any h [0,1] we have the lower bound where c > 0 is some absolute constant. Let us examine some implications: R n (h,f c min ( V n, V nh, (7 When h = 0, the right-hand side of (7 is equal to c V /n. Thus, without any further assumptions, ERM is as good as it gets (it is minimax-optimal, up to multiplicative constants. When h = 1 (the zero-error case, the right-hand side of (7 is equal to cv /n, which matches the upper bound (5 up to constants. Thus, if we happen to know that we are in a zero-error situation, ERM is minimax-optimal as well. For intermediate values of h, the lower bound depends on the relative sizes of h, V, and n. In particular, if h V /n, we have the minimax lower bound R n (h,f cv /nh. Alternatively, for a fixed h (0,1, we may think of n = V /h as the cutoff sample size, beyond which the effect of the margin condition on η can be spotted and exploited by a learning algorithm. In the same paper, Massart and Nédélec obtain the following upper bound on ERM: P P (h,f E [ L( f n L(f ] V C n, if h V /n C V (1 + log nh, if h >. (8 V /n nh V Thus, ERM is nearly minimax-optimal (we say nearly because of the extra log factor in the above bound; in fact, as Massart and Nédélec show, the log factor is unavoidable when the function class F is rich in a certain sense. The proof of the above upper bound is rather involved and technical, and we will not get into it here. The appearance of the logarithmic term in (8 is rather curious. Given the lower bound of Theorem 1, one may be tempted to dismiss it as an artifact of the analysis used by Massart and Nédélec. However, as we will now see, in certain situations this logarithmic term is unavoidable. To that end, we first need a definition: We say that a class F binary-valued functions f : X {0,1} is (N,D-rich, for some N,D N, if there exist N distinct points x 1,..., x N X, such that the projection { (f F (x N = (x1,..., f (x N } : f F of F onto x N contains all binary strings of Hamming weight 1 D. Some examples: If F is a VC-class with VC dimension V, then it is (V,D-rich for all 1 D V. This follows directly from definitions. 1 The Hamming weight of a binary string is the number of nonzero bits it has. 3

4 A nontrivial example, and one that is relevant to statistical learning, is as follows. Let F be the collection of indicators of all halfspaces in R d, for some d. There is a result in computational geometry which says that, for any N d + 1, one can find N distinct points x 1,..., x N, such that F (x n contains all strings in {0,1} N with Hamming weight up to, and including, d/. Consequently, F is (N, d/ -rich for all N d + 1. We can now state the following result [MN06]: Theorem. Given some D 1, pose that F is (N, D-rich for all N 4D. Then R n (h,f c(1 h D ] [1 + log nh nh D (9 for any D/n h < 1, where c > 0 is some absolute constant. We will present the proofs of Theorems 1 and in Sections and 3, after giving some necessary background on minimax lower bounds. 1 Preparation: minimax lower bounds for statistical estimation To prove Theorem 1, we need some background on minimax lower bounds for statistical estimation problems. The presentation here follows an excellent paper by Bin Yu [Yu97]. The setting is as follows: we have an indexed set {P θ : θ Θ} of probability distributions on some set Z, where Θ is some parameter set. For our purposes, it suffices to consider the case when Z is a finite set. We observe a random sample Z from one of the P θ s, and our goal is to estimate θ. We will measure the quality of any estimator θ = θ(z by its expected risk E θ [d(θ, θ(z ] = z ZP θ (zd(θ, θ(z, where E θ [ ] denotes expectation with respect to P θ, and d : Θ Θ R + is a given loss function. Here, we will assume that d is a pseudometric on Θ, i.e., it has the following properties: 1. Symmetry d(θ,θ = d(θ,θ for any θ,θ Θ.. Triangle inequality d(θ,θ d(θ,η + d(η,θ for any θ,θ,η Θ. (We do not require that d(θ,θ = 0 if and only if θ = θ, in which case d is a metric. Since we do not know ahead of time which of the θ s we will be facing, it is natural to expect the worst and seek an estimator θ to minimize the worst-case risk θ Θ E θ [d(θ, θ(z ]. With this in mind, let us define the minimax risk M(Θ inf θ E θ [d(θ, θ(z ], θ Θ where the infimum is over all estimators θ = θ(z. We are particularly interested in tight lower bounds on M(Θ, since they will give us some idea of how difficult the estimation problem is. A simple yet powerful technique for getting lower bounds is the two-point method introduced by Lucien Le Cam: 4

5 Lemma 1. For any two θ,θ Θ and any estimator θ, E θ [d(θ, θ(z ] + E θ [d(θ, θ(z ] d(θ,θ min(p θ (z,p θ (z. (10 Proof. Consider an arbitrary z Z. If P θ (z P θ (z, then P θ (zd(θ, θ(z + P θ d(θ, θ(z = P θ (z [ d(θ, θ(z + d(θ, θ(z ] + [P θ (z P θ (z]d(θ, θ(z P θ (z [ d(θ, θ(z + d(θ, θ(z ] P θ (zd(θ,θ, where the second line is because P θ (z P θ (z and d(, is nonnegative, while the third line is by the triangle inequality. By the same token, if P θ (z P θ (z, then Summing over all z Z, we get (10. z Z P θ (zd(θ, θ(z + P θ d(θ, θ(z P θ (zd(θ,θ. The sum on the right-hand side of (10 can be expressed in terms of the so-called total variation distance between P θ and P θ. For any two probability distributions P,Q P (Z, the total variation distance is Moreover, P Q TV 1 P(z Q(z. (11 z Z P Q TV = 1 z Zmin(P(z,Q(z. (1 To prove (1, we just use the definition (11: if we let Z { z Z : P(z Q(z }, then P Q TV = 1 (P(z Q(z + 1 (Q(z P(z z Z z Z = 1 ( P(Z Q(Z + 1 ( Q(Z c P(Z c = 1 ( P(Z Q(Z + 1 ( P(Z Q(Z = P(Z Q(Z = (P(z Q(z = z:p(z Q(z z:p(z Q(z (P(z min(p(z,q(z = z Z(P(z min(p(z,q(z = 1 z Zmin(P(z,Q(z. z:p(z Q(z (P(z min(p(z,q(z } {{ } =0 and we are done. Thus, we can rewrite the bound (10 as E θ [d(θ, θ(z ] + E θ [d(θ, θ(z ] d(θ,θ (1 P θ P θ TV. (13 5

6 Let us examine some consequences. If we take the remum of both sides of (13 over the choices of θ and θ, then we see that, for any estimator θ, E θ [d(θ, θ(z ] 1 θ Θ { d(θ,θ (1 } P θ P θ TV. (14 θ θ Notice that the right-hand side does not depend on the choice of θ, so we can take the infimum of both sides of (14 over θ and obtain the following: Corollary 1. The minimax risk M(Θ can be lower-bounded as M(Θ 1 θ θ { d(θ,θ (1 P θ P θ TV }. (15 The two-point bound (15 gives us an idea of when the problem of estimating θ based on observations Z P θ is difficult: when there exists at least one pair of distinct parameter values θ,θ Θ such that d(θ,θ is large, while the total variation distance P θ P θ TV is small. This means that the observation Z does not give us enough information to reliably distinguish between θ and θ, but the consequences of mistaking one for the other are severe (since d(θ,θ is large. Another useful consequence of Lemma 1 comes about when we consider the Bayesian setting, i.e., when we have a prior distribution π on Θ and consider the average risk of any estimator θ: E π [d(θ, θ(z ] π(dθe θ [d(θ, θ(z ]. Θ Then we have the following lower bound on M(Θ: [ ] M(Θ inf E π d(θ, θ(z. θ To prove this, we simply note that, for any θ, E π [d(θ, θ(z ] π(dθe θ [d(θ, θ(z ] E θ [d(θ, θ]. Θ θ Θ (In fact, under suitable regularity conditions, one can prove that the minimax risk is equal to the worstcase Bayes risk: M(Θ = inf θ π P (Θ E π [ d(θ, θ(z ] = π P (Θ [ ] inf E π d(θ, θ(z, θ where the first equality always holds, while the second equality is valid whenever the conditions of the minimax theorem are satisfied. With these definitions, we have the following: Corollary. Let π be any prior distribution on Θ, and let µ be any joint probability distribution of a random pair (θ,θ Θ Θ, such that the marginal distributions of both θ and θ are equal to π. Then [ ] [ M(Θ inf E π d(θ, θ(z Eµ d(θ,θ (1 ] P θ P θ TV. (16 θ I have learned this particular formulation from my colleagues Yihong Wu (here at Illinois and Yury Polyanskiy (MIT. 6

7 Proof. Take the expectation of both sides of (13 with respect to µ and then use the fact that, under µ, both θ and θ have the same distribution π. In many cases, the analysis of minimax lower bounds can be reduced to the following problem: Suppose that the parameter set Θ can be identified with the m-dimensional binary hypercube {0,1} m for some m, and the objective is to determine every bit of the underlying unknown parameter θ {0,1} m based on an observation Z P θ. In that setting, a key result known as Assouad s lemma says that the difficulty of estimating the entire bit string θ is related to the difficulty of estimating each bit of θ separately, assuming you already know all other bits: Theorem 3 (Assouad s lemma. Suppose that Θ = {0,1} m and consider the Hamming metric Then M(Θ m d H (θ,θ m 1 {θi θ i }. (17 ( 1 max P θ,θ :d H (θ,θ θ P θ TV. (18 =1 Proof. Notice that we can write d H (θ,θ as a sum m d i (θ,θ, where d i (θ,θ 1 {θi θ i }, and each d i is a pseudometric. Now let π be the uniform distribution on Θ = {0,1} m, and for each i {1,...,m} let µ i be the joint distribution of a random pair (θ,θ Θ Θ, under which θ π and θ differs from θ only in the i th bit, i.e., θ π and d i (θ,θ = 1. Then the marginal distribution of θ under µ i is µ i (θ,θ = 1 θ {0,1} m m 1 {θi θ θ {0,1} m i and θ j =θ j,j i } = 1 m = π(θ, since for each θ there is only one θ that differs from it in a single bit. Now the idea is to apply Corollary separately to each bit: [ M(Θ = inf E π dh (θ, θ(z ] θ = inf θ m m [ E π di (θ, θ(z ] [ inf E π di (θ, θ(z ] θ m 1 E [ µ i di (θ,θ (1 ] P θ P θ TV, where the last step is by Corollary. Since d i (θ,θ = 1 under µ i for every i, we have and we are done. M(Θ 1 1 = m m [ ] E µi 1 Pθ P θ TV m min θ,θ :d H (θ,θ =1 ( 1 Pθ P θ TV ( 1 max θ,θ :d H (θ,θ =1 P θ P θ TV, 7

8 In order to apply Assouad s lemma, we need to have an upper bound on the total variation distance between any two P θ and P θ with d H (θ,θ = 1. More often than not, it is easier to work with another distance between probability distributions, the so-called Hellinger distance. For any two P,Q P (Z, their squared Hellinger distance is given by H (P,Q 1 ( P(z Q(z. The total variation distance can be both upper- and lower-bounded by Hellinger: Moreover, for any n pairs of distributions (P 1,Q 1,...,(P n,q n, 1 z Z H (P,Q P Q TV H(P,Q. (19 H (P 1... P n,q 1... Q n Armed with these facts, we can prove the following version of Assouad s lemma: n H (P i,q i. (0 Corollary 3. Let { P θ : θ {0,1} m} be a collection of probability distributions on some set Z indexed by the vertices of the binary hypercube Θ = {0,1} m. Suppose that there exists some constant α > 0, such that H (P θ,p θ α, if d H (θ,θ = 1. (1 Consider the problem of estimating the parameter θ Θ based on n i.i.d. observations from P θ, where the loss is measured by the Hamming distance. Then the corresponding minimax risk, which we denote by M n (Θ, is lower-bounded as M n (Θ m (1 αn. ( Proof. For any θ Θ, let P n θ denote the product of n copies of P θ, i.e., the joint distribution of n i.i.d. samples from P θ. For any two θ,θ Θ with d H (θ,θ = 1, we have P n θ P n θ TV H(P n θ,qn θ n H (P θ,p θ αn, where the first step is by (19, the second is by (0, and the third is by (1. The bound ( then follows from Assouad s lemma. Assouad s lemma is a wonderful tool, but it is only applicable when the parameter space Θ is a binary hypercube and the metric d is the Hamming metric. To handle other situations, we need different methods. Let s recall what we had seen earlier: the minimax risk M(Θ is large if we can find at least two distinct points θ,θ Θ, such that (a the distance d(θ,θ is large and (b the distributions P θ and P θ are statistically close to one another, i.e., the total variation distance P θ P θ TV is small. Let s see what happens when we follow this logic further and consider any two θ,θ Θ that are separated from each other by a distance of at least ε > 0: 8

9 Lemma. Fix ε > 0, and let Θ 0 Θ be any set which is ε-separated, i.e., d(θ,θ ε for any two distinct θ,θ Θ 0. Then where the infimum is over all estimators θ that take values in Θ 0. M(Θ ε inf ( P θ θ θ, (3 θ θ Θ 0 Proof. Fix any ε-separated set Θ 0. Then, since Θ 0 Θ, we have M(Θ M(Θ 0. Given any estimator θ = θ(z taking values in Θ, define another estimator θ = θ(z taking values in Θ 0 as follows: Then, for any θ Θ 0, θ = argmin θ Θ 0 d( θ,θ. d( θ,θ d( θ, θ + d(θ, θ d(θ, θ, where the first step is by the triangle inequality, while the second step is by definition of θ. Consequently, M(Θ M(Θ 0 1 inf E θ [d( θ,θ]. (4 θ θ Θ 0 Now let us consider the event {z Z : θ(z θ} for some θ Θ 0. Since θ Θ 0 by construction, and Θ 0 is ε-separated, we have {z Z : θ(z θ} {z Z : d( θ(z,θ ε}. Using this fact and Markov s inequality, we can write [ ] ( ( E θ d( θ,θ P θ θ θ Pθ d( θ,θ ε. (5 ε Using this in (4, we get (3. Lemma is at its most effective when we can find a finite set Θ 0 which is ε-separated and has large cardinality. The reason for the latter requirement is that then it should not be too hard to arrange things in such a way that the probability of error P θ ( θ θ is uniformly bounded away from zero for any estimator θ. In particular, if we can find an ε-separated set Θ 0 Θ, such that infmax P θ ( θ,θ α θ θ Θ 0 for some absolute constant α > 0, then we can lower-bound the minimax risk M(Θ by αε/. So now we need tight lower bounds on the probability of error when testing multiple hypotheses that hold for any estimation procedure. Bounds of this sort rely, in one way or another, on a fundamental result from information theory known as Fano s inequality [CT06] (or Fano s lemma, as statisticians call it [Yu97]. We will use a particular version, due to Lucien Birgé [Bir05]. To state it, we first need to define relative entropy (or Kullback Leibler divergence. Given any two probability distributions P,Q on Z, the relative entropy (or Kullback Leibler divergence between P and Q is defined as D(P Q z Z +, P(zlog P(z Q(z, if p(p p(q otherwise where p(p {z Z : P(z > 0} is the port of P. Here are some useful properties: (6 9

10 D(P Q 0, with equality if and only if P Q; for any n pairs of distributions (P i,q i, 1 i n, we have D(P 1... P n Q 1... Q n = n D(P i Q i. (7 Also, one can bound the total variation distance in terms of the KL divergence as follows: 1 P Q TV D(P Q. (8 This inequality is usually referred to as Pinsker s inequality after Mark S. Pinsker, although Pinsker didn t prove it. In the present form, the inequality (8, with the optimal constant 1/ in front of D(P Q, was obtained independently by I. Csiszár, J.H.B. Kemperman, and S. Kullback. With these preliminaries out of the way, we can state Birgé s bound: Lemma 3 (Birgé. Let {P i } N i=0 be a finite family of probability distributions on Z, and let {A i } N be a family i=0 of disjoint events. Then where α = 0.71 and ( min P i (A i max α, 0 i N K = 1 N K log(n + 1 N D(P i P 0. Now let s see how we can apply Birgé s lemma: Let Θ 0 = {θ 0,...,θ N } be some ε-separated subset of Θ. Fix an estimator θ taking values in Θ 0. For each 0 i N, consider the probability distribution P i = P θi and the event A i = {z Z : θ(z = θ i }. Since the elements of Θ 0 are all distinct, the events A 0,..., A N are disjoint. Now pose that we have chosen Θ 0 in such a way that Then, by Birgé s lemma, K = 1 N 0 P i = N D(P 1 N, N D ( P θ0 P θi αlog(n + 1. (9 maxp θ ( θ θ = max P i (A c i θ Θ 0 0 i N = 1 min 0 i N P i (A i 1 α 0.9. This bound holds for any estimator θ. Consequently, using Lemma, we get the lower bound M(Θ (1 αε/. 10

11 Proof of Theorem 1 Roughly speaking, the strategy of the proof will be as follows: We start by observing that we can lowerbound the minimax risk R n (h,f by R n (h,f inf P Q E [ L( L(f ] for any subset Q P (h,f. We will then construct Q in such a way that its elements can be naturally indexed by the vertices of a binary hypercube of dimension V 1 (where V is the VC dimension of F, and the expected excess risk for each P Q can be related to the minimax risk for the problem of estimating the corresponding binary string of length V 1. At that point, we will be in a position to apply Assouad s lemma..1 Construction of Q As stated above, our set Q of difficult distributions will consist of V 1 distributions P b, b {0,1} V 1. To construct these distributions, we will first pick the marginal distribution P X P (X of the feature X, and then specify the conditional distributions P (b Y X of the binary label Y given X, for each b {0,1}V 1. For now, let us assume that h > 0. We construct P X as follows. Since F is a VC class with VC dimension V, there exists a set {x 1,..., x V } X of points that can be shattered by F, i.e., for any binary string β {0,1} V there exists at least one f F, such that f (x j = β j for all 1 j V. Given a parameter p [0,1/(V 1], which will be chosen later, we choose P X so that P X ({x 1,..., x V } = 1, and { p, if 1 j V 1 P X (x j =. (30 1 (V 1p, otherwise In other words, P X is a discrete distribution ported on the shattered set {x 1,..., x V } that puts the same mass p on each of the first V 1 points and dumps the rest of the probability mass, equal to 1 (V 1p, on the last point x V. Owing to our condition on p, this is a valid probability distribution. Next we choose P (b Y X for each b {0,1}V 1 as follows: For a fixed b, the conditional distribution of Y given X = x is ( 1 + (b j 1h Bernoulli, if x = x j and j {1,...,V 1} Bernoulli(0, otherwise In other words, for a fixed b, we have 1 h η b (x P (b, if x = x j for some j {1,...,V 1}, and b j = 0 1+h (Y = 1 X = x = Y X, if x = x j for some j {1,...,V 1}, and b j = 1. (31 0, otherwise Therefore, the corresponding Bayes classifier, which we denote by f, is given by b 0, if x = x j for some j {1,...,V 1}, and b j = 0 f b (x = 1, if x = x j for some j {1,...,V 1}, and b j = 1. (3 0, otherwise 11

12 That is, the output of the Bayes classifier on each x j, 1 j V 1, is simply equal to the bit value b j, and is zero everywhere else. It remains to check whether the resulting set Q = { P b : b {0,1} V 1} is contained in P (h,f. First of all, from (31 we see that η b (x 1 h for all x (indeed, η b (x 1 = h when x {x 1,..., x V 1 }, and η b (x 1 = 1 otherwise. Secondly, because {x 1,..., x V } is shattered by F, there exists at least one f F, such that f b (x = f (x for all x {x 1,..., x V }. Thus, Q P (h,f, as claimed.. Reduction to an estimation problem on the binary hypercube Now that we have constructed Q, we will show that the problem of learning a classifier when the n training samples are drawn i.i.d. from some P b Q is at least as difficult as reconstructing the underlying bit string b indeed, intuitively this makes sense, given the structure of the Bayes classifier f b corresponding to P b. We start by noting that, for any classifier f : X {0,1} and any distribution P on X {0,1}, we have 3 From this, we see that if P P (h,f, then L(f L(f = E [ η(x 1 f (X f (X ]. L(f L(f h E [ f (X f (X ] h f f L1, where the L 1 norm is computed w.r.t. the marginal distribution P X. Hence, we have inf E [ L( f n L(f ] [ = inf max E b L( fn L(f b {0,1} V 1 b ] P Q f n h inf max E b f n f b {0,1} V 1 b L 1, (33 where the L 1 norm is w.r.t. the distribution P X defined in (30, and E b [ ] denotes expectation w.r.t. P b. Now we are ready to carry out the reduction to an estimation problem on the binary hypercube {0,1} V 1. Given any candidate estimator f n, let b n be an estimator that takes values in {0,1} V 1, and is defined as follows: b n argmin f n f b L 1. (34 b {0,1} V 1 In other words, b n is the binary string that indexes the element of { f b : b {0,1}V 1} which is the closest to f n in the L 1 norm. Then, for any b, f bn f b L 1 f bn L1 + f b 1 f b L 1, (35 where the first step is by the triangle inequality, while the second step is by (34. Combining (35 with (33, we get 3 Exercise: prove this! inf E [ L( f n L(f ] h inf max E b f bn f b {0,1} V 1 b L 1, (36 P Q b n 1

13 where the infimum is over all estimators that take values in {0,1} V 1 based on n i.i.d. samples from one of the P b s. Now let us inspect the L 1 norm f b f b L1 for any two b,b. Using (3, we have Substituting this into (36 gives inf P Q f b f b L1 = = = p = p X V j =1 f b (x f b (x P X (d x P X (x j f b (x j f b (x j V 1 j =1 V 1 j =1 f b (x j f b (x j b j b j = p d H (b,b. E [ L( f n L(f ] ph inf [ max E b dh ( b n,b ], (37 b {0,1} V 1 as claimed. Now we are in a position to apply Assouad s lemma. b n.3 Applying Assouad s lemma In order to apply Assouad s lemma, we need an upper bound on the squared Hellinger distance H (P b,p b for all b,b with d H (b,b = 1. This is a simple computation. In fact, for any two b,b {0,1} V 1 we have H (P b,p b = V j =1 y {0,1} V 1 = p j =1 y {0,1} V 1 = p H (Bernoulli j =1 ( P b (x j, y P b (x j, y ( P (b Y X (y x j P (b Y X (y x j ( ( 1 + (b j 1h 1 + (b j 1h,Bernoulli. For each j {1,...,V 1}, the j th term in the above summation is nonzero if and only if b j b j, in which case it is equal to the squared Hellinger distance between the Bernoulli( 1 h 1+h and Bernoulli( distributions. Thus, ( ( 1 h 1 + h H (P b,p b = ph (Bernoulli,Bernoulli d H (b,b = p 1 h 1 + h d H (b,b ( = p 1 1 h d H (b,b. 13

14 In particular, the collection of distributions {P b : b {0,1} V 1 } satisfies the condition (1 of Corollary 3 with α = p(1 1 h ph. Therefore, applying the bound (, we get inf P Q E [ L( f n L(f ] ( p(v 1h 1 nph. 4 If we let p = /9nh, the term in parentheses will be equal to 1/3, and inf P Q E [ L( L(f ] V 1 54nh, assuming that the condition p 1/(V 1 holds. This will be the case if h (V 1/n. Therefore, inf P P (h,f E [ L( L(f ] V 1 54nh, if h (V 1/n. (38 If h (V 1/n, we can use the above construction with h = (V 1/n. In that case, because P (h,f P (h,f whenever h h, we see that inf P P (h,f E [ L( f n L(f ] inf f n V 1 54n h = 1 54 P P ( h,f Putting together (38 and (39, we get the bound of Theorem 1. E [ L( L(f ] V 1 n, if h (V 1/n. (39 3 Proof of Theorem In its broad outline, the proof is very similar to the proof of Theorem 1. Fix some N 4D. Since F is (N,D-rich, there exist N distinct points x 1,..., x N X, such that {0,1} N D { b {0,1} N : wt(b = D } F (x n, where wt(b denotes the Hamming weight of a binary string b. Let P X be the uniform distribution on {x 1,..., x N }. Also, for each b {0,1} N D, let η b (x i 1 + (b i 1h, 1 i N. Now for each such b define a distribution P b P (X {0,1} by P b = P X P (b Y X, where P (b Y X (1 x i = 1 + (b i 1h η b (x i. In other words, the conditional distribution P (b is a Bernoulli measure with bias 1+h Y X =x i if b i = 1 and 1 h if b i = 0. The corresponding Bayes classifier is given by f b (x i = b i for each i, as before. Moreover, 14

15 since F is (N,D-rich, for any b {0,1} N D there exists at least one f F, such that f (x i = b i = f b (x i for every i. Consequently, Q = { P b : b {0,1} N D } P (h,f. Proceeding in the same way as in the proof of Theorem 1, for any subset C {0,1} N D, R n (h,f h = h N inf max b n C b C E b f bn f b L1 inf max E [ b dh ( b n,b ]. b n C b C Since {0,1} N D is not the entire binary hypercube {0,1}N, we cannot use Assouad s lemma. Instead, we will use Lemma in conjunction with Birgé s bound. In order to do that, we must first construct a nice, wellseparated subset of {0,1} N D. The following combinatorial result, due to P. Reynaud-Bouret, does the trick. There exists a subset C {0,1} N D with the following properties: 1. d H (b,b > D/ for any two distinct b,b C ;. log C κd log N D, where κ Item 1 says that this C is D/-separated in the Hamming distance. Thus, by Lemma, R n (h,f hd 4N = hd 4N inf b n C ( max P ( b bn b b C 1 min P ( b bn b. b n C b C We are now in a position to apply Birgé s lemma: min P ( b bn b α, b n C b C where α = 0.71, provided 1 K = D ( P n b C 1 P n b 0 αlog C, (40 b C,b b 0 where b 0 is some fixed but arbitrary element of C, and P n b is a product of n copies of P b (recall that our estimators are based on an i.i.d. sample of size n. So now we turn to the analysis of K. For any two b,b {0,1} N, we have D ( P n b P n b = nd(pb P b N = n P b (x i, ylog P b(x i, y P b (x i, y = n N = n N y {0,1} N P (b y {0,1} N D Y X (y x i log P (b Y X (y x i ( Bernoulli P (b Y X (y x i ( 1 + (bi 1h 1 + (b Bernoulli( i 1h, 15

16 where the first line is by (7, while the rest follows from definitions. Now, in the above summation, the i th term is nonzero if and only if b i = b, and in that case it is equal to i ( ( ( 1 + h 1 h D Bernoulli Bernoulli = h log 1 + h 1 h. Consequently, D ( ( P n b P n n 1 + h b = N h log 1 h = n N h log ( 1 + h 1 h N 1 {bi b i } d H (b,b. Now, for any two b,b {0,1} N D, d H(b,b wt(b+wt(b = D. Moreover, using the bound log t t 1 for any t 0, we have h log 1 + h ( 1 + h 1 h h 1 h 1 = h 1 h. Therefore, for any two b,b C, D ( P n b P n b 4nh D N (1 h, which implies that K 4nh D N (1 h. Using this and the fact that log C κd log N D the condition (40 will be satisfied if we can choose some N 4D, so that with κ 0.33, we see that N log N D A tedious calculation (see [MN06] shows that the choice N = 8nh ( ακ(1 h 1 + log nh D does the job, provided h D/n. This completes the proof. 4nh ακ(1 h. (41 References [Bir05] L. Birgé. A new lower bound for multiple hypothesis testing. IEEE Transactions on Information Theory, 51(4: , April 005. [CT06] T. M. Cover and J. T. Thomas. Elements of Information Theory. Wiley, nd edition, 006. [MN06] P. Massart and É. Nédélec. Risk bounds for statistical learning. Annals of Statistics, 34(5:36 366, 006. [Vap98] V. Vapnik. Statistical Learning Theory. Wiley, [Yu97] B. Yu. Assouad, Fano, and Le Cam. In D. Pollard, E. Torgersen, and G. Yang, editors, Festschrift for Lucien Le Cam, pages Springer,

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad

Lecture 8: Minimax Lower Bounds: LeCam, Fano, and Assouad 40.850: athematical Foundation of Big Data Analysis Spring 206 Lecture 8: inimax Lower Bounds: LeCam, Fano, and Assouad Lecturer: Fang Han arch 07 Disclaimer: These notes have not been subjected to the

More information

21.1 Lower bounds on minimax risk for functional estimation

21.1 Lower bounds on minimax risk for functional estimation ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will

More information

Case study: stochastic simulation via Rademacher bootstrap

Case study: stochastic simulation via Rademacher bootstrap Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

CONSIDER an estimation problem in which we want to

CONSIDER an estimation problem in which we want to 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 4, APRIL 2011 Lower Bounds for the Minimax Risk Using f-divergences, and Applications Adityanand Guntuboyina, Student Member, IEEE Abstract Lower

More information

Asymptotic redundancy and prolixity

Asymptotic redundancy and prolixity Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends

More information

arxiv:math/ v1 [math.st] 23 Feb 2007

arxiv:math/ v1 [math.st] 23 Feb 2007 The Annals of Statistics 2006, Vol. 34, No. 5, 2326 2366 DOI: 10.1214/009053606000000786 c Institute of Mathematical Statistics, 2006 arxiv:math/0702683v1 [math.st] 23 Feb 2007 RISK BOUNDS FOR STATISTICAL

More information

On Bayes Risk Lower Bounds

On Bayes Risk Lower Bounds Journal of Machine Learning Research 17 (2016) 1-58 Submitted 4/16; Revised 10/16; Published 12/16 On Bayes Risk Lower Bounds Xi Chen Stern School of Business New York University New York, NY 10012, USA

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

Model selection theory: a tutorial with applications to learning

Model selection theory: a tutorial with applications to learning Model selection theory: a tutorial with applications to learning Pascal Massart Université Paris-Sud, Orsay ALT 2012, October 29 Asymptotic approach to model selection - Idea of using some penalized empirical

More information

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

More information

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1] ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University The Annals of Statistics 1999, Vol. 27, No. 5, 1564 1599 INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1 By Yuhong Yang and Andrew Barron Iowa State University and Yale University

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

A first model of learning

A first model of learning A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers

More information

3.0.1 Multivariate version and tensor product of experiments

3.0.1 Multivariate version and tensor product of experiments ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Maxim Raginsky and Igal Sason ISIT 2013, Istanbul, Turkey Capacity-Achieving Channel Codes The set-up DMC

More information

On Improved Bounds for Probability Metrics and f- Divergences

On Improved Bounds for Probability Metrics and f- Divergences IRWIN AND JOAN JACOBS CENTER FOR COMMUNICATION AND INFORMATION TECHNOLOGIES On Improved Bounds for Probability Metrics and f- Divergences Igal Sason CCIT Report #855 March 014 Electronics Computers Communications

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Information Theory in Intelligent Decision Making

Information Theory in Intelligent Decision Making Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Stat 928: Statistical Learning Theory Lecture: 18 Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron Instructor: Sham Kakade 1 Introduction This course will be divided into 2 parts.

More information

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012 Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv:203.093v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Statistical Measures of Uncertainty in Inverse Problems

Statistical Measures of Uncertainty in Inverse Problems Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

1 Differential Privacy and Statistical Query Learning

1 Differential Privacy and Statistical Query Learning 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 5: December 07, 015 1 Differential Privacy and Statistical Query Learning 1.1 Differential Privacy Suppose

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health

More information

IN this paper, we study two related problems of minimaxity

IN this paper, we study two related problems of minimaxity IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 7, NOVEMBER 1999 2271 Minimax Nonparametric Classification Part I: Rates of Convergence Yuhong Yang Abstract This paper studies minimax aspects of

More information

EECS 545 Project Report: Query Learning for Multiple Object Identification

EECS 545 Project Report: Query Learning for Multiple Object Identification EECS 545 Project Report: Query Learning for Multiple Object Identification Dan Lingenfelter, Tzu-Yu Liu, Antonis Matakos and Zhaoshi Meng 1 Introduction In a typical query learning setting we have a set

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture notes on statistical decision theory Econ 2110, fall 2013 Lecture notes on statistical decision theory Econ 2110, fall 2013 Maximilian Kasy March 10, 2014 These lecture notes are roughly based on Robert, C. (2007). The Bayesian choice: from decision-theoretic

More information

Generalization Bounds

Generalization Bounds Generalization Bounds Here we consider the problem of learning from binary labels. We assume training data D = x 1, y 1,... x N, y N with y t being one of the two values 1 or 1. We will assume that these

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

DR.RUPNATHJI( DR.RUPAK NATH )

DR.RUPNATHJI( DR.RUPAK NATH ) Contents 1 Sets 1 2 The Real Numbers 9 3 Sequences 29 4 Series 59 5 Functions 81 6 Power Series 105 7 The elementary functions 111 Chapter 1 Sets It is very convenient to introduce some notation and terminology

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Channel Polarization and Blackwell Measures

Channel Polarization and Blackwell Measures Channel Polarization Blackwell Measures Maxim Raginsky Abstract The Blackwell measure of a binary-input channel (BIC is the distribution of the posterior probability of 0 under the uniform input distribution

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information