Improved Guarantees for Agnostic Learning of Disjunctions

Size: px

Start display at page:

Download "Improved Guarantees for Agnostic Learning of Disjunctions"

Hubert Jennings
5 years ago
Views:

1 Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University Avri Blu Carnegie Mellon University Or Sheffet Carnegie Mellon University Abstract Given soe arbitrary distribution D over {0, 1} n and arbitrary target function c, the proble of agnostic learning of disjunctions is to achieve an error rate coparable to the error OPT disj of the best disjunction with respect to (D, c ). Achieving error O(n OPT disj ) + ǫ is trivial, and Winnow [13] achieves error O(r OPT disj ) + ǫ, where r is the nuber of relevant variables in the best disjunction. In recent work, Peleg [14] shows how to achieve a bound of Õ( n OPT disj )+ǫ in polynoial tie. In this paper we iprove on Peleg s bound, giving a polynoial-tie algorith achieving a bound of O(n 1/3+α OPT disj ) + ǫ for any constant α > 0. The heart of the algorith is a ethod for weak-learning when OPT disj = O(1/n 1/3+α ), which can then be fed into existing agnostic boosting procedures to achieve the desired guarantee. 1 Introduction Learning disjunctions (or conjunctions) over {0, 1} n in the PAC odel is a well-studied and easy proble. The siple list-and-cross-off algorith runs in linear tie per exaple and requires only O(n/ǫ) exaples to achieve error ǫ (ignoring the logarithic dependence on the confidence ter δ). The siilarly efficient Winnow algorith [13] requires only O((r log n)/ǫ) exaples to learn well when the target function is a disjunction of size r. However, when the data is only ostly consistent with a disjunction, the proble becoes substantially harder. In this agnostic setting, our goal is to produce a hypothesis h whose error rate err D (h) = Pr D (h(x) c (x)) satisfies err D (h) c OPT disj + ǫ, where OPT disj is the error rate of the best disjunction and c is as sall as possible. For exaple, while Winnow perfors well as a function of the nuber of attribute errors of the best disjunction 1 [12, 1], this can be a factor O(r) worse than the nuber of istakes of the best disjunction. Recently, Feldan [3] has shown that for any constant ǫ > 0, deterining whether the best disjunction for a given dataset S has error ǫ or error 1 2 ǫ is NP-hard. Even ore recently, Feldan et al. [5] extend this hardness result to the proble of agnostic learning disjunctions by the hypothesis class of halfspaces. Thus, these results show that the proble of finding a disjunction (or linear separator) of error at ost 1 2 ǫ given that the error OPT disj of the best disjunction is at ost ǫ is coputationally hard for any constant ǫ > 0. Given these hardness results, it is natural to consider what kinds of learning guarantees can be achieved. If the error OPT disj of the best disjunction is O(1/n) then learning is essentially equivalent to the noise-free case. Peleg [14] shows how to iprove this to a bound of Õ(1/ n). In particular, on any given dataset S, his algorith produces a disjunction of error rate on S at ost Õ( n OPT disj (S)). 2 1 The iniu nuber of variables that would need to be flipped in order to ake the data perfectly consistent with a disjunction. This is essentially the sae as its hinge loss. 2 His results are for the Red-Blue Set-Cover Proble [2] which is equivalent to the proble of approxiating the best disjunction, except that positive exaples ust be classified correctly (i.e., the goal is to approxiate the iniu nuber of istakes on negatives subject to correctly classifying the positives). The extension to allowing for two-sided error, however, is iediate.

2 In this work, we iprove on the result of Peleg [14], achieving a bound of O(n 1/3+α OPT disj ) + ǫ for any constant α > 0, though our algorith is not a proper learner (does not produce a disjunction as its output). 3 In particular, our ain result is an algorith for weak-learning under an arbitrary distribution D, under the assuption that the optial disjunction has error rate O(1/n 1/3+α ), which we can then feed into boosting procedures of [7, 9] or the recent ABoostDI booster of [4], to achieve the claied guarantee. Note that our guarantee holds for any distribution over {0, 1} n. In contrast, ost recent work on agnostic learning has been for the case of unifor or other nice distributions [8, 10, 11]. 1.1 Our Results We present a learning algorith whose error rate is an O(n 1/3+α ) approxiation to that of the best disjunction, for any α > 0. Forally, we prove the following theore. Theore 1 There exists an algorith that for an arbitrary distribution D over {0, 1} n and arbitrary target function c : {0, 1} n {1, 1}, for every constant α > 0 and every ǫ, δ > 0, runs in tie polynoial in 1/ǫ, log(1/δ), and n, uses poly(1/ǫ, log(1/δ), n) rando exaples fro D, and outputs a hypothesis h, such that with probability > 1 δ, where OPT disj = in f DISJUNCTIONS err D (f). err D (h) O(n 1 3 +α OPT disj ) + ǫ The proof of Theore 1 is based on finding a weak-learner under the assuption that OPT OPT disj = O(n (1/3+α) ). In particular, we show: Theore 2 There exists an algorith with the following property. For every distribution D over {0, 1} n and every target function c such that OPT < n 1 3 α, for soe constant α > 0, for every δ > 0, the algorith runs in tie t(δ, n), uses (δ, n) rando saples drawn fro D and outputs a hypothesis h, such that with probability > 1 δ, err D (h) 1 2 γ where t and are polynoials in n, 1/δ, and γ = Ω(n 2 ). The high-level idea of the algorith and proof for Theore 2 is as follows. First, we can assue the target function is balanced (nearly equal probability ass on positive and negative exaples) and that siilarly no individual variable is noticeably correlated with the target, else weak-learning is iediate. So, for each variable i, the probability ass of positive exaples with x i = 1 is approxiately equal to the fraction of negative exaples with x i = 1. Let c opt denote the (unknown) optial disjunction, which we ay assue is onotone by including negated variables as additional features. Let r denote the nuber of relevant variables; i.e., the nuber of variables in c opt. Also, assue for this discussion that we know the value of OPT = err D (c opt ). Call an exaple x good if c (x) = c opt (x) and bad otherwise. Now, since the only negative exaples than can have a relevant x i set to 1 are the bad negatives, this eans that for relevant variables i, Pr x D (x i = 1 c (x) = 1) = O(OPT). Therefore, Pr x D (x i = 1 c (x) = +1) = O(OPT) and so Pr x D (x i = 1) = O(OPT) as well. This eans that by estiating Pr x D (x i = 1) for each variable i, we can reove all variables of density ω(opt) fro the syste, knowing they are irrelevant. At this point, we have nearly all the ingredients for the Õ(1/ n) bound of Peleg [14]. In particular, since all variables have density O(OPT), this eans the average nuber of variables set to 1 per exaple is O(OPT n). Let S be the set of exaples whose density is at ost twice the average (so Pr(S ) 1/2); we now clai that if OPT = o(1/ n), then either S is unbalanced or else soe variable x i ust have noticeable correlation with the target over exaples in S. In particular, since positive exaples ust have on average at least 1 O(OPT) relevant variables set to 1, and the good negative exaples have zero relevant variables set to 1, the only way for S to be balanced and have no relevant variable with noticeable correlation is for the bad negative exaples to on average have Ω(1/OPT) relevant variables set to 1. But this is not possible since all exaples in S have only O(OPT n) variables set to 1, and 1/OPT OPT n for OPT = o(1/ n). So, soe hypothesis of the for: if x S then flip a fair coin, else predict x i ust be a weak-learner. 3 This bound hides a low-order ter of (log n) 1/α. Solving for equality yields α = O(n 1/3+o(1) ). q log log n log n and a bound of

3 In order to iprove over the Õ(1/ n) bound of [14], we do the following. Assue all variables have nearly the sae density and all exaples have nearly the sae density as well. This is not without loss of generality (and the general case adds additional coplications that ust be addressed), but siplifies the picture for this high-level sketch. Now, if no individual variable or its copleent is a weak predictor, by the above analysis it ust be the case that the bad negative exaples on average have a substantial nuber of variables set to 1 in the relevant region (essentially so that the total hinge-loss (attribute-errors) is Ω()). Suppose now that one guesses such a bad negative exaple e and focuses on only those n variables set to 1 by e. The disjunction c opt restricted to this set ay now ake any istakes on positive exaples (the substantial nuber of variables set to 1 in the relevant region in e ay still be a sall fraction of the relevant region). On the other hand, because we have restricted to a relatively sall nuber of variables n, the average density of exaples as a function of n has dropped significantly. 4 As a result, suppose we again discard all exaples with a nuber of 1 s in these n variables substantially larger than the average. Then, on the reainder, the hinge-loss (attribute-errors) caused by the bad negative exaples is now substantially reduced. This ore than akes up for the additional error on positive exaples. In particular, we show one can argue that for soe bad negative exaple e, if one perfors the above procedure, then with respect to the reaining subset of exaples, soe variable ust be a weak predictor. In the end, the final hypothesis is defined by an exaple e, a threshold θ, and a variable i, and will be of the for if x e [1, θ] then flip a coin, else predict x i. The algorith then siply searches over all such triples. In the general case (when the variables and the exaples do not all have the sae density), this is preceded by a preprocessing step that groups variables and exaples into a nuber of buckets and then runs the above algorith on each bucket. The rest of the paper is organized as follows. We start off with notation and definitions in Section 2. In Section 3 we prove Theore 1. We achieve this in two steps: first we show how to get a weak learner for the special case that the exaples and variables are fairly hoogeneous (all variables set to 1 roughly the sae nuber of ties, and all exaples with roughly the sae nuber of variables set to 1 (actually a soewhat weaker condition than this)). We then show how to reduce a general instance to this special case. In Section 3.3 we use existing boosting algoriths cobined with this weak-learner to prove our ain result. Finally, we discuss conclusions and future directions in Section 4. 2 Notation and Preliinaries Let X = {0, 1} n and let D be the data distribution over X. We have a labeling function c : X {1, 1}, and use c opt to denote the disjunction of least error with respect to c. Without loss of generality we ay assue c opt is onotone, and we denote the error rate of c opt as OPT disj or siply OPT. I.e., OPT = Pr x D [c opt (x) c (x)]. For the rest of the paper, we will assue that OPT = Ω( 1 n ) (otherwise we can use Peleg s algorith described in the previous section). We will also assue that we know the value of OPT. 5 We use r to denote the nuber of variables in c opt and we call these the relevant variables. We will call the exaples on which c and c opt agree good, and those on which c and c opt disagree bad. The exaples causing the ost difficulty will be the bad negative exaples, which can potentially satisfy any relevant variables, thus incurring up to r attribute errors (hinge-loss) and yet be labeled negative. We assue that the algorith gets as input 2 exaples out of which + are positive exaples (their label is 1), and are negative (their label is 1). We can assue for the goal of weak-learning that +, = (1 ± o(1)), else we have an iediate weak predictor. Let + bad denote the nuber of bad positive exaples, i.e., positives that do not satisfy c opt, and let bad denote the nuber of bad negative exaples, i.e., negatives that do satisfy c opt. For convenience of notation (losing at ost a factor of 2 in our guarantee) we assue that the error rate of c opt on both positive and negative exaples separately is at ost OPT. Given this, we ay assue that + bad OPT(1 + o(1)) and bad OPT(1 + o(1)). Our algorith will exaine a set of Õ(n2 ) hypotheses, of which we will prove that at least one has training error at ost 1/2 Ω(1/n 2 ), under the assuption that OPT is O(1/n 1/3+α ). In the following we assue that is sufficiently large, Õ(n4 ), so that with high probability this iplies error at ost 1/2 Ω(1/n 2 ) over D. In particular, each hypothesis is defined by a training exaple, a threshold and a variable, and so by copression bounds [6], Õ(n 4 ) training exaples are sufficient to produce a weak learner with 4 E.g., given two rando vectors with n = n 2/3 1 s, their intersection would have expected size (n ) 1/2. Of course, our dataset need not be unifor rando exaples of the given density, but the fact that all variables have the sae density allows one to ake a siilar arguent. 5 If OPT is unknown, we can efficiently enuerate over possible guesses for OPT such that one such guess will be within a 1/poly additive factor of the true value. For each guess, we can run our algorith and test it on a fresh saple to see if its output is a weak learner.

4 high probability. Finally, it will be convenient to think of algoriths that ake predictions on only a subset of the doain. If an algorith predicts on a subset of probability ass p, and has error rate 1/2 γ on that subset, then by flipping a fair coin on the reainder, the overall error rate will be 1/2 γ for γ = pγ. 3 Proof of Theore 1 We first build a weak agnostic learner for the best disjunction proble. Our weak learner has the guarantee given in Theore 2, which we restate below. Theore 2 There exists an algorith with the following property. For every distribution D over {0, 1} n and every target function c such that OPT < n 1 3 α, for soe constant α > 0, for every δ > 0, the algorith runs in tie t(δ, n), uses (δ, n) rando saples drawn fro D and outputs a hypothesis h, such that with probability > 1 δ, err D (h) 1 2 γ where t and are polynoials in n, 1/δ, and γ = Ω(n 2 ). Our algorith has two stages: a preprocessing step (which we present later in Section 3.2) that ensures that all variables are set to 1 roughly the sae nuber of ties and that the bad and good exaples have roughly the sae nuber 1s, and a core algorith (which we present first in Section 3.1) that operates on data of this for. One aspect of the preprocessing step is that in addition to partitioning exaples into buckets, it ay involve discarding soe relevant variables, yielding a dataset in which only soe /polylog(n) positive exaples satisfy c opt over the variables reaining. Thus, our assuption in Section 3.1 is that while the dataset has the hoogeneity properties desired and the fraction of bad negative exaples is OPT(1 + o(1)), the fraction of bad positive exaples ay be as large as 1 1/polylog(n). Nonetheless, this will still allow for weak learning. 3.1 (B, α, )-Sparse Instances As entioned above, in this section we give a weak learning algorith for a dataset that has certain nice hoogeneity properties. We call such a dataset a (B, α, )-sparse instance. We begin by describing what these properties are. The first property is that there exists a positive integer B such that for each variable x i, the nuber of positive exaples in the instance with x i = 1 is between B/2 and B, and the nuber of negative exaples with x i = 1 is between B 2 (1 o(1)) and B(1 + o(1)). The first property iplies that in this case the overall nuber of 1s in all exaples is at ost 2nB(1 + o(1)), and therefore, an average exaple has no ore than nb(1+o(1)) variables set to 1. If the bad negatives were typical exaples, we would expect the to contain at ost nb bad (1+o(1)) nb OPT(1+o(1)) ones in total. While in general this ay not necessarily be the case, we assue for this section that at least they are not too atypical on average. In particular, the second property we assue this instance satisfies is that the overall nuber of ones present in all the bad negatives is at ost n 1+α BOPT. Denote by the nuber of positive exaples that c opt classifies correctly. The third property is that /n o(α). If this dataset were our given training set then this would be redundant, as we already assue the stronger condition that the fraction of good positive exaples is 1 O(OPT). 6 However, will be of use in later sections, when we call this algorith as a subroutine on instances defined by only a subset of all the variables. In other words, we show here that even if we allow c opt to ake ore istakes on the positive exaples (and in particular, to label alost all positives incorrectly!) yet ake at ost OPT istakes on the negatives, we are still able to weak-learn. As our analysis shows, the condition we require of is that the ratio OPT doinates the ratio. Furtherore, the ratio n 1/3 will play a role in the definition of γ, our advantage over a rando guess. An instance satisfying all the above three properties is called a (B, α, )-sparse instance. Next, we show how to get a weak learner for such sparse instances. We first introduce the following definitions. Definition 3 Given an exaple e and a positive integer threshold θ, we define the (e, θ)-restricted doain to be the set of all exaples whose intersection with e is strictly saller than θ. That is, the set of exaples x such that x e < θ. For any hypothesis h, we define the (e, θ)-restricted hypothesis to be h over any exaple 6 Indeed, if the original instance was sparse, we would have = (1 o(1)).

5 that belongs to the (e, θ) restricted doain, and I don t know (flipping a fair coin) over any other exaple. In particular, we consider the (e, θ)-restricted (+1)-hypothesis predict +1 if the given exaple intersects e on less than θ variables. (e, θ)-restricted ( 1)-hypothesis predict 1 if the given exaple intersects e on less than θ variables. (e, θ)-restricted x i -hypothesis predict +1 if the given exaple intersects e on less than θ variables and has x i = 1. We call these n + 2 restricted hypotheses the (e, θ)-restricted base hypotheses. Our weak-learning algorith enuerates over all pairs of (e, θ), where e is a negative exaple in our training set and θ is an integer between 1 and n. For every such pair, our algorith checks whether any of the n + 2 restricted hypothesis is a Ω( OPT r )-weak-learner (see Algorith 1 below). Our next lea proves that for (B, α, )-sparse instances, this algorith indeed finds a weak-learner. In fact, we show that for every negative exaple e, it suffices to consider a particular value of θ. Algorith 1 A weak learner for sparse instances. Input: A (B, α, ) sparse instance. Step 1: For every negative exaple e in the set and every θ {1, 2,..., n} Step 1a: Check if any of the (e, θ)-restricted hypotheses fro Definition 3 is a weak learner with error at ost 1 2 Ω(n 2 ). Step 1b: If Yes, then output the corresponding hypothesis and halt. Step 2: If no restricted hypothesis is a weak learner, output failure. Lea 4 Suppose we are given a (B, α, )-sparse instance, and that c opt akes no ore than a n ( 1 3 +α) fraction of errors on the negative exaples. Then there exists a bad negative exaple e and a threshold θ such that one of the (e, θ)-restricted base hypotheses entioned in Definition 3 has error at ost 1/2 γ for γ = Ω( OPT r ). Since we ay assue OPT > 1/ n, this iplies γ = Ω(n 2 ). Thus Algorith 1 outputs a hypothesis of error at ost 1 2 Ω(n 2 ). Proof: Let + and be the nuber of positive and negative exaples in this sparse instance, where we reserve to refer to the size of the original dataset of which this sparse instance is a subset. As before, call exaples good if they are classified correctly by c opt, else call the bad. We know B = O(OPT), because relevant variables have no ore than O(OPT) occurrences of 1 over the negative exaples. Since each good positive exaple has to have at least one relevant variable set to 1, it ust also hold that B = Ω( /r). It follows that ropt = Ω( /). We now show how to find a weak learner given a (B, α, )- sparse instance, based on a bad negative exaple. Consider any bad negative exaple e i with t i variables set to 1. If we su the intersection (i.e. the dotproduct) of e i with each of the positive exaples in the instance, we siply get the total nuber of ones in the positive exaples over these t i variables. As each variable is set to 1 between B/2 and B ties, this su is B t i for soe B [B/2, B]. Therefore, the expected intersection of e i with a rando positive exaple 1 is t + i B. Set θ i = β tib, where β > 1 will be chosen later suitably. Throw out any exaple which has + ore than θ i intersection with e i. Using Markov s inequality, we deduce that we retain at least + (1 1 β ) positive exaples. The key point of the above is that focusing on the exaples that reain, none of the can contribute ore than θ i hinge-loss (attribute errors), restricting c opt to the t i variables set to 1 by e i. On the other hand, it is possible that the nuber of actual errors over positives has increased substantially: perhaps too few of the reaining positive exaples share relevant variables with e i in order for any of the (e i, θ i ) restricted hypotheses to be a weak learner. We now argue that this cannot happen siultaneously for all e i. Specifically, assue for contradiction that none of the (e i, θ i )-restricted base hypotheses yields a weak learner. Consider the total nuber of 1s contributed by the reaining negative exaples over the relevant variables of e i (the relevant variables that are set to 1 by e i ). As each bad negative contributes at ost θ i such ones, the overall contribution on the negative side is θ i OPT(1 + o(1)) = β tib OPT(1 + o(1)). + Since none of relevant variables set to 1 by e i gives a weak learner, it holds that the nuber of 1s over the positive side of these relevant variables is no ore than 2β t + i B OPT (see below, at the specification of the value of γ). So even if each occurrence of 1 coes fro a unique positive exaple, we still have no ore

6 than 2β t + i B OPT positive exaples fro the (e i, θ i ) restricted doain intersecting e i over the relevant variables. Therefore, adding back in the positive exaples not fro the restricted doain, we have no ore than 2β t + i B OPT + + /β positive exaples that intersect e i over the relevant variables. Consider now a bipartite graph with the good positive exaples on one side and the OPT bad negative exaples on the other side, with an edge between positive e j and negative e i if e j intersects e i over the relevant variables. Since each e i has degree at ost 2β t + i B OPT+ + /β, the total nuber of edges is at ost 2β BOPT + i t i+ + OPT/β, and therefore soe good positive exaples ust have degree at ost OPT[ 2βB + i t i + + β ]. On the other hand, since we are given a (B, α, )-sparse instance, we know that every good positive exaple intersects at least B(1 o(1)) 2 negative exaples, and oreover that i t i n 1+α BOPT. Putting this together we have: Setting β = ( + ) 2 2B 2 n 1+α OPT B/2 (1 + o(1))opt [ 2βB 2 n 1+α OPT β ]. to equalize the two ters in the su above, we derive B 4 2(1 + o(1))b n(1+α)/2 OPT 3/2. Thus we have n 1+α 2 OPT 3 1+o(1) Recall that / n o(α), so we derive a contradiction, as for sufficiently large n it ust hold that ( ) 1/3 1 + o(1) OPT n 1+α 3 o(α) > n 1/3 α. 32 In order to coplete the proof, we need to verify that indeed β > 1. Recall B = O(OPT) and +, so + / n o(α). Thus β 2 1 = Ω( ) = Ω(n 2α o(α) ) by our assuption on OPT. n 1+α+o(α) OPT 3 The last detail is to check what advantage do we get over a rando guess. Our analysis shows that for soe bad negative exaple e i, the nuber of ones over the relevant variables on the positive side is at least 2β t + i B OPT, whereas on the negative side, there can be at ost β t + i B OPT(1 + o(1)) ones. We deduce that at least one of the at ost in(r, t i ) relevant variables set to 1 by e i ust give a gap of at least β tib OPT(1 o(1)) + in(r,t i) > B OPT(1 o(1)) since β > 1. Finally, using the fact that B = Ω( /r) we get a gap of Ω( OPT r ) or equivalently an advantage of γ = Ω( OPT r ). This advantage is trivially Ω(n 2(1+o(α)) ), or, using the assuption OPT > 1/ n (for otherwise, we can apply Peleg s algorith [14]), we get γ = Ω(n 3 2 (1+o(α)) ). 3.2 General Instances Section 3.1 dealt with nicely behaved (hoogeneous) instances. In order to coplete the proof of Theore 2, we need to show how to reduce a general instance to such a (B, α, )-sparse instance. What we show is a (siple) algorith that partitions a given instance into sub-instances, based on the nuber of 1s of each exaple over certain variables (but without looking at the labels of the exaples). It outputs a polylog(n)-long list of sub-instances, each containing a noticeable fraction of the doain, and has the following guarantee: either soe sub-instance has a trivial weak-learner (has a noticeably different nuber of positive versus negative exaples or there is a variable with noticeable correlation), or soe sub-instance is (B, α, )-sparse. Forally, we prove this next lea. Lea 5 There exists a poly((log n) O(1/α), n, )-tie algorith, that gets as an input 2 labeled exaples in {0, 1} n, and output a list of subsets, each containing /polylog(n) exaples, s.t. either soe subset has a trivial weak-learner, or soe subset is (B, α, /polylog(n))-sparse. Cobining the algorith fro Lea 5 with the algorith presented in Section 3.1, we get our weaklearning algorith (see Algorith 2). We first run the algorith of Lea 5, traverse all sub-instances, and check whether any has a trivial weak-learner. If not, we run the algorith for (B, α, )-sparse instances over each sub-instance. Obviously, given the one sub-instance which is sparse, we find a restricted hypothesis with Ω(n 2 ) advantage over a rando guess. Proof: We start by repeating the arguent presented in the introduction (Section 1.1). For any relevant variable, no ore than bad OPT(1 + o(1)) bad exaples set it to 1. Therefore, as an initial step, we throw out any variable with ore than this any occurrences over the negative exaples, as it cannot

7 possibly be a relevant variable. For convenience, redefine n to be the nuber of variables that reain. Next, we check each individual variable to deterine if it itself is a weak predictor. If not, then this eans each variable is set to 1 on approxiately the sae nuber of positive and negative exaples. Bucket all the variables according the nuber of ties they are set to 1, where the j-bucket contains all the variables that are set to 1 any nuber of ties in the range [2 j, 2 j+1 ). Since there are at ost log n buckets, soe bucket j ust cover at least + log n positive exaples, in the sense that the disjunction over the relevant variables in this bucket agrees with at least this any good positives. So now, let B = 2 j+1, let n and r be the total nuber of variables and the nuber of relevant variables in this bucket respextively. As we can ignore all exaples that are identically 0 over the n variables in this bucket, let + (resp. ) be the nuber of positive (resp. negative) exaples covered by the variables in this bucket. Our algorith adds the reaining exaples (over these n variables) as one sub-instance to its list. Let the nuber of these exaples be 2. As before, if the nuber of positive exaples and negative exaples covered by these n variables differ significantly, or if soe variable is a weak learner (with respect to the set of exaples left), then the algorith halts. Observe that if this sub-instance is (B, α, / log(n))-sparse, then we are done, no atter what other sub-instances the algorith will add to its list. Focusing on the reaining exaples, every variable is set to 1 at ost B any ties over the positive exaples, so the total nuber of 1s, over the positive exaples is n B. If indeed the resulting instance is not (B, α, / log(n))-sparse, then the total nuber of 1s over the bad negative exaples is (n ) 1+α (B )OPT. So now, our algorith throws out any exaple with ore than 2n B / variables set to 1, and adds the reaining exaples to the list of sub-instances. By Markov s inequality, we are guaranteed not to reove ore than 1/2 of the positive exaples, so the sub-instance reaining is sufficiently large. As before, if the reaining subset of exaples (over these n variables) has a trivial weak-learner, we are done. Otherwise, the algorith continues recursively over this sub-instance re-buckets and then reoves all exaples with too any variables set to 1. Note, each tie the algorith buckets the variables, it needs to recurse over each bucket that covers at least a 1/ log(n) fraction of the positive exaples. In the worst-case, all of the log(n) buckets cover these any positive exaples, and therefore, the branching factor in each bucketing step is log(n). We now show that the depth of the bucket-and-reove recurrence is no ore than O(1/α). It is easy to see inductively that at the i-th step of the recursion, we retain a fraction of /(logn) i positive exaples. Suppose that by the first i steps, no sub-instance is sparse and no weak-learner is found. Recall, if ropt 1, we have an iediate weak-learner, so it ust hold that in the i-th step, we still retain at least n i = 1/OPT variables. Furtherore, as in the i-th step we did not have a sparse instance, it follows that the bad negative exaples had ore than (n i ) 1+α (B )OPT ones before we threw out exaples. Once we reove dense exaples, they contain no ore than 2(ni)(B ) i negatives that survive each reoval step is no ore than n α i OPT any ones. Thus, the fraction of ones over the bad i. As 1/OPT > n 1/3, this fraction is at ost n α/3 (log n) i < n α/6 (for the first O(1/α) iterations). Hence, after 6/α iterations, soe relevant variable ust be a weak-learner. To coplete the proof, note that we take no ore than (log n) 6/α bucket-and-reove steps. Each such step requires poly(n, ) tie for the bucketing, reoval and checking for weak-learner. We conclude that the run-tie of this algorith is poly((log n) 1/α, n, ). Algorith 2 A weak learner for general instances. Input: A set of 2 training exaples. Step 1: If any individual variable or the constant hypotheses is a weak learner, output it and halt. Step 2: Reove any variable which has ore than 2OPT 1 s over the negative exaples. Step 3: Bucket the reaining variables such that bucket j contains variables with density in [2 j, 2 j+1 ). Step 4: For every bucket which covers at least a log n fraction of the positive exaples Step 4a: Run the algorith for sparse instances on this bucket. If a weak learner is obtained, output it and halt. Step 4b: Let B be the density (2 j+1 ) in this bucket, n be the nuber of variables in the bucket and 2 be the total nuber of exaples with respect to this bucket (ignoring the ones which are identically zero over the n variables). Reove all the exaples which have ore than 2n B / 1 s over this bucket. Repeat steps 1-4 on this new instance.

8 3.3 Strong Learning Given Theore 2, we now prove the ain theore (Theore 1) by plugging the weak-learner into an offthe-shelf boosting algorith for the agnostic case. We use the recent ABoostDI booster of [4], which converts any algorith satisfying Theore 2 into one satisfying Theore 1. The result in [4] gives a boosting technique for (η, γ)-weak learners. In our context an (η, γ)-weak learner is an algorith which with respect to to any distribution D, with high probability, produces a hypothesis of error 1 2 γ, whenever OPT disj 1 2 η. Theore 6 (Feldan [4], Theore 3.5) There exists an algorith ABoostDI that, given a (η, γ)-weak learner, for every distribution D and ǫ > 0, produces, with high probability, a hypothesis h such that err D (h) OPT disj 1 2η + ǫ. Furtherore, the running tie of the algorith is T poly( 1 γ, 1 ǫ ), where T is the running tie of the weak learner. As an iediate corollary, we set η = n 1/3 α and obtain an hypothesis h such that err D (h) 2n 1/3+α OPT + ǫ. This concludes the proof of Theore 1. We note that as an alternative to ABoostDI, one can also use the boosting algorith of Kalai et. al [9], followed by another boosting algorith of Gavinsky [7], to get the result in Theore 1. 4 Future Directions In this paper we have presented an algorith for learning the class of disjunctions in the case that OPT < n (1/3+α), achieving an error rate of O(n 1/3+α OPT) + ǫ. The natural open question is whether one can iprove this bound. For exaple, can one achieve weak agnostic learning for OPT = n 1/4? Or, can one iprove the bounds as a function of the nuber of relevant variables, e.g., aking only a factor O(r 0.9 ) ties ore istakes than the best disjunction? An intriguing open question is whether one can extend this technique for other concept classes. For exaple, consider the class of linear separators over {0, 1} n with weights in {0, 1} (i.e., ajority vote or k of r functions). Here we do not know even how to achieve weak learning for OPT = n The algorith presented in this paper for disjunctions uses the fact that in order for individual variables not to be weak hypotheses theselves, the bad negative exaples ust in soe sense point in the direction of the target vector (they ust have a high dot-product with the target function vector if we view the disjunction as a linear threshold function) to a substantially greater extent than the positive exaples do. E.g., if a typical positive exaple has t relevant variables set to 1, then the typical bad negative exaple ust have t/opt relevant variables set to 1. For the case of ajority-vote functions, the difficulty with this approach is that instead all we can say is that if the positive exaples have r/2 + t relevant variables set to 1, then the typical bad negative exaples should have at least r/2 + t/opt relevant variables set to 1, which ight not be such a distinction in a ultiplicative sense. On a ore general note, our work here uses soewhat non-traditional hypotheses, by using the exaples theselves to define slices of the data (focusing on those exaples with no ore than a certain θ dotproduct with soe given negative exaple). Perhaps this ight be useful for other learning probles. Acknowledgents We would like to thank Aaron Roth and Maria-Florina Balcan for a nuber of helpful discussions. This work was supported in part by NSF grant CCF References [1] Peter Auer and Manfred K. Waruth. Tracking the best disjunction. In Proc. 36th Syposiu on Foundations of Coputer Science, pages , [2] R.D. Carr, S. Doddi, G. Konjevod, and M. Marathe. On the red-blue set cover proble. In Proceedings of the Eleventh Annual ACM-SIAM syposiu on Discrete algoriths (SODA), pages , [3] Vitaly Feldan. Optial hardness results for axiizing agreeents with onoials. SICOMP, 39(2), Extended abstract in CCC [4] Vitaly Feldan. Distribution-specific agnostic boosting. In 1st Syposiu on Innovations in Coputer Science (ICS), pages , [5] Vitaly Feldan, Venkatesan Guruswai, Prasad Raghavendra, and Yi Wu. Agnostic learning of onoials by halfspaces is hard. In FOCS, 2009.

9 [6] Sally Floyd and Manfred Waruth. Saple copression, learnability, and the vapnik-chervonenkis diension. Mach. Learn., 21(3): , [7] Ditry Gavinsky. Optially-sooth adaptive boosting and application to agnostic learning. J. Mach. Learn. Res., 4: , [8] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM Journal on Coputing, 37(6): , [9] Ada Tauan Kalai, Yishay Mansour, and Elad Verbin. On agnostic boosting and parity learning. In STOC 08: Proceedings of the 40th annual ACM syposiu on Theory of coputing, pages , New York, NY, USA, ACM. [10] A. Klivans, R. O Donnell, and R. Servedio. Learning geoetric concepts via gaussian surface area. In FOCS, [11] Ada R. Klivans, Philip M. Long, and Rocco A. Servedio. Learning halfspaces with alicious noise. In ICALP, [12] N. Littlestone. Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow. In Proc. 4th Conference on Coputational Learning Theory, pages , Santa Cruz, California, Morgan Kaufann. [13] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorith. Machine Learning, 2(4), [14] David Peleg. Approxiation algoriths for the label-coverax and red-blue set cover probles. J. of Discrete Algoriths, 5(1):55 64, 2007.

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability