Efficient Learning of Linear Separators under Bounded Noise

Size: px

Start display at page:

Download "Efficient Learning of Linear Separators under Bounded Noise"

Agatha Barton
5 years ago
Views:

1 Efficient Learning of Linear Separators uner Boune Noise Pranjal Awasthi Maria-Florina Balcan Ruth Urner March 1, 15 Nika Haghtalab Abstract We stuy the learnability of linear separators in R in the presence of boune (aka Massart noise This is a realistic generalization of the ranom classification noise moel, where the aversary can flip each example x with probability η(x η We provie the first polynomial time algorithm that can learn linear separators to arbitrarily small excess error in this noise moel uner the uniform istribution over the unit sphere in R, for some constant value of η While wiely stuie in the statistical learning theory community in the context of getting faster convergence rates, computationally efficient algorithms in this moel ha remaine elusive Our work provies the first evience that one can inee esign algorithms achieving arbitrarily small excess error in polynomial time uner this realistic noise moel an thus opens up a new an exciting line of research We aitionally provie lower bouns showing that popular algorithms such as hinge loss minimization an averaging cannot lea to arbitrarily small excess error uner Massart noise, even uner the uniform istribution Our work instea, makes use of a margin base technique evelope in the context of active learning As a result, our algorithm is also an active learning algorithm with label complexity that is only a logarithmic the esire excess error ɛ 1 Introuction Overview Linear separators are the most popular classifiers stuie in both the theory an practice of machine learning Designing noise tolerant, polynomial time learning algorithms that achieve arbitrarily small excess error rates for linear separators is a long-staning question in learning theory In the absence of noise (when the ata is realizable such algorithms exist via linear programming [11] However, the problem becomes significantly harer in the presence of label noise In particular, in this work we are concerne with esigning algorithms that can achieve error OPT + ɛ which is arbitrarily close to OPT, the error of the best linear separator, an run in time polynomial in 1 ɛ an (as usual, we call ɛ the excess error Such strong guarantees are only known for the well stuie ranom classification noise moel [7] In this work, we provie the first algorithm that can achieve arbitrarily small excess error, in truly polynomial time, for boune noise, also calle Massart noise [8], a much more realistic an wiely stuie noise moel in statistical learning theory [9] We aitionally show strong lower bouns uner the same noise moel for two other computationally efficient learning algorithms (hinge loss minimization an the averaging algorithm, which coul be of inepenent interest Motivation The work on computationally efficient algorithms for learning halfspaces has focuse on two ifferent extremes On one han, for the very stylize ranom classification noise moel (RCN, where each 1

2 example x is flippe inepenently with equal probability η, several works have provie computationally efficient algorithms that can achieve arbitrarily small excess error in polynomial time [7, 3, 5] note that all these results crucially exploit the high amount of symmetry present in the RCN noise At the other extreme, there has been significant work on much more ifficult an aversarial noise moels, incluing the agnostic moel [5] an malicious noise moels [4] The best results here however, not only require aitional istributional assumptions about the marginal over the instance space, but they only achieve much weaker multiplicative approximation guarantees [3, 7, ]; for example, the best result of this form for the case of uniform istribution over the unit sphere S 1 achieves excess error copt [], for some large constant c While interesting from a technical point of view, guarantees of this form are somewhat troubling from a statistical point of view, as they are inconsistent, in the sense there is a barrier O(OPT, after which we cannot prove that the excess error further ecreases as we get more an more samples In fact, recent evience shows that this is unavoiable for polynomial time algorithms for such aversarial noise moels [1] Our Results In this work we ientify a realistic an wiely stuie noise moel in the statistical learning theory, the so calle Massart noise [9], for which we can prove much stronger guarantees Massart noise can be thought of as a generalization of the ranom classification noise moel where the label of each example x is flippe inepenently with probability η(x < 1/ The aversary has control over choosing a ifferent noise rate η(x η for every example x with the only constraint that η(x η From a statistical point of view, it is well known that uner this moel, we can get faster rates compare to worst case joint istributions [9] In computational learning theory, this noise moel was also stuie, but uner the name of malicious misclassification noise [9, 31] However ue to its highly unsymmetric nature, til ate, computationally efficient learning algorithms in this moel have remaine elusive In this work, we provie the first computationally efficient algorithm achieving arbitrarily small excess error for learning linear separators Formally, we show that there exists a polynomial time algorithm that can learn linear separators to error OPT+ɛ an run in poly(, 1 ɛ when the unerlying istribution is the uniform istribution over the unit ball in R an the noise of each example is upper boune by a constant η (inepenent of the imension As mentione earlier, a result of this form was only known for ranom classification noise From a technical point of view, as oppose to ranom classification noise, where the error of each classifier scales uniformly uner the observe labels, the observe error of classifiers uner Masasart noise coul change rastically in a non-monotonic fashion This is ue to the fact that the aversary has control over choosing a ifferent noise rate η(x η for every example x As a result, as we show in our work (see Section 4, stanar algorithms such as the averaging algorithm [3] which work for ranom noise can only achieve a much poorer excess error (as a function of η uner Massart noise Technically speaking, this is ue to the fact that Massart noise can introuce high correlations between the observe labels an the component orthogonal to the irection of the best classifier In face of these challenges, we take an entirely ifferent approach than previously consiere for ranom classification noise Specifically, we analyze a recent margin base algorithm of [] This algorithm was esigne for learning linear separators uner agnostic an malicious noise moels, an it was shown to achieve an excess error of copt for a constant c By using new structural insights, we show that there exists a constant η (inepenent of the imension, so that if we use Massart noise where the flipping probability is upper boune by η, we can use a moification of the algorithm in [] an achieve arbitrarily small excess error One way to think about this result is that we efine an aaptively chosen sequence of hinge loss minimization problems aroun smaller an smaller bans aroun the current guess for the target We show by relating the hinge loss an /1-loss together with a careful localization analysis that these will

3 irect us closer an closer to the optimal classifier, allowing us to achieve arbitrarily small excess error rates in polynomial time Given that our algorithm is an aaptively chosen sequence of hinge loss minimization problems, one might woner what guarantee one-shot hinge loss minimization coul provie In Section 5, we show a strong negative result: for every, an η 1/, there is a noisy istribution D over R {, 1} satisfying Massart noise with parameter η an an ɛ >, such that -hinge loss minimization returns a classifier with excess error Ω(ɛ This result coul be of inepenent interest While there exists earlier work showing that hinge loss minimization can lea to classifiers of large /1-loss [6], the lower bouns in that paper employ istributions with significant mass on iscrete points with flippe label (which is not possible uner Massart noise at a very large istance from the optimal classifier Thus, that result makes strong use of the hinge loss s sensitivity to errors at large istance Here, we show that hinge loss minimization is boun to fail uner much more benign conitions One appealing feature of our result is the algorithm we analyze is in fact naturally aaptable to the active learning or selective sampling scenario (intensively stuie in recent years [19, 13, ], where the learning algorithms only receive the classifications of examples when they ask for them We show that, in this moel, our algorithms achieve a label complexity whose epenence on the error parameter ɛ is polylogarithmic (an thus exponentially better than that of any passive algorithm This provies the first polynomial-time active learning algorithm for learning linear separators uner Massart noise We note that prior to our work only inefficient algorithms coul achieve the esire label complexity uner Massart noise [4, ] Relate Work The agnostic noise moel is notoriously har to eal with computationally an there is significant evience that achieving arbitrarily small excess error in polynomial time is har in this moel [1, 18, 1] For this moel, uner our istributional assumptions, [3] provies an algorithm that learns linear separators in R to excess error at most ɛ, but whose running time poly( exp(1/ɛ Recent work show evience that the exponential epenence on 1/ɛ is unavoiable in this case [6] for the agnostic case We sie-step this by consiering a more structure, yet realistic noise moel Motivate by the fact that many moern machine learning applications have massive amounts of unannotate or unlabele ata, there has been significant interest in esigning active learning algorithms that most efficiently utilize the available ata, while minimizing the nee for human intervention Over the past ecae there has been substantial progress on unerstaning the unerlying statistical principles of active learning, an several general characterizations have been evelope for escribing when active learning coul have an avantage over the classical passive supervise learning paraigm both in the noise free settings an in the agnostic case [17, 13, 3, 4, 19, 15, 1, 14, ] However, espite many efforts, except for very simple noise moels (ranom classification noise [5] an linear noise [16], to ate there are no known computationally efficient algorithms with provable guarantees in the presence of Massart noise that can achieve arbitrarily small excess error We note that work of [1] provies computationally efficient algorithms for both passive an active learning uner the assumption that the hinge loss (or other surrogate loss minimizer aligns with the minimizer of the /1-loss In our work (Section 5, we show that this is not the case uner Massart noise even when the marginal over the instance space is uniform, but still provie a computationally efficient algorithm for this much more challenging setting Preliminaries We consier the binary classification problem; that is, we work on the problem of preicting a binary label y for a given instance x We assume that the ata points (x, y are rawn from an unknown unerlying 3

4 istribution D over X Y, where X = R is the instance space an Y = { 1, 1} is the label space For the purpose of this work, we consier istributions where the marginal of D over X is a uniform istribution on a -imensional unit ball We work with the class of all homogeneous halfspaces, enote by H = {sign(w x : w R } For a given halfspace w H, we efine the error of w with respect to D, by err D(w = Pr (x,y D[sign(w x y] We examine learning halfspaces in the presence of Massart noise In this setting, we assume that the Bayes optimal classifier is a linear separator w Note that w can have a non-zero error Then Massart noise with parameter β > is a conition such that for all x, the conitional label probability is such that Pr(y = 1 x Pr(y = 1 x β (1 Equivalently, we say that D satisfies Massart noise with parameter β, if an aversary construct D by first taking the istribution D over instances (x, sign(w x an then flipping the label of an instance x with probability at most 1 β 1 Also note that uner istribution D, w remains the Bayes optimal classier In the remainer of this work, we refer to D as the noisy istribution an to istribution D over instances (x, sign(w x as the clean istribution Our goal is then to fin a halfspace w that has small excess error, as compare to the Bayes optimal classifier w That is, for any ɛ >, fin a halfspace w, such that err D(w err D(w ɛ Note that the excess error of any classifier w only epens on the points in the region where w an w isagree So, err D(w err D(w θ(w,w π Aitionally, uner Massart noise the amount of noise in the isagreement region is also boune by 1 β It is not ifficult to see that uner Massart noise, β θ(w, w π err D(w err D(w ( In our analysis, we frequently examine the region within a certain margin of a halfspace For a halfspace w an margin b, let S w,b be the set of all points that fall within a margin b from w, ie, S w,b = {x : w x b} For istributions D an D, we inicate the istribution conitione on S w,b by D w,b an D w,b, respectively In the remainer of this work, we refer to the region S w,b as the ban In our analysis, we use hinge loss, as a convex surrogate function for the /1-loss For a halfspace w, we use -normalize hinge loss that is efine as l(w, x, y = max{, 1 (w xy } For a labele sample set W, let l(w, W = 1 W (x,y W l(w, x, y be the empirical hinge loss of a vector w with respect to W 3 Computationally Efficient Algorithm for Massart Noise In this section, prove our main result for learning half-spaces in presence of Massart noise We focus on the case where D is the uniform istribution on the -imensional unit ball Our main Theorem is as follows Theorem 1 Let the optimal bayes classifier be a half-space enote by w Assume that the massart noise conition hols for some β > Then for any ɛ, δ >, Algorithm 1 with λ = 1 8, α k = 3879π(1 λ k 1, b k 1 = 3463α k, an k = 536 ( /4 b k 1, runs in polynomial time, procees in s = O(log 1 ɛ rouns, where in roun k it takes n k = poly(, exp(k, log( 1 δ unlabele samples an m k = O(( + log(k/δ labels an with probability (1 δ returns a linear separator that has excess error (compare to w of at most ɛ 1 Note that the relationship between Massart noise parameter β, an the maximum flipping probability iscusse in the introuction η, is η = 1 β 4

5 Note that in the above theorem an Algorithm 1, the value of β is unknown to the algorithm, an therefore, our results are aaptive to values of β within the acceptable range efine by the theorem The algorithm escribe above is similar to that of [] an uses an iterative margin-base approach The algorithm runs for s = log 1 ( 1 1 λ ɛ rouns for a constant λ (, 1] By inuction assume that our algorithm prouces a hypothesis w k 1 at roun k 1 such that θ(w k 1, w α k We satisfy the base case by using an algorithm of [7] At roun k, we sample m k labele examples from the conitional istribution D wk 1,b k 1 which is the uniform istribution over {x : w k 1 x b k 1 } We then choose w k from the set of all hypothesis B(w k 1, α k = {w : θ(w, w k 1 α k } such that w k minimizes the empirical hinge loss over these examples Subsequently, as we prove in etail later, θ(w k, w α k+1 Note that for any w, the excess error of w is at most the error of w on D when the labels are correcte accoring to w, ie, err D(w err D(w err D (w Moreover, when D is uniform, err D (w = θ(w,w π Hence, θ(w s, w πɛ implies that w s has excess error of at most ɛ The algorithm escribe below was originally introuce to achieve an error of c err(w for some constant c in presence of aversarial noise Achieving a small excess error err(w +ɛ is a much more ambitious goal one that requires new technical insights Our two crucial technical innovations are as follow: We first make a key observation that uner Massart noise, the noise rate over any conitional istribution D is still at most 1 β Therefore, as we focus on the istribution within the ban, our noise rate oes not increase Our secon technical contribution is a careful choice of parameters Inee the choice of parameters, upto a constant, plays an important role in tolerating a constant amount of Massart noise Using these insights, we show that the algorithm by [] can inee achieve a much stronger guarantee, namely arbitrarily small excess error in presence of Massart noise That is, for any ɛ, this algorithm can achieve error of err(w + ɛ in the presence of Massart noise Algorithm 1 EFFICIENT ALGORITHM FOR ARBITRARILY SMALL EXCESS ERROR FOR MASSART NOISE Input: A istribution D An oracle that returns x an an oracle that returns y for a (x, y sample from D Permitte excess error ɛ an probability of failure δ Parameters: A learning rate λ; a sequence of sample sizes m k ; a sequence of angles of the hypothesis space α k ; a sequence of withs of the labele space b k ; a sequence of threshols of hinge-loss k Algorithm: 1 Take poly(, 1 δ samples an run poly(, 1 δ -time algorithm by [7] to fin a half-space w with excess error such that θ(w, w 3879π (Refer to Appenix C Draw m 1 examples (x, y from D an put them into a working set W 3 For k = 1,, log ( 1 1 λ ( 1 ɛ = s (a Fin v k such that v k w k 1 < α k (as a result v k B(w k 1, α k, that minimizes the empirical hinge loss over W using threshol k That is l k (v k, W min w B(wk 1,α k l k (w, W (b Clear the working set W (c Normalize v k to w k = v k v k Until m k+1 aitional examples are put in W, raw an example x from D If w k x b k, then reject x, else put (x, y into W Output: Return w s, which has excess error ɛ with probability 1 δ 5

6 Overview of our analysis: Similar to [], we ivie err D (w k to two categories; error in the ban, ie, on x S wk 1,b k 1, an error outsie the ban, on x S wk 1,b k 1 We choose b k 1 an α k such that, for every hypothesis w B(w k 1, α k that is consiere at step k, the probability mass outsie the ban such that w an w also isagree is very small (Lemma 5 Therefore, the error associate with the region outsie the ban is also very small This motivates the esign of the algorithm to only minimize the error in the ban Furthermore, the probability mass of the ban is also small enough such that for err D (w k α k+1 to hol, it suffices for w k to have a small constant error over the clean istribution restricte to the ban, namely D wk 1,b k 1 This is where minimizing hinge loss in the ban comes in As minimizing the /1-loss is NP-har, an alternative metho for fining w k with small error in the ban is neee Hinge loss that is a convex loss function can be efficiently minimize So, we can efficiently fin w k that minimizes the empirical hinge loss of the sample rawn from D wk 1,b k 1 To allow the hinge loss to remain a faithful proxy of /1-loss as we focus on bans with smaller withs, we use a normalize hinge loss function efine by l (w, x, y = max{, 1 w xy } A crucial part of our analysis involves showing that if w k minimizes the empirical hinge loss of the sample set rawn from D wk 1,b k 1, it inee has a small /1-error on D wk 1,b k 1 To this en, we first show that when k is proportional to b k, the hinge loss of w on D wk 1,b k 1, which is an upper boun on the /1-error of w k in the ban, is itself small (Lemma 1 Next, we notice that uner Massart noise, the noise rate in any marginal of the istribution is still at most 1 β Therefore, focusing the istribution in the ban oes not increase the probability of noise in the ban Moreover, the noise points in the ban are close to the ecision bounary so intuitively speaking, they can not increase the hinge loss too much Using these insights we can show that the hinge loss of w k on D wk 1,b k 1 is close to its hinge loss on D wk 1,b k 1 (Lemma Proof of Theorem 1 an relate lemmas To prove Theorem 1, we first introuce a series of lemmas concerning the behavior of hinge loss in the ban These lemmas buil up towars showing that w k has error of at most a fixe small constant in the ban For ease of exposition, for any k, let D k an D k represent D wk 1,b k 1 an D wk 1,b k 1, respectively, an l( represent l k ( Furthermore, let c = 3463, such that b k 1 = cα k Our first lemma, whose proof appears in Appenix B, provies an upper boun on the true hinge error of w on the clean istribution in the ban Lemma 1 E (x,y Dk l(w, x, y b The next Lemma compares the true hinge loss of any w B(w k 1, α k on two istributions, Dk an D k It is clear that the ifference between the hinge loss on these two istributions is entirely attribute to the noise points an their margin from w A key insight in the proof of this lemma is that as we concentrate in the ban, the probability of seeing a noise point remains uner 1 β This is ue to the fact that uner Massart noise, each label can be change with probability at most 1 β Furthermore, by concentrating in the ban all points are close to the ecision bounary of w k 1 Since w is also close in angle to w k 1, then points in the ban are also close to the ecision bounary of w Therefore the hinge loss of noise points in the ban can not increase the total hinge loss of w by too much Lemma For any w such that w B(w k 1, α k, we have E (x,y Dk l(w, x, y E (x,y Dk l(w, x, y 19 1 β b k 1 k 6

7 Proof Let N be the set of noise points We have, E (x,y Dk l(w, x, y E (x,y Dk l(w, x, y = E (x,y Dk (l(w, x, y l(w, x, sign(w x E (x,y (1 Dk x N (l(w, x, y l(w, x, y ( w x E (x,y 1 Dk x N k Pr (x N E k (x,y D (x,y (w x (By Cauchy Shwarz Dk k 1 β αk k 1 + b k 1 (By Definition 41 of [] for uniform 1 β b k 1 k ( 1c β b k 1 k (for >, c > 1 For a labele sample set W rawn at ranom from D k, let cleane(w be the set of samples with the labels correcte by w, ie, cleane(w = {(x, sign(w x : for all (x, y W } Then by stanar VC-imension bouns (Proof inclue in Appenix B there is m k O(( + log(k/ such that for any ranomly rawn set W of m k labele samples from D k, with probability 1 B(w k 1, α k, δ (k+k, for any w E (x,y Dk l(w, x, y l(w, W 1 8, (3 E (x,y Dk l(w, x, y l(w, cleane(w 1 8 (4 Our next lemma is a crucial step in our analysis of Algorithm 1 This lemma proves that if w k B(w k 1, α k minimizes the empirical hinge loss on the sample rawn from the noisy istribution in the ban, namely D wk 1,b k 1, then with high probability w k also has a small /1-error with respect to the clean istribution in the ban, ie, D wk 1,b k 1 Lemma 3 There exists m k O(( + log(k/, such that for a ranomly rawn labele sample set W of size m k from D k, an for w k such that w k has the minimum empirical hinge loss on W between the set of all hypothesis in B(w k 1, α k, with probability 1 δ (k+k, err Dk (w k k b k β b k 1 k Proof Sketch First, we note that the true /1-error of w k on any istribution is at most its true hinge loss on that istribution Lemma 1 provies an upper boun on the true hinge loss on istribution D k Therefore, it remains to create a connection between the empirical hinge loss of w k on the sample rawn from D k to its true hinge loss on istribution D k This, we achieve by using the generalization bouns of Equations 3 an 4 to connect the empirical an true hinge loss of w k an w, an using Lemma to connect the hinge of w k an w in the clean an noisy istributions 7

8 Proof of Theorem 1 For ease of exposition, let c = 3463 Recall that λ = 1 8, α k = 3879π(1 λ k 1, b k 1 = cα k, k = 536 ( /4 b k 1, an β > Note that for any w, the excess error of w is at most the error of w on the clean istribution D, ie, err D(w err D(w err D (w Moreover, for uniform istribution D, err D (w = θ(w,w π Hence, to show that w has ɛ excess error, it suffices to show that err D (w ɛ Our goal is to achieve excess error of 3879(1 λ k at roun k This we o inirectly by bouning err D (w k at every step We use inuction For k =, we use the algorithm for aversarial noise moel by [7], which can achieve excess error of ɛ if err D(w ɛ < 56 log(1/ɛ (Refer to Appenix C for more etails For Massart noise, err D(w 1 β So, for our choice of β, this algorithm can achieve excess error of in poly(, 1 δ samples an run-time Furthermore, using Equation, θ(w, w < 3879π Assume that at roun k 1, err D (w k (1 λ k 1 We will show that w k, which is chosen by the algorithm at roun k, also has err D (w k 3879(1 λ k First note that err D (w k (1 λ k 1 implies θ(w k 1, w α k Let S = S wk 1,b k 1 inicate the ban at roun k We ivie the error of w k to two parts, error outsie the ban an error insie of the ban That is err D (w k = Pr x D [x / S an (w k x(w x < ] + Pr x D [x S an (w k x(w x < ] For the first part, ie, error outsie of the ban, Pr x D [x / S an (w k x(w x < ] is at most Pr [x / S an (w k x(w k 1 x < ] + Pr [x / S an (w k 1 x(w x < ] α k c ( x D x D π e, where this inequality hols by the application of Lemma 5 an the fact that θ(w k 1, w k α k an θ(w k 1, w α k For the secon part, ie, error insie the ban Pr [x S an (w k x(w x < ] = err Dk (w k Pr [x S] x D x D err Dk (w k V 1 V b k 1 (By Lemma 4 ( + 1 err Dk (w k c α k, π where the last transition hols by the fact that V 1 V from Lemma 3, to show that err D (w k α k+1 π +1 ( k β b k b k 1 k We simplify this inequality as follows ( k β b k b k 1 k π [8] Replacing an upper boun on err D k (w k, it suffices to show that the following inequality hols ( + 1 c α k + α k c ( π π e α k+1 π c π( e c ( 1 λ Replacing in the rhs, the values of c = 3463, an k = 536( /4 b k 1, we have ( 536( /4 + 1 β π( ( c + e c ( 1/4 8

9 ( ( / (For > < 1 λ Therefore, err D (w k 3879(1 λ k Sample complexity analysis: We require m k labele samples in the ban S wk 1,b k 1 at roun k By Lemma 4, the probability that a ranomly rawn sample from D falls in S wk 1,b k 1 is at least O(b k 1 = O((1 λ k 1 Therefore, we nee O((1 λ k 1 m k unlabele samples to get m k examples in the ban with probability 1 So, the total unlabele sample complexity is at most s k=1 δ 8(k+k O ((1 λ k 1 m k s s k=1 ( ( ( 1 m k O ɛ log + log log(1/ɛ ɛ δ 4 Average Does Not Work Our algorithm escribe in the previous section uses convex loss minimization (in our case, hinge loss in the ban as an efficient proxy for minimizing the /1 loss The Average algorithm introuce by [3] is another computationally efficient algorithm that has provable noise tolerance guarantees uner certain noise moels an istributions For example, it achieves arbitrarily small excess error in the presence of ranom classification noise an monotonic noise when the istribution is uniform over the unit sphere Furthermore, even in the presence of a small amount of malicious noise an less symmetric istributions, Average has been use to obtain a weak learner, which can then be booste to achieve a non-trivial noise tolerance [7] Therefore it is natural to ask, whether the noise tolerance that Average exhibits coul be extene to the case of Massart noise uner the uniform istribution? We answer this question in the negative We show that the lack of symmetry in Massart noise presents a significant barrier for the one-shot application of Average, even when the marginal istribution is completely symmetric Aitionally, we also iscuss obstacles in incorporating Average as a weak learner with the margin-base technique In a nutshell, Average takes m sample points an their respective labels, W = {(x 1, y 1,, (x m, y m }, an returns 1 m m i=1 xi y i Our main result in this section shows that for a wie range of istributions that are very symmetric in nature, incluing the Gaussian an the uniform istribution, there is an instance of Massart noise uner which Average can not achieve an arbitrarily small excess error Theorem For any continuous istribution D with a pf that is a function of the istance from the origin only, there is a noisy istribution D over X {, 1} that satisfies Massart noise conition in Equation 1 for some parameter β > an Average returns a classifier with excess error Ω( β(1 β 1+β Proof Let w = (1,,, be the target halfspace Let the noise istribution be such that for all x, if x 1 x < then we flip the label of x with probability 1 β, otherwise we keep the label Clearly, this satisfies Massart noise with parameter β Let w be expecte vector returne by Average We first show that w is far from w in angle Then, using Equation we show that w has large excess error First we examine the expecte component of w that is parallel to w, ie, w w = w 1 For ease of exposition, we ivie our analysis to two cases, one for regions with no noise (first an thir quarants 9

10 an secon for regions with noise (secon an fourth quarants Let E be the event that x 1 x > By symmetry, it is easy to see that Pr[E] = 1/ Then E[w w ] = Pr(E E[w w E] + Pr(Ē E[w w Ē] For the first term, for x E the label has not change So, E[w w E] = E[ x 1 E] = zf(z For the secon term, the label of each point stays the same with probability 1+β an is flippe with probability 1 β Hence, E[w w E] = β E[ x 1 E] = β zf(z Therefore, the expecte parallel component of w is E[w w ] = 1+β zf(z Next, we examine w, the orthogonal component of w on the secon coorinate Similar to the previous case for the clean regions E[w E] = E[ x E] = zf(z Next, for the secon an forth quarants, which are noisy, we have E (x,y D[x y x 1 x < ] = ( 1 + β So, w = + ( 1 + β = ( 1 + β ( 1 + β = β 1 zf(z z f(z + ( 1 β ( z f(z + ( 1 β z f(z z f(z + ( 1 β + ( 1 β ( z f(z 1 z f(z z f(z z f(z ( 1 β zf(z Therefore θ(w, w = arctan( 1 β 1+β 1 β (1+β err D(w err D(w β θ(w,w π β 1 β π(1+β (Fourth quarant (Secon quarant (By symmetry By Equation, we have Our margin-base analysis from Section 3 relies on using hinge-loss minimization in the ban at every roun to efficiently fin a halfspace w k that is a weak learner for D k, ie, err Dk (w k is at most a small constant, as emonstrate in Lemma 3 Motivate by this more lenient goal of fining a weak learner, one might ask whether Average, as an efficient algorithm for fining low error halfspaces, can be incorporate with the margin-base technique in the same way as hinge loss minimization? We argue that the marginbase technique is inherently incompatible with Average The Margin-base technique maintains two key properties at every step: First, the angle between w k an w k 1 an the angle between w k 1 an w are small, an as a result θ(w, w k is small Secon, w k is a weak learner with err Dk 1 (w k at most a small constant In our work, hinge loss minimization in the ban guarantees both of these properties simultaneously by limiting its search to the halfspaces that are close in angle to w k 1 an limiting its istribution to D wk 1,b k 1 However, in the case of Average as we concentrate in the ban D wk 1,b k 1 we bias the istributions towars its orthogonal component with respect to w k 1 Hence, an upper boun on θ(w, w k 1 only serves to assure that most of the ata is orthogonal to w as well Therefore, informally speaking, we lose the signal that otherwise coul irect us in the irection of w More formally, consier the construction from Theorem such that w k 1 = w = (1,,, In istribution D wk 1,b k 1, the component of w k that is parallel to w k 1 scales own by the with of the ban, b k 1 However, as most of the probability stays in a ban passing through the origin in any log-concave (incluing Gaussian an uniform istribution, the orthogonal ( component of w k remains almost unchange Therefore, θ(w k, w 1 β = θ(w k, w k 1 Ω( b k 1 (1+β (1 β (1+βα k 1 1

11 5 Hinge Loss Minimization Does Not Work Hinge loss minimization is a wiely use technique in Machine Learning In this section, we show that, perhaps surprisingly, hinge loss minimization oes not lea to arbitrarily small excess error even uner very small noise conition, that is it is not consistent (Note that in our setting of Massart noise, consistency is the same as achieving arbitrarily small excess error, since the Bayes optimal classifier is a member of the class of halfspaces It has been shown earlier that hinge loss minimization can lea to classifiers of large /1-loss [6] However, the lower bouns in that paper employ istributions with significant mass on iscrete points with flippe label (which is not possible uner Massart noise at a very large istance from the optimal classifier Thus, that result makes strong use of the hinge loss s sensitivity to errors at large istance Here, we show that hinge loss minimization is boun to fail uner much more benign conitions More concretely, we show that for every parameter, an arbitrarily small boun on the probability of flipping a label, η = 1 β, hinge loss minimization is not consistent even on istributions with a uniform marginal over the unit ball in R, with the Bayes optimal classifier being a halfspace an the noise satisfying the Massart noise conition with boun η That is, there exists a constant ɛ an a sample size m(ɛ such that hinge loss minimization returns a classifier of excess error at least ɛ with high probability over sample size of at least m(ɛ Hinge loss minimization oes approximate the optimal hinge loss We show that this oes not translate into an agnostic learning guarantee for halfspaces with respect to the /1-loss even uner very small noise conitions Let P β be the class of istributions D with uniform marginal over the unit ball B 1 R, the Bayes classifier being a halfspace w, an satisfying the Massart noise conition with parameter β Our lower boun for hinge loss minimization is state as follows Theorem 3 For every hinge-loss parameter an every Massart noise parameter β < 1, there exists a istribution D,β P β (that is, a istribution over B 1 { 1, 1} with uniform marginal over B 1 R satisfying the β-massart conition such that -hinge loss minimization is not consistent on D,β with respect to the class of halfspaces That is, there exists an ɛ an a sample size m(ɛ such that hinge loss minimization will output a classifier of excess error larger ɛ (with high probability over samples of size at least m(ɛ Proof iea To prove the above result, we efine a subclass of P α,η P β consisting of well structure istributions We then show that for every hinge parameter an every boun on the noise η, there is a istribution D P α,η on which -hinge loss minimization is not consistent h w* In the remainer of this section, we use the notation h w for the classifier associate h w A with a vector w B 1, that is h w (x = sign(w x, since for our geometric D w construction it is convenient to ifferentiate between the two We efine a family B / P α,η P β of istributions D α,η, inexe by an angle α an a noise parameter η as w* follows Let the Bayes optimal classifier be linear h = h w for a unit vector w Let h w be the classifier that is efine by the unit vector w at angle α from w We D B A partition the unit ball into areas A, B an D as in the Figure 5 That is A consists of the two weges of isagreement between h w an h w an the wege where the two classifiers agree is ivie into B (points that are closer to h w than to h w an D Figure 1: P α,η (points that are closer to h w than to h w We now flip the labels of all points in A an B with probability η = 1 β an leave the labels eterministic accoring to h w in the area D More formally, points at angle between α/ an π/ an points at angle between π + α/ an π/ from w are labele per h w (x with conitional label probability 1 All other points are labele h w (x 11

12 with probability η an h w (x with probability (1 η Clearly, this istribution satisfies Massart noise conitions in Equation 1 with parameter β The goal of the above construction is to esign istributions where vectors along the irection of w have smaller hinge loss of those along the irection of w Observe that the noise in the are A will ten to even out the ifference in hinge loss between w an w (since are A is symmetric with respect to these two irections The noise in area B however will help w : Since all points in area B are closer to the hyperplane efine by w than to the one efine by w, vector w will pay more in hinge loss for the noise in this area In the corresponing area D of points that are closer to the hyperplane efine by w than to the one efine by w we o not a noise, so the cost for both w an w in this area is small We show that for every α, from a certain noise level η on, w (or any other vector in its irection is not the expecte hinge minimizer on D α,η We then argue that thereby hinge loss minimization will not approximate w arbitrarily close in angle an can therefore not achieve arbitrarily small excess /1-error Overall, we show that for every (arbitrarily small boun on the noise η an hinge parameter, we can choose an angle α such that -hinge loss minimization is not consistent for istribution D α,η The etails of the proof can be foun in the Appenix, Section D 6 Conclusions Our work is the first to provie a computationally efficient algorithm uner the Massart noise moel, a istributional assumption that has been ientifie in statistical learning to yiel fast (statistical rates of convergence While both computational an statistical efficiency is crucial in machine learning applications, computational an statistical complexity have been stuie uner isparate sets of assumptions an moels We view our results on the computational complexity of learning uner Massart noise also as a step towars bringing these two lines of research closer together We hope that this will spur more work ientifying situations that lea to both computational an statistical efficiency to ultimately she light on the unerlying connections an epenencies of these two important aspects of automate learning Acknowlegments This work was supporte in part by NSF grants CCF-95319, CCF , CCF- 1491, a Sloan Research Fellowshp, a Microsoft Research Faculty Fellowship, an a Google Research Awar References [1] Sanjeev Arora, László Babai, Jacques Stern, an Z Sweeyk The harness of approximate optima in lattices, coes, an systems of linear equations In Proceeings of the 34th IEEE Annual Symposium on Founations of Computer Science (FOCS, 1993 [] Pranjal Awasthi, Maria Florina Balcan, an Philip M Long The power of localization for efficiently learning linear separators with noise In Proceeings of the 46th Annual ACM Symposium on Theory of Computing (STOC, 14 [3] Maria-Florina Balcan, Alina Beygelzimer, an John Langfor Agnostic active learning In Proceeings of the 3r International Conference on Machine Learning (ICML, 6 [4] Maria-Florina Balcan, Anrei Z Broer, an Tong Zhang Margin base active learning In Proceeings of the th Annual Conference on Learning Theory (COLT, 7 1

13 [5] Maria-Florina Balcan an Vitaly Felman Statistical active learning algorithms In Avances in Neural Information Processing Systems (NIPS, 13 [6] Shai Ben-Davi, Davi Loker, Nathan Srebro, an Karthik Sriharan Minimizing the misclassification error rate using a surrogate convex loss In Proceeings of the 9th International Conference on Machine Learning (ICML, 1 [7] Avrim Blum, Alan Frieze, Ravi Kannan, an Santosh Vempala A polynomial-time algorithm for learning noisy linear threshol functions Algorithmica, (1-:35 5, 1998 [8] Karl-Heinz Borgwart The simplex metho, volume 1 of Algorithms an Combinatorics: Stuy an Research Texts Springer-Verlag, Berlin, 1987 [9] Olivier Bousquet, Stéphane Boucheron, an Gabor Lugosi Theory of classification: a survey of recent avances ESAIM: Probability an Statistics, 9:33 375, 5 [1] Rui M Castro an Robert D Nowak Minimax bouns for active learning In Proceeings of the th Annual Conference on Learning Theory, (COLT, 7 [11] Nello Cristianini an John Shawe-Taylor An Introuction to Support Vector Machines an Other Kernel-base Learning Methos Cambrige University Press, [1] Amit Daniely, Nati Linial, an Shai Shalev-Shwartz From average case complexity to improper learning complexity In Proceeings of the 46th Annual ACM Symposium on Theory of Computing (STOC, 14 [13] Sanjoy Dasgupta Coarse sample complexity bouns for active learning In Avances in Neural Information Processing Systems (NIPS, 5 [14] Sanjoy Dasgupta Active learning Encyclopeia of Machine Learning, 11 [15] Sanjoy Dasgupta, Daniel Hsu, an Claire Monteleoni A general agnostic active learning algorithm In Avances in Neural Information Processing Systems (NIPS, 7 [16] Ofer Dekel, Clauio Gentile, an Karthik Sriharan Selective sampling an active learning from single an multiple teachers Journal of Machine Learning Research, 13: , 1 [17] Yoav Freun, H Sebastian Seung, Eli Shamir, an Naftali Tishby Selective sampling using the query by committee algorithm Machine Learning, 8(-3: , 1997 [18] Venkatesan Guruswami an Prasa Raghavenra Harness of learning halfspaces with noise In Proceeings of the 47th Annual IEEE Symposium on Founations of Computer Science (FOCS, 6 [19] Steve Hanneke A boun on the label complexity of agnostic active learning In Proceeings of the 4r International Conference on Machine Learning (ICML, 7 [] Steve Hanneke Theory of isagreement-base active learning Founations an Trens in Machine Learning, 7(-3:131 39, 14 [1] Steve Hanneke an Liu Yang Surrogate losses in passive an active learning CoRR, abs/17377, 14 13

14 [] Aam Tauman Kalai, Aam R Klivans, Yishay Mansour, an Rocco A Serveio Agnostically learning halfspaces SIAM Journal on Computing, 37(6: , 8 [3] Aam Tauman Kalai, Yishay Mansour, an Ela Verbin On agnostic boosting an parity learning In Proceeings of the 4th Annual ACM Symposium on Theory of Computing (STOC, 8 [4] Michael J Kearns an Ming Li Learning in the presence of malicious errors (extene abstract In Proceeings of the th Annual ACM Symposium on Theory of Computing (STOC, 1988 [5] Michael J Kearns, Robert E Schapire, an Lina Sellie Towar efficient agnostic learning In Proceeings of the 5th Annual Conference on Computational Learning Theory (COLT, 199 [6] Aam R Klivans an Pravesh Kothari Embeing har learning problems into gaussian space In Approximation, Ranomization, an Combinatorial Optimization Algorithms an Techniques, (AP- PROX/RANDOM, 14 [7] Aam R Klivans, Philip M Long, an Rocco A Serveio Learning halfspaces with malicious noise Journal of Machine Learning Research, 1:715 74, 9 [8] Pascal Massart an loie Nlec Risk bouns for statistical learning The Annals of Statistics, 34(5:36 366, 1 6 [9] Ronal L Rivest an Robert H Sloan A formal moel of hierarchical concept learning Information an Computation, 114(1:88 114, 1994 [3] Rocco A Serveio Efficient algorithms in computational learning theory Harvar University, 1 [31] Robert H Sloan Pac learning, noise, an geometry In Learning an Geometry: Computational Approaches, pages 1 41 Springer, 1996 A Probability Lemmas For The Uniform Distribution The following probability lemmas are use throughout this work Variation of these lemmas are presente in previous work in terms of their asymptotic behavior [, 4, ] Here, we focus on fining bouns that are tight even when the constants are concerne Inee, the improve constants in these bouns are essential to tolerating Massart noise with β > Throughout this section, let D be the uniform istribution over a -imensional ball Let f( inicate the pf of D For any, let V be the volume of a -imensional unit ball Ratios between volumes of the unit ball in ifferent imensions are commonly use to fin the probability mass of ifferent regions uner the uniform istribution Note that for any V V = π The following boun ue to [8] proves useful in our analysis π V V π The next lemma provies an upper an lower boun for the probability mass of a ban in uniform istribution 14

15 Lemma 4 Let u be any unit vector in R For all a, b [ C, C ], such that C < /, we have b a C V 1 V Pr x D [u x [a, b]] b a V 1 V Proof We have Pr [u x [a, b]] = V 1 x D V b a (1 z ( 1/ z For the upper boun, we note that the integrant is at most 1, so Pr x D [u x [a, b]] V 1 V b a For the lower boun, note that since a, b [ C, C ], the integrant is at least (1 C ( 1/ We know that for any x [, 5], 1 x > 4 x So, assuming that > C, (1 C ( 1/ 4 C ( 1/ C Pr x D [u x [a, b]] b a C V 1 V Lemma 5 Let u an v be two unit vectors in R an let α = θ(u, v Then, c α Pr [sign(u x sign(w x an u x > ] α c ( x D π e Proof Without the loss of generality, we can assume u = (1,,, an w = (cos(α, sin(α,,, Consier the projection of D on the first coorinates Let E be the event we are intereste in We first show that for any x = (x 1, x E, x > c/ Consier x 1 (the other case is symmetric If x E, it must be that x sin(α cα So, x = c α c sin(α c Next, we consier a circle of raius < r < 1 aroun the center, inicate by S(r Let A(r = S(r E be the arc of such circle that is in E Then the length of such arc is the arc-length that falls in the isagreement region, ie, rα, minus the arc-length that falls in the ban of with cα Note, that for every x A(r, x = r, so f(x = V V (1 x ( / = V V (1 r ( / Pr [sign(u x sign(w x an u x > α ] = (rα cα f(r r x D c = /c 1 = V c α = c α π c α π α π α π V 1 /c 1 1 /c ( rc α cα f( cr c r (change of variable z = r /c /c (r 1(1 c r (r 1e r ( (r 1 ( c r r ( / r ( 1( ( c r e ( c r ( 1( ( c r e ( c r 1 [ ] e ( r r= /c r=1 r r 15

16 α π (e c ( e ( / α c ( π e B Proofs of Margin-base Lemmas Proof of Lemma 1 Let L(w = E (x,y Dk l(w, x, y, = k, an b = b k 1 First note that for our choice of b , using Lemma 4 we have that Pr [ w k 1 x < b] b 8539 x D Note that L(w is maximize when w = w k 1 Then L(w (1 a f(a a Pr x D [ w k 1 x < b] (1 a (1 a ( 1/ a b 8539 For the numerator: (1 a (1 a ( 1/ a (1 a e a ( 1/ a 1 e a ( 1/ a 1 ae a ( 1/ a ( π 1 ( 1 erf 1 ( 1 (1 e ( 1 / π ( 1 e ( 1 ( 1 1 ( 1 1 ( 1 π ( 1 8 ( ( 1 1 (( (By Taylor expansion 5463 (By 1 8 ( 1 < 1 4 Where the last inequality follows from the fact that for our choice of parameters 3, so 1 8 ( 1 < 1 5 Therefore, 536( /4 b < L(w b b Proof of Lemma 3 Note that the convex loss minimization proceure returns a vector v k that is not necessarily normalize To consier all vectors in B(w k 1, α k, at step k, the optimization is one over all 16

17 vectors v (of any length such that w k 1 v < α k For all k, α k < 3879π (or 1168, so v k , an as a result l(w k, W l(v k, W We have, err Dk (w k E (x,y Dk l(w k, x, y ( E (x,y l(w Dk k, x, y β b k 1 k (By Lemma l(w k, W β b k 1 k (By Equation l(v k, W β b k 1 k (By v k l(w, W β b k 1 k (By v k minimizing the hinge-loss E (x,y l(w, x, y β b k (By Equation 3 Dk ( k E (x,y Dk l(w, x, y β b k (By Lemma k k b k β b k 1 k (By Lemma 1 Lemma 6 For any constant c, there is m k O(( + log(k/ such that for a ranomly rawn set W of m k labele samples from D k, with probability 1 δ, for any w B(w k+k k 1, α k, E (x,y Dk (l(w, x, y l(w, W c, E (x,y Dk (l(w, x, y l(w, cleane(w c Proof By Lemma H3 of [], l(w, x, y = O( for all (x, y S wk 1,b k 1 an θ(w, w k 1 r k We get the result by applying Lemma H of [] C Initialization We initialize our margin base proceure with the algorithm from [7] The guarantees mentione in [7] ɛ hol as long as the noise rate is η c log 1/ɛ [7] o not explicitly compute the constant but it is easy to check that c 1 56 This can be compute from inequality 17 in the proof of Lemma 16 in [7] We nee the lhs to be at least ɛ / On the rhs, the first term is lower boune by ɛ /51 Hence, we nee the secon term to be at most ɛ The secon term is upper boune by 4c ɛ This implies that c 1/56 D Hinge Loss Minimization In this section, we show that hinge loss minimization is not consistent in our setup, that is, that it oes not lea to arbitrarily small excess error We let B 1 enote the unit ball in R In this section, we will only work with =, thus we set B 1 = B 1 17

18 Recall that the -hinge loss of a vector w R on an example (x, y R { 1, 1} is efine as follows: { } y(w x l (w, x, y = max, 1 For a istribution D over R { 1, 1}, we let L D enote the expecte hinge loss over D, that is L D (w = E (x,y Dl (w, x, y If clear from context, we omit the superscript an write L (w for L D (w Let A be the algorithm that minimizes the empirical -hinge loss over a sample That is, for W = {(x 1, y 1,, (x m, y m }, we have A (W argmin w B1 1 W (x,y W l (w, x, y Hinge loss minimization over halfspaces converges to the optimal hinge loss over all halfspace (it is hinge loss consistent That is, for all ɛ > there is a sample size m(ɛ such that for all istributions D, we have E W Dm[L D (A (W ] min w B 1 L D (w + ɛ In this section, we show that this oes not translate into an agnostic learning guarantee for halfspaces with respect to the /1-loss Moreover, hinge loss minimization is not even consistent with respect to the /1-loss even when restricte to a rather benign classes of istributions P Let P β be the class of istributions D with uniform marginal over the unit ball in R, the Bayes classifier being a halfspace w, an satisfying the Massart noise conition with parameter β We show that there is a istribution D P β an an ɛ an a sample size m such that hinge loss minimization will output a classifier of excess error larger than ɛ on expectation over samples of size larger than m More precisely, for all m m : E W Dm[L D (A (W ] > min w B 1 err D(w + ɛ Formally, our lower boun for hinge loss minimization is state as follows Theorem 3 (Restate For every hinge-loss parameter an every Massart noise parameter β < 1, there exists a istribution D,β P β (that is, a istribution over B 1 { 1, 1} with uniform marginal over B 1 R satisfying the β-massart conition such that -hinge loss minimization is not consistent on P,β with respect to the class of halfspaces That is, there exists an ɛ an a sample size m(ɛ such that hinge loss minimization will output a classifier of excess error larger than ɛ (with high probability over samples of size at least m(ɛ In the section, we use the notation h w for the classifier associate with a vector w B 1, that is h w (x = sign(w x, since for our geometric construction it is convenient to ifferentiate between the two The rest of this section is evote to proving the above theorem A class of istributions 18

Efficient Learning of Linear Separators under Bounded Noise

Efficient Learning of Linear Separators uner Boune Noise Pranjal Awasthi Princeton University Maria-Florina Balcan Nika Haghtalab Carnegie Mellon University Ruth Urner Max Planck Institute for Intelligent