Efficient Learning of Linear Separators under Bounded Noise

Size: px
Start display at page:

Download "Efficient Learning of Linear Separators under Bounded Noise"

Transcription

1 Efficient Learning of Linear Separators uner Boune Noise Pranjal Awasthi Maria-Florina Balcan Ruth Urner March 1, 15 Nika Haghtalab Abstract We stuy the learnability of linear separators in R in the presence of boune (aka Massart noise This is a realistic generalization of the ranom classification noise moel, where the aversary can flip each example x with probability η(x η We provie the first polynomial time algorithm that can learn linear separators to arbitrarily small excess error in this noise moel uner the uniform istribution over the unit sphere in R, for some constant value of η While wiely stuie in the statistical learning theory community in the context of getting faster convergence rates, computationally efficient algorithms in this moel ha remaine elusive Our work provies the first evience that one can inee esign algorithms achieving arbitrarily small excess error in polynomial time uner this realistic noise moel an thus opens up a new an exciting line of research We aitionally provie lower bouns showing that popular algorithms such as hinge loss minimization an averaging cannot lea to arbitrarily small excess error uner Massart noise, even uner the uniform istribution Our work instea, makes use of a margin base technique evelope in the context of active learning As a result, our algorithm is also an active learning algorithm with label complexity that is only a logarithmic the esire excess error ɛ 1 Introuction Overview Linear separators are the most popular classifiers stuie in both the theory an practice of machine learning Designing noise tolerant, polynomial time learning algorithms that achieve arbitrarily small excess error rates for linear separators is a long-staning question in learning theory In the absence of noise (when the ata is realizable such algorithms exist via linear programming [11] However, the problem becomes significantly harer in the presence of label noise In particular, in this work we are concerne with esigning algorithms that can achieve error OPT + ɛ which is arbitrarily close to OPT, the error of the best linear separator, an run in time polynomial in 1 ɛ an (as usual, we call ɛ the excess error Such strong guarantees are only known for the well stuie ranom classification noise moel [7] In this work, we provie the first algorithm that can achieve arbitrarily small excess error, in truly polynomial time, for boune noise, also calle Massart noise [8], a much more realistic an wiely stuie noise moel in statistical learning theory [9] We aitionally show strong lower bouns uner the same noise moel for two other computationally efficient learning algorithms (hinge loss minimization an the averaging algorithm, which coul be of inepenent interest Motivation The work on computationally efficient algorithms for learning halfspaces has focuse on two ifferent extremes On one han, for the very stylize ranom classification noise moel (RCN, where each 1

2 example x is flippe inepenently with equal probability η, several works have provie computationally efficient algorithms that can achieve arbitrarily small excess error in polynomial time [7, 3, 5] note that all these results crucially exploit the high amount of symmetry present in the RCN noise At the other extreme, there has been significant work on much more ifficult an aversarial noise moels, incluing the agnostic moel [5] an malicious noise moels [4] The best results here however, not only require aitional istributional assumptions about the marginal over the instance space, but they only achieve much weaker multiplicative approximation guarantees [3, 7, ]; for example, the best result of this form for the case of uniform istribution over the unit sphere S 1 achieves excess error copt [], for some large constant c While interesting from a technical point of view, guarantees of this form are somewhat troubling from a statistical point of view, as they are inconsistent, in the sense there is a barrier O(OPT, after which we cannot prove that the excess error further ecreases as we get more an more samples In fact, recent evience shows that this is unavoiable for polynomial time algorithms for such aversarial noise moels [1] Our Results In this work we ientify a realistic an wiely stuie noise moel in the statistical learning theory, the so calle Massart noise [9], for which we can prove much stronger guarantees Massart noise can be thought of as a generalization of the ranom classification noise moel where the label of each example x is flippe inepenently with probability η(x < 1/ The aversary has control over choosing a ifferent noise rate η(x η for every example x with the only constraint that η(x η From a statistical point of view, it is well known that uner this moel, we can get faster rates compare to worst case joint istributions [9] In computational learning theory, this noise moel was also stuie, but uner the name of malicious misclassification noise [9, 31] However ue to its highly unsymmetric nature, til ate, computationally efficient learning algorithms in this moel have remaine elusive In this work, we provie the first computationally efficient algorithm achieving arbitrarily small excess error for learning linear separators Formally, we show that there exists a polynomial time algorithm that can learn linear separators to error OPT+ɛ an run in poly(, 1 ɛ when the unerlying istribution is the uniform istribution over the unit ball in R an the noise of each example is upper boune by a constant η (inepenent of the imension As mentione earlier, a result of this form was only known for ranom classification noise From a technical point of view, as oppose to ranom classification noise, where the error of each classifier scales uniformly uner the observe labels, the observe error of classifiers uner Masasart noise coul change rastically in a non-monotonic fashion This is ue to the fact that the aversary has control over choosing a ifferent noise rate η(x η for every example x As a result, as we show in our work (see Section 4, stanar algorithms such as the averaging algorithm [3] which work for ranom noise can only achieve a much poorer excess error (as a function of η uner Massart noise Technically speaking, this is ue to the fact that Massart noise can introuce high correlations between the observe labels an the component orthogonal to the irection of the best classifier In face of these challenges, we take an entirely ifferent approach than previously consiere for ranom classification noise Specifically, we analyze a recent margin base algorithm of [] This algorithm was esigne for learning linear separators uner agnostic an malicious noise moels, an it was shown to achieve an excess error of copt for a constant c By using new structural insights, we show that there exists a constant η (inepenent of the imension, so that if we use Massart noise where the flipping probability is upper boune by η, we can use a moification of the algorithm in [] an achieve arbitrarily small excess error One way to think about this result is that we efine an aaptively chosen sequence of hinge loss minimization problems aroun smaller an smaller bans aroun the current guess for the target We show by relating the hinge loss an /1-loss together with a careful localization analysis that these will

3 irect us closer an closer to the optimal classifier, allowing us to achieve arbitrarily small excess error rates in polynomial time Given that our algorithm is an aaptively chosen sequence of hinge loss minimization problems, one might woner what guarantee one-shot hinge loss minimization coul provie In Section 5, we show a strong negative result: for every, an η 1/, there is a noisy istribution D over R {, 1} satisfying Massart noise with parameter η an an ɛ >, such that -hinge loss minimization returns a classifier with excess error Ω(ɛ This result coul be of inepenent interest While there exists earlier work showing that hinge loss minimization can lea to classifiers of large /1-loss [6], the lower bouns in that paper employ istributions with significant mass on iscrete points with flippe label (which is not possible uner Massart noise at a very large istance from the optimal classifier Thus, that result makes strong use of the hinge loss s sensitivity to errors at large istance Here, we show that hinge loss minimization is boun to fail uner much more benign conitions One appealing feature of our result is the algorithm we analyze is in fact naturally aaptable to the active learning or selective sampling scenario (intensively stuie in recent years [19, 13, ], where the learning algorithms only receive the classifications of examples when they ask for them We show that, in this moel, our algorithms achieve a label complexity whose epenence on the error parameter ɛ is polylogarithmic (an thus exponentially better than that of any passive algorithm This provies the first polynomial-time active learning algorithm for learning linear separators uner Massart noise We note that prior to our work only inefficient algorithms coul achieve the esire label complexity uner Massart noise [4, ] Relate Work The agnostic noise moel is notoriously har to eal with computationally an there is significant evience that achieving arbitrarily small excess error in polynomial time is har in this moel [1, 18, 1] For this moel, uner our istributional assumptions, [3] provies an algorithm that learns linear separators in R to excess error at most ɛ, but whose running time poly( exp(1/ɛ Recent work show evience that the exponential epenence on 1/ɛ is unavoiable in this case [6] for the agnostic case We sie-step this by consiering a more structure, yet realistic noise moel Motivate by the fact that many moern machine learning applications have massive amounts of unannotate or unlabele ata, there has been significant interest in esigning active learning algorithms that most efficiently utilize the available ata, while minimizing the nee for human intervention Over the past ecae there has been substantial progress on unerstaning the unerlying statistical principles of active learning, an several general characterizations have been evelope for escribing when active learning coul have an avantage over the classical passive supervise learning paraigm both in the noise free settings an in the agnostic case [17, 13, 3, 4, 19, 15, 1, 14, ] However, espite many efforts, except for very simple noise moels (ranom classification noise [5] an linear noise [16], to ate there are no known computationally efficient algorithms with provable guarantees in the presence of Massart noise that can achieve arbitrarily small excess error We note that work of [1] provies computationally efficient algorithms for both passive an active learning uner the assumption that the hinge loss (or other surrogate loss minimizer aligns with the minimizer of the /1-loss In our work (Section 5, we show that this is not the case uner Massart noise even when the marginal over the instance space is uniform, but still provie a computationally efficient algorithm for this much more challenging setting Preliminaries We consier the binary classification problem; that is, we work on the problem of preicting a binary label y for a given instance x We assume that the ata points (x, y are rawn from an unknown unerlying 3

4 istribution D over X Y, where X = R is the instance space an Y = { 1, 1} is the label space For the purpose of this work, we consier istributions where the marginal of D over X is a uniform istribution on a -imensional unit ball We work with the class of all homogeneous halfspaces, enote by H = {sign(w x : w R } For a given halfspace w H, we efine the error of w with respect to D, by err D(w = Pr (x,y D[sign(w x y] We examine learning halfspaces in the presence of Massart noise In this setting, we assume that the Bayes optimal classifier is a linear separator w Note that w can have a non-zero error Then Massart noise with parameter β > is a conition such that for all x, the conitional label probability is such that Pr(y = 1 x Pr(y = 1 x β (1 Equivalently, we say that D satisfies Massart noise with parameter β, if an aversary construct D by first taking the istribution D over instances (x, sign(w x an then flipping the label of an instance x with probability at most 1 β 1 Also note that uner istribution D, w remains the Bayes optimal classier In the remainer of this work, we refer to D as the noisy istribution an to istribution D over instances (x, sign(w x as the clean istribution Our goal is then to fin a halfspace w that has small excess error, as compare to the Bayes optimal classifier w That is, for any ɛ >, fin a halfspace w, such that err D(w err D(w ɛ Note that the excess error of any classifier w only epens on the points in the region where w an w isagree So, err D(w err D(w θ(w,w π Aitionally, uner Massart noise the amount of noise in the isagreement region is also boune by 1 β It is not ifficult to see that uner Massart noise, β θ(w, w π err D(w err D(w ( In our analysis, we frequently examine the region within a certain margin of a halfspace For a halfspace w an margin b, let S w,b be the set of all points that fall within a margin b from w, ie, S w,b = {x : w x b} For istributions D an D, we inicate the istribution conitione on S w,b by D w,b an D w,b, respectively In the remainer of this work, we refer to the region S w,b as the ban In our analysis, we use hinge loss, as a convex surrogate function for the /1-loss For a halfspace w, we use -normalize hinge loss that is efine as l(w, x, y = max{, 1 (w xy } For a labele sample set W, let l(w, W = 1 W (x,y W l(w, x, y be the empirical hinge loss of a vector w with respect to W 3 Computationally Efficient Algorithm for Massart Noise In this section, prove our main result for learning half-spaces in presence of Massart noise We focus on the case where D is the uniform istribution on the -imensional unit ball Our main Theorem is as follows Theorem 1 Let the optimal bayes classifier be a half-space enote by w Assume that the massart noise conition hols for some β > Then for any ɛ, δ >, Algorithm 1 with λ = 1 8, α k = 3879π(1 λ k 1, b k 1 = 3463α k, an k = 536 ( /4 b k 1, runs in polynomial time, procees in s = O(log 1 ɛ rouns, where in roun k it takes n k = poly(, exp(k, log( 1 δ unlabele samples an m k = O(( + log(k/δ labels an with probability (1 δ returns a linear separator that has excess error (compare to w of at most ɛ 1 Note that the relationship between Massart noise parameter β, an the maximum flipping probability iscusse in the introuction η, is η = 1 β 4

5 Note that in the above theorem an Algorithm 1, the value of β is unknown to the algorithm, an therefore, our results are aaptive to values of β within the acceptable range efine by the theorem The algorithm escribe above is similar to that of [] an uses an iterative margin-base approach The algorithm runs for s = log 1 ( 1 1 λ ɛ rouns for a constant λ (, 1] By inuction assume that our algorithm prouces a hypothesis w k 1 at roun k 1 such that θ(w k 1, w α k We satisfy the base case by using an algorithm of [7] At roun k, we sample m k labele examples from the conitional istribution D wk 1,b k 1 which is the uniform istribution over {x : w k 1 x b k 1 } We then choose w k from the set of all hypothesis B(w k 1, α k = {w : θ(w, w k 1 α k } such that w k minimizes the empirical hinge loss over these examples Subsequently, as we prove in etail later, θ(w k, w α k+1 Note that for any w, the excess error of w is at most the error of w on D when the labels are correcte accoring to w, ie, err D(w err D(w err D (w Moreover, when D is uniform, err D (w = θ(w,w π Hence, θ(w s, w πɛ implies that w s has excess error of at most ɛ The algorithm escribe below was originally introuce to achieve an error of c err(w for some constant c in presence of aversarial noise Achieving a small excess error err(w +ɛ is a much more ambitious goal one that requires new technical insights Our two crucial technical innovations are as follow: We first make a key observation that uner Massart noise, the noise rate over any conitional istribution D is still at most 1 β Therefore, as we focus on the istribution within the ban, our noise rate oes not increase Our secon technical contribution is a careful choice of parameters Inee the choice of parameters, upto a constant, plays an important role in tolerating a constant amount of Massart noise Using these insights, we show that the algorithm by [] can inee achieve a much stronger guarantee, namely arbitrarily small excess error in presence of Massart noise That is, for any ɛ, this algorithm can achieve error of err(w + ɛ in the presence of Massart noise Algorithm 1 EFFICIENT ALGORITHM FOR ARBITRARILY SMALL EXCESS ERROR FOR MASSART NOISE Input: A istribution D An oracle that returns x an an oracle that returns y for a (x, y sample from D Permitte excess error ɛ an probability of failure δ Parameters: A learning rate λ; a sequence of sample sizes m k ; a sequence of angles of the hypothesis space α k ; a sequence of withs of the labele space b k ; a sequence of threshols of hinge-loss k Algorithm: 1 Take poly(, 1 δ samples an run poly(, 1 δ -time algorithm by [7] to fin a half-space w with excess error such that θ(w, w 3879π (Refer to Appenix C Draw m 1 examples (x, y from D an put them into a working set W 3 For k = 1,, log ( 1 1 λ ( 1 ɛ = s (a Fin v k such that v k w k 1 < α k (as a result v k B(w k 1, α k, that minimizes the empirical hinge loss over W using threshol k That is l k (v k, W min w B(wk 1,α k l k (w, W (b Clear the working set W (c Normalize v k to w k = v k v k Until m k+1 aitional examples are put in W, raw an example x from D If w k x b k, then reject x, else put (x, y into W Output: Return w s, which has excess error ɛ with probability 1 δ 5

6 Overview of our analysis: Similar to [], we ivie err D (w k to two categories; error in the ban, ie, on x S wk 1,b k 1, an error outsie the ban, on x S wk 1,b k 1 We choose b k 1 an α k such that, for every hypothesis w B(w k 1, α k that is consiere at step k, the probability mass outsie the ban such that w an w also isagree is very small (Lemma 5 Therefore, the error associate with the region outsie the ban is also very small This motivates the esign of the algorithm to only minimize the error in the ban Furthermore, the probability mass of the ban is also small enough such that for err D (w k α k+1 to hol, it suffices for w k to have a small constant error over the clean istribution restricte to the ban, namely D wk 1,b k 1 This is where minimizing hinge loss in the ban comes in As minimizing the /1-loss is NP-har, an alternative metho for fining w k with small error in the ban is neee Hinge loss that is a convex loss function can be efficiently minimize So, we can efficiently fin w k that minimizes the empirical hinge loss of the sample rawn from D wk 1,b k 1 To allow the hinge loss to remain a faithful proxy of /1-loss as we focus on bans with smaller withs, we use a normalize hinge loss function efine by l (w, x, y = max{, 1 w xy } A crucial part of our analysis involves showing that if w k minimizes the empirical hinge loss of the sample set rawn from D wk 1,b k 1, it inee has a small /1-error on D wk 1,b k 1 To this en, we first show that when k is proportional to b k, the hinge loss of w on D wk 1,b k 1, which is an upper boun on the /1-error of w k in the ban, is itself small (Lemma 1 Next, we notice that uner Massart noise, the noise rate in any marginal of the istribution is still at most 1 β Therefore, focusing the istribution in the ban oes not increase the probability of noise in the ban Moreover, the noise points in the ban are close to the ecision bounary so intuitively speaking, they can not increase the hinge loss too much Using these insights we can show that the hinge loss of w k on D wk 1,b k 1 is close to its hinge loss on D wk 1,b k 1 (Lemma Proof of Theorem 1 an relate lemmas To prove Theorem 1, we first introuce a series of lemmas concerning the behavior of hinge loss in the ban These lemmas buil up towars showing that w k has error of at most a fixe small constant in the ban For ease of exposition, for any k, let D k an D k represent D wk 1,b k 1 an D wk 1,b k 1, respectively, an l( represent l k ( Furthermore, let c = 3463, such that b k 1 = cα k Our first lemma, whose proof appears in Appenix B, provies an upper boun on the true hinge error of w on the clean istribution in the ban Lemma 1 E (x,y Dk l(w, x, y b The next Lemma compares the true hinge loss of any w B(w k 1, α k on two istributions, Dk an D k It is clear that the ifference between the hinge loss on these two istributions is entirely attribute to the noise points an their margin from w A key insight in the proof of this lemma is that as we concentrate in the ban, the probability of seeing a noise point remains uner 1 β This is ue to the fact that uner Massart noise, each label can be change with probability at most 1 β Furthermore, by concentrating in the ban all points are close to the ecision bounary of w k 1 Since w is also close in angle to w k 1, then points in the ban are also close to the ecision bounary of w Therefore the hinge loss of noise points in the ban can not increase the total hinge loss of w by too much Lemma For any w such that w B(w k 1, α k, we have E (x,y Dk l(w, x, y E (x,y Dk l(w, x, y 19 1 β b k 1 k 6

7 Proof Let N be the set of noise points We have, E (x,y Dk l(w, x, y E (x,y Dk l(w, x, y = E (x,y Dk (l(w, x, y l(w, x, sign(w x E (x,y (1 Dk x N (l(w, x, y l(w, x, y ( w x E (x,y 1 Dk x N k Pr (x N E k (x,y D (x,y (w x (By Cauchy Shwarz Dk k 1 β αk k 1 + b k 1 (By Definition 41 of [] for uniform 1 β b k 1 k ( 1c β b k 1 k (for >, c > 1 For a labele sample set W rawn at ranom from D k, let cleane(w be the set of samples with the labels correcte by w, ie, cleane(w = {(x, sign(w x : for all (x, y W } Then by stanar VC-imension bouns (Proof inclue in Appenix B there is m k O(( + log(k/ such that for any ranomly rawn set W of m k labele samples from D k, with probability 1 B(w k 1, α k, δ (k+k, for any w E (x,y Dk l(w, x, y l(w, W 1 8, (3 E (x,y Dk l(w, x, y l(w, cleane(w 1 8 (4 Our next lemma is a crucial step in our analysis of Algorithm 1 This lemma proves that if w k B(w k 1, α k minimizes the empirical hinge loss on the sample rawn from the noisy istribution in the ban, namely D wk 1,b k 1, then with high probability w k also has a small /1-error with respect to the clean istribution in the ban, ie, D wk 1,b k 1 Lemma 3 There exists m k O(( + log(k/, such that for a ranomly rawn labele sample set W of size m k from D k, an for w k such that w k has the minimum empirical hinge loss on W between the set of all hypothesis in B(w k 1, α k, with probability 1 δ (k+k, err Dk (w k k b k β b k 1 k Proof Sketch First, we note that the true /1-error of w k on any istribution is at most its true hinge loss on that istribution Lemma 1 provies an upper boun on the true hinge loss on istribution D k Therefore, it remains to create a connection between the empirical hinge loss of w k on the sample rawn from D k to its true hinge loss on istribution D k This, we achieve by using the generalization bouns of Equations 3 an 4 to connect the empirical an true hinge loss of w k an w, an using Lemma to connect the hinge of w k an w in the clean an noisy istributions 7

8 Proof of Theorem 1 For ease of exposition, let c = 3463 Recall that λ = 1 8, α k = 3879π(1 λ k 1, b k 1 = cα k, k = 536 ( /4 b k 1, an β > Note that for any w, the excess error of w is at most the error of w on the clean istribution D, ie, err D(w err D(w err D (w Moreover, for uniform istribution D, err D (w = θ(w,w π Hence, to show that w has ɛ excess error, it suffices to show that err D (w ɛ Our goal is to achieve excess error of 3879(1 λ k at roun k This we o inirectly by bouning err D (w k at every step We use inuction For k =, we use the algorithm for aversarial noise moel by [7], which can achieve excess error of ɛ if err D(w ɛ < 56 log(1/ɛ (Refer to Appenix C for more etails For Massart noise, err D(w 1 β So, for our choice of β, this algorithm can achieve excess error of in poly(, 1 δ samples an run-time Furthermore, using Equation, θ(w, w < 3879π Assume that at roun k 1, err D (w k (1 λ k 1 We will show that w k, which is chosen by the algorithm at roun k, also has err D (w k 3879(1 λ k First note that err D (w k (1 λ k 1 implies θ(w k 1, w α k Let S = S wk 1,b k 1 inicate the ban at roun k We ivie the error of w k to two parts, error outsie the ban an error insie of the ban That is err D (w k = Pr x D [x / S an (w k x(w x < ] + Pr x D [x S an (w k x(w x < ] For the first part, ie, error outsie of the ban, Pr x D [x / S an (w k x(w x < ] is at most Pr [x / S an (w k x(w k 1 x < ] + Pr [x / S an (w k 1 x(w x < ] α k c ( x D x D π e, where this inequality hols by the application of Lemma 5 an the fact that θ(w k 1, w k α k an θ(w k 1, w α k For the secon part, ie, error insie the ban Pr [x S an (w k x(w x < ] = err Dk (w k Pr [x S] x D x D err Dk (w k V 1 V b k 1 (By Lemma 4 ( + 1 err Dk (w k c α k, π where the last transition hols by the fact that V 1 V from Lemma 3, to show that err D (w k α k+1 π +1 ( k β b k b k 1 k We simplify this inequality as follows ( k β b k b k 1 k π [8] Replacing an upper boun on err D k (w k, it suffices to show that the following inequality hols ( + 1 c α k + α k c ( π π e α k+1 π c π( e c ( 1 λ Replacing in the rhs, the values of c = 3463, an k = 536( /4 b k 1, we have ( 536( /4 + 1 β π( ( c + e c ( 1/4 8

9 ( ( / (For > < 1 λ Therefore, err D (w k 3879(1 λ k Sample complexity analysis: We require m k labele samples in the ban S wk 1,b k 1 at roun k By Lemma 4, the probability that a ranomly rawn sample from D falls in S wk 1,b k 1 is at least O(b k 1 = O((1 λ k 1 Therefore, we nee O((1 λ k 1 m k unlabele samples to get m k examples in the ban with probability 1 So, the total unlabele sample complexity is at most s k=1 δ 8(k+k O ((1 λ k 1 m k s s k=1 ( ( ( 1 m k O ɛ log + log log(1/ɛ ɛ δ 4 Average Does Not Work Our algorithm escribe in the previous section uses convex loss minimization (in our case, hinge loss in the ban as an efficient proxy for minimizing the /1 loss The Average algorithm introuce by [3] is another computationally efficient algorithm that has provable noise tolerance guarantees uner certain noise moels an istributions For example, it achieves arbitrarily small excess error in the presence of ranom classification noise an monotonic noise when the istribution is uniform over the unit sphere Furthermore, even in the presence of a small amount of malicious noise an less symmetric istributions, Average has been use to obtain a weak learner, which can then be booste to achieve a non-trivial noise tolerance [7] Therefore it is natural to ask, whether the noise tolerance that Average exhibits coul be extene to the case of Massart noise uner the uniform istribution? We answer this question in the negative We show that the lack of symmetry in Massart noise presents a significant barrier for the one-shot application of Average, even when the marginal istribution is completely symmetric Aitionally, we also iscuss obstacles in incorporating Average as a weak learner with the margin-base technique In a nutshell, Average takes m sample points an their respective labels, W = {(x 1, y 1,, (x m, y m }, an returns 1 m m i=1 xi y i Our main result in this section shows that for a wie range of istributions that are very symmetric in nature, incluing the Gaussian an the uniform istribution, there is an instance of Massart noise uner which Average can not achieve an arbitrarily small excess error Theorem For any continuous istribution D with a pf that is a function of the istance from the origin only, there is a noisy istribution D over X {, 1} that satisfies Massart noise conition in Equation 1 for some parameter β > an Average returns a classifier with excess error Ω( β(1 β 1+β Proof Let w = (1,,, be the target halfspace Let the noise istribution be such that for all x, if x 1 x < then we flip the label of x with probability 1 β, otherwise we keep the label Clearly, this satisfies Massart noise with parameter β Let w be expecte vector returne by Average We first show that w is far from w in angle Then, using Equation we show that w has large excess error First we examine the expecte component of w that is parallel to w, ie, w w = w 1 For ease of exposition, we ivie our analysis to two cases, one for regions with no noise (first an thir quarants 9

10 an secon for regions with noise (secon an fourth quarants Let E be the event that x 1 x > By symmetry, it is easy to see that Pr[E] = 1/ Then E[w w ] = Pr(E E[w w E] + Pr(Ē E[w w Ē] For the first term, for x E the label has not change So, E[w w E] = E[ x 1 E] = zf(z For the secon term, the label of each point stays the same with probability 1+β an is flippe with probability 1 β Hence, E[w w E] = β E[ x 1 E] = β zf(z Therefore, the expecte parallel component of w is E[w w ] = 1+β zf(z Next, we examine w, the orthogonal component of w on the secon coorinate Similar to the previous case for the clean regions E[w E] = E[ x E] = zf(z Next, for the secon an forth quarants, which are noisy, we have E (x,y D[x y x 1 x < ] = ( 1 + β So, w = + ( 1 + β = ( 1 + β ( 1 + β = β 1 zf(z z f(z + ( 1 β ( z f(z + ( 1 β z f(z z f(z + ( 1 β + ( 1 β ( z f(z 1 z f(z z f(z z f(z ( 1 β zf(z Therefore θ(w, w = arctan( 1 β 1+β 1 β (1+β err D(w err D(w β θ(w,w π β 1 β π(1+β (Fourth quarant (Secon quarant (By symmetry By Equation, we have Our margin-base analysis from Section 3 relies on using hinge-loss minimization in the ban at every roun to efficiently fin a halfspace w k that is a weak learner for D k, ie, err Dk (w k is at most a small constant, as emonstrate in Lemma 3 Motivate by this more lenient goal of fining a weak learner, one might ask whether Average, as an efficient algorithm for fining low error halfspaces, can be incorporate with the margin-base technique in the same way as hinge loss minimization? We argue that the marginbase technique is inherently incompatible with Average The Margin-base technique maintains two key properties at every step: First, the angle between w k an w k 1 an the angle between w k 1 an w are small, an as a result θ(w, w k is small Secon, w k is a weak learner with err Dk 1 (w k at most a small constant In our work, hinge loss minimization in the ban guarantees both of these properties simultaneously by limiting its search to the halfspaces that are close in angle to w k 1 an limiting its istribution to D wk 1,b k 1 However, in the case of Average as we concentrate in the ban D wk 1,b k 1 we bias the istributions towars its orthogonal component with respect to w k 1 Hence, an upper boun on θ(w, w k 1 only serves to assure that most of the ata is orthogonal to w as well Therefore, informally speaking, we lose the signal that otherwise coul irect us in the irection of w More formally, consier the construction from Theorem such that w k 1 = w = (1,,, In istribution D wk 1,b k 1, the component of w k that is parallel to w k 1 scales own by the with of the ban, b k 1 However, as most of the probability stays in a ban passing through the origin in any log-concave (incluing Gaussian an uniform istribution, the orthogonal ( component of w k remains almost unchange Therefore, θ(w k, w 1 β = θ(w k, w k 1 Ω( b k 1 (1+β (1 β (1+βα k 1 1

11 5 Hinge Loss Minimization Does Not Work Hinge loss minimization is a wiely use technique in Machine Learning In this section, we show that, perhaps surprisingly, hinge loss minimization oes not lea to arbitrarily small excess error even uner very small noise conition, that is it is not consistent (Note that in our setting of Massart noise, consistency is the same as achieving arbitrarily small excess error, since the Bayes optimal classifier is a member of the class of halfspaces It has been shown earlier that hinge loss minimization can lea to classifiers of large /1-loss [6] However, the lower bouns in that paper employ istributions with significant mass on iscrete points with flippe label (which is not possible uner Massart noise at a very large istance from the optimal classifier Thus, that result makes strong use of the hinge loss s sensitivity to errors at large istance Here, we show that hinge loss minimization is boun to fail uner much more benign conitions More concretely, we show that for every parameter, an arbitrarily small boun on the probability of flipping a label, η = 1 β, hinge loss minimization is not consistent even on istributions with a uniform marginal over the unit ball in R, with the Bayes optimal classifier being a halfspace an the noise satisfying the Massart noise conition with boun η That is, there exists a constant ɛ an a sample size m(ɛ such that hinge loss minimization returns a classifier of excess error at least ɛ with high probability over sample size of at least m(ɛ Hinge loss minimization oes approximate the optimal hinge loss We show that this oes not translate into an agnostic learning guarantee for halfspaces with respect to the /1-loss even uner very small noise conitions Let P β be the class of istributions D with uniform marginal over the unit ball B 1 R, the Bayes classifier being a halfspace w, an satisfying the Massart noise conition with parameter β Our lower boun for hinge loss minimization is state as follows Theorem 3 For every hinge-loss parameter an every Massart noise parameter β < 1, there exists a istribution D,β P β (that is, a istribution over B 1 { 1, 1} with uniform marginal over B 1 R satisfying the β-massart conition such that -hinge loss minimization is not consistent on D,β with respect to the class of halfspaces That is, there exists an ɛ an a sample size m(ɛ such that hinge loss minimization will output a classifier of excess error larger ɛ (with high probability over samples of size at least m(ɛ Proof iea To prove the above result, we efine a subclass of P α,η P β consisting of well structure istributions We then show that for every hinge parameter an every boun on the noise η, there is a istribution D P α,η on which -hinge loss minimization is not consistent h w* In the remainer of this section, we use the notation h w for the classifier associate h w A with a vector w B 1, that is h w (x = sign(w x, since for our geometric D w construction it is convenient to ifferentiate between the two We efine a family B / P α,η P β of istributions D α,η, inexe by an angle α an a noise parameter η as w* follows Let the Bayes optimal classifier be linear h = h w for a unit vector w Let h w be the classifier that is efine by the unit vector w at angle α from w We D B A partition the unit ball into areas A, B an D as in the Figure 5 That is A consists of the two weges of isagreement between h w an h w an the wege where the two classifiers agree is ivie into B (points that are closer to h w than to h w an D Figure 1: P α,η (points that are closer to h w than to h w We now flip the labels of all points in A an B with probability η = 1 β an leave the labels eterministic accoring to h w in the area D More formally, points at angle between α/ an π/ an points at angle between π + α/ an π/ from w are labele per h w (x with conitional label probability 1 All other points are labele h w (x 11

12 with probability η an h w (x with probability (1 η Clearly, this istribution satisfies Massart noise conitions in Equation 1 with parameter β The goal of the above construction is to esign istributions where vectors along the irection of w have smaller hinge loss of those along the irection of w Observe that the noise in the are A will ten to even out the ifference in hinge loss between w an w (since are A is symmetric with respect to these two irections The noise in area B however will help w : Since all points in area B are closer to the hyperplane efine by w than to the one efine by w, vector w will pay more in hinge loss for the noise in this area In the corresponing area D of points that are closer to the hyperplane efine by w than to the one efine by w we o not a noise, so the cost for both w an w in this area is small We show that for every α, from a certain noise level η on, w (or any other vector in its irection is not the expecte hinge minimizer on D α,η We then argue that thereby hinge loss minimization will not approximate w arbitrarily close in angle an can therefore not achieve arbitrarily small excess /1-error Overall, we show that for every (arbitrarily small boun on the noise η an hinge parameter, we can choose an angle α such that -hinge loss minimization is not consistent for istribution D α,η The etails of the proof can be foun in the Appenix, Section D 6 Conclusions Our work is the first to provie a computationally efficient algorithm uner the Massart noise moel, a istributional assumption that has been ientifie in statistical learning to yiel fast (statistical rates of convergence While both computational an statistical efficiency is crucial in machine learning applications, computational an statistical complexity have been stuie uner isparate sets of assumptions an moels We view our results on the computational complexity of learning uner Massart noise also as a step towars bringing these two lines of research closer together We hope that this will spur more work ientifying situations that lea to both computational an statistical efficiency to ultimately she light on the unerlying connections an epenencies of these two important aspects of automate learning Acknowlegments This work was supporte in part by NSF grants CCF-95319, CCF , CCF- 1491, a Sloan Research Fellowshp, a Microsoft Research Faculty Fellowship, an a Google Research Awar References [1] Sanjeev Arora, László Babai, Jacques Stern, an Z Sweeyk The harness of approximate optima in lattices, coes, an systems of linear equations In Proceeings of the 34th IEEE Annual Symposium on Founations of Computer Science (FOCS, 1993 [] Pranjal Awasthi, Maria Florina Balcan, an Philip M Long The power of localization for efficiently learning linear separators with noise In Proceeings of the 46th Annual ACM Symposium on Theory of Computing (STOC, 14 [3] Maria-Florina Balcan, Alina Beygelzimer, an John Langfor Agnostic active learning In Proceeings of the 3r International Conference on Machine Learning (ICML, 6 [4] Maria-Florina Balcan, Anrei Z Broer, an Tong Zhang Margin base active learning In Proceeings of the th Annual Conference on Learning Theory (COLT, 7 1

13 [5] Maria-Florina Balcan an Vitaly Felman Statistical active learning algorithms In Avances in Neural Information Processing Systems (NIPS, 13 [6] Shai Ben-Davi, Davi Loker, Nathan Srebro, an Karthik Sriharan Minimizing the misclassification error rate using a surrogate convex loss In Proceeings of the 9th International Conference on Machine Learning (ICML, 1 [7] Avrim Blum, Alan Frieze, Ravi Kannan, an Santosh Vempala A polynomial-time algorithm for learning noisy linear threshol functions Algorithmica, (1-:35 5, 1998 [8] Karl-Heinz Borgwart The simplex metho, volume 1 of Algorithms an Combinatorics: Stuy an Research Texts Springer-Verlag, Berlin, 1987 [9] Olivier Bousquet, Stéphane Boucheron, an Gabor Lugosi Theory of classification: a survey of recent avances ESAIM: Probability an Statistics, 9:33 375, 5 [1] Rui M Castro an Robert D Nowak Minimax bouns for active learning In Proceeings of the th Annual Conference on Learning Theory, (COLT, 7 [11] Nello Cristianini an John Shawe-Taylor An Introuction to Support Vector Machines an Other Kernel-base Learning Methos Cambrige University Press, [1] Amit Daniely, Nati Linial, an Shai Shalev-Shwartz From average case complexity to improper learning complexity In Proceeings of the 46th Annual ACM Symposium on Theory of Computing (STOC, 14 [13] Sanjoy Dasgupta Coarse sample complexity bouns for active learning In Avances in Neural Information Processing Systems (NIPS, 5 [14] Sanjoy Dasgupta Active learning Encyclopeia of Machine Learning, 11 [15] Sanjoy Dasgupta, Daniel Hsu, an Claire Monteleoni A general agnostic active learning algorithm In Avances in Neural Information Processing Systems (NIPS, 7 [16] Ofer Dekel, Clauio Gentile, an Karthik Sriharan Selective sampling an active learning from single an multiple teachers Journal of Machine Learning Research, 13: , 1 [17] Yoav Freun, H Sebastian Seung, Eli Shamir, an Naftali Tishby Selective sampling using the query by committee algorithm Machine Learning, 8(-3: , 1997 [18] Venkatesan Guruswami an Prasa Raghavenra Harness of learning halfspaces with noise In Proceeings of the 47th Annual IEEE Symposium on Founations of Computer Science (FOCS, 6 [19] Steve Hanneke A boun on the label complexity of agnostic active learning In Proceeings of the 4r International Conference on Machine Learning (ICML, 7 [] Steve Hanneke Theory of isagreement-base active learning Founations an Trens in Machine Learning, 7(-3:131 39, 14 [1] Steve Hanneke an Liu Yang Surrogate losses in passive an active learning CoRR, abs/17377, 14 13

14 [] Aam Tauman Kalai, Aam R Klivans, Yishay Mansour, an Rocco A Serveio Agnostically learning halfspaces SIAM Journal on Computing, 37(6: , 8 [3] Aam Tauman Kalai, Yishay Mansour, an Ela Verbin On agnostic boosting an parity learning In Proceeings of the 4th Annual ACM Symposium on Theory of Computing (STOC, 8 [4] Michael J Kearns an Ming Li Learning in the presence of malicious errors (extene abstract In Proceeings of the th Annual ACM Symposium on Theory of Computing (STOC, 1988 [5] Michael J Kearns, Robert E Schapire, an Lina Sellie Towar efficient agnostic learning In Proceeings of the 5th Annual Conference on Computational Learning Theory (COLT, 199 [6] Aam R Klivans an Pravesh Kothari Embeing har learning problems into gaussian space In Approximation, Ranomization, an Combinatorial Optimization Algorithms an Techniques, (AP- PROX/RANDOM, 14 [7] Aam R Klivans, Philip M Long, an Rocco A Serveio Learning halfspaces with malicious noise Journal of Machine Learning Research, 1:715 74, 9 [8] Pascal Massart an loie Nlec Risk bouns for statistical learning The Annals of Statistics, 34(5:36 366, 1 6 [9] Ronal L Rivest an Robert H Sloan A formal moel of hierarchical concept learning Information an Computation, 114(1:88 114, 1994 [3] Rocco A Serveio Efficient algorithms in computational learning theory Harvar University, 1 [31] Robert H Sloan Pac learning, noise, an geometry In Learning an Geometry: Computational Approaches, pages 1 41 Springer, 1996 A Probability Lemmas For The Uniform Distribution The following probability lemmas are use throughout this work Variation of these lemmas are presente in previous work in terms of their asymptotic behavior [, 4, ] Here, we focus on fining bouns that are tight even when the constants are concerne Inee, the improve constants in these bouns are essential to tolerating Massart noise with β > Throughout this section, let D be the uniform istribution over a -imensional ball Let f( inicate the pf of D For any, let V be the volume of a -imensional unit ball Ratios between volumes of the unit ball in ifferent imensions are commonly use to fin the probability mass of ifferent regions uner the uniform istribution Note that for any V V = π The following boun ue to [8] proves useful in our analysis π V V π The next lemma provies an upper an lower boun for the probability mass of a ban in uniform istribution 14

15 Lemma 4 Let u be any unit vector in R For all a, b [ C, C ], such that C < /, we have b a C V 1 V Pr x D [u x [a, b]] b a V 1 V Proof We have Pr [u x [a, b]] = V 1 x D V b a (1 z ( 1/ z For the upper boun, we note that the integrant is at most 1, so Pr x D [u x [a, b]] V 1 V b a For the lower boun, note that since a, b [ C, C ], the integrant is at least (1 C ( 1/ We know that for any x [, 5], 1 x > 4 x So, assuming that > C, (1 C ( 1/ 4 C ( 1/ C Pr x D [u x [a, b]] b a C V 1 V Lemma 5 Let u an v be two unit vectors in R an let α = θ(u, v Then, c α Pr [sign(u x sign(w x an u x > ] α c ( x D π e Proof Without the loss of generality, we can assume u = (1,,, an w = (cos(α, sin(α,,, Consier the projection of D on the first coorinates Let E be the event we are intereste in We first show that for any x = (x 1, x E, x > c/ Consier x 1 (the other case is symmetric If x E, it must be that x sin(α cα So, x = c α c sin(α c Next, we consier a circle of raius < r < 1 aroun the center, inicate by S(r Let A(r = S(r E be the arc of such circle that is in E Then the length of such arc is the arc-length that falls in the isagreement region, ie, rα, minus the arc-length that falls in the ban of with cα Note, that for every x A(r, x = r, so f(x = V V (1 x ( / = V V (1 r ( / Pr [sign(u x sign(w x an u x > α ] = (rα cα f(r r x D c = /c 1 = V c α = c α π c α π α π α π V 1 /c 1 1 /c ( rc α cα f( cr c r (change of variable z = r /c /c (r 1(1 c r (r 1e r ( (r 1 ( c r r ( / r ( 1( ( c r e ( c r ( 1( ( c r e ( c r 1 [ ] e ( r r= /c r=1 r r 15

16 α π (e c ( e ( / α c ( π e B Proofs of Margin-base Lemmas Proof of Lemma 1 Let L(w = E (x,y Dk l(w, x, y, = k, an b = b k 1 First note that for our choice of b , using Lemma 4 we have that Pr [ w k 1 x < b] b 8539 x D Note that L(w is maximize when w = w k 1 Then L(w (1 a f(a a Pr x D [ w k 1 x < b] (1 a (1 a ( 1/ a b 8539 For the numerator: (1 a (1 a ( 1/ a (1 a e a ( 1/ a 1 e a ( 1/ a 1 ae a ( 1/ a ( π 1 ( 1 erf 1 ( 1 (1 e ( 1 / π ( 1 e ( 1 ( 1 1 ( 1 1 ( 1 π ( 1 8 ( ( 1 1 (( (By Taylor expansion 5463 (By 1 8 ( 1 < 1 4 Where the last inequality follows from the fact that for our choice of parameters 3, so 1 8 ( 1 < 1 5 Therefore, 536( /4 b < L(w b b Proof of Lemma 3 Note that the convex loss minimization proceure returns a vector v k that is not necessarily normalize To consier all vectors in B(w k 1, α k, at step k, the optimization is one over all 16

17 vectors v (of any length such that w k 1 v < α k For all k, α k < 3879π (or 1168, so v k , an as a result l(w k, W l(v k, W We have, err Dk (w k E (x,y Dk l(w k, x, y ( E (x,y l(w Dk k, x, y β b k 1 k (By Lemma l(w k, W β b k 1 k (By Equation l(v k, W β b k 1 k (By v k l(w, W β b k 1 k (By v k minimizing the hinge-loss E (x,y l(w, x, y β b k (By Equation 3 Dk ( k E (x,y Dk l(w, x, y β b k (By Lemma k k b k β b k 1 k (By Lemma 1 Lemma 6 For any constant c, there is m k O(( + log(k/ such that for a ranomly rawn set W of m k labele samples from D k, with probability 1 δ, for any w B(w k+k k 1, α k, E (x,y Dk (l(w, x, y l(w, W c, E (x,y Dk (l(w, x, y l(w, cleane(w c Proof By Lemma H3 of [], l(w, x, y = O( for all (x, y S wk 1,b k 1 an θ(w, w k 1 r k We get the result by applying Lemma H of [] C Initialization We initialize our margin base proceure with the algorithm from [7] The guarantees mentione in [7] ɛ hol as long as the noise rate is η c log 1/ɛ [7] o not explicitly compute the constant but it is easy to check that c 1 56 This can be compute from inequality 17 in the proof of Lemma 16 in [7] We nee the lhs to be at least ɛ / On the rhs, the first term is lower boune by ɛ /51 Hence, we nee the secon term to be at most ɛ The secon term is upper boune by 4c ɛ This implies that c 1/56 D Hinge Loss Minimization In this section, we show that hinge loss minimization is not consistent in our setup, that is, that it oes not lea to arbitrarily small excess error We let B 1 enote the unit ball in R In this section, we will only work with =, thus we set B 1 = B 1 17

18 Recall that the -hinge loss of a vector w R on an example (x, y R { 1, 1} is efine as follows: { } y(w x l (w, x, y = max, 1 For a istribution D over R { 1, 1}, we let L D enote the expecte hinge loss over D, that is L D (w = E (x,y Dl (w, x, y If clear from context, we omit the superscript an write L (w for L D (w Let A be the algorithm that minimizes the empirical -hinge loss over a sample That is, for W = {(x 1, y 1,, (x m, y m }, we have A (W argmin w B1 1 W (x,y W l (w, x, y Hinge loss minimization over halfspaces converges to the optimal hinge loss over all halfspace (it is hinge loss consistent That is, for all ɛ > there is a sample size m(ɛ such that for all istributions D, we have E W Dm[L D (A (W ] min w B 1 L D (w + ɛ In this section, we show that this oes not translate into an agnostic learning guarantee for halfspaces with respect to the /1-loss Moreover, hinge loss minimization is not even consistent with respect to the /1-loss even when restricte to a rather benign classes of istributions P Let P β be the class of istributions D with uniform marginal over the unit ball in R, the Bayes classifier being a halfspace w, an satisfying the Massart noise conition with parameter β We show that there is a istribution D P β an an ɛ an a sample size m such that hinge loss minimization will output a classifier of excess error larger than ɛ on expectation over samples of size larger than m More precisely, for all m m : E W Dm[L D (A (W ] > min w B 1 err D(w + ɛ Formally, our lower boun for hinge loss minimization is state as follows Theorem 3 (Restate For every hinge-loss parameter an every Massart noise parameter β < 1, there exists a istribution D,β P β (that is, a istribution over B 1 { 1, 1} with uniform marginal over B 1 R satisfying the β-massart conition such that -hinge loss minimization is not consistent on P,β with respect to the class of halfspaces That is, there exists an ɛ an a sample size m(ɛ such that hinge loss minimization will output a classifier of excess error larger than ɛ (with high probability over samples of size at least m(ɛ In the section, we use the notation h w for the classifier associate with a vector w B 1, that is h w (x = sign(w x, since for our geometric construction it is convenient to ifferentiate between the two The rest of this section is evote to proving the above theorem A class of istributions 18

Efficient Learning of Linear Separators under Bounded Noise

Efficient Learning of Linear Separators under Bounded Noise Efficient Learning of Linear Separators uner Boune Noise Pranjal Awasthi Princeton University Maria-Florina Balcan Nika Haghtalab Carnegie Mellon University Ruth Urner Max Planck Institute for Intelligent

More information

A PTAS for Agnostically Learning Halfspaces

A PTAS for Agnostically Learning Halfspaces A PTAS for Agnostically Learning Halfspaces Amit Daniely June 5, 05 Abstract We present a PTAS for agnostically learning halfspaces w.r.t. the uniform istribution on the imensional sphere. Namely, we show

More information

Learning and 1-bit Compressed Sensing under Asymmetric Noise

Learning and 1-bit Compressed Sensing under Asymmetric Noise JMLR: Workshop and Conference Proceedings vol 49:1 39, 2016 Learning and 1-bit Compressed Sensing under Asymmetric Noise Pranjal Awasthi Rutgers University Maria-Florina Balcan Nika Haghtalab Hongyang

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University Sample and Computationally Efficient Active Learning Maria-Florina Balcan Carnegie Mellon University Machine Learning is Shaping the World Highly successful discipline with lots of applications. Computational

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Algorithms and matching lower bounds for approximately-convex optimization

Algorithms and matching lower bounds for approximately-convex optimization Algorithms an matching lower bouns for approximately-convex optimization Yuanzhi Li Department of Computer Science Princeton University Princeton, NJ, 08450 yuanzhil@cs.princeton.eu Anrej Risteski Department

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

PLAL: Cluster-based Active Learning

PLAL: Cluster-based Active Learning JMLR: Workshop an Conference Proceeings vol 3 (13) 1 22 PLAL: Cluster-base Active Learning Ruth Urner rurner@cs.uwaterloo.ca School of Computer Science, University of Waterloo, Canaa, ON, N2L 3G1 Sharon

More information

Robust Bounds for Classification via Selective Sampling

Robust Bounds for Classification via Selective Sampling Nicolò Cesa-Bianchi DSI, Università egli Stui i Milano, Italy Clauio Gentile DICOM, Università ell Insubria, Varese, Italy Francesco Orabona Iiap, Martigny, Switzerlan cesa-bianchi@siunimiit clauiogentile@uninsubriait

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Classic Paradigm Insufficient Nowadays Modern applications: massive amounts

More information

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes Leaving Ranomness to Nature: -Dimensional Prouct Coes through the lens of Generalize-LDPC coes Tavor Baharav, Kannan Ramchanran Dept. of Electrical Engineering an Computer Sciences, U.C. Berkeley {tavorb,

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

Statistical Active Learning Algorithms

Statistical Active Learning Algorithms Statistical Active Learning Algorithms Maria Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Vitaly Feldman IBM Research - Almaden vitaly@post.harvard.edu Abstract We describe a framework

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

A Sketch of Menshikov s Theorem

A Sketch of Menshikov s Theorem A Sketch of Menshikov s Theorem Thomas Bao March 14, 2010 Abstract Let Λ be an infinite, locally finite oriente multi-graph with C Λ finite an strongly connecte, an let p

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan

Foundations For Learning in the Age of Big Data. Maria-Florina Balcan Foundations For Learning in the Age of Big Data Maria-Florina Balcan Modern Machine Learning New applications Explosion of data Modern ML: New Learning Approaches Modern applications: massive amounts of

More information

arxiv: v2 [cs.ds] 11 May 2016

arxiv: v2 [cs.ds] 11 May 2016 Optimizing Star-Convex Functions Jasper C.H. Lee Paul Valiant arxiv:5.04466v2 [cs.ds] May 206 Department of Computer Science Brown University {jasperchlee,paul_valiant}@brown.eu May 3, 206 Abstract We

More information

Approximate Constraint Satisfaction Requires Large LP Relaxations

Approximate Constraint Satisfaction Requires Large LP Relaxations Approximate Constraint Satisfaction Requires Large LP Relaxations oah Fleming April 19, 2018 Linear programming is a very powerful tool for attacking optimization problems. Techniques such as the ellipsoi

More information

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Analyzing Tensor Power Method Dynamics in Overcomplete Regime Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical

More information

Error Floors in LDPC Codes: Fast Simulation, Bounds and Hardware Emulation

Error Floors in LDPC Codes: Fast Simulation, Bounds and Hardware Emulation Error Floors in LDPC Coes: Fast Simulation, Bouns an Harware Emulation Pamela Lee, Lara Dolecek, Zhengya Zhang, Venkat Anantharam, Borivoje Nikolic, an Martin J. Wainwright EECS Department University of

More information

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d

A new proof of the sharpness of the phase transition for Bernoulli percolation on Z d A new proof of the sharpness of the phase transition for Bernoulli percolation on Z Hugo Duminil-Copin an Vincent Tassion October 8, 205 Abstract We provie a new proof of the sharpness of the phase transition

More information

Database-friendly Random Projections

Database-friendly Random Projections Database-frienly Ranom Projections Dimitris Achlioptas Microsoft ABSTRACT A classic result of Johnson an Linenstrauss asserts that any set of n points in -imensional Eucliean space can be embee into k-imensional

More information

PETER L. BARTLETT AND MARTEN H. WEGKAMP

PETER L. BARTLETT AND MARTEN H. WEGKAMP CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS PETER L. BARTLETT AND MARTEN H. WEGKAMP Abstract. We consier the problem of binary classification where the classifier can, for a particular cost,

More information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21 Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting

More information

Asymptotic Active Learning

Asymptotic Active Learning Asymptotic Active Learning Maria-Florina Balcan Eyal Even-Dar Steve Hanneke Michael Kearns Yishay Mansour Jennifer Wortman Abstract We describe and analyze a PAC-asymptotic model for active learning. We

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

26.1 Metropolis method

26.1 Metropolis method CS880: Approximations Algorithms Scribe: Dave Anrzejewski Lecturer: Shuchi Chawla Topic: Metropolis metho, volume estimation Date: 4/26/07 The previous lecture iscusse they some of the key concepts of

More information

Chapter 6: Energy-Momentum Tensors

Chapter 6: Energy-Momentum Tensors 49 Chapter 6: Energy-Momentum Tensors This chapter outlines the general theory of energy an momentum conservation in terms of energy-momentum tensors, then applies these ieas to the case of Bohm's moel.

More information

From Batch to Transductive Online Learning

From Batch to Transductive Online Learning From Batch to Transductive Online Learning Sham Kakade Toyota Technological Institute Chicago, IL 60637 sham@tti-c.org Adam Tauman Kalai Toyota Technological Institute Chicago, IL 60637 kalai@tti-c.org

More information

Sharp Thresholds. Zachary Hamaker. March 15, 2010

Sharp Thresholds. Zachary Hamaker. March 15, 2010 Sharp Threshols Zachary Hamaker March 15, 2010 Abstract The Kolmogorov Zero-One law states that for tail events on infinite-imensional probability spaces, the probability must be either zero or one. Behavior

More information

Linear Regression with Limited Observation

Linear Regression with Limited Observation Ela Hazan Tomer Koren Technion Israel Institute of Technology, Technion City 32000, Haifa, Israel ehazan@ie.technion.ac.il tomerk@cs.technion.ac.il Abstract We consier the most common variants of linear

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakae Karen Livescu Karthik Sriharan Toyota Technological Institute at Chicago, 6045 S. Kenwoo Ave., Chicago, IL kamalika@soe.ucs.eu

More information

Learning Mixtures of Gaussians with Maximum-a-posteriori Oracle

Learning Mixtures of Gaussians with Maximum-a-posteriori Oracle Satyaki Mahalanabis Dept of Computer Science University of Rochester smahalan@csrochestereu Abstract We consier the problem of estimating the parameters of a mixture of istributions, where each component

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Necessary and Sufficient Conditions for Sketched Subspace Clustering Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This

More information

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation

Binary Discrimination Methods for High Dimensional Data with a. Geometric Representation Binary Discrimination Methos for High Dimensional Data with a Geometric Representation Ay Bolivar-Cime, Luis Miguel Corova-Roriguez Universia Juárez Autónoma e Tabasco, División Acaémica e Ciencias Básicas

More information

Agnostic Learning of Disjunctions on Symmetric Distributions

Agnostic Learning of Disjunctions on Symmetric Distributions Agnostic Learning of Disjunctions on Symmetric Distributions Vitaly Feldman vitaly@post.harvard.edu Pravesh Kothari kothari@cs.utexas.edu May 26, 2014 Abstract We consider the problem of approximating

More information

12.11 Laplace s Equation in Cylindrical and

12.11 Laplace s Equation in Cylindrical and SEC. 2. Laplace s Equation in Cylinrical an Spherical Coorinates. Potential 593 2. Laplace s Equation in Cylinrical an Spherical Coorinates. Potential One of the most important PDEs in physics an engineering

More information

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs Ashish Goel Michael Kapralov Sanjeev Khanna Abstract We consier the well-stuie problem of fining a perfect matching in -regular bipartite

More information

LECTURE NOTES ON DVORETZKY S THEOREM

LECTURE NOTES ON DVORETZKY S THEOREM LECTURE NOTES ON DVORETZKY S THEOREM STEVEN HEILMAN Abstract. We present the first half of the paper [S]. In particular, the results below, unless otherwise state, shoul be attribute to G. Schechtman.

More information

How to Minimize Maximum Regret in Repeated Decision-Making

How to Minimize Maximum Regret in Repeated Decision-Making How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:

More information

On the Value of Partial Information for Learning from Examples

On the Value of Partial Information for Learning from Examples JOURNAL OF COMPLEXITY 13, 509 544 (1998) ARTICLE NO. CM970459 On the Value of Partial Information for Learning from Examples Joel Ratsaby* Department of Electrical Engineering, Technion, Haifa, 32000 Israel

More information

PDE Notes, Lecture #11

PDE Notes, Lecture #11 PDE Notes, Lecture # from Professor Jalal Shatah s Lectures Febuary 9th, 2009 Sobolev Spaces Recall that for u L loc we can efine the weak erivative Du by Du, φ := udφ φ C0 If v L loc such that Du, φ =

More information

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x) Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)

More information

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties Journal of Machine Learning Research 16 (2015) 1547-1572 Submitte 1/14; Revise 9/14; Publishe 8/15 Flexible High-Dimensional Classification Machines an Their Asymptotic Properties Xingye Qiao Department

More information

Convergence of Random Walks

Convergence of Random Walks Chapter 16 Convergence of Ranom Walks This lecture examines the convergence of ranom walks to the Wiener process. This is very important both physically an statistically, an illustrates the utility of

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Logarithmic spurious regressions

Logarithmic spurious regressions Logarithmic spurious regressions Robert M. e Jong Michigan State University February 5, 22 Abstract Spurious regressions, i.e. regressions in which an integrate process is regresse on another integrate

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri ITA, UC San Diego, 9500 Gilman Drive, La Jolla, CA Sham M. Kakae Karen Livescu Karthik Sriharan Toyota Technological Institute at Chicago, 6045 S. Kenwoo Ave., Chicago, IL kamalika@soe.ucs.eu

More information

Discriminative Learning can Succeed where Generative Learning Fails

Discriminative Learning can Succeed where Generative Learning Fails Discriminative Learning can Succeed where Generative Learning Fails Philip M. Long, a Rocco A. Servedio, b,,1 Hans Ulrich Simon c a Google, Mountain View, CA, USA b Columbia University, New York, New York,

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Lecture 5. Symmetric Shearer s Lemma

Lecture 5. Symmetric Shearer s Lemma Stanfor University Spring 208 Math 233: Non-constructive methos in combinatorics Instructor: Jan Vonrák Lecture ate: January 23, 208 Original scribe: Erik Bates Lecture 5 Symmetric Shearer s Lemma Here

More information

Acute sets in Euclidean spaces

Acute sets in Euclidean spaces Acute sets in Eucliean spaces Viktor Harangi April, 011 Abstract A finite set H in R is calle an acute set if any angle etermine by three points of H is acute. We examine the maximal carinality α() of

More information

On the Surprising Behavior of Distance Metrics in High Dimensional Space

On the Surprising Behavior of Distance Metrics in High Dimensional Space On the Surprising Behavior of Distance Metrics in High Dimensional Space Charu C. Aggarwal, Alexaner Hinneburg 2, an Daniel A. Keim 2 IBM T. J. Watson Research Center Yortown Heights, NY 0598, USA. charu@watson.ibm.com

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations Lecture XII Abstract We introuce the Laplace equation in spherical coorinates an apply the metho of separation of variables to solve it. This will generate three linear orinary secon orer ifferential equations:

More information

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 10-806 Foundations of Machine Learning and Data Science Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015 1 Active Learning Most classic machine learning methods and the formal learning

More information

Estimating Causal Direction and Confounding Of Two Discrete Variables

Estimating Causal Direction and Confounding Of Two Discrete Variables Estimating Causal Direction an Confouning Of Two Discrete Variables This inspire further work on the so calle aitive noise moels. Hoyer et al. (2009) extene Shimizu s ientifiaarxiv:1611.01504v1 [stat.ml]

More information

Qubit channels that achieve capacity with two states

Qubit channels that achieve capacity with two states Qubit channels that achieve capacity with two states Dominic W. Berry Department of Physics, The University of Queenslan, Brisbane, Queenslan 4072, Australia Receive 22 December 2004; publishe 22 March

More information

The Exact Form and General Integrating Factors

The Exact Form and General Integrating Factors 7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily

More information

Upper and Lower Bounds on ε-approximate Degree of AND n and OR n Using Chebyshev Polynomials

Upper and Lower Bounds on ε-approximate Degree of AND n and OR n Using Chebyshev Polynomials Upper an Lower Bouns on ε-approximate Degree of AND n an OR n Using Chebyshev Polynomials Mrinalkanti Ghosh, Rachit Nimavat December 11, 016 1 Introuction The notion of approximate egree was first introuce

More information

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

inflow outflow Part I. Regular tasks for MAE598/494 Task 1 MAE 494/598, Fall 2016 Project #1 (Regular tasks = 20 points) Har copy of report is ue at the start of class on the ue ate. The rules on collaboration will be release separately. Please always follow the

More information

On the Aloha throughput-fairness tradeoff

On the Aloha throughput-fairness tradeoff On the Aloha throughput-fairness traeoff 1 Nan Xie, Member, IEEE, an Steven Weber, Senior Member, IEEE Abstract arxiv:1605.01557v1 [cs.it] 5 May 2016 A well-known inner boun of the stability region of

More information

Table of Common Derivatives By David Abraham

Table of Common Derivatives By David Abraham Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec

More information

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization

On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization JMLR: Workshop an Conference Proceeings vol 30 013) 1 On the Complexity of Banit an Derivative-Free Stochastic Convex Optimization Oha Shamir Microsoft Research an the Weizmann Institute of Science oha.shamir@weizmann.ac.il

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

Learning convex bodies is hard

Learning convex bodies is hard Learning convex bodies is hard Navin Goyal Microsoft Research India navingo@microsoft.com Luis Rademacher Georgia Tech lrademac@cc.gatech.edu Abstract We show that learning a convex body in R d, given

More information

Margin Based Active Learning

Margin Based Active Learning Margin Based Active Learning Maria-Florina Balcan 1, Andrei Broder 2, and Tong Zhang 3 1 Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. ninamf@cs.cmu.edu 2 Yahoo! Research, Sunnyvale,

More information

Situation awareness of power system based on static voltage security region

Situation awareness of power system based on static voltage security region The 6th International Conference on Renewable Power Generation (RPG) 19 20 October 2017 Situation awareness of power system base on static voltage security region Fei Xiao, Zi-Qing Jiang, Qian Ai, Ran

More information

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations Diophantine Approximations: Examining the Farey Process an its Metho on Proucing Best Approximations Kelly Bowen Introuction When a person hears the phrase irrational number, one oes not think of anything

More information

Tractability results for weighted Banach spaces of smooth functions

Tractability results for weighted Banach spaces of smooth functions Tractability results for weighte Banach spaces of smooth functions Markus Weimar Mathematisches Institut, Universität Jena Ernst-Abbe-Platz 2, 07740 Jena, Germany email: markus.weimar@uni-jena.e March

More information

Classification Methods with Reject Option Based on Convex Risk Minimization

Classification Methods with Reject Option Based on Convex Risk Minimization Journal of Machine Learning Research 11 (010) 111-130 Submitte /09; Revise 11/09; Publishe 1/10 Classification Methos with Reject Option Base on Convex Risk Minimization Ming Yuan H. Milton Stewart School

More information

arxiv: v2 [cond-mat.stat-mech] 11 Nov 2016

arxiv: v2 [cond-mat.stat-mech] 11 Nov 2016 Noname manuscript No. (will be inserte by the eitor) Scaling properties of the number of ranom sequential asorption iterations neee to generate saturate ranom packing arxiv:607.06668v2 [con-mat.stat-mech]

More information

A Unified Theorem on SDP Rank Reduction

A Unified Theorem on SDP Rank Reduction A Unifie heorem on SDP Ran Reuction Anthony Man Cho So, Yinyu Ye, Jiawei Zhang November 9, 006 Abstract We consier the problem of fining a low ran approximate solution to a system of linear equations in

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

Proof of SPNs as Mixture of Trees

Proof of SPNs as Mixture of Trees A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a

More information

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example

More information

Robust Low Rank Kernel Embeddings of Multivariate Distributions

Robust Low Rank Kernel Embeddings of Multivariate Distributions Robust Low Rank Kernel Embeings of Multivariate Distributions Le Song, Bo Dai College of Computing, Georgia Institute of Technology lsong@cc.gatech.eu, boai@gatech.eu Abstract Kernel embeing of istributions

More information

arxiv: v2 [cs.cc] 13 Mar 2016

arxiv: v2 [cs.cc] 13 Mar 2016 Complexity Theoretic Limitations on Learning Halfspaces Amit Daniely March 15, 2016 arxiv:1505.05800v2 [cs.cc] 13 Mar 2016 Abstract We stuy the problem of agnostically learning halfspaces which is efine

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

Efficient Semi-supervised and Active Learning of Disjunctions

Efficient Semi-supervised and Active Learning of Disjunctions Maria-Florina Balcan ninamf@cc.gatech.edu Christopher Berlind cberlind@gatech.edu Steven Ehrlich sehrlich@cc.gatech.edu Yingyu Liang yliang39@gatech.edu School of Computer Science, College of Computing,

More information

arxiv: v4 [cs.ds] 7 Mar 2014

arxiv: v4 [cs.ds] 7 Mar 2014 Analysis of Agglomerative Clustering Marcel R. Ackermann Johannes Blömer Daniel Kuntze Christian Sohler arxiv:101.697v [cs.ds] 7 Mar 01 Abstract The iameter k-clustering problem is the problem of partitioning

More information

Physics 505 Electricity and Magnetism Fall 2003 Prof. G. Raithel. Problem Set 3. 2 (x x ) 2 + (y y ) 2 + (z + z ) 2

Physics 505 Electricity and Magnetism Fall 2003 Prof. G. Raithel. Problem Set 3. 2 (x x ) 2 + (y y ) 2 + (z + z ) 2 Physics 505 Electricity an Magnetism Fall 003 Prof. G. Raithel Problem Set 3 Problem.7 5 Points a): Green s function: Using cartesian coorinates x = (x, y, z), it is G(x, x ) = 1 (x x ) + (y y ) + (z z

More information

Structural Risk Minimization over Data-Dependent Hierarchies

Structural Risk Minimization over Data-Dependent Hierarchies Structural Risk Minimization over Data-Depenent Hierarchies John Shawe-Taylor Department of Computer Science Royal Holloway an Befor New College University of Lonon Egham, TW20 0EX, UK jst@cs.rhbnc.ac.uk

More information

Agmon Kolmogorov Inequalities on l 2 (Z d )

Agmon Kolmogorov Inequalities on l 2 (Z d ) Journal of Mathematics Research; Vol. 6, No. ; 04 ISSN 96-9795 E-ISSN 96-9809 Publishe by Canaian Center of Science an Eucation Agmon Kolmogorov Inequalities on l (Z ) Arman Sahovic Mathematics Department,

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information