A New Perspective on Learning Linear Separators with Large L q L p Margins

Size: px

Start display at page:

Download "A New Perspective on Learning Linear Separators with Large L q L p Margins"

Lorena Nicholson
5 years ago
Views:

1 A New Persective on Learning Linear Searators with Large L q L Margins Maria-Florina Balcan Georgia Institute of Technology Christoher Berlind Georgia Institute of Technology Abstract We give theoretical and emirical results that rovide new insights into large margin learning We rove a bound on the generalization error of learning linear searators with large L q L margins (where L q and L are dual norms) for any finite 1 The bound leads to a simle data-deendent sufficient condition for fast learning in addition to extending and imroving uon revious results We also rovide the first study that shows the benefits of taking advantage of margins with < 2 over margins with 2 Our exeriments confirm that our theoretical results are relevant in ractice 1 INTRODUCTION The notion of margin arises naturally in many areas of machine learning Margins have long been used to motivate the design of algorithms, to give sufficient conditions for fast learning, and to exlain unexected erformance of algorithms in ractice Here we are concerned with learning the class of homogeneous linear searators in R d over distributions with large margins We use a general notion of margin, the L q L margin, that catures, among others, the notions used in the analyses of Percetron ( = q = 2) and Winnow ( =, q = 1) For, q [1, ] with 1/ + 1/q = 1, the L q L margin of a linear classifier x sign(w x) with resect to a distribution D is defined as w x γ q, (D, w) = inf x D w q x While revious work has addressed the case of 2 both theoretically (Grove et al, 2001; Servedio, 2000; Aearing in Proceedings of the 17 th International Conference on Artificial Intelligence and Statistics (AISTATS) 2014, Reykjavik, Iceland JMLR: W&CP volume 33 Coyright 2014 by the authors Gentile, 2003) and exerimentally (Zhang, 2002), the < 2 case has been mentioned but much less exlored This ga in the literature is ossibly due to the fact that when < 2 a large margin alone does not guarantee small samle comlexity (see Examle 1 below for such a case) This leads to the question of whether large L q L margins with < 2 can lead to small samle comlexity, and if so, under what conditions will this haen? Furthermore, are there situations in which taking advantage of margins with < 2 can lead to better erformance than using margins with 2? In this work, we answer these three questions using both theoretical and emirical evidence We first give a bound on the generalization error of linear searators with large L q L margins that holds for any finite 1 The result is roved through a new bound on the fat-shattering dimension of linear searators with bounded L q norm The bound imroves uon revious results by removing a factor of log d when 2 < and extends the reviously known bounds to the case of 1 < 2 A highlight of this theoretical result is that it gives a simle sufficient condition for fast learning even for the < 2 case The condition is related to the L 2, norm of the data matrix and can be estimated from the data We then give a concrete family of learning roblems in which using the L L 1 margin gives significantly better samle comlexity guarantees than for L q L margins with > 1 We define a family of distributions over labeled examles and consider the samle comlexity of learning the class W of linear searators with large L q L margins By bounding covering numbers, we uer bound the samle comlexity of learning W 1 and lower bound the comlexity of learning W when > 1, and we show that the uer bound can be significantly smaller than the lower bound In addition, we give exerimental results suorting our claim that taking advantage of large L L 1 margins can lead to faster learning We observe that in the realizable case, the roblem of finding a consistent linear searator that maximizes the L q L margin 68

2 Learning Linear Searators with L ql Margins is a convex rogram (similar to SVM) An extension of this method to the non-realizable case is equivalent to minimizing the L q -norm regularized hinge loss We aly these margin-maximization algorithms to both synthetic and real-world data sets and find that maximizing the L L 1 margin can result in better classifiers than maximizing other margins We also show that the theoretical condition for fast learning that aears in our generalization bound is favorably satisfied on many real-world data sets Related Work It has long been known that the classic algorithms Percetron (Rosenblatt, 1958) and Winnow (Littlestone, 1988) have mistake bounds of 1/γ2,2 2 and Õ(1/γ1, ), 2 resectively A family of quasi-additive algorithms (Grove et al, 2001) interolates between the behavior of Percetron and Winnow by defining a Percetron-like algorithm for any > 1 While this gives an algorithm for any 1 < the mistake bound of Õ(1/γ2 q,) only alies for 2 For small values of, these algorithms can be used to learn nonlinear searators by using factorizable kernels (Gentile, 2013) A related family (Servedio, 2000) was designed for learning in the PAC model rather than the mistake bound model, but again, guarantees were only given for 2 Other works (Kakade et al, 2009; Cortes et al, 2010; Kloft and Blanchard, 2012; Maurer and Pontil, 2012) bound the Rademacher comlexity of classes of linear searators under general forms of regularization Secial cases of each of these regularization methods corresond to L q -norm regularization, which is closely related to maximizing L q L margin Kakade et al (2009) directly consider the case of L q -norm regularization but only give Rademacher comlexity bounds for the case of 2 Both Cortes et al (2010) and Kloft and Blanchard (2012) give Rademacher comlexity bounds that cover the entire range 1 in the context of multile kernel learning, but their discussion of excess risk bounds for different choices of is limited to the 2 case while our work discusses the generalization error over the entire range Maurer and Pontil (2012) consider the more general setting of block-norm regularized linear classes but only give bounds for the case of 2 In contrast to our work, none of the above works give lower bounds on the samle comlexity or give concrete evidence of when some values of will result in faster learning than others There are a few other cases in the literature where the 2 regime is discussed Zhang (2002) deals with algorithms for L q -norm regularized loss minimization and discusses cases in which L -norm regularization may be aroriate Balcan et al (2013) study the roblem of learning two-sided disjunctions in the semisuervised setting and note that the regularity assumtion inherent in the roblem can be interreted as a large L L 1 margin assumtion A related roblem osed by Blum and Balcan (2007), the two-sided majority with margins roblem, has a slightly modified form which constitutes another natural occurrence of a large L L 1 margin (see sulementary material) 2 PRELIMINARIES Let D be a distribution over a bounded instance sace X R d A linear searator over X is a classifier h(x) = sign(w x) for some weight vector w R d We use h and w to denote the target function and weight vector, resectively, so that h (x) = sign(w x) gives the label for any instance x and err D (h) = Pr x D [h(x) h (x)] is the generalization error of any hyothesis h We will often abuse notation and refer to a classifier and its corresonding weight vector interchangeably We will overload the notation X to reresent either a set of n oints in R d or the d n matrix of containing one oint er column For any oint x = (x 1,, x d ) R d and 1, the L -norm of x is x = ( d x i ) 1 and the L -norm is x = max i x i Let X denote su x X x, which is finite for any by our assumtion that X is bounded The L q -norm is the dual of the L -norm if 1/ + 1/q = 1 (so the L -norm and L 1 -norm are duals) In this work, and q will always denote dual norms For any weight vector w, the L q L margin of w with resect to D is defined as w x γ q, (D, w) = inf x D w q x We can similarly define γ q, (X, w) for a set X We will dro the first argument when referring to the distribution-based definition and the distribution is clear from context We assume that D has a ositive margin; that is, there exists w such that γ q, (D, w) > 0 Note that by Hölder s inequality, w x w q x for dual and q, so γ q, (D, w) 1 We also define the L a,b matrix norm b/a 1/b r c M a,b = m ij a j=1 69

3 Maria-Florina Balcan, Christoher Berlind for any r c matrix M = (m ij ) In other words, we take the L a -norm of each row in the matrix and then take the L b -norm of the resulting vector of L a -norms 21 L q L Suort Vector Machines Given a linearly searable set X of n labeled examles, we can solve the convex rogram min w st w q y i (w x i ) x i 1, 1 i n (1) to maximize the L q L margin Observe that a solution ŵ to this roblem has γ q, (X, ŵ) = 1/ ŵ q We call an algorithm that oututs a solution to (1) an L q L SVM due to the close relationshi between this roblem and the standard suort vector machine If X is not linearly searable, we can introduce nonnegative slack variables in the usual way and solve min w, ξ 0 st w q + C n ξ i y i (w x i ) x i 1 ξ i, 1 i n (2) which is equivalent to minimizing the hinge loss with resect to an L -normalized data set using L q -norm regularization on the weight vector sace 3 GENERALIZATION BOUND In this section we give an uer bound on the generalization error of learning linear searators over distributions with large L q L margins The roof follows from combining a theorem of Bartlett and Shawe- Taylor (1999) with a new bound on the fat-shattering dimension of the class of linear searators with small L q -norm We begin with the following definitions Definition For a set F of real-valued functions on X, a finite set {x 1,, x n } X is said to be γ-shattered by F if there are real numbers r 1,, r n such that for all b = (b 1,, b n ) { 1, 1} n, there is a function f b F such that { f b (x i r i + γ if b i = 1 ) r i γ if b i = 1 The fat-shattering dimension of F at scale γ, denoted fat F (γ), is the size of the largest subset of X which is γ-shattered by F Our bound on the fat-shattering dimension will use two lemmas analogous to Lemmas 11 and 12 in (Servedio, 2000) Lemma 1 Let F = {x w x : w q W q } with 1 If the set {x 1,, x n } X n is γ- shattered by F then every b = (b 1,, b n ) { 1, 1} n satisfies n b ix i γn W q Proof The roof is identical to that of Lemma 11 in (Servedio, 2000), relacing the radius 1/ X of F in their lemma with W q The next lemma will deend on the following classical result from robability theory known as the Khintchine inequality Theorem 1 (Khintchine) If the random variable σ = (σ 1,, σ n ) is uniform over { 1, 1} n and 0 < <, then any finite set {z 1,, z n } C satisfies ( [ n n ]) 1 z i 2 E σ i z i B n z i 2 A where A and B are constants deending only on The recise otimal constants for A and B were found by Haageru (1982), but for our uroses, it suffices to note that when 1 we have 1/2 A 1 and 1 B Lemma 2 For any set X = {x 1,, x n } X n and any 1 <, there is some b = (b 1,, b n ) { 1, 1} n such that n b ix i B X 2, Proof We will bound the exectation of n b ix i when b = (b 1,, b n ) is uniformly distributed over { 1, 1} n We have 1/ E n b i x i d = E n ɛ i x i j j=1 [ d n ] 1/ E ɛ i x i j j=1 d j=1 B = B X 2, ( n ) /2 x i j 2 1/ where the first inequality is an alication of Jensen s inequality and the second uses the Khintchine inequality The roof is comleted by noting that there must be some choice of b for which the value of n b ix i is smaller than its exectation We can use these two lemmas to give an uer bound on the fat-shattering dimension for any finite 70

4 Learning Linear Searators with L ql Margins Theorem 2 Let F = {x w x : w q W q } with 1 < If there is a constant C = C(d, ) indeendent of n such that X 2, Cn α X for any set X of n examles drawn from D, then ( ) CB W q X 1 1 α fat F (γ) γ Proof Combining Lemmas 1 and 2, we have that any set X = {x 1,, x n } X n that is γ-shattered by F satisfies B X 2, CB n α X Solving γn W q for n gives us n ( CB W q X γ ) 1/(1 α) as an uer bound on the maximum size of any γ-shattered set This bound extends and imroves uon Theorem 8 in (Servedio, 2000) In their secific setting, W q = 1/ X, so we can directly comare their bound fat F (γ) 2 log 4d γ 2 (3) for 2 to our bound ( ) 1/(1 α) CB fat F (γ) (4) γ for 1 < Observe that by Minkowski s inequality, any set X satisfies X 2, n 1/2 X if 2 In this case, (4) simlifies to (B /γ) 2 which is dimension-indeendent and imroves uon (3) by a factor of log d when is constant When 1 < 2, (3) does not aly, but (4) still gives a bound that can be small in many cases deending on the relationshi between X 2, and γ We will give secific examles in Section 31 The fat-shattering dimension is relevant due to the following theorem of Bartlett and Shawe-Taylor (1999) that relates the generalization erformance of a classifier with large margin to the fat-shattering dimension of the associated real-valued function class at a scale of roughly the margin of the classifier Theorem 3 (Bartlett & Shawe-Taylor) Let F be a collection of real-valued functions on a set X and let D be a distribution over X Let X = {x 1,, x n } be a set of examles drawn iid from D with labels y i = h (x i ) for each i With robability at least 1 δ, if a classifier h(x) = sign(f(x)) with f F satisfies y i f(x i ) γ > 0 for each x i X, then err D (h) 2 n where k = fat F (γ/16) ( k log 8en k log(32n) + log 8n δ ), Now we can state and rove the following theorem which bounds the generalization erformance of the L q L SVM algorithm Theorem 4 For any distribution D and target w with γ q, (D, w ) γ q,, if there is a constant C = C(d, ) such that X 2, Cn α X for any set X of n examles from D then there is a olynomial time algorithm that oututs, with robability at least 1 δ, a classifier h such that err D (h) = O ( 1 n ( (CB γ q, ) 1 )) 1 α log 2 n + log n δ Proof By the definition of L q L margin, there exists a w (namely, w / w q ) with w q = 1 that achieves margin γ q, with resect to D This w has margin at least γ q, with resect to any set X of n examles from D A vector ŵ satisfying these roerties can be found in olynomial time by solving the convex rogram (1) and normalizing the solution Notice that if the samle is normalized to have x = 1 for every x X then the L q L margin of ŵ does not change but becomes equal to the functional margin y(ŵ x) aearing in the Theorem 3 Alying Theorem 2 with W q = 1 and X = 1 yields fat F (γ) ( CB γ )1/(1 α) and alying Theorem 3 to ŵ gives us the desired result Theorem 4 tells us that if the quantity ( CB γ q, ) 1 1 α (5) is small for a certain choice of and q then the L q L SVM will have good generalization This gives us a data-deendent bound, as (5) deends on data; secifically, C and α deend on the distribution D alone, while γ q, deends on the relationshi between D and the target w As mentioned, if 2 then we can use C = 1 and α = 1/2 for any distribution, in which case the bound deends solely on the margin γ q, (and to a lesser extent on B ) If 2 then any set has X 2, n 1/ X (this follows by subadditivity of the function z z /2 when 2) and we can obtain a similar dimension-indeendent bound with C = 1 and α = 1/ Achieving dimension indeendence for all distributions comes at the rice of the bound becoming uninformative as 1, as (5) simlifies to (B /γ q, ) q for these values More interesting situations arise when we consider the quantity (5) for secific distributions, as we will show in the next section 31 Examles Here we will give some secific learning roblems showing when large margins can be helful and when they are not helful We focus on the 2 case, as large margins are always helful when 2 71

5 Maria-Florina Balcan, Christoher Berlind Examle 1 Unhelful margins First, let D 1 be the uniform distribution over the standard basis vectors in R d and let w be any weight vector in { 1, 1} d In this case γ q, = d 1/q, which is a large margin for small If n d, then X 2, is roughly n 1/ X (ignoring log factors), and (5) simlifies to Bd q We could also choose to simlify (5) using C = d 1/ and α = 0, which gives us B d Either way, the bound in Theorem 4 becomes Õ(d/n) which is uninformative since n d If we take n d, we can still obtain a bound of Õ(d/n), but this is the same as the worst-case bound based on VC dimension, so the large margin has no advantage In fact, this examle rovides a lower bound: even if an algorithm knows the distribution D 1 and is allowed a 1/2 robability of failure, an error of ɛ cannot be guaranteed with fewer than (1 2ɛ)d examles because any algorithm can hoe for at best an error rate of 1/2 on the examles it has not yet seen Examle 2 Helful margins As another examle, divide the d coordinates into k = o(d) disjoint blocks of equal size and let D 2 be the uniform distribution over examles that have 1 s for all coordinates within some block and 0 s elsewhere Taking w to be a vector in { 1, 1} d that has the same sign within each block, we have γ q, = k 1/q If k < n < d then X 2, is roughly k 1/ 1/2 n X, and (5) simlifies to B 2 k When k = o(d) this is a significant imrovement over worst case bounds for any constant choice of Examle 3 An advantage for < 2 Consider a distribution that is a combination of the revious two examles: with robability 1/2 it returns an examle drawn from D 1 and otherwise returns an examle from D 2 By including the basis vectors, we have made the margin γ q, = d 1/q but as long as k = o(d) the bound on X 2, does not change significantly from Examle 2, and we can still use C = k 1/ 1/2 and α = 1/2 Now (5) simlifies to B 2 k for = 1, but becomes B 2 k 2/ 1 d 2/q in general When k = d this gives us an error bound of Õ( d/n) for = 1 but Õ(d/n) or worse for 2 While this uer bound does not imly that generalization error will be worse for 2 than it is for = 1, we show in the next section that for a slightly modified version of this distribution we can obtain samle comlexity lower bounds for large margin algorithms with 2 that are significantly greater than the uer bound for = 1 4 THE CASE FOR L L 1 MARGINS Here we give a family of learning roblems to show the benefits of using L L 1 margins over other margins We do this by defining a distribution D over unlabeled examles in R d that can be consistently labeled by a variety of otential target functions w We then consider a family of large L q L margin concet classes W and bound the samle comlexity of learning a concet in W using covering number bounds We show that learning W 1 can be much easier than learning W for > 1; for examle, with certain arameters for D having O( d) examles is sufficient for learning W 1, while learning any other W requires Ω(d) examles Secifically, let W = {w R d : w = 1, γ q, (w) γ q, (w )}, where w maximizes the L L 1 margin with resect to D We restrict our discussion to weight vectors with unit L norm because normalization does not change the margin (nor does it affect the outut of the corresonding classifier) Let the covering number N (ɛ, W, D) be the size of the smallest set V W such that for every w W there exists a v V with d D (w, v) := Pr x D [sign(w x) sign(v x)] ɛ Define the distribution D over {0, 1} d as follows Divide the d coordinates into k disjoint blocks of size d/k (assume d/(2k) is an odd integer) Fli a fair coin If heads, ick a random block and return an examle with exactly d/(2k) randomly chosen coordinates set to 1 within the chosen block and all other coordinates set to 0 If tails, return a standard basis vector (exactly one coordinate set to 1) chosen uniformly at random The target function will be determined by any weight vector w achieving the maximum L L 1 margin with resect to D As we will see, w can be any vector in { 1, 1} d with comlete agreement within each block We first give an uer bound on the covering of W 1 Proosition 1 For any ɛ > 0, N (ɛ, W 1, D) 2 k Proof Let V k be the following set Divide the d coordinates into k disjoint blocks of size d/k A vector v { 1, 1} d is a member of V k if and only if each block in v is entirely +1 s or entirely 1 s We will show that W 1 = V k, and since V k = 2 k we will have N (ɛ, W 1, D) 2 k for any ɛ Note that by Hölder s inequality, γ q, (w) 1 for any w R d For any w V k and any examle x drawn from D, we have w x = x 1, so γ,1 (w) = 1, the maximum margin If w / V k then either w / { 1, 1} d or w has sign disagreement within at least one of the k blocks If w / { 1, 1} d then γ,1 (w) = min x w x / x 1 min i w e i = min i w i < 1 If w has sign disagreement within a block then w x < x 1 for any x with 1 s in disagreeing coordinates of w, and this results in a margin strictly less than 1 Now we will give lower bounds on the covering numbers for W with > 1 In the following let H(α) = α log(α) (1 α) log(1 α), the binary entroy function 72

6 Learning Linear Searators with L ql Margins Proosition 2 If 1 < < then for any ɛ > 0, N (ɛ, W, D) 2 (1/2 H(2ɛ))d k1/q (d/2) 1/ k Proof First we show that W 2 d/2 k1/q (d/2) 1/ k Any w W 1 has margin γ q, (w ) = d 1/q, so W = {w R d : γ q, (w) d 1/q } Note that W { 1, 1} d because if w / { 1, 1} d then γ q, (w) min i w e i / w q = min i w i / w q < d 1/q Let w { 1, 1} d be a weight vector such that in each block there are at least d/k r ositive values and at most r negative values or vice versa (there are at most r values with whichever sign is in the minority) Clearly w has large margin with resect to any of the basis vectors drawn from D For the rest of D we have inf x w x = max(1, n/(2k) 2r), so w W if and only if max(1, d/(2k) 2r) (d/(2k)) 1/ For < and d > 2k, this haens if and only if r 1 2 ( d 2k ( d 2k )1/ ) Letting r = 1 2 ( d 2k ( d 2k )1/ ), we have W = (2 r ) ) k 2 d/2 k1/q (d/2) 1/ k i=0 ( d/k i Now we can lower bound the covering number using a volume argument by noting that if m is the cardinality of the largest ɛ-ball around any w W then N (ɛ, W, D) W /m For any air w, w W, d D (w, w ) h(w, w )/(2d) where h(w, w ) is the hamming distance (number of coordinates in which w and w disagree) Therefore, in order for d D (w, w ) ɛ we need h(w, w ) 2ɛd For any w W the number of w such that h(w, w ) 2ɛd is at most ) 2 H(2ɛ)d It follows that N (ɛ, W, D) 2ɛd i=0 ( d i W /2 H(2ɛ)d 2 d/2 H(2ɛ)d k1/q (d/2) 1/ k Proosition 3 For any ɛ > 0, N (ɛ, W, D) 2 (1 H(2ɛ))d Proof First we show that W = { 1, 1} d Any w W 1 has margin γ 1, (w ) = 1/d, so W = {w R d : γ 1, (w) 1/d} For any w { 1, 1} d and examle x drawn from D, we have w 1 = d, x = 1, and w x 1 (since x has an odd number of coordinates set to 1) resulting in margin γ 1, (w) 1/d If w / { 1, 1} d then γ 1, (w) = min x w x / w 1 min i w e i / w 1 = min i w i / w 1 < 1/d To bound the covering number, we use the same volume argument as above Again, the size of the largest ɛ-ball around any w W is at most 2 H(2ɛ)d (since this bound only requires that every air w, w W has d D (w, w ) h(w, w )/(2d)) It follows that N (ɛ, W, D) 2 d /2 H(2ɛ)d Using standard distribution-secific samle comlexity bounds based on covering numbers (Itai and Benedek, 1991), we have an uer bound of O((1/ɛ) ln(n (ɛ, W, D)/δ)) and lower bound of ln((1 δ)n (2ɛ, W, D)) for learning, with robability at least 1 δ, a concet in W to within ɛ error Thus, we have the following results for the samle comlexity m of learning W with resect to D If = 1 then m O if 1 < < then m ( 1 ɛ ( k + ln 1 δ )), ( ) ( ) 1/ 1 d 2 H(4ɛ) d k 1/q k + ln(1 δ), 2 and if = then m (1 H(4ɛ)) d + ln(1 δ) For aroriate values of k and ɛ relative to d, the the samle comlexity can be much smaller for the = 1 case For examle, if k = O(d 1/4 ) and Ω(d 1/4 ) ɛ 1/40, then (assuming δ is a small constant) having O( d) examles is sufficient for learning W 1 while at least Ω(d) examles are required to learn W for any > 1 5 EXPERIMENTS We erformed two emirical studies to suort our theoretical results First, to give further evidence that using L L 1 margins can lead to faster learning than other margins, we ran exeriments on both synthetic and real-world data sets Using the L q L SVM formulation defined in (1) for linearly searable data and the formulation defined in (2) for non-searable data, both imlemented using standard convex otimization software, we ran our algorithms for a range of values of and a range of s n on each data set We reort several cases in which maximizing the L L 1 margin results in faster learning (ie, smaller samle comlexity) than maximizing other margins Figure 1 shows results on two synthetic data sets One is generated using the Blocks distribution family from Section 4 with d = 90 and k = 9 The other uses examles generated from a standard Gaussian distribution in R 100 subject to having L L 1 margin at least 0075 with resect to a fixed random target vector in { 1, 1} d (in other words, Gaussian samles with margin smaller than 0075 are rejected) In both cases, the error decreases much faster for < 2 than for large Figure 2 shows results on three data sets from the UCI Machine Learning Reository (Bache and Lichman, 2013) The Fertility data set consists of 100 training examles in R 10, the SPECTF Heart data set has 267 examles in R 44, and we used a subset of the CNAE-9 data set with 240 examles in R 857 In all three cases, better erformance was achieved by algorithms with < 2 than by those with > 2 73

7 Maria-Florina Balcan, Christoher Berlind generalization error = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = 170 generalization error n = 50 n = 100 n = 150 n = 200 n = 250 n = 300 n = 350 n = 400 n = 450 n = 500 generalization error = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = 170 generalization error n = 50 n = 75 n = 100 n = 125 n = 150 n = 175 n = 200 n = 225 n = 250 Figure 1: Synthetic data results for Blocks distribution (to) and Gaussian with margin (bottom) The left column lots generalization error (averaged over 500 trials with different training sets) versus number of training examles n while the right column lots error versus The goal of our second exeriment was to determine, for real-world data, what arameter α can be used in the bound on X 2, in Theorem 4 Secifically, for each data set we want to find α min = inf{α : X 2, n α X }, the smallest value of α so the bound holds with C = 1 Recall that for = 1, α min can theoretically be as great as 1, while for 2 it is at most 1/2 We would like to see whether α min is often small in real data sets or whether it is close to the theoretical uer bound We can estimate α min for a given set of data by creating a sequence {X m } n m=1 of data matrices by adding to the matrix one oint from the data set at a time For each oint in the sequence we can comute α m = log( X m 2, / X ), log m a value of α that realizes the bound with equality for this articular data matrix We reeat this rocess T times, each with a different random ordering of the data, to find T sequences αm, i where 1 i T and 1 m n We can then comute ˆα min = max i,m αm, i a value of α which realizes the bound for every data matrix considered and which causes the bound to hold with equality in at least one instance Figure 3 shows a histogram of the resulting estimates on a variety of data sets and for three values of number of data sets = 1 = 2 = ˆα min Figure 3: A histogram showing the values of ˆα min on 47 data sets from the UCI reository Notice that in the vast majority of cases, the estimate of α min is less than 1/2 As exected there are more values above 1/2 for = 1 than for 2, but none of the estimates were above 07 This gives us evidence that many real data sets are much more favorable for learning with large L L 1 margins than the worst-case bounds may suggest 74

8 Learning Linear Searators with L ql Margins = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = n = 10 n = 15 n = 20 n = 25 n = 30 n = 35 n = 40 n = 45 n = 50 n = 55 n = = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = n = 26 n = 40 n = 53 n = 66 n = 80 n = 93 n = 106 n = 120 n = 133 n = 146 n = = = 1125 = 125 = 15 = 20 = 30 = 50 = 90 = n = 8 n = 12 n = 16 n = 20 n = 24 n = 28 n = 32 n = 36 n = 40 Figure 2: Results for Fertility data (to), SPECTF Heart data (middle), and CNAE-9 (bottom) The left column lots error (averaged over 100 trials with different training sets and tested on all non-training examles) versus number of training examles n while the right column lots error versus 6 DISCUSSION Theorem 4 alies to the realizable case in which the two classes are linearly searable by a ositive L q L margin The result can be extended to the nonrealizable case by bounding the emirical Rademacher comlexity in terms of the L 2, -norm of the data using the Khintchine inequality This bound can be seen as a secial case of Proosition 2 of Kloft and Blanchard (2012) as our setting is a secial case of multile kernel learning (see sulementary material for details) Kloft and Blanchard (2012) also rove bounds on the oulation and local Rademacher comlexities, although in doing so they introduce an exlicit deendence on dimension (number of kernels) This highlights a difference in goals between our work and much of the MKL literature The MKL literature rovides bounds that aly to all data distributions while focusing on the 2 regime, as this is the most relevant range for MKL On the other hand, we are interested in understanding what kinds of data lead to fast (and dimension-indeendent) learning for different notions of margin, and furthermore when one tye of margin can be rovably better than another We make stes toward understanding this through our condition on the L 2, -norm and concrete lower bounds on the generalization error, neither of which has been exlored in the context of MKL Carrying over these ersectives into more general settings such as MKL is an exciting direction for further research Acknowledgments We thank Ying Xiao for insightful discussions and the anonymous reviewers for their helful comments This work was suorted in art by NSF grants CCF and CCF , AFOSR grant FA , ONR grant N , and a Microsoft Research Faculty Fellowshi 75

9 Maria-Florina Balcan, Christoher Berlind References K Bache and M Lichman UCI machine learning reository, 2013 URL htt://archiveicsuciedu/ml M-F Balcan, C Berlind, S Ehrlich, and Y Liang Efficient Semi-suervised and Active Learning of Disjunctions In Proceedings of the International Conference on Machine Learning (ICML), 2013 P L Bartlett and J Shawe-Taylor Generalization erformance of suort vector machines and other attern classifiers In B Schölkof, C J C Burges, and A J Smola, editors, Advances in Kernel Methods, ages MIT Press, 1999 A Blum and M-F Balcan Oen Problems in Efficient Semi-Suervised PAC Learning In Proceedings of the Conference on Learning Theory (COLT), 2007 C Cortes, M Mohri, and A Rostamizadeh Generalization Bounds for Learning Kernels In Proceedings of the International Conference on Machine Learning (ICML), 2010 C Gentile The Robustness of the -Norm Algorithms Machine Learning, 53: , 2003 C Gentile Personal communication, 2013 A J Grove, N Littlestone, and D Schuurmans General Convergence Results for Linear Discriminant Udates Machine Learning, 43: , 2001 U Haageru The best constants in the Khintchine inequality Studia Mathematica, 1982 A Itai and G M Benedek Learnability with resect to fixed distributions Theoretical Comuter Science, 86(2): , 1991 S M Kakade, K Sridharan, and A Tewari On the Comlexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization In Proceedings of Neural Information Processing Systems (NIPS), 2009 M Kloft and G Blanchard On the Convergence Rate of l -Norm Multile Kernel Learning Journal of Machine Learning Research, 13: , 2012 N Littlestone Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm Machine Learning, 2(4): , 1988 A Maurer and M Pontil Structured Sarsity and Generalization Journal of Machine Learning Research, 13: , 2012 F Rosenblatt The ercetron: a robabilistic model for information storage and organization in the brain Psychological review, 65(6): , Nov 1958 R A Servedio PAC Analogues of Percetron and Winnow via Boosting the Margin In Proceedings of the Conference on Learning Theory (COLT), ages , 2000 T Zhang On the dual formulation of regularized linear systems with convex risks Machine Learning, 46:91 129,

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions.