Rounding Methods for Discrete Linear Classification (Extended Version)

Size: px

Start display at page:

Download "Rounding Methods for Discrete Linear Classification (Extended Version)"

Samuel Hunter
6 years ago
Views:

1 Extended Version Yann Chevaleyre LIPN, CNRS UMR 7030, Université Paris Nord, 99 Avenue Jean-Baptiste Cléent, Villetaneuse, France Frédéric Koriche CRIL, CNRS UMR 8188, Université d Artois, Rue Jean Souvraz SP 18, 6307 Lens, France Jean-Daniel Zucer jean-danielzucer@irdfr INSERM U87, Université Pierre et Marie Curie, 15 Rue de l Ecole de Médecine, Paris, France Abstract Learning discrete linear classifiers is nown as a difficult challenge In this paper, this learning tas is cast as cobinatorial optiization proble: given a training saple fored by positive and negative feature vectors in the Euclidean space, the goal is to find a discrete linear function that iniizes the cuulative hinge loss of the saple Since this proble is NP-hard, we exaine two siple rounding algoriths that discretize the fractional solution of the proble Generalization bounds are derived for several classes of binary-weighted linear functions, by analyzing the Radeacher coplexity of these classes and by establishing approxiation bounds for our rounding algoriths Our ethods are evaluated on both synthetic and real-world data 1 Introduction Linear classification is a well-studied learning proble in which one needs to extrapolate, fro a set of positive and negative exaples represented in Euclidean space by their feature vector, a linear hypothesis hx = sgn w, x b that correctly classifies future, unseen, exaples In the past decades, a wide variety of theoretical results and efficient algoriths have been obtained for learning real-weighted linear functions also nown as perceptrons Notably, it is well-nown that the linear classification proble can Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 013 JMLR: W&CP volue 8 Copyright 013 by the authors be cast as a convex optiization proble and solved in polynoial tie by support vector achines if the perforance of hypotheses is easured by convex loss functions such as the hinge loss see eg Shawe-Taylor and Cristianini 000 Much less is nown, however, about learning discrete linear classifier Indeed, integer weights, and in particular {0, 1}-valued and { 1, 0, 1}-valued weights, can play a crucial role in any application doains in which the classifier has to be interpretable by huans One of the ain otivating applications for this wor coes fro the field of quantitative etagenoics, which is the study of the collective genoe of the icro-organiss inhabiting our body It is now technically possible to easure the abundance of bacterial species by easuring the activity of specific tracer genes for that species Moreover, it is nown that the abundance of soe bacterial species in our body is related to obesity or leanness Instead of learning a standard linear classifier to predict obesity, biologists would lie to find two sall groups of bacterial species, such that if the abundance of bacteria in the first group is greater than that of the second group, then the individual is classified as being obese Given a dataset in which features represent the abundance of specific bacterial species, this proble boils down to learning a linear classifier with { 1, 0, 1}-valued weights In other doains such as edical diagnosis, the interpretability of predictive odels is also a ey aspect The ost coon diagnostic odels are M-of-N rules Towell and Shavli, 1993 according to which patients are classified as ill if at least M criteria aong N are satisfied However, learning M-of-N rules is hard a proof is provided in appendix In binary classification, linear threshold functions with {0, 1}-valued

2 weights are equivalent to M-of-N rules Thus, the theory and the algoriths described in this paper can also be used to learn such rules, as shown in the experiental section Perhaps the ajor obstacle to the developent of discrete linear functions lies in the fact that, in the standard distribution-free PAC learning odel, the proble of finding an integer-weighted linear function that is consistent with a training set is equivalent to the Zero-One Integer Linear Prograing proble Pitt and Valiant, 1988, which is NP-coplete In order to alleviate this issue, several authors have investigated the learnability of discrete linear functions in distribution-specific odels, such as the unifor distribution Golea and Marchand, 1993a; Köhler et al, 1990; Opper et al, 1990; Venatesh, 1991, or the product distribution Golea and Marchand, 1993b Yet, beyond this pioneering wor, any questions reain open, especially when the odel is distributionfree but the loss functions are convex In this paper, we consider just such a scenario by exaining the proble of learning binary-weighted linear functions with the hinge loss, a well-nown surrogate of the zero-one loss The ey coponents of the classification proble are a set C {0, 1} n of boolean vectors 1 fro which the learner pics his hypotheses, and a fixed yet hidden probability distribution over the set R n {±1} of exaples For a hinge paraeter > 0, the hinge loss penalizes a hypothesis c C on an exaple x, y if its argin y c, x is less than The perforance of a hypothesis c C is easured by its ris, denoted risc, and defined as the expected loss of c on an exaple x, y drawn fro the underlying distribution Typically, risc is upper-bounded by the su of two ters: a saple estiate ris c of the perforance of c and a penalty ter T C that depends on the hypothesis class C and, potentially, also on the training set The saple estiate ris c is siply the averaged cuulative hinge loss of c on a set {x i, y i } of exaples drawn independently fro the underlying distribution The penalty ter T C can be given by the VC-diension of C, or its Radeacher coplexity with respect to the size of the training set For binary-weighted linear classifiers, the penalty ter induced by their Radeacher coplexity can be substantially saller than the penalty ter induced by their VC diension So, by a siple adaptation of Bartlett and Mendelson s fraewor 1 As explained in Section 4, { 1, 0, 1}-weighted classification can be reduced to {0, 1}-weighted classification 00, our ris bounds tae the for of: risc ris c + 8 ln/ R C + 1 where R C is the Radeacher coplexity of C with respect to, and 0, 1 is a confidence paraeter Ideally, we would lie to have at our disposal an efficient algorith for iniizing ris c The resulting iniizer, say c, would be guaranteed to provide an optial hypothesis because the other ters in the ris bound 1 do not depend on the choice of the hypothesis Unfortunately, because the class C of discrete linear classifiers is not a a convex set, the convexity of hinge loss does not help in finding c and, as shown by Theore 1 in the next section, the optiization proble reains NP-hard The ey essage to be gleaned fro this paper is that the convexity of the loss function does help in approxiating the cobinatorial optiization proble, using siple rounding ethods Our first algorith is a standard randoized rounding RR ethod that starts fro a fractional solution w in the convex hull of C, and then builds c by viewing the fractional value w i as the probability that c i should be set to 1 The second algorith, called greedy rounding GR, is essentially a derandoization of RR that iteratively rounds the coordinates of the fractional solution by aintaining a constraint on the su of weights For the class C of binary-weighted linear functions, we show that the greedy rounding algorith is guaranteed to return a concept c C satisfying: ris c ris c + X where X p = ax x i p, and x p is the L p -nor of x We also show that the proble of iproving this bound up to a constant factor is NP-hard Cobining greedy rounding s perforance with the Radeacher coplexity of C yields the ris bound: risc ris c + X + { } n 8 ln/ X 1 in 1, + For the subclass C of sparse binary-weighted linear functions involving at ost ones aong n, we show that greedy rounding is guaranteed to return a concept c C satisfying: ris c ris c + X

3 Using the Radeacher coplexity of C, which is substantially saller than that of C, we have: risc ris c + X + log n X 8 ln/ + Siilar results are derived with the randoized rounding algorith, with less sharp bounds due to the randoization process We evaluate these rounding ethods on a both synthetic and real-world datasets, showing good perforance in coparison with standard linear optiization ethods The proofs of preparatory leas and 5 can be found in appendix Binary-Weighted Linear Classifiers Notation The set of positive integers {1,, n} is denoted [n] For a subset S R n, we denote by convs the convex hull of S For two vectors u, v R n and p 1 the L p -nor of u is denoted u p and the inner product between u and v is denoted u, v Given a vector u R n and [n], we denote by u 1: the prefix u 1,, u of u Finally, given a training set {x i, y i }, we write X p = ax x i p In this study, we shall exaine classification probles for which the set of instances is the Euclidean space R n and the hypothesis class is a subset of {0, 1} n Specifically, we shall focus on the class C = {0, 1} n of all binary-weighted linear functions, and the subclass C of all binary-weighted linear functions with at ost ones aong n The paraeterized loss function l : R {±1} R exained in this wor is the hinge loss defined by: l p, y = 1 ax0, py where > 0 1 Coputational Coplexity For a training set {x i, y i }, the epirical ris of a weight vector c C, denoted ris c, is defined by its averaged cuulative loss: ris c = 1 l c i, x i, y By c, we denote any iniizer of the objective function ris Recall that if C is a convex subset of R n then c can be found in polynoial tie using convex optiization algoriths However, for the discrete class C = {0, 1} n, the next result states that the optiization proble is uch harder Theore 1 There exists a constant α > 0 such that, unless P=NP, there is no polynoial tie algorith capable of learning fro any dataset of size a hypothesis c C such that: ris c in ris c + α X c C Proof In what follows, we denote by c any vector in C for which ris c is inial For an undirected graph G = V, E, the Max-Cut proble is to find a subset S V such that the nuber of edges with one end point in S and the other in V \S is axial Unless P=NP, no polynoial-tie algorith can achieve the approxiation ratio of 0997 for MaxCut in 3- regular graphs Beran and Karpinsi, 1999 Based on this result, we first construct a dataset fro a 3-regular graph G = V, E having an even nuber of vertices Our dataset consist of n = V + 1 features and = E exaples The first V features are associated with the vertices of G For each edge j, j E, we build two positively labeled exaples x and x in the following way In the first exaple x, j and j are set to, and all other features are set to 0 In the second exaple x, j and j are set to, the feature V + 1 is set to and all others are set to 0 Consider any weight vector c where c V +1 is equal to 0 Clearly, setting c V +1 to 1 will strictly decrease the loss of c if at least one coordinate in c is nonzero Thus, we will assue fro now on and without loss of generality that c V +1 is always set to 1 Observe that the loss on the two exaples x and x is { l c, x + l c, x 0 if c j c j = 1 otherwise Let us now define cutc = {i, j E : c i c j } By viewing c as the characteristic vector of a subset of vertices, cut c is the value of the cut in G induced by this subset Note we have cutc = E E ris c Thus, iniizing the loss on the dataset axiizes the cut on the graph Consequently, cut c is the optial value of the Max-cut proble Finally, suppose by contradiction that for all α > 0, there is a polynoial-tie algorith capable of learning fro any dataset of size a vector c satisfying ris c ris c + α X Notably, in the dataset constructed above, the value of X is 6 Thus, for this dataset, we get that ris c ris c + α 6, and hence, cutc cutc α E 6

4 To this point, Feige et al 001 have shown that on 3-regular graphs, the optial cut has a value of at least E / By reporting this value into, we obtain cutc cutc 4α 6 cutc = cutc 1 4α 6 Because α can be arbitrarily close to 0, this iplies that Max-Cut is approxiable within any constant factor, which contradicts Beran and Karpinsi s 1999 inapproxiability result Radeacher Coplexity Suppose that our training set S = {x i, y i } consists of exaples generated by independent draws fro soe fixed probability distribution on R n {±1} For a class F of real valued functions f : R n R, define its Radeacher coplexity on S to be: R S F = E [ sup f F 1 ] σ i fx i Here, the expectation is over the Radeacher rando variables σ 1 σ, which are drawn fro {±1} with equal probability Since S is rando, we can also tae expectation over the choice of S and define R F = E [R S F], which gives us a quantity that depends on both the function class and the saple size As indicated by inequality 1, bounds on the Radeacher coplexity of a class iediately yield ris bounds for classifiers piced fro that class For continuous linear functions, sharp Radeecher coplexity bounds have been provided by Kaade et al 008 We provide here siilar bounds for two iportant classes of discrete linear functions Lea Let σ 1,, σ be Radeacher variables Then E[ σ i ] /8 for any even Theore 3 Let C = {0, 1} n be the class of all binaryweighted linear functions Then, { } n R C X 1 in 1, This bound is tight up to a constant factor Proof Consider the hypothesis class: F p,v = {x w, x : w R n, w p v} By Theore 1 in Kaade et al, 008, we have R F,v vx / Moreover, for any c {0, 1} n, we have c n It follows that C F, n, and since 1, we get that: R C RF n n, n X X 1 Now, let us prove that this bound is tight First, let us rewrite the radeacher coplexity over saples in a ore convenient for: [ R S C = 1 = 1 = 1 n E j=1 n E j=1 [ n E j=1 sup c j c j {0,1} [ sup w j w j { 1,1} ] σ i x i,j σ i x i,j ] σ i x i,j ] 3 For the case n, consider a training set S such that x i,i = X 1 for all i [], and zero everywhere else Clearly, equation 3 iplies R S C = X1 For the case n <, assue is a ultiple of n and consider a dataset S in which each each exaple contains only one non-zero value equal to X 1, and such that the nuber of nonzero values per colun is n Then, by applying Lea to equation 3, we obtain: R S C nx 1 n n 3 = X 1 3 Theore 4 For a constant > 0, let C be the class of binary-weighted linear functions with at ost ones aong n Then, log n R C X Proof For a closed convex set S R n +, consider the hypothesis class: F S = {x w, x : w S} Using the convex function F w = n j=1 W 1 ln wj W 1 + ln n, where W 1 = ax w S w 1, we get fro Theore 1 in Kaade et al, 008: R F S X W 1 sup {F w : w S} w j 4 For any, let S = conv {w {0, 1} n : w 1 }, where conv is the convex hull Because F is convex and S is a convex polytope, the supreu of F is one of the vertices of the polytope Thus, sup F w = sup{f w : w {0, 1} n, w 1 } w S ln n + l=1 1 ln 1 = ln n The result follows by reporting this value into 4

Algorith 1 Randoized Rounding RR Paraeters: A set of exaples, a convex set S 1 Solve w = argin w [0,1] n S ris w For each i [n], set c i to 1 with probability w i 3 Return c Figure 1 left

5 Algorith 1 Randoized Rounding RR Paraeters: A set of exaples, a convex set S 1 Solve w = argin w [0,1] n S ris w For each i [n], set c i to 1 with probability w i 3 Return c Figure 1 left Intersection of the l 1 ball of radius, of the l ball of radius 1 for non negative coordinates right The solution to the convex relaxation coincides with the solution to the original proble 3 Rounding Methods This section exploits the convexity of the hinge loss to derive siple approxiation algoriths for iniizing epirical ris The overall idea is to first relax the optiization proble by deriving a fractional solution w, and then to round the solution w using a deterinistic or a randoized ethod The convex optiization setting we consider is defined by: w = argin ris w 5 w [0,1] n S where S = R n + for the hypothesis class C = {0, 1} n, and S = {w R n + : w 1 } for the subclass C of binary-weighted linear functions with at ost ones aong n Note that the epirical ris iniization proble for C can be viewed as an optiization proble over R n + under the L -nor constraint The proble of iniizing epirical ris in the convex hull of C is illustrated in the left part of Figure 1 The accuracy of rounding ethods depend on the nuber of non-fractional values in the relaxed solution w Indeed, if ost weights of w are already in {0, 1}, then these values will reain unchanged by the rounding phase, and the final approxiation c will close to w Figure 1 illustrates this phenoenon by representing a case where w and c coincide The objective function is represented by ellipses, and the four dots at the corner of the square are the vectors {0, 1} The hinge loss also taes an iportant part in the quality of the rounding process Increasing the paraeter increases the lielihood that weights becoe binary Taing this to the extree, if X 1, then the hinge loss is linear inside the [0, 1] n hypercube and all convex optiization tass described above will yield solutions with binary weights We note in passing that a siilar phenoenon arises in the Lasso feature selection procedure, where the weight vectors are ore liely to fall on a vertex of the L 1 -ball as the argin increases 31 Randoized Rounding The randoized rounding RR algorith is one of the ost popular approxiation schees for cobinatorial optiization Raghavan and Thopson, 1987; Williason and Shoys, 011 In the setting of our fraewor, the algorith starts fro the fractional solution of the proble and draws a rando concept c C by choosing each value c i independently to 1 with probability w i and to 0 with probability 1 w i The following lea derived fro Bernstein s inequality states that using c instead of w to copute a dot product yields a bounded deviation Lea 5 Let x R n, w [0, 1] n and c {0, 1} n be a rando vector such that P [c i = 1] = wi for all i 1 n Then, with probability at least 1, the following inequalities hold: [ c, x w, x ± 15 x ln ] and 3 c, x [ w, x ± x + 17 w 1 ln ] Theore 6 Let c be the vector returned by the randoized rounding algorith Then, with probability 1, the following hold: For the class C: ris c ris c + 15 X ln For the class C : ris c ris c w 1 X ln Proof Since the -hinge loss in 1 -Lipschitz, we have: risc risc risc risw 1 c, x i w, x i 6 Taing expectations and applying the union bound on Lea 5, we get with probability 1 that: P [ i [], c, x i w, x i t] P [ c, x i w, x i t]

6 Algorith Greedy Rounding GR Paraeters: A set of exaples, an integer n 1 Solve w = argin w [0,1] n S ris w For = 1 to n, set A {a {0, 1} : i = 1 θ i, 1 + x i, a w θi, 1 + x i,w 1 w } c argin ris c1,, c 1, a, w+1,, w n a A 3 return c 1 c n The result follows by setting = and reporting into 6 the values t = 15X ln and t = w 1 X ln 3 Greedy Rounding Despite its relative wea guarantees, the randoized rounding procedure can be used as a building bloc for constructing ore efficient algoriths Specifically, the new approxiation schee we propose, called Greedy Rounding GR, is essentially a derandoization of RR with soe iproveents As described in Algorith, the procedure starts again by coputing the fractional solution w of the optiization tas Line 1 Then, the coordinates of w are rounded in a sequential anner by siply aintaining a constraint on the adissible values Line The algorith uses a atrix Θ = [θ i, ] of paraeters defined as follows For any [n], let c 1: = c 1,, c be the prefix of the vector c build at the end of step Then, θ i, = j=1 x i,jc j w for each i [] The next result, which we call the derandoization lea, shows that at each step of the rounding procedure, there is a value a {0, 1} which does not increase the loss too uch Lea 7 For any n and any c 1 c 1 {0, 1} 1, there exist a {0, 1} such that for all i [], θ i, 1 + x i, a w θ i, 1 + x i,w 1 w Proof Let z be a rando vector taing values in {0, 1} n such that P[z j = 1] = wj for all j [n] Clearly, we have E [z w ] = 0 For any i [], let f i z, w = z w, x i We can observe that E [f i z, w ] = n j=1 x i,j w j 1 w j Taing conditional expectations, we have: E [f i z, w z 1: 1 = c 1: 1 ] = E [ c 1: 1 w 1: 1, x i,1: 1 + z :n w:n, x i,:n ] In the right hand side of this equation, the squared su is equal to the su of squares because the ter c 1: 1 w 1: 1, x i,1: 1 z :n w :n, x i,:n is null in expectation We get that: E [f i z, w z 1: 1 = c 1: 1 ] = 1 n = x i,j c j wj + j=1 = θ i, 1 + n j= x i,jw j j= 1 w j x i,jw j 1 w j Now, let U 1,, U n denote rando variables taing values in soe doain D R, and g be a function fro D n into R Using the definition of conditional expectation, we now that for any j [n] there exists a value u D such that E [gu 1, U n U j = u] E [gu 1, U n ] By application, there exists a value c {0, 1} such that E [f i z, w z 1: = c 1: ] E [f i z, w z 1: 1 = c 1: 1 ] The result follows using a=c Based on this lea, the approxiation guarantees of the greedy rounding algorith are suarized in the next theore Interestingly, a coparison with the lower bound for the class C obtained in Theore 1 reveals that the approxiation bound of GR for this class is tight up to a constant factor In other words, GR is an optial approxiation algorith for the class of binary-weighted linear functions Theore 8 Let c be the vector returned by the GR algorith Then, For the class C, ris c ris c + X / For the class C, ris c ris c +X / Proof Since the -hinge loss is 1/-Lipschitz, we can use inequality 6 to derive that: ris c ris c c, x i w, x i c w, x i θ i,n 7

7 Now, by application of lea 7, we now that for each step of GR, any value a A is such that θ i, 1 + x i, a w θi, 1 + x i, w 1 w for all i [] Since c A, we ust have θ i, j [] x i,j w j 1 w j for all i [] and [n] Reporting this inequality into 7, ris c ris c 1 n x i,j w j 1 w j Rounding Methods for Discrete Linear Classification j=1 Let R = ris c ris c For the class C, using the fact that w j 1 w j 1 4 we have R X / For the class C,, using Hölder s inequality and the fact that w 1, we obtain R X / 4 Experients We tested the epirical perforance of our algoriths by conducting experients on a synthetic proble and several real-world doains Besides the Randoized Rounding RR algorith and the Greedy Rounding GR algorith, we evaluated the behavior of two fractional optiization techniques: the Convex Cvx optiization ethod that returns the fractional solution of the proble specified by 5, and the Support Vector Machine L1-SVM that solves the l 1 -constrained version of the proble For sall datasets, we also evaluated MIP ixed integer prograing which is the exact solution to the cobinatorial proble In our ipleentation of the algoriths, we used the linear prograing software CPLEX that returns the fractional solution of convex optiization tass 41 Synthetic Data In order to validate different aspects or our algoriths, we designed a siple artificial dataset generator Called with paraeters, n,, η, the generator builds a dataset coposed of exaples, each with n features Exaples are drawn fro a unifor distribution over [ 10, 10] n Also, the generator draws randoly a target function with exactly ones, and each exaple is labeled with respect to this target Finally, the coordinates of each exaple are perturbed with a noral law of standard deviation η We first evaluated the generalization perforance of the optiization algoriths Setting = 10, n = 100, η = 01, we generated datasets with an increasing nuber of exaples, and plotted the generalization zero-one loss easured on test data upper part of figure Next, we evaluated the robustness of our algoriths with respect to irrelevant attributes Setting = 10, = 50, η = 01, we generated datasets with Figure Test error rates on synthetic data, coparing the nuber of exaples upper part and the nuber of irrelevant features lower part n varying fro 0 to 800, and plotted again the generalization zero-one loss, easured on test data botto part of figure It is apparent that GR and RR perfor significantly better than L1-SVM, which is not surprising, because the target concepts have {0, 1}- weights On synthetic data, GR is slightly less accurate than RR, whose perforance is close to Cvx 4 Metagenoic Data In etagenoic classification, discrete linear functions have a natural interpretation in ter of bacterial abundance We used a real-world dataset containing 38 individuals and 69 features The dataset is divided into two well-balanced classes: obese people and non obese Each feature represents the abundance of a bacterial species As entioned in the introduction, the weight

8 L1-SVM Cvx RR GR MIP run tie 00s 00s 004s 098s 13s Table 1 Test error rates and average running tie in seconds on etagenoic data of each feature captures a qualitative effect encoded by a value in { 1, 0, +1} negative, null effect, positive Let POS respectively NEG denote the group of bacterial species whose feature has a weight of 1 respectively 1 If the abundance of all bacteria in POS is greater than the abundance of the bacteria in NEG, then the individual will be classified as obese In order to learn ternary-weighted linear functions with our algoriths, we used a siple tric that reduces the classification tas to a binary-weighted learning proble The idea is to duplicate attributes in the following way: to each instance x R n we associate an instance x R d where d = n and x = x 1, x 1, x, x,, x n, x n Given a binaryweighted concept c {0, 1} d, the corresponding ternary-weighted concept c { 1, 0, +1} n is recovered by setting c i = c i 1 c i Based on this transforation, it is easy to see that l c ; x, y = l c; x, y So, if c iniizes epirical ris on the set {x t, y t }, then c iniizes epirical ris on {x t, y t } If, in addition, c is -sparse, then c is -sparse The test error rates of algoriths are reported in Table 4 Test errors was easured by conducting 10 fold cross validation, averaged over 10 experients In light of these results, it is apparent that RR slightly outperfors both SVM and Cvx, which clearly overfit the data even in presence of the L 1 -ball constraint for the first two rows Unsurprisingly, the MIP solver generated a odel superior to the others For 0, the ixed integer progra did not finish in reasonable tie, so we left the corresponding entries of the table blan In a nutshell, we can conclude that the accuracy does not suffer fro switching to ternary weights, but this learning tas loos challenging 43 Colon cancer To deonstrate the perforance of discrete linear classifiers for gene selection, we applied our algoriths to icroarray data on colon cancer, which is publicly available The dataset consists of 6 saples, of which are noral and 40 of which are fro colon cancer tissues The genes are already pre-filtered, consisting of the,000 genes We launched our algoriths L1-SVM Cvx RR RR s 004s 004s s Table Test error rates and average running tie in seconds on colon cancer data If at least 3 of the following conditions are et, then the ushroo is poisonous bruises = yes odor {alond, f oul, usty, none, pungent} gill attachent = attached gill spacing = crowded stal root = rooted stal color above ring = pin stal color below ring = pin ring nuber = one ring type {large, pendant} spore print color = brown Table 3 M-of-N rule for the ushroos dataset with = 15 to select 15 genes only We did not plot the result of GR because each run of GR too a huge aount of tie Instead, RR 5 is a variant of randoized rounding that selects the best out of 5 rando roundings at each tie step It turns out that RR 5 achieves a uch better error rate than RR in this case but not on the datasets of the previous subsections Thus, we obtain a concept uch sipler than the linear hypothesis generated by the SVM, with coparable accuracy 44 Mushroos Finally, we ran experients on the ushroos dataset to evaluate how M-of-N rules are learnt using rounding algoriths This dataset contains features which are all noinal We transfored these features into binary features, and ran our discrete linear learning algoriths on this dataset, without iposing any cardinality constraint With an accuracy of 98%, the M-of-N rule shown in Table 3 was produced We ran 10 ties 10 fold cross validation with our algoriths see table 4 Algorith Cvx achieves a perfect classification Here, GR outperfors RR, but running RR several ties and choosing the best solution considerably iproves the accuracy results

9 Cvx RR RR 0 GR s 06s 074s 11s Table 4 Test error rates and average running tie in seconds on colon cancer data 5 suppleental aterial Results and proofs shown in this section only appear in the extended version of this wor Theore 9 For a training set {x i, y i }, checing if there exists a -of-n rule achieving zero error is NPhard Proof By reduction fro the exact 3-cover proble Let U = {1 } be a set of eleents and C = {C 1 C n } a collection of 3-subsets of U There exists an exact 3-cover of U iff there exists C C such that each eleent of U is covered exactly once by soe subset of C Let us build our dataset, consisting of + 3 exaples and n + features As usual, the i th exaple is x i = x i,1 x i,n+ and its label is y i To begin, let us describe the first exaples For each i U, we have a positive exaple x i and a negative exaple x +i We have x i,n+ = 1 and x i,n+1 = x +i,n+1 = x +i,n+ = 0 Also, for all i {1 } and i {1 n}, we have x i,j = x +i,j = 1 if i C j, otherwise x i,j = x +i,j = 0 The last three exaples are build as follows We have y +1 = +1 and y + = y +3 = 1 Finally, we have x +1,n+1 = x +1,n+ = x +,n+1 = x +3,n+1 = 1 All other attributes of last three exaples are set to zero This construction is suarized on figure?? Let us now prove that C is a solution to the exact 3-cover proble if and only if there is a rule classifying the data correctly Let us start with the only if part It is straightforward to chec that if C is a solution to the X3C proble, then the rule if at least features aong the subset C {n + 1, n + } are set to one, then the exaple is positive correctly classifies the data Now the if part Assue there exists soe learnt rule correctly classifying the data To correctly classify the last three exaples, such a learnt rule ust necessarily be of the for if at least features aong the subset C {n + 1, n + } are set to one, then the exaple is positive So each positive exaples ust be covered by at least two features Let us show that each positive exaple is covered by exactly two features First note that each exaple x i for i {1 } is covered once by the feature n + Assue by contradiction that exaple x i is also covered ore than once by the first n features Then, the a 1 a n a n+1 a n+ lab C 1 C n C 1 C n Table 5 Suary of the construction of theore?? negative exaple x i+ would be incorrectly classified Thus, the set C consists in an exact 3-set cover Proof of lea Let σ 1 σ n denote radeacher rando[ variables Let ] us show that for any, we have E σ i > [ 8 is is even, and E ] σ 1 i > 8 is is odd Let z be a binoial rando variable with paraeters p = 1 and Then the ean deviation see eg? for the definition is MD = E [ z E [z] ] = If is even, then +1 =! =!! for Then fro Stirling s forula, we get = see eg > 1 corrolary 9 of? Thus, MD > 4 = 1 4 [ Now ] note that E σ i = E [ z E [z] ] Thus, for [ ] even, we get E σ i > 8 Also, if is odd, [ ] [ we can write E σ ] 1 i > E σ 1 i > 8 [ ] Let us now derive the upper bound: E σ i [ ] E σ i = V ar σ i = Proof of lea 5 Let z R n be a rando vector defined as follows z i = x i c i w i Then, we have E [z i ] = 0 and E [ ] zi = x i wi wi = x i w i 1 w i x i 4 Because the variables z i are independent zero ean rando variables, we can apply the following Bernstein s inequality: [ ] { t } / P zi > t exp i Ez i + z t/3

10 To derive the first bound, we start by noting that i Ez i 1 4 x Thus, rewritting Bernstein inequality and bounding it by yields { [ ] P zi > t exp t / 1 4 x + x t 3 } By taing the logarith, this last inequation becoes t ln 1 4 x + x t 3 0, which is a polynoial inequation of the for at + bt + c 0 with a = 1, b = 1 3 x ln and c = 1 4 x ln Let t = b+ b 4ac a First note that any t t is a solution to this inequation We will be interested in finding an t t thus satisfying Bernstein s inequality which has a ore readable for than t t = b + b 4ac a = ln b+ 4ac a 3 ln x x + x ln x ln x 1 ln ln 15 x ln = t Note that the first inequality holds because u + v u + v Also the last inequality holds because for any 1, we have ln Now let us proove the second bound with the exact sae technique Note that i Ez i i x i w i x W 1 Plugging this forula into Bernstein s bound, we get: { } [ ] t / P zi > t exp W 1 x + x t 3 Again, this can be written as a polynoial inequation at 1 + bt + c 0 where a = 1/, b = x 3 ln, c = W 1 x ln A solution is t = b+ b 4ac a Looing for an upper bound on t, we get t = b + b 4ac a Last line holds because ln References b+ 4ac a = 3 x ln W + 1 x ln x ln 3 + W ln 1 x ln W 1 17 for any 1 P L Bartlett and S Mendelson Radeacher and Gaussian coplexities: Ris bounds and structural results Journal of Machine Learning Research, 3: , 00 P Beran and M Karpinsi On soe tighter inapproxiability results extended abstract In Autoata, Languages and Prograing, 6th International Colloquiu ICALP, pages Springer, 1999 Y Chevaleyre, F Koriche, and J D Zucer Rounding ethods for discrete linear classification extended version Technical Report hal , hal, 013 U Feige, M Karpinsi, and M Langberg A note on approxiating Max-Bisection on regular graphs Inforation Processing Letters, 794: , 001 M Golea and M Marchand Average case analysis of the clipped Hebb rule for nonoverlapping Perception networs In Proceedings of the 6th annual conference on coputational learning theory COLT 93 ACM, 1993a M Golea and M Marchand On learning perceptrons with binary weights Neural Coputation, 55:767 78, 1993b S M Kaade, K Sridharan, and A Tewari On the coplexity of linear prediction: Ris bounds, argin bounds, and regularization In Proceedings of the nd Annual Conference on Neural Inforation Processing Systes NIPS, pages , 008 H Köhler, S Diederich, W Kinzel, and M Opper Learning algorith for a neural networ with binary synapses Zeitschrift fr Physi B Condensed Matter, 78:333 34, 1990

11 M Opper, W Kinzel, J Kleinz, and R Nehl On the ability of the optial perceptron to generalise Journal of Physics A: Matheatical and General, 3 11:L581 L586, 1990 L Pitt and L G Valiant Coputational liitations on learning fro exaples J ACM, 354: , 1988 P Raghavan and C D Thopson Randoized rounding: A technique for probably good algoriths and algorithic proofs Cobinatorica, 74: , 1987 J Shawe-Taylor and N Cristianini An Introduction to Support Vector Machines Cabridge, 000 G G Towell and J W Shavli Extracting refined rules fro nowledge-based neural networs Machine Learning, 13:71 101, 1993 S Venatesh On learning binary weights for ajority functions In Proceedings of the 4th Annual Worshop on Coputational Learning Theory COLT, pages Morgan Kaufann, 1991 D P Williason and D B Shoys The Design of Approxiation Algoriths Cabridge, 011 Rounding Methods for Discrete Linear Classification

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds