Rounding Methods for Discrete Linear Classification (Extended Version)

Size: px
Start display at page:

Download "Rounding Methods for Discrete Linear Classification (Extended Version)"

Transcription

1 Extended Version Yann Chevaleyre LIPN, CNRS UMR 7030, Université Paris Nord, 99 Avenue Jean-Baptiste Cléent, Villetaneuse, France Frédéric Koriche CRIL, CNRS UMR 8188, Université d Artois, Rue Jean Souvraz SP 18, 6307 Lens, France Jean-Daniel Zucer jean-danielzucer@irdfr INSERM U87, Université Pierre et Marie Curie, 15 Rue de l Ecole de Médecine, Paris, France Abstract Learning discrete linear classifiers is nown as a difficult challenge In this paper, this learning tas is cast as cobinatorial optiization proble: given a training saple fored by positive and negative feature vectors in the Euclidean space, the goal is to find a discrete linear function that iniizes the cuulative hinge loss of the saple Since this proble is NP-hard, we exaine two siple rounding algoriths that discretize the fractional solution of the proble Generalization bounds are derived for several classes of binary-weighted linear functions, by analyzing the Radeacher coplexity of these classes and by establishing approxiation bounds for our rounding algoriths Our ethods are evaluated on both synthetic and real-world data 1 Introduction Linear classification is a well-studied learning proble in which one needs to extrapolate, fro a set of positive and negative exaples represented in Euclidean space by their feature vector, a linear hypothesis hx = sgn w, x b that correctly classifies future, unseen, exaples In the past decades, a wide variety of theoretical results and efficient algoriths have been obtained for learning real-weighted linear functions also nown as perceptrons Notably, it is well-nown that the linear classification proble can Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 013 JMLR: W&CP volue 8 Copyright 013 by the authors be cast as a convex optiization proble and solved in polynoial tie by support vector achines if the perforance of hypotheses is easured by convex loss functions such as the hinge loss see eg Shawe-Taylor and Cristianini 000 Much less is nown, however, about learning discrete linear classifier Indeed, integer weights, and in particular {0, 1}-valued and { 1, 0, 1}-valued weights, can play a crucial role in any application doains in which the classifier has to be interpretable by huans One of the ain otivating applications for this wor coes fro the field of quantitative etagenoics, which is the study of the collective genoe of the icro-organiss inhabiting our body It is now technically possible to easure the abundance of bacterial species by easuring the activity of specific tracer genes for that species Moreover, it is nown that the abundance of soe bacterial species in our body is related to obesity or leanness Instead of learning a standard linear classifier to predict obesity, biologists would lie to find two sall groups of bacterial species, such that if the abundance of bacteria in the first group is greater than that of the second group, then the individual is classified as being obese Given a dataset in which features represent the abundance of specific bacterial species, this proble boils down to learning a linear classifier with { 1, 0, 1}-valued weights In other doains such as edical diagnosis, the interpretability of predictive odels is also a ey aspect The ost coon diagnostic odels are M-of-N rules Towell and Shavli, 1993 according to which patients are classified as ill if at least M criteria aong N are satisfied However, learning M-of-N rules is hard a proof is provided in appendix In binary classification, linear threshold functions with {0, 1}-valued

2 weights are equivalent to M-of-N rules Thus, the theory and the algoriths described in this paper can also be used to learn such rules, as shown in the experiental section Perhaps the ajor obstacle to the developent of discrete linear functions lies in the fact that, in the standard distribution-free PAC learning odel, the proble of finding an integer-weighted linear function that is consistent with a training set is equivalent to the Zero-One Integer Linear Prograing proble Pitt and Valiant, 1988, which is NP-coplete In order to alleviate this issue, several authors have investigated the learnability of discrete linear functions in distribution-specific odels, such as the unifor distribution Golea and Marchand, 1993a; Köhler et al, 1990; Opper et al, 1990; Venatesh, 1991, or the product distribution Golea and Marchand, 1993b Yet, beyond this pioneering wor, any questions reain open, especially when the odel is distributionfree but the loss functions are convex In this paper, we consider just such a scenario by exaining the proble of learning binary-weighted linear functions with the hinge loss, a well-nown surrogate of the zero-one loss The ey coponents of the classification proble are a set C {0, 1} n of boolean vectors 1 fro which the learner pics his hypotheses, and a fixed yet hidden probability distribution over the set R n {±1} of exaples For a hinge paraeter > 0, the hinge loss penalizes a hypothesis c C on an exaple x, y if its argin y c, x is less than The perforance of a hypothesis c C is easured by its ris, denoted risc, and defined as the expected loss of c on an exaple x, y drawn fro the underlying distribution Typically, risc is upper-bounded by the su of two ters: a saple estiate ris c of the perforance of c and a penalty ter T C that depends on the hypothesis class C and, potentially, also on the training set The saple estiate ris c is siply the averaged cuulative hinge loss of c on a set {x i, y i } of exaples drawn independently fro the underlying distribution The penalty ter T C can be given by the VC-diension of C, or its Radeacher coplexity with respect to the size of the training set For binary-weighted linear classifiers, the penalty ter induced by their Radeacher coplexity can be substantially saller than the penalty ter induced by their VC diension So, by a siple adaptation of Bartlett and Mendelson s fraewor 1 As explained in Section 4, { 1, 0, 1}-weighted classification can be reduced to {0, 1}-weighted classification 00, our ris bounds tae the for of: risc ris c + 8 ln/ R C + 1 where R C is the Radeacher coplexity of C with respect to, and 0, 1 is a confidence paraeter Ideally, we would lie to have at our disposal an efficient algorith for iniizing ris c The resulting iniizer, say c, would be guaranteed to provide an optial hypothesis because the other ters in the ris bound 1 do not depend on the choice of the hypothesis Unfortunately, because the class C of discrete linear classifiers is not a a convex set, the convexity of hinge loss does not help in finding c and, as shown by Theore 1 in the next section, the optiization proble reains NP-hard The ey essage to be gleaned fro this paper is that the convexity of the loss function does help in approxiating the cobinatorial optiization proble, using siple rounding ethods Our first algorith is a standard randoized rounding RR ethod that starts fro a fractional solution w in the convex hull of C, and then builds c by viewing the fractional value w i as the probability that c i should be set to 1 The second algorith, called greedy rounding GR, is essentially a derandoization of RR that iteratively rounds the coordinates of the fractional solution by aintaining a constraint on the su of weights For the class C of binary-weighted linear functions, we show that the greedy rounding algorith is guaranteed to return a concept c C satisfying: ris c ris c + X where X p = ax x i p, and x p is the L p -nor of x We also show that the proble of iproving this bound up to a constant factor is NP-hard Cobining greedy rounding s perforance with the Radeacher coplexity of C yields the ris bound: risc ris c + X + { } n 8 ln/ X 1 in 1, + For the subclass C of sparse binary-weighted linear functions involving at ost ones aong n, we show that greedy rounding is guaranteed to return a concept c C satisfying: ris c ris c + X

3 Using the Radeacher coplexity of C, which is substantially saller than that of C, we have: risc ris c + X + log n X 8 ln/ + Siilar results are derived with the randoized rounding algorith, with less sharp bounds due to the randoization process We evaluate these rounding ethods on a both synthetic and real-world datasets, showing good perforance in coparison with standard linear optiization ethods The proofs of preparatory leas and 5 can be found in appendix Binary-Weighted Linear Classifiers Notation The set of positive integers {1,, n} is denoted [n] For a subset S R n, we denote by convs the convex hull of S For two vectors u, v R n and p 1 the L p -nor of u is denoted u p and the inner product between u and v is denoted u, v Given a vector u R n and [n], we denote by u 1: the prefix u 1,, u of u Finally, given a training set {x i, y i }, we write X p = ax x i p In this study, we shall exaine classification probles for which the set of instances is the Euclidean space R n and the hypothesis class is a subset of {0, 1} n Specifically, we shall focus on the class C = {0, 1} n of all binary-weighted linear functions, and the subclass C of all binary-weighted linear functions with at ost ones aong n The paraeterized loss function l : R {±1} R exained in this wor is the hinge loss defined by: l p, y = 1 ax0, py where > 0 1 Coputational Coplexity For a training set {x i, y i }, the epirical ris of a weight vector c C, denoted ris c, is defined by its averaged cuulative loss: ris c = 1 l c i, x i, y By c, we denote any iniizer of the objective function ris Recall that if C is a convex subset of R n then c can be found in polynoial tie using convex optiization algoriths However, for the discrete class C = {0, 1} n, the next result states that the optiization proble is uch harder Theore 1 There exists a constant α > 0 such that, unless P=NP, there is no polynoial tie algorith capable of learning fro any dataset of size a hypothesis c C such that: ris c in ris c + α X c C Proof In what follows, we denote by c any vector in C for which ris c is inial For an undirected graph G = V, E, the Max-Cut proble is to find a subset S V such that the nuber of edges with one end point in S and the other in V \S is axial Unless P=NP, no polynoial-tie algorith can achieve the approxiation ratio of 0997 for MaxCut in 3- regular graphs Beran and Karpinsi, 1999 Based on this result, we first construct a dataset fro a 3-regular graph G = V, E having an even nuber of vertices Our dataset consist of n = V + 1 features and = E exaples The first V features are associated with the vertices of G For each edge j, j E, we build two positively labeled exaples x and x in the following way In the first exaple x, j and j are set to, and all other features are set to 0 In the second exaple x, j and j are set to, the feature V + 1 is set to and all others are set to 0 Consider any weight vector c where c V +1 is equal to 0 Clearly, setting c V +1 to 1 will strictly decrease the loss of c if at least one coordinate in c is nonzero Thus, we will assue fro now on and without loss of generality that c V +1 is always set to 1 Observe that the loss on the two exaples x and x is { l c, x + l c, x 0 if c j c j = 1 otherwise Let us now define cutc = {i, j E : c i c j } By viewing c as the characteristic vector of a subset of vertices, cut c is the value of the cut in G induced by this subset Note we have cutc = E E ris c Thus, iniizing the loss on the dataset axiizes the cut on the graph Consequently, cut c is the optial value of the Max-cut proble Finally, suppose by contradiction that for all α > 0, there is a polynoial-tie algorith capable of learning fro any dataset of size a vector c satisfying ris c ris c + α X Notably, in the dataset constructed above, the value of X is 6 Thus, for this dataset, we get that ris c ris c + α 6, and hence, cutc cutc α E 6

4 To this point, Feige et al 001 have shown that on 3-regular graphs, the optial cut has a value of at least E / By reporting this value into, we obtain cutc cutc 4α 6 cutc = cutc 1 4α 6 Because α can be arbitrarily close to 0, this iplies that Max-Cut is approxiable within any constant factor, which contradicts Beran and Karpinsi s 1999 inapproxiability result Radeacher Coplexity Suppose that our training set S = {x i, y i } consists of exaples generated by independent draws fro soe fixed probability distribution on R n {±1} For a class F of real valued functions f : R n R, define its Radeacher coplexity on S to be: R S F = E [ sup f F 1 ] σ i fx i Here, the expectation is over the Radeacher rando variables σ 1 σ, which are drawn fro {±1} with equal probability Since S is rando, we can also tae expectation over the choice of S and define R F = E [R S F], which gives us a quantity that depends on both the function class and the saple size As indicated by inequality 1, bounds on the Radeacher coplexity of a class iediately yield ris bounds for classifiers piced fro that class For continuous linear functions, sharp Radeecher coplexity bounds have been provided by Kaade et al 008 We provide here siilar bounds for two iportant classes of discrete linear functions Lea Let σ 1,, σ be Radeacher variables Then E[ σ i ] /8 for any even Theore 3 Let C = {0, 1} n be the class of all binaryweighted linear functions Then, { } n R C X 1 in 1, This bound is tight up to a constant factor Proof Consider the hypothesis class: F p,v = {x w, x : w R n, w p v} By Theore 1 in Kaade et al, 008, we have R F,v vx / Moreover, for any c {0, 1} n, we have c n It follows that C F, n, and since 1, we get that: R C RF n n, n X X 1 Now, let us prove that this bound is tight First, let us rewrite the radeacher coplexity over saples in a ore convenient for: [ R S C = 1 = 1 = 1 n E j=1 n E j=1 [ n E j=1 sup c j c j {0,1} [ sup w j w j { 1,1} ] σ i x i,j σ i x i,j ] σ i x i,j ] 3 For the case n, consider a training set S such that x i,i = X 1 for all i [], and zero everywhere else Clearly, equation 3 iplies R S C = X1 For the case n <, assue is a ultiple of n and consider a dataset S in which each each exaple contains only one non-zero value equal to X 1, and such that the nuber of nonzero values per colun is n Then, by applying Lea to equation 3, we obtain: R S C nx 1 n n 3 = X 1 3 Theore 4 For a constant > 0, let C be the class of binary-weighted linear functions with at ost ones aong n Then, log n R C X Proof For a closed convex set S R n +, consider the hypothesis class: F S = {x w, x : w S} Using the convex function F w = n j=1 W 1 ln wj W 1 + ln n, where W 1 = ax w S w 1, we get fro Theore 1 in Kaade et al, 008: R F S X W 1 sup {F w : w S} w j 4 For any, let S = conv {w {0, 1} n : w 1 }, where conv is the convex hull Because F is convex and S is a convex polytope, the supreu of F is one of the vertices of the polytope Thus, sup F w = sup{f w : w {0, 1} n, w 1 } w S ln n + l=1 1 ln 1 = ln n The result follows by reporting this value into 4

5 Algorith 1 Randoized Rounding RR Paraeters: A set of exaples, a convex set S 1 Solve w = argin w [0,1] n S ris w For each i [n], set c i to 1 with probability w i 3 Return c Figure 1 left Intersection of the l 1 ball of radius, of the l ball of radius 1 for non negative coordinates right The solution to the convex relaxation coincides with the solution to the original proble 3 Rounding Methods This section exploits the convexity of the hinge loss to derive siple approxiation algoriths for iniizing epirical ris The overall idea is to first relax the optiization proble by deriving a fractional solution w, and then to round the solution w using a deterinistic or a randoized ethod The convex optiization setting we consider is defined by: w = argin ris w 5 w [0,1] n S where S = R n + for the hypothesis class C = {0, 1} n, and S = {w R n + : w 1 } for the subclass C of binary-weighted linear functions with at ost ones aong n Note that the epirical ris iniization proble for C can be viewed as an optiization proble over R n + under the L -nor constraint The proble of iniizing epirical ris in the convex hull of C is illustrated in the left part of Figure 1 The accuracy of rounding ethods depend on the nuber of non-fractional values in the relaxed solution w Indeed, if ost weights of w are already in {0, 1}, then these values will reain unchanged by the rounding phase, and the final approxiation c will close to w Figure 1 illustrates this phenoenon by representing a case where w and c coincide The objective function is represented by ellipses, and the four dots at the corner of the square are the vectors {0, 1} The hinge loss also taes an iportant part in the quality of the rounding process Increasing the paraeter increases the lielihood that weights becoe binary Taing this to the extree, if X 1, then the hinge loss is linear inside the [0, 1] n hypercube and all convex optiization tass described above will yield solutions with binary weights We note in passing that a siilar phenoenon arises in the Lasso feature selection procedure, where the weight vectors are ore liely to fall on a vertex of the L 1 -ball as the argin increases 31 Randoized Rounding The randoized rounding RR algorith is one of the ost popular approxiation schees for cobinatorial optiization Raghavan and Thopson, 1987; Williason and Shoys, 011 In the setting of our fraewor, the algorith starts fro the fractional solution of the proble and draws a rando concept c C by choosing each value c i independently to 1 with probability w i and to 0 with probability 1 w i The following lea derived fro Bernstein s inequality states that using c instead of w to copute a dot product yields a bounded deviation Lea 5 Let x R n, w [0, 1] n and c {0, 1} n be a rando vector such that P [c i = 1] = wi for all i 1 n Then, with probability at least 1, the following inequalities hold: [ c, x w, x ± 15 x ln ] and 3 c, x [ w, x ± x + 17 w 1 ln ] Theore 6 Let c be the vector returned by the randoized rounding algorith Then, with probability 1, the following hold: For the class C: ris c ris c + 15 X ln For the class C : ris c ris c w 1 X ln Proof Since the -hinge loss in 1 -Lipschitz, we have: risc risc risc risw 1 c, x i w, x i 6 Taing expectations and applying the union bound on Lea 5, we get with probability 1 that: P [ i [], c, x i w, x i t] P [ c, x i w, x i t]

6 Algorith Greedy Rounding GR Paraeters: A set of exaples, an integer n 1 Solve w = argin w [0,1] n S ris w For = 1 to n, set A {a {0, 1} : i = 1 θ i, 1 + x i, a w θi, 1 + x i,w 1 w } c argin ris c1,, c 1, a, w+1,, w n a A 3 return c 1 c n The result follows by setting = and reporting into 6 the values t = 15X ln and t = w 1 X ln 3 Greedy Rounding Despite its relative wea guarantees, the randoized rounding procedure can be used as a building bloc for constructing ore efficient algoriths Specifically, the new approxiation schee we propose, called Greedy Rounding GR, is essentially a derandoization of RR with soe iproveents As described in Algorith, the procedure starts again by coputing the fractional solution w of the optiization tas Line 1 Then, the coordinates of w are rounded in a sequential anner by siply aintaining a constraint on the adissible values Line The algorith uses a atrix Θ = [θ i, ] of paraeters defined as follows For any [n], let c 1: = c 1,, c be the prefix of the vector c build at the end of step Then, θ i, = j=1 x i,jc j w for each i [] The next result, which we call the derandoization lea, shows that at each step of the rounding procedure, there is a value a {0, 1} which does not increase the loss too uch Lea 7 For any n and any c 1 c 1 {0, 1} 1, there exist a {0, 1} such that for all i [], θ i, 1 + x i, a w θ i, 1 + x i,w 1 w Proof Let z be a rando vector taing values in {0, 1} n such that P[z j = 1] = wj for all j [n] Clearly, we have E [z w ] = 0 For any i [], let f i z, w = z w, x i We can observe that E [f i z, w ] = n j=1 x i,j w j 1 w j Taing conditional expectations, we have: E [f i z, w z 1: 1 = c 1: 1 ] = E [ c 1: 1 w 1: 1, x i,1: 1 + z :n w:n, x i,:n ] In the right hand side of this equation, the squared su is equal to the su of squares because the ter c 1: 1 w 1: 1, x i,1: 1 z :n w :n, x i,:n is null in expectation We get that: E [f i z, w z 1: 1 = c 1: 1 ] = 1 n = x i,j c j wj + j=1 = θ i, 1 + n j= x i,jw j j= 1 w j x i,jw j 1 w j Now, let U 1,, U n denote rando variables taing values in soe doain D R, and g be a function fro D n into R Using the definition of conditional expectation, we now that for any j [n] there exists a value u D such that E [gu 1, U n U j = u] E [gu 1, U n ] By application, there exists a value c {0, 1} such that E [f i z, w z 1: = c 1: ] E [f i z, w z 1: 1 = c 1: 1 ] The result follows using a=c Based on this lea, the approxiation guarantees of the greedy rounding algorith are suarized in the next theore Interestingly, a coparison with the lower bound for the class C obtained in Theore 1 reveals that the approxiation bound of GR for this class is tight up to a constant factor In other words, GR is an optial approxiation algorith for the class of binary-weighted linear functions Theore 8 Let c be the vector returned by the GR algorith Then, For the class C, ris c ris c + X / For the class C, ris c ris c +X / Proof Since the -hinge loss is 1/-Lipschitz, we can use inequality 6 to derive that: ris c ris c c, x i w, x i c w, x i θ i,n 7

7 Now, by application of lea 7, we now that for each step of GR, any value a A is such that θ i, 1 + x i, a w θi, 1 + x i, w 1 w for all i [] Since c A, we ust have θ i, j [] x i,j w j 1 w j for all i [] and [n] Reporting this inequality into 7, ris c ris c 1 n x i,j w j 1 w j Rounding Methods for Discrete Linear Classification j=1 Let R = ris c ris c For the class C, using the fact that w j 1 w j 1 4 we have R X / For the class C,, using Hölder s inequality and the fact that w 1, we obtain R X / 4 Experients We tested the epirical perforance of our algoriths by conducting experients on a synthetic proble and several real-world doains Besides the Randoized Rounding RR algorith and the Greedy Rounding GR algorith, we evaluated the behavior of two fractional optiization techniques: the Convex Cvx optiization ethod that returns the fractional solution of the proble specified by 5, and the Support Vector Machine L1-SVM that solves the l 1 -constrained version of the proble For sall datasets, we also evaluated MIP ixed integer prograing which is the exact solution to the cobinatorial proble In our ipleentation of the algoriths, we used the linear prograing software CPLEX that returns the fractional solution of convex optiization tass 41 Synthetic Data In order to validate different aspects or our algoriths, we designed a siple artificial dataset generator Called with paraeters, n,, η, the generator builds a dataset coposed of exaples, each with n features Exaples are drawn fro a unifor distribution over [ 10, 10] n Also, the generator draws randoly a target function with exactly ones, and each exaple is labeled with respect to this target Finally, the coordinates of each exaple are perturbed with a noral law of standard deviation η We first evaluated the generalization perforance of the optiization algoriths Setting = 10, n = 100, η = 01, we generated datasets with an increasing nuber of exaples, and plotted the generalization zero-one loss easured on test data upper part of figure Next, we evaluated the robustness of our algoriths with respect to irrelevant attributes Setting = 10, = 50, η = 01, we generated datasets with Figure Test error rates on synthetic data, coparing the nuber of exaples upper part and the nuber of irrelevant features lower part n varying fro 0 to 800, and plotted again the generalization zero-one loss, easured on test data botto part of figure It is apparent that GR and RR perfor significantly better than L1-SVM, which is not surprising, because the target concepts have {0, 1}- weights On synthetic data, GR is slightly less accurate than RR, whose perforance is close to Cvx 4 Metagenoic Data In etagenoic classification, discrete linear functions have a natural interpretation in ter of bacterial abundance We used a real-world dataset containing 38 individuals and 69 features The dataset is divided into two well-balanced classes: obese people and non obese Each feature represents the abundance of a bacterial species As entioned in the introduction, the weight

8 L1-SVM Cvx RR GR MIP run tie 00s 00s 004s 098s 13s Table 1 Test error rates and average running tie in seconds on etagenoic data of each feature captures a qualitative effect encoded by a value in { 1, 0, +1} negative, null effect, positive Let POS respectively NEG denote the group of bacterial species whose feature has a weight of 1 respectively 1 If the abundance of all bacteria in POS is greater than the abundance of the bacteria in NEG, then the individual will be classified as obese In order to learn ternary-weighted linear functions with our algoriths, we used a siple tric that reduces the classification tas to a binary-weighted learning proble The idea is to duplicate attributes in the following way: to each instance x R n we associate an instance x R d where d = n and x = x 1, x 1, x, x,, x n, x n Given a binaryweighted concept c {0, 1} d, the corresponding ternary-weighted concept c { 1, 0, +1} n is recovered by setting c i = c i 1 c i Based on this transforation, it is easy to see that l c ; x, y = l c; x, y So, if c iniizes epirical ris on the set {x t, y t }, then c iniizes epirical ris on {x t, y t } If, in addition, c is -sparse, then c is -sparse The test error rates of algoriths are reported in Table 4 Test errors was easured by conducting 10 fold cross validation, averaged over 10 experients In light of these results, it is apparent that RR slightly outperfors both SVM and Cvx, which clearly overfit the data even in presence of the L 1 -ball constraint for the first two rows Unsurprisingly, the MIP solver generated a odel superior to the others For 0, the ixed integer progra did not finish in reasonable tie, so we left the corresponding entries of the table blan In a nutshell, we can conclude that the accuracy does not suffer fro switching to ternary weights, but this learning tas loos challenging 43 Colon cancer To deonstrate the perforance of discrete linear classifiers for gene selection, we applied our algoriths to icroarray data on colon cancer, which is publicly available The dataset consists of 6 saples, of which are noral and 40 of which are fro colon cancer tissues The genes are already pre-filtered, consisting of the,000 genes We launched our algoriths L1-SVM Cvx RR RR s 004s 004s s Table Test error rates and average running tie in seconds on colon cancer data If at least 3 of the following conditions are et, then the ushroo is poisonous bruises = yes odor {alond, f oul, usty, none, pungent} gill attachent = attached gill spacing = crowded stal root = rooted stal color above ring = pin stal color below ring = pin ring nuber = one ring type {large, pendant} spore print color = brown Table 3 M-of-N rule for the ushroos dataset with = 15 to select 15 genes only We did not plot the result of GR because each run of GR too a huge aount of tie Instead, RR 5 is a variant of randoized rounding that selects the best out of 5 rando roundings at each tie step It turns out that RR 5 achieves a uch better error rate than RR in this case but not on the datasets of the previous subsections Thus, we obtain a concept uch sipler than the linear hypothesis generated by the SVM, with coparable accuracy 44 Mushroos Finally, we ran experients on the ushroos dataset to evaluate how M-of-N rules are learnt using rounding algoriths This dataset contains features which are all noinal We transfored these features into binary features, and ran our discrete linear learning algoriths on this dataset, without iposing any cardinality constraint With an accuracy of 98%, the M-of-N rule shown in Table 3 was produced We ran 10 ties 10 fold cross validation with our algoriths see table 4 Algorith Cvx achieves a perfect classification Here, GR outperfors RR, but running RR several ties and choosing the best solution considerably iproves the accuracy results

9 Cvx RR RR 0 GR s 06s 074s 11s Table 4 Test error rates and average running tie in seconds on colon cancer data 5 suppleental aterial Results and proofs shown in this section only appear in the extended version of this wor Theore 9 For a training set {x i, y i }, checing if there exists a -of-n rule achieving zero error is NPhard Proof By reduction fro the exact 3-cover proble Let U = {1 } be a set of eleents and C = {C 1 C n } a collection of 3-subsets of U There exists an exact 3-cover of U iff there exists C C such that each eleent of U is covered exactly once by soe subset of C Let us build our dataset, consisting of + 3 exaples and n + features As usual, the i th exaple is x i = x i,1 x i,n+ and its label is y i To begin, let us describe the first exaples For each i U, we have a positive exaple x i and a negative exaple x +i We have x i,n+ = 1 and x i,n+1 = x +i,n+1 = x +i,n+ = 0 Also, for all i {1 } and i {1 n}, we have x i,j = x +i,j = 1 if i C j, otherwise x i,j = x +i,j = 0 The last three exaples are build as follows We have y +1 = +1 and y + = y +3 = 1 Finally, we have x +1,n+1 = x +1,n+ = x +,n+1 = x +3,n+1 = 1 All other attributes of last three exaples are set to zero This construction is suarized on figure?? Let us now prove that C is a solution to the exact 3-cover proble if and only if there is a rule classifying the data correctly Let us start with the only if part It is straightforward to chec that if C is a solution to the X3C proble, then the rule if at least features aong the subset C {n + 1, n + } are set to one, then the exaple is positive correctly classifies the data Now the if part Assue there exists soe learnt rule correctly classifying the data To correctly classify the last three exaples, such a learnt rule ust necessarily be of the for if at least features aong the subset C {n + 1, n + } are set to one, then the exaple is positive So each positive exaples ust be covered by at least two features Let us show that each positive exaple is covered by exactly two features First note that each exaple x i for i {1 } is covered once by the feature n + Assue by contradiction that exaple x i is also covered ore than once by the first n features Then, the a 1 a n a n+1 a n+ lab C 1 C n C 1 C n Table 5 Suary of the construction of theore?? negative exaple x i+ would be incorrectly classified Thus, the set C consists in an exact 3-set cover Proof of lea Let σ 1 σ n denote radeacher rando[ variables Let ] us show that for any, we have E σ i > [ 8 is is even, and E ] σ 1 i > 8 is is odd Let z be a binoial rando variable with paraeters p = 1 and Then the ean deviation see eg? for the definition is MD = E [ z E [z] ] = If is even, then +1 =! =!! for Then fro Stirling s forula, we get = see eg > 1 corrolary 9 of? Thus, MD > 4 = 1 4 [ Now ] note that E σ i = E [ z E [z] ] Thus, for [ ] even, we get E σ i > 8 Also, if is odd, [ ] [ we can write E σ ] 1 i > E σ 1 i > 8 [ ] Let us now derive the upper bound: E σ i [ ] E σ i = V ar σ i = Proof of lea 5 Let z R n be a rando vector defined as follows z i = x i c i w i Then, we have E [z i ] = 0 and E [ ] zi = x i wi wi = x i w i 1 w i x i 4 Because the variables z i are independent zero ean rando variables, we can apply the following Bernstein s inequality: [ ] { t } / P zi > t exp i Ez i + z t/3

10 To derive the first bound, we start by noting that i Ez i 1 4 x Thus, rewritting Bernstein inequality and bounding it by yields { [ ] P zi > t exp t / 1 4 x + x t 3 } By taing the logarith, this last inequation becoes t ln 1 4 x + x t 3 0, which is a polynoial inequation of the for at + bt + c 0 with a = 1, b = 1 3 x ln and c = 1 4 x ln Let t = b+ b 4ac a First note that any t t is a solution to this inequation We will be interested in finding an t t thus satisfying Bernstein s inequality which has a ore readable for than t t = b + b 4ac a = ln b+ 4ac a 3 ln x x + x ln x ln x 1 ln ln 15 x ln = t Note that the first inequality holds because u + v u + v Also the last inequality holds because for any 1, we have ln Now let us proove the second bound with the exact sae technique Note that i Ez i i x i w i x W 1 Plugging this forula into Bernstein s bound, we get: { } [ ] t / P zi > t exp W 1 x + x t 3 Again, this can be written as a polynoial inequation at 1 + bt + c 0 where a = 1/, b = x 3 ln, c = W 1 x ln A solution is t = b+ b 4ac a Looing for an upper bound on t, we get t = b + b 4ac a Last line holds because ln References b+ 4ac a = 3 x ln W + 1 x ln x ln 3 + W ln 1 x ln W 1 17 for any 1 P L Bartlett and S Mendelson Radeacher and Gaussian coplexities: Ris bounds and structural results Journal of Machine Learning Research, 3: , 00 P Beran and M Karpinsi On soe tighter inapproxiability results extended abstract In Autoata, Languages and Prograing, 6th International Colloquiu ICALP, pages Springer, 1999 Y Chevaleyre, F Koriche, and J D Zucer Rounding ethods for discrete linear classification extended version Technical Report hal , hal, 013 U Feige, M Karpinsi, and M Langberg A note on approxiating Max-Bisection on regular graphs Inforation Processing Letters, 794: , 001 M Golea and M Marchand Average case analysis of the clipped Hebb rule for nonoverlapping Perception networs In Proceedings of the 6th annual conference on coputational learning theory COLT 93 ACM, 1993a M Golea and M Marchand On learning perceptrons with binary weights Neural Coputation, 55:767 78, 1993b S M Kaade, K Sridharan, and A Tewari On the coplexity of linear prediction: Ris bounds, argin bounds, and regularization In Proceedings of the nd Annual Conference on Neural Inforation Processing Systes NIPS, pages , 008 H Köhler, S Diederich, W Kinzel, and M Opper Learning algorith for a neural networ with binary synapses Zeitschrift fr Physi B Condensed Matter, 78:333 34, 1990

11 M Opper, W Kinzel, J Kleinz, and R Nehl On the ability of the optial perceptron to generalise Journal of Physics A: Matheatical and General, 3 11:L581 L586, 1990 L Pitt and L G Valiant Coputational liitations on learning fro exaples J ACM, 354: , 1988 P Raghavan and C D Thopson Randoized rounding: A technique for probably good algoriths and algorithic proofs Cobinatorica, 74: , 1987 J Shawe-Taylor and N Cristianini An Introduction to Support Vector Machines Cabridge, 000 G G Towell and J W Shavli Extracting refined rules fro nowledge-based neural networs Machine Learning, 13:71 101, 1993 S Venatesh On learning binary weights for ajority functions In Proceedings of the 4th Annual Worshop on Coputational Learning Theory COLT, pages Morgan Kaufann, 1991 D P Williason and D B Shoys The Design of Approxiation Algoriths Cabridge, 011 Rounding Methods for Discrete Linear Classification

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies Approxiation in Stochastic Scheduling: The Power of -Based Priority Policies Rolf Möhring, Andreas Schulz, Marc Uetz Setting (A P p stoch, r E( w and (B P p stoch E( w We will assue that the processing

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Homework 3 Solutions CSE 101 Summer 2017

Homework 3 Solutions CSE 101 Summer 2017 Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels Extension of CSRSM for the Paraetric Study of the Face Stability of Pressurized Tunnels Guilhe Mollon 1, Daniel Dias 2, and Abdul-Haid Soubra 3, M.ASCE 1 LGCIE, INSA Lyon, Université de Lyon, Doaine scientifique

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

A Note on the Applied Use of MDL Approximations

A Note on the Applied Use of MDL Approximations A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

OBJECTIVES INTRODUCTION

OBJECTIVES INTRODUCTION M7 Chapter 3 Section 1 OBJECTIVES Suarize data using easures of central tendency, such as the ean, edian, ode, and idrange. Describe data using the easures of variation, such as the range, variance, and

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

Physics 215 Winter The Density Matrix

Physics 215 Winter The Density Matrix Physics 215 Winter 2018 The Density Matrix The quantu space of states is a Hilbert space H. Any state vector ψ H is a pure state. Since any linear cobination of eleents of H are also an eleent of H, it

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup) Recovering Data fro Underdeterined Quadratic Measureents (CS 229a Project: Final Writeup) Mahdi Soltanolkotabi Deceber 16, 2011 1 Introduction Data that arises fro engineering applications often contains

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science Proceedings of the 6th WSEAS International Conference on Applied Coputer Science, Tenerife, Canary Islands, Spain, Deceber 16-18, 2006 183 Qualitative Modelling of Tie Series Using Self-Organizing Maps:

More information

On the Use of A Priori Information for Sparse Signal Approximations

On the Use of A Priori Information for Sparse Signal Approximations ITS TECHNICAL REPORT NO. 3/4 On the Use of A Priori Inforation for Sparse Signal Approxiations Oscar Divorra Escoda, Lorenzo Granai and Pierre Vandergheynst Signal Processing Institute ITS) Ecole Polytechnique

More information

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words) 1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016/2017 Lessons 9 11 Jan 2017 Outline Artificial Neural networks Notation...2 Convolutional Neural Networks...3

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

lecture 36: Linear Multistep Mehods: Zero Stability

lecture 36: Linear Multistep Mehods: Zero Stability 95 lecture 36: Linear Multistep Mehods: Zero Stability 5.6 Linear ultistep ethods: zero stability Does consistency iply convergence for linear ultistep ethods? This is always the case for one-step ethods,

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

Complex Quadratic Optimization and Semidefinite Programming

Complex Quadratic Optimization and Semidefinite Programming Coplex Quadratic Optiization and Seidefinite Prograing Shuzhong Zhang Yongwei Huang August 4 Abstract In this paper we study the approxiation algoriths for a class of discrete quadratic optiization probles

More information

time time δ jobs jobs

time time δ jobs jobs Approxiating Total Flow Tie on Parallel Machines Stefano Leonardi Danny Raz y Abstract We consider the proble of optiizing the total ow tie of a strea of jobs that are released over tie in a ultiprocessor

More information

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA) Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu

More information

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs International Cobinatorics Volue 2011, Article ID 872703, 9 pages doi:10.1155/2011/872703 Research Article On the Isolated Vertices and Connectivity in Rando Intersection Graphs Yilun Shang Institute for

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information