PAC-Bayesian Learning of Linear Classifiers

Size: px
Start display at page:

Download "PAC-Bayesian Learning of Linear Classifiers"

Transcription

1 Pascal Gerain Alexandre Lacasse François Laviolette Mario Marchand Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada, GV-0A6 Abstract We present a general PAC-Bayes theore fro which all known PAC-Bayes risk bounds are obtained as particular cases. We also propose different learning algoriths for finding linear classifiers that iniize these bounds. These learning algoriths are generally copetitive with both AdaBoost and the SVM.. Intoduction For the classification proble, we are given a training set of exaples each generated according to the sae but unknown distribution D, and the goal is to find a classifier that iniizes the true risk i.e., the generalization error or the expected loss. Since the true risk is defined only with respect to the unknown distribution D, we are autoatically confronted with the proble of specifying exactly what we should optiize on the training data to find a classifier having the sallest possible true risk. Many different specifications of what should be optiized on the training data have been provided by using different inductive principles but the final guarantee on the true risk, however, always coes with a so-called risk bound that holds uniforly over a set of classifiers. Hence, the foral justification of a learning strategy has always coe a posteriori via a risk bound. Since a risk bound can be coputed fro what a classifier achieves on the training data, it autoatically suggests the following optiization proble for learning algoriths: given a risk upper bound, find a classifier that iniizes it. Despite the enorous ipact they had on our understanding of learning, the VC bounds are generally very loose. These bounds are characterized by the fact that Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, Copyright 2009 by the authors/owners. their data-dependencies only coes through the training error of the classifiers. The fact that there also exists VC lower bounds, that are asyptotically identical to the corresponding upper bounds, suggests that significantly tighter bounds can only coe through extra data-dependent properties such as the distribution of argins achieved by a classifier on the training data. Aong the data-dependent bounds that have been proposed recently, the PAC-Bayes bounds McAllester, 2003; Seeger, 2002; Langford, 2005; Catoni, 2007 see to be especially tight. These bounds thus appear to be a good starting point for the design of a bound-iniizing algorith. In this paper, we present a general PAC-Bayes theore and show that all known PAC-Bayes bounds are corollaries of this general theore. When spherical Gaussians, over the space of linear classifiers, are used for priors and posteriors, we show that the Gibbs classifier that iniizes any of the above-entioned PAC-Bayes risk bound is obtained fro the linear classifier that iniizes a non-convex objective function. We also propose two different learning algoriths for finding linear classifiers that iniize PAC-Bayes risk bounds and a third algorith that uses cross-validation to deterine the value of a paraeter which is present in the risk bound of Catoni The first algorith uses a non-inforative prior to construct a classifier fro all the training data. The second algorith uses a fraction of the training set to construct an inforative prior that is used to learn the final linear classifier on the reaining fraction of the training data. The third algorith is, as the first one, based on a non-inforative prior but uses the cross-validation ethodology to choose one of the bound s paraeters. The idea of using a fraction of the training data to construct a prior has been proposed in Abroladze et al., 2006 for the proble of choosing the hyperparaeter values of the SVM. In contrast, the priors are used here to directly iniize a PAC-Bayes bound.

2 Our extensive experients indicate that the second and third algoriths are copetitive with both Ada- Boost and the SVM and are generally uch ore effective than the first algorith in their ability at producing classifiers with sall true risk. 2. Siplified PAC-Bayesian Theory We consider binary classification probles where the input space X consists of an arbitrary subset of R n and the output space Y = {, +}. An exaple is an input-output x, y pair where x X and y Y. Throughout the paper, we adopt the PAC setting where each exaple x, y is drawn according to a fixed, but unknown, distribution D on X Y. The risk Rh of any classifier h: X Y is defined as the probability that h isclassifies an exaple drawn according to D. Given a training set S of exaples, the epirical risk R S h of any classifier h is defined by the frequency of training errors of h on S. Hence Rh def = Ihx y, x,y D R S h def = Ihx i y i, i= where Ia = if predicate a is true and 0 otherwise. After observing the training set S, the task of the learner is to choose a posterior distribution Q over a space H of classifiers such that the Q-weighted ajority vote classifier B Q will have the sallest possible risk. On any input exaple x, the output B Q x of the ajority vote classifier B Q soeties called the Bayes classifier is given by B Q x def = sgn ] hx h Q where sgns = + if s > 0 and sgns = otherwise. The output of the deterinistic ajority vote classifier B Q is closely related to the output of a stochastic classifier called the Gibbs classifier G Q. To classify an input exaple x, the Gibbs classifier G Q chooses randoly a deterinistic classifier h according to Q to classify x. The true risk RG Q and the epirical risk R S G Q of the Gibbs classifier are thus given by RG Q = Rh ; R S G Q = R S h. h Q h Q Any bound for RG Q can straightforwardly be turned into a bound for the risk of the ajority vote RB Q. Indeed, whenever B Q isclassifies x, at least half of the classifiers under easure Q isclassifies x. It follows that the error rate of G Q is at least half of the, error rate of B Q. Hence RB Q 2RG Q. As shown in Langford and Shawe-Taylor 2003, this factor of 2 can soeties be reduced to + ɛ. The following theore gives both an upper and a lower bound on RG Q by upper-bounding DR S G Q, RG Q for any convex function D : 0, ] 0, ] R. Theore 2.. For any distribution D, for any set H of classifiers, for any prior distribution P of support H, for any 0, ], and for any convex function D : 0, ] 0, ] R, we have Pr Q on H: DR S G Q, RG Q KLQ P + ln h P edr Sh,Rh ], def where KLQ P = ln Qh h Q P h is the Kullback- Leibler divergence between Q and P. Proof. Since e DR Sh,Rh is a non-negative h P rando variable, Markov s inequality gives Pr h P edr Sh,Rh h P edr Sh,Rh. Hence, by taking the logarith on each side of the innerost inequality and by transforing the expectation over P into an expectation over Q, we obtain Pr Q : ln ln h Q P h Qh edr Sh,Rh h P edr Sh,Rh ] ]. The theore then follows fro two applications of Jensen s inegality: one exploiting the concavity of lnx and the second the convexity of D. Theore 2. provides a tool to derive PAC-Bayesian risk bounds. ach such bound is obtained by using a particular convex function D : 0, ] 0, ] R and by upper-bounding h P edrsh,rh. For exaple, a slightly tighter PAC-Bayes bound than the one derived by Seeger 2002 and Langford 2005 can be obtained fro Theore 2. by using Dq, p = klq, p, where klq, p def = q ln q q + q ln p p.

3 Corollary 2.. For any distribution D, for any set H of classifiers, for any distribution P of support H, for any 0, ], we have Pr where Q on H: klr S G Q, RG Q KLQ P + ln ξ ], ξ def = k=0 k k/ k k/ k. Proof. The corollary iediately follows fro Theore 2. by choosing Dq, p = klq, p. Indeed, in that case we have h P = h P edr S h,rh RS h Rh = h P P k=0 Pr RS h R S h RS h Rh k «k R S h= k k «k Rh Rh = P k=0 kk/ k k/ k, where last equality arises fro the fact that R S h is a binoial rando variable of ean Rh. See Banerjee 2006 for a very siilar proof. Note also that we retreive the exact forulation of the PAC- Bayes bound of Langford 2005 if we upper bound ξ by +. However, ξ Θ. The PAC-Bayes bound of McAllester 2003 can be obtained by using Dq, p = 2q p 2. Let us now consider functions that are linear in the epirical risk, i.e., functions of the for Dq, p = Fp C q for convex F. As the next corollary shows, this choice for D gives a PAC-Bayes bound whose iniu is obtained for Gibbs classifiers iniizing a siple linear cobination of R S G Q and KLQ P. The next corollary has also been found by Catoni 2007Th..2.]. Corollary 2.2. For any distribution D, any set H of classifiers, any distribution P of support H, any 0, ], and any positive real nuber C, we have Q on H: Pr RG Q { exp C R e C S G Q ] + ]}. KLQ P + ln Proof. Put Dq, p = Fp C q for soe function F to be defined. Then h P = h P = h P = h P edr S h,rh efrh CR S h e FRh P k=0 Pr R S h= k e Ck e FRh P k=0 krh k Rh k e Ck = h P e FRh Rhe C + Rh, and the result follows easily fro Theore 2. when F is the convex function Fp=ln p e C ]. It is interesting to copare the bounds of Corollaries 2. and 2.2. A nice property of the bound of Corollary 2.2 is the fact that its iniization is obtained fro the Gibbs classifier G Q that iniizes C R S G Q + KLQ P. As we will see, this iniization proble is closely related to the one solved by the SVM when Q is an isotropic Gaussian over the space of linear classifiers. Miniizing the bound given by Corollary 2. does not appear to be as siple because the upper bound on RG Q is not an explicit function of R S G Q and KLQ P. However, this upper bound does not depend on an arbitrary constant such as C in Corollary 2.2 which gives a coputational advantage to Corollary 2. since, several bound iniizations one for each value of C would be needed in the case of Corollary 2.2. The tightness of these bounds can be copared with the following proposition. Proposition 2.. For any 0 R S R <, we have { } ln R e C ] CR S = klr S, R. ax C 0 Consequently, by oitting lnξ, Corollary 2. always gives a bound which is tighter or equal to the one given by Corollary 2.2. On another hand, there always exists values of C for which Corollary 2.2 gives a tighter bound than Corollary 2.. The next lea shows that the bound of Corollary 2.2 has the interesting property of having an analytical expression of the optial posterior Q for every prior P. Lea 2.. For any set H of classifiers and any prior P of support H, for any positive real nuber C, the posterior Q that iniizes the upper bound on RG Q of Corollary 2.2 has a density which is given by the following Boltzann distribution : Q h = Z P he C R Sh, where denotes the nuber of training exaples in S and Z is a noralizing constant.

4 Proof. We present here a proof for the case where H is countable. But the theore also holds for the continuous case. For any fixed C, and P, the distribution Q iniizing the bound of Corollary 2.2 is the sae as the one iniizing BQ, where BQ def = C h H under the constraint h H QhR S h + KLQ P Qh =. At optiality, Q ust satisfy Lagrange constraints, naely that there exists λ R such that for any h H, we have λ = Consequently, B Qh = C R Sh + + log Qh P h,. Qh = P he λ C R Sh = Z P he C R Sh, where Z is a noralizing constant. It is well known that Bayes classifiers resulting fro a Boltzann distribution can only be expressed via integral forulations. Such intergrals can be approxiated by soe Markov Chain Monté Carlo sapling, but, since the ixing tie is unknown, we have no real control on the precision of the approxiation. For this reason, we restrict ourselves here to the case where the posterior Q is chosen fro a paraeterized set of distributions. Building on the previous work of Langford and Shawe-Taylor 2003 and Langford 2005, we will focus on isotropic Gaussian distributions of linear classifiers since, in this case, we have an exact analytical expression for B Q, G Q, R S B Q, R S G Q, and KLQ P in ters of the paraeters of the posterior Q. These analytic expressions will enable us to perfor our coputations without perforing any Monté-Carlo sapling. 3. Specialization to Linear Classifiers Let us apply Corollary 2. and 2.2 to linear classifiers that are defined over a space of features. Here we suppose that each x X is apped to a feature vector φx = φ x, φ 2 x,... where each φ i is given explicitly as a real-valued function or given iplicitly by using a Mercer kernel k : X X R. In the latter case, we have kx, x = φx φx x, x X X. ach linear classifier h w is identified by a real-valued weight vector w. The output h w x of h w on any x X is given by h w x = sgn w φx. The task of the learner is to produce a posterior Q over the set of all possible weight vectors. If each possible feature vector φ has N coponents, the set of all possible weight vectors is R N. Let Qv denote the posterior density evaluated at weight vector v. We restrict ourselves to the case where the learner is going to produce a posterior Q w, paraeterized by a chosen weight vector w, such that for any weight vectors v and u we have Q w v = Q w u whenever v w = u w. Posteriors Q w satisfying this property are said to be syetric about w. It can be easily shown that for any Q w syetric about w and for any feature vector φ: sgn sgn v φ = sgn w φ. v Q w In other words, for any input exaple, the output of the ajority vote classifier B Qw given by the left hand side of quation is the sae as the one given by the linear classifier h w whenever Q w is syetric about w. Consequently, Rh w = RB Qw 2RG Qw and, consequently, Corollary 2. and 2.2 provide upper bounds on Rh w for these posteriors. Building on the previous work of Langford and Shawe-Taylor 2003 and Langford 2005, we choose both the prior P wp and the posterior Q w to be spherical Gaussians with identity covariance atrix respectively centered on w p and on w. Hence, for any weight vector v R N : Q w v = N exp 2 v 2π w 2 Thus, the posterior is paraeterized by a weight vector w that will be chosen by the learner based on the values of R S G Qw and KLQ w P wp. Here, the weight vector w p that paraeterizes the prior P wp represents prior knowledge that the learner ight have about the classification task i.e., about good direction for linear separators. We therefore have w p = 0 in the absence of prior knowledge so that P 0 is the noninforative prior. Alternatively, we ight set aside a subset S of the training data S and choose w p such that R S G Pwp is sall. By perforing siple Gaussian integrals, as in Langford 2005, we find KLQ w P wp = 2 w w p 2 RG Qw = Φ w Γ w x, y x,y D R S G Qw = i= Φ w Γ w x i, y i, where Γ w x, y denotes the noralized argin of w on x, y, i.e., Γ w x, y def = yw φx w φx, and where Φa

5 denotes the probability that X > a when X is a N0, rando variable, i.e., Φa def = exp 2 2π x2 dx Two objective functions to iniize a By using the above expressions for R S G Qw and KLQ w P wp, Corollaries 2. and 2.2 both provide upper bounds to RG Qw and to Rh w since Rh w 2RG Qw. Hence, each bound depend on the sae quantities: the epirical risk easure, R S G Qw, and KLQ w P wp which acts as a regularizer. Miniizing the upper bound given by Corollary 2., in the case of linear classifiers, aounts to finding w that iniizes the following objective function { BS, w, def = sup ɛ : klr S G Qw ɛ KLQ w P wp + ln ξ ]}, F 2. for a fixed value of the confidence paraeter say = Consequently, our proble is to find weight vector w that iniizes B subject to the constraints kl R S G Qw B = KLQ w P wp + ln ξ ] 3 B > R S G Qw. 4 Miniizing the bound of Corollary 2.2, in the case of linear classifiers, aounts at finding w that iniizes the siple objective function CR S G Qw + KLQ w P wp = yi w φx i C Φ + φx i 2 w w p 2, F 2.2 i= for soe fixed choice of C and w p. In the absence of prior knowledge, w p = 0 and the regularizer becoes identical to the one used by the SVM. Indeed, the learning strategy used by the soft-argin SVM consists at finding w that iniizes C i= ax 0, y i w φx i + 2 w 2, for soe fixed choice of C. Thus, for w p = 0, both learning strategies are identical except for the fact that the convex SVM hinge loss, ax0,, is replaced by the non-convex probit loss, Φ. Hence, the objective function iniized by the soft-argin SVM is a convex relaxation of objective function F 2.2. ach learning strategy has its potential drawback. The single local iniu of the soft-argin SVM solution ight be suboptial, whereas the non-convex PAC- Bayes bound ight present several local inia. Observe that B, the objective function F 2., is defined only iplicitly in ters of w via the constraints given by quations 3 and 4. This optiization proble appears to be ore involved than the unconstrained optiization of objective function F 2.2 that arises fro Corollary 2.2. However, it appears also to be ore relevant, since, according to Proposition 2., the upper bound given by Corollary 2. is soewhat tighter than the one given by Corollary 2.2 apart fro the presence of a lnξ ter. The optiization of the objective function F 2. has also the advantage of not being dependent of any constant C like the one present in the objective objective function F Gradient Descent of the PAC-Bayes Bound We are now concerned with the proble of iniizing the non-convex objective functions F 2. for fixed w p and F 2.2 for fixed C and w p. As a first approach, it akes sense to iniize these objective functions by gradient-descent. More specifically, we have used the Polak-Ribière conjugate gradient descent algorith ipleented in the GNU Scientific Library GSL. The gradient with respect to w of objective function F 2. is obtained by coputing the partial derivative of both sides of quation 3 with respect to w j the jth coponent of w. After solving for B/ w j, we find that the gradient is given by B B w w p + ln B R S i= B RS R S B Φ yi w φx i φx i yi φx i φx i ], 5 where Φ t denotes the first derivative of Φ evaluated at t. We have observed that objective function F 2. tends to have only one local iniu, even if it is not convex. We have therefore used a single gradient descent run to iniize F 2.. The gradient of objective function F 2.2 is C i= Φ yi w φx i φx i yi φx i φx i + w w p. 6 Since this objective function ight have several local inia, especially for large values of C, each objective function iniization of F 2.2 consisted of k different gradient-descent runs, where each run was initiated

6 fro a new, randoly-chosen, starting position. In the results presented here, we have used k = 0 for C 0 and k = 00 for C > Proposed learning algoriths We propose three algoriths that can be used either with the prial variables i.e., the coponents of w or the dual variables {α,..., α } that appear in the linear expansion w = i= α iy i φx i. In this latter case, the features are iplicitly given by a Mercer kernel kx, x = φx φx x, x X X. The objective functions F 2. and F 2.2, with their gradients q. 5 and 6, can then straightforwardly be expressed in ters of k, and the dual variables. 2 The first algorith, called PBGD, uses the prior P 0 i.e., with w p = 0 to learn a posterior Q w by iniizing the bound value of Corollary 2. objective function F 2.. In this paper, every bound coputation has been perfored with = The second algorith, called PBGD2, was studied to investigate if it is worthwhile to use a fraction x of the training data to construct an inforative prior P wp, for soe w p 0, that will be used to learn a posterior Q w on the reaining x fraction of the training data. In its first stage, PBGD2 iniizes the objective function F 2.2 by using a fraction x of the training data to construct one posterior for each value of C {0 k : k = 0,..., 6}. Note that a large value for C attepts to generate a w for which the training error of G w is sall. ach posterior is constructed with the sae non-inforative prior used for PBGD i.e., with w p = 0. Then, each of these seven posteriors is used as a prior P wp, with w p 0, for learning a posterior Q w by iniizing the objective function F 2. on the reaining fraction x of the training data. Fro the union bound arguent, the ter in Corollary 2. needs to be replaced by /7 to get a bound that uniforly holds for these seven priors. pirically, we have observed that the best fraction x used for constructing the prior was /2. Hence, we report here only the results for x = /2. For the third algorith, called PBGD3, we always used the prior P 0 to iniize the objective function F 2.2. But, instead of using the solution obtained for the value of C that gave the sallest bound of Corollary 2.2, we perfored 0-fold cross validation on the training set to find the best value for C and then used that value of C to find the classifier that iniizes objective function F 2.2. Hence PBGD3 fol- 2 This is true if w p can be expanded in ters of exaples that do not belong to the training set. lows the sae cross-validation learning ethodology norally eployed with the SVM but uses the probit loss instead of the hinge loss. To copute the risk bound for the linear classifier returned by PBGD3 and the other coparison algoriths AdaBoost and SVM, we perfored a line search, along the direction of the weight vector w of the returned classifier, to find the nor w that iniizes the bound of Corollary For each bound coputation, we used the non-inforative prior P PBGD with Respect to Prial Variables For the sake of coparison, all learning algoriths of this subsection are producing a linear classifier h w on the set of basis functions {φ, φ 2,...} known as decision stups. ach decision stup φ i is a threshold classifier that depends on a single attribute: its output is +b if the tested attribute exceeds a threshold value t, and b otherwise, where b {, +}. For each attribute, at ost ten equally-spaced possible values for t were deterined a priori. We have copared the three PBGD algoriths to AdaBoost Schapire et al., 998 because the latter is a standard and efficient algorith when used with decision stups. Since AdaBoost is an algorith that iniizes the exponential risk i= exp y iw φx i, it never chooses a w for which there exists a training exaple where y i w φx i is very large. This is to be contrasted with the PBGD s algoriths for which the epirical risk R S G Qw has the sigoidal shape of quation 2 and never exceeds one. We thus anticipate that AdaBoost and PBGD s algoriths will select different weight vectors w on any data sets. The results obtained for all three algoriths are suarized in Table. xcept for MNIST, all data sets were taken fro the UCI repository. ach data set was randoly split into a training set S of S exaples and a testing set T of T exaples. The nuber n of attributes for each data set is also specified. For Ada- Boost, the nuber of boosting rounds was fixed to 200. For all algoriths, R T w refers to the frequency of errors, easured on the testing set T, of the linear classifier h w returned by the learner. For the PBGD s algoriths, G T w def = R T G Qw refers to the epirical risk on T of the Gibbs classifier. The Bnd coluns refer to the PAC-Bayes bound of Corollary 2., coputed on the training set. All bounds hold with confidence = For PBGD, PBGD3 and Ada- Boost, the bound is coputed on all the training data 3 This is justified by the fact that the bound holds uniforly for all weight vectors w.

7 Table. Suary of results for linear classifiers on decision stups. Dataset a AdaBoost PBGD 2 PBGD2 3 PBGD3 SSB Nae S T n R T w Bnd R T w G T w Bnd R T w G T w Bnd R T w G T w Bnd Usvotes Credit-A Glass Haberan Heart Sonar BreastCancer Tic-tac-toe ,3<a, Ionosphere Wdbc MNIST:0vs MNIST:vs MNIST:vs < MNIST:2vs Letter:AvsB Letter:DvsO Letter:OvsQ Adult a<,2 Mushroo a,3<2< with the non inforative prior P 0. For PBGD2, the bound is coputed on the second half of the training data with the prior P wp constructed fro the first half, and, as explain in Section 4., with replaced by /7. Note that the bounds values for the classifiers returned by PBGD2 are generally uch lower those for the classifiers produced by the other algoriths. This alost always aterializes in a saller testing error for the linear classifier produced by PBGD2. To our knowledge, these training set bounds for PBGD2 are the sallest ones obtained for any learning algorith producing linear classifiers. To deterine whether or not a difference of epirical risk easured on the testing set T is statistically significant, we have used the test set bound ethod of Langford 2005 based on the binoial tail inversion with a confidence level of 95%. It turns out that no algorith has succeeded in choosing a linear classifier h w which was statistically significantly better SSB than the one chosen by another algorith except for the few cases that are list in the colun SSB of Table. Overall, AdaBoost and PBGD2and3 are very copetitive to one another with no clear winner and are generally superior to PBGD. We therefore see an advantage in using half of the training data to learn a good prior over using a non-inforative prior and keeping all the data to learn the posterior PBGD with Respect to Dual Variables In this subsection, we copare the PBGD algoriths to the soft-argin SVM. Here, all four learning algoriths are producing a linear classifier on a feature space defined by the RBF kernel k satisfying kx, x = exp 2 x x 2 /γ 2 x, x X X. The results obtained for all algoriths are suarized in Table 2. All the data sets are the sae as those of the previous subsection. The notation used in this table is identical to the one used for Table. For the SVM and PBGD3, the kernel paraeter γ and the soft-argin paraeter C was chosen by 0- fold cross validation on the training set S aong the set of values proposed by Abroladze et al PBGDand2 also tried the sae set of values for γ but used the bound of Corollary 2. to select a good value. Since we consider 5 different values of γ, the bound of PBGD is therefore coputed with replaced by /5. For PBGD2, the bound is, as stated before, coputed on the second half of the training data but with replaced by /7 5. Again, the bounds values for the classifiers returned by PBGD2 are generally uch lower those for the classifiers produced by the other algoriths. To our knowledge, the training set bounds for PBGD2 are the sallest ones obtained for any learning algorith producing linear classifiers. The sae ethod as in the previous subsection was used to deterine whether or not a difference of epirical risk easured on the testing set T is statistically significant. It turns out that no algorith has chosen a linear classifier h w which was statistically significantly better than the choices of the others except the few cases listed in the SSB colun. Thus, the SVM and PBGD3 are very copetitive to one another with no clear winner and are both slightly superior to PBGD2, and a bit ore than slightly superior than PBGD. 5. Conclusion We have shown that the standard PAC-Bayes risk bounds McAllester, 2003; Seeger, 2002; Langford, 2005; Catoni, 2007 are specializations of Theore 2. that are obtained by choosing a particular convex func-

8 Table 2. Suary of results for linear classifiers with a RBF kernel. Dataset s SVM PBGD 2 PBGD2 3 PBGD3 SSB Nae S T n R T w Bnd R T w G T w Bnd R T w G T w Bnd R T w G T w Bnd Usvotes Credit-A Glass Haberan Heart Sonar BreastCancer Tic-tac-toe s,3<2< Ionosphere Wdbc MNIST:0vs MNIST:vs MNIST:vs MNIST:2vs s< Letter:AvsB Letter:DvsO Letter:OvsQ Adult s,3<2 Mushroo s,2,3< tion D that binds Gibbs true risk to its epirical estiate. Moreover, when spherical Gaussians over spaces of linear classifiers are used for priors and posteriors, we have shown that the Gibbs classifier G Qw that iniizes the PAC-Bayes bound of Corollary 2. resp. 2.2 is obtained fro the weight vector w that iniizes the non-convex objective function F 2. resp. F 2.2. When the prior is non-inforative, a siple convex relaxation of F 2.2 gives the objective function which is iniized by the soft-argin SVM. We have proposed two learning algoriths PBGD and PBGD2 for finding linear classifiers that iniize the bound of Corollary 2., and another algorith PBGD3 that uses the cross-validation ethodology to deterine the value of paraeter C in the objective function F 2.2. PBGD uses a noninforative prior to construct the final classifier fro all the training data. In contrast, PBGD2 uses a fraction of the training set to construct an inforative prior that is used to learn the final linear classifier on the reaining fraction of the training data. Our extensive experients indicate that PBGD2 is generally uch ore effective than PBGD at producing classifiers with sall true risk. Moreover, the training set risk bounds obtained for PBGD2 are, to our knowledge, the sallest obtained so far for any learning algorith producing linear classifiers. In fact, PBGD2 is a learning algorith producing classifiers having a good guarantee without the need of using any test set for that purpose. This opens the way to a feasible learning strategy that uses all the available data for training. Our results also indicate that PBGD2 and PBGD3 are copetitive with both AdaBoost and the soft-argin SVM at producing classifiers with sall true risk. However, as a consequence of the nonconvexity of the objective function F 2.2, PBGD2 and PBGD3 are slower than AdaBoost and the SVM. Acknowledgeents Work supported by NSRC Discovery grants and References Abroladze, A., Parrado-Hernández,., & Shawe- Taylor, J Tighter PAC-Bayes bounds. Proceedings of the 2006 conference on Neural Inforation Processing Systes NIPS-06 pp Banerjee, A On bayesian bounds. ICML 06: Proceedings of the 23rd international conference on Machine learning pp Catoni, O PAC-Bayesian surpevised classification: the therodynaics of statistical learning. Monograph series of the Institute of Matheatical Statistics, Langford, J Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6, Langford, J., & Shawe-Taylor, J PAC-Bayes & argins. In S. T. S. Becker and K. Oberayer ds., Advances in neural inforation processing systes 5, Cabridge, MA: MIT Press. McAllester, D PAC-Bayesian stochastic odel selection. Machine Learning, 5, 5 2. Schapire, R.., Freund, Y., Bartlett, P., & Lee, W. S Boosting the argin: A new explanation for the effectiveness of voting ethods. The Annals of Statistics, 26, Seeger, M PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3,

La théorie PAC-Bayes en apprentissage supervisé

La théorie PAC-Bayes en apprentissage supervisé La théorie PAC-Bayes en apprentissage supervisé Présentation au LRI de l université Paris XI François Laviolette, Laboratoire du GRAAL, Université Laval, Québec, Canada 14 dcembre 2010 Summary Aujourd

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

arxiv: v3 [stat.ml] 9 Aug 2016

arxiv: v3 [stat.ml] 9 Aug 2016 with Specialization to Linear Classifiers Pascal Gerain Aaury Habrard François Laviolette 3 ilie Morvant INRIA, SIRRA Project-Tea, 75589 Paris, France, et DI, École Norale Supérieure, 7530 Paris, France

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers

PAC-Bayes Risk Bounds for Sample-Compressed Gibbs Classifiers PAC-Bayes Ris Bounds for Sample-Compressed Gibbs Classifiers François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Département d informatique et de génie logiciel,

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm

Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm Journal of Machine Learning Research 16 2015 787-860 Submitted 5/13; Revised 9/14; Published 4/15 Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm Pascal Germain

More information

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification Eilie Morvant eilieorvant@lifuniv-rsfr okol Koço sokolkoco@lifuniv-rsfr Liva Ralaivola livaralaivola@lifuniv-rsfr Aix-Marseille

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

From PAC-Bayes Bounds to Quadratic Programs for Majority Votes

From PAC-Bayes Bounds to Quadratic Programs for Majority Votes François Laviolette FrancoisLaviolette@iftulavalca Mario Marchand MarioMarchand@iftulavalca Jean-Francis Roy Jean-FrancisRoy1@ulavalca Département d informatique et de génie logiciel, Université Laval,

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

PAC-Bayesian Generalization Bound for Multi-class Learning

PAC-Bayesian Generalization Bound for Multi-class Learning PAC-Bayesian Generalization Bound for Multi-class Learning Loubna BENABBOU Department of Industrial Engineering Ecole Mohammadia d Ingènieurs Mohammed V University in Rabat, Morocco Benabbou@emi.ac.ma

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

PAC-Bayesian Learning and Domain Adaptation

PAC-Bayesian Learning and Domain Adaptation PAC-Bayesian Learning and Domain Adaptation Pascal Germain 1 François Laviolette 1 Amaury Habrard 2 Emilie Morvant 3 1 GRAAL Machine Learning Research Group Département d informatique et de génie logiciel

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

A PAC-Bayesian Approach for Domain Adaptation with Specialization to Linear Classifiers

A PAC-Bayesian Approach for Domain Adaptation with Specialization to Linear Classifiers A PAC-Bayesian Approach for Doain Adaptation with Specialization to Linear Classifiers Pascal Gerain Aaury Habrard François Laviolette ilie Morvant o cite this version: Pascal Gerain Aaury Habrard François

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Simplified PAC-Bayesian Margin Bounds

Simplified PAC-Bayesian Margin Bounds Siplified PAC-Bayesian Margin Bounds David McAllester Toyota Technological Institute at Chicago callester@tti-c.org Abstract. The theoretical understanding of support vector achines is largely based on

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS ISSN 1440-771X AUSTRALIA DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS An Iproved Method for Bandwidth Selection When Estiating ROC Curves Peter G Hall and Rob J Hyndan Working Paper 11/00 An iproved

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

Generalization of the PAC-Bayesian Theory

Generalization of the PAC-Bayesian Theory Generalization of the PACBayesian Theory and Applications to SemiSupervised Learning Pascal Germain INRIA Paris (SIERRA Team) Modal Seminar INRIA Lille January 24, 2017 Dans la vie, l essentiel est de

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Biostatistics Department Technical Report

Biostatistics Department Technical Report Biostatistics Departent Technical Report BST006-00 Estiation of Prevalence by Pool Screening With Equal Sized Pools and a egative Binoial Sapling Model Charles R. Katholi, Ph.D. Eeritus Professor Departent

More information

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer. UIVRSITY OF TRTO DIPARTITO DI IGGRIA SCIZA DLL IFORAZIO 3823 Povo Trento (Italy) Via Soarive 4 http://www.disi.unitn.it O TH US OF SV FOR LCTROAGTIC SUBSURFAC SSIG A. Boni. Conci A. assa and S. Piffer

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Variational Adaptive-Newton Method

Variational Adaptive-Newton Method Variational Adaptive-Newton Method Mohaad Etiyaz Khan Wu Lin Voot Tangkaratt Zuozhu Liu Didrik Nielsen AIP, RIKEN, Tokyo Abstract We present a black-box learning ethod called the Variational Adaptive-Newton

More information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information Cite as: Straub D. (2014). Value of inforation analysis with structural reliability ethods. Structural Safety, 49: 75-86. Value of Inforation Analysis with Structural Reliability Methods Daniel Straub

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence Best Ar Identification: A Unified Approach to Fixed Budget and Fixed Confidence Victor Gabillon Mohaad Ghavazadeh Alessandro Lazaric INRIA Lille - Nord Europe, Tea SequeL {victor.gabillon,ohaad.ghavazadeh,alessandro.lazaric}@inria.fr

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

arxiv: v3 [stat.ml] 13 Jul 2017

arxiv: v3 [stat.ml] 13 Jul 2017 PAC-Bayesian Analysis for a two-step Hierarchical Multiview Learning Approach Anil Goyal,2 ilie Morvant Pascal Gerain 3 Massih-Reza Aini 2 Univ Lyon, UJM-aint-tienne, CNR, Institut d Optique Graduate chool,

More information

Variations sur la borne PAC-bayésienne

Variations sur la borne PAC-bayésienne Variations sur la borne PAC-bayésienne Pascal Germain INRIA Paris Équipe SIRRA Séminaires du département d informatique et de génie logiciel Université Laval 11 juillet 2016 Pascal Germain INRIA/SIRRA

More information

Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees

Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Sample Guarantees Tight Bounds for the Expected Risk of Linear Classifiers and PAC-Bayes Finite-Saple Guarantees Jean Honorio CSAIL, MIT Cabridge, MA 0239, USA jhonorio@csail.it.edu Toi Jaakkola CSAIL, MIT Cabridge, MA

More information

Geometrical intuition behind the dual problem

Geometrical intuition behind the dual problem Based on: Geoetrical intuition behind the dual proble KP Bennett, EJ Bredensteiner, Duality and Geoetry in SVM Classifiers, Proceedings of the International Conference on Machine Learning, 2000 1 Geoetrical

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

Testing equality of variances for multiple univariate normal populations

Testing equality of variances for multiple univariate normal populations University of Wollongong Research Online Centre for Statistical & Survey Methodology Working Paper Series Faculty of Engineering and Inforation Sciences 0 esting equality of variances for ultiple univariate

More information

Principal Components Analysis

Principal Components Analysis Principal Coponents Analysis Cheng Li, Bingyu Wang Noveber 3, 204 What s PCA Principal coponent analysis (PCA) is a statistical procedure that uses an orthogonal transforation to convert a set of observations

More information

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA) Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

THE CONSTRUCTION OF GOOD EXTENSIBLE RANK-1 LATTICES. 1. Introduction We are interested in approximating a high dimensional integral [0,1]

THE CONSTRUCTION OF GOOD EXTENSIBLE RANK-1 LATTICES. 1. Introduction We are interested in approximating a high dimensional integral [0,1] MATHEMATICS OF COMPUTATION Volue 00, Nuber 0, Pages 000 000 S 0025-578(XX)0000-0 THE CONSTRUCTION OF GOOD EXTENSIBLE RANK- LATTICES JOSEF DICK, FRIEDRICH PILLICHSHAMMER, AND BENJAMIN J. WATERHOUSE Abstract.

More information

On the Impact of Kernel Approximation on Learning Accuracy

On the Impact of Kernel Approximation on Learning Accuracy On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu

More information

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies Approxiation in Stochastic Scheduling: The Power of -Based Priority Policies Rolf Möhring, Andreas Schulz, Marc Uetz Setting (A P p stoch, r E( w and (B P p stoch E( w We will assue that the processing

More information

Prediction by random-walk perturbation

Prediction by random-walk perturbation Prediction by rando-walk perturbation Luc Devroye School of Coputer Science McGill University Gábor Lugosi ICREA and Departent of Econoics Universitat Popeu Fabra lucdevroye@gail.co gabor.lugosi@gail.co

More information

Solutions of some selected problems of Homework 4

Solutions of some selected problems of Homework 4 Solutions of soe selected probles of Hoework 4 Sangchul Lee May 7, 2018 Proble 1 Let there be light A professor has two light bulbs in his garage. When both are burned out, they are replaced, and the next

More information