PAC-Bayesian Learning of Linear Classifiers

Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada, GV-0A6 Abstract We present a general PAC-Bayes theore fro which all known PAC-Bayes risk bounds are obtained as particular cases. We also propose different learning algoriths for finding linear classifiers that iniize these bounds. These learning algoriths are generally copetitive with both AdaBoost and the SVM.. Intoduction For the classification proble, we are given a training set of exaples each generated according to the sae but unknown distribution D, and the goal is to find a classifier that iniizes the true risk i.e., the generalization error or the expected loss. Since the true risk is defined only with respect to the unknown distribution D, we are autoatically confronted with the proble of specifying exactly what we should optiize on the training data to find a classifier having the sallest possible true risk. Many different specifications of what should be optiized on the training data have been provided by using different inductive principles but the final guarantee on the true risk, however, always coes with a so-called risk bound that holds uniforly over a set of classifiers. Hence, the foral justification of a learning strategy has always coe a posteriori via a risk bound. Since a risk bound can be coputed fro what a classifier achieves on the training data, it autoatically suggests the following optiization proble for learning algoriths: given a risk upper bound, find a classifier that iniizes it. Despite the enorous ipact they had on our understanding of learning, the VC bounds are generally very loose. These bounds are characterized by the fact that Appearing in Proceedings of the 26 th International Conference on Machine Learning, Montreal, Canada, 2009. Copyright 2009 by the authors/owners. their data-dependencies only coes through the training error of the classifiers. The fact that there also exists VC lower bounds, that are asyptotically identical to the corresponding upper bounds, suggests that significantly tighter bounds can only coe through extra data-dependent properties such as the distribution of argins achieved by a classifier on the training data. Aong the data-dependent bounds that have been proposed recently, the PAC-Bayes bounds McAllester, 2003; Seeger, 2002; Langford, 2005; Catoni, 2007 see to be especially tight. These bounds thus appear to be a good starting point for the design of a bound-iniizing algorith. In this paper, we present a general PAC-Bayes theore and show that all known PAC-Bayes bounds are corollaries of this general theore. When spherical Gaussians, over the space of linear classifiers, are used for priors and posteriors, we show that the Gibbs classifier that iniizes any of the above-entioned PAC-Bayes risk bound is obtained fro the linear classifier that iniizes a non-convex objective function. We also propose two different learning algoriths for finding linear classifiers that iniize PAC-Bayes risk bounds and a third algorith that uses cross-validation to deterine the value of a paraeter which is present in the risk bound of Catoni 2007. The first algorith uses a non-inforative prior to construct a classifier fro all the training data. The second algorith uses a fraction of the training set to construct an inforative prior that is used to learn the final linear classifier on the reaining fraction of the training data. The third algorith is, as the first one, based on a non-inforative prior but uses the cross-validation ethodology to choose one of the bound s paraeters. The idea of using a fraction of the training data to construct a prior has been proposed in Abroladze et al., 2006 for the proble of choosing the hyperparaeter values of the SVM. In contrast, the priors are used here to directly iniize a PAC-Bayes bound.

Our extensive experients indicate that the second and third algoriths are copetitive with both Ada- Boost and the SVM and are generally uch ore effective than the first algorith in their ability at producing classifiers with sall true risk. 2. Siplified PAC-Bayesian Theory We consider binary classification probles where the input space X consists of an arbitrary subset of R n and the output space Y = {, +}. An exaple is an input-output x, y pair where x X and y Y. Throughout the paper, we adopt the PAC setting where each exaple x, y is drawn according to a fixed, but unknown, distribution D on X Y. The risk Rh of any classifier h: X Y is defined as the probability that h isclassifies an exaple drawn according to D. Given a training set S of exaples, the epirical risk R S h of any classifier h is defined by the frequency of training errors of h on S. Hence Rh def = Ihx y, x,y D R S h def = Ihx i y i, i= where Ia = if predicate a is true and 0 otherwise. After observing the training set S, the task of the learner is to choose a posterior distribution Q over a space H of classifiers such that the Q-weighted ajority vote classifier B Q will have the sallest possible risk. On any input exaple x, the output B Q x of the ajority vote classifier B Q soeties called the Bayes classifier is given by B Q x def = sgn ] hx h Q where sgns = + if s > 0 and sgns = otherwise. The output of the deterinistic ajority vote classifier B Q is closely related to the output of a stochastic classifier called the Gibbs classifier G Q. To classify an input exaple x, the Gibbs classifier G Q chooses randoly a deterinistic classifier h according to Q to classify x. The true risk RG Q and the epirical risk R S G Q of the Gibbs classifier are thus given by RG Q = Rh ; R S G Q = R S h. h Q h Q Any bound for RG Q can straightforwardly be turned into a bound for the risk of the ajority vote RB Q. Indeed, whenever B Q isclassifies x, at least half of the classifiers under easure Q isclassifies x. It follows that the error rate of G Q is at least half of the, error rate of B Q. Hence RB Q 2RG Q. As shown in Langford and Shawe-Taylor 2003, this factor of 2 can soeties be reduced to + ɛ. The following theore gives both an upper and a lower bound on RG Q by upper-bounding DR S G Q, RG Q for any convex function D : 0, ] 0, ] R. Theore 2.. For any distribution D, for any set H of classifiers, for any prior distribution P of support H, for any 0, ], and for any convex function D : 0, ] 0, ] R, we have Pr Q on H: DR S G Q, RG Q KLQ P + ln h P edr Sh,Rh ], def where KLQ P = ln Qh h Q P h is the Kullback- Leibler divergence between Q and P. Proof. Since e DR Sh,Rh is a non-negative h P rando variable, Markov s inequality gives Pr h P edr Sh,Rh h P edr Sh,Rh. Hence, by taking the logarith on each side of the innerost inequality and by transforing the expectation over P into an expectation over Q, we obtain Pr Q : ln ln h Q P h Qh edr Sh,Rh h P edr Sh,Rh ] ]. The theore then follows fro two applications of Jensen s inegality: one exploiting the concavity of lnx and the second the convexity of D. Theore 2. provides a tool to derive PAC-Bayesian risk bounds. ach such bound is obtained by using a particular convex function D : 0, ] 0, ] R and by upper-bounding h P edrsh,rh. For exaple, a slightly tighter PAC-Bayes bound than the one derived by Seeger 2002 and Langford 2005 can be obtained fro Theore 2. by using Dq, p = klq, p, where klq, p def = q ln q q + q ln p p.

Corollary 2.. For any distribution D, for any set H of classifiers, for any distribution P of support H, for any 0, ], we have Pr where Q on H: klr S G Q, RG Q KLQ P + ln ξ ], ξ def = k=0 k k/ k k/ k. Proof. The corollary iediately follows fro Theore 2. by choosing Dq, p = klq, p. Indeed, in that case we have h P = h P edr S h,rh RS h Rh = h P P k=0 Pr RS h R S h RS h Rh k «k R S h= k k «k Rh Rh = P k=0 kk/ k k/ k, where last equality arises fro the fact that R S h is a binoial rando variable of ean Rh. See Banerjee 2006 for a very siilar proof. Note also that we retreive the exact forulation of the PAC- Bayes bound of Langford 2005 if we upper bound ξ by +. However, ξ Θ. The PAC-Bayes bound of McAllester 2003 can be obtained by using Dq, p = 2q p 2. Let us now consider functions that are linear in the epirical risk, i.e., functions of the for Dq, p = Fp C q for convex F. As the next corollary shows, this choice for D gives a PAC-Bayes bound whose iniu is obtained for Gibbs classifiers iniizing a siple linear cobination of R S G Q and KLQ P. The next corollary has also been found by Catoni 2007Th..2.]. Corollary 2.2. For any distribution D, any set H of classifiers, any distribution P of support H, any 0, ], and any positive real nuber C, we have Q on H: Pr RG Q { exp C R e C S G Q ] + ]}. KLQ P + ln Proof. Put Dq, p = Fp C q for soe function F to be defined. Then h P = h P = h P = h P edr S h,rh efrh CR S h e FRh P k=0 Pr R S h= k e Ck e FRh P k=0 krh k Rh k e Ck = h P e FRh Rhe C + Rh, and the result follows easily fro Theore 2. when F is the convex function Fp=ln p e C ]. It is interesting to copare the bounds of Corollaries 2. and 2.2. A nice property of the bound of Corollary 2.2 is the fact that its iniization is obtained fro the Gibbs classifier G Q that iniizes C R S G Q + KLQ P. As we will see, this iniization proble is closely related to the one solved by the SVM when Q is an isotropic Gaussian over the space of linear classifiers. Miniizing the bound given by Corollary 2. does not appear to be as siple because the upper bound on RG Q is not an explicit function of R S G Q and KLQ P. However, this upper bound does not depend on an arbitrary constant such as C in Corollary 2.2 which gives a coputational advantage to Corollary 2. since, several bound iniizations one for each value of C would be needed in the case of Corollary 2.2. The tightness of these bounds can be copared with the following proposition. Proposition 2.. For any 0 R S R <, we have { } ln R e C ] CR S = klr S, R. ax C 0 Consequently, by oitting lnξ, Corollary 2. always gives a bound which is tighter or equal to the one given by Corollary 2.2. On another hand, there always exists values of C for which Corollary 2.2 gives a tighter bound than Corollary 2.. The next lea shows that the bound of Corollary 2.2 has the interesting property of having an analytical expression of the optial posterior Q for every prior P. Lea 2.. For any set H of classifiers and any prior P of support H, for any positive real nuber C, the posterior Q that iniizes the upper bound on RG Q of Corollary 2.2 has a density which is given by the following Boltzann distribution : Q h = Z P he C R Sh, where denotes the nuber of training exaples in S and Z is a noralizing constant.

Proof. We present here a proof for the case where H is countable. But the theore also holds for the continuous case. For any fixed C, and P, the distribution Q iniizing the bound of Corollary 2.2 is the sae as the one iniizing BQ, where BQ def = C h H under the constraint h H QhR S h + KLQ P Qh =. At optiality, Q ust satisfy Lagrange constraints, naely that there exists λ R such that for any h H, we have λ = Consequently, B Qh = C R Sh + + log Qh P h,. Qh = P he λ C R Sh = Z P he C R Sh, where Z is a noralizing constant. It is well known that Bayes classifiers resulting fro a Boltzann distribution can only be expressed via integral forulations. Such intergrals can be approxiated by soe Markov Chain Monté Carlo sapling, but, since the ixing tie is unknown, we have no real control on the precision of the approxiation. For this reason, we restrict ourselves here to the case where the posterior Q is chosen fro a paraeterized set of distributions. Building on the previous work of Langford and Shawe-Taylor 2003 and Langford 2005, we will focus on isotropic Gaussian distributions of linear classifiers since, in this case, we have an exact analytical expression for B Q, G Q, R S B Q, R S G Q, and KLQ P in ters of the paraeters of the posterior Q. These analytic expressions will enable us to perfor our coputations without perforing any Monté-Carlo sapling. 3. Specialization to Linear Classifiers Let us apply Corollary 2. and 2.2 to linear classifiers that are defined over a space of features. Here we suppose that each x X is apped to a feature vector φx = φ x, φ 2 x,... where each φ i is given explicitly as a real-valued function or given iplicitly by using a Mercer kernel k : X X R. In the latter case, we have kx, x = φx φx x, x X X. ach linear classifier h w is identified by a real-valued weight vector w. The output h w x of h w on any x X is given by h w x = sgn w φx. The task of the learner is to produce a posterior Q over the set of all possible weight vectors. If each possible feature vector φ has N coponents, the set of all possible weight vectors is R N. Let Qv denote the posterior density evaluated at weight vector v. We restrict ourselves to the case where the learner is going to produce a posterior Q w, paraeterized by a chosen weight vector w, such that for any weight vectors v and u we have Q w v = Q w u whenever v w = u w. Posteriors Q w satisfying this property are said to be syetric about w. It can be easily shown that for any Q w syetric about w and for any feature vector φ: sgn sgn v φ = sgn w φ. v Q w In other words, for any input exaple, the output of the ajority vote classifier B Qw given by the left hand side of quation is the sae as the one given by the linear classifier h w whenever Q w is syetric about w. Consequently, Rh w = RB Qw 2RG Qw and, consequently, Corollary 2. and 2.2 provide upper bounds on Rh w for these posteriors. Building on the previous work of Langford and Shawe-Taylor 2003 and Langford 2005, we choose both the prior P wp and the posterior Q w to be spherical Gaussians with identity covariance atrix respectively centered on w p and on w. Hence, for any weight vector v R N : Q w v = N exp 2 v 2π w 2 Thus, the posterior is paraeterized by a weight vector w that will be chosen by the learner based on the values of R S G Qw and KLQ w P wp. Here, the weight vector w p that paraeterizes the prior P wp represents prior knowledge that the learner ight have about the classification task i.e., about good direction for linear separators. We therefore have w p = 0 in the absence of prior knowledge so that P 0 is the noninforative prior. Alternatively, we ight set aside a subset S of the training data S and choose w p such that R S G Pwp is sall. By perforing siple Gaussian integrals, as in Langford 2005, we find KLQ w P wp = 2 w w p 2 RG Qw = Φ w Γ w x, y x,y D R S G Qw = i= Φ w Γ w x i, y i, where Γ w x, y denotes the noralized argin of w on x, y, i.e., Γ w x, y def = yw φx w φx, and where Φa

denotes the probability that X > a when X is a N0, rando variable, i.e., Φa def = exp 2 2π x2 dx. 2 3.. Two objective functions to iniize a By using the above expressions for R S G Qw and KLQ w P wp, Corollaries 2. and 2.2 both provide upper bounds to RG Qw and to Rh w since Rh w 2RG Qw. Hence, each bound depend on the sae quantities: the epirical risk easure, R S G Qw, and KLQ w P wp which acts as a regularizer. Miniizing the upper bound given by Corollary 2., in the case of linear classifiers, aounts to finding w that iniizes the following objective function { BS, w, def = sup ɛ : klr S G Qw ɛ KLQ w P wp + ln ξ ]}, F 2. for a fixed value of the confidence paraeter say = 0.05. Consequently, our proble is to find weight vector w that iniizes B subject to the constraints kl R S G Qw B = KLQ w P wp + ln ξ ] 3 B > R S G Qw. 4 Miniizing the bound of Corollary 2.2, in the case of linear classifiers, aounts at finding w that iniizes the siple objective function CR S G Qw + KLQ w P wp = yi w φx i C Φ + φx i 2 w w p 2, F 2.2 i= for soe fixed choice of C and w p. In the absence of prior knowledge, w p = 0 and the regularizer becoes identical to the one used by the SVM. Indeed, the learning strategy used by the soft-argin SVM consists at finding w that iniizes C i= ax 0, y i w φx i + 2 w 2, for soe fixed choice of C. Thus, for w p = 0, both learning strategies are identical except for the fact that the convex SVM hinge loss, ax0,, is replaced by the non-convex probit loss, Φ. Hence, the objective function iniized by the soft-argin SVM is a convex relaxation of objective function F 2.2. ach learning strategy has its potential drawback. The single local iniu of the soft-argin SVM solution ight be suboptial, whereas the non-convex PAC- Bayes bound ight present several local inia. Observe that B, the objective function F 2., is defined only iplicitly in ters of w via the constraints given by quations 3 and 4. This optiization proble appears to be ore involved than the unconstrained optiization of objective function F 2.2 that arises fro Corollary 2.2. However, it appears also to be ore relevant, since, according to Proposition 2., the upper bound given by Corollary 2. is soewhat tighter than the one given by Corollary 2.2 apart fro the presence of a lnξ ter. The optiization of the objective function F 2. has also the advantage of not being dependent of any constant C like the one present in the objective objective function F 2.2. 4. Gradient Descent of the PAC-Bayes Bound We are now concerned with the proble of iniizing the non-convex objective functions F 2. for fixed w p and F 2.2 for fixed C and w p. As a first approach, it akes sense to iniize these objective functions by gradient-descent. More specifically, we have used the Polak-Ribière conjugate gradient descent algorith ipleented in the GNU Scientific Library GSL. The gradient with respect to w of objective function F 2. is obtained by coputing the partial derivative of both sides of quation 3 with respect to w j the jth coponent of w. After solving for B/ w j, we find that the gradient is given by B B w w p + ln B R S i= B RS R S B Φ yi w φx i φx i yi φx i φx i ], 5 where Φ t denotes the first derivative of Φ evaluated at t. We have observed that objective function F 2. tends to have only one local iniu, even if it is not convex. We have therefore used a single gradient descent run to iniize F 2.. The gradient of objective function F 2.2 is C i= Φ yi w φx i φx i yi φx i φx i + w w p. 6 Since this objective function ight have several local inia, especially for large values of C, each objective function iniization of F 2.2 consisted of k different gradient-descent runs, where each run was initiated

fro a new, randoly-chosen, starting position. In the results presented here, we have used k = 0 for C 0 and k = 00 for C > 0. 4.. Proposed learning algoriths We propose three algoriths that can be used either with the prial variables i.e., the coponents of w or the dual variables {α,..., α } that appear in the linear expansion w = i= α iy i φx i. In this latter case, the features are iplicitly given by a Mercer kernel kx, x = φx φx x, x X X. The objective functions F 2. and F 2.2, with their gradients q. 5 and 6, can then straightforwardly be expressed in ters of k, and the dual variables. 2 The first algorith, called PBGD, uses the prior P 0 i.e., with w p = 0 to learn a posterior Q w by iniizing the bound value of Corollary 2. objective function F 2.. In this paper, every bound coputation has been perfored with = 0.05. The second algorith, called PBGD2, was studied to investigate if it is worthwhile to use a fraction x of the training data to construct an inforative prior P wp, for soe w p 0, that will be used to learn a posterior Q w on the reaining x fraction of the training data. In its first stage, PBGD2 iniizes the objective function F 2.2 by using a fraction x of the training data to construct one posterior for each value of C {0 k : k = 0,..., 6}. Note that a large value for C attepts to generate a w for which the training error of G w is sall. ach posterior is constructed with the sae non-inforative prior used for PBGD i.e., with w p = 0. Then, each of these seven posteriors is used as a prior P wp, with w p 0, for learning a posterior Q w by iniizing the objective function F 2. on the reaining fraction x of the training data. Fro the union bound arguent, the ter in Corollary 2. needs to be replaced by /7 to get a bound that uniforly holds for these seven priors. pirically, we have observed that the best fraction x used for constructing the prior was /2. Hence, we report here only the results for x = /2. For the third algorith, called PBGD3, we always used the prior P 0 to iniize the objective function F 2.2. But, instead of using the solution obtained for the value of C that gave the sallest bound of Corollary 2.2, we perfored 0-fold cross validation on the training set to find the best value for C and then used that value of C to find the classifier that iniizes objective function F 2.2. Hence PBGD3 fol- 2 This is true if w p can be expanded in ters of exaples that do not belong to the training set. lows the sae cross-validation learning ethodology norally eployed with the SVM but uses the probit loss instead of the hinge loss. To copute the risk bound for the linear classifier returned by PBGD3 and the other coparison algoriths AdaBoost and SVM, we perfored a line search, along the direction of the weight vector w of the returned classifier, to find the nor w that iniizes the bound of Corollary 2.. 3 For each bound coputation, we used the non-inforative prior P 0. 4.2. PBGD with Respect to Prial Variables For the sake of coparison, all learning algoriths of this subsection are producing a linear classifier h w on the set of basis functions {φ, φ 2,...} known as decision stups. ach decision stup φ i is a threshold classifier that depends on a single attribute: its output is +b if the tested attribute exceeds a threshold value t, and b otherwise, where b {, +}. For each attribute, at ost ten equally-spaced possible values for t were deterined a priori. We have copared the three PBGD algoriths to AdaBoost Schapire et al., 998 because the latter is a standard and efficient algorith when used with decision stups. Since AdaBoost is an algorith that iniizes the exponential risk i= exp y iw φx i, it never chooses a w for which there exists a training exaple where y i w φx i is very large. This is to be contrasted with the PBGD s algoriths for which the epirical risk R S G Qw has the sigoidal shape of quation 2 and never exceeds one. We thus anticipate that AdaBoost and PBGD s algoriths will select different weight vectors w on any data sets. The results obtained for all three algoriths are suarized in Table. xcept for MNIST, all data sets were taken fro the UCI repository. ach data set was randoly split into a training set S of S exaples and a testing set T of T exaples. The nuber n of attributes for each data set is also specified. For Ada- Boost, the nuber of boosting rounds was fixed to 200. For all algoriths, R T w refers to the frequency of errors, easured on the testing set T, of the linear classifier h w returned by the learner. For the PBGD s algoriths, G T w def = R T G Qw refers to the epirical risk on T of the Gibbs classifier. The Bnd coluns refer to the PAC-Bayes bound of Corollary 2., coputed on the training set. All bounds hold with confidence = 0.95. For PBGD, PBGD3 and Ada- Boost, the bound is coputed on all the training data 3 This is justified by the fact that the bound holds uniforly for all weight vectors w.

Table. Suary of results for linear classifiers on decision stups. Dataset a AdaBoost PBGD 2 PBGD2 3 PBGD3 SSB Nae S T n R T w Bnd R T w G T w Bnd R T w G T w Bnd R T w G T w Bnd Usvotes 235 200 6 0.055 0.346 0.085 0.03 0.207 0.060 0.058 0.65 0.060 0.057 0.26 Credit-A 353 300 5 0.70 0.504 0.77 0.243 0.375 0.87 0.9 0.272 0.43 0.59 0.420 Glass 07 07 9 0.78 0.636 0.96 0.346 0.562 0.68 0.76 0.395 0.50 0.226 0.58 Haberan 44 50 3 0.260 0.590 0.273 0.283 0.422 0.267 0.287 0.465 0.273 0.386 0.424 Heart 50 47 3 0.259 0.569 0.70 0.250 0.46 0.90 0.205 0.379 0.84 0.24 0.473 Sonar 04 04 60 0.23 0.644 0.269 0.376 0.579 0.73 0.68 0.547 0.25 0.209 0.622 BreastCancer 343 340 9 0.053 0.295 0.04 0.058 0.29 0.047 0.054 0.04 0.044 0.048 0.90 Tic-tac-toe 479 479 9 0.357 0.483 0.294 0.384 0.462 0.207 0.208 0.302 0.207 0.27 0.474 2,3<a, Ionosphere 76 75 34 0.20 0.602 0.20 0.223 0.425 0.09 0.29 0.347 0.03 0.25 0.557 Wdbc 285 284 30 0.049 0.447 0.042 0.099 0.272 0.049 0.048 0.47 0.035 0.05 0.39 MNIST:0vs8 500 96 784 0.008 0.528 0.05 0.052 0.9 0.0 0.06 0.062 0.006 0.0 0.262 MNIST:vs7 500 922 784 0.03 0.54 0.020 0.055 0.84 0.05 0.06 0.050 0.06 0.07 0.233 MNIST:vs8 500 936 784 0.025 0.552 0.037 0.097 0.247 0.027 0.030 0.087 0.08 0.037 0.305 3< MNIST:2vs3 500 905 784 0.047 0.558 0.046 0.8 0.264 0.040 0.044 0.05 0.034 0.048 0.356 Letter:AvsB 500 055 6 0.00 0.254 0.009 0.050 0.80 0.007 0.0 0.065 0.007 0.044 0.80 Letter:DvsO 500 058 6 0.036 0.378 0.043 0.24 0.34 0.033 0.039 0.090 0.024 0.038 0.360 Letter:OvsQ 500 036 6 0.038 0.43 0.06 0.70 0.357 0.053 0.053 0.06 0.042 0.049 0.454 Adult 809 0000 4 0.49 0.394 0.68 0.96 0.270 0.69 0.69 0.209 0.59 0.60 0.364 a<,2 Mushroo 4062 4062 22 0.000 0.200 0.046 0.065 0.30 0.06 0.07 0.030 0.002 0.004 0.50 a,3<2< with the non inforative prior P 0. For PBGD2, the bound is coputed on the second half of the training data with the prior P wp constructed fro the first half, and, as explain in Section 4., with replaced by /7. Note that the bounds values for the classifiers returned by PBGD2 are generally uch lower those for the classifiers produced by the other algoriths. This alost always aterializes in a saller testing error for the linear classifier produced by PBGD2. To our knowledge, these training set bounds for PBGD2 are the sallest ones obtained for any learning algorith producing linear classifiers. To deterine whether or not a difference of epirical risk easured on the testing set T is statistically significant, we have used the test set bound ethod of Langford 2005 based on the binoial tail inversion with a confidence level of 95%. It turns out that no algorith has succeeded in choosing a linear classifier h w which was statistically significantly better SSB than the one chosen by another algorith except for the few cases that are list in the colun SSB of Table. Overall, AdaBoost and PBGD2and3 are very copetitive to one another with no clear winner and are generally superior to PBGD. We therefore see an advantage in using half of the training data to learn a good prior over using a non-inforative prior and keeping all the data to learn the posterior. 4.3. PBGD with Respect to Dual Variables In this subsection, we copare the PBGD algoriths to the soft-argin SVM. Here, all four learning algoriths are producing a linear classifier on a feature space defined by the RBF kernel k satisfying kx, x = exp 2 x x 2 /γ 2 x, x X X. The results obtained for all algoriths are suarized in Table 2. All the data sets are the sae as those of the previous subsection. The notation used in this table is identical to the one used for Table. For the SVM and PBGD3, the kernel paraeter γ and the soft-argin paraeter C was chosen by 0- fold cross validation on the training set S aong the set of values proposed by Abroladze et al. 2006. PBGDand2 also tried the sae set of values for γ but used the bound of Corollary 2. to select a good value. Since we consider 5 different values of γ, the bound of PBGD is therefore coputed with replaced by /5. For PBGD2, the bound is, as stated before, coputed on the second half of the training data but with replaced by /7 5. Again, the bounds values for the classifiers returned by PBGD2 are generally uch lower those for the classifiers produced by the other algoriths. To our knowledge, the training set bounds for PBGD2 are the sallest ones obtained for any learning algorith producing linear classifiers. The sae ethod as in the previous subsection was used to deterine whether or not a difference of epirical risk easured on the testing set T is statistically significant. It turns out that no algorith has chosen a linear classifier h w which was statistically significantly better than the choices of the others except the few cases listed in the SSB colun. Thus, the SVM and PBGD3 are very copetitive to one another with no clear winner and are both slightly superior to PBGD2, and a bit ore than slightly superior than PBGD. 5. Conclusion We have shown that the standard PAC-Bayes risk bounds McAllester, 2003; Seeger, 2002; Langford, 2005; Catoni, 2007 are specializations of Theore 2. that are obtained by choosing a particular convex func-

Table 2. Suary of results for linear classifiers with a RBF kernel. Dataset s SVM PBGD 2 PBGD2 3 PBGD3 SSB Nae S T n R T w Bnd R T w G T w Bnd R T w G T w Bnd R T w G T w Bnd Usvotes 235 200 6 0.055 0.370 0.080 0.7 0.244 0.050 0.050 0.53 0.075 0.085 0.332 Credit-A 353 300 5 0.83 0.59 0.50 0.96 0.34 0.50 0.52 0.248 0.60 0.267 0.375 Glass 07 07 9 0.78 0.57 0.68 0.349 0.539 0.25 0.232 0.430 0.68 0.36 0.54 Haberan 44 50 3 0.280 0.423 0.280 0.285 0.47 0.327 0.323 0.444 0.253 0.250 0.555 Heart 50 47 3 0.97 0.53 0.90 0.236 0.44 0.84 0.90 0.400 0.97 0.246 0.520 Sonar 04 04 60 0.63 0.599 0.250 0.379 0.560 0.73 0.23 0.477 0.44 0.243 0.585 BreastCancer 343 340 9 0.038 0.46 0.044 0.056 0.32 0.04 0.046 0.0 0.047 0.05 0.62 Tic-tac-toe 479 479 9 0.08 0.555 0.365 0.369 0.426 0.73 0.93 0.287 0.077 0.07 0.548 s,3<2< Ionosphere 76 75 34 0.097 0.53 0.4 0.242 0.395 0.03 0.5 0.376 0.09 0.65 0.465 Wdbc 285 284 30 0.074 0.400 0.074 0.204 0.366 0.067 0.9 0.298 0.074 0.20 0.367 MNIST:0vs8 500 96 784 0.003 0.257 0.009 0.053 0.202 0.007 0.05 0.058 0.004 0.0 0.320 MNIST:vs7 500 922 784 0.0 0.26 0.04 0.045 0.6 0.009 0.05 0.052 0.00 0.02 0.250 MNIST:vs8 500 936 784 0.0 0.306 0.04 0.066 0.204 0.0 0.09 0.060 0.00 0.024 0.29 MNIST:2vs3 500 905 784 0.020 0.348 0.038 0.2 0.265 0.028 0.043 0.096 0.023 0.036 0.326 s< Letter:AvsB 500 055 6 0.00 0.49 0.005 0.043 0.70 0.003 0.009 0.064 0.00 0.408 0.485 Letter:DvsO 500 058 6 0.04 0.395 0.07 0.095 0.267 0.024 0.030 0.086 0.03 0.03 0.350 Letter:OvsQ 500 036 6 0.05 0.332 0.029 0.30 0.299 0.09 0.032 0.078 0.04 0.045 0.329 Adult 809 0000 4 0.59 0.535 0.73 0.98 0.274 0.80 0.8 0.224 0.64 0.74 0.372 s,3<2 Mushroo 4062 4062 22 0.000 0.23 0.007 0.032 0.9 0.00 0.003 0.0 0.000 0.00 0.67 s,2,3< tion D that binds Gibbs true risk to its epirical estiate. Moreover, when spherical Gaussians over spaces of linear classifiers are used for priors and posteriors, we have shown that the Gibbs classifier G Qw that iniizes the PAC-Bayes bound of Corollary 2. resp. 2.2 is obtained fro the weight vector w that iniizes the non-convex objective function F 2. resp. F 2.2. When the prior is non-inforative, a siple convex relaxation of F 2.2 gives the objective function which is iniized by the soft-argin SVM. We have proposed two learning algoriths PBGD and PBGD2 for finding linear classifiers that iniize the bound of Corollary 2., and another algorith PBGD3 that uses the cross-validation ethodology to deterine the value of paraeter C in the objective function F 2.2. PBGD uses a noninforative prior to construct the final classifier fro all the training data. In contrast, PBGD2 uses a fraction of the training set to construct an inforative prior that is used to learn the final linear classifier on the reaining fraction of the training data. Our extensive experients indicate that PBGD2 is generally uch ore effective than PBGD at producing classifiers with sall true risk. Moreover, the training set risk bounds obtained for PBGD2 are, to our knowledge, the sallest obtained so far for any learning algorith producing linear classifiers. In fact, PBGD2 is a learning algorith producing classifiers having a good guarantee without the need of using any test set for that purpose. This opens the way to a feasible learning strategy that uses all the available data for training. Our results also indicate that PBGD2 and PBGD3 are copetitive with both AdaBoost and the soft-argin SVM at producing classifiers with sall true risk. However, as a consequence of the nonconvexity of the objective function F 2.2, PBGD2 and PBGD3 are slower than AdaBoost and the SVM. Acknowledgeents Work supported by NSRC Discovery grants 262067 and 022405. References Abroladze, A., Parrado-Hernández,., & Shawe- Taylor, J. 2006. Tighter PAC-Bayes bounds. Proceedings of the 2006 conference on Neural Inforation Processing Systes NIPS-06 pp. 9 6. Banerjee, A. 2006. On bayesian bounds. ICML 06: Proceedings of the 23rd international conference on Machine learning pp. 8 88. Catoni, O. 2007. PAC-Bayesian surpevised classification: the therodynaics of statistical learning. Monograph series of the Institute of Matheatical Statistics, http://arxiv.org/abs/072.0248. Langford, J. 2005. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6, 273 306. Langford, J., & Shawe-Taylor, J. 2003. PAC-Bayes & argins. In S. T. S. Becker and K. Oberayer ds., Advances in neural inforation processing systes 5, 423 430. Cabridge, MA: MIT Press. McAllester, D. 2003. PAC-Bayesian stochastic odel selection. Machine Learning, 5, 5 2. Schapire, R.., Freund, Y., Bartlett, P., & Lee, W. S. 998. Boosting the argin: A new explanation for the effectiveness of voting ethods. The Annals of Statistics, 26, 65 686. Seeger, M. 2002. PAC-Bayesian generalization bounds for gaussian processes. Journal of Machine Learning Research, 3, 233 269.