Bounds on the Generalization Performance of Kernel Machines Ensembles

Size: px

Start display at page:

Download "Bounds on the Generalization Performance of Kernel Machines Ensembles"

Chad Horn
5 years ago
Views:

1 Bounds on the Generalzaton Performance of Kernel Machnes Ensembles Theodoros Evgenou Lus Perez-Breva Massmlano Pontl Tomaso Poggo Center for Bologcal and Computatonal Learnng, MIT, 45 Carleton Street E25-201, Cambrdge, MA Abstract We study the problem of learnng usng combnatons of machnes. In partcular we present new theoretcal bounds on the generalzaton performance of votng ensembles of kernel machnes. Specal cases consdered are baggng and support vector machnes. We present expermental results supportng the theoretcal bounds, and descrbe characterstcs of kernel machnes ensembles suggested from the expermental fndngs. We also show how such ensembles can be used for fast tranng wth very large datasets. 1. Introducton Two major recent advances n learnng theory are support vector machnes (SVM) (Vapnk, 1998) and ensemble methods such as boostng and baggng (Breman, 1996; Schapre et al., 1998). Dstrbuton ndependent bounds on the generalzaton performance of these two technques have been suggested recently (Shawe-Taylor & Crstann, 1998; Bartlett, 1998; Schapre et al., 1998), and smlartes between these bounds n terms of a geometrc quantty known as the margn have been proposed. More recently bounds on the generalzaton performance of SVM based on crossvaldaton have been derved (Vapnk, 1998; Chapelle & Vapnk, 1999). These bounds depend also on geometrc quanttes other than the margn (such as the radus of the smallest sphere contanng the support vectors). In ths paper we study the generalzaton performance of ensembles of general kernel machnes usng crossvaldaton arguments. The kernel machnes consdered are learnng machnes of the form f(x) = l α K(x, x), (1) =1 where (x,y ) =1...l are the tranng ponts and K s a kernel functon, for example a Gaussan - see (Vapnk, 1998; Wahba, 1990) for a number of kernels. The coeffcents α are learned by solvng the followng optmzaton problem: max α l =1 S(α) 1 2 l,j=1 ααjyykj subject to : 0 α C where S( ) s a cost functon, C a constant, and we have defned K j = K j. SVM are a partcular case of these machnes for S(α) = α. For SVM, ponts for whch α 0 are called support vectors. Notce that the bas term (threshold b n the general case of machnes f(x) = l =1 α K(x, x)+b) sncorporated n the kernel K. We study the cross-valdaton error of general votng ensembles of kernel machnes (2). Partcular cases are that of baggng kernel machnes each traned on dfferent subsamples of the ntal tranng set, or that of votng kernel machnes each usng a dfferent kernel (also dfferent subsets of features/components of the ntal nput features). We frst present bounds on the generalzaton error of such ensembles, and then dscuss experments where the derved bounds are used for model selecton. We also show expermental results showng that a valdaton set can be used for model selecton for kernel machnes and ther ensembles, wthout havng to decrease the tranng set sze n order to create a valdaton set. Fnally we show how such ensembles can be used for fast tranng wth very large data sets. 2. Generalzaton Performance of Kernel Machnes Ensembles The theoretcal results of the paper are based on the cross-valdaton (or leave-one-out) error. The crossvaldaton procedure conssts of removng from the (2)

2 tranng set one pont at a tme, tranng a machne on the remanng ponts and then testng on the removed one. The number of errors counted throughout ths process, L ((x,y 1 ),...(x l,y l )), s called the cross-valdaton error. It s known that ths quantty provdes an estmate of the generalzaton performance of a machne (Wahba, 1990; Vapnk, 1998). In partcular the expectaton of the generalzaton error of a machne traned usng l ponts s bounded by the expectaton of the cross valdaton error of a machne traned on l + 1 ponts (Luntz and Bralovsky theorem (Vapnk, 1998)). We begn wth some known results on the crossvaldaton error of kernel machnes. The followng theorem s from (Jaakkola & Haussler, 1998) : Theorem 2.1 The cross-valdaton error of a kernel machne (2) s upper bounded as: L ((x,y 1 ),...(x l,y l )) l θ(α K y f(x )) (3) =1 where θ s the Heavysde functon, and f s the optmal functon found by solvng maxmzaton problem (2). In the partcular case of SVM where the data are separable (3) can be bounded by geometrc quanttes, namely (Vapnk, 1998): l =1 θ(α K y f(x )) D2 sv ρ 2 (4) where D sv s the radus of the smallest sphere n the feature space nduced by kernel K (Wahba, 1990; Vapnk, 1998) centered at the orgn contanng the support vectors, and ρ s the margn (ρ 2 = 1 )ofthe f 2 K SVM. Usng ths result, the followng theorem s a drect applcaton of the Luntz and Bralovsky theorem (Vapnk, 1998): Theorem 2.2 The average generalzaton error of an SVM (wth zero threshold b, and n the separable case) traned on l ponts s upper bounded by 1 l +1 E ( D 2 sv(l) ρ 2 (l) ), where the expectaton E s taken wth respect to the probablty of a tranng set of sze l. Notce that ths result shows that the performance of SVM does not depend only on the margn, but also on other geometrc quanttes, namely the radus D sv. In the non-separable case, t can be shown (the proof s smlar to that of corollary 2.2 below) that equaton (4) can be wrtten as: l =1 θ(α K y f(x )) EE 1 + D2 sv ρ 2 (5) where EE 1 s the hard margn emprcal error of the SVM (the number of tranng ponts wth yf(x) < 1). We now extend these results to the case of ensembles of kernel machnes. We consder the general case where each of the machnes n the ensemble uses a dfferent kernel. Let T be the number of machnes, and let be the kernel used by machne t. Notce that, as a specal case, approprate choces of lead to machnes that may have dfferent subsets of features from the orgnal ones. Let f (t) (x) be the optmal soluton of machne t (real-valued), and α (t) the optmal weght that machne t assgns to pont (x,y ) (after solvng problem (2)). We consder ensembles that are lnear combnatons of the ndvdual machnes. In partcular, the separatng surface of the ensemble s: F (x) = c t f (t) (x) (6) and the classfcaton s done by takng the sgn of ths functon. The coeffcents c t are not learned (.e. c t = 1 T ), and T c t = 1 (for scalng reasons), c t > 0. All parameters (C s and kernels) are fxed before tranng. In the partcular case of baggng, the subsamplng of the tranng data should be determnstc. Wth ths we mean that when the bounds are used for model (parameter) selecton, for each model the same subsample sets of the data need to be used. These subsamples, however, are stll random ones. We beleve that the results presented below also hold (wth mnor modfcatons) n the general case that the subsamplng s always random. We now consder the cross-valdaton error of such ensembles. Theorem 2.3 The cross-valdaton error L ((x,y 1 ),...(x l,y l )) of a kernel machnes ensemble s upper bounded by: l θ( c t α (t) y F (x )) (7) =1 The proof of ths theorem s based on the followng lemma shown n Vapnk (1998) and n Jaakkola and Haussler (1998): Lemma 2.1 Let α be the coeffcent of the soluton f(x) of machne (2) correspondng to pont (x,y ),

3 α 0.Letf (x) be the soluton of machne (2) found when pont (x,y ) s removed from the tranng set. Then: y f (x ) y f(x ) α K. Usnglemma2.1wecannowprovetheorem2.3. Proof of theorem 2.3: Let F (x) = T c tf (t) (x) be the fnal machne traned wth all ntal tranng data except (x,y ). Lemma 2.1 gves that y y F (x )=y c tf (t) (x) θ( y F (x )) θ( = y F (x ) c tf (t) (x ) = y F (x )) therefore the leave one out error l =1 θ( y F (x )) s not more than l =1 θ( T y F (x )) whch proves the theorem. 2 Notce that the bound has the same form as bound (3): for each pont (x,y ) we only need to take nto account ts correspondng parameter α (t) and remove the effects of α (t) from the value of F (x ). The cross-valdaton error can also be bounded usng geometrc quanttes. To ths purpose we ntroduce one more parameter that we call the ensemble margn (n contrast to the margn of a sngle SVM). For each pont (x,y ) we defne ts ensemble margn to be smply y F (x ). Ths s exactly the defnton of margn n (Schapre et al., 1998). For any gven δ>0 we defne EE δ to be the number of tranng ponts wth ensemble margn <δ(emprcal error wth margn δ), and by N δ the set of the remanng tranng ponts - the ones wth ensemble margn δ. Fnally, we note by D t(δ) to be the radus of the smallest sphere n the feature space nduced by kernel centered at the orgn contanng the ponts of machne t wth α (t) 0 and ensemble margn larger than δ (n the case of SVM, these are the support vectors of machne t wth ensemble margn larger than δ). A smple consequence of theorem 2.3 and of the nequalty Dt(δ) 2 for ponts x wth α (t) 0 and ensemble margn y F (x ) δ s the followng: Corollary 2.1 For any δ>0 the cross-valdaton error L ((x,y 1 ),...(x l,y l )) of a kernel machnes ensemble s upper bounded by: ( EE δ + 1 T c t Dt(δ) 2 δ ( α (t) ) N δ ) (8) Proof: For each tranng pont (x,y ) wth ensemble margn y F (x ) <δwe upper bound θ( T y F (x )) wth 1 (ths s a trval bound). For the remanng ponts (the ponts n N δ ) we show that: If θ( y F (x )) 1 δ T y F (x ) < 0, then: θ( y F (x )) = 0 1 δ On the other hand, f T then θ( T ctα(t). ctα(t) y F (x ) δ 1 δ. y F (x ) 0, y F (x )) = 1, whle 1. So n both cases nequalty (9) holds. Therefore: l θ( y F (x )) =1 ( ) EE δ + 1 c t δ α(t) N δ ( EE δ + 1 T ) c td 2 δ t(δ)( α (t) ) N δ whch proves the corollary. 2 Notce that equaton (8) holds for any δ>0, so the best bound s obtaned for the mnmum of the rght hand sde wth respect to δ > 0. Usng the Luntz and Bralovsky theorem, theorems 2.3 and 2.1 provde bounds on the generalzaton performance of general kernel machnes ensembles lke that of theorem 2.2. We now consder the partcular case of SVM ensembles. In ths case, for example choosng δ = 1 (8) becomes: Corollary 2.2 The leave-one-out error of an ensemble of SVMs s upper bounded by: L ((x,y 1 ),...(x l,y l )) EE 1 + c t D 2 t ρ 2 t (9)

4 where EE 1 s the margn emprcal error wth ensemble margn 1, D t s the radus of the smallest sphere centered at the orgn, n the feature space nduced by kernel, contanng the support vectors of machne t, andρ t s the margn of SVM t. Ths s because clearly D t D t(δ) for any δ, and N δ α (t) l =1 α(t) = 1 (see (Vapnk, 1998) for ρ 2 t a proof of ths equalty). A number of remarks can be made from equaton (9). Frst notce that the generalzaton performance of the SVM ensemble now depends on the average (convex combnaton of) D2 ρ of the ndvdual machnes. In 2 some cases ths may be smaller than the D2 ρ of a sngle 2 SVM. For example, suppose we tran many SVMs on dfferent subsamples of the tranng ponts and we want to compare such an ensemble wth a sngle SVM usng all the ponts. If all SVMs (the sngle one, as well as the ndvdual ones of the ensemble) use most of ther tranng ponts as support vectors, then clearly the D 2 of each SVM n the ensemble s smaller than that of the sngle SVM. Moreover the margn of each SVM n the ensemble s expected to be larger than that of the sngle SVM usng all the ponts. So the average D2 ρ 2 n ths case s expected to be smaller than that of the sngle SVM. Another case where an ensemble of SVMs may be better than a sngle SVM s the one where there are outlers among the tranng data: f the ndvdual SVMs are traned on subsamples of the tranng data, some of the machnes may have smaller D2 ρ because 2 they do not use some outlers. In general t s not clear when ensembles of kernel machnes are better than sngle machnes. The bounds n ths secton may provde some nsght to ths queston. Notce also how the ensemble margn δ plays a role for the generalzaton performance of kernel machne ensembles. Ths margn s also shown to be mportant for boostng (Schapre et al., 1998). Fnally, notce that all the results dscussed hold for the case that there s no bas (threshold b), or the case where the bas s ncluded n the kernel (as dscussed n the ntroducton). In the experments dscussed below we use the results also for cases where the bas s not regularzed, whch s common n practce. It may be possble to use recent theoretcal results (Chapelle & Vapnk, 1999) on the leave-one-out bounds of SVM when the bas b s taken nto account n order to study the generalzaton performance of kernel machnes ensembles wth the bas b. 3. Experments To test how tght the bounds we presented are, we conducted a number of experments usng data sets from UCI, 1 as well as the US Postal Servce (USPS) data set (LeCun et al., 1990). We show results for some of the sets n Fgures 1-2. For each data set we splt the overall set n tranng and testng (the szes are shown n the fgures) n 50 dfferent (random) ways, and for each splt: 1. We traned one SVM wth b = 0 usng all tranng data, computed the leave-one-bound gven by theorem 2.1, and then compute the test performance usng the test set. 2. We repeated (1) ths tme wth wth b We traned 30 SVMs wth b = 0 each usng a random subsample of sze 40% of the tranng data (baggng), computed the leave-one-bound gven by theorem 2.3 usng c t = 1 30, and then compute the test performance usng the test set. 4. We repeated (3) ths tme wth wth b 0. We then averaged over the 50 tranng-testng splts the test performances and the leave-one-out bounds found, and computed the standard devatons. All machnes were traned usng a Gaussan kernel, and we repeated the procedure for a number of dfferent σ s of the Gaussan, and for a fxed C (show n the fgures). We show the averages and standard devatons of the results n the fgures. In all fgures we use the followng notaton: top left fgure: baggng wth b =0; top rght fgure: sngle SVM wth b = 0; bottom left fgure: baggng wth b 0; and bottom rght fgure: sngle SVM wth b 0. In all plots the sold lne s the mean test performance and the dashed lne s the error bound computed usng the leave-one-out theorems (theorems 2.1 and 2.3). The dotted lne s the valdaton set error dscussed below. For smplcty, only one error bar (standard devaton over the 50 tranngtestng splts) s shown (the others were smlar). The cost parameter C used s gven n each of the fgures. The horzontal axs s the natural logarthm of the σ of the Gaussan kernel used, whle the vertcal axs s the error. An nterestng observaton s that the bounds are always tghter for the case of baggng than they are for thecaseofasnglesvm. Ths s an nterestng expermental fndng for whch we do not have a theoretcal explanaton. It may be because the generalzaton performance of a machne s related to the expected leave-one-out error of the machne (Vapnk, 1 mlearn/mlrepostory.html

5 Fgure 1. Breast cancer data: see text for descrpton. 1998), and by combnng many machnes each usng a dfferent (random) subset of the tranng data we better approxmate the expected leave-one-out than we do when we only compute the leave-one-out of a sngle machne. Ths fndng can practcally justfy the use of ensembles of machnes for model selecton: parameter selecton usng the leave-one-out bounds presented n ths paper s easer for ensembles of machnes than t s for sngle machnes. Another nterestng observaton s that the bounds seem to work smlarly n the case that the bas b s not 0. In ths case, as before, the bounds are tghter for ensembles of machnes than they are for sngle machnes. In all our experments we always found that the bounds presented here (equatons (3) and (7)) get looser n the case that the parameter C used durng the tranng, s large. A representatve example s shown n fgure 3. Ths result can be understood lookng at the leave-one-out bound for a sngle SVM (equaton (3)). Let (x,y )beasupportvectorfor whch y f(x ) < 1. It s known (Vapnk, 1998) that for these support vectors the coeffcent α s C. If C s such that CK > 1 (for the Gaussan kernel ths reduces to C>1, as K(x, x) = 1), then clearly θ(ck y f(x )) = 1. In ths case the bound n equaton (3) effectvely counts all support vectors wth margn less then one (plus some of the ones on the margn - yf(x) = 1). Ths means that for large C (n the case of Gaussan kernels ths can be for example for any C>1), the bounds of ths paper effectvely are smlar (not larger than) to another known leaveone-out bound for SVMs, namely one that uses the number of all support vectors to bound generalzaton Fgure 2. USPS data: see text for descrpton. performance (Vapnk, 1998). So effectvely our expermental results show that the number of support vectors does not provde a good estmate of the generalzaton performance of the SVMs and ther ensembles. 4. Valdaton Set for Model Selecton Instead of usng bounds on the generalzaton performance of learnng machnes lke the ones dscussed above, an alternatve approach for model selecton s to use a valdaton set to choose the parameters of the machnes. We consder frst the smple case where we have N machnes and we choose the best one based on the error they make on a fxed valdaton set of sze V. Ths can be thought of as a specal case where we consder as our hypothess space to be the set of the N machnes, and then we tran by smply pckng the machne wth the smallest emprcal error (n ths case ths s the valdaton error). It s known that f VE s the valdaton error of machne and TE s ts true test error, then for all N machnes smultaneously the followng bound holds wth probablty 1 η (Devroye et al., 1996; Vapnk, 1998) : log(n) log( η 4 TE VE + ) (10) V So how accurately we pck the best machne usng the valdaton set depends, as expected, on the number of machnes N andontheszev of the valdaton set. The bound suggests that a valdaton set can be used to accurately estmate the generalzaton performance of a relatvely small number of machnes (.e. small number of parameter values examned), as done often n practce.

6 error Fgure 3. USPS data: usng a large C (C=50). In ths case the bounds do not work; see text for an explanaton. We used ths observaton for parameter selecton for SVM and for ther ensembles. Expermentally we followed a slghtly dfferent procedure from what s suggested by bound (10): for each machne (that s, for each σ of the Gaussan kernel n our case, both for a sngle SVM and for an ensemble of machnes) we splt the tranng set (for each tranng-testng splt of the overall data set as descrbed above) nto a smaller tranng set and a valdaton set (70-30% respectvely). We traned each machne usng the new, smaller tranng set, and measured the performance of the machne on the valdaton set. Unlke what bound (10) suggests, nstead of comparng the valdaton performance found wth the generalzaton performance of the machnes traned on the smaller tranng set (whch s the case for whch bound (10) holds), we compared the valdaton performance wth the test performance of the machne traned usng all the ntal (larger) tranng set. Ths way we dd not have to use less ponts for tranng the machnes, whch s a typcal drawback of usng a valdaton set, and we could compare the valdaton performance wth the leave-one-out bounds and the test performance of the exact same machnes we used n the prevous secton. We show the results of these experments n fgures 1-2: see the dotted lnes n the plots. We observe that although the valdaton error s that of a machne traned on a smaller tranng set, t stll provdes a very good estmate of the test performance of the machnes traned on the whole tranng set. In all cases, ncludng the case of C>1for whch the leave-one-out bounds dscussed above dd not work well, the valdaton set error provded a very good estmate of the test sgma Fgure 4. When the coeffcents of the second layer are learned usng a lnear SVM the system s less senstve to changes of the σ of the Gaussan kernel used by the ndvdual machnes of the ensemble. Sold lne s one SVM, dotted s ensemble of 30 SVMs wth fxed c t = 1 30,and dashed lne s ensemble of 30 SVMs wth the coeffcents c t learned. The horzontal axs shows the natural logarthm of the σ of the Gaussan kernel. The data use s the heart data set. The threshold b s non-zero for ths experment. performance of the machnes. 5. Other Ensembles The ensemble kernel machnes (6) consdered so far are votng combnatons where the coeffcents c t n (6) of the lnear combnaton of the machnes are fxed. We now consder the case where these coeffcents are also learned from the tranng subsets. In partcular we consder the followng archtecture: a. A number T of kernel machnes s traned as before (for example usng dfferent tranng data, or dfferent parameters). b. The T outputs (real valued n our experments, but could also be thresholded - bnary) of the machnes at each of the tranng ponts are computed. c. A lnear machne (.e. lnear SVM) s traned usng as nputs the outputs of the T machnes on the tranng data, and as labels the orgnal tranng labels. The soluton s used as the coeffcents c t of the lnear combnaton of the T machnes. Notce that for ths type of machnes the leave-one-out bound of theorem 2.3 does not hold snce the theorem assumes fxed coeffcents c t. A valdaton set can stll be used for model selecton for these machnes. On the other hand, an mportant characterstc of ths type of ensembles s that ndependent of what kernels/parameters each of the ndvdual machnes of the ensemble use, the second layer machne (whch fnds coeffcents c t ) uses always a lnear kernel. Ths may

7 mply that the overall archtecture may not be very senstve to the kernel/parameters of the machnes of the ensemble. We tested ths hypothess expermentally by comparng how the test performance of ths type of machnes changes wth the σ of the Gaussan kernel used from the ndvdual machnes of the ensemble, and compared the behavor wth that of sngle machnes and ensembles of machnes wth fxed c t.in fgure 5 we show two example. In our experments, for all data sets except from one, learnng the coeffcents c t of the combnaton of the machnes usng a lnear machne (we used a lnear SVM) made the overall machne less senstve to changes of the parameters of the ndvdual machnes (σ of the Gaussan kernel). Ths can be practcally a useful characterstc of the archtecture outlned n ths secton: for example the choce of the kernel parameters of the machnes of the ensembles need not be tuned accurately. 6. Ensembles versus Sngle Machnes So far we concentrated on the theoretcal and expermental characterstcs of ensembles of kernel machnes. We now dscuss how ensembles compare wth sngle machnes. Table 1 shows the test performance of one SVM compared wth that of an ensemble of 30 SVMs combned wth c t = 1 30 and an ensemble of 30 SVMs combned usng a lnear SVM for some UCI data sets (characterstc results). For the tables of ths secton we use, for convenence, the followng notaton: VCC stands for Votng Combnatons of Classfers, meanng that the coeffcents c t of the combnaton of the machnes are fxed. ACC stands for Adaptve Combnatons of Classfers, meanng that the coeffcents c t of the combnaton of the machnes are learned-adapted. We only consder SVM and ensembles of SVMs wth the threshold b. The table shows mean test errors and standard devatons for the best (decded usng the valdaton set performance n ths case) parameters of the machnes (σ s of Gaussans and parameter C - hence dfferent from fgures 1-2 whch where for a gven C). As the results show, the best SVM and the best ensembles we found have about the same test performance. Therefore, wth approprate tunng of the parameters of the machnes, combnng SVM s does not lead to performance mprovement compared to a sngle SVM. Although the best SVM and the best ensemble (that s, after accurate parameter tunng) perform smlarly, an mportant dfference of the ensembles compared to a sngle machne s that the Table 1. Average errors and standard devatons (percentages) of the best machnes (best σ of the Gaussan kernel and best C) - chosen accordng to the valdaton set performances. The performances of the machnes are about the same. VCC and ACC use 30 SVM classfers. Data Set SVM VCC ACC Breast 25.5 ± ± ± 4 thyrod 5.1 ± ± ± 2.7 dabetes 23 ± ± ± 1.8 heart 15.4 ± ± ± 3.2 Table 2. Comparson between error rates of a sngle SVM v.s. error rates of VCC and ACC of 100 SVMs for dfferent percentages of subsampled data. The last data set s from (Osuna et al., 1997). Data Set VCC 5% VCC 1% SVM Dabetes ± 1.6 Thyrod ± 2.5 Faces tranng of the ensemble conssts of a large number of (parallelzable) small-tranng-set kernel machnes - n the case of baggng. Ths mples that one can gan performance smlar to that of a sngle machne by tranng many faster machnes usng smaller tranng sets. Ths can be an mportant practcal advantage of ensembles of machnes especally n the case of large data sets. Table 2 compares the test performance of a sngle SVM wth that of an ensemble of SVM each traned wth as low as 1% of the ntal tranng set (for one data set). For fxed c t the performance decreases only slghtly n all cases (thyrod, that we show, was the only data set we found n our experments for whch the change was sgnfcant for the case of VCC), whle n the case of the archtecture of secton 5 even wth 1% tranng data the performance does not decrease: ths s because the lnear machne used to learn coeffcents c t uses all the tranng data. Even n ths last case the overall machne can stll be faster than a sngle machne, snce the second layer learnng machne s a lnear one, and fast tranng methods for the partcular case of lnear machnes exst (Platt, 1998). Fnally, t may be the case that ensembles of machnes perform better for some problems n the presence of outlers (as dscussed n secton 3.1), or, f the ensemble conssts of machnes that use dfferent kernels and/or dfferent nput features, n the presence of rrelevant features. The leave-one-out bounds presented n ths paper may be used for fndng these cases and for better understandng how baggng and general ensemble methods work (Breman, 1996; Schapre et al., 1998).

8 7. Conclusons We presented theoretcal bounds on the generalzaton error of ensembles of kernel machnes. Our results apply to the qute general case where each of the machnes n the ensemble s traned on dfferent subsets of the tranng data and/or uses dfferent kernels or nput features. Expermental results supportng our theoretcal fndngs have been presented. A number of observatons have been made from the experments. We summarze some of them below: 1. The leave-one-out bounds for ensembles of machnes have a form smlar to that of sngle machnes. In the partcular case of SVMs, the bounds are based on an average geometrc quantty of the ndvdual SVMs of the ensemble (average margn and average radus of the sphere contanng the support vectors). 2. The leave-one-out bounds presented are expermentally found to be tghter than the equvalent ones for sngle machnes. 3. For SVM, the leave-one-out bounds based on the number of support vectors are expermentally found not to be tght. 4. Expermentally we found that a valdaton set can be used for accurate model selecton wthout havng to decrease the sze of the tranng set used n order to create a valdaton set. 5. Wth accurate parameter tunng (model selecton) sngle SVMs and ensembles of SVMs perform smlarly. 6. Ensembles of machnes for whch the coeffcents of combnng the machnes are also learned from the data are less senstve to changes of parameters (.e. kernel) than sngle machnes are. 7. Fast (parallel) tranng wthout sgnfcant loss of performance relatvely to sngle whole-large-tranngset machnes can be acheved usng ensembles of machnes. A number of questons and research drectons are open. An mportant theoretcal queston s how the bounds and experments presented n ths paper are related wth Breman s (1996) bas-varance analyss of baggng. For example, the fact that sngle SVM perform about the same as ensembles of SVMs may be because SVMs are stable machnes,.e., they have small varance (Breman, 1996). Experments on the basvarance decomposton of SVM and kernel machnes may lead to further understandng of these machnes. The theoretcal results we presented here approach the problem of learnng wth ensembles of machnes from a dfferent perspectve - from that of studyng the leaveone-out geometrc bounds of these machnes. On the practcal sde, further experments usng very large data sets are needed to support our expermental fndng that the ensembles of machnes can be used for fast tranng wthout sgnfcant loss n performance. Fnally, other theoretcal questons are how to extend the bounds of secton 2 to the type of machnes dscussed n secton 5, and how to use more recent leaveone-out bounds for SVM (Chapelle & Vapnk, 1999) to better characterze the performance of ensembles of machnes. References Bartlett, P. (1998). The sample complexty of pattern classfcaton wth neural networks: the sze of the weghts s more mportant that the sze of the network. IEEE Transactons on Informaton Theory, 44, Breman, L. (1996). Baggng predctors. Machne Learnng, 26, Chapelle, O., & Vapnk, V. (1999). Model selecton for support vector machnes. Proceedngs of the Twelfth Conference on Neural Informaton Processng Systems. Devroye, L., Györf, L., & Lugos, G. (1996). Aprobablstc theory of pattern recognton. New York, NY: Sprnger. Jaakkola, T., & Haussler, D. (1998). Explotng generatve models n dscrmnatve classfers. Proceedngs of the Eleventh Conference on Neural Informaton Processng Systems. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R.E., Hubbard, W., & Jackel, L.J. (1989). Backpropagaton appled to handwrtten zp code recognton. Neural Computaton, 1, Osuna, E., Freund, R., & Gros, F. (1997). Support vector machnes: Tranng and applcatons. A.I. Memo 1602, AI Laboratory, Massachusetts Insttute of Technology, Cambrdge, MA. Platt, J. (1998). Fast tranng of support vector machnes usng sequental mnmal optmzaton. In C. Burges & B. Scholkopf, (Eds.), Advances n kernel methods support vector learnng. Cambrdge, MA: MIT Press. Schapre, R., Freund, Y., Bartlett, P. & Lee, W.S. (1998). Boostng the margn: A new explanaton for the effectveness of votng methods. The Annals of Statstcs, 26, Shawe-Taylor, J. & Crstann, N. (1998). Robust bounds on generalzaton from the margn dstrbuton (Techncal Report NC2-TR ). NeuroCOLT2, Royal Holloway, Unversty of London, UK. Vapnk, V. (1998). Statstcal learnng theory. New York, NY: Wley. Wahba, G. (1990). Splnes models for observatonal data. Seres n Appled Mathematcs, Vol. 59, SIAM, Phladelpha.

Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers

Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers Machne Learnng, 55, 71 97, 2004 c 2004 Kluwer Academc Publshers. Manufactured n The Netherlands. Leave One Out Error, Stablty, and Generalzaton of Votng Combnatons of Classfers THEODOROS EVGENIOU theodoros.evgenou@nsead.edu