Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers

Size: px

Start display at page:

Download "Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers"

Oliver Snow
6 years ago
Views:

1 Machne Learnng, 55, 71 97, 2004 c 2004 Kluwer Academc Publshers. Manufactured n The Netherlands. Leave One Out Error, Stablty, and Generalzaton of Votng Combnatons of Classfers THEODOROS EVGENIOU theodoros.evgenou@nsead.edu Technology Management, INSEAD, Boulevard de Constance, Fontanebleau, France MASSIMILIANO PONTIL DII, Unversty of Sena, Va Roma 56, Sena, Italy pontl@d.uns.t ANDRÉ ELISSEEFF andre.elsseeff@tuebngen.mpg.de Max Planck Insttute for Bologcal Cybernetcs, Spemannstrasse 38, Tübngen, Germany Edtor: Robert E. Schapre Abstract. We study the leave-one-out and generalzaton s of votng combnatons of learnng machnes. A specal case consdered s a varant of baggng. We analyze n detal combnatons of kernel machnes, such as support vector machnes, and present theoretcal estmates of ther leave-one-out. We also derve novel bounds on the stablty of combnatons of any classfers. These bounds can be used to formally show that, for example, baggng ncreases the stablty of unstable learnng machnes. We report experments supportng the theoretcal fndngs. Keywords: cross-valdaton, baggng, combnatons of machnes, stablty 1. Introducton Studyng the generalzaton performance of ensembles of learnng machnes has been the topc of ongong research n recent years (Breman, 1996; Schapre et al., 1998; Fredman, Haste, & Tbshran, 1998). There s a lot of expermental work showng that combnng learnng machnes, for example usng boostng or baggng methods (Breman, 1996; Schapre et al., 1998), very often leads to mproved generalzaton performance. A number of theoretcal explanatons have also been proposed (Schapre et al., 1998; Breman, 1996), but more work on ths aspect s stll needed. Two mportant theoretcal tools for studyng the generalzaton performance of learnng machnes are the leave-one-out (or cross valdaton) of the machnes, and the stablty of the machnes (Bousquet & Elsseeff, 2002; Boucheron, Lugos, & Massart, 2000). The second, although an older tool (Devroye & Wagner, 1979; Devroye, Györf, & Lugos, 1996), has become only mportant recently wth the work of Kearns and Ron (1999) and Bousquet and Elsseeff (2002). Stablty has been dscussed extensvely also n the work of Breman (1996). The theory n Breman (1996) s that baggng ncreases performance because t reduces the varance of the base learnng machnes, although t does not always ncrease the bas (Breman, 1996).

2 72 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF The defnton of the varance n Breman (1996) s smlar n sprt to that of stablty we use n ths paper. The key dfference s that n Breman (1996) the varance of a learnng machne s defned n an asymptotc way and s not used to derve any non-asymptotc bounds on the generalzaton of baggng machnes, whle here we defne stablty for fnte samples lke t s done n Bousquet and Elsseeff (2002) and we also derve such non-asymptotc bounds. The ntuton gven by Breman (1996) gves nterestng nsghts: the effect of baggng depends on the stablty of the base classfer. Stablty means here changes n the output of the classfer when the tranng set s perturbed. If the base classfers are stable, then baggng s not expected to decrease the generalzaton. On the other hand, f the base classfer s unstable, such as often occurs wth decson trees, the generalzaton performance s supposed to ncrease wth baggng. Despte expermental evdence, the nsghts n Breman (1996) had not been supported by a general theory lnkng stablty to the generalzaton of baggng, whch s what Secton 5 below s about. In ths paper we study the generalzaton performance of ensembles of kernel machnes usng both leave-one-out and stablty arguments. We consder the general case where each of the machnes n the ensemble uses a dfferent kernel and dfferent subsets of the tranng set. The ensemble s a convex combnaton of the ndvdual machnes. A partcular case of ths scheme s that of baggng kernel machnes. Unlke standard baggng (Breman, 1996), ths paper consders combnatons of the real outputs of the classfers, and each machne s traned on a dfferent and small subset of the ntal tranng set chosen by randomly subsamplng from the ntal tranng set. Each machne n the ensemble uses n general a dfferent kernel. As a specal case, approprate choces of these kernels lead to machnes that may use dfferent subsets of the ntal nput features, or dfferent nput representatons n general. We derve theoretcal bounds for the generalzaton of the ensembles based on a leave-one-out estmate. We also present results on the stablty of combnatons of classfers, whch we apply to the case of baggng kernel machnes. They can also be appled to baggng learnng machnes other than kernel machnes, showng formally that baggng can ncrease the stablty of the learnng machnes when these are not stable, and decrease t otherwse. An mplcaton of ths result s that t can be easer to control the generalzaton of baggng machnes. For example the leave one out s a better estmate of ther test, somethng that we expermentally observe. The paper s organzed as follows. Secton 2 gves the basc notaton and background. In Secton 3 we present bounds for a leave-one-out of kernel machne ensembles. These bounds are used for model selecton experments n Secton 4. In Secton 5 we dscuss the algorthmc stablty of ensembles, and present a formal analyss of how baggng nfluences the stablty of learnng machnes. The results can also provde a justfcaton of the expermental fndngs of Secton 4. Secton 6 dscusses other ways of combnng learnng machnes. 2. Background and notatons In ths secton we recall the man features of kernel machnes. For a more detaled account (see Vapnk, 1998; Schölkopf, Burges, & Smola, 1998; Evgenou, Pontl, & Poggo, 2000). Foranaccount consstent wth our notaton (see Evgenou, Pontl, & Poggo, 2000).

3 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 73 Kernel machne classfers are the mnmzers of functonals of the form: H[ f ] = 1 l l V (y, f (x )) + λ f 2 K, (1) =1 where we use the followng notaton: Let X R n be the nput set, the pars (x, y ) X { 1, 1}, = 1,...,lare sampled ndependently and dentcally accordng to an unknown probablty dstrbuton P(x, y). The set D l ={(x 1, y 1 ),...,(x l, y l )} s the tranng set. f s a functon R n R belongng to a Reproducng Kernel Hlbert Space (RKHS) H defned by kernel K, and f 2 K s the norm of f n ths space. See Vapnk (1998) and Wahba (1990) for a number of kernels. The classfcaton s done by takng the sgn of ths functon. V (y, f (x)) s the loss functon. The choce of ths functon determnes dfferent learnng technques, each leadng to a dfferent learnng algorthm (for computng the coeffcents α see below). λ s called the regularzaton parameter and s a postve constant. Machnes of ths form have been motvated n the framework of statstcal learnng theory. Under rather general condtons (Evgenou, Pontl, & Poggo, 2000) the soluton of Eq. (1) s of the form f (x) = l α y K (x, x). (2) =1 The coeffcents α n Eq. (2) are learned by solvng the followng optmzaton problem: l max H(α) = S(α ) 1 l α α j y y j K (x, x j ) α 2 =1, j=1 subject to: 0 α C, = 1,...,l, (3) where S( )sacontnuous and concave functon (strctly concave f matrx K (x, x j )snot strctly postve defnte) and C = 1 a constant. Thus, H(α) sstrctly concave and the 2lλ above optmzaton problem has a unque soluton. Support Vector Machnes (SVMs) are a partcular case of these machnes for S(α) = α. Ths corresponds to a loss functon V n (1) that s of the form θ(1 yf(x))(1 yf(x)), where θ s the Heavysde functon: θ(x) = 1fx > 0, and zero otherwse. The ponts for whch α > 0 are called support vectors. Notce that the bas term (threshold b n the general case of machnes f (x) = l =1 α K (x, x) + b) sncorporated n the kernel K, and t s therefore also regularzed. Notce also that functon S( ) n(3) can take general forms leadng to machnes other than SVM but n the general case the optmzaton of (3) may be computatonally neffcent.

4 74 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF 2.1. Kernel machne ensembles Gven a learnng algorthm such as a SVM or an ensemble of SVMs we defne f Dl to be the soluton of the algorthm when the tranng set D l ={(x, y ), = 1,...,l} s used. We denote by Dl the tranng set obtaned by removng pont (x, y ) from D l, that s the set D l \{(x, y )}. When t s clear n the text we wll denote f Dl by f and f D l by f. We consder the general case where each of the machnes n the ensemble uses a dfferent kernel and dfferent subsets D r,t of the tranng set D l where r refers to the sze of the subset and t = 1,...,T to the machne that uses t to learn. Let f Dr,t (x)bethe optmal soluton of machne t usng a kernel K (t).wedenote by α (t) the optmal weght that machne t assgns to pont (x, y ) (after solvng optmzng problem (3)). We consder ensembles that are convex combnatons of the ndvdual machnes. The decson functon of the ensemble s gven by F r,t (x) = c t f Dr,t (x) (4) wth c t 0, and T c t = 1 (for scalng reasons). The coeffcents c t are not learned and all parameters (C s and kernels) are fxed before tranng. The classfcaton s done by takng the sgn of F r,t (x). Below for smplcty we wll note wth captal F the combnaton F r,t.insecton 5 we wll consder only the case that c t = 1 for smplcty. T In the followng, the sets D r,t wll be dentcally sampled accordng to the unform dstrbuton and wthout replacement from the tranng set D l.wewll denote by E Dr D l the expectaton wth respect to the subsamplng from D l accordng to the unform dstrbuton (wthout replacement), and sometmes we wrte f Dr,t D l rather than f Dr,t to make clear whch tranng set has been used durng learnng. The letter r wll always refer to the number of elements n D r,t Leave-one-out If θ s, as before, the Heavysde functon, then the leave-one-out of f on D l s defned by Loo Dl ( f ) = 1 l l θ( y f (x )) (5) =1 Notce that for smplcty there s a small abuse of notaton here, snce the leave-one-out typcally refers to a learnng method whle here we use the soluton f n the notaton. The leave-one-out provdes an estmate of the average generalzaton performance of a machne. It s known that the expectaton of the generalzaton of a machne traned usng l ponts s equal to the expectaton of the Loo of a machne traned on l + 1 ponts. Ths s summarzed by the followng theorem, orgnally due to Luntz and Bralovsky (see Vapnk, 1998).

5 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 75 Table 1. Notaton. f V ( f, y) P(x, y) Real valued predcton rule of one learnng machne, f : X R Loss functon Probablty dstrbuton underlnng the data D l Set of..d examples sampled from P(x, y), D l ={(x, y ) X { 1, 1}} =1 l Dl The set D l \{(x, y )} f Dl Loo Dl ( f ) Learnng machne (e.g. SVM) traned on D l. Also noted as f Leave-one-out of f on the data set D l π δ (x) Soft margn loss, π δ (x) = 0, f x < δ, 1fx > 0, and x δ f δ x 0 Loo δ,dl ( f ) Leave one out wth soft margn π δ β l Unform stablty of f D r,t or D r,t D l Set of r ponts sampled unformly from D l used by machne t, t = 1,...,T D r D l Set of r ponts sampled unformly from D l (D r,t D l ) Orgnal D r,t wth pont (x, y ) removed F r,t,orjust F Ensemble of T machnes, F r,t = T c t f Dr,t ˆF Expected combnaton of machnes E Dr D l [ f Dr ] DLoo Dl (F) Determnstc leave one out DLoo δ,dl (F) Determnstc leave one out wth soft margn π δ Theorem 2.1. Suppose f Dl s the outcome of a determnstc learnng algorthm. Then [ [ ( E Dl E(x,y) θ yfdl (x) )]] [ ( )] = E Dl+1 LooDl+1 f Dl+1 As observed (Kearns & Ron, 1999), ths theorem can be extended to general learnng algorthms by addng a randomzng preprocessng step. The way the leave-one-out s computed can however be dfferent dependng on the randomness. Consder the prevous ensemble of kernel machnes (4). The data sets D r,t, t = 1,...,T are drawn randomly from the tranng set D l.wecan then compute a leave-one-out estmate for example n ether of the followng ways: 1. For = 1,...,l, remove (x, y ) from D l and sample new data sets D r,t, t = 1,...,T from Dl. Compute the f D r,t Dl and average then the of the resultng ensemble machne computed on (x, y ). Ths leads to the classcal defnton of leave-one-out and can be computed as: Loo Dl (F) = 1 l ( l 1 θ y T =1 ) f Dr,t Dl (x ) 2. For = 1,...,l, remove (x, y ) from each D r,t D l. Compute the f (Dr,t D l ) and average the of the resultng ensemble machne computed on (x, y ). Note that we have used the notaton (D r,t D l ) to denote the set D r,t D l where (x, y ) has been (6)

6 76 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF removed. Ths leads to what we wll call a determnstc verson of the leave-one-out, n short det-leave-one-out, or DLoo: DLoo Dl (F) = 1 l ( l 1 θ y T =1 ) f (Dr,t D l ) (x ) Note that the frst computaton requres to re-sample new data sets for each leave-one-out round, whle the second computaton uses the same subsample data sets for each leaveone-out round removng at most one pont from each of them. In a sense, the det-leave-oneout s then more determnstc than the classcal computaton (6). In ths paper, we wll consder manly the det-leave-one-out for whch we wll derve easy-to-compute bounds and from whch we wll bound the generalzaton of ensemble machnes. Fnally notce that the sze of the subsamplng s mplct n the notaton DLoo Dl (F): r s fxed n ths paper so there s no need to complcate the notaton further. (7) 3. Leave-one-out estmates of kernel machne ensembles We begn wth some known results about the leave-one-out of kernel machnes. The followng theorem s from Jaakkola and Haussler (1998): Theorem 3.1. The leave-one-out of a kernel machne (3) s upper bounded as: Loo Dl ( f ) 1 l l θ ( α K (x, x ) y f Dl (x ) ) (8) =1 where f Dl s the optmal functon found by solvng problem (3) on the whole tranng set. In the partcular case of SVMs where the data are separable the r.h.s of Eq. (8) can be bounded by geometrc quanttes, namely Vapnk (1998): Loo Dl ( f ) 1 l l θ ( α K (x, x ) y f Dl (x ) ) 1 dsv 2 (9) l ρ 2 =1 where d sv s the radus of the smallest sphere n the feature space nduced by kernel K (Wahba, 1990; Vapnk, 1998) centered at the orgn contanng the support vectors, that s d sv = max :α >0 K (x, x ), and ρ s the margn (ρ 2 = 1 )ofthe SVM. f 2 K Usng ths result, the next theorem s a drect applcaton of Theorem 2.1: Theorem 3.2. Suppose that the data s separable by the SVM. Then, the average generalzaton of a SVM traned on l ponts s upper bounded by ( 1 d 2 ) l + 1 E sv(l) D l, ρ 2 (l)

7 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 77 where the expectaton E s taken wth respect to the probablty of a tranng set D l of sze l. Notce that ths result shows that the performance of the SVM does not depend only on the margn, but also on other geometrc quanttes, namely the radus d sv. We now extend these results to the case of ensembles of kernel machnes. In the partcular case of baggng, the subsamplng of the tranng data should be determnstc. By ths we mean that when the bounds on the leave one out are used for model (parameter) selecton, for each model the same subsample sets of the data need to be used. These subsamples, however, are stll random ones. We beleve that the results presented below also hold (wth mnor modfcatons) n the general case that the subsamplng s always random. We now consder the det-leave-one-out of such ensembles. Theorem 3.3. by: DLoo Dl (F) 1 l The det-leave-one-out of a kernel machne ensemble s upper bounded ( ) l θ c t α (t) K (t) (x, x ) y F(x ). (10) =1 The proof of ths Theorem s based on the followng lemma shown n Vapnk (1998) and Jaakkola and Haussler (1998): Lemma 3.1. Let α be the coeffcent of the soluton f (x) of machne (3) correspondng to pont (x, y ),α > 0. Let f (x) be the soluton of machne (3) found when the data pont (x, y ) s removed from the tranng set. Then y f (x ) y f (x ) α K (x, x ). Usng Lemma 3.1 we can now prove Theorem 3.3. Proof of Theorem 3.3: Let F (x) = T c t f (t) (x)bethe ensemble machne traned wth all ntal tranng data except (x, y ) (subsets D r,t are the orgnal ones only (x, y )s removed from them). Lemma 3.1 gves that y F (x ) = y = y F(x ) from whch t follows that: c t f (t) (x ) c t [ y f (t) (x ) α (t) K (t) (x, x ) ] c t α (t) K (t) (x, x ) ( ) θ( y F (x )) θ c t α (t) K (t) (x, x ) y F(x ).

8 78 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF Therefore the leave one out l =1 θ( y F (x )) s not more than ( ) l θ c t α (t) K (t) (x, x ) y F(x ), =1 whch proves the Theorem. Notce that the bound has the same form as the bound n Eq. (8): for each pont (x, y ) we only need to take nto account ts correspondng parameter α (t) and remove the effects of α (t) from the value of F(x ). The det-leave-one-out can also be bounded usng geometrc quanttes. To ths purpose we ntroduce one more parameter that we call the ensemble margn (n contrast to the margn of a sngle SVM). For each pont (x, y )wedefne ts ensemble margn to be y F(x ). Ths s exactly the defnton of margn n Schapre et al. (1998). For any gven δ > 0wedefne Err δ to be the emprcal wth ensemble margn less than δ, Err δ (F) = 1 l l θ( y F(x ) + δ). =1 and by N δ the set of the remanng tranng ponts the ones wth ensemble margn δ. Fnally, we note by d t(δ) the radus of the smallest sphere n the feature space nduced by kernel K (t) centered at the orgn whch contans the ponts of machne t wth α (t) > 0 and ensemble margn larger than δ. 1 Corollary 3.1. For any δ>0the det-leave-one-out of a kernel machne ensemble s upper bounded by: DLoo Dl (F) Err δ (F) + 1 l ( 1 δ ( c t dt(δ) 2 α (t) N δ )) (11) Proof: For each tranng pont (x, y ) wth ensemble margn y F(x ) <δwe upper bound θ( T c tα (t) K (t) (x, x ) y F(x )) wth 1 (ths s a trval bound). For the remanng ponts (the ponts n N δ )weshow that: ( ) θ c t α (t) K (t) (x, x ) y F(x ) 1 δ c t α (t) K (t) (x, x ). (12)

9 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 79 In the case that T c tα (t) K (t) (x, x ) y F(x ) < 0, Eq. (12) s trvally satsfed. If T c tα (t) K (t) (x, x ) y F(x ) 0, then whle ( ) θ c t α (t) K (t) (x, x ) y F(x ) = 1, c t α (t) K (t) (x, x ) y F(x ) δ 1 δ So n both cases nequalty (12) holds. Therefore: ( ) l θ c t α (t) K (t) (x, x ) y F(x ) =1 c t α (t) K (t) (x, x ) 1. lerr δ + 1 δ N δ ( lerr δ + 1 c t dt(δ) 2 δ c t K (t) (x, x )α (t) α (t) N δ ). The statement of the corollary follows by applyng Theorem 3.3. Notce that Eq. (11) holds for any δ>0, so the best bound s obtaned for the mnmum of the rght hand sde wth respect to δ>0. Usng Theorem 2.1, Theorems 3.3 and 3.1 provde bounds on the average generalzaton performance of general kernel machnes ensembles lke that of Theorem 3.2. We now consder the partcular case of SVM ensembles. In ths case we have the followng Corollary 3.2. Suppose that each SVM n the ensembles separated the data set used durng tranng. Then, the det-leave-one-out of an ensemble of SVMs s upper bounded by: DLoo Dl (F) Err 1 (F) + 1 l c t d 2 t ρ 2 t (13) where Err 1 s the margn emprcal wth ensemble margn 1, d t s the radus of the smallest sphere centered at the orgn, n the feature space nduced by kernel K (t), contanng the support vectors of machne t, and ρ t s the margn of the t-th SVM. Proof: l =1 α(t) We chose δ = 1n(11). Clearly we have that d t d t(δ) for any δ, and N δ α (t) (see Vapnk (1998)) for a proof of ths equalty). = 1 ρ 2 t Notce that the average generalzaton performance of the SVM ensemble now depends on the average (convex combnaton of) D2 of the ndvdual machnes. In some cases ths ρ 2

10 80 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF may be smaller than the D2 of a sngle SVM. For example, suppose we tran many SVMs ρ 2 on dfferent sub-samples of the tranng ponts and we want to compare such an ensemble wth a sngle SVM usng all the ponts. If all SVMs (the sngle one, as well as the ndvdual ones of the ensemble) have most of ther tranng ponts as support vectors, then clearly the D 2 of each SVM n the ensemble s smaller than that of the sngle SVM. Moreover the margn of each SVM n the ensemble s expected to be larger than that of the sngle SVM usng all the ponts. So the average D2 n ths case s expected to be smaller than that ρ 2 of the sngle SVM. Another case where an ensemble of SVMs may be better than a sngle SVM s the one where there are outlers among the tranng data. If the ndvdual SVMs are traned on subsamples of the tranng data, some of the machnes may have smaller D2 ρ 2 because they do not use some outlers whch of course also depends on the choce of C for each of the machnes. In general t s not clear when ensembles of kernel machnes are better than sngle machnes. The bounds n ths secton may provde some nsght to ths queston. Fnally, we remark that all the results dscussed hold for the case that there s no bas (threshold b), or the case where the bas s ncluded n the kernel (as dscussed n the ntroducton). In the experments dscussed below we use the results also n the case that the bas s not regularzed (as dscussed n Secton 2 ths means that the separatng functon ncludes a bas b,sots f (x) = l =1 α K (x, x)+b), whch s common n practce. Recent work n (Chapelle & Vapnk, 1999) may be used to extend our results to an ensemble of kernel machnes wth the bas not regularzed: whether ths can be done s an open queston. 4. Experments To test how tght the bounds we presented are, we conducted a number of experments usng datasets from UCI 2,aswell as the US Postal Servce (USPS) dataset (LeCun et al., 1990). We show results for some of the sets n fgures 1 5. For each dataset we splt the overall set n tranng and testng (the szes are shown n the fgures) n 50 dfferent (random) ways, and for each splt: 1. We traned one SVM wth b = 0 usng all tranng data, computed the leave-one-out bound gven by Theorem 3.1, and then compute the test performance usng the test set. 2. We repeated (1) ths tme wth b We traned 30 SVMs wth b = 0 each usng a random subsample of sze 40% of the tranng data (baggng), computed the leave-one-out bound gven by Theorem 3.3 usng c t = 1, and then compute the test performance usng the test set We repeated (3) ths tme wth wth b 0. We then averaged over the 50 tranng-testng splts the test performances and the leaveone-out bounds found, and computed the standard devatons. All machnes were traned usng a Gaussan kernel, and we repeated the procedure for a number of dfferent σ s of the Gaussan, and for a fxed value of the parameter C, (selected by hand so that t s less than 1 n fgures 1 5, and more than 1 n fgure 6, for reasons explaned below for smplcty we

11 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 81 Breast Cancer data (C=0.5, Tran= 200, Test = 77) Bag l o o, b=0 svm l o o, b= bag l o o svm l o o sgma sgma Fgure 1. Breast cancer data: Top left fgure: baggng wth b = 0; Top rght fgure: sngle SVM wth b = 0; Bottom left fgure: baggng wth b 0; Bottom rght fgure: sngle SVM wth b 0. In each plot the sold lne s the mean test performance and the dashed lne s the bound computed usng the leave-one-out Theorems 3.1 and 3.3. The dotted lne s the valdaton set dscussed below. The horzontal axs shows the logarthm of the σ of the Gaussan kernel used. used the same value of C n fgures 1 5, C = 0.5, but we found the same trend for other small values of C, C < 1). We show the averages and standard devatons of the results n fgures 1 to 5. In all fgures we use the followng notaton: Top left fgure: baggng wth b = 0; Top rght fgure: sngle SVM wth b = 0; Bottom left fgure: baggng wth b 0; Bottom rght fgure: sngle SVM wth b 0. In each plot the sold lne s the mean test performance and the dashed lne s the bound computed usng the leave-one-out Theorems 3.1 and 3.3. The dotted lne s the valdaton set dscussed below. The horzontal axs shows the logarthm of the σ of the Gaussan kernel used. For smplcty, only one bar (standard devaton over the 50 tranng-testng splts) s shown (the others were smlar). Notce that even for tranng-testng splts for whch the s one standard devaton away from the mean over the 50 runs (.e. nstead of plottng the graphs through the center of the bars, we plot them at the end of the bars) the bounds for combnatons of machnes are stll tghter than for sngle machnes n fgures 3 to 5. The cost parameter C used s gven

12 82 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF 5 Thyrod data (C=0.5, Tran= 140, Test = 75) Bag l o o, b=0 5 svm l o o, b= bag l o o svm l o o sgma sgma Fgure 2. Thyrod data: Notaton lke n fgure 1. n each of the fgures. The horzontal axs s the natural logarthm of the σ of the Gaussan kernel used, whle the vertcal axs s the. An nterestng observaton s that the bounds are always tghter for the case of baggng than they are for the case of a sngle SVM. Ths s an nterestng expermental fndng for whch we provde a possble theoretcal explanaton n the next secton. Ths fndng can practcally justfy the use of ensembles of machnes for model selecton: Parameter selecton usng the leave-one-out bounds presented n ths paper s easer for ensembles of machnes than t s for sngle machnes. Another nterestng observaton s that the bounds seem to work smlarly n the case that the bas b s not 0. In ths case, as before, the bounds are tghter for ensembles of machnes than they are for sngle machnes. Expermentally we found that the bounds presented here do not work well n the case that the C parameter used s large (C = 100). An example s shown n fgure 6. Consder the leave-one-out bound for a sngle SVM gven by Theorem 3.1. Let (x, y )beasupport vector for whch y f (x ) < 1. It s known (Vapnk, 1998) that for these support vectors the coeffcent α s C. IfC s such that CK(x, x ) > 1 (for example consder Gaussan

13 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 83 Dabetes data (C=0.5, Tran= 468, Test = 300) Bag l o o, b=0 svm l o o, b= bag l o o svm l o o sgma sgma Fgure 3. Dabetes data: Notaton lke n fgure 1. kernel wth K (x, x) = 1 and any C > 1), then clearly θ(ck(x, x ) y f (x )) = 1. In ths case the bound of Theorem 3.1 effectvely counts all support vectors outsde the margn (plus some of the ones on the margn,.e. yf(x) = 1). Ths means that for large C (n the case of Gaussan kernels ths can be for example for any C > 1), the bounds of ths paper effectvely are smlar (not larger than) to another known leave-one-out bound for SVMs, namely one that uses the number of all support vectors to bound generalzaton performance (Vapnk, 1998). So effectvely our expermental results show that the number of support vectors does not provde a good estmate of the generalzaton performance of the SVMs and ther ensembles. 5. Stablty of ensemble methods We now present a theoretcal explanaton of the expermental fndng that the leave-one-out bound s tghter for the case of ensemble machnes than t s for sngle machnes. The analyss s done wthn the framework of stablty and learnng (Bousquet & Elsseeff, 2002). It has

14 84 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF 0.5 Heart data (C=0.5, Tran= 170, Test = 100) Bag l o o, b=0 svm l o o, b= bag l o o 0.5 svm l o o sgma sgma Fgure 4. Heart data: Notaton lke n fgure 1. been proposed n the past that baggng ncreases the stablty of the learnng methods (Breman, 1996). Here we provde a formal argument for ths. As before, we denote by D l the tranng set D l wthout example pont (x, y ). We use the followng noton of stablty defned n Bousquet and Elsseeff (2002). Defnton (Unform stablty). We say that a learnng method s β l stable wth respect to a loss functon V and tranng sets of sze l f the followng holds: {1,...,l}, D l, (x, y) : V ( f Dl (x), y ) V ( f D l (x), y ) β l. Roughly speakng the cost of a learnng machne on a new (test) pont (x, y) should not change more than β l when we tran the machne wth any tranng set of sze l and when we tran the machne wth the same tranng set but one tranng pont (any pont) removed. Notce that ths defnton s useful manly for real-valued loss functons V.To use t for classfcaton machnes we need to start wth the real valued output (2) before thresholdng.

15 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS Postal data (C=0.5, Tran= 791, Test = 2007) Bag l o o, b=0 svm l o o, b= bag l o o 0.6 svm l o o sgma sgma Fgure 5. USPS data: Notaton lke n fgure 1. We defne for any gven constant δ the leave-one-out Loo δ on a tranng set D l to be: Loo δ,dl ( f ) = 1 l l ( π δ y f D l (x ) ), =1 where the functon π δ (x) s0for x < δ, 1for x > 0, and x + 1 for δ x 0(asoft δ margn functon). 3 For ensemble machnes, we wll consder agan a defnton smlar to (7): DLoo δ,dl (F) = 1 l ( l 1 π δ y T =1 ) f (Dr,t D l) (x ), Notce that for δ 0weget the leave one out s that we defned n Secton 2, namely Eqs. (5) and (7), and clearly DLoo 0,Dl (F) DLoo δ,dl (F) for all δ>0.

16 86 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF Postal data (C=100, Tran= 791, Test = 2007) Bag l o o, b=0 svm l o o, b= bag l o o 0.6 svm l o o sgma sgma Fgure 6. USPS data: Usng a large C (C = 50). In ths case the bounds do not work see text for an explanaton. Notaton lke n fgure 1. Let β l be the stablty of the kernel machne for the real valued output wrt. the l 1 norm, that s: {1,...,l}, D l, x : f Dl (x) f D l (x) βl For SVMs t s known (Bousquet & Elsseeff, 2002) that β l s upper bounded by C κ 2 where κ = sup x X K (x, x) sassumed to be fnte. The bound on the stablty of a SVM s not explctly dependent of the sze of the tranng set l.however, the value of C s often chosen such that C s small for large l.inthe former experments, C s fxed for all machnes whch are traned on learnng sets of same szes. Ths means that they have all the same stablty for the l 1 norm. We frst state a bound on the expected of a sngle kernel machne n terms of ts Loo δ. The followng theorem s from Bousquet and Elsseeff (2002).

17 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 87 Theorem 5.1. For any gven δ, wth probablty 1 η the generalzaton msclassfcaton of an algorthm that s β l stable w.r.t. the l 1 norm s bounded as: E (x,y) [ θ( yfdl (x)) ] Loo δ,dl ( f Dl ) + β l + where β l s assumed to be a non-ncreasng functon of l. l 2 ( 2 β l ) 2 ln δ l ( ) 1, η Notce that the bound holds for a gven constant δ. One can derve a bound that holds unformly for all δ and therefore use the best δ (.e. the emprcal margn of the classfer) (Bousquet & Elsseeff, 2002). For a SVM, the value of β l s equal to Cκ. Theorem provdes the followng bound: [ ( E (x,y) θ yfdl (x) )] ( ( ) Cκ Loo δ,dl f Dl l Cκ 2 δ + 1 l ) 2 ln ( ) 1 η The value of C s often a functon of l. Dependng on the way C decreases wth l, ths bound can be tght or loose. We now study a smlar generalzaton bound for an ensemble of machnes where each machne uses only r ponts drawn randomly wth the unform dstrbuton from the tranng set. We consder only the case where the coeffcents c t of (4) are all 1 (so takng the average T machne lke n standard baggng (Breman, 1996)). Such an ensemble s very close to the orgnal dea of baggng despte some dfferences namely that n standard baggng each machne uses a tranng set of sze equal to the sze of the orgnal set created by random subsamplng wth replacement, nstead of usng only r ponts. We wll consder the expected combnaton ˆF defned as: 4 ˆF(x) = E Dr D l [ f Dr (x) ] where the expectaton s taken wth respect to the tranng data D r of sze r drawn unformly from D l. The stabllty bounds we present below hold for ths expected combnaton and not for the fnte combnaton consdered so far as mentoned below how close these two are s an open queston. The leave-one-out we defne for ths expectaton s agan lke n (7) (as n Eq. (7) the sze r of the subsamples for smplcty s not ncluded n the notaton): DLoo δ,dl ( ˆF) = 1 l l ( [ π δ y E Dr D l f D r (x ) ]) =1 whch s dfferent from the standard leave-one-out : 1 l l ( [ π δ y E Dr Dl f Dr (x ) ]) =1

18 88 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF whch corresponds to (6). As an extreme case when T : ( ) 1 DLoo δ,dl f Dr,t DLoo δ,dl ( ˆF) (14) T Ths relaton motvates the choce of our method of calculaton for the leave one out estmate n Secton 3. Indeed the rght hand sde of the equaton corresponds to the quantty that we have bounded n Sectons 3 and 4 and that ultmately we would lke to relate to the stablty of the base machne. It s an open queston to measure how fast the convergence Eq. (14) s. As we dscuss below and as also mentoned n Breman (1996), ncreasng T beyond a certan value (typcally small,.e. 100) does not nfluence the performance of baggng, whch may mply that the convergence (14) s fast. We then have the followng bound on the expected of ensemble combnatons: Theorem 5.2. For any gven δ, wth probablty 1 η the generalzaton msclassfcaton of the expected combnaton of classfers ˆF each usng a subsample of sze r of the tranng set and each havng a stablty β r wrt. the l 1 norm s bounded as: E (x,y) [θ( y ˆF(x))] DLoo δ,dl ( ˆF) + r l β r r + 2 2l ( 2βr 1 δ + 1 r ) 2 ln ( ) 1 η Proof: We wll apply the stablty Theorem 5.1 to the followng algorthm: Onaset of sze l, the algorthm s the same as the expected ensemble machne we consder. Onatranng set of sze l 1, t adds a dummy nput par (x 0, y 0 ) and uses the same samplng scheme as the one used wth D l. That s, D r s sampled from D l {(x 0, y 0 )} wth the same dstrbuton as t s sampled from D l n the defnton of ˆF. When (x 0, y 0 ) s drawn n D r,tsnot used n tranng so that f Dr s replaced by f Dr \{(x 0,y 0 )}. The new algorthm that we wll call G can then be expressed as: G(x) = E Dr D l [ f Dr (x)] and G, ts outcome on the set Dl s equal to G (x) = E Dr D l [ f D r (x)] where (x, y ) plays the role of the dummy par (x 0, y 0 ) prevously mentoned. The resultng algorthm has then the same behavor on tranng sets of sze l as the ensemble machne we consder, and the classcal leave-one-out for G corresponds to the det-leave-one-out we have defned prevously for ˆF. From that perspectve, t s suffcent to show that G s rβ r stable wrt. the l l 1 norm and to apply Theorem 5.1. We have: G G = E Dr D l [ f Dr ] EDr D l [ f D r ] where Dr = D r\(x, y ). We have by defnton: G G = f Dr dp f D r dp

19 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 89 where P denotes here the dstrbuton over the samplng of D r from D l. Defnng the functon 1 A of the set A as to be: 1 A (z) = 1ffz A, wedecompose each of the ntegral as follows: G G = f Dr 1 (x,y ) D r dp + f D r 1 (x,y )/ D r dp f Dr 1 (x,y )/ D r dp Clearly, f (x, y ) / D r, D r = Dr,sothat: G G = f Dr 1 (x,y ) D r dp f D r 1 (x,y ) D r dp β r 1 (x,y ) D r dp β r P[(x, y ) D r ] f D r 1 (x,y ) D r dp where the probablty s taken wth respect to the random subsamplng of the data set D r from D l. Snce ths subsamplng s done wthout replacement, such a probablty s equal to r l whch fnally gves a bound on the stablty of G = E D r D l [ f Dr ]. Ths result plugged nto the prevous theorem gves the fnal bound. Ths theorem holds for ensemble combnatons that are theoretcally defned from the expectaton E Dr D l [ f Dr ]. Notce that the hypothess do not requre that the combnaton s formed by only the same type of machnes. In partcular, one can magne an ensemble of dfferent kernel machnes wth dfferent kernels. We formalze ths remark n the followng Theorem 5.3. Let ˆF S be a fnte combnaton of SVMs f s, s = 1,...,S wth dfferent kernels K 1,...,K S : ˆF S = 1 S S [ ] E Dr D l f s Dr s=1 (15) where f s D r D l s a SVM wth kernel K s learned on D r. Denote as before by DLoo δ,dl ( ˆF S ) the det-leave-one-out of ˆF S computed wth the functon π δ. Assume that each of the f s D r D l are learned wth the same C on a subset D r of sze r drawn from D l wth a unform dstrbuton. For any gven δ, wth probablty 1 η, the generalzaton msclassfcaton s bounded as: E (x,y) [θ( y ˆF S (x))] DLoo δ,dl ( ˆF S ) + r ( 2l (Cκ) + r 2 Cκ 2l δ + 1 ) 2 ( ) 1 ln, r η where κ = 1 S S s=1 sup x X K s (x, x).

20 90 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF Proof: As before, we study G G = 1 S S s=1 [ ] [ ] E Dr D l f s Dr E Dr D l f s Dr Followng the same calculatons as n the prevous theorem for each of the summand, we have: G G 1 S S β r,s 1 (x,y ) D r dp, s=1 where β r,s denotes the stablty of a SVM wth kernel K s on a set of sze r, and P s the dstrbuton over the samplng of D r from D l.asbefore, snce (x, y ) appears n D r only tmes n average, we have the followng bound: r l G G 1 S S s=1 β r,s r l. Replacng β r,s by ts value for the case of SVMs yelds a bound on the generalzaton of G n terms of ts leave-one-out. Ths translates for F as a bound on ts generalzaton n terms of ts det-leave-one-out whch s the statement of the theorem. Notce that Theorem 5.3 holds for combnatons of kernel machnes where for each kernel we use many machnes traned on subsamples of the tranng set. So t s an ensemble of ensembles (see Eq. (15)). Compared to what has been derved for a sngle SVM, combnng SVMs provdes a tghter bound on the generalzaton. Ths result can then be nterpreted as an explanaton of the better estmaton of the test by the det-leave-one-out for ensemble methods. The bounds gven by the prevous theorems have the form: E (x,y) [θ( yf(x))] DLoo δ,dl (F) + O r ln ( ) 1 η C r κ l δ 2 although the bound for a sngle SVM s: E (x,y) [θ( yf(x))] Loo δ,dl ( f ) + O lc l κ ln ( ) 1 η δ 2

21 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 91 We have ndexed the parameters C wth an ndex that ndcates that the SVMs are not learned wth the same tranng set sze n the frst and n the second case. In the experments, the same C was used for all SVMs (C l = C r ). The bound derved for a combnaton of SVMs s then tghter than for a sngle SVM by a factor of r/l. The mprovement s because the stablty of the combnaton of SVMs s better than the stablty of a sngle SVM. Ths s true f we assume that both SVMs are traned wth the same C but the dscusson becomes more trcky f dfferent C s are used durng learnng. The stablty of SVMs depends ndeed on the way the value of C s determned. For a sngle SVM, C s s generally a functon of l, and for combnaton of SVMs, C also depends on the sze of the subsampled learnng sets D t.intheorem 5, we have seen that the stablty of the combnaton of machnes was smaller than rβ r where β l r s equal to Cκ for SVMs. If 2 ths stablty s better than the stablty of a sngle machne, then combnng the functons f Dr,t provdes a better bound. However, n the other case, the bound gets worse. We have the followng corollary whose proof s drect: Corollary 5.1. If a learnng system s β l stable and β l β r < r, then combnng these learnng l systems does not provde a better bound on the dfference between the test and the leave-one-out. Conversely, f β l β r > r, then combnng these learnng systems leads to l a better bound on the dfference between the test and the leave-one-out. Ths corollary gves an ndcaton that combnng machnes should not be used f the stablty of the sngle machne s very good. Notce that the corollary s about bounds, and not about whether the generalzaton for baggng or the actual dfference between the test and leave one out s always smaller for unstable machnes (and larger for stable ones) ths depends on how tght the bounds are n every case. However, t s not often the case that we have a hghly stable sngle machne and therefore typcally baggng mproves stablty. In such a stuaton, the bounds presented n ths paper show that we have better control of the generalzaton for combnaton of SVMs n the sense that the leave one out and the emprcal s are closer to the test. The bounds presented do not necessarly mply that the generalzaton of baggng s less than that of sngle machnes. Smlar remarks have already been made by Breman (1996) for baggng where smlar consderatons of stablty are expermentally dscussed. Another remark that can be made from the work of Breman s that baggng does not mprove performances after a certan number of bagged predctors. On the other hand, t does not reduce performances ether. Ths expermentally derved statement can be translated n our framework as: When T ncreases, the stablty of the combned learnng system tends to the stablty of the expectaton E Dr D l [ f Dr ] whch does not mprove after T has passed a certan value. Ths value may correspond to the convergence of the fnte sum 1 T T f D r,t to ts expectaton wrt. D r,t D l. At last, t s worthwhle notcng that the stablty analyss of ths secton holds also for the emprcal. Indeed, for a β l stable algorthm, as t s underlned n Bousquet and Elsseeff (2002), the leave-one-out and the emprcal are related by: Loo δ,dl ( f ) Err 0 ( f Dl ) + β l,

22 92 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF where Err 0 ( f Dl )sthe emprcal on the learnng set D l. Usng ths nequalty n Theorems 5.2, and 5.3 for the algorthm G, wecan bound the generalzaton of F n terms of the emprcal and the stablty of the machnes. 6. Other ensembles and estmates 6.1. Valdaton set for model selecton Instead of usng bounds on the generalzaton performance of learnng machnes lke the ones dscussed above, an alternatve approach for model selecton s to use a valdaton set to choose the parameters of the machnes. We consder frst the smple case where we have N machnes and we choose the best one based on the they make on a fxed valdaton set of sze V. Ths can be thought of as a specal case where we consder as hypothess space the set of the N machnes, and then we tran by smply pckng the machne wth the smallest emprcal (n ths case ths s the valdaton ). It s known that f VE s the valdaton of machne and TE s ts true test, then for all N machnes smultaneously the followng bound holds wth probablty 1 η (Devroye, Györf, & Lugos, 1996; Vapnk, 1998): TE VE + log(n) log ( η ) 4. (16) V So how accurately we pck the best machne usng the valdaton set depends, as expected, on the number of machnes N and on the sze V of the valdaton set. The bound suggests that a valdaton set can be used to accurately estmate the generalzaton performance of a relatvely small number of machnes (.e. small number of parameter values examned), as done often n practce. We used ths observaton for parameter selecton for SVMs and for ther ensembles. Expermentally we followed a slghtly dfferent procedure from what s suggested by bound (16). For each machne (that s, for each σ of the Gaussan kernel n our case, both for a sngle SVM and for an ensemble of machnes) we splt the tranng set (for each tranng-testng splt of the overall dataset as descrbed above) nto a smaller tranng set and a valdaton set (70 30% respectvely). We traned each machne usng the new, smaller tranng set, and measured the performance of the machne on the valdaton set. Unlke what bound (16) suggests, nstead of comparng the valdaton performance found wth the generalzaton performance of the machnes traned on the smaller tranng set (whch s the case for whch bound (16) holds), we compared the valdaton performance wth the test performance of the machne traned usng all the ntal (larger) tranng set. Ths way we dd not have to use less ponts for tranng the machnes, whch s a typcal drawback of usng a valdaton set, and we could compare the valdaton performance wth the leave-one-out bounds and the test performance of the exact same machnes we used n the Secton 4. We show the results of these experments n fgures 1 5 (see the dotted lnes n the plots). We observe that although the valdaton s that of a machne traned on a smaller

23 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 93 tranng set, t stll provdes a very good estmate of the test performance of the machnes traned on the whole tranng set. Inall cases, ncludng the case of C > 1 for whch the leave-one-out bounds dscussed above dd not work well, the valdaton set provded avery good estmate of the test performance of the machnes Adaptve combnatons of learnng machnes The ensembles of kernel machnes (4) consdered so far are votng combnatons where the coeffcents c t n (4) of the lnear combnaton of the machnes are fxed. We now consder the case where these coeffcents are also learned. In partcular we consder the followng two-layer archtecture: 1. A number T of kernel machnes s traned as before (for example usng dfferent tranng data, or dfferent parameters). Let f t (x), t = 1,...,T be the machnes. 2. The T outputs (real valued n our experments, but could also be thresholded bnary) of the machnes at each of the tranng ponts are computed. 3. A lnear machne (.e. lnear SVM) s traned usng as nputs the outputs of the T machnes on the tranng data, and as labels the orgnal tranng labels. The soluton s used as the coeffcents c t of the lnear combnaton of the T machnes. In ths case the ensemble machne F(x)sakernel machne tself whch s traned usng as kernel the functon: K(x, t) = f t (x) f t (t). Notce that snce each of the machnes f t (x) depend of the data, also the kernel K s data dependent. Therefore the stablty parameter of the ensemble machne s more dffcult to compute (when a data pont s left out the kernel K changes). Lkewse the leave-one-out bound of Theorem 3.3 does not hold snce the theorem assumes fxed coeffcents c t. 5 On the other hand, an mportant characterstc of ths type of ensembles s that ndependent of what kernels/parameters each of the ndvdual machnes of the ensemble use, the second layer machne (whch fnds coeffcents c t )always uses a lnear kernel. Ths may mply that the overall archtecture s less senstve to the kernel/parameters of the machnes of the ensemble.we tested ths hypothess expermentally by comparng how the test performance of ths type of machnes changes wth the σ of the Gaussan kernel used from the ndvdual machnes of the ensemble, and compared the behavor wth that of sngle machnes and ensembles of machnes wth fxed c t.infgure 7 we show two examples. In our experments, for all datasets except from one, learnng the coeffcents c t of the combnaton of the machnes usng a lnear machne (we used a lnear SVM) made the overall machne less senstve to changes of the parameters of the ndvdual machnes (σ of the Gaussan kernel). Ths can be a useful characterstc of the archtecture outlned n ths secton. For example the choce of the kernel parameters of the machnes of the ensembles need not be tuned accurately.

24 94 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF sgma sgma Fgure 7. When the coeffcents of the second layer are learned usng a lnear SVM the system s less senstve to changes of the σ of the Gaussan kernel used by the ndvdual machnes of the ensemble. Sold lne s one SVM, dotted s ensemble of 30 SVMs wth fxed c t = 1 30, and dashed lne s ensemble of 30 SVMs wth the coeffcents c t learned. The horzontal axs shows the natural logarthm of the σ of the Gaussan kernel. Left s the Heart dataset, and rght s the Dabetes one. The threshold b s non-zero for these experments Ensembles versus sngle machnes So far we concentrated on the theoretcal and expermental characterstcs of ensembles of kernel machnes. We now dscuss how ensembles compare wth sngle machnes. Table 2 shows the test performance of one SVM compared wth that of an ensemble of 30 SVMs combned wth c t = 1 and an ensemble of 30 SVMs combned usng a lnear 30 SVM for some UCI datasets (characterstc results). For the tables of ths secton we use, for convenence, the followng notaton: VCC stands for Votng Combnatons of Classfers, meanng that the coeffcents c t of the combnaton of the machnes are fxed. ACC stands for Adaptve Combnatons of Classfers, meanng that the coeffcents c t of the combnaton of the machnes are learned-adapted. We only consder SVMs and ensembles of SVMs wth the threshold b. The table shows mean test s and standard devatons for the best (decded usng the valdaton set Table 2. Average s and standard devatons (percentages) of the best machnes (best σ of the Gaussan kernel and best C) chosen accordng to the valdaton set performances. The performances of the machnes are about the same. VCC and ACC use 30 SVMs. Dataset SVM VCC ACC Breast 25.5 ± ± ± 4.0 Thyrod 5.1 ± ± ± 2.7 Dabetes 23.0 ± ± ± 1.8 Heart 15.4 ± ± ± 3.2

25 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 95 Table 3. Comparson between rates of a sngle SVM v.s. rates of VCC and ACC of 100 SVMs for dfferent percentages of subsampled data. The last dataset s from Osuna, Freund, and Gros (1997). Dataset VCC 10% VCC 5% VCC 1% ACC 10% ACC 5% ACC 1% SVM Dabetes ± 1.6 Thyrod ± 2.5 Faces 0.5 performance n ths case) parameters of the machnes (σ s of Gaussans and parameter C hence dfferent from fgures 1 5 whch where for a gven C). As the results show, the best SVM and the best ensembles we found have about the same test performance. Therefore, wth approprate tunng of the parameters of the machnes, combnng SVMs does not lead to performance mprovement compared to a sngle SVM. Although the best SVM and the best ensemble (that s, after accurate parameter tunng) perform smlarly, an mportant dfference of the ensembles compared to a sngle machne s that the tranng of the ensemble conssts of a large number of (parallelzable) small-tranng-set kernel machnes n the case of baggng. Ths mples that one can gan performance smlar to that of a sngle machne by tranng many faster machnes usng smaller tranng sets although the actual testng may be slower snce the sze of the unon of support vectors of the combnaton of machnes s expected to be larger than the number of support vectors of a sngle machne usng all the tranng data. Ths can be an mportant practcal advantage of ensembles of machnes especally n the case of large datasets. Table 3 compares the test performance of a sngle SVM wth that of an ensemble of SVMs each traned wth as low as 1% of the ntal tranng set (for one dataset for the other ones we could not use 1% because the sze of the orgnal dataset was small so 1% of t was only a couple of ponts). For fxed c t the performance decreases only slghtly n all cases (Thyrod, that we show, was the only dataset we found n our experments for whch the change was sgnfcant for the case of VCC), whle n the case of the archtecture of Secton 5 even wth 1% tranng data the performance does not decrease. Ths s because the lnear machne used to learn coeffcents c t uses all the tranng data. Even n ths last case the overall machne can stll be faster than a sngle machne, snce the second layer learnng machne s a lnear one, and fast tranng methods for the partcular case of lnear machnes exst (Platt, 1998). 7. Conclusons We presented theoretcal bounds on the generalzaton of ensembles of kernel machnes such as SVMs. Our results apply to the general case where each of the machnes n the ensemble s traned on dfferent subsets of the tranng data and/or uses dfferent kernels or nput features. A specal case of ensembles s that of baggng. The bounds were derved wthn the frameworks of cross valdaton and stablty and learnng. They nvolve two man quanttes: the det-leave-one-out estmate and the stablty parameter of the ensembles.

Bounds on the Generalization Performance of Kernel Machines Ensembles

Bounds on the Generalization Performance of Kernel Machines Ensembles Bounds on the Generalzaton Performance of Kernel Machnes Ensembles Theodoros Evgenou theos@a.mt.edu Lus Perez-Breva lpbreva@a.mt.edu Massmlano Pontl pontl@a.mt.edu Tomaso Poggo tp@a.mt.edu Center for Bologcal