Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers

Size: px
Start display at page:

Download "Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers"

Transcription

1 Machne Learnng, 55, 71 97, 2004 c 2004 Kluwer Academc Publshers. Manufactured n The Netherlands. Leave One Out Error, Stablty, and Generalzaton of Votng Combnatons of Classfers THEODOROS EVGENIOU theodoros.evgenou@nsead.edu Technology Management, INSEAD, Boulevard de Constance, Fontanebleau, France MASSIMILIANO PONTIL DII, Unversty of Sena, Va Roma 56, Sena, Italy pontl@d.uns.t ANDRÉ ELISSEEFF andre.elsseeff@tuebngen.mpg.de Max Planck Insttute for Bologcal Cybernetcs, Spemannstrasse 38, Tübngen, Germany Edtor: Robert E. Schapre Abstract. We study the leave-one-out and generalzaton s of votng combnatons of learnng machnes. A specal case consdered s a varant of baggng. We analyze n detal combnatons of kernel machnes, such as support vector machnes, and present theoretcal estmates of ther leave-one-out. We also derve novel bounds on the stablty of combnatons of any classfers. These bounds can be used to formally show that, for example, baggng ncreases the stablty of unstable learnng machnes. We report experments supportng the theoretcal fndngs. Keywords: cross-valdaton, baggng, combnatons of machnes, stablty 1. Introducton Studyng the generalzaton performance of ensembles of learnng machnes has been the topc of ongong research n recent years (Breman, 1996; Schapre et al., 1998; Fredman, Haste, & Tbshran, 1998). There s a lot of expermental work showng that combnng learnng machnes, for example usng boostng or baggng methods (Breman, 1996; Schapre et al., 1998), very often leads to mproved generalzaton performance. A number of theoretcal explanatons have also been proposed (Schapre et al., 1998; Breman, 1996), but more work on ths aspect s stll needed. Two mportant theoretcal tools for studyng the generalzaton performance of learnng machnes are the leave-one-out (or cross valdaton) of the machnes, and the stablty of the machnes (Bousquet & Elsseeff, 2002; Boucheron, Lugos, & Massart, 2000). The second, although an older tool (Devroye & Wagner, 1979; Devroye, Györf, & Lugos, 1996), has become only mportant recently wth the work of Kearns and Ron (1999) and Bousquet and Elsseeff (2002). Stablty has been dscussed extensvely also n the work of Breman (1996). The theory n Breman (1996) s that baggng ncreases performance because t reduces the varance of the base learnng machnes, although t does not always ncrease the bas (Breman, 1996).

2 72 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF The defnton of the varance n Breman (1996) s smlar n sprt to that of stablty we use n ths paper. The key dfference s that n Breman (1996) the varance of a learnng machne s defned n an asymptotc way and s not used to derve any non-asymptotc bounds on the generalzaton of baggng machnes, whle here we defne stablty for fnte samples lke t s done n Bousquet and Elsseeff (2002) and we also derve such non-asymptotc bounds. The ntuton gven by Breman (1996) gves nterestng nsghts: the effect of baggng depends on the stablty of the base classfer. Stablty means here changes n the output of the classfer when the tranng set s perturbed. If the base classfers are stable, then baggng s not expected to decrease the generalzaton. On the other hand, f the base classfer s unstable, such as often occurs wth decson trees, the generalzaton performance s supposed to ncrease wth baggng. Despte expermental evdence, the nsghts n Breman (1996) had not been supported by a general theory lnkng stablty to the generalzaton of baggng, whch s what Secton 5 below s about. In ths paper we study the generalzaton performance of ensembles of kernel machnes usng both leave-one-out and stablty arguments. We consder the general case where each of the machnes n the ensemble uses a dfferent kernel and dfferent subsets of the tranng set. The ensemble s a convex combnaton of the ndvdual machnes. A partcular case of ths scheme s that of baggng kernel machnes. Unlke standard baggng (Breman, 1996), ths paper consders combnatons of the real outputs of the classfers, and each machne s traned on a dfferent and small subset of the ntal tranng set chosen by randomly subsamplng from the ntal tranng set. Each machne n the ensemble uses n general a dfferent kernel. As a specal case, approprate choces of these kernels lead to machnes that may use dfferent subsets of the ntal nput features, or dfferent nput representatons n general. We derve theoretcal bounds for the generalzaton of the ensembles based on a leave-one-out estmate. We also present results on the stablty of combnatons of classfers, whch we apply to the case of baggng kernel machnes. They can also be appled to baggng learnng machnes other than kernel machnes, showng formally that baggng can ncrease the stablty of the learnng machnes when these are not stable, and decrease t otherwse. An mplcaton of ths result s that t can be easer to control the generalzaton of baggng machnes. For example the leave one out s a better estmate of ther test, somethng that we expermentally observe. The paper s organzed as follows. Secton 2 gves the basc notaton and background. In Secton 3 we present bounds for a leave-one-out of kernel machne ensembles. These bounds are used for model selecton experments n Secton 4. In Secton 5 we dscuss the algorthmc stablty of ensembles, and present a formal analyss of how baggng nfluences the stablty of learnng machnes. The results can also provde a justfcaton of the expermental fndngs of Secton 4. Secton 6 dscusses other ways of combnng learnng machnes. 2. Background and notatons In ths secton we recall the man features of kernel machnes. For a more detaled account (see Vapnk, 1998; Schölkopf, Burges, & Smola, 1998; Evgenou, Pontl, & Poggo, 2000). Foranaccount consstent wth our notaton (see Evgenou, Pontl, & Poggo, 2000).

3 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 73 Kernel machne classfers are the mnmzers of functonals of the form: H[ f ] = 1 l l V (y, f (x )) + λ f 2 K, (1) =1 where we use the followng notaton: Let X R n be the nput set, the pars (x, y ) X { 1, 1}, = 1,...,lare sampled ndependently and dentcally accordng to an unknown probablty dstrbuton P(x, y). The set D l ={(x 1, y 1 ),...,(x l, y l )} s the tranng set. f s a functon R n R belongng to a Reproducng Kernel Hlbert Space (RKHS) H defned by kernel K, and f 2 K s the norm of f n ths space. See Vapnk (1998) and Wahba (1990) for a number of kernels. The classfcaton s done by takng the sgn of ths functon. V (y, f (x)) s the loss functon. The choce of ths functon determnes dfferent learnng technques, each leadng to a dfferent learnng algorthm (for computng the coeffcents α see below). λ s called the regularzaton parameter and s a postve constant. Machnes of ths form have been motvated n the framework of statstcal learnng theory. Under rather general condtons (Evgenou, Pontl, & Poggo, 2000) the soluton of Eq. (1) s of the form f (x) = l α y K (x, x). (2) =1 The coeffcents α n Eq. (2) are learned by solvng the followng optmzaton problem: l max H(α) = S(α ) 1 l α α j y y j K (x, x j ) α 2 =1, j=1 subject to: 0 α C, = 1,...,l, (3) where S( )sacontnuous and concave functon (strctly concave f matrx K (x, x j )snot strctly postve defnte) and C = 1 a constant. Thus, H(α) sstrctly concave and the 2lλ above optmzaton problem has a unque soluton. Support Vector Machnes (SVMs) are a partcular case of these machnes for S(α) = α. Ths corresponds to a loss functon V n (1) that s of the form θ(1 yf(x))(1 yf(x)), where θ s the Heavysde functon: θ(x) = 1fx > 0, and zero otherwse. The ponts for whch α > 0 are called support vectors. Notce that the bas term (threshold b n the general case of machnes f (x) = l =1 α K (x, x) + b) sncorporated n the kernel K, and t s therefore also regularzed. Notce also that functon S( ) n(3) can take general forms leadng to machnes other than SVM but n the general case the optmzaton of (3) may be computatonally neffcent.

4 74 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF 2.1. Kernel machne ensembles Gven a learnng algorthm such as a SVM or an ensemble of SVMs we defne f Dl to be the soluton of the algorthm when the tranng set D l ={(x, y ), = 1,...,l} s used. We denote by Dl the tranng set obtaned by removng pont (x, y ) from D l, that s the set D l \{(x, y )}. When t s clear n the text we wll denote f Dl by f and f D l by f. We consder the general case where each of the machnes n the ensemble uses a dfferent kernel and dfferent subsets D r,t of the tranng set D l where r refers to the sze of the subset and t = 1,...,T to the machne that uses t to learn. Let f Dr,t (x)bethe optmal soluton of machne t usng a kernel K (t).wedenote by α (t) the optmal weght that machne t assgns to pont (x, y ) (after solvng optmzng problem (3)). We consder ensembles that are convex combnatons of the ndvdual machnes. The decson functon of the ensemble s gven by F r,t (x) = c t f Dr,t (x) (4) wth c t 0, and T c t = 1 (for scalng reasons). The coeffcents c t are not learned and all parameters (C s and kernels) are fxed before tranng. The classfcaton s done by takng the sgn of F r,t (x). Below for smplcty we wll note wth captal F the combnaton F r,t.insecton 5 we wll consder only the case that c t = 1 for smplcty. T In the followng, the sets D r,t wll be dentcally sampled accordng to the unform dstrbuton and wthout replacement from the tranng set D l.wewll denote by E Dr D l the expectaton wth respect to the subsamplng from D l accordng to the unform dstrbuton (wthout replacement), and sometmes we wrte f Dr,t D l rather than f Dr,t to make clear whch tranng set has been used durng learnng. The letter r wll always refer to the number of elements n D r,t Leave-one-out If θ s, as before, the Heavysde functon, then the leave-one-out of f on D l s defned by Loo Dl ( f ) = 1 l l θ( y f (x )) (5) =1 Notce that for smplcty there s a small abuse of notaton here, snce the leave-one-out typcally refers to a learnng method whle here we use the soluton f n the notaton. The leave-one-out provdes an estmate of the average generalzaton performance of a machne. It s known that the expectaton of the generalzaton of a machne traned usng l ponts s equal to the expectaton of the Loo of a machne traned on l + 1 ponts. Ths s summarzed by the followng theorem, orgnally due to Luntz and Bralovsky (see Vapnk, 1998).

5 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 75 Table 1. Notaton. f V ( f, y) P(x, y) Real valued predcton rule of one learnng machne, f : X R Loss functon Probablty dstrbuton underlnng the data D l Set of..d examples sampled from P(x, y), D l ={(x, y ) X { 1, 1}} =1 l Dl The set D l \{(x, y )} f Dl Loo Dl ( f ) Learnng machne (e.g. SVM) traned on D l. Also noted as f Leave-one-out of f on the data set D l π δ (x) Soft margn loss, π δ (x) = 0, f x < δ, 1fx > 0, and x δ f δ x 0 Loo δ,dl ( f ) Leave one out wth soft margn π δ β l Unform stablty of f D r,t or D r,t D l Set of r ponts sampled unformly from D l used by machne t, t = 1,...,T D r D l Set of r ponts sampled unformly from D l (D r,t D l ) Orgnal D r,t wth pont (x, y ) removed F r,t,orjust F Ensemble of T machnes, F r,t = T c t f Dr,t ˆF Expected combnaton of machnes E Dr D l [ f Dr ] DLoo Dl (F) Determnstc leave one out DLoo δ,dl (F) Determnstc leave one out wth soft margn π δ Theorem 2.1. Suppose f Dl s the outcome of a determnstc learnng algorthm. Then [ [ ( E Dl E(x,y) θ yfdl (x) )]] [ ( )] = E Dl+1 LooDl+1 f Dl+1 As observed (Kearns & Ron, 1999), ths theorem can be extended to general learnng algorthms by addng a randomzng preprocessng step. The way the leave-one-out s computed can however be dfferent dependng on the randomness. Consder the prevous ensemble of kernel machnes (4). The data sets D r,t, t = 1,...,T are drawn randomly from the tranng set D l.wecan then compute a leave-one-out estmate for example n ether of the followng ways: 1. For = 1,...,l, remove (x, y ) from D l and sample new data sets D r,t, t = 1,...,T from Dl. Compute the f D r,t Dl and average then the of the resultng ensemble machne computed on (x, y ). Ths leads to the classcal defnton of leave-one-out and can be computed as: Loo Dl (F) = 1 l ( l 1 θ y T =1 ) f Dr,t Dl (x ) 2. For = 1,...,l, remove (x, y ) from each D r,t D l. Compute the f (Dr,t D l ) and average the of the resultng ensemble machne computed on (x, y ). Note that we have used the notaton (D r,t D l ) to denote the set D r,t D l where (x, y ) has been (6)

6 76 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF removed. Ths leads to what we wll call a determnstc verson of the leave-one-out, n short det-leave-one-out, or DLoo: DLoo Dl (F) = 1 l ( l 1 θ y T =1 ) f (Dr,t D l ) (x ) Note that the frst computaton requres to re-sample new data sets for each leave-one-out round, whle the second computaton uses the same subsample data sets for each leaveone-out round removng at most one pont from each of them. In a sense, the det-leave-oneout s then more determnstc than the classcal computaton (6). In ths paper, we wll consder manly the det-leave-one-out for whch we wll derve easy-to-compute bounds and from whch we wll bound the generalzaton of ensemble machnes. Fnally notce that the sze of the subsamplng s mplct n the notaton DLoo Dl (F): r s fxed n ths paper so there s no need to complcate the notaton further. (7) 3. Leave-one-out estmates of kernel machne ensembles We begn wth some known results about the leave-one-out of kernel machnes. The followng theorem s from Jaakkola and Haussler (1998): Theorem 3.1. The leave-one-out of a kernel machne (3) s upper bounded as: Loo Dl ( f ) 1 l l θ ( α K (x, x ) y f Dl (x ) ) (8) =1 where f Dl s the optmal functon found by solvng problem (3) on the whole tranng set. In the partcular case of SVMs where the data are separable the r.h.s of Eq. (8) can be bounded by geometrc quanttes, namely Vapnk (1998): Loo Dl ( f ) 1 l l θ ( α K (x, x ) y f Dl (x ) ) 1 dsv 2 (9) l ρ 2 =1 where d sv s the radus of the smallest sphere n the feature space nduced by kernel K (Wahba, 1990; Vapnk, 1998) centered at the orgn contanng the support vectors, that s d sv = max :α >0 K (x, x ), and ρ s the margn (ρ 2 = 1 )ofthe SVM. f 2 K Usng ths result, the next theorem s a drect applcaton of Theorem 2.1: Theorem 3.2. Suppose that the data s separable by the SVM. Then, the average generalzaton of a SVM traned on l ponts s upper bounded by ( 1 d 2 ) l + 1 E sv(l) D l, ρ 2 (l)

7 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 77 where the expectaton E s taken wth respect to the probablty of a tranng set D l of sze l. Notce that ths result shows that the performance of the SVM does not depend only on the margn, but also on other geometrc quanttes, namely the radus d sv. We now extend these results to the case of ensembles of kernel machnes. In the partcular case of baggng, the subsamplng of the tranng data should be determnstc. By ths we mean that when the bounds on the leave one out are used for model (parameter) selecton, for each model the same subsample sets of the data need to be used. These subsamples, however, are stll random ones. We beleve that the results presented below also hold (wth mnor modfcatons) n the general case that the subsamplng s always random. We now consder the det-leave-one-out of such ensembles. Theorem 3.3. by: DLoo Dl (F) 1 l The det-leave-one-out of a kernel machne ensemble s upper bounded ( ) l θ c t α (t) K (t) (x, x ) y F(x ). (10) =1 The proof of ths Theorem s based on the followng lemma shown n Vapnk (1998) and Jaakkola and Haussler (1998): Lemma 3.1. Let α be the coeffcent of the soluton f (x) of machne (3) correspondng to pont (x, y ),α > 0. Let f (x) be the soluton of machne (3) found when the data pont (x, y ) s removed from the tranng set. Then y f (x ) y f (x ) α K (x, x ). Usng Lemma 3.1 we can now prove Theorem 3.3. Proof of Theorem 3.3: Let F (x) = T c t f (t) (x)bethe ensemble machne traned wth all ntal tranng data except (x, y ) (subsets D r,t are the orgnal ones only (x, y )s removed from them). Lemma 3.1 gves that y F (x ) = y = y F(x ) from whch t follows that: c t f (t) (x ) c t [ y f (t) (x ) α (t) K (t) (x, x ) ] c t α (t) K (t) (x, x ) ( ) θ( y F (x )) θ c t α (t) K (t) (x, x ) y F(x ).

8 78 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF Therefore the leave one out l =1 θ( y F (x )) s not more than ( ) l θ c t α (t) K (t) (x, x ) y F(x ), =1 whch proves the Theorem. Notce that the bound has the same form as the bound n Eq. (8): for each pont (x, y ) we only need to take nto account ts correspondng parameter α (t) and remove the effects of α (t) from the value of F(x ). The det-leave-one-out can also be bounded usng geometrc quanttes. To ths purpose we ntroduce one more parameter that we call the ensemble margn (n contrast to the margn of a sngle SVM). For each pont (x, y )wedefne ts ensemble margn to be y F(x ). Ths s exactly the defnton of margn n Schapre et al. (1998). For any gven δ > 0wedefne Err δ to be the emprcal wth ensemble margn less than δ, Err δ (F) = 1 l l θ( y F(x ) + δ). =1 and by N δ the set of the remanng tranng ponts the ones wth ensemble margn δ. Fnally, we note by d t(δ) the radus of the smallest sphere n the feature space nduced by kernel K (t) centered at the orgn whch contans the ponts of machne t wth α (t) > 0 and ensemble margn larger than δ. 1 Corollary 3.1. For any δ>0the det-leave-one-out of a kernel machne ensemble s upper bounded by: DLoo Dl (F) Err δ (F) + 1 l ( 1 δ ( c t dt(δ) 2 α (t) N δ )) (11) Proof: For each tranng pont (x, y ) wth ensemble margn y F(x ) <δwe upper bound θ( T c tα (t) K (t) (x, x ) y F(x )) wth 1 (ths s a trval bound). For the remanng ponts (the ponts n N δ )weshow that: ( ) θ c t α (t) K (t) (x, x ) y F(x ) 1 δ c t α (t) K (t) (x, x ). (12)

9 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 79 In the case that T c tα (t) K (t) (x, x ) y F(x ) < 0, Eq. (12) s trvally satsfed. If T c tα (t) K (t) (x, x ) y F(x ) 0, then whle ( ) θ c t α (t) K (t) (x, x ) y F(x ) = 1, c t α (t) K (t) (x, x ) y F(x ) δ 1 δ So n both cases nequalty (12) holds. Therefore: ( ) l θ c t α (t) K (t) (x, x ) y F(x ) =1 c t α (t) K (t) (x, x ) 1. lerr δ + 1 δ N δ ( lerr δ + 1 c t dt(δ) 2 δ c t K (t) (x, x )α (t) α (t) N δ ). The statement of the corollary follows by applyng Theorem 3.3. Notce that Eq. (11) holds for any δ>0, so the best bound s obtaned for the mnmum of the rght hand sde wth respect to δ>0. Usng Theorem 2.1, Theorems 3.3 and 3.1 provde bounds on the average generalzaton performance of general kernel machnes ensembles lke that of Theorem 3.2. We now consder the partcular case of SVM ensembles. In ths case we have the followng Corollary 3.2. Suppose that each SVM n the ensembles separated the data set used durng tranng. Then, the det-leave-one-out of an ensemble of SVMs s upper bounded by: DLoo Dl (F) Err 1 (F) + 1 l c t d 2 t ρ 2 t (13) where Err 1 s the margn emprcal wth ensemble margn 1, d t s the radus of the smallest sphere centered at the orgn, n the feature space nduced by kernel K (t), contanng the support vectors of machne t, and ρ t s the margn of the t-th SVM. Proof: l =1 α(t) We chose δ = 1n(11). Clearly we have that d t d t(δ) for any δ, and N δ α (t) (see Vapnk (1998)) for a proof of ths equalty). = 1 ρ 2 t Notce that the average generalzaton performance of the SVM ensemble now depends on the average (convex combnaton of) D2 of the ndvdual machnes. In some cases ths ρ 2

10 80 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF may be smaller than the D2 of a sngle SVM. For example, suppose we tran many SVMs ρ 2 on dfferent sub-samples of the tranng ponts and we want to compare such an ensemble wth a sngle SVM usng all the ponts. If all SVMs (the sngle one, as well as the ndvdual ones of the ensemble) have most of ther tranng ponts as support vectors, then clearly the D 2 of each SVM n the ensemble s smaller than that of the sngle SVM. Moreover the margn of each SVM n the ensemble s expected to be larger than that of the sngle SVM usng all the ponts. So the average D2 n ths case s expected to be smaller than that ρ 2 of the sngle SVM. Another case where an ensemble of SVMs may be better than a sngle SVM s the one where there are outlers among the tranng data. If the ndvdual SVMs are traned on subsamples of the tranng data, some of the machnes may have smaller D2 ρ 2 because they do not use some outlers whch of course also depends on the choce of C for each of the machnes. In general t s not clear when ensembles of kernel machnes are better than sngle machnes. The bounds n ths secton may provde some nsght to ths queston. Fnally, we remark that all the results dscussed hold for the case that there s no bas (threshold b), or the case where the bas s ncluded n the kernel (as dscussed n the ntroducton). In the experments dscussed below we use the results also n the case that the bas s not regularzed (as dscussed n Secton 2 ths means that the separatng functon ncludes a bas b,sots f (x) = l =1 α K (x, x)+b), whch s common n practce. Recent work n (Chapelle & Vapnk, 1999) may be used to extend our results to an ensemble of kernel machnes wth the bas not regularzed: whether ths can be done s an open queston. 4. Experments To test how tght the bounds we presented are, we conducted a number of experments usng datasets from UCI 2,aswell as the US Postal Servce (USPS) dataset (LeCun et al., 1990). We show results for some of the sets n fgures 1 5. For each dataset we splt the overall set n tranng and testng (the szes are shown n the fgures) n 50 dfferent (random) ways, and for each splt: 1. We traned one SVM wth b = 0 usng all tranng data, computed the leave-one-out bound gven by Theorem 3.1, and then compute the test performance usng the test set. 2. We repeated (1) ths tme wth b We traned 30 SVMs wth b = 0 each usng a random subsample of sze 40% of the tranng data (baggng), computed the leave-one-out bound gven by Theorem 3.3 usng c t = 1, and then compute the test performance usng the test set We repeated (3) ths tme wth wth b 0. We then averaged over the 50 tranng-testng splts the test performances and the leaveone-out bounds found, and computed the standard devatons. All machnes were traned usng a Gaussan kernel, and we repeated the procedure for a number of dfferent σ s of the Gaussan, and for a fxed value of the parameter C, (selected by hand so that t s less than 1 n fgures 1 5, and more than 1 n fgure 6, for reasons explaned below for smplcty we

11 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 81 Breast Cancer data (C=0.5, Tran= 200, Test = 77) Bag l o o, b=0 svm l o o, b= bag l o o svm l o o sgma sgma Fgure 1. Breast cancer data: Top left fgure: baggng wth b = 0; Top rght fgure: sngle SVM wth b = 0; Bottom left fgure: baggng wth b 0; Bottom rght fgure: sngle SVM wth b 0. In each plot the sold lne s the mean test performance and the dashed lne s the bound computed usng the leave-one-out Theorems 3.1 and 3.3. The dotted lne s the valdaton set dscussed below. The horzontal axs shows the logarthm of the σ of the Gaussan kernel used. used the same value of C n fgures 1 5, C = 0.5, but we found the same trend for other small values of C, C < 1). We show the averages and standard devatons of the results n fgures 1 to 5. In all fgures we use the followng notaton: Top left fgure: baggng wth b = 0; Top rght fgure: sngle SVM wth b = 0; Bottom left fgure: baggng wth b 0; Bottom rght fgure: sngle SVM wth b 0. In each plot the sold lne s the mean test performance and the dashed lne s the bound computed usng the leave-one-out Theorems 3.1 and 3.3. The dotted lne s the valdaton set dscussed below. The horzontal axs shows the logarthm of the σ of the Gaussan kernel used. For smplcty, only one bar (standard devaton over the 50 tranng-testng splts) s shown (the others were smlar). Notce that even for tranng-testng splts for whch the s one standard devaton away from the mean over the 50 runs (.e. nstead of plottng the graphs through the center of the bars, we plot them at the end of the bars) the bounds for combnatons of machnes are stll tghter than for sngle machnes n fgures 3 to 5. The cost parameter C used s gven

12 82 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF 5 Thyrod data (C=0.5, Tran= 140, Test = 75) Bag l o o, b=0 5 svm l o o, b= bag l o o svm l o o sgma sgma Fgure 2. Thyrod data: Notaton lke n fgure 1. n each of the fgures. The horzontal axs s the natural logarthm of the σ of the Gaussan kernel used, whle the vertcal axs s the. An nterestng observaton s that the bounds are always tghter for the case of baggng than they are for the case of a sngle SVM. Ths s an nterestng expermental fndng for whch we provde a possble theoretcal explanaton n the next secton. Ths fndng can practcally justfy the use of ensembles of machnes for model selecton: Parameter selecton usng the leave-one-out bounds presented n ths paper s easer for ensembles of machnes than t s for sngle machnes. Another nterestng observaton s that the bounds seem to work smlarly n the case that the bas b s not 0. In ths case, as before, the bounds are tghter for ensembles of machnes than they are for sngle machnes. Expermentally we found that the bounds presented here do not work well n the case that the C parameter used s large (C = 100). An example s shown n fgure 6. Consder the leave-one-out bound for a sngle SVM gven by Theorem 3.1. Let (x, y )beasupport vector for whch y f (x ) < 1. It s known (Vapnk, 1998) that for these support vectors the coeffcent α s C. IfC s such that CK(x, x ) > 1 (for example consder Gaussan

13 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 83 Dabetes data (C=0.5, Tran= 468, Test = 300) Bag l o o, b=0 svm l o o, b= bag l o o svm l o o sgma sgma Fgure 3. Dabetes data: Notaton lke n fgure 1. kernel wth K (x, x) = 1 and any C > 1), then clearly θ(ck(x, x ) y f (x )) = 1. In ths case the bound of Theorem 3.1 effectvely counts all support vectors outsde the margn (plus some of the ones on the margn,.e. yf(x) = 1). Ths means that for large C (n the case of Gaussan kernels ths can be for example for any C > 1), the bounds of ths paper effectvely are smlar (not larger than) to another known leave-one-out bound for SVMs, namely one that uses the number of all support vectors to bound generalzaton performance (Vapnk, 1998). So effectvely our expermental results show that the number of support vectors does not provde a good estmate of the generalzaton performance of the SVMs and ther ensembles. 5. Stablty of ensemble methods We now present a theoretcal explanaton of the expermental fndng that the leave-one-out bound s tghter for the case of ensemble machnes than t s for sngle machnes. The analyss s done wthn the framework of stablty and learnng (Bousquet & Elsseeff, 2002). It has

14 84 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF 0.5 Heart data (C=0.5, Tran= 170, Test = 100) Bag l o o, b=0 svm l o o, b= bag l o o 0.5 svm l o o sgma sgma Fgure 4. Heart data: Notaton lke n fgure 1. been proposed n the past that baggng ncreases the stablty of the learnng methods (Breman, 1996). Here we provde a formal argument for ths. As before, we denote by D l the tranng set D l wthout example pont (x, y ). We use the followng noton of stablty defned n Bousquet and Elsseeff (2002). Defnton (Unform stablty). We say that a learnng method s β l stable wth respect to a loss functon V and tranng sets of sze l f the followng holds: {1,...,l}, D l, (x, y) : V ( f Dl (x), y ) V ( f D l (x), y ) β l. Roughly speakng the cost of a learnng machne on a new (test) pont (x, y) should not change more than β l when we tran the machne wth any tranng set of sze l and when we tran the machne wth the same tranng set but one tranng pont (any pont) removed. Notce that ths defnton s useful manly for real-valued loss functons V.To use t for classfcaton machnes we need to start wth the real valued output (2) before thresholdng.

15 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS Postal data (C=0.5, Tran= 791, Test = 2007) Bag l o o, b=0 svm l o o, b= bag l o o 0.6 svm l o o sgma sgma Fgure 5. USPS data: Notaton lke n fgure 1. We defne for any gven constant δ the leave-one-out Loo δ on a tranng set D l to be: Loo δ,dl ( f ) = 1 l l ( π δ y f D l (x ) ), =1 where the functon π δ (x) s0for x < δ, 1for x > 0, and x + 1 for δ x 0(asoft δ margn functon). 3 For ensemble machnes, we wll consder agan a defnton smlar to (7): DLoo δ,dl (F) = 1 l ( l 1 π δ y T =1 ) f (Dr,t D l) (x ), Notce that for δ 0weget the leave one out s that we defned n Secton 2, namely Eqs. (5) and (7), and clearly DLoo 0,Dl (F) DLoo δ,dl (F) for all δ>0.

16 86 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF Postal data (C=100, Tran= 791, Test = 2007) Bag l o o, b=0 svm l o o, b= bag l o o 0.6 svm l o o sgma sgma Fgure 6. USPS data: Usng a large C (C = 50). In ths case the bounds do not work see text for an explanaton. Notaton lke n fgure 1. Let β l be the stablty of the kernel machne for the real valued output wrt. the l 1 norm, that s: {1,...,l}, D l, x : f Dl (x) f D l (x) βl For SVMs t s known (Bousquet & Elsseeff, 2002) that β l s upper bounded by C κ 2 where κ = sup x X K (x, x) sassumed to be fnte. The bound on the stablty of a SVM s not explctly dependent of the sze of the tranng set l.however, the value of C s often chosen such that C s small for large l.inthe former experments, C s fxed for all machnes whch are traned on learnng sets of same szes. Ths means that they have all the same stablty for the l 1 norm. We frst state a bound on the expected of a sngle kernel machne n terms of ts Loo δ. The followng theorem s from Bousquet and Elsseeff (2002).

17 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 87 Theorem 5.1. For any gven δ, wth probablty 1 η the generalzaton msclassfcaton of an algorthm that s β l stable w.r.t. the l 1 norm s bounded as: E (x,y) [ θ( yfdl (x)) ] Loo δ,dl ( f Dl ) + β l + where β l s assumed to be a non-ncreasng functon of l. l 2 ( 2 β l ) 2 ln δ l ( ) 1, η Notce that the bound holds for a gven constant δ. One can derve a bound that holds unformly for all δ and therefore use the best δ (.e. the emprcal margn of the classfer) (Bousquet & Elsseeff, 2002). For a SVM, the value of β l s equal to Cκ. Theorem provdes the followng bound: [ ( E (x,y) θ yfdl (x) )] ( ( ) Cκ Loo δ,dl f Dl l Cκ 2 δ + 1 l ) 2 ln ( ) 1 η The value of C s often a functon of l. Dependng on the way C decreases wth l, ths bound can be tght or loose. We now study a smlar generalzaton bound for an ensemble of machnes where each machne uses only r ponts drawn randomly wth the unform dstrbuton from the tranng set. We consder only the case where the coeffcents c t of (4) are all 1 (so takng the average T machne lke n standard baggng (Breman, 1996)). Such an ensemble s very close to the orgnal dea of baggng despte some dfferences namely that n standard baggng each machne uses a tranng set of sze equal to the sze of the orgnal set created by random subsamplng wth replacement, nstead of usng only r ponts. We wll consder the expected combnaton ˆF defned as: 4 ˆF(x) = E Dr D l [ f Dr (x) ] where the expectaton s taken wth respect to the tranng data D r of sze r drawn unformly from D l. The stabllty bounds we present below hold for ths expected combnaton and not for the fnte combnaton consdered so far as mentoned below how close these two are s an open queston. The leave-one-out we defne for ths expectaton s agan lke n (7) (as n Eq. (7) the sze r of the subsamples for smplcty s not ncluded n the notaton): DLoo δ,dl ( ˆF) = 1 l l ( [ π δ y E Dr D l f D r (x ) ]) =1 whch s dfferent from the standard leave-one-out : 1 l l ( [ π δ y E Dr Dl f Dr (x ) ]) =1

18 88 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF whch corresponds to (6). As an extreme case when T : ( ) 1 DLoo δ,dl f Dr,t DLoo δ,dl ( ˆF) (14) T Ths relaton motvates the choce of our method of calculaton for the leave one out estmate n Secton 3. Indeed the rght hand sde of the equaton corresponds to the quantty that we have bounded n Sectons 3 and 4 and that ultmately we would lke to relate to the stablty of the base machne. It s an open queston to measure how fast the convergence Eq. (14) s. As we dscuss below and as also mentoned n Breman (1996), ncreasng T beyond a certan value (typcally small,.e. 100) does not nfluence the performance of baggng, whch may mply that the convergence (14) s fast. We then have the followng bound on the expected of ensemble combnatons: Theorem 5.2. For any gven δ, wth probablty 1 η the generalzaton msclassfcaton of the expected combnaton of classfers ˆF each usng a subsample of sze r of the tranng set and each havng a stablty β r wrt. the l 1 norm s bounded as: E (x,y) [θ( y ˆF(x))] DLoo δ,dl ( ˆF) + r l β r r + 2 2l ( 2βr 1 δ + 1 r ) 2 ln ( ) 1 η Proof: We wll apply the stablty Theorem 5.1 to the followng algorthm: Onaset of sze l, the algorthm s the same as the expected ensemble machne we consder. Onatranng set of sze l 1, t adds a dummy nput par (x 0, y 0 ) and uses the same samplng scheme as the one used wth D l. That s, D r s sampled from D l {(x 0, y 0 )} wth the same dstrbuton as t s sampled from D l n the defnton of ˆF. When (x 0, y 0 ) s drawn n D r,tsnot used n tranng so that f Dr s replaced by f Dr \{(x 0,y 0 )}. The new algorthm that we wll call G can then be expressed as: G(x) = E Dr D l [ f Dr (x)] and G, ts outcome on the set Dl s equal to G (x) = E Dr D l [ f D r (x)] where (x, y ) plays the role of the dummy par (x 0, y 0 ) prevously mentoned. The resultng algorthm has then the same behavor on tranng sets of sze l as the ensemble machne we consder, and the classcal leave-one-out for G corresponds to the det-leave-one-out we have defned prevously for ˆF. From that perspectve, t s suffcent to show that G s rβ r stable wrt. the l l 1 norm and to apply Theorem 5.1. We have: G G = E Dr D l [ f Dr ] EDr D l [ f D r ] where Dr = D r\(x, y ). We have by defnton: G G = f Dr dp f D r dp

19 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 89 where P denotes here the dstrbuton over the samplng of D r from D l. Defnng the functon 1 A of the set A as to be: 1 A (z) = 1ffz A, wedecompose each of the ntegral as follows: G G = f Dr 1 (x,y ) D r dp + f D r 1 (x,y )/ D r dp f Dr 1 (x,y )/ D r dp Clearly, f (x, y ) / D r, D r = Dr,sothat: G G = f Dr 1 (x,y ) D r dp f D r 1 (x,y ) D r dp β r 1 (x,y ) D r dp β r P[(x, y ) D r ] f D r 1 (x,y ) D r dp where the probablty s taken wth respect to the random subsamplng of the data set D r from D l. Snce ths subsamplng s done wthout replacement, such a probablty s equal to r l whch fnally gves a bound on the stablty of G = E D r D l [ f Dr ]. Ths result plugged nto the prevous theorem gves the fnal bound. Ths theorem holds for ensemble combnatons that are theoretcally defned from the expectaton E Dr D l [ f Dr ]. Notce that the hypothess do not requre that the combnaton s formed by only the same type of machnes. In partcular, one can magne an ensemble of dfferent kernel machnes wth dfferent kernels. We formalze ths remark n the followng Theorem 5.3. Let ˆF S be a fnte combnaton of SVMs f s, s = 1,...,S wth dfferent kernels K 1,...,K S : ˆF S = 1 S S [ ] E Dr D l f s Dr s=1 (15) where f s D r D l s a SVM wth kernel K s learned on D r. Denote as before by DLoo δ,dl ( ˆF S ) the det-leave-one-out of ˆF S computed wth the functon π δ. Assume that each of the f s D r D l are learned wth the same C on a subset D r of sze r drawn from D l wth a unform dstrbuton. For any gven δ, wth probablty 1 η, the generalzaton msclassfcaton s bounded as: E (x,y) [θ( y ˆF S (x))] DLoo δ,dl ( ˆF S ) + r ( 2l (Cκ) + r 2 Cκ 2l δ + 1 ) 2 ( ) 1 ln, r η where κ = 1 S S s=1 sup x X K s (x, x).

20 90 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF Proof: As before, we study G G = 1 S S s=1 [ ] [ ] E Dr D l f s Dr E Dr D l f s Dr Followng the same calculatons as n the prevous theorem for each of the summand, we have: G G 1 S S β r,s 1 (x,y ) D r dp, s=1 where β r,s denotes the stablty of a SVM wth kernel K s on a set of sze r, and P s the dstrbuton over the samplng of D r from D l.asbefore, snce (x, y ) appears n D r only tmes n average, we have the followng bound: r l G G 1 S S s=1 β r,s r l. Replacng β r,s by ts value for the case of SVMs yelds a bound on the generalzaton of G n terms of ts leave-one-out. Ths translates for F as a bound on ts generalzaton n terms of ts det-leave-one-out whch s the statement of the theorem. Notce that Theorem 5.3 holds for combnatons of kernel machnes where for each kernel we use many machnes traned on subsamples of the tranng set. So t s an ensemble of ensembles (see Eq. (15)). Compared to what has been derved for a sngle SVM, combnng SVMs provdes a tghter bound on the generalzaton. Ths result can then be nterpreted as an explanaton of the better estmaton of the test by the det-leave-one-out for ensemble methods. The bounds gven by the prevous theorems have the form: E (x,y) [θ( yf(x))] DLoo δ,dl (F) + O r ln ( ) 1 η C r κ l δ 2 although the bound for a sngle SVM s: E (x,y) [θ( yf(x))] Loo δ,dl ( f ) + O lc l κ ln ( ) 1 η δ 2

21 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 91 We have ndexed the parameters C wth an ndex that ndcates that the SVMs are not learned wth the same tranng set sze n the frst and n the second case. In the experments, the same C was used for all SVMs (C l = C r ). The bound derved for a combnaton of SVMs s then tghter than for a sngle SVM by a factor of r/l. The mprovement s because the stablty of the combnaton of SVMs s better than the stablty of a sngle SVM. Ths s true f we assume that both SVMs are traned wth the same C but the dscusson becomes more trcky f dfferent C s are used durng learnng. The stablty of SVMs depends ndeed on the way the value of C s determned. For a sngle SVM, C s s generally a functon of l, and for combnaton of SVMs, C also depends on the sze of the subsampled learnng sets D t.intheorem 5, we have seen that the stablty of the combnaton of machnes was smaller than rβ r where β l r s equal to Cκ for SVMs. If 2 ths stablty s better than the stablty of a sngle machne, then combnng the functons f Dr,t provdes a better bound. However, n the other case, the bound gets worse. We have the followng corollary whose proof s drect: Corollary 5.1. If a learnng system s β l stable and β l β r < r, then combnng these learnng l systems does not provde a better bound on the dfference between the test and the leave-one-out. Conversely, f β l β r > r, then combnng these learnng systems leads to l a better bound on the dfference between the test and the leave-one-out. Ths corollary gves an ndcaton that combnng machnes should not be used f the stablty of the sngle machne s very good. Notce that the corollary s about bounds, and not about whether the generalzaton for baggng or the actual dfference between the test and leave one out s always smaller for unstable machnes (and larger for stable ones) ths depends on how tght the bounds are n every case. However, t s not often the case that we have a hghly stable sngle machne and therefore typcally baggng mproves stablty. In such a stuaton, the bounds presented n ths paper show that we have better control of the generalzaton for combnaton of SVMs n the sense that the leave one out and the emprcal s are closer to the test. The bounds presented do not necessarly mply that the generalzaton of baggng s less than that of sngle machnes. Smlar remarks have already been made by Breman (1996) for baggng where smlar consderatons of stablty are expermentally dscussed. Another remark that can be made from the work of Breman s that baggng does not mprove performances after a certan number of bagged predctors. On the other hand, t does not reduce performances ether. Ths expermentally derved statement can be translated n our framework as: When T ncreases, the stablty of the combned learnng system tends to the stablty of the expectaton E Dr D l [ f Dr ] whch does not mprove after T has passed a certan value. Ths value may correspond to the convergence of the fnte sum 1 T T f D r,t to ts expectaton wrt. D r,t D l. At last, t s worthwhle notcng that the stablty analyss of ths secton holds also for the emprcal. Indeed, for a β l stable algorthm, as t s underlned n Bousquet and Elsseeff (2002), the leave-one-out and the emprcal are related by: Loo δ,dl ( f ) Err 0 ( f Dl ) + β l,

22 92 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF where Err 0 ( f Dl )sthe emprcal on the learnng set D l. Usng ths nequalty n Theorems 5.2, and 5.3 for the algorthm G, wecan bound the generalzaton of F n terms of the emprcal and the stablty of the machnes. 6. Other ensembles and estmates 6.1. Valdaton set for model selecton Instead of usng bounds on the generalzaton performance of learnng machnes lke the ones dscussed above, an alternatve approach for model selecton s to use a valdaton set to choose the parameters of the machnes. We consder frst the smple case where we have N machnes and we choose the best one based on the they make on a fxed valdaton set of sze V. Ths can be thought of as a specal case where we consder as hypothess space the set of the N machnes, and then we tran by smply pckng the machne wth the smallest emprcal (n ths case ths s the valdaton ). It s known that f VE s the valdaton of machne and TE s ts true test, then for all N machnes smultaneously the followng bound holds wth probablty 1 η (Devroye, Györf, & Lugos, 1996; Vapnk, 1998): TE VE + log(n) log ( η ) 4. (16) V So how accurately we pck the best machne usng the valdaton set depends, as expected, on the number of machnes N and on the sze V of the valdaton set. The bound suggests that a valdaton set can be used to accurately estmate the generalzaton performance of a relatvely small number of machnes (.e. small number of parameter values examned), as done often n practce. We used ths observaton for parameter selecton for SVMs and for ther ensembles. Expermentally we followed a slghtly dfferent procedure from what s suggested by bound (16). For each machne (that s, for each σ of the Gaussan kernel n our case, both for a sngle SVM and for an ensemble of machnes) we splt the tranng set (for each tranng-testng splt of the overall dataset as descrbed above) nto a smaller tranng set and a valdaton set (70 30% respectvely). We traned each machne usng the new, smaller tranng set, and measured the performance of the machne on the valdaton set. Unlke what bound (16) suggests, nstead of comparng the valdaton performance found wth the generalzaton performance of the machnes traned on the smaller tranng set (whch s the case for whch bound (16) holds), we compared the valdaton performance wth the test performance of the machne traned usng all the ntal (larger) tranng set. Ths way we dd not have to use less ponts for tranng the machnes, whch s a typcal drawback of usng a valdaton set, and we could compare the valdaton performance wth the leave-one-out bounds and the test performance of the exact same machnes we used n the Secton 4. We show the results of these experments n fgures 1 5 (see the dotted lnes n the plots). We observe that although the valdaton s that of a machne traned on a smaller

23 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 93 tranng set, t stll provdes a very good estmate of the test performance of the machnes traned on the whole tranng set. Inall cases, ncludng the case of C > 1 for whch the leave-one-out bounds dscussed above dd not work well, the valdaton set provded avery good estmate of the test performance of the machnes Adaptve combnatons of learnng machnes The ensembles of kernel machnes (4) consdered so far are votng combnatons where the coeffcents c t n (4) of the lnear combnaton of the machnes are fxed. We now consder the case where these coeffcents are also learned. In partcular we consder the followng two-layer archtecture: 1. A number T of kernel machnes s traned as before (for example usng dfferent tranng data, or dfferent parameters). Let f t (x), t = 1,...,T be the machnes. 2. The T outputs (real valued n our experments, but could also be thresholded bnary) of the machnes at each of the tranng ponts are computed. 3. A lnear machne (.e. lnear SVM) s traned usng as nputs the outputs of the T machnes on the tranng data, and as labels the orgnal tranng labels. The soluton s used as the coeffcents c t of the lnear combnaton of the T machnes. In ths case the ensemble machne F(x)sakernel machne tself whch s traned usng as kernel the functon: K(x, t) = f t (x) f t (t). Notce that snce each of the machnes f t (x) depend of the data, also the kernel K s data dependent. Therefore the stablty parameter of the ensemble machne s more dffcult to compute (when a data pont s left out the kernel K changes). Lkewse the leave-one-out bound of Theorem 3.3 does not hold snce the theorem assumes fxed coeffcents c t. 5 On the other hand, an mportant characterstc of ths type of ensembles s that ndependent of what kernels/parameters each of the ndvdual machnes of the ensemble use, the second layer machne (whch fnds coeffcents c t )always uses a lnear kernel. Ths may mply that the overall archtecture s less senstve to the kernel/parameters of the machnes of the ensemble.we tested ths hypothess expermentally by comparng how the test performance of ths type of machnes changes wth the σ of the Gaussan kernel used from the ndvdual machnes of the ensemble, and compared the behavor wth that of sngle machnes and ensembles of machnes wth fxed c t.infgure 7 we show two examples. In our experments, for all datasets except from one, learnng the coeffcents c t of the combnaton of the machnes usng a lnear machne (we used a lnear SVM) made the overall machne less senstve to changes of the parameters of the ndvdual machnes (σ of the Gaussan kernel). Ths can be a useful characterstc of the archtecture outlned n ths secton. For example the choce of the kernel parameters of the machnes of the ensembles need not be tuned accurately.

24 94 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF sgma sgma Fgure 7. When the coeffcents of the second layer are learned usng a lnear SVM the system s less senstve to changes of the σ of the Gaussan kernel used by the ndvdual machnes of the ensemble. Sold lne s one SVM, dotted s ensemble of 30 SVMs wth fxed c t = 1 30, and dashed lne s ensemble of 30 SVMs wth the coeffcents c t learned. The horzontal axs shows the natural logarthm of the σ of the Gaussan kernel. Left s the Heart dataset, and rght s the Dabetes one. The threshold b s non-zero for these experments Ensembles versus sngle machnes So far we concentrated on the theoretcal and expermental characterstcs of ensembles of kernel machnes. We now dscuss how ensembles compare wth sngle machnes. Table 2 shows the test performance of one SVM compared wth that of an ensemble of 30 SVMs combned wth c t = 1 and an ensemble of 30 SVMs combned usng a lnear 30 SVM for some UCI datasets (characterstc results). For the tables of ths secton we use, for convenence, the followng notaton: VCC stands for Votng Combnatons of Classfers, meanng that the coeffcents c t of the combnaton of the machnes are fxed. ACC stands for Adaptve Combnatons of Classfers, meanng that the coeffcents c t of the combnaton of the machnes are learned-adapted. We only consder SVMs and ensembles of SVMs wth the threshold b. The table shows mean test s and standard devatons for the best (decded usng the valdaton set Table 2. Average s and standard devatons (percentages) of the best machnes (best σ of the Gaussan kernel and best C) chosen accordng to the valdaton set performances. The performances of the machnes are about the same. VCC and ACC use 30 SVMs. Dataset SVM VCC ACC Breast 25.5 ± ± ± 4.0 Thyrod 5.1 ± ± ± 2.7 Dabetes 23.0 ± ± ± 1.8 Heart 15.4 ± ± ± 3.2

25 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 95 Table 3. Comparson between rates of a sngle SVM v.s. rates of VCC and ACC of 100 SVMs for dfferent percentages of subsampled data. The last dataset s from Osuna, Freund, and Gros (1997). Dataset VCC 10% VCC 5% VCC 1% ACC 10% ACC 5% ACC 1% SVM Dabetes ± 1.6 Thyrod ± 2.5 Faces 0.5 performance n ths case) parameters of the machnes (σ s of Gaussans and parameter C hence dfferent from fgures 1 5 whch where for a gven C). As the results show, the best SVM and the best ensembles we found have about the same test performance. Therefore, wth approprate tunng of the parameters of the machnes, combnng SVMs does not lead to performance mprovement compared to a sngle SVM. Although the best SVM and the best ensemble (that s, after accurate parameter tunng) perform smlarly, an mportant dfference of the ensembles compared to a sngle machne s that the tranng of the ensemble conssts of a large number of (parallelzable) small-tranng-set kernel machnes n the case of baggng. Ths mples that one can gan performance smlar to that of a sngle machne by tranng many faster machnes usng smaller tranng sets although the actual testng may be slower snce the sze of the unon of support vectors of the combnaton of machnes s expected to be larger than the number of support vectors of a sngle machne usng all the tranng data. Ths can be an mportant practcal advantage of ensembles of machnes especally n the case of large datasets. Table 3 compares the test performance of a sngle SVM wth that of an ensemble of SVMs each traned wth as low as 1% of the ntal tranng set (for one dataset for the other ones we could not use 1% because the sze of the orgnal dataset was small so 1% of t was only a couple of ponts). For fxed c t the performance decreases only slghtly n all cases (Thyrod, that we show, was the only dataset we found n our experments for whch the change was sgnfcant for the case of VCC), whle n the case of the archtecture of Secton 5 even wth 1% tranng data the performance does not decrease. Ths s because the lnear machne used to learn coeffcents c t uses all the tranng data. Even n ths last case the overall machne can stll be faster than a sngle machne, snce the second layer learnng machne s a lnear one, and fast tranng methods for the partcular case of lnear machnes exst (Platt, 1998). 7. Conclusons We presented theoretcal bounds on the generalzaton of ensembles of kernel machnes such as SVMs. Our results apply to the general case where each of the machnes n the ensemble s traned on dfferent subsets of the tranng data and/or uses dfferent kernels or nput features. A specal case of ensembles s that of baggng. The bounds were derved wthn the frameworks of cross valdaton and stablty and learnng. They nvolve two man quanttes: the det-leave-one-out estmate and the stablty parameter of the ensembles.

Bounds on the Generalization Performance of Kernel Machines Ensembles

Bounds on the Generalization Performance of Kernel Machines Ensembles Bounds on the Generalzaton Performance of Kernel Machnes Ensembles Theodoros Evgenou theos@a.mt.edu Lus Perez-Breva lpbreva@a.mt.edu Massmlano Pontl pontl@a.mt.edu Tomaso Poggo tp@a.mt.edu Center for Bologcal

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

Lecture 4. Instructor: Haipeng Luo

Lecture 4. Instructor: Haipeng Luo Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

Vapnik-Chervonenkis theory

Vapnik-Chervonenkis theory Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k. THE CELLULAR METHOD In ths lecture, we ntroduce the cellular method as an approach to ncdence geometry theorems lke the Szemeréd-Trotter theorem. The method was ntroduced n the paper Combnatoral complexty

More information

More metrics on cartesian products

More metrics on cartesian products More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA 4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected

More information

Learning Theory: Lecture Notes

Learning Theory: Lecture Notes Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014 COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011 Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected

More information

Edge Isoperimetric Inequalities

Edge Isoperimetric Inequalities November 7, 2005 Ross M. Rchardson Edge Isopermetrc Inequaltes 1 Four Questons Recall that n the last lecture we looked at the problem of sopermetrc nequaltes n the hypercube, Q n. Our noton of boundary

More information

1 Matrix representations of canonical matrices

1 Matrix representations of canonical matrices 1 Matrx representatons of canoncal matrces 2-d rotaton around the orgn: ( ) cos θ sn θ R 0 = sn θ cos θ 3-d rotaton around the x-axs: R x = 1 0 0 0 cos θ sn θ 0 sn θ cos θ 3-d rotaton around the y-axs:

More information

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES SVANTE JANSON Abstract. We gve explct bounds for the tal probabltes for sums of ndependent geometrc or exponental varables, possbly wth dfferent

More information

Statistics II Final Exam 26/6/18

Statistics II Final Exam 26/6/18 Statstcs II Fnal Exam 26/6/18 Academc Year 2017/18 Solutons Exam duraton: 2 h 30 mn 1. (3 ponts) A town hall s conductng a study to determne the amount of leftover food produced by the restaurants n the

More information

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}. CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

Chapter 6. Supplemental Text Material

Chapter 6. Supplemental Text Material Chapter 6. Supplemental Text Materal S6-. actor Effect Estmates are Least Squares Estmates We have gven heurstc or ntutve explanatons of how the estmates of the factor effects are obtaned n the textboo.

More information

COS 521: Advanced Algorithms Game Theory and Linear Programming

COS 521: Advanced Algorithms Game Theory and Linear Programming COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton

More information

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens THE CHINESE REMAINDER THEOREM KEITH CONRAD We should thank the Chnese for ther wonderful remander theorem. Glenn Stevens 1. Introducton The Chnese remander theorem says we can unquely solve any par of

More information

A new construction of 3-separable matrices via an improved decoding of Macula s construction

A new construction of 3-separable matrices via an improved decoding of Macula s construction Dscrete Optmzaton 5 008 700 704 Contents lsts avalable at ScenceDrect Dscrete Optmzaton journal homepage: wwwelsevercom/locate/dsopt A new constructon of 3-separable matrces va an mproved decodng of Macula

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Maximizing the number of nonnegative subsets

Maximizing the number of nonnegative subsets Maxmzng the number of nonnegatve subsets Noga Alon Hao Huang December 1, 213 Abstract Gven a set of n real numbers, f the sum of elements of every subset of sze larger than k s negatve, what s the maxmum

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.

More information

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE Analytcal soluton s usually not possble when exctaton vares arbtrarly wth tme or f the system s nonlnear. Such problems can be solved by numercal tmesteppng

More information

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to

More information

Affine transformations and convexity

Affine transformations and convexity Affne transformatons and convexty The purpose of ths document s to prove some basc propertes of affne transformatons nvolvng convex sets. Here are a few onlne references for background nformaton: http://math.ucr.edu/

More information

Min Cut, Fast Cut, Polynomial Identities

Min Cut, Fast Cut, Polynomial Identities Randomzed Algorthms, Summer 016 Mn Cut, Fast Cut, Polynomal Identtes Instructor: Thomas Kesselhem and Kurt Mehlhorn 1 Mn Cuts n Graphs Lecture (5 pages) Throughout ths secton, G = (V, E) s a mult-graph.

More information

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016 U.C. Berkeley CS94: Spectral Methods and Expanders Handout 8 Luca Trevsan February 7, 06 Lecture 8: Spectral Algorthms Wrap-up In whch we talk about even more generalzatons of Cheeger s nequaltes, and

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

APPENDIX A Some Linear Algebra

APPENDIX A Some Linear Algebra APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,

More information

Section 8.3 Polar Form of Complex Numbers

Section 8.3 Polar Form of Complex Numbers 80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

Complete subgraphs in multipartite graphs

Complete subgraphs in multipartite graphs Complete subgraphs n multpartte graphs FLORIAN PFENDER Unverstät Rostock, Insttut für Mathematk D-18057 Rostock, Germany Floran.Pfender@un-rostock.de Abstract Turán s Theorem states that every graph G

More information

Linear Regression Analysis: Terminology and Notation

Linear Regression Analysis: Terminology and Notation ECON 35* -- Secton : Basc Concepts of Regresson Analyss (Page ) Lnear Regresson Analyss: Termnology and Notaton Consder the generc verson of the smple (two-varable) lnear regresson model. It s represented

More information

Linear Classification, SVMs and Nearest Neighbors

Linear Classification, SVMs and Nearest Neighbors 1 CSE 473 Lecture 25 (Chapter 18) Lnear Classfcaton, SVMs and Nearest Neghbors CSE AI faculty + Chrs Bshop, Dan Klen, Stuart Russell, Andrew Moore Motvaton: Face Detecton How do we buld a classfer to dstngush

More information

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009 College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:

More information

Excess Error, Approximation Error, and Estimation Error

Excess Error, Approximation Error, and Estimation Error E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple

More information

1 Definition of Rademacher Complexity

1 Definition of Rademacher Complexity COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #9 Scrbe: Josh Chen March 5, 2013 We ve spent the past few classes provng bounds on the generalzaton error of PAClearnng algorths for the

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

Exercises. 18 Algorithms

Exercises. 18 Algorithms 18 Algorthms Exercses 0.1. In each of the followng stuatons, ndcate whether f = O(g), or f = Ω(g), or both (n whch case f = Θ(g)). f(n) g(n) (a) n 100 n 200 (b) n 1/2 n 2/3 (c) 100n + log n n + (log n)

More information

Formulas for the Determinant

Formulas for the Determinant page 224 224 CHAPTER 3 Determnants e t te t e 2t 38 A = e t 2te t e 2t e t te t 2e 2t 39 If 123 A = 345, 456 compute the matrx product A adj(a) What can you conclude about det(a)? For Problems 40 43, use

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

Which Separator? Spring 1

Which Separator? Spring 1 Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal

More information

Finding Dense Subgraphs in G(n, 1/2)

Finding Dense Subgraphs in G(n, 1/2) Fndng Dense Subgraphs n Gn, 1/ Atsh Das Sarma 1, Amt Deshpande, and Rav Kannan 1 Georga Insttute of Technology,atsh@cc.gatech.edu Mcrosoft Research-Bangalore,amtdesh,annan@mcrosoft.com Abstract. Fndng

More information

The Order Relation and Trace Inequalities for. Hermitian Operators

The Order Relation and Trace Inequalities for. Hermitian Operators Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

x = , so that calculated

x = , so that calculated Stat 4, secton Sngle Factor ANOVA notes by Tm Plachowsk n chapter 8 we conducted hypothess tests n whch we compared a sngle sample s mean or proporton to some hypotheszed value Chapter 9 expanded ths to

More information

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015 CS 3710: Vsual Recognton Classfcaton and Detecton Adrana Kovashka Department of Computer Scence January 13, 2015 Plan for Today Vsual recognton bascs part 2: Classfcaton and detecton Adrana s research

More information

/ n ) are compared. The logic is: if the two

/ n ) are compared. The logic is: if the two STAT C141, Sprng 2005 Lecture 13 Two sample tests One sample tests: examples of goodness of ft tests, where we are testng whether our data supports predctons. Two sample tests: called as tests of ndependence

More information

Composite Hypotheses testing

Composite Hypotheses testing Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

Chapter 9: Statistical Inference and the Relationship between Two Variables

Chapter 9: Statistical Inference and the Relationship between Two Variables Chapter 9: Statstcal Inference and the Relatonshp between Two Varables Key Words The Regresson Model The Sample Regresson Equaton The Pearson Correlaton Coeffcent Learnng Outcomes After studyng ths chapter,

More information

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law: CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and

More information

Evaluation of simple performance measures for tuning SVM hyperparameters

Evaluation of simple performance measures for tuning SVM hyperparameters Evaluaton of smple performance measures for tunng SVM hyperparameters Kabo Duan, S Sathya Keerth, Aun Neow Poo Department of Mechancal Engneerng, Natonal Unversty of Sngapore, 0 Kent Rdge Crescent, 960,

More information

This column is a continuation of our previous column

This column is a continuation of our previous column Comparson of Goodness of Ft Statstcs for Lnear Regresson, Part II The authors contnue ther dscusson of the correlaton coeffcent n developng a calbraton for quanttatve analyss. Jerome Workman Jr. and Howard

More information

Graph Reconstruction by Permutations

Graph Reconstruction by Permutations Graph Reconstructon by Permutatons Perre Ille and Wllam Kocay* Insttut de Mathémathques de Lumny CNRS UMR 6206 163 avenue de Lumny, Case 907 13288 Marselle Cedex 9, France e-mal: lle@ml.unv-mrs.fr Computer

More information

Foundations of Arithmetic

Foundations of Arithmetic Foundatons of Arthmetc Notaton We shall denote the sum and product of numbers n the usual notaton as a 2 + a 2 + a 3 + + a = a, a 1 a 2 a 3 a = a The notaton a b means a dvdes b,.e. ac = b where c s an

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

On the correction of the h-index for career length

On the correction of the h-index for career length 1 On the correcton of the h-ndex for career length by L. Egghe Unverstet Hasselt (UHasselt), Campus Depenbeek, Agoralaan, B-3590 Depenbeek, Belgum 1 and Unverstet Antwerpen (UA), IBW, Stadscampus, Venusstraat

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

STAT 3008 Applied Regression Analysis

STAT 3008 Applied Regression Analysis STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

Bayesian predictive Configural Frequency Analysis

Bayesian predictive Configural Frequency Analysis Psychologcal Test and Assessment Modelng, Volume 54, 2012 (3), 285-292 Bayesan predctve Confgural Frequency Analyss Eduardo Gutérrez-Peña 1 Abstract Confgural Frequency Analyss s a method for cell-wse

More information

Support Vector Machines

Support Vector Machines CS 2750: Machne Learnng Support Vector Machnes Prof. Adrana Kovashka Unversty of Pttsburgh February 17, 2016 Announcement Homework 2 deadlne s now 2/29 We ll have covered everythng you need today or at

More information

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty Addtonal Codes usng Fnte Dfference Method Benamn Moll 1 HJB Equaton for Consumpton-Savng Problem Wthout Uncertanty Before consderng the case wth stochastc ncome n http://www.prnceton.edu/~moll/ HACTproect/HACT_Numercal_Appendx.pdf,

More information

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.

More information

PHYS 705: Classical Mechanics. Calculus of Variations II

PHYS 705: Classical Mechanics. Calculus of Variations II 1 PHYS 705: Classcal Mechancs Calculus of Varatons II 2 Calculus of Varatons: Generalzaton (no constrant yet) Suppose now that F depends on several dependent varables : We need to fnd such that has a statonary

More information

Lecture 4: September 12

Lecture 4: September 12 36-755: Advanced Statstcal Theory Fall 016 Lecture 4: September 1 Lecturer: Alessandro Rnaldo Scrbe: Xao Hu Ta Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer: These notes have not been

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces

More information