Leave One Out Error, Stability, and Generalization of Voting Combinations of Classifiers
|
|
- Oliver Snow
- 6 years ago
- Views:
Transcription
1 Machne Learnng, 55, 71 97, 2004 c 2004 Kluwer Academc Publshers. Manufactured n The Netherlands. Leave One Out Error, Stablty, and Generalzaton of Votng Combnatons of Classfers THEODOROS EVGENIOU theodoros.evgenou@nsead.edu Technology Management, INSEAD, Boulevard de Constance, Fontanebleau, France MASSIMILIANO PONTIL DII, Unversty of Sena, Va Roma 56, Sena, Italy pontl@d.uns.t ANDRÉ ELISSEEFF andre.elsseeff@tuebngen.mpg.de Max Planck Insttute for Bologcal Cybernetcs, Spemannstrasse 38, Tübngen, Germany Edtor: Robert E. Schapre Abstract. We study the leave-one-out and generalzaton s of votng combnatons of learnng machnes. A specal case consdered s a varant of baggng. We analyze n detal combnatons of kernel machnes, such as support vector machnes, and present theoretcal estmates of ther leave-one-out. We also derve novel bounds on the stablty of combnatons of any classfers. These bounds can be used to formally show that, for example, baggng ncreases the stablty of unstable learnng machnes. We report experments supportng the theoretcal fndngs. Keywords: cross-valdaton, baggng, combnatons of machnes, stablty 1. Introducton Studyng the generalzaton performance of ensembles of learnng machnes has been the topc of ongong research n recent years (Breman, 1996; Schapre et al., 1998; Fredman, Haste, & Tbshran, 1998). There s a lot of expermental work showng that combnng learnng machnes, for example usng boostng or baggng methods (Breman, 1996; Schapre et al., 1998), very often leads to mproved generalzaton performance. A number of theoretcal explanatons have also been proposed (Schapre et al., 1998; Breman, 1996), but more work on ths aspect s stll needed. Two mportant theoretcal tools for studyng the generalzaton performance of learnng machnes are the leave-one-out (or cross valdaton) of the machnes, and the stablty of the machnes (Bousquet & Elsseeff, 2002; Boucheron, Lugos, & Massart, 2000). The second, although an older tool (Devroye & Wagner, 1979; Devroye, Györf, & Lugos, 1996), has become only mportant recently wth the work of Kearns and Ron (1999) and Bousquet and Elsseeff (2002). Stablty has been dscussed extensvely also n the work of Breman (1996). The theory n Breman (1996) s that baggng ncreases performance because t reduces the varance of the base learnng machnes, although t does not always ncrease the bas (Breman, 1996).
2 72 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF The defnton of the varance n Breman (1996) s smlar n sprt to that of stablty we use n ths paper. The key dfference s that n Breman (1996) the varance of a learnng machne s defned n an asymptotc way and s not used to derve any non-asymptotc bounds on the generalzaton of baggng machnes, whle here we defne stablty for fnte samples lke t s done n Bousquet and Elsseeff (2002) and we also derve such non-asymptotc bounds. The ntuton gven by Breman (1996) gves nterestng nsghts: the effect of baggng depends on the stablty of the base classfer. Stablty means here changes n the output of the classfer when the tranng set s perturbed. If the base classfers are stable, then baggng s not expected to decrease the generalzaton. On the other hand, f the base classfer s unstable, such as often occurs wth decson trees, the generalzaton performance s supposed to ncrease wth baggng. Despte expermental evdence, the nsghts n Breman (1996) had not been supported by a general theory lnkng stablty to the generalzaton of baggng, whch s what Secton 5 below s about. In ths paper we study the generalzaton performance of ensembles of kernel machnes usng both leave-one-out and stablty arguments. We consder the general case where each of the machnes n the ensemble uses a dfferent kernel and dfferent subsets of the tranng set. The ensemble s a convex combnaton of the ndvdual machnes. A partcular case of ths scheme s that of baggng kernel machnes. Unlke standard baggng (Breman, 1996), ths paper consders combnatons of the real outputs of the classfers, and each machne s traned on a dfferent and small subset of the ntal tranng set chosen by randomly subsamplng from the ntal tranng set. Each machne n the ensemble uses n general a dfferent kernel. As a specal case, approprate choces of these kernels lead to machnes that may use dfferent subsets of the ntal nput features, or dfferent nput representatons n general. We derve theoretcal bounds for the generalzaton of the ensembles based on a leave-one-out estmate. We also present results on the stablty of combnatons of classfers, whch we apply to the case of baggng kernel machnes. They can also be appled to baggng learnng machnes other than kernel machnes, showng formally that baggng can ncrease the stablty of the learnng machnes when these are not stable, and decrease t otherwse. An mplcaton of ths result s that t can be easer to control the generalzaton of baggng machnes. For example the leave one out s a better estmate of ther test, somethng that we expermentally observe. The paper s organzed as follows. Secton 2 gves the basc notaton and background. In Secton 3 we present bounds for a leave-one-out of kernel machne ensembles. These bounds are used for model selecton experments n Secton 4. In Secton 5 we dscuss the algorthmc stablty of ensembles, and present a formal analyss of how baggng nfluences the stablty of learnng machnes. The results can also provde a justfcaton of the expermental fndngs of Secton 4. Secton 6 dscusses other ways of combnng learnng machnes. 2. Background and notatons In ths secton we recall the man features of kernel machnes. For a more detaled account (see Vapnk, 1998; Schölkopf, Burges, & Smola, 1998; Evgenou, Pontl, & Poggo, 2000). Foranaccount consstent wth our notaton (see Evgenou, Pontl, & Poggo, 2000).
3 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 73 Kernel machne classfers are the mnmzers of functonals of the form: H[ f ] = 1 l l V (y, f (x )) + λ f 2 K, (1) =1 where we use the followng notaton: Let X R n be the nput set, the pars (x, y ) X { 1, 1}, = 1,...,lare sampled ndependently and dentcally accordng to an unknown probablty dstrbuton P(x, y). The set D l ={(x 1, y 1 ),...,(x l, y l )} s the tranng set. f s a functon R n R belongng to a Reproducng Kernel Hlbert Space (RKHS) H defned by kernel K, and f 2 K s the norm of f n ths space. See Vapnk (1998) and Wahba (1990) for a number of kernels. The classfcaton s done by takng the sgn of ths functon. V (y, f (x)) s the loss functon. The choce of ths functon determnes dfferent learnng technques, each leadng to a dfferent learnng algorthm (for computng the coeffcents α see below). λ s called the regularzaton parameter and s a postve constant. Machnes of ths form have been motvated n the framework of statstcal learnng theory. Under rather general condtons (Evgenou, Pontl, & Poggo, 2000) the soluton of Eq. (1) s of the form f (x) = l α y K (x, x). (2) =1 The coeffcents α n Eq. (2) are learned by solvng the followng optmzaton problem: l max H(α) = S(α ) 1 l α α j y y j K (x, x j ) α 2 =1, j=1 subject to: 0 α C, = 1,...,l, (3) where S( )sacontnuous and concave functon (strctly concave f matrx K (x, x j )snot strctly postve defnte) and C = 1 a constant. Thus, H(α) sstrctly concave and the 2lλ above optmzaton problem has a unque soluton. Support Vector Machnes (SVMs) are a partcular case of these machnes for S(α) = α. Ths corresponds to a loss functon V n (1) that s of the form θ(1 yf(x))(1 yf(x)), where θ s the Heavysde functon: θ(x) = 1fx > 0, and zero otherwse. The ponts for whch α > 0 are called support vectors. Notce that the bas term (threshold b n the general case of machnes f (x) = l =1 α K (x, x) + b) sncorporated n the kernel K, and t s therefore also regularzed. Notce also that functon S( ) n(3) can take general forms leadng to machnes other than SVM but n the general case the optmzaton of (3) may be computatonally neffcent.
4 74 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF 2.1. Kernel machne ensembles Gven a learnng algorthm such as a SVM or an ensemble of SVMs we defne f Dl to be the soluton of the algorthm when the tranng set D l ={(x, y ), = 1,...,l} s used. We denote by Dl the tranng set obtaned by removng pont (x, y ) from D l, that s the set D l \{(x, y )}. When t s clear n the text we wll denote f Dl by f and f D l by f. We consder the general case where each of the machnes n the ensemble uses a dfferent kernel and dfferent subsets D r,t of the tranng set D l where r refers to the sze of the subset and t = 1,...,T to the machne that uses t to learn. Let f Dr,t (x)bethe optmal soluton of machne t usng a kernel K (t).wedenote by α (t) the optmal weght that machne t assgns to pont (x, y ) (after solvng optmzng problem (3)). We consder ensembles that are convex combnatons of the ndvdual machnes. The decson functon of the ensemble s gven by F r,t (x) = c t f Dr,t (x) (4) wth c t 0, and T c t = 1 (for scalng reasons). The coeffcents c t are not learned and all parameters (C s and kernels) are fxed before tranng. The classfcaton s done by takng the sgn of F r,t (x). Below for smplcty we wll note wth captal F the combnaton F r,t.insecton 5 we wll consder only the case that c t = 1 for smplcty. T In the followng, the sets D r,t wll be dentcally sampled accordng to the unform dstrbuton and wthout replacement from the tranng set D l.wewll denote by E Dr D l the expectaton wth respect to the subsamplng from D l accordng to the unform dstrbuton (wthout replacement), and sometmes we wrte f Dr,t D l rather than f Dr,t to make clear whch tranng set has been used durng learnng. The letter r wll always refer to the number of elements n D r,t Leave-one-out If θ s, as before, the Heavysde functon, then the leave-one-out of f on D l s defned by Loo Dl ( f ) = 1 l l θ( y f (x )) (5) =1 Notce that for smplcty there s a small abuse of notaton here, snce the leave-one-out typcally refers to a learnng method whle here we use the soluton f n the notaton. The leave-one-out provdes an estmate of the average generalzaton performance of a machne. It s known that the expectaton of the generalzaton of a machne traned usng l ponts s equal to the expectaton of the Loo of a machne traned on l + 1 ponts. Ths s summarzed by the followng theorem, orgnally due to Luntz and Bralovsky (see Vapnk, 1998).
5 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 75 Table 1. Notaton. f V ( f, y) P(x, y) Real valued predcton rule of one learnng machne, f : X R Loss functon Probablty dstrbuton underlnng the data D l Set of..d examples sampled from P(x, y), D l ={(x, y ) X { 1, 1}} =1 l Dl The set D l \{(x, y )} f Dl Loo Dl ( f ) Learnng machne (e.g. SVM) traned on D l. Also noted as f Leave-one-out of f on the data set D l π δ (x) Soft margn loss, π δ (x) = 0, f x < δ, 1fx > 0, and x δ f δ x 0 Loo δ,dl ( f ) Leave one out wth soft margn π δ β l Unform stablty of f D r,t or D r,t D l Set of r ponts sampled unformly from D l used by machne t, t = 1,...,T D r D l Set of r ponts sampled unformly from D l (D r,t D l ) Orgnal D r,t wth pont (x, y ) removed F r,t,orjust F Ensemble of T machnes, F r,t = T c t f Dr,t ˆF Expected combnaton of machnes E Dr D l [ f Dr ] DLoo Dl (F) Determnstc leave one out DLoo δ,dl (F) Determnstc leave one out wth soft margn π δ Theorem 2.1. Suppose f Dl s the outcome of a determnstc learnng algorthm. Then [ [ ( E Dl E(x,y) θ yfdl (x) )]] [ ( )] = E Dl+1 LooDl+1 f Dl+1 As observed (Kearns & Ron, 1999), ths theorem can be extended to general learnng algorthms by addng a randomzng preprocessng step. The way the leave-one-out s computed can however be dfferent dependng on the randomness. Consder the prevous ensemble of kernel machnes (4). The data sets D r,t, t = 1,...,T are drawn randomly from the tranng set D l.wecan then compute a leave-one-out estmate for example n ether of the followng ways: 1. For = 1,...,l, remove (x, y ) from D l and sample new data sets D r,t, t = 1,...,T from Dl. Compute the f D r,t Dl and average then the of the resultng ensemble machne computed on (x, y ). Ths leads to the classcal defnton of leave-one-out and can be computed as: Loo Dl (F) = 1 l ( l 1 θ y T =1 ) f Dr,t Dl (x ) 2. For = 1,...,l, remove (x, y ) from each D r,t D l. Compute the f (Dr,t D l ) and average the of the resultng ensemble machne computed on (x, y ). Note that we have used the notaton (D r,t D l ) to denote the set D r,t D l where (x, y ) has been (6)
6 76 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF removed. Ths leads to what we wll call a determnstc verson of the leave-one-out, n short det-leave-one-out, or DLoo: DLoo Dl (F) = 1 l ( l 1 θ y T =1 ) f (Dr,t D l ) (x ) Note that the frst computaton requres to re-sample new data sets for each leave-one-out round, whle the second computaton uses the same subsample data sets for each leaveone-out round removng at most one pont from each of them. In a sense, the det-leave-oneout s then more determnstc than the classcal computaton (6). In ths paper, we wll consder manly the det-leave-one-out for whch we wll derve easy-to-compute bounds and from whch we wll bound the generalzaton of ensemble machnes. Fnally notce that the sze of the subsamplng s mplct n the notaton DLoo Dl (F): r s fxed n ths paper so there s no need to complcate the notaton further. (7) 3. Leave-one-out estmates of kernel machne ensembles We begn wth some known results about the leave-one-out of kernel machnes. The followng theorem s from Jaakkola and Haussler (1998): Theorem 3.1. The leave-one-out of a kernel machne (3) s upper bounded as: Loo Dl ( f ) 1 l l θ ( α K (x, x ) y f Dl (x ) ) (8) =1 where f Dl s the optmal functon found by solvng problem (3) on the whole tranng set. In the partcular case of SVMs where the data are separable the r.h.s of Eq. (8) can be bounded by geometrc quanttes, namely Vapnk (1998): Loo Dl ( f ) 1 l l θ ( α K (x, x ) y f Dl (x ) ) 1 dsv 2 (9) l ρ 2 =1 where d sv s the radus of the smallest sphere n the feature space nduced by kernel K (Wahba, 1990; Vapnk, 1998) centered at the orgn contanng the support vectors, that s d sv = max :α >0 K (x, x ), and ρ s the margn (ρ 2 = 1 )ofthe SVM. f 2 K Usng ths result, the next theorem s a drect applcaton of Theorem 2.1: Theorem 3.2. Suppose that the data s separable by the SVM. Then, the average generalzaton of a SVM traned on l ponts s upper bounded by ( 1 d 2 ) l + 1 E sv(l) D l, ρ 2 (l)
7 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 77 where the expectaton E s taken wth respect to the probablty of a tranng set D l of sze l. Notce that ths result shows that the performance of the SVM does not depend only on the margn, but also on other geometrc quanttes, namely the radus d sv. We now extend these results to the case of ensembles of kernel machnes. In the partcular case of baggng, the subsamplng of the tranng data should be determnstc. By ths we mean that when the bounds on the leave one out are used for model (parameter) selecton, for each model the same subsample sets of the data need to be used. These subsamples, however, are stll random ones. We beleve that the results presented below also hold (wth mnor modfcatons) n the general case that the subsamplng s always random. We now consder the det-leave-one-out of such ensembles. Theorem 3.3. by: DLoo Dl (F) 1 l The det-leave-one-out of a kernel machne ensemble s upper bounded ( ) l θ c t α (t) K (t) (x, x ) y F(x ). (10) =1 The proof of ths Theorem s based on the followng lemma shown n Vapnk (1998) and Jaakkola and Haussler (1998): Lemma 3.1. Let α be the coeffcent of the soluton f (x) of machne (3) correspondng to pont (x, y ),α > 0. Let f (x) be the soluton of machne (3) found when the data pont (x, y ) s removed from the tranng set. Then y f (x ) y f (x ) α K (x, x ). Usng Lemma 3.1 we can now prove Theorem 3.3. Proof of Theorem 3.3: Let F (x) = T c t f (t) (x)bethe ensemble machne traned wth all ntal tranng data except (x, y ) (subsets D r,t are the orgnal ones only (x, y )s removed from them). Lemma 3.1 gves that y F (x ) = y = y F(x ) from whch t follows that: c t f (t) (x ) c t [ y f (t) (x ) α (t) K (t) (x, x ) ] c t α (t) K (t) (x, x ) ( ) θ( y F (x )) θ c t α (t) K (t) (x, x ) y F(x ).
8 78 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF Therefore the leave one out l =1 θ( y F (x )) s not more than ( ) l θ c t α (t) K (t) (x, x ) y F(x ), =1 whch proves the Theorem. Notce that the bound has the same form as the bound n Eq. (8): for each pont (x, y ) we only need to take nto account ts correspondng parameter α (t) and remove the effects of α (t) from the value of F(x ). The det-leave-one-out can also be bounded usng geometrc quanttes. To ths purpose we ntroduce one more parameter that we call the ensemble margn (n contrast to the margn of a sngle SVM). For each pont (x, y )wedefne ts ensemble margn to be y F(x ). Ths s exactly the defnton of margn n Schapre et al. (1998). For any gven δ > 0wedefne Err δ to be the emprcal wth ensemble margn less than δ, Err δ (F) = 1 l l θ( y F(x ) + δ). =1 and by N δ the set of the remanng tranng ponts the ones wth ensemble margn δ. Fnally, we note by d t(δ) the radus of the smallest sphere n the feature space nduced by kernel K (t) centered at the orgn whch contans the ponts of machne t wth α (t) > 0 and ensemble margn larger than δ. 1 Corollary 3.1. For any δ>0the det-leave-one-out of a kernel machne ensemble s upper bounded by: DLoo Dl (F) Err δ (F) + 1 l ( 1 δ ( c t dt(δ) 2 α (t) N δ )) (11) Proof: For each tranng pont (x, y ) wth ensemble margn y F(x ) <δwe upper bound θ( T c tα (t) K (t) (x, x ) y F(x )) wth 1 (ths s a trval bound). For the remanng ponts (the ponts n N δ )weshow that: ( ) θ c t α (t) K (t) (x, x ) y F(x ) 1 δ c t α (t) K (t) (x, x ). (12)
9 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 79 In the case that T c tα (t) K (t) (x, x ) y F(x ) < 0, Eq. (12) s trvally satsfed. If T c tα (t) K (t) (x, x ) y F(x ) 0, then whle ( ) θ c t α (t) K (t) (x, x ) y F(x ) = 1, c t α (t) K (t) (x, x ) y F(x ) δ 1 δ So n both cases nequalty (12) holds. Therefore: ( ) l θ c t α (t) K (t) (x, x ) y F(x ) =1 c t α (t) K (t) (x, x ) 1. lerr δ + 1 δ N δ ( lerr δ + 1 c t dt(δ) 2 δ c t K (t) (x, x )α (t) α (t) N δ ). The statement of the corollary follows by applyng Theorem 3.3. Notce that Eq. (11) holds for any δ>0, so the best bound s obtaned for the mnmum of the rght hand sde wth respect to δ>0. Usng Theorem 2.1, Theorems 3.3 and 3.1 provde bounds on the average generalzaton performance of general kernel machnes ensembles lke that of Theorem 3.2. We now consder the partcular case of SVM ensembles. In ths case we have the followng Corollary 3.2. Suppose that each SVM n the ensembles separated the data set used durng tranng. Then, the det-leave-one-out of an ensemble of SVMs s upper bounded by: DLoo Dl (F) Err 1 (F) + 1 l c t d 2 t ρ 2 t (13) where Err 1 s the margn emprcal wth ensemble margn 1, d t s the radus of the smallest sphere centered at the orgn, n the feature space nduced by kernel K (t), contanng the support vectors of machne t, and ρ t s the margn of the t-th SVM. Proof: l =1 α(t) We chose δ = 1n(11). Clearly we have that d t d t(δ) for any δ, and N δ α (t) (see Vapnk (1998)) for a proof of ths equalty). = 1 ρ 2 t Notce that the average generalzaton performance of the SVM ensemble now depends on the average (convex combnaton of) D2 of the ndvdual machnes. In some cases ths ρ 2
10 80 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF may be smaller than the D2 of a sngle SVM. For example, suppose we tran many SVMs ρ 2 on dfferent sub-samples of the tranng ponts and we want to compare such an ensemble wth a sngle SVM usng all the ponts. If all SVMs (the sngle one, as well as the ndvdual ones of the ensemble) have most of ther tranng ponts as support vectors, then clearly the D 2 of each SVM n the ensemble s smaller than that of the sngle SVM. Moreover the margn of each SVM n the ensemble s expected to be larger than that of the sngle SVM usng all the ponts. So the average D2 n ths case s expected to be smaller than that ρ 2 of the sngle SVM. Another case where an ensemble of SVMs may be better than a sngle SVM s the one where there are outlers among the tranng data. If the ndvdual SVMs are traned on subsamples of the tranng data, some of the machnes may have smaller D2 ρ 2 because they do not use some outlers whch of course also depends on the choce of C for each of the machnes. In general t s not clear when ensembles of kernel machnes are better than sngle machnes. The bounds n ths secton may provde some nsght to ths queston. Fnally, we remark that all the results dscussed hold for the case that there s no bas (threshold b), or the case where the bas s ncluded n the kernel (as dscussed n the ntroducton). In the experments dscussed below we use the results also n the case that the bas s not regularzed (as dscussed n Secton 2 ths means that the separatng functon ncludes a bas b,sots f (x) = l =1 α K (x, x)+b), whch s common n practce. Recent work n (Chapelle & Vapnk, 1999) may be used to extend our results to an ensemble of kernel machnes wth the bas not regularzed: whether ths can be done s an open queston. 4. Experments To test how tght the bounds we presented are, we conducted a number of experments usng datasets from UCI 2,aswell as the US Postal Servce (USPS) dataset (LeCun et al., 1990). We show results for some of the sets n fgures 1 5. For each dataset we splt the overall set n tranng and testng (the szes are shown n the fgures) n 50 dfferent (random) ways, and for each splt: 1. We traned one SVM wth b = 0 usng all tranng data, computed the leave-one-out bound gven by Theorem 3.1, and then compute the test performance usng the test set. 2. We repeated (1) ths tme wth b We traned 30 SVMs wth b = 0 each usng a random subsample of sze 40% of the tranng data (baggng), computed the leave-one-out bound gven by Theorem 3.3 usng c t = 1, and then compute the test performance usng the test set We repeated (3) ths tme wth wth b 0. We then averaged over the 50 tranng-testng splts the test performances and the leaveone-out bounds found, and computed the standard devatons. All machnes were traned usng a Gaussan kernel, and we repeated the procedure for a number of dfferent σ s of the Gaussan, and for a fxed value of the parameter C, (selected by hand so that t s less than 1 n fgures 1 5, and more than 1 n fgure 6, for reasons explaned below for smplcty we
11 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 81 Breast Cancer data (C=0.5, Tran= 200, Test = 77) Bag l o o, b=0 svm l o o, b= bag l o o svm l o o sgma sgma Fgure 1. Breast cancer data: Top left fgure: baggng wth b = 0; Top rght fgure: sngle SVM wth b = 0; Bottom left fgure: baggng wth b 0; Bottom rght fgure: sngle SVM wth b 0. In each plot the sold lne s the mean test performance and the dashed lne s the bound computed usng the leave-one-out Theorems 3.1 and 3.3. The dotted lne s the valdaton set dscussed below. The horzontal axs shows the logarthm of the σ of the Gaussan kernel used. used the same value of C n fgures 1 5, C = 0.5, but we found the same trend for other small values of C, C < 1). We show the averages and standard devatons of the results n fgures 1 to 5. In all fgures we use the followng notaton: Top left fgure: baggng wth b = 0; Top rght fgure: sngle SVM wth b = 0; Bottom left fgure: baggng wth b 0; Bottom rght fgure: sngle SVM wth b 0. In each plot the sold lne s the mean test performance and the dashed lne s the bound computed usng the leave-one-out Theorems 3.1 and 3.3. The dotted lne s the valdaton set dscussed below. The horzontal axs shows the logarthm of the σ of the Gaussan kernel used. For smplcty, only one bar (standard devaton over the 50 tranng-testng splts) s shown (the others were smlar). Notce that even for tranng-testng splts for whch the s one standard devaton away from the mean over the 50 runs (.e. nstead of plottng the graphs through the center of the bars, we plot them at the end of the bars) the bounds for combnatons of machnes are stll tghter than for sngle machnes n fgures 3 to 5. The cost parameter C used s gven
12 82 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF 5 Thyrod data (C=0.5, Tran= 140, Test = 75) Bag l o o, b=0 5 svm l o o, b= bag l o o svm l o o sgma sgma Fgure 2. Thyrod data: Notaton lke n fgure 1. n each of the fgures. The horzontal axs s the natural logarthm of the σ of the Gaussan kernel used, whle the vertcal axs s the. An nterestng observaton s that the bounds are always tghter for the case of baggng than they are for the case of a sngle SVM. Ths s an nterestng expermental fndng for whch we provde a possble theoretcal explanaton n the next secton. Ths fndng can practcally justfy the use of ensembles of machnes for model selecton: Parameter selecton usng the leave-one-out bounds presented n ths paper s easer for ensembles of machnes than t s for sngle machnes. Another nterestng observaton s that the bounds seem to work smlarly n the case that the bas b s not 0. In ths case, as before, the bounds are tghter for ensembles of machnes than they are for sngle machnes. Expermentally we found that the bounds presented here do not work well n the case that the C parameter used s large (C = 100). An example s shown n fgure 6. Consder the leave-one-out bound for a sngle SVM gven by Theorem 3.1. Let (x, y )beasupport vector for whch y f (x ) < 1. It s known (Vapnk, 1998) that for these support vectors the coeffcent α s C. IfC s such that CK(x, x ) > 1 (for example consder Gaussan
13 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 83 Dabetes data (C=0.5, Tran= 468, Test = 300) Bag l o o, b=0 svm l o o, b= bag l o o svm l o o sgma sgma Fgure 3. Dabetes data: Notaton lke n fgure 1. kernel wth K (x, x) = 1 and any C > 1), then clearly θ(ck(x, x ) y f (x )) = 1. In ths case the bound of Theorem 3.1 effectvely counts all support vectors outsde the margn (plus some of the ones on the margn,.e. yf(x) = 1). Ths means that for large C (n the case of Gaussan kernels ths can be for example for any C > 1), the bounds of ths paper effectvely are smlar (not larger than) to another known leave-one-out bound for SVMs, namely one that uses the number of all support vectors to bound generalzaton performance (Vapnk, 1998). So effectvely our expermental results show that the number of support vectors does not provde a good estmate of the generalzaton performance of the SVMs and ther ensembles. 5. Stablty of ensemble methods We now present a theoretcal explanaton of the expermental fndng that the leave-one-out bound s tghter for the case of ensemble machnes than t s for sngle machnes. The analyss s done wthn the framework of stablty and learnng (Bousquet & Elsseeff, 2002). It has
14 84 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF 0.5 Heart data (C=0.5, Tran= 170, Test = 100) Bag l o o, b=0 svm l o o, b= bag l o o 0.5 svm l o o sgma sgma Fgure 4. Heart data: Notaton lke n fgure 1. been proposed n the past that baggng ncreases the stablty of the learnng methods (Breman, 1996). Here we provde a formal argument for ths. As before, we denote by D l the tranng set D l wthout example pont (x, y ). We use the followng noton of stablty defned n Bousquet and Elsseeff (2002). Defnton (Unform stablty). We say that a learnng method s β l stable wth respect to a loss functon V and tranng sets of sze l f the followng holds: {1,...,l}, D l, (x, y) : V ( f Dl (x), y ) V ( f D l (x), y ) β l. Roughly speakng the cost of a learnng machne on a new (test) pont (x, y) should not change more than β l when we tran the machne wth any tranng set of sze l and when we tran the machne wth the same tranng set but one tranng pont (any pont) removed. Notce that ths defnton s useful manly for real-valued loss functons V.To use t for classfcaton machnes we need to start wth the real valued output (2) before thresholdng.
15 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS Postal data (C=0.5, Tran= 791, Test = 2007) Bag l o o, b=0 svm l o o, b= bag l o o 0.6 svm l o o sgma sgma Fgure 5. USPS data: Notaton lke n fgure 1. We defne for any gven constant δ the leave-one-out Loo δ on a tranng set D l to be: Loo δ,dl ( f ) = 1 l l ( π δ y f D l (x ) ), =1 where the functon π δ (x) s0for x < δ, 1for x > 0, and x + 1 for δ x 0(asoft δ margn functon). 3 For ensemble machnes, we wll consder agan a defnton smlar to (7): DLoo δ,dl (F) = 1 l ( l 1 π δ y T =1 ) f (Dr,t D l) (x ), Notce that for δ 0weget the leave one out s that we defned n Secton 2, namely Eqs. (5) and (7), and clearly DLoo 0,Dl (F) DLoo δ,dl (F) for all δ>0.
16 86 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF Postal data (C=100, Tran= 791, Test = 2007) Bag l o o, b=0 svm l o o, b= bag l o o 0.6 svm l o o sgma sgma Fgure 6. USPS data: Usng a large C (C = 50). In ths case the bounds do not work see text for an explanaton. Notaton lke n fgure 1. Let β l be the stablty of the kernel machne for the real valued output wrt. the l 1 norm, that s: {1,...,l}, D l, x : f Dl (x) f D l (x) βl For SVMs t s known (Bousquet & Elsseeff, 2002) that β l s upper bounded by C κ 2 where κ = sup x X K (x, x) sassumed to be fnte. The bound on the stablty of a SVM s not explctly dependent of the sze of the tranng set l.however, the value of C s often chosen such that C s small for large l.inthe former experments, C s fxed for all machnes whch are traned on learnng sets of same szes. Ths means that they have all the same stablty for the l 1 norm. We frst state a bound on the expected of a sngle kernel machne n terms of ts Loo δ. The followng theorem s from Bousquet and Elsseeff (2002).
17 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 87 Theorem 5.1. For any gven δ, wth probablty 1 η the generalzaton msclassfcaton of an algorthm that s β l stable w.r.t. the l 1 norm s bounded as: E (x,y) [ θ( yfdl (x)) ] Loo δ,dl ( f Dl ) + β l + where β l s assumed to be a non-ncreasng functon of l. l 2 ( 2 β l ) 2 ln δ l ( ) 1, η Notce that the bound holds for a gven constant δ. One can derve a bound that holds unformly for all δ and therefore use the best δ (.e. the emprcal margn of the classfer) (Bousquet & Elsseeff, 2002). For a SVM, the value of β l s equal to Cκ. Theorem provdes the followng bound: [ ( E (x,y) θ yfdl (x) )] ( ( ) Cκ Loo δ,dl f Dl l Cκ 2 δ + 1 l ) 2 ln ( ) 1 η The value of C s often a functon of l. Dependng on the way C decreases wth l, ths bound can be tght or loose. We now study a smlar generalzaton bound for an ensemble of machnes where each machne uses only r ponts drawn randomly wth the unform dstrbuton from the tranng set. We consder only the case where the coeffcents c t of (4) are all 1 (so takng the average T machne lke n standard baggng (Breman, 1996)). Such an ensemble s very close to the orgnal dea of baggng despte some dfferences namely that n standard baggng each machne uses a tranng set of sze equal to the sze of the orgnal set created by random subsamplng wth replacement, nstead of usng only r ponts. We wll consder the expected combnaton ˆF defned as: 4 ˆF(x) = E Dr D l [ f Dr (x) ] where the expectaton s taken wth respect to the tranng data D r of sze r drawn unformly from D l. The stabllty bounds we present below hold for ths expected combnaton and not for the fnte combnaton consdered so far as mentoned below how close these two are s an open queston. The leave-one-out we defne for ths expectaton s agan lke n (7) (as n Eq. (7) the sze r of the subsamples for smplcty s not ncluded n the notaton): DLoo δ,dl ( ˆF) = 1 l l ( [ π δ y E Dr D l f D r (x ) ]) =1 whch s dfferent from the standard leave-one-out : 1 l l ( [ π δ y E Dr Dl f Dr (x ) ]) =1
18 88 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF whch corresponds to (6). As an extreme case when T : ( ) 1 DLoo δ,dl f Dr,t DLoo δ,dl ( ˆF) (14) T Ths relaton motvates the choce of our method of calculaton for the leave one out estmate n Secton 3. Indeed the rght hand sde of the equaton corresponds to the quantty that we have bounded n Sectons 3 and 4 and that ultmately we would lke to relate to the stablty of the base machne. It s an open queston to measure how fast the convergence Eq. (14) s. As we dscuss below and as also mentoned n Breman (1996), ncreasng T beyond a certan value (typcally small,.e. 100) does not nfluence the performance of baggng, whch may mply that the convergence (14) s fast. We then have the followng bound on the expected of ensemble combnatons: Theorem 5.2. For any gven δ, wth probablty 1 η the generalzaton msclassfcaton of the expected combnaton of classfers ˆF each usng a subsample of sze r of the tranng set and each havng a stablty β r wrt. the l 1 norm s bounded as: E (x,y) [θ( y ˆF(x))] DLoo δ,dl ( ˆF) + r l β r r + 2 2l ( 2βr 1 δ + 1 r ) 2 ln ( ) 1 η Proof: We wll apply the stablty Theorem 5.1 to the followng algorthm: Onaset of sze l, the algorthm s the same as the expected ensemble machne we consder. Onatranng set of sze l 1, t adds a dummy nput par (x 0, y 0 ) and uses the same samplng scheme as the one used wth D l. That s, D r s sampled from D l {(x 0, y 0 )} wth the same dstrbuton as t s sampled from D l n the defnton of ˆF. When (x 0, y 0 ) s drawn n D r,tsnot used n tranng so that f Dr s replaced by f Dr \{(x 0,y 0 )}. The new algorthm that we wll call G can then be expressed as: G(x) = E Dr D l [ f Dr (x)] and G, ts outcome on the set Dl s equal to G (x) = E Dr D l [ f D r (x)] where (x, y ) plays the role of the dummy par (x 0, y 0 ) prevously mentoned. The resultng algorthm has then the same behavor on tranng sets of sze l as the ensemble machne we consder, and the classcal leave-one-out for G corresponds to the det-leave-one-out we have defned prevously for ˆF. From that perspectve, t s suffcent to show that G s rβ r stable wrt. the l l 1 norm and to apply Theorem 5.1. We have: G G = E Dr D l [ f Dr ] EDr D l [ f D r ] where Dr = D r\(x, y ). We have by defnton: G G = f Dr dp f D r dp
19 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 89 where P denotes here the dstrbuton over the samplng of D r from D l. Defnng the functon 1 A of the set A as to be: 1 A (z) = 1ffz A, wedecompose each of the ntegral as follows: G G = f Dr 1 (x,y ) D r dp + f D r 1 (x,y )/ D r dp f Dr 1 (x,y )/ D r dp Clearly, f (x, y ) / D r, D r = Dr,sothat: G G = f Dr 1 (x,y ) D r dp f D r 1 (x,y ) D r dp β r 1 (x,y ) D r dp β r P[(x, y ) D r ] f D r 1 (x,y ) D r dp where the probablty s taken wth respect to the random subsamplng of the data set D r from D l. Snce ths subsamplng s done wthout replacement, such a probablty s equal to r l whch fnally gves a bound on the stablty of G = E D r D l [ f Dr ]. Ths result plugged nto the prevous theorem gves the fnal bound. Ths theorem holds for ensemble combnatons that are theoretcally defned from the expectaton E Dr D l [ f Dr ]. Notce that the hypothess do not requre that the combnaton s formed by only the same type of machnes. In partcular, one can magne an ensemble of dfferent kernel machnes wth dfferent kernels. We formalze ths remark n the followng Theorem 5.3. Let ˆF S be a fnte combnaton of SVMs f s, s = 1,...,S wth dfferent kernels K 1,...,K S : ˆF S = 1 S S [ ] E Dr D l f s Dr s=1 (15) where f s D r D l s a SVM wth kernel K s learned on D r. Denote as before by DLoo δ,dl ( ˆF S ) the det-leave-one-out of ˆF S computed wth the functon π δ. Assume that each of the f s D r D l are learned wth the same C on a subset D r of sze r drawn from D l wth a unform dstrbuton. For any gven δ, wth probablty 1 η, the generalzaton msclassfcaton s bounded as: E (x,y) [θ( y ˆF S (x))] DLoo δ,dl ( ˆF S ) + r ( 2l (Cκ) + r 2 Cκ 2l δ + 1 ) 2 ( ) 1 ln, r η where κ = 1 S S s=1 sup x X K s (x, x).
20 90 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF Proof: As before, we study G G = 1 S S s=1 [ ] [ ] E Dr D l f s Dr E Dr D l f s Dr Followng the same calculatons as n the prevous theorem for each of the summand, we have: G G 1 S S β r,s 1 (x,y ) D r dp, s=1 where β r,s denotes the stablty of a SVM wth kernel K s on a set of sze r, and P s the dstrbuton over the samplng of D r from D l.asbefore, snce (x, y ) appears n D r only tmes n average, we have the followng bound: r l G G 1 S S s=1 β r,s r l. Replacng β r,s by ts value for the case of SVMs yelds a bound on the generalzaton of G n terms of ts leave-one-out. Ths translates for F as a bound on ts generalzaton n terms of ts det-leave-one-out whch s the statement of the theorem. Notce that Theorem 5.3 holds for combnatons of kernel machnes where for each kernel we use many machnes traned on subsamples of the tranng set. So t s an ensemble of ensembles (see Eq. (15)). Compared to what has been derved for a sngle SVM, combnng SVMs provdes a tghter bound on the generalzaton. Ths result can then be nterpreted as an explanaton of the better estmaton of the test by the det-leave-one-out for ensemble methods. The bounds gven by the prevous theorems have the form: E (x,y) [θ( yf(x))] DLoo δ,dl (F) + O r ln ( ) 1 η C r κ l δ 2 although the bound for a sngle SVM s: E (x,y) [θ( yf(x))] Loo δ,dl ( f ) + O lc l κ ln ( ) 1 η δ 2
21 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 91 We have ndexed the parameters C wth an ndex that ndcates that the SVMs are not learned wth the same tranng set sze n the frst and n the second case. In the experments, the same C was used for all SVMs (C l = C r ). The bound derved for a combnaton of SVMs s then tghter than for a sngle SVM by a factor of r/l. The mprovement s because the stablty of the combnaton of SVMs s better than the stablty of a sngle SVM. Ths s true f we assume that both SVMs are traned wth the same C but the dscusson becomes more trcky f dfferent C s are used durng learnng. The stablty of SVMs depends ndeed on the way the value of C s determned. For a sngle SVM, C s s generally a functon of l, and for combnaton of SVMs, C also depends on the sze of the subsampled learnng sets D t.intheorem 5, we have seen that the stablty of the combnaton of machnes was smaller than rβ r where β l r s equal to Cκ for SVMs. If 2 ths stablty s better than the stablty of a sngle machne, then combnng the functons f Dr,t provdes a better bound. However, n the other case, the bound gets worse. We have the followng corollary whose proof s drect: Corollary 5.1. If a learnng system s β l stable and β l β r < r, then combnng these learnng l systems does not provde a better bound on the dfference between the test and the leave-one-out. Conversely, f β l β r > r, then combnng these learnng systems leads to l a better bound on the dfference between the test and the leave-one-out. Ths corollary gves an ndcaton that combnng machnes should not be used f the stablty of the sngle machne s very good. Notce that the corollary s about bounds, and not about whether the generalzaton for baggng or the actual dfference between the test and leave one out s always smaller for unstable machnes (and larger for stable ones) ths depends on how tght the bounds are n every case. However, t s not often the case that we have a hghly stable sngle machne and therefore typcally baggng mproves stablty. In such a stuaton, the bounds presented n ths paper show that we have better control of the generalzaton for combnaton of SVMs n the sense that the leave one out and the emprcal s are closer to the test. The bounds presented do not necessarly mply that the generalzaton of baggng s less than that of sngle machnes. Smlar remarks have already been made by Breman (1996) for baggng where smlar consderatons of stablty are expermentally dscussed. Another remark that can be made from the work of Breman s that baggng does not mprove performances after a certan number of bagged predctors. On the other hand, t does not reduce performances ether. Ths expermentally derved statement can be translated n our framework as: When T ncreases, the stablty of the combned learnng system tends to the stablty of the expectaton E Dr D l [ f Dr ] whch does not mprove after T has passed a certan value. Ths value may correspond to the convergence of the fnte sum 1 T T f D r,t to ts expectaton wrt. D r,t D l. At last, t s worthwhle notcng that the stablty analyss of ths secton holds also for the emprcal. Indeed, for a β l stable algorthm, as t s underlned n Bousquet and Elsseeff (2002), the leave-one-out and the emprcal are related by: Loo δ,dl ( f ) Err 0 ( f Dl ) + β l,
22 92 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF where Err 0 ( f Dl )sthe emprcal on the learnng set D l. Usng ths nequalty n Theorems 5.2, and 5.3 for the algorthm G, wecan bound the generalzaton of F n terms of the emprcal and the stablty of the machnes. 6. Other ensembles and estmates 6.1. Valdaton set for model selecton Instead of usng bounds on the generalzaton performance of learnng machnes lke the ones dscussed above, an alternatve approach for model selecton s to use a valdaton set to choose the parameters of the machnes. We consder frst the smple case where we have N machnes and we choose the best one based on the they make on a fxed valdaton set of sze V. Ths can be thought of as a specal case where we consder as hypothess space the set of the N machnes, and then we tran by smply pckng the machne wth the smallest emprcal (n ths case ths s the valdaton ). It s known that f VE s the valdaton of machne and TE s ts true test, then for all N machnes smultaneously the followng bound holds wth probablty 1 η (Devroye, Györf, & Lugos, 1996; Vapnk, 1998): TE VE + log(n) log ( η ) 4. (16) V So how accurately we pck the best machne usng the valdaton set depends, as expected, on the number of machnes N and on the sze V of the valdaton set. The bound suggests that a valdaton set can be used to accurately estmate the generalzaton performance of a relatvely small number of machnes (.e. small number of parameter values examned), as done often n practce. We used ths observaton for parameter selecton for SVMs and for ther ensembles. Expermentally we followed a slghtly dfferent procedure from what s suggested by bound (16). For each machne (that s, for each σ of the Gaussan kernel n our case, both for a sngle SVM and for an ensemble of machnes) we splt the tranng set (for each tranng-testng splt of the overall dataset as descrbed above) nto a smaller tranng set and a valdaton set (70 30% respectvely). We traned each machne usng the new, smaller tranng set, and measured the performance of the machne on the valdaton set. Unlke what bound (16) suggests, nstead of comparng the valdaton performance found wth the generalzaton performance of the machnes traned on the smaller tranng set (whch s the case for whch bound (16) holds), we compared the valdaton performance wth the test performance of the machne traned usng all the ntal (larger) tranng set. Ths way we dd not have to use less ponts for tranng the machnes, whch s a typcal drawback of usng a valdaton set, and we could compare the valdaton performance wth the leave-one-out bounds and the test performance of the exact same machnes we used n the Secton 4. We show the results of these experments n fgures 1 5 (see the dotted lnes n the plots). We observe that although the valdaton s that of a machne traned on a smaller
23 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 93 tranng set, t stll provdes a very good estmate of the test performance of the machnes traned on the whole tranng set. Inall cases, ncludng the case of C > 1 for whch the leave-one-out bounds dscussed above dd not work well, the valdaton set provded avery good estmate of the test performance of the machnes Adaptve combnatons of learnng machnes The ensembles of kernel machnes (4) consdered so far are votng combnatons where the coeffcents c t n (4) of the lnear combnaton of the machnes are fxed. We now consder the case where these coeffcents are also learned. In partcular we consder the followng two-layer archtecture: 1. A number T of kernel machnes s traned as before (for example usng dfferent tranng data, or dfferent parameters). Let f t (x), t = 1,...,T be the machnes. 2. The T outputs (real valued n our experments, but could also be thresholded bnary) of the machnes at each of the tranng ponts are computed. 3. A lnear machne (.e. lnear SVM) s traned usng as nputs the outputs of the T machnes on the tranng data, and as labels the orgnal tranng labels. The soluton s used as the coeffcents c t of the lnear combnaton of the T machnes. In ths case the ensemble machne F(x)sakernel machne tself whch s traned usng as kernel the functon: K(x, t) = f t (x) f t (t). Notce that snce each of the machnes f t (x) depend of the data, also the kernel K s data dependent. Therefore the stablty parameter of the ensemble machne s more dffcult to compute (when a data pont s left out the kernel K changes). Lkewse the leave-one-out bound of Theorem 3.3 does not hold snce the theorem assumes fxed coeffcents c t. 5 On the other hand, an mportant characterstc of ths type of ensembles s that ndependent of what kernels/parameters each of the ndvdual machnes of the ensemble use, the second layer machne (whch fnds coeffcents c t )always uses a lnear kernel. Ths may mply that the overall archtecture s less senstve to the kernel/parameters of the machnes of the ensemble.we tested ths hypothess expermentally by comparng how the test performance of ths type of machnes changes wth the σ of the Gaussan kernel used from the ndvdual machnes of the ensemble, and compared the behavor wth that of sngle machnes and ensembles of machnes wth fxed c t.infgure 7 we show two examples. In our experments, for all datasets except from one, learnng the coeffcents c t of the combnaton of the machnes usng a lnear machne (we used a lnear SVM) made the overall machne less senstve to changes of the parameters of the ndvdual machnes (σ of the Gaussan kernel). Ths can be a useful characterstc of the archtecture outlned n ths secton. For example the choce of the kernel parameters of the machnes of the ensembles need not be tuned accurately.
24 94 T. EVGENIOU, M. PONTIL AND A. ELISSEEFF sgma sgma Fgure 7. When the coeffcents of the second layer are learned usng a lnear SVM the system s less senstve to changes of the σ of the Gaussan kernel used by the ndvdual machnes of the ensemble. Sold lne s one SVM, dotted s ensemble of 30 SVMs wth fxed c t = 1 30, and dashed lne s ensemble of 30 SVMs wth the coeffcents c t learned. The horzontal axs shows the natural logarthm of the σ of the Gaussan kernel. Left s the Heart dataset, and rght s the Dabetes one. The threshold b s non-zero for these experments Ensembles versus sngle machnes So far we concentrated on the theoretcal and expermental characterstcs of ensembles of kernel machnes. We now dscuss how ensembles compare wth sngle machnes. Table 2 shows the test performance of one SVM compared wth that of an ensemble of 30 SVMs combned wth c t = 1 and an ensemble of 30 SVMs combned usng a lnear 30 SVM for some UCI datasets (characterstc results). For the tables of ths secton we use, for convenence, the followng notaton: VCC stands for Votng Combnatons of Classfers, meanng that the coeffcents c t of the combnaton of the machnes are fxed. ACC stands for Adaptve Combnatons of Classfers, meanng that the coeffcents c t of the combnaton of the machnes are learned-adapted. We only consder SVMs and ensembles of SVMs wth the threshold b. The table shows mean test s and standard devatons for the best (decded usng the valdaton set Table 2. Average s and standard devatons (percentages) of the best machnes (best σ of the Gaussan kernel and best C) chosen accordng to the valdaton set performances. The performances of the machnes are about the same. VCC and ACC use 30 SVMs. Dataset SVM VCC ACC Breast 25.5 ± ± ± 4.0 Thyrod 5.1 ± ± ± 2.7 Dabetes 23.0 ± ± ± 1.8 Heart 15.4 ± ± ± 3.2
25 LEAVE ONE OUT GENERALIZATION ERRORS OF VOTING COMBINATIONS 95 Table 3. Comparson between rates of a sngle SVM v.s. rates of VCC and ACC of 100 SVMs for dfferent percentages of subsampled data. The last dataset s from Osuna, Freund, and Gros (1997). Dataset VCC 10% VCC 5% VCC 1% ACC 10% ACC 5% ACC 1% SVM Dabetes ± 1.6 Thyrod ± 2.5 Faces 0.5 performance n ths case) parameters of the machnes (σ s of Gaussans and parameter C hence dfferent from fgures 1 5 whch where for a gven C). As the results show, the best SVM and the best ensembles we found have about the same test performance. Therefore, wth approprate tunng of the parameters of the machnes, combnng SVMs does not lead to performance mprovement compared to a sngle SVM. Although the best SVM and the best ensemble (that s, after accurate parameter tunng) perform smlarly, an mportant dfference of the ensembles compared to a sngle machne s that the tranng of the ensemble conssts of a large number of (parallelzable) small-tranng-set kernel machnes n the case of baggng. Ths mples that one can gan performance smlar to that of a sngle machne by tranng many faster machnes usng smaller tranng sets although the actual testng may be slower snce the sze of the unon of support vectors of the combnaton of machnes s expected to be larger than the number of support vectors of a sngle machne usng all the tranng data. Ths can be an mportant practcal advantage of ensembles of machnes especally n the case of large datasets. Table 3 compares the test performance of a sngle SVM wth that of an ensemble of SVMs each traned wth as low as 1% of the ntal tranng set (for one dataset for the other ones we could not use 1% because the sze of the orgnal dataset was small so 1% of t was only a couple of ponts). For fxed c t the performance decreases only slghtly n all cases (Thyrod, that we show, was the only dataset we found n our experments for whch the change was sgnfcant for the case of VCC), whle n the case of the archtecture of Secton 5 even wth 1% tranng data the performance does not decrease. Ths s because the lnear machne used to learn coeffcents c t uses all the tranng data. Even n ths last case the overall machne can stll be faster than a sngle machne, snce the second layer learnng machne s a lnear one, and fast tranng methods for the partcular case of lnear machnes exst (Platt, 1998). 7. Conclusons We presented theoretcal bounds on the generalzaton of ensembles of kernel machnes such as SVMs. Our results apply to the general case where each of the machnes n the ensemble s traned on dfferent subsets of the tranng data and/or uses dfferent kernels or nput features. A specal case of ensembles s that of baggng. The bounds were derved wthn the frameworks of cross valdaton and stablty and learnng. They nvolve two man quanttes: the det-leave-one-out estmate and the stablty parameter of the ensembles.
Bounds on the Generalization Performance of Kernel Machines Ensembles
Bounds on the Generalzaton Performance of Kernel Machnes Ensembles Theodoros Evgenou theos@a.mt.edu Lus Perez-Breva lpbreva@a.mt.edu Massmlano Pontl pontl@a.mt.edu Tomaso Poggo tp@a.mt.edu Center for Bologcal
More informationGeneralized Linear Methods
Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set
More informationEnsemble Methods: Boosting
Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement
More informationBoostrapaggregating (Bagging)
Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod
More informationKernel Methods and SVMs Extension
Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general
More informationHomework Assignment 3 Due in class, Thursday October 15
Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.
More informationErrors for Linear Systems
Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch
More information10-701/ Machine Learning, Fall 2005 Homework 3
10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40
More informationFeature Selection: Part 1
CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?
More informationLectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix
Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could
More informationLecture 10 Support Vector Machines II
Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed
More informationModule 9. Lecture 6. Duality in Assignment Problems
Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept
More informationChapter 13: Multiple Regression
Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to
More informationLinear Feature Engineering 11
Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19
More information3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X
Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number
More informationChapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems
Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons
More informationLecture 4. Instructor: Haipeng Luo
Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would
More informationEcon107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)
I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes
More informationMASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.
More informationDifference Equations
Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1
More informationVapnik-Chervonenkis theory
Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown
More information1 Convex Optimization
Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,
More informationCSC 411 / CSC D11 / CSC C11
18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t
More informationA Robust Method for Calculating the Correlation Coefficient
A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal
More informationThe Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction
ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also
More informationCase A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.
THE CELLULAR METHOD In ths lecture, we ntroduce the cellular method as an approach to ncdence geometry theorems lke the Szemeréd-Trotter theorem. The method was ntroduced n the paper Combnatoral complexty
More informationMore metrics on cartesian products
More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of
More information2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification
E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton
More information4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA
4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected
More informationLearning Theory: Lecture Notes
Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be
More informationLecture 12: Discrete Laplacian
Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly
More informationOnline Classification: Perceptron and Winnow
E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng
More informationProblem Set 9 Solutions
Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem
More informationLecture Notes on Linear Regression
Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014
COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and
More informationStanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011
Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected
More informationEdge Isoperimetric Inequalities
November 7, 2005 Ross M. Rchardson Edge Isopermetrc Inequaltes 1 Four Questons Recall that n the last lecture we looked at the problem of sopermetrc nequaltes n the hypercube, Q n. Our noton of boundary
More information1 Matrix representations of canonical matrices
1 Matrx representatons of canoncal matrces 2-d rotaton around the orgn: ( ) cos θ sn θ R 0 = sn θ cos θ 3-d rotaton around the x-axs: R x = 1 0 0 0 cos θ sn θ 0 sn θ cos θ 3-d rotaton around the y-axs:
More informationTAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES
TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES SVANTE JANSON Abstract. We gve explct bounds for the tal probabltes for sums of ndependent geometrc or exponental varables, possbly wth dfferent
More informationStatistics II Final Exam 26/6/18
Statstcs II Fnal Exam 26/6/18 Academc Year 2017/18 Solutons Exam duraton: 2 h 30 mn 1. (3 ponts) A town hall s conductng a study to determne the amount of leftover food produced by the restaurants n the
More informationWe present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.
CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea
More informationANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)
Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of
More informationChapter 6. Supplemental Text Material
Chapter 6. Supplemental Text Materal S6-. actor Effect Estmates are Least Squares Estmates We have gven heurstc or ntutve explanatons of how the estmates of the factor effects are obtaned n the textboo.
More informationCOS 521: Advanced Algorithms Game Theory and Linear Programming
COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton
More informationTHE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens
THE CHINESE REMAINDER THEOREM KEITH CONRAD We should thank the Chnese for ther wonderful remander theorem. Glenn Stevens 1. Introducton The Chnese remander theorem says we can unquely solve any par of
More informationA new construction of 3-separable matrices via an improved decoding of Macula s construction
Dscrete Optmzaton 5 008 700 704 Contents lsts avalable at ScenceDrect Dscrete Optmzaton journal homepage: wwwelsevercom/locate/dsopt A new constructon of 3-separable matrces va an mproved decodng of Macula
More informationStructure and Drive Paul A. Jensen Copyright July 20, 2003
Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.
More informationFor now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.
Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson
More informationprinceton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg
prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there
More informationModule 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:
More informationMaximizing the number of nonnegative subsets
Maxmzng the number of nonnegatve subsets Noga Alon Hao Huang December 1, 213 Abstract Gven a set of n real numbers, f the sum of elements of every subset of sze larger than k s negatve, what s the maxmum
More informationThe Expectation-Maximization Algorithm
The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.
More informationCHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE
CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE Analytcal soluton s usually not possble when exctaton vares arbtrarly wth tme or f the system s nonlnear. Such problems can be solved by numercal tmesteppng
More informationNotes on Frequency Estimation in Data Streams
Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to
More informationAffine transformations and convexity
Affne transformatons and convexty The purpose of ths document s to prove some basc propertes of affne transformatons nvolvng convex sets. Here are a few onlne references for background nformaton: http://math.ucr.edu/
More informationMin Cut, Fast Cut, Polynomial Identities
Randomzed Algorthms, Summer 016 Mn Cut, Fast Cut, Polynomal Identtes Instructor: Thomas Kesselhem and Kurt Mehlhorn 1 Mn Cuts n Graphs Lecture (5 pages) Throughout ths secton, G = (V, E) s a mult-graph.
More informationU.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016
U.C. Berkeley CS94: Spectral Methods and Expanders Handout 8 Luca Trevsan February 7, 06 Lecture 8: Spectral Algorthms Wrap-up In whch we talk about even more generalzatons of Cheeger s nequaltes, and
More informationWeek 5: Neural Networks
Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple
More informationCSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography
CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve
More informationAPPENDIX A Some Linear Algebra
APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,
More informationSection 8.3 Polar Form of Complex Numbers
80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the
More informationAssortment Optimization under MNL
Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.
More informationThe Geometry of Logit and Probit
The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.
More informationComplete subgraphs in multipartite graphs
Complete subgraphs n multpartte graphs FLORIAN PFENDER Unverstät Rostock, Insttut für Mathematk D-18057 Rostock, Germany Floran.Pfender@un-rostock.de Abstract Turán s Theorem states that every graph G
More informationLinear Regression Analysis: Terminology and Notation
ECON 35* -- Secton : Basc Concepts of Regresson Analyss (Page ) Lnear Regresson Analyss: Termnology and Notaton Consder the generc verson of the smple (two-varable) lnear regresson model. It s represented
More informationLinear Classification, SVMs and Nearest Neighbors
1 CSE 473 Lecture 25 (Chapter 18) Lnear Classfcaton, SVMs and Nearest Neghbors CSE AI faculty + Chrs Bshop, Dan Klen, Stuart Russell, Andrew Moore Motvaton: Face Detecton How do we buld a classfer to dstngush
More informationCollege of Computer & Information Science Fall 2009 Northeastern University 20 October 2009
College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:
More informationExcess Error, Approximation Error, and Estimation Error
E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple
More information1 Definition of Rademacher Complexity
COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #9 Scrbe: Josh Chen March 5, 2013 We ve spent the past few classes provng bounds on the generalzaton error of PAClearnng algorths for the
More informationSupporting Information
Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to
More informationExercises. 18 Algorithms
18 Algorthms Exercses 0.1. In each of the followng stuatons, ndcate whether f = O(g), or f = Ω(g), or both (n whch case f = Θ(g)). f(n) g(n) (a) n 100 n 200 (b) n 1/2 n 2/3 (c) 100n + log n n + (log n)
More informationFormulas for the Determinant
page 224 224 CHAPTER 3 Determnants e t te t e 2t 38 A = e t 2te t e 2t e t te t 2e 2t 39 If 123 A = 345, 456 compute the matrx product A adj(a) What can you conclude about det(a)? For Problems 40 43, use
More informationSupport Vector Machines. Vibhav Gogate The University of Texas at dallas
Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest
More informationWhich Separator? Spring 1
Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal
More informationFinding Dense Subgraphs in G(n, 1/2)
Fndng Dense Subgraphs n Gn, 1/ Atsh Das Sarma 1, Amt Deshpande, and Rav Kannan 1 Georga Insttute of Technology,atsh@cc.gatech.edu Mcrosoft Research-Bangalore,amtdesh,annan@mcrosoft.com Abstract. Fndng
More informationThe Order Relation and Trace Inequalities for. Hermitian Operators
Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence
More informationMMA and GCMMA two methods for nonlinear optimization
MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons
More informationx = , so that calculated
Stat 4, secton Sngle Factor ANOVA notes by Tm Plachowsk n chapter 8 we conducted hypothess tests n whch we compared a sngle sample s mean or proporton to some hypotheszed value Chapter 9 expanded ths to
More informationCS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015
CS 3710: Vsual Recognton Classfcaton and Detecton Adrana Kovashka Department of Computer Scence January 13, 2015 Plan for Today Vsual recognton bascs part 2: Classfcaton and detecton Adrana s research
More information/ n ) are compared. The logic is: if the two
STAT C141, Sprng 2005 Lecture 13 Two sample tests One sample tests: examples of goodness of ft tests, where we are testng whether our data supports predctons. Two sample tests: called as tests of ndependence
More informationComposite Hypotheses testing
Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter
More informationChapter 11: Simple Linear Regression and Correlation
Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests
More informationChapter 9: Statistical Inference and the Relationship between Two Variables
Chapter 9: Statstcal Inference and the Relatonshp between Two Varables Key Words The Regresson Model The Sample Regresson Equaton The Pearson Correlaton Coeffcent Learnng Outcomes After studyng ths chapter,
More informationIntroduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:
CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and
More informationEvaluation of simple performance measures for tuning SVM hyperparameters
Evaluaton of smple performance measures for tunng SVM hyperparameters Kabo Duan, S Sathya Keerth, Aun Neow Poo Department of Mechancal Engneerng, Natonal Unversty of Sngapore, 0 Kent Rdge Crescent, 960,
More informationThis column is a continuation of our previous column
Comparson of Goodness of Ft Statstcs for Lnear Regresson, Part II The authors contnue ther dscusson of the correlaton coeffcent n developng a calbraton for quanttatve analyss. Jerome Workman Jr. and Howard
More informationGraph Reconstruction by Permutations
Graph Reconstructon by Permutatons Perre Ille and Wllam Kocay* Insttut de Mathémathques de Lumny CNRS UMR 6206 163 avenue de Lumny, Case 907 13288 Marselle Cedex 9, France e-mal: lle@ml.unv-mrs.fr Computer
More informationFoundations of Arithmetic
Foundatons of Arthmetc Notaton We shall denote the sum and product of numbers n the usual notaton as a 2 + a 2 + a 3 + + a = a, a 1 a 2 a 3 a = a The notaton a b means a dvdes b,.e. ac = b where c s an
More informationLinear Approximation with Regularization and Moving Least Squares
Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...
More informationOn the correction of the h-index for career length
1 On the correcton of the h-ndex for career length by L. Egghe Unverstet Hasselt (UHasselt), Campus Depenbeek, Agoralaan, B-3590 Depenbeek, Belgum 1 and Unverstet Antwerpen (UA), IBW, Stadscampus, Venusstraat
More informationEEE 241: Linear Systems
EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they
More informationSTAT 3008 Applied Regression Analysis
STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,
More informationCIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M
CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute
More informationBayesian predictive Configural Frequency Analysis
Psychologcal Test and Assessment Modelng, Volume 54, 2012 (3), 285-292 Bayesan predctve Confgural Frequency Analyss Eduardo Gutérrez-Peña 1 Abstract Confgural Frequency Analyss s a method for cell-wse
More informationSupport Vector Machines
CS 2750: Machne Learnng Support Vector Machnes Prof. Adrana Kovashka Unversty of Pttsburgh February 17, 2016 Announcement Homework 2 deadlne s now 2/29 We ll have covered everythng you need today or at
More informationAdditional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty
Addtonal Codes usng Fnte Dfference Method Benamn Moll 1 HJB Equaton for Consumpton-Savng Problem Wthout Uncertanty Before consderng the case wth stochastc ncome n http://www.prnceton.edu/~moll/ HACTproect/HACT_Numercal_Appendx.pdf,
More informationPsychology 282 Lecture #24 Outline Regression Diagnostics: Outliers
Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.
More informationPHYS 705: Classical Mechanics. Calculus of Variations II
1 PHYS 705: Classcal Mechancs Calculus of Varatons II 2 Calculus of Varatons: Generalzaton (no constrant yet) Suppose now that F depends on several dependent varables : We need to fnd such that has a statonary
More informationLecture 4: September 12
36-755: Advanced Statstcal Theory Fall 016 Lecture 4: September 1 Lecturer: Alessandro Rnaldo Scrbe: Xao Hu Ta Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer: These notes have not been
More informationThe exam is closed book, closed notes except your one-page cheat sheet.
CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces
More information