Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters

Size: px

Start display at page:

Download "Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters"

Nathan Edward Andrews
6 years ago
Views:

1 Journa of Machne Learnng Research 8 (2007) Submtted /06; Pubshed 4/07 Preventng Over-Fttng durng Mode Seecton va Bayesan Reguarsaton of the Hyper-Parameters Gavn C. Cawey Ncoa L. C. Tabot Schoo of Computng Scences Unversty of East Anga Norwch, Unted Kngdom NR4 7TJ GCC@CMP.UEA.AC.UK NLCT@CMP.UEA.AC.UK Edtors: Isabee Guyon and Amr Saffar Abstract Whe the mode parameters of a kerne machne are typcay gven by the souton of a convex optmsaton probem, wth a snge goba optmum, the seecton of good vaues for the reguarsaton and kerne parameters s much ess straghtforward. Fortunatey the eave-one-out cross-vadaton procedure can be performed or a east approxmated very effcenty n cosed form for a wde varety of kerne earnng methods, provdng a convenent means for mode seecton. Leave-one-out cross-vadaton based estmates of performance, however, generay exhbt a reatvey hgh varance and are therefore prone to over-fttng. In ths paper, we nvestgate the nove use of Bayesan reguarsaton at the second eve of nference, addng a reguarsaton term to the mode seecton crteron correspondng to a pror over the hyper-parameter vaues, where the addtona reguarsaton parameters are ntegrated out anaytcay. Resuts obtaned on a sute of thrteen rea-word and synthetc benchmark data sets ceary demonstrate the beneft of ths approach. Keywords: mode seecton, kerne methods, Bayesan reguarsaton. Introducton Leave-one-out cross-vadaton (Lachenbruch and Mckey, 968; Luntz and Braovsky, 969; Stone, 974) provdes the bass for computatonay effcent mode seecton strateges for a varety of kerne earnng methods, ncudng the Support Vector Machne (SVM) (Cortes and Vapnk, 995; Chapee et a., 2002), Gaussan Process (GP) (Rasmussen and Wams, 2006; Sundararajan and Keerth, 200), Least-Squares Support Vector Machne (LS-SVM) (Suykens and Vandewae, 999; Cawey and Tabot, 2004), Kerne Fsher Dscrmnant (KFD) anayss (Mka et a., 999; Cawey and Tabot, 2003; Saad et a., 2004; Bo et a., 2006) and Kerne Logstc Regresson (KLR) (Keerth et a., 2005; Cawey and Tabot, 2007). These methods have proved hghy successfu for kerne machnes havng ony a sma number of hyper-parameters to optmse, as demonstrated by the set of modes achevng the best average score n the WCCI-2006 performance predcton chaenge (Cawey, 2006; Guyon et a., 2006). Unfortunatey, whe eave-one-out cross-vadaton estmators have been shown to be amost unbased (Luntz and Braovsky, 969), they are known to exhbt a reatvey hgh varance (e.g., Kohav, 995). A kerne wth many hyper-parameters, for nstance those used n Automatc Reevance Determnaton (ARD) (e.g., Rasmussen and Wams, 2006) or feature scang methods (Chapee et a., 2002; Bo et a., 2006), may provde suffcent. See c 2007 Gavn Cawey and Ncoa Tabot.

2 CAWLEY AND TALBOT degrees of freedom to over-ft eave-one-out cross-vadaton based mode seecton crtera, resutng n performance nferor to that obtaned usng a ess fexbe kerne functon. In ths paper, we nvestgate the nove use of reguarsaton (Tkhonov and Arsenn, 977) of the hyper-parameters n mode seecton n order to ameorate the effects of the hgh varance of eave-one-out crossvadaton based seecton crtera, and so mprove predctve performance. The reguarsaton term corresponds to a zero-mean Gaussan pror over the vaues of the kerne parameters, representng a preference for smooth kerne functons, and hence a reatvey smpe cassfer. The reguarsaton parameters ntroduced n ths step are ntegrated out anaytcay n the stye of Buntne and Wegend (99), to provde a Bayesan mode seecton crteron that can be optmsed n a straghtforward manner va, for exampe, scaed conjugate gradent descent (Wams, 99). The paper s structured as foows: The remander of ths secton provdes a bref overvew of the east-squares support vector machne, ncudng the use of eave-one-out cross-vadaton based mode seecton procedures, gven n suffcent deta to ensure the reproducbty of the resuts. Secton 2 descrbes the use of Bayesan reguarsaton to prevent over-fttng at the second eve of nference, that s, mode seecton. Secton 3 presents resuts obtaned over a sute of thrteen benchmark data sets, whch demonstrate the utty of ths approach. Secton 4 provdes dscusson of the resuts and suggests drectons for further research. Fnay, the work s summarsed and drectons for further work are outned n Secton 5.. Least Squares Support Vector Machne In the remander of ths secton, we provde a bref overvew of the east-squares support vector machne (Suykens and Vandewae, 999) used as the testbed for the nvestgaton of the roe of reguarsaton n the mode seecton process descrbed n ths study. Gven tranng data, D = {(x, y )}, where x X R d and y {,+}, we seek to construct a near dscrmnant, f (x) = φ(x) w + b, n a feature space, F, defned by a fxed transformaton of the nput space, φ : X F. The parameters of the near dscrmnant, (w, b), are gven by the mnmser of a reguarsed (Tkhonov and Arsenn, 977) east-squares tranng crteron, L = 2 w 2 + 2µ [y φ(x ) w b] 2, () where µ s a reguarsaton parameter controng the bas-varance trade-off (Geman et a., 992). Rather than specfy the feature space drecty, t s nstead nduced by a kerne functon, K : X X R, whch evauates the nner-product between the projectons of the data onto the feature space, F, that s, K (x,x ) = φ(x) φ(x ). The nterpretaton of an nner-product n a fxed feature space s vad for any Mercer kerne (Mercer, 909), for whch the Gram matrx, K = [k j = K (x,x j )], j= s postve sem-defnte, that s, a T Ka 0, a R, a 0. The Gram matrx effectvey encodes the spata reatonshps between the projectons of the data n the feature space, F. A near mode can thus be mpcty constructed n the feature space usng ony nformaton contaned n the Gram matrx, wthout expcty evauatng the postons of the data n the feature space va the transformaton φ( ). Indeed, the representer theorem (Kmedorf 842

3 BAYESIAN REGULARISATION IN MODEL SELECTION and Wahba, 97) shows that the souton of the optmsaton probem () can be wrtten as an expanson over the tranng patterns, w = α φ(x ) = f (x) = α K (x,x) + b. The advantage of the kerne trck then becomes apparent; a near mode can be constructed n an extremey rch, hgh- (possby nfnte-) dmensona feature space, usng ony fnte-dmensona quanttes, such as the Gram matrx, K. The kerne trck aso aows the constructon of statstca modes that operate drecty on structured data, for nstance strngs, trees and graphs, eadng to the current nterest n kerne earnng methods n computatona boogy (Schökopf et a., 2004) and text-processng (Joachms, 2002). The Rada Bass Functon (RBF) kerne, K (x,x ) = exp { η x x 2} s commony encountered n practca appcatons of kerne earnng methods, here η s a kerne parameter, controng the senstvty of the kerne functon. The feature space for the rada bass functon kerne conssts of the postve orthant of an nfnte-dmensona unt hyper-sphere (e.g., Shawe-Tayor and Crstann, 2004). The Gram matrx for the rada bass functon kerne s thus of fu rank (Mcche, 986), and so the kerne mode s abe to form an arbtrary shatterng of the data... A DUAL TRAINING ALGORITHM The basc tranng agorthm for the east-squares support vector machne (Suykens and Vandewae, 999) vews the reguarsed oss functon () as a constraned mnmsaton probem: mn w,b,ε 2 w 2 + 2µ ε 2 subject to ε = y w φ(x ) b. The prma Lagrangan for ths constraned optmsaton probem gves the unconstraned mnmsaton probem defned by the foowng reguarsed oss functon, L = 2 w 2 + 2µ ε 2 α {w φ(x ) + b + ε y }, where α = (α,α 2,...,α ) R s a vector of Lagrange mutpers. The optmaty condtons for ths probem can be expressed as foows: L w = 0 = w = α φ(x ), (2) L b L = 0 = α = 0, (3) = 0 = α = ε ε µ, {,2,...,}, (4) L = 0 = w φ(x ) + b + ε y = 0, {,2,...,}. (5) α 843

4 CAWLEY AND TALBOT Usng (2) and (4) to emnate w and ε = (ε,ε 2,...,ε ), from (5), we fnd that α j φ(x j ) φ(x ) + b + µα = y {,2,...,}. (6) j= Notng that K (x,x ) = φ(x) φ(x ), the system of near equatons, (6) and (3), can be wrtten more concsey n matrx form as [ ][ ] [ ] K + µi α y T =, 0 b 0 where K = [k j = K (x,x j )], j=, I s the dentty matrx and s a coumn vector of ones. The optma parameters for the mode of the condtona mean can then be obtaned wth a computatona compexty of O( 3 ) operatons, usng drect methods, such as Choesky decomposton (Goub and Van Loan, 996)...2 EFFICIENT IMPLEMENTATION VIA CHOLESKY DECOMPOSITION A more effcent tranng agorthm can be obtaned, takng advantage of the speca structure of the system of near equatons (Suykens et a., 2002). The system of near equatons to be soved n fttng a east-squares support vector machne s gven by, [ ][ ] [ ] M α y T =, (7) 0 b 0 where M = K +µi. Unfortunatey the matrx on the eft-hand sde s not postve defnte, and so we cannot sove ths system of near equatons drecty usng the Choesky decomposton. However, the frst row of (7) can be re-wrtten as M ( α + M b ) = y. (8) Rearrangng (8), we see that α = M (y b), usng ths resut to emnate α, the second row of (7) can be wrtten as T M b = T M y. The system of near equatons can then be re-wrtten as [ M 0 ][ α + M b 0 T T M b ] [ = y T M y ]. (9) In ths case, the matrx on the eft hand sde s postve-defnte, as M = K + λi s postve-defnte and T M s postve snce the nverse of a postve defnte matrx s aso postve defnte. The revsed system of near equatons (9) can be soved as foows: Frst sove Mρ = and Mν = y, (0) whch may be performed effcenty usng the Choesky factorsaton of M. The mode parameters of the east-squares support vector machne are then gven by b = T ν T ρ and α = ν ρb. The two systems of near equatons (0) can be soved effcenty usng the Choesky decomposton of M = R T R, where R s the upper tranguar Choesky factor of M. 844

5 BAYESIAN REGULARISATION IN MODEL SELECTION.2 Leave-One-Out Cross-Vadaton Cross-vadaton (Stone, 974) s commony used to obtan a reabe estmate of the test error for performance estmaton or for use as a mode seecton crteron. The most common form, k-fod cross-vadaton, parttons the avaabe data nto k dsjont subsets. In each teraton a cassfer s traned on a dfferent combnaton of k subsets and the unused subset s used to estmate the test error rate. The k-fod cross-vadaton estmate of the test error rate s then smpy the average of the test error rate observed n each of the k teratons, or fods. The most extreme form of cross-vadaton, where k = such that the test partton n each fod conssts of ony a snge pattern, s known as eave-one-out cross-vadaton (Lachenbruch and Mckey, 968) and has been shown to provde an amost unbased estmate of the test error rate (Luntz and Braovsky, 969). Leave-one-out cross-vadaton s however computatonay expensve, n the case of the east-squares support vector machne a naïve mpementaton havng a compexty of O( 4 ) operatons. Leave-one-out cross-vadaton s therefore normay ony used n crcumstances where the avaabe data are extremey scarce such that the computatona expense s no onger prohbtve. In ths case the nherenty hgh varance of the eave-one-out estmator (Kohav, 995) s offset by the mnma decrease n the sze of the tranng set n each fod, and so may provde a more reabe estmate of generasaton performance than conventona k-fod cross-vadaton. Fortunatey eaveone-out cross-vadaton of east-squares support vector machnes can be performed n cosed form wth a computatona compexty of ony O( 3 ) operatons (Cawey and Tabot, 2004). Leave-oneout cross-vadaton can then be used n medum to arge scae appcatons, where there may be a few thousand data-ponts, athough the reatvey hgh varance of ths estmator remans potentay probematc..2. VIRTUAL LEAVE-ONE-OUT CROSS-VALIDATION The optma vaues of the parameters of a Least-Squares Support Vector Machne are gven by the souton of a system of near equatons: [ K + µi T 0 ][ α b ] [ y = 0 ]. () The matrx on the eft-hand sde of () can be decomposed nto bock-matrx representaton, as foows: [ ] [ K + µi c c T = T ] = C. 0 c C Let [α ( ) ;b ( ) ] represent the parameters of the east-squares support vector machne durng the th teraton of the eave-one-out cross-vadaton procedure, then n the frst teraton, n whch the frst tranng pattern s excuded, [ ] α ( ) = C [y 2,...,y,0] T. b ( ) The eave-one-out predcton for the frst tranng pattern s then gven by [ ŷ ( ) = c T α ( ) b ( ) ] = c T C [y 2,...,y,0] T. 845

6 CAWLEY AND TALBOT Consderng the ast equatons n the system of near equatons (), t s cear that [c C ] [α,...,α,b] T = [y 2,...,y,0] T, and so ŷ ( ) = c T C [c C ] [ α T,b ] T = c T C c α + c [α 2,...,α,b] T. Notng, from the frst equaton n the system of near equatons (), that y = c α + c T [α 2,...,α,b] T, thus ŷ ( ) ( = y α c c T C c ). Fnay, va the bock matrx nverson emma, [ c c T ] [ = c C κ κ c C C + κ C ct c C κ C ct where κ = c c T C c, and notng that the system of near equatons () s nsenstve to permutatons of the orderng of the equatons and of the unknowns, we have that, y ŷ ( ) = α C ],. (2) Ths means that, assumng the system of near equatons () s soved va expct nverson of C, a eave-one-out cross-vadaton estmate of an approprate mode seecton crteron can be evauated usng nformaton aready avaabe a by-product of tranng the east-squares support vector machne on the entre data set (cf., Sundararajan and Keerth, 200)..2.2 EFFICIENT IMPLEMENTATION VIA CHOLESKY FACTORISATION The eave-one-out cross-vadaton behavour of the east-squares support vector machne s descrbed by (2). The coeffcents of the kerne expanson, α, can be found effcenty, va Choesky factorsaton, as descrbed n Secton..2. However we must aso determne the dagona eements of C n an effcent manner. Usng the bock matrx nverson formua, we obtan C = [ M T 0 ] [ M = + M SM T M SM T M M S M where M = K + µi and S M = T M = T η s the Schur compement of M. The nverse of the postve defnte matrx, M, can be computed effcenty from ts Choesky factorsaton, va the SYMINV agorthm (Seaks, 972), for exampe usng the LAPACK routne DTRTRI (Anderson et a., 999). Let R = [r j ], j= be the ower tranguar Choesky factor of the postve defnte matrx M, such that M = RR T. Furthermore, et S = [s j ], j= = R, where s = and s j = s r r k s k j, k= represent the (ower tranguar) nverse of the Choesky factor. The nverse of M s then gven by M = S T S. In the case of effcent eave-one-out cross-vadaton of east-squares support vector machnes, we are prncpay concerned ony wth the dagona eements of M, gven by S M ] M = s 2 j = C = j= s 2 j + ρ2 j= S M {,2,...,}. 846

7 BAYESIAN REGULARISATION IN MODEL SELECTION The computatona compexty of the basc tranng agorthm s O( 3 ) operatons, beng domnated by the evauaton of the Choesky factor. However, the computatona compexty of the anaytc eave-one-out cross-vadaton procedure, when performed as a by-product of the tranng agorthm, s ony O() operatons. The computatona expense of the eave-one-out cross-vadaton procedure therefore becomes ncreasngy neggbe as the tranng set becomes arger..3 Mode Seecton The vrtua eave-one-out cross-vadaton procedure descrbed n the prevous secton provdes the bass for a smpe automated mode seecton strategy for the east-squares support vector machne. Perhaps the most basc mode seecton crteron s provded by the Predcted REsdua Sum of Squares (PRESS) crteron (Aen, 974), whch s smpy the eave-one-out estmate of the sum-ofsquares error, Q(θ) = 2 [ y ŷ ( ) A mnmum of the mode seecton crteron s often found va a smpe grd-search procedure n the majorty of practca appcatons of kerne earnng methods. However, ths s rarey necessary and often hghy neffcent as a grd-search spends a arge amount of tme nvestgatng hyperparameter vaues outsde the neghbourhood of the goba optmum. A more effcent approach uses the Neder-Mead smpex agorthm (Neder and Mead, 965), as mpemented by the fmnsearch functon of the MATLAB optmsaton toobox. An aternatve easy mpemented approach uses conjugate gradent methods, wth the requred gradent nformaton estmated by the method of fnte dfferences, and mpemented by the fmnunc functon from the MATLAB optmsaton toobox. In ths study however, we use scaed conjugate gradent descent (Wams, 99), wth the requred gradent nformaton evauated anaytcay, as ths s approxmatey twce as effcent..3. PARTIAL DERIVATIVES OF THE PRESS MODEL SELECTION CRITERION Let θ = {θ,...,θ n } = {λ,η,...,η d } represent the vector of hyper-parameters for a east-squares support vector machne, where η,...,η d represent the kerne parameters. The PRESS statstc (Aen, 974) can be wrtten as Q(θ) = 2 [ r ( ) ] 2, where r ( ) ] 2. = y ŷ ( ) = α C Usng the chan rue, the parta dervatve of the PRESS statstc, wth respect to an ndvdua hyper-parameter, θ j, s gven by, where such that Q(θ) r ( ) = r ( ) = α C Q(θ) = θ j Q(θ) = θ j and α C Q(θ) r ( ) r ( ) θ j { α θ j C 847 r ( ) θ j, = α θ j C α [ C α [ C } ] C. 2 θ j. C ] 2, θ j

8 CAWLEY AND TALBOT We begn by dervng the parta dervatves of the mode parameters, [ α T b ] T, wth respect to the hyper-parameter θ j. The mode parameters are gven by the souton of a system of near equatons, such that [ α T b ] T = C [ y T 0 ] T. Usng the foowng dentty for the parta dervatves of the nverse of a matrx, we obtan, [ α T b ] T θ j C θ j C = C C, (3) θ j C = C C [ y T 0 ] C [ = C α T b ] T. θ j θ j Note the computatona compexty of evauatng the parta dervatves of the mode parameters s O( 2 ), as ony two successve matrx-vector products are requred. The parta dervatves of the dagona eements of C can be found usng the nverse matrx dervatve dentty (3). For a kerne parameter, C/ η j w generay be fuy dense, and so the computatona compexty of evauatng the dagona eements of C / η j w be O( 3 ) operatons. If, on the other hand, we consder the reguarsaton parameter, µ, we have that C µ = [ I 0 0 T 0 and so the computaton of the parta dervatves of the mode parameters, wth respect to the reguarsaton parameter, s sghty smpfed, [ α T b ] T µ ], = C [ α T b ] T. More mportanty, as C/ µ s dagona, the dagona eements of (3) can be evauated wth a computatona compexty of ony O( 2 ) operatons. Ths suggests that t may be more effcent to adopt dfferent strateges for optmsng the reguarsaton parameter, µ, and the vector of kerne parameters, η, (cf., Saad et a., 2004). For a kerne parameter, η j, the parta dervatves of C wth respect to η j are gven by the parta dervatves of the kerne matrx, that s, C η j = [ K/ η j 0 0 T 0 For the spherca rada bass functon kerne, used n ths study, the parta dervatve wth respect to the kerne parameter s gven by K (x,x ) η ]. = K (x,x ) x x 2. Fnay, snce the reguarsaton parameter, µ, and the scae parameter of the rada bass functon kerne are strcty postve quanttes, n order to permt the use of an unconstraned optmsaton procedure, we adopt the parametersaton θ j = og 2 θ j, such that Q(θ) θ j = Q(θ) θ j θ j θ j where θ j θ j = θ j og2. 848

9 BAYESIAN REGULARISATION IN MODEL SELECTION.3.2 AUTOMATIC RELEVANCE DETERMINATION Automatc Reevance Determnaton (ARD) (e.g., Rasmussen and Wams, 2006), aso known as feature scang (Chapee et a., 2002; Bo et a., 2006), ams to dentfy nformatve nput features as a natura consequence of optmsng the mode seecton crteron. Ths can be most easy acheved usng an eptca rada bass functon kerne, { K (x,x ) = exp d η [x x ] 2 }, that ncorporates ndvdua scang factors for each nput dmenson. The parta dervatves wth respect to the kerne parameters are then gven by, K (x,x ) η = K (x,x )[x x ] 2. Generasaton performance s key to be enhanced f rreevant features are down-weghted. It s therefore hoped that mnmsng the mode seecton crteron w ead to very sma vaues for the scang factors assocated wth redundant nput features, aowng them to be dentfed and pruned from the mode. 2. Bayesan Reguarsaton n Mode Seecton In order to overcome the observed over-fttng n mode seecton usng eave-one-out cross-vadaton based methods, we propose to add a reguarsaton term (Tkhonov and Arsenn, 977) to the mode seecton crteron, whch penases soutons where the kerne parameters take on unduy arge vaues. The reguarsed mode seecton crteron s then gven by M(θ) = ζq(θ) + ξω(θ), (4) where ξ and ζ are addtona reguarsaton parameters, Q(θ) s the mode seecton crteron, n ths case the PRESS statstc and Ω(θ) s a reguarsaton term, Q(θ) = 2 [ y ŷ ( ) ] 2 and Ω(θ) = 2 In ths study we have eft the reguarsaton parameter, µ, unreguarsed. However, we have now ntroduced two further reguarsaton parameters ξ and ζ for whch good vaues must aso be found. Ths probem may be soved by takng a Bayesan approach and adoptng an gnorance pror and ntegratng out the addtona reguarsaton parameters anaytcay n the stye of Buntne and Wegend (99). Adaptng the approach taken by Wams (995), the reguarsed mode seecton crteron (4) can be nterpreted as the posteror densty n the space of the hyper-parameters, P(θ D) P(D θ)p(θ), by takng the negatve ogarthm and negectng addtve constants. Here P(D θ) represents the kehood wth respect to the hyper-parameters and P(θ) represents our pror beefs regardng the d η

10 CAWLEY AND TALBOT hyper-parameters, n ths case that they shoud have a sma magntude, correspondng to a reatvey smpe mode. These quanttes can be expressed as P(D θ) = Z Q exp{ ζq(θ)} and P(θ) = Z Ω exp{ ξω(θ)} where Z Q and Z Ω are the approprate normasng constants. Assumng the data represent an..d. sampe, the kehood n ths case s Gaussan, [ ] y ŷ ( ) 2 P(D θ) = exp 2πσ 2σ 2 where ζ = ( ) 2π /2 σ 2 = Z Q =. ζ Lkewse, the pror s a Gaussan, centred on the orgn, P(θ) = d exp 2π/ξ { ξ } 2 η2 such that Z Ω = ( ) 2π d/2. ξ Mnmsng (4) s thus equvaent to maxmsng the posteror densty wth respect to the hyperparameters. Note that the use of a pror over the hyper-parameters s n accordance wth norma Bayesan practce and has been nvestgated n the case of Gaussan Process cassfers by Wams and Barber (998). The combnaton of frequentst and Bayesan approaches at the frst and second eves of nference s however somewhat unusua. The margna kehood s dependent on the assumptons of the mode, whch may not be competey approprate. Cross-vadaton based procedures may therefore be more robust n the case of mode ms-specfcaton (Wahba, 990). It seems reasonabe for the mode to be ess senstve to assumptons at the second eve of nference than the frst, and so the proposed approach represents a pragmatc combnaton of technques. 2. Emnaton of Second Leve Reguarsaton Parameters ξ and ζ Under the evdence framework proposed by MacKay (992a,b,c) the hyper-parameters ξ and ζ are determned by maxmsng the margna kehood, aso known as the Bayesan evdence for the mode. In ths study, however we opt to ntegrate out the hyper-parameters anaytcay, extendng the work by Buntne and Wegend (99) and Wams (995) to consder Bayesan reguarsaton at the second eve of nference, namey the seecton of good vaues for the hyper-parameters. We begn wth the pror over the hyper-parameters, whch depends on ξ, P(θ ξ) = Z Ω (ξ) exp{ ξω}. The reguarsaton parameter ξ may then be ntegrated out anaytcay usng a sutabe pror, P(ξ), Z P(θ) = P(θ ξ)p(ξ)dξ. The mproper Jeffreys pror, P(ξ) /ξ s an approprate gnorance pror n ths case as ξ s a scae parameter, notng that ξ s strcty postve, p(θ) = Z (2π) d/2 ξ d/2 exp{ ξω}dξ

11 BAYESIAN REGULARISATION IN MODEL SELECTION Usng the Gamma ntegra R 0 xν e µx dx = Γ(ν)/µ ν (Gradshteyn and Ryzhc, 994, equaton 3.384), we obtan Γ(d/2) p(θ) = (2π) d/2 Ω d/2 = og p(θ) d 2 ogω. Fnay, adoptng a smar procedure to emnate ζ, we obtan a revsed mode seecton crteron wth Bayesan reguarsaton, L = 2 ogq(θ) + d ogω(θ). (5) 2 n whch the reguarsaton parameters have been emnated. As before, ths crteron can be optmsed va standard methods, such as the Neder-Mead smpex agorthm (Neder and Mead, 965) or scaed conjugate gradent descent (Wams, 99). The parta dervatves of the proposed Bayesan mode seecton crteron are gven by L = Q(θ) + d Ω(θ) θ 2Q(θ) θ 2Ω(θ) θ and Ω(θ) η = η. The addtona computatona expense nvoved n Bayesan reguarsaton of the mode seecton crteron s ony O(d) operatons, and s extremey sma n comparson wth the O( 3 ) operatons nvoved n obtanng the eave-one-out error (ncudng the cost of tranng the mode on the entre data set). Per teraton of the mode seecton process, the cost of the Bayesan reguarsaton s therefore mnma. There seems tte reason to suppose that the reguarsaton w have an adverse effect on convergence, and ths seems to be the case n practce. 2.2 Reatonshp wth the Evdence Framework Under the evdence framework of MacKay (992a,b,c) the reguarsaton parameters, ξ and ζ, are seected so as to maxmse the margna kehood, aso known as the Bayesan evdence, for the mode. The og-evdence s gven by ogp(d) = ξω(θ) ζq(θ) 2 og A + d 2 ogξ + 2 ogζ 2 og{2π}, where A s the Hessan of the reguarsed mode seecton crteron (4) wth respect to the hyperparameters, θ. Settng the parta dervatves of the og evdence wth respect to the reguarsaton parameters, ξ and ζ, equa to zero, we obtan the famar update formuae, ξ new = γ 2Ω(θ) and ζ new = γ 2Q(θ), where γ s the number of we defned hyper-parameters, that s, the hyper-parameters for whch the optma vaue s prmary determned by the og-kehood term, Q(θ) rather than by the reguarser, Ω(θ). In the case of the L2 reguarsaton term, correspondng to a Gaussan pror, the number of we determned hyper-parameters s gven by γ = n j= λ j λ j + ξ where, λ,...,λ n represent the egenvaues of the Hessan of the unreguarsed mode seecton crteron, Q(θ) wth respect to the kerne parameters. Comparng the parta dervatves of the reguarsed mode seecton crteron (4) wth those of the Bayesan crteron (5), reveas that the 85

12 CAWLEY AND TALBOT Bayesan reguarsaton scheme s equvaent to optmsng the reguarsed mode seecton crteron (4) assumng that the reguarsaton parameters, ξ and ζ, are contnuousy updated accordng to the foowng update rues, ξ eff = d 2Ω(θ) and ζ eff = 2Q(θ). Ths exacty corresponds to the cheap and cheerfu approxmaton of the evdence framework suggested by MacKay (994), whch assumes that a of the hyper-parameters are we-determned and that the number of hyper-parameters s sma n reaton to the number of tranng patterns. Snce γ d, t seems sef evdent that the proposed Bayesan reguarsaton scheme w be prone to a degree of under-fttng, especay n the case of a feature scang kerne wth many redundant features. The theoretca and practca pros and cons of the ntegrate-out approach and the evdence framework are dscussed n some deta by MacKay (994) and Bshop (995) and references theren. However, the ntegrate-out approach does not requre the evauaton of the Hessan matrx of the orgna seecton crteron, Q(θ), whch s key to prove computatonay prohbtve. 3. Resuts In ths secton, we present expermenta resuts demonstratng the benefts of the proposed mode seecton strategy ncorporatng Bayesan reguarsaton to overcome the nherent hgh varance of eave-one-out cross-vadaton based seecton crtera. Tabe 2 shows a comparson of the error rates of east-squares support vector machnes, usng mode seecton procedures wth, and wthout, Bayesan reguarsaton, (LS-SVM and LS-SVM-BR respectvey) over the sute of thrteen pubc doman benchmark data sets used n the study by Mka et a. (2000). Resuts obtaned usng a Gaussan process cassfer (Rasmussen and Wams, 2006), based on the expectaton propagaton method, are aso provded for comparson (EP-GPC). The same set of 00 random parttons of the data (20 n the case of the arger mage and spce benchmarks) to form tranng and test sets used n that study are aso used here. In each case, mode seecton s performed ndependenty for each reasaton of the data set, such that the standard errors refect the varabty of both the tranng agorthm and the mode seecton procedure wth changes n the sampng of the data. Both conventona spherca and eptca rada bass kernes are used for a kerne earnng methods, so that the effectveness of each agorthm for automatc reevance determnaton can be assessed. The use of mutpe tranng/test parttons aows an estmate of the statstca sgnfcance of dfferences n performance between agorthms to be computed. Let ˆx and ŷ represent the means of the performance statstc for a par of competng agorthms, and e x and e y the correspondng standard errors, then the z statstc s computed as z = ŷ ˆx. e 2 x + e 2 y The z-score can then be converted to a sgnfcance eve va the norma cumuatve dstrbuton functon, such that z =.64 corresponds to a 95% sgnfcance eve. A statements of statstca sgnfcance n the remander of ths secton refer to a 95% eve of sgnfcance. 3. Performance of Modes Based on the Spherca RBF Kerne The resuts shown n the frst three data coumns of Tabe 2 show the performance of LS-SVM, LS-SVM-BR and EP-ARD modes based on the spherca Gaussan kerne. The performance of 852

13 BAYESIAN REGULARISATION IN MODEL SELECTION Data Set Tranng Testng Number of Input Patterns Patterns Repcatons Features Banana Breast cancer Dabets Fare soar German Heart Image Rngnorm Spce Thyrod Ttanc Twonorm Waveform Tabe : Detas of data sets used n emprca comparson. LS-SVM modes wth and wthout Bayesan reguarsaton are very smar, wth nether mode provng sgnfcanty better than the other on any of the data sets. Ths seems reasonabe gven that ony two hyper-parameters are optmsed durng mode seecton and so there s tte scope for over-fttng the PRESS mode seecton crteron and the reguarsaton term w have tte effect. The LS-SVM mode wth Bayesan reguarsaton s sgnfcanty out-performed by the Gaussan Process cassfer on one benchmark banana, but performs sgnfcanty better on a further four (rngnorm, spce, twonorm, waveform). Demšar (2006) recommends the use of the Wcoxon sgned rank test for assessng the statstca sgnfcance of dfferences n performance over mutpe data sets. Accordng to ths test, nether the LSSVM-BR nor the EP-PGC s statstcay superor at the 95% eve of sgnfcance. 3.2 Performance of Modes Based on the Eptca RBF Kerne The performances of LS-SVM, LS-SVM-BR and EP-GPC modes based on the eptca Gaussan kerne, whch ncudes a separate scae parameter for each nput feature, are shown n the ast three coumns of Tabe 2. Before evauatng the effects of Bayesan reguarsaton n mode seecton t s worth notng that the use of eptca RBF kernes does not generay mprove performance. For the LS-SVM, the eptca kerne produces sgnfcanty better resuts on ony two benchmarks (mage and spce) and sgnfcanty worse resuts on a further eght (banana, breast cancer, dabets, german, heart, rngnorm, twonorm, waveform), wth the degradaton n performance beng very arge n some cases (e.g., heart). Ths seems key to be a resut of the addtona degrees of freedom nvoved n the mode seecton process, aowng over-fttng of the PRESS mode seecton crteron as a resut of ts nherenty hgh varance. Note that fuy Bayesan approaches, such as the Gaussan Process Cassfer, are aso unabe to reaby seect kerne parameters for the eptca RBF kerne. The eptca kerne s sgnfcanty better on ony three benchmarks (fare soar, 853

14 Data Set Rada Bass Functon Automatc Reevance Determnaton LSSVM LSSVM-BR EP-GPC LSSVM LSSVM-BR EP-GPC CAWLEY AND TALBOT Banana 0.60 ± ± ± ± ± ± Breast cancer ± ± ± ± ± ± Dabetes ± ± ± ± ± ± 0.93 Fare soar ± ± ± ± ± ± 0.82 German ± ± ± ± ± ± 0.22 Heart 6.64 ± ± ± ± ± ± Image 3.00 ± ± ± ± ± ± Rngnorm.6 ± ± ± ± ± ± Spce 0.97 ± ± ± ± ± ± Thyrod 4.68 ± ± ± ± ± ± 0.28 Ttanc ± ± ± ± ± ± 0.34 Twonorm 2.84 ± ± ± ± ± ± Waveform 9.79 ± ± ± ± ± ± Tabe 2: Error rates of east-squares support vector machne, wth and wthout Bayesan reguarsaton of the mode seecton crteron, n ths case the PRESS statstc (Aen, 974), and Gaussan process cassfers over thrteen benchmark data sets (Rätsch et a., 200), usng both standard rada bass functon and automatc reevance determnaton kernes. The resuts for the EP-GPC were obtaned usng the MATLAB software accompanyng the book by Rasmussen and Wams (2006). The resuts for each method are presented n the form of the mean error rate over test data for 00 reasatons of each data set (20 n the case of the mage and spce data sets), aong wth the assocated standard error. The best resuts are shown n bodface and the second best n tacs (wthout mpcaton of statstca sgnfcance).

15 BAYESIAN REGULARISATION IN MODEL SELECTION mage and spce), whe beng sgnfcanty worse on sx (breast cancer, dabets, heart, rngnorm, twonorm and waveform). In the case of the eptca RBF kerne, the use of Bayesan reguarsaton n mode seecton has a dramatc effect on the performance of LS-SVM modes, wth the LS-SVM-BR mode provng sgnfcanty better than the conventona LS-SVM on nne of the thrteen benchmarks (breast cancer, dabets, fare soar, german, heart, rngnorm, spce, twonorm and waveform) wthout beng sgnfcanty worse on any of the remanng four data sets. Ths demonstrates that over-fttng n mode seecton, due to the arger number of kerne parameters, s key to be the sgnfcant factor causng the reatvey poor performance of modes wth the eptca RBF kerne. Agan, the Gaussan Process cassfer s sgnfcanty better than the LS-SVM wth Bayesan reguarsaton on the banana and twonorm data sets, but s sgnfcanty worse on four of the remanng eeven (dabets, heart, rngnorm and spce). Agan, accordng to the Wcoxon sgned rank test, nether the LSSVM-BR nor the EP-PGC s statstcay superor at the 95% eve of sgnfcance. However the magntude of the dfference n performance between LSSVM-BR and EP-GPC approaches tends to be greatest when the LSSVM-BR out-performs EP-GPC, most notaby on the heart, spce and rngnorm data sets. Ths provdes some support for the observaton of Wahba (990) that cross-vadaton based mode seecton procedures shoud be more robust aganst mode ms-specfcaton (see aso Rasmussen and Wams, 2006). 4. Dscusson The expermenta evauaton presented n the prevous secton demonstrates that over-fttng can occur n mode seecton, due to the varance of the mode seecton crteron. In many cases the mnmum of the seecton crteron usng the eptca RBF kerne s ower than that achevabe usng the spherca RBF kerne, however ths resuts n a degradaton n generasaton performance. Usng the PRESS statstc, the over-fttng s key to be most severe n cases wth a sma number of tranng patterns, as the varance of the eave-one-out estmator decreases as the sampe sze becomes arger. Usng the standard LSSVM, the eptca RBF kerne ony out-performs the spherca RBF kerne on two of the thrteen data sets, mage and spce, whch aso happen to be the two argest data sets n terms of the number of tranng patterns. The greatest degradaton n performance s obtaned on the heart benchmark, the thrd smaest. The heart data set aso has a reatvey arge number of nput features (3). A arge number of nput features ntroduces a many addtona degrees of freedom to optmse the mode seecton crteron, and so w generay tend to encourage over-fttng. However, there may be a compact subset of hghy reevant features wth the remander beng amost entrey unnformatve. In ths case the advantage of suppressng the nosy nputs s so great that t overcomes the predsposton towards over-fttng, and so resuts n mproved generasaton (as observed n the case of the mage and spce benchmarks). Whether the use of an eptca RBF kerne w mprove or degrade generasaton argey depends on such characterstcs of the data that are not known a-pror, and so t seems prudent to consder a range of kerne functons and seect the best va cross-vadaton. The expermenta resuts ndcate that Bayesan reguarsaton of the hyper-parameters s generay benefca, wthout at ths stage provdng a compete souton to the probem of over-fttng the mode seecton crteron. The effectveness of the Bayesan reguarsaton scheme s to a arge extent dependent on the approprateness of the pror mposed on the hyper-parameters. There s no reason to assume that the smpe Gaussan pror used here s n any sense optma, and ths s an 855

16 CAWLEY AND TALBOT ssue where further research s necessary (see Secton 4.2). The comparson of the ntegrate-out approach and the evdence framework hghghts a defcency of the smpe Gaussan pror. It suggests that the ntegrate-out approach s key to resut n md over-reguarsaton of the hyper-parameters n the presence of a arge number of rreevant features, as the correspondng hyper-parameters w be -determned. The LSSVM wth Bayesan reguarsaton of the hyper-parameters does not sgnfcanty outperform the expectaton propagaton based Gaussan process cassfer over the sute of thrteen benchmark data sets consdered. Ths s not whoy surprsng as the EP-GPC s at east very cose to the state-of-the-art, ndeed t s nterestng that the EP-GPC does not out-perform such a comparatvey smpe mode. The EP-GPC uses the margna kehood as the mode seecton crteron, whch gves the probabty of the data, gven the assumptons of the mode (Rasmussen and Wams, 2006). Cross-vadaton based approaches, on the other hand, provde an estmate of generasaton performance that does not depend on the mode assumptons, and so may be more robust aganst mode ms-specfcaton (Wahba, 990). The no free unch theorems suggest that, at east n terms of generasaton performance, there s a ack of nherent superorty of one cassfcaton method over another, n the absence of a-pror assumptons regardng the data. Ths mpes that f one cassfer performs better than another on a partcuar data set t s because the nductve bases of that cassfer provde a better ft to the partcuar pattern recognton task, rather than to ts superorty n a more genera sense. A mode wth strong nductve bases s key to beneft when these bases are we suted to the data, but w perform bady when they do not. Whe a mode wth weak nductve bases w be more robust, t s ess key to perform conspcuousy we on any gven data set. Ths means there are compementary advantages and dsadvantages to both approaches. 4. Reatonshp to Exstng Work The use of a pror over the hyper-parameters s n accordance wth norma Bayesan practce and has been used n Gaussan Process cassfcaton (Wams and Barber, 998). The probem of over-fttng n mode seecton has aso been addressed by Q et a. (2004), n the case of seectng nformatve features for a ogstc regresson mode usng an Automatc Reevance Determnaton (ARD) pror (cf., Tppng, 2000). In ths case, the Expectaton Propagaton method (Mnka, 200) s used to obtan a determnstc approxmaton of the posteror, and aso (as a by-product) a eaveone-out performance estmate. The atter s then used to mpement a form of eary-stoppng (e.g., Sare, 995) to prevent over-fttng resutng from the drect optmzaton of the margna kehood unt convergence. It seems key that ths approach woud be aso be benefca n the case of tunng the hyper-parameters of the covarance functon of Gaussan process mode, usng ether the eaveone-out estmate arsng from the EP approxmaton, or an approxmate eave-one-out estmate from the Lapace approxmaton (cf., Cawey and Tabot, 2007). 4.2 Drectons for Further Research In ths paper, the reguarsaton term corresponds to a smpe spherca Gaussan pror over the kerne parameters. One drecton of research woud be to nvestgate aternatve reguarsaton terms. The 856

17 BAYESIAN REGULARISATION IN MODEL SELECTION frst possbty woud be to use a reguarsaton term correspondng to a separabe Lapace pror, Ω(θ) = 2 d η = p(θ) = d ξ 2 exp{ ξ η }. Settng the dervatve of the reguarsed mode seecton crteron (4) to zero, we obtan Q η = ξ f η > 0 and Q ζ η < ξ f η = 0, ζ whch mpes that f the senstvty of the eave-one-out error, Q(θ), fas beow ξ/ζ, the vaue of the hyper-parameter, η w be set exacty to zero, effectvey prunng that nput from the mode. In ths way expct feature seecton may be obtaned as a consequence of (reguarsed) mode seecton. The mode seecton crteron wth Bayesan reguarsaton then becomes L = ogq(θ) + N ogω(θ) 2 where N s the number of nput features wth non-zero scae factors. Ths potentay overcomes the propensty towards under-fttng the data that mght be expected when usng the Gaussan pror, as the prunng acton of the Lapace pror means that the vaues of a remanng hyper-parameters are we-determned by the data. In the case of the Lapace pror, the ntegrate-out approach s exacty equvaent to contnuous updates of the hyper-parameters accordng to the update formuae under the evdence framework (Wams, 995). Aternatvey, defnng a pror over the functon of a mode seems more n accordance wth Bayesan deas than choosng a pror over the parameters of the mode. The use of a pror over the hyper-parameters based on the smoothness of the resutng mode aso provdes a potenta drecton for future research. In ths case, the reguarsaton term mght take the form, Ω(θ) = 2 d j= [ ] 2 2 ŷ, drecty penasng modes wth excess curvature. Ths reguarsaton term corresponds to curvature drven smoothng n mut-ayer perceptron networks (Bshop, 993), except that the mode output ŷ s vewed as a functon of the hyper-parameters, rather than of the weghts. A penaty term based on the frst-order parta dervatves s aso feasbe (cf., Drucker and Le Cun, 992). 5. Concuson Leave-one-out cross-vadaton has proved to be an effectve means of mode seecton for a varety of kerne earnng methods, provded the number of hyper-parameters to be tuned s reatvey sma. The use of kerne functons wth arge numbers of parameters often provdes suffcent degrees of freedom to over-ft the mode seecton crteron, eadng to poor generasaton. In ths paper, we have proposed the use of reguarsaton at the second eve of nference, that s, mode seecton. The use of Bayesan reguarsaton s shown to be effectve n reducng over-fttng, by ensurng the vaues of the kerne parameters reman sma, gvng a smoother kerne and hence a ess compex cassfer. Ths s acheved wth ony a mnma computatona expense as the addtona reguarsaton parameters are ntegrated out anaytcay usng a reference pror. Whe a fuy Bayesan x 2 j 857

18 CAWLEY AND TALBOT mode seecton strategy s conceptuay more eegant, t may aso be ess robust to mode msspecfcaton. The use of eave-one-out cross-vadaton n mode seecton and Bayesan methods at the next eve seems to be a pragmatc compromse. The effectveness of ths approach s ceary demonstrated n the expermenta evauaton where, on average, the LS-SVM wth Bayesan reguarsaton out-performs the expectaton-propagaton based Gaussan process cassfer, usng both spherca and eptca RBF kernes. Acknowedgments The authors woud ke to thank the organsers of the WCCI mode seecton workshop and performance predcton chaenge and the NIPS mut-eve nference workshop and mode seecton game, and feow partcpants for the stmuatng dscussons that have heped to shape ths work. We aso thank Car Rasmussen and Chrs Wams for ther advce regardng the EP-GPC and the anonymous revewers for ther detaed and constructve comments that have sgnfcanty mproved ths paper. References D. M. Aen. The reatonshp between varabe seecton and predcton. Technometrcs, 6:25 27, 974. E. Anderson, Z. Ba, C. Bschof, S. Backford, J. Demme, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarng, A. McKenney, and D. Sorenson. LAPACK Users Gude. SIAM Press, thrd edton, 999. C. M. Bshop. Curvature-drven smoothng: a earnng agorthm for feedforward networks. IEEE Transactons on Neura Networks, 4(5): , September 993. C. M. Bshop. Neura Networks for Pattern Recognton. Oxford Unversty Press, 995. L. Bo, L. Wang, and L. Jao. Feature scang for kerne Fsher dscrmnant anayss usng eaveone-out cross-vadaton. Neura Computaton, 8:96 978, Apr W. L. Buntne and A. S. Wegend. Bayesan back-propagaton. Compex Systems, 5: , 99. G. C. Cawey. Leave-one-out cross-vadaton based mode seecton crtera for weghted LS- SVMs. In Proceedngs of the Internatona Jont Conference on Neura Networks (IJCNN-2006), pages , Vancouver, BC, Canada, Juy G. C. Cawey and N. L. C. Tabot. Effcent eave-one-out cross-vadaton of kerne Fsher dscrmnant cassfers. Pattern Recognton, 36(): , November G. C. Cawey and N. L. C. Tabot. Fast eave-one-out cross-vadaton of sparse east-squares support vector machnes. Neura Networks, 7(0): , December G. C. Cawey and N. L. C. Tabot. Approxmate eave-one-out cross-vadaton for kerne ogstc regresson. Machne Learnng (submtted),

19 BAYESIAN REGULARISATION IN MODEL SELECTION C. Chapee, V. Vapnk, O. Bousquet, and S. Mukherjee. Choosng mutpe parameters for support vector machnes. Machne Learnng, 46():3 59, C. Cortes and V. Vapnk. Support vector networks. Machne Learnng, 20: , 995. J. Demšar. Statstca comparsons of cassfers over mutpe data sets. Journa of Machne Learnng Research, 7: 30, H. Drucker and Y. Le Cun. Improvng generazaton performance usng doube back-propagaton. IEEE Transactons on Neura Networks, 3(6):99 997, 992. S. Geman, E. Benenstock, and R. Doursat. Neura networks and the bas/varance dema. Neura Computaton, 4(): 58, 992. G. H. Goub and C. F. Van Loan. Matrx Computatons. The Johns Hopkns Unversty Press, Batmore, thrd edton edton, 996. I. S. Gradshteyn and I. M. Ryzhc. Tabe of Integras, Seres and Products. Academc Press, ffth edton, 994. I. Guyon, A. R. Saffar Azar Aamdar, G. Dror, and J. Buhmann. Performance predcton chaenge. In Proceedngs of the Internatona Jont Conference on Neura Networks (IJCNN-2006), pages , Vancouver, BC, Canada, Juy T. Joachms. Learnng to Cassfy Text usng Support Vector Machnes - Methods, Theory and Agorthms. Kuwer Academc Pubshers, S. S. Keerth, K. B. Duan, S. K. Shevade, and A. N. Poo. A fast dua agorthm for kerne ogstc regresson. Machne Learnng, 6( 3):5 65, November G. S. Kmedorf and G. Wahba. Some resuts on Tchebycheffan spne functons. J. Math. Ana. Appc., 33:82 95, 97. R. Kohav. A study of cross-vadaton and bootstrap for accuracy estmaton and mode seecton. In Proceedngs of the Fourteenth Internatona Conference on Artfca Integence (IJCAI), pages 37 43, San Mateo, CA, 995. Morgan Kaufmann. P. A. Lachenbruch and M. R. Mckey. Estmaton of error rates n dscrmnant anayss. Technometrcs, 0():, February 968. A. Luntz and V. Braovsky. On estmaton of characters obtaned n statstca procedure of recognton (n Russan). Techcheskaya Kbernetca, 3, 969. D. J. C. MacKay. Bayesan nterpoaton. Neura Computaton, 4(3):45 447, 992a. D. J. C. MacKay. A practca Bayesan framework for backprop networks. Neura Computaton, 4 (3): , 992b. D. J. C. MacKay. The evdence framework apped to cassfcaton networks. Neura Computaton, 4(5): , 992c. 859

20 CAWLEY AND TALBOT D. J. C. MacKay. Hyperparameters: Optmse or ntegrate out? In G. Hedbreder, edtor, Maxmum Entropy and Bayesan Methods. Kuwer, 994. J. Mercer. Functons of postve and negatve type and ther connecton wth the theory of ntegra equatons. Phosophca Transactons of the Roya Socety of London, A, 209:45 446, 909. C. A. Mcche. Interpoaton of scattered data: Dstance matrces and condtonay postve defnte functons. Constructve Approxmaton, 2: 22, 986. S. Mka, G. Rätsch, J. Weston, B. Schökopf, and K.-R. Müer. Fsher dscrmnant anayss wth kernes. In Neura Networks for Sgna Processng, voume IX, pages IEEE Press, New York, 999. S. Mka, G. Rätsch, J. Weston, B. Schökopf, A. J. Smoa, and K.-R. Müer. Invarant feature extracton and cassfcaton n feature spaces. In S. A. Soa, T. K. Leen, and K.-R. Müer, edtors, Advances n Neura Informaton Processng Systems, voume 2, pages MIT Press, T. P. Mnka. Expectaton propagaton for approxmate Bayesan nference. In Proceedngs of the 7 th Annua Conference on Uncertanty n Artfca Integence, pages Morgan Kauffmann, 200. J. A. Neder and R. Mead. A smpex method for functon mnmzaton. Computer Journa, 7: , 965. Y. Q, T. P. Mnka, R. W. Pcard, and Z. Ghahraman. Predctve automatc reevance determnaton by expectaton propagaton. In Proceedngs of the 2 st Internatona Conference on Machne Learnng, pages , Banff, Aberta, Canada, Juy C. E. Rasmussen and C. K. I. Wams. Gaussan Processes for Machne Learnng. Adaptve Computaton and Machne Learnng. MIT Press, G. Rätsch, T. Onoda, and K.-R. Müer. Soft margns for AdaBoost. Machne Learnng, 42(3): , 200. K. Saad, N. L. C. Tabot, and G. C. Cawey. Optmay reguarsed kerne Fsher dscrmnant anayss. In Proceedngs of the 7th Internatona Conference on Pattern Recognton (ICPR- 2004), voume 2, pages , Cambrdge, Unted Kngdom, August W. S. Sare. Stopped tranng and other remdes for overfttng. In Proceedngs of the 27th Symposum on the Interface of Computer Scence and Statstcs, pages , Pttsburgh, PA, USA, June B. Schökopf, K. Tsuda, and J.-P. Vert. Kerne Methods n Computatona Boogy. MIT Press, T. Seaks. SYMINV : An agorthm for the nverson of a postve defnte matrx by the Choesky decomposton. Econometrca, 40(5):96 962, September 972. J. Shawe-Tayor and N. Crstann. Kerne Methods for Pattern Anayss. Cambrdge Unversty Press,

21 BAYESIAN REGULARISATION IN MODEL SELECTION M. Stone. Cross-vadatory choce and assessment of statstca predctons. Journa of the Roya Statstca Socety, B 36(): 47, 974. S. Sundararajan and S. S. Keerth. Predctve approaches for choosng hyperparameters n Gaussan processes. Neura Computaton, 3(5):03 8, May 200. J. A. K. Suykens and J. Vandewae. Least squares support vector machne cassfers. Neura Processng Letters, 9(3): , June 999. J. A. K. Suykens, T. Van Geste, J. De Brabanter, B. De Moor, and J. Vandewae. Least Squares Support Vector Machnes. Word Scentfc, A. N. Tkhonov and V. Y. Arsenn. Soutons of I-posed Probems. John Wey, New York, 977. M. E. Tppng. Sparse Bayesan earnng and the reevance vector machne. Journa of Machne Learnng Research, :2 244, June G. Wahba. Spne Modes for Observatona Data. SIAM Press, Phadepha, PA, 990. C. K. I. Wams and D. Barber. Bayesan cassfcaton wth Gaussan processes. IEEE Transactons on Pattern Anayss and Machne Integence, 20(2):342 35, December 998. P. M. Wams. A Marquardt agorthm for choosng the step sze n backpropagaton earnng wth conjugate gradents. Technca Report CSRP-229, Unversty of Sussex, February 99. P. M. Wams. Bayesan reguarzaton and prunng usng a Lapace pror. Neura Computaton, 7 ():7 43,

Image Classification Using EM And JE algorithms

Image Classification Using EM And JE algorithms Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Image Cassfcaton Usng EM And JE agorthms Xaojn Sh Department of Computer Engneerng, Unversty of Caforna, Santa Cruz, CA, 9564 jennfer@soe.ucsc.edu