Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters

Size: px
Start display at page:

Download "Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters"

Transcription

1 Journa of Machne Learnng Research 8 (2007) Submtted /06; Pubshed 4/07 Preventng Over-Fttng durng Mode Seecton va Bayesan Reguarsaton of the Hyper-Parameters Gavn C. Cawey Ncoa L. C. Tabot Schoo of Computng Scences Unversty of East Anga Norwch, Unted Kngdom NR4 7TJ GCC@CMP.UEA.AC.UK NLCT@CMP.UEA.AC.UK Edtors: Isabee Guyon and Amr Saffar Abstract Whe the mode parameters of a kerne machne are typcay gven by the souton of a convex optmsaton probem, wth a snge goba optmum, the seecton of good vaues for the reguarsaton and kerne parameters s much ess straghtforward. Fortunatey the eave-one-out cross-vadaton procedure can be performed or a east approxmated very effcenty n cosed form for a wde varety of kerne earnng methods, provdng a convenent means for mode seecton. Leave-one-out cross-vadaton based estmates of performance, however, generay exhbt a reatvey hgh varance and are therefore prone to over-fttng. In ths paper, we nvestgate the nove use of Bayesan reguarsaton at the second eve of nference, addng a reguarsaton term to the mode seecton crteron correspondng to a pror over the hyper-parameter vaues, where the addtona reguarsaton parameters are ntegrated out anaytcay. Resuts obtaned on a sute of thrteen rea-word and synthetc benchmark data sets ceary demonstrate the beneft of ths approach. Keywords: mode seecton, kerne methods, Bayesan reguarsaton. Introducton Leave-one-out cross-vadaton (Lachenbruch and Mckey, 968; Luntz and Braovsky, 969; Stone, 974) provdes the bass for computatonay effcent mode seecton strateges for a varety of kerne earnng methods, ncudng the Support Vector Machne (SVM) (Cortes and Vapnk, 995; Chapee et a., 2002), Gaussan Process (GP) (Rasmussen and Wams, 2006; Sundararajan and Keerth, 200), Least-Squares Support Vector Machne (LS-SVM) (Suykens and Vandewae, 999; Cawey and Tabot, 2004), Kerne Fsher Dscrmnant (KFD) anayss (Mka et a., 999; Cawey and Tabot, 2003; Saad et a., 2004; Bo et a., 2006) and Kerne Logstc Regresson (KLR) (Keerth et a., 2005; Cawey and Tabot, 2007). These methods have proved hghy successfu for kerne machnes havng ony a sma number of hyper-parameters to optmse, as demonstrated by the set of modes achevng the best average score n the WCCI-2006 performance predcton chaenge (Cawey, 2006; Guyon et a., 2006). Unfortunatey, whe eave-one-out cross-vadaton estmators have been shown to be amost unbased (Luntz and Braovsky, 969), they are known to exhbt a reatvey hgh varance (e.g., Kohav, 995). A kerne wth many hyper-parameters, for nstance those used n Automatc Reevance Determnaton (ARD) (e.g., Rasmussen and Wams, 2006) or feature scang methods (Chapee et a., 2002; Bo et a., 2006), may provde suffcent. See c 2007 Gavn Cawey and Ncoa Tabot.

2 CAWLEY AND TALBOT degrees of freedom to over-ft eave-one-out cross-vadaton based mode seecton crtera, resutng n performance nferor to that obtaned usng a ess fexbe kerne functon. In ths paper, we nvestgate the nove use of reguarsaton (Tkhonov and Arsenn, 977) of the hyper-parameters n mode seecton n order to ameorate the effects of the hgh varance of eave-one-out crossvadaton based seecton crtera, and so mprove predctve performance. The reguarsaton term corresponds to a zero-mean Gaussan pror over the vaues of the kerne parameters, representng a preference for smooth kerne functons, and hence a reatvey smpe cassfer. The reguarsaton parameters ntroduced n ths step are ntegrated out anaytcay n the stye of Buntne and Wegend (99), to provde a Bayesan mode seecton crteron that can be optmsed n a straghtforward manner va, for exampe, scaed conjugate gradent descent (Wams, 99). The paper s structured as foows: The remander of ths secton provdes a bref overvew of the east-squares support vector machne, ncudng the use of eave-one-out cross-vadaton based mode seecton procedures, gven n suffcent deta to ensure the reproducbty of the resuts. Secton 2 descrbes the use of Bayesan reguarsaton to prevent over-fttng at the second eve of nference, that s, mode seecton. Secton 3 presents resuts obtaned over a sute of thrteen benchmark data sets, whch demonstrate the utty of ths approach. Secton 4 provdes dscusson of the resuts and suggests drectons for further research. Fnay, the work s summarsed and drectons for further work are outned n Secton 5.. Least Squares Support Vector Machne In the remander of ths secton, we provde a bref overvew of the east-squares support vector machne (Suykens and Vandewae, 999) used as the testbed for the nvestgaton of the roe of reguarsaton n the mode seecton process descrbed n ths study. Gven tranng data, D = {(x, y )}, where x X R d and y {,+}, we seek to construct a near dscrmnant, f (x) = φ(x) w + b, n a feature space, F, defned by a fxed transformaton of the nput space, φ : X F. The parameters of the near dscrmnant, (w, b), are gven by the mnmser of a reguarsed (Tkhonov and Arsenn, 977) east-squares tranng crteron, L = 2 w 2 + 2µ [y φ(x ) w b] 2, () where µ s a reguarsaton parameter controng the bas-varance trade-off (Geman et a., 992). Rather than specfy the feature space drecty, t s nstead nduced by a kerne functon, K : X X R, whch evauates the nner-product between the projectons of the data onto the feature space, F, that s, K (x,x ) = φ(x) φ(x ). The nterpretaton of an nner-product n a fxed feature space s vad for any Mercer kerne (Mercer, 909), for whch the Gram matrx, K = [k j = K (x,x j )], j= s postve sem-defnte, that s, a T Ka 0, a R, a 0. The Gram matrx effectvey encodes the spata reatonshps between the projectons of the data n the feature space, F. A near mode can thus be mpcty constructed n the feature space usng ony nformaton contaned n the Gram matrx, wthout expcty evauatng the postons of the data n the feature space va the transformaton φ( ). Indeed, the representer theorem (Kmedorf 842

3 BAYESIAN REGULARISATION IN MODEL SELECTION and Wahba, 97) shows that the souton of the optmsaton probem () can be wrtten as an expanson over the tranng patterns, w = α φ(x ) = f (x) = α K (x,x) + b. The advantage of the kerne trck then becomes apparent; a near mode can be constructed n an extremey rch, hgh- (possby nfnte-) dmensona feature space, usng ony fnte-dmensona quanttes, such as the Gram matrx, K. The kerne trck aso aows the constructon of statstca modes that operate drecty on structured data, for nstance strngs, trees and graphs, eadng to the current nterest n kerne earnng methods n computatona boogy (Schökopf et a., 2004) and text-processng (Joachms, 2002). The Rada Bass Functon (RBF) kerne, K (x,x ) = exp { η x x 2} s commony encountered n practca appcatons of kerne earnng methods, here η s a kerne parameter, controng the senstvty of the kerne functon. The feature space for the rada bass functon kerne conssts of the postve orthant of an nfnte-dmensona unt hyper-sphere (e.g., Shawe-Tayor and Crstann, 2004). The Gram matrx for the rada bass functon kerne s thus of fu rank (Mcche, 986), and so the kerne mode s abe to form an arbtrary shatterng of the data... A DUAL TRAINING ALGORITHM The basc tranng agorthm for the east-squares support vector machne (Suykens and Vandewae, 999) vews the reguarsed oss functon () as a constraned mnmsaton probem: mn w,b,ε 2 w 2 + 2µ ε 2 subject to ε = y w φ(x ) b. The prma Lagrangan for ths constraned optmsaton probem gves the unconstraned mnmsaton probem defned by the foowng reguarsed oss functon, L = 2 w 2 + 2µ ε 2 α {w φ(x ) + b + ε y }, where α = (α,α 2,...,α ) R s a vector of Lagrange mutpers. The optmaty condtons for ths probem can be expressed as foows: L w = 0 = w = α φ(x ), (2) L b L = 0 = α = 0, (3) = 0 = α = ε ε µ, {,2,...,}, (4) L = 0 = w φ(x ) + b + ε y = 0, {,2,...,}. (5) α 843

4 CAWLEY AND TALBOT Usng (2) and (4) to emnate w and ε = (ε,ε 2,...,ε ), from (5), we fnd that α j φ(x j ) φ(x ) + b + µα = y {,2,...,}. (6) j= Notng that K (x,x ) = φ(x) φ(x ), the system of near equatons, (6) and (3), can be wrtten more concsey n matrx form as [ ][ ] [ ] K + µi α y T =, 0 b 0 where K = [k j = K (x,x j )], j=, I s the dentty matrx and s a coumn vector of ones. The optma parameters for the mode of the condtona mean can then be obtaned wth a computatona compexty of O( 3 ) operatons, usng drect methods, such as Choesky decomposton (Goub and Van Loan, 996)...2 EFFICIENT IMPLEMENTATION VIA CHOLESKY DECOMPOSITION A more effcent tranng agorthm can be obtaned, takng advantage of the speca structure of the system of near equatons (Suykens et a., 2002). The system of near equatons to be soved n fttng a east-squares support vector machne s gven by, [ ][ ] [ ] M α y T =, (7) 0 b 0 where M = K +µi. Unfortunatey the matrx on the eft-hand sde s not postve defnte, and so we cannot sove ths system of near equatons drecty usng the Choesky decomposton. However, the frst row of (7) can be re-wrtten as M ( α + M b ) = y. (8) Rearrangng (8), we see that α = M (y b), usng ths resut to emnate α, the second row of (7) can be wrtten as T M b = T M y. The system of near equatons can then be re-wrtten as [ M 0 ][ α + M b 0 T T M b ] [ = y T M y ]. (9) In ths case, the matrx on the eft hand sde s postve-defnte, as M = K + λi s postve-defnte and T M s postve snce the nverse of a postve defnte matrx s aso postve defnte. The revsed system of near equatons (9) can be soved as foows: Frst sove Mρ = and Mν = y, (0) whch may be performed effcenty usng the Choesky factorsaton of M. The mode parameters of the east-squares support vector machne are then gven by b = T ν T ρ and α = ν ρb. The two systems of near equatons (0) can be soved effcenty usng the Choesky decomposton of M = R T R, where R s the upper tranguar Choesky factor of M. 844

5 BAYESIAN REGULARISATION IN MODEL SELECTION.2 Leave-One-Out Cross-Vadaton Cross-vadaton (Stone, 974) s commony used to obtan a reabe estmate of the test error for performance estmaton or for use as a mode seecton crteron. The most common form, k-fod cross-vadaton, parttons the avaabe data nto k dsjont subsets. In each teraton a cassfer s traned on a dfferent combnaton of k subsets and the unused subset s used to estmate the test error rate. The k-fod cross-vadaton estmate of the test error rate s then smpy the average of the test error rate observed n each of the k teratons, or fods. The most extreme form of cross-vadaton, where k = such that the test partton n each fod conssts of ony a snge pattern, s known as eave-one-out cross-vadaton (Lachenbruch and Mckey, 968) and has been shown to provde an amost unbased estmate of the test error rate (Luntz and Braovsky, 969). Leave-one-out cross-vadaton s however computatonay expensve, n the case of the east-squares support vector machne a naïve mpementaton havng a compexty of O( 4 ) operatons. Leave-one-out cross-vadaton s therefore normay ony used n crcumstances where the avaabe data are extremey scarce such that the computatona expense s no onger prohbtve. In ths case the nherenty hgh varance of the eave-one-out estmator (Kohav, 995) s offset by the mnma decrease n the sze of the tranng set n each fod, and so may provde a more reabe estmate of generasaton performance than conventona k-fod cross-vadaton. Fortunatey eaveone-out cross-vadaton of east-squares support vector machnes can be performed n cosed form wth a computatona compexty of ony O( 3 ) operatons (Cawey and Tabot, 2004). Leave-oneout cross-vadaton can then be used n medum to arge scae appcatons, where there may be a few thousand data-ponts, athough the reatvey hgh varance of ths estmator remans potentay probematc..2. VIRTUAL LEAVE-ONE-OUT CROSS-VALIDATION The optma vaues of the parameters of a Least-Squares Support Vector Machne are gven by the souton of a system of near equatons: [ K + µi T 0 ][ α b ] [ y = 0 ]. () The matrx on the eft-hand sde of () can be decomposed nto bock-matrx representaton, as foows: [ ] [ K + µi c c T = T ] = C. 0 c C Let [α ( ) ;b ( ) ] represent the parameters of the east-squares support vector machne durng the th teraton of the eave-one-out cross-vadaton procedure, then n the frst teraton, n whch the frst tranng pattern s excuded, [ ] α ( ) = C [y 2,...,y,0] T. b ( ) The eave-one-out predcton for the frst tranng pattern s then gven by [ ŷ ( ) = c T α ( ) b ( ) ] = c T C [y 2,...,y,0] T. 845

6 CAWLEY AND TALBOT Consderng the ast equatons n the system of near equatons (), t s cear that [c C ] [α,...,α,b] T = [y 2,...,y,0] T, and so ŷ ( ) = c T C [c C ] [ α T,b ] T = c T C c α + c [α 2,...,α,b] T. Notng, from the frst equaton n the system of near equatons (), that y = c α + c T [α 2,...,α,b] T, thus ŷ ( ) ( = y α c c T C c ). Fnay, va the bock matrx nverson emma, [ c c T ] [ = c C κ κ c C C + κ C ct c C κ C ct where κ = c c T C c, and notng that the system of near equatons () s nsenstve to permutatons of the orderng of the equatons and of the unknowns, we have that, y ŷ ( ) = α C ],. (2) Ths means that, assumng the system of near equatons () s soved va expct nverson of C, a eave-one-out cross-vadaton estmate of an approprate mode seecton crteron can be evauated usng nformaton aready avaabe a by-product of tranng the east-squares support vector machne on the entre data set (cf., Sundararajan and Keerth, 200)..2.2 EFFICIENT IMPLEMENTATION VIA CHOLESKY FACTORISATION The eave-one-out cross-vadaton behavour of the east-squares support vector machne s descrbed by (2). The coeffcents of the kerne expanson, α, can be found effcenty, va Choesky factorsaton, as descrbed n Secton..2. However we must aso determne the dagona eements of C n an effcent manner. Usng the bock matrx nverson formua, we obtan C = [ M T 0 ] [ M = + M SM T M SM T M M S M where M = K + µi and S M = T M = T η s the Schur compement of M. The nverse of the postve defnte matrx, M, can be computed effcenty from ts Choesky factorsaton, va the SYMINV agorthm (Seaks, 972), for exampe usng the LAPACK routne DTRTRI (Anderson et a., 999). Let R = [r j ], j= be the ower tranguar Choesky factor of the postve defnte matrx M, such that M = RR T. Furthermore, et S = [s j ], j= = R, where s = and s j = s r r k s k j, k= represent the (ower tranguar) nverse of the Choesky factor. The nverse of M s then gven by M = S T S. In the case of effcent eave-one-out cross-vadaton of east-squares support vector machnes, we are prncpay concerned ony wth the dagona eements of M, gven by S M ] M = s 2 j = C = j= s 2 j + ρ2 j= S M {,2,...,}. 846

7 BAYESIAN REGULARISATION IN MODEL SELECTION The computatona compexty of the basc tranng agorthm s O( 3 ) operatons, beng domnated by the evauaton of the Choesky factor. However, the computatona compexty of the anaytc eave-one-out cross-vadaton procedure, when performed as a by-product of the tranng agorthm, s ony O() operatons. The computatona expense of the eave-one-out cross-vadaton procedure therefore becomes ncreasngy neggbe as the tranng set becomes arger..3 Mode Seecton The vrtua eave-one-out cross-vadaton procedure descrbed n the prevous secton provdes the bass for a smpe automated mode seecton strategy for the east-squares support vector machne. Perhaps the most basc mode seecton crteron s provded by the Predcted REsdua Sum of Squares (PRESS) crteron (Aen, 974), whch s smpy the eave-one-out estmate of the sum-ofsquares error, Q(θ) = 2 [ y ŷ ( ) A mnmum of the mode seecton crteron s often found va a smpe grd-search procedure n the majorty of practca appcatons of kerne earnng methods. However, ths s rarey necessary and often hghy neffcent as a grd-search spends a arge amount of tme nvestgatng hyperparameter vaues outsde the neghbourhood of the goba optmum. A more effcent approach uses the Neder-Mead smpex agorthm (Neder and Mead, 965), as mpemented by the fmnsearch functon of the MATLAB optmsaton toobox. An aternatve easy mpemented approach uses conjugate gradent methods, wth the requred gradent nformaton estmated by the method of fnte dfferences, and mpemented by the fmnunc functon from the MATLAB optmsaton toobox. In ths study however, we use scaed conjugate gradent descent (Wams, 99), wth the requred gradent nformaton evauated anaytcay, as ths s approxmatey twce as effcent..3. PARTIAL DERIVATIVES OF THE PRESS MODEL SELECTION CRITERION Let θ = {θ,...,θ n } = {λ,η,...,η d } represent the vector of hyper-parameters for a east-squares support vector machne, where η,...,η d represent the kerne parameters. The PRESS statstc (Aen, 974) can be wrtten as Q(θ) = 2 [ r ( ) ] 2, where r ( ) ] 2. = y ŷ ( ) = α C Usng the chan rue, the parta dervatve of the PRESS statstc, wth respect to an ndvdua hyper-parameter, θ j, s gven by, where such that Q(θ) r ( ) = r ( ) = α C Q(θ) = θ j Q(θ) = θ j and α C Q(θ) r ( ) r ( ) θ j { α θ j C 847 r ( ) θ j, = α θ j C α [ C α [ C } ] C. 2 θ j. C ] 2, θ j

8 CAWLEY AND TALBOT We begn by dervng the parta dervatves of the mode parameters, [ α T b ] T, wth respect to the hyper-parameter θ j. The mode parameters are gven by the souton of a system of near equatons, such that [ α T b ] T = C [ y T 0 ] T. Usng the foowng dentty for the parta dervatves of the nverse of a matrx, we obtan, [ α T b ] T θ j C θ j C = C C, (3) θ j C = C C [ y T 0 ] C [ = C α T b ] T. θ j θ j Note the computatona compexty of evauatng the parta dervatves of the mode parameters s O( 2 ), as ony two successve matrx-vector products are requred. The parta dervatves of the dagona eements of C can be found usng the nverse matrx dervatve dentty (3). For a kerne parameter, C/ η j w generay be fuy dense, and so the computatona compexty of evauatng the dagona eements of C / η j w be O( 3 ) operatons. If, on the other hand, we consder the reguarsaton parameter, µ, we have that C µ = [ I 0 0 T 0 and so the computaton of the parta dervatves of the mode parameters, wth respect to the reguarsaton parameter, s sghty smpfed, [ α T b ] T µ ], = C [ α T b ] T. More mportanty, as C/ µ s dagona, the dagona eements of (3) can be evauated wth a computatona compexty of ony O( 2 ) operatons. Ths suggests that t may be more effcent to adopt dfferent strateges for optmsng the reguarsaton parameter, µ, and the vector of kerne parameters, η, (cf., Saad et a., 2004). For a kerne parameter, η j, the parta dervatves of C wth respect to η j are gven by the parta dervatves of the kerne matrx, that s, C η j = [ K/ η j 0 0 T 0 For the spherca rada bass functon kerne, used n ths study, the parta dervatve wth respect to the kerne parameter s gven by K (x,x ) η ]. = K (x,x ) x x 2. Fnay, snce the reguarsaton parameter, µ, and the scae parameter of the rada bass functon kerne are strcty postve quanttes, n order to permt the use of an unconstraned optmsaton procedure, we adopt the parametersaton θ j = og 2 θ j, such that Q(θ) θ j = Q(θ) θ j θ j θ j where θ j θ j = θ j og2. 848

9 BAYESIAN REGULARISATION IN MODEL SELECTION.3.2 AUTOMATIC RELEVANCE DETERMINATION Automatc Reevance Determnaton (ARD) (e.g., Rasmussen and Wams, 2006), aso known as feature scang (Chapee et a., 2002; Bo et a., 2006), ams to dentfy nformatve nput features as a natura consequence of optmsng the mode seecton crteron. Ths can be most easy acheved usng an eptca rada bass functon kerne, { K (x,x ) = exp d η [x x ] 2 }, that ncorporates ndvdua scang factors for each nput dmenson. The parta dervatves wth respect to the kerne parameters are then gven by, K (x,x ) η = K (x,x )[x x ] 2. Generasaton performance s key to be enhanced f rreevant features are down-weghted. It s therefore hoped that mnmsng the mode seecton crteron w ead to very sma vaues for the scang factors assocated wth redundant nput features, aowng them to be dentfed and pruned from the mode. 2. Bayesan Reguarsaton n Mode Seecton In order to overcome the observed over-fttng n mode seecton usng eave-one-out cross-vadaton based methods, we propose to add a reguarsaton term (Tkhonov and Arsenn, 977) to the mode seecton crteron, whch penases soutons where the kerne parameters take on unduy arge vaues. The reguarsed mode seecton crteron s then gven by M(θ) = ζq(θ) + ξω(θ), (4) where ξ and ζ are addtona reguarsaton parameters, Q(θ) s the mode seecton crteron, n ths case the PRESS statstc and Ω(θ) s a reguarsaton term, Q(θ) = 2 [ y ŷ ( ) ] 2 and Ω(θ) = 2 In ths study we have eft the reguarsaton parameter, µ, unreguarsed. However, we have now ntroduced two further reguarsaton parameters ξ and ζ for whch good vaues must aso be found. Ths probem may be soved by takng a Bayesan approach and adoptng an gnorance pror and ntegratng out the addtona reguarsaton parameters anaytcay n the stye of Buntne and Wegend (99). Adaptng the approach taken by Wams (995), the reguarsed mode seecton crteron (4) can be nterpreted as the posteror densty n the space of the hyper-parameters, P(θ D) P(D θ)p(θ), by takng the negatve ogarthm and negectng addtve constants. Here P(D θ) represents the kehood wth respect to the hyper-parameters and P(θ) represents our pror beefs regardng the d η

10 CAWLEY AND TALBOT hyper-parameters, n ths case that they shoud have a sma magntude, correspondng to a reatvey smpe mode. These quanttes can be expressed as P(D θ) = Z Q exp{ ζq(θ)} and P(θ) = Z Ω exp{ ξω(θ)} where Z Q and Z Ω are the approprate normasng constants. Assumng the data represent an..d. sampe, the kehood n ths case s Gaussan, [ ] y ŷ ( ) 2 P(D θ) = exp 2πσ 2σ 2 where ζ = ( ) 2π /2 σ 2 = Z Q =. ζ Lkewse, the pror s a Gaussan, centred on the orgn, P(θ) = d exp 2π/ξ { ξ } 2 η2 such that Z Ω = ( ) 2π d/2. ξ Mnmsng (4) s thus equvaent to maxmsng the posteror densty wth respect to the hyperparameters. Note that the use of a pror over the hyper-parameters s n accordance wth norma Bayesan practce and has been nvestgated n the case of Gaussan Process cassfers by Wams and Barber (998). The combnaton of frequentst and Bayesan approaches at the frst and second eves of nference s however somewhat unusua. The margna kehood s dependent on the assumptons of the mode, whch may not be competey approprate. Cross-vadaton based procedures may therefore be more robust n the case of mode ms-specfcaton (Wahba, 990). It seems reasonabe for the mode to be ess senstve to assumptons at the second eve of nference than the frst, and so the proposed approach represents a pragmatc combnaton of technques. 2. Emnaton of Second Leve Reguarsaton Parameters ξ and ζ Under the evdence framework proposed by MacKay (992a,b,c) the hyper-parameters ξ and ζ are determned by maxmsng the margna kehood, aso known as the Bayesan evdence for the mode. In ths study, however we opt to ntegrate out the hyper-parameters anaytcay, extendng the work by Buntne and Wegend (99) and Wams (995) to consder Bayesan reguarsaton at the second eve of nference, namey the seecton of good vaues for the hyper-parameters. We begn wth the pror over the hyper-parameters, whch depends on ξ, P(θ ξ) = Z Ω (ξ) exp{ ξω}. The reguarsaton parameter ξ may then be ntegrated out anaytcay usng a sutabe pror, P(ξ), Z P(θ) = P(θ ξ)p(ξ)dξ. The mproper Jeffreys pror, P(ξ) /ξ s an approprate gnorance pror n ths case as ξ s a scae parameter, notng that ξ s strcty postve, p(θ) = Z (2π) d/2 ξ d/2 exp{ ξω}dξ

11 BAYESIAN REGULARISATION IN MODEL SELECTION Usng the Gamma ntegra R 0 xν e µx dx = Γ(ν)/µ ν (Gradshteyn and Ryzhc, 994, equaton 3.384), we obtan Γ(d/2) p(θ) = (2π) d/2 Ω d/2 = og p(θ) d 2 ogω. Fnay, adoptng a smar procedure to emnate ζ, we obtan a revsed mode seecton crteron wth Bayesan reguarsaton, L = 2 ogq(θ) + d ogω(θ). (5) 2 n whch the reguarsaton parameters have been emnated. As before, ths crteron can be optmsed va standard methods, such as the Neder-Mead smpex agorthm (Neder and Mead, 965) or scaed conjugate gradent descent (Wams, 99). The parta dervatves of the proposed Bayesan mode seecton crteron are gven by L = Q(θ) + d Ω(θ) θ 2Q(θ) θ 2Ω(θ) θ and Ω(θ) η = η. The addtona computatona expense nvoved n Bayesan reguarsaton of the mode seecton crteron s ony O(d) operatons, and s extremey sma n comparson wth the O( 3 ) operatons nvoved n obtanng the eave-one-out error (ncudng the cost of tranng the mode on the entre data set). Per teraton of the mode seecton process, the cost of the Bayesan reguarsaton s therefore mnma. There seems tte reason to suppose that the reguarsaton w have an adverse effect on convergence, and ths seems to be the case n practce. 2.2 Reatonshp wth the Evdence Framework Under the evdence framework of MacKay (992a,b,c) the reguarsaton parameters, ξ and ζ, are seected so as to maxmse the margna kehood, aso known as the Bayesan evdence, for the mode. The og-evdence s gven by ogp(d) = ξω(θ) ζq(θ) 2 og A + d 2 ogξ + 2 ogζ 2 og{2π}, where A s the Hessan of the reguarsed mode seecton crteron (4) wth respect to the hyperparameters, θ. Settng the parta dervatves of the og evdence wth respect to the reguarsaton parameters, ξ and ζ, equa to zero, we obtan the famar update formuae, ξ new = γ 2Ω(θ) and ζ new = γ 2Q(θ), where γ s the number of we defned hyper-parameters, that s, the hyper-parameters for whch the optma vaue s prmary determned by the og-kehood term, Q(θ) rather than by the reguarser, Ω(θ). In the case of the L2 reguarsaton term, correspondng to a Gaussan pror, the number of we determned hyper-parameters s gven by γ = n j= λ j λ j + ξ where, λ,...,λ n represent the egenvaues of the Hessan of the unreguarsed mode seecton crteron, Q(θ) wth respect to the kerne parameters. Comparng the parta dervatves of the reguarsed mode seecton crteron (4) wth those of the Bayesan crteron (5), reveas that the 85

12 CAWLEY AND TALBOT Bayesan reguarsaton scheme s equvaent to optmsng the reguarsed mode seecton crteron (4) assumng that the reguarsaton parameters, ξ and ζ, are contnuousy updated accordng to the foowng update rues, ξ eff = d 2Ω(θ) and ζ eff = 2Q(θ). Ths exacty corresponds to the cheap and cheerfu approxmaton of the evdence framework suggested by MacKay (994), whch assumes that a of the hyper-parameters are we-determned and that the number of hyper-parameters s sma n reaton to the number of tranng patterns. Snce γ d, t seems sef evdent that the proposed Bayesan reguarsaton scheme w be prone to a degree of under-fttng, especay n the case of a feature scang kerne wth many redundant features. The theoretca and practca pros and cons of the ntegrate-out approach and the evdence framework are dscussed n some deta by MacKay (994) and Bshop (995) and references theren. However, the ntegrate-out approach does not requre the evauaton of the Hessan matrx of the orgna seecton crteron, Q(θ), whch s key to prove computatonay prohbtve. 3. Resuts In ths secton, we present expermenta resuts demonstratng the benefts of the proposed mode seecton strategy ncorporatng Bayesan reguarsaton to overcome the nherent hgh varance of eave-one-out cross-vadaton based seecton crtera. Tabe 2 shows a comparson of the error rates of east-squares support vector machnes, usng mode seecton procedures wth, and wthout, Bayesan reguarsaton, (LS-SVM and LS-SVM-BR respectvey) over the sute of thrteen pubc doman benchmark data sets used n the study by Mka et a. (2000). Resuts obtaned usng a Gaussan process cassfer (Rasmussen and Wams, 2006), based on the expectaton propagaton method, are aso provded for comparson (EP-GPC). The same set of 00 random parttons of the data (20 n the case of the arger mage and spce benchmarks) to form tranng and test sets used n that study are aso used here. In each case, mode seecton s performed ndependenty for each reasaton of the data set, such that the standard errors refect the varabty of both the tranng agorthm and the mode seecton procedure wth changes n the sampng of the data. Both conventona spherca and eptca rada bass kernes are used for a kerne earnng methods, so that the effectveness of each agorthm for automatc reevance determnaton can be assessed. The use of mutpe tranng/test parttons aows an estmate of the statstca sgnfcance of dfferences n performance between agorthms to be computed. Let ˆx and ŷ represent the means of the performance statstc for a par of competng agorthms, and e x and e y the correspondng standard errors, then the z statstc s computed as z = ŷ ˆx. e 2 x + e 2 y The z-score can then be converted to a sgnfcance eve va the norma cumuatve dstrbuton functon, such that z =.64 corresponds to a 95% sgnfcance eve. A statements of statstca sgnfcance n the remander of ths secton refer to a 95% eve of sgnfcance. 3. Performance of Modes Based on the Spherca RBF Kerne The resuts shown n the frst three data coumns of Tabe 2 show the performance of LS-SVM, LS-SVM-BR and EP-ARD modes based on the spherca Gaussan kerne. The performance of 852

13 BAYESIAN REGULARISATION IN MODEL SELECTION Data Set Tranng Testng Number of Input Patterns Patterns Repcatons Features Banana Breast cancer Dabets Fare soar German Heart Image Rngnorm Spce Thyrod Ttanc Twonorm Waveform Tabe : Detas of data sets used n emprca comparson. LS-SVM modes wth and wthout Bayesan reguarsaton are very smar, wth nether mode provng sgnfcanty better than the other on any of the data sets. Ths seems reasonabe gven that ony two hyper-parameters are optmsed durng mode seecton and so there s tte scope for over-fttng the PRESS mode seecton crteron and the reguarsaton term w have tte effect. The LS-SVM mode wth Bayesan reguarsaton s sgnfcanty out-performed by the Gaussan Process cassfer on one benchmark banana, but performs sgnfcanty better on a further four (rngnorm, spce, twonorm, waveform). Demšar (2006) recommends the use of the Wcoxon sgned rank test for assessng the statstca sgnfcance of dfferences n performance over mutpe data sets. Accordng to ths test, nether the LSSVM-BR nor the EP-PGC s statstcay superor at the 95% eve of sgnfcance. 3.2 Performance of Modes Based on the Eptca RBF Kerne The performances of LS-SVM, LS-SVM-BR and EP-GPC modes based on the eptca Gaussan kerne, whch ncudes a separate scae parameter for each nput feature, are shown n the ast three coumns of Tabe 2. Before evauatng the effects of Bayesan reguarsaton n mode seecton t s worth notng that the use of eptca RBF kernes does not generay mprove performance. For the LS-SVM, the eptca kerne produces sgnfcanty better resuts on ony two benchmarks (mage and spce) and sgnfcanty worse resuts on a further eght (banana, breast cancer, dabets, german, heart, rngnorm, twonorm, waveform), wth the degradaton n performance beng very arge n some cases (e.g., heart). Ths seems key to be a resut of the addtona degrees of freedom nvoved n the mode seecton process, aowng over-fttng of the PRESS mode seecton crteron as a resut of ts nherenty hgh varance. Note that fuy Bayesan approaches, such as the Gaussan Process Cassfer, are aso unabe to reaby seect kerne parameters for the eptca RBF kerne. The eptca kerne s sgnfcanty better on ony three benchmarks (fare soar, 853

14 Data Set Rada Bass Functon Automatc Reevance Determnaton LSSVM LSSVM-BR EP-GPC LSSVM LSSVM-BR EP-GPC CAWLEY AND TALBOT Banana 0.60 ± ± ± ± ± ± Breast cancer ± ± ± ± ± ± Dabetes ± ± ± ± ± ± 0.93 Fare soar ± ± ± ± ± ± 0.82 German ± ± ± ± ± ± 0.22 Heart 6.64 ± ± ± ± ± ± Image 3.00 ± ± ± ± ± ± Rngnorm.6 ± ± ± ± ± ± Spce 0.97 ± ± ± ± ± ± Thyrod 4.68 ± ± ± ± ± ± 0.28 Ttanc ± ± ± ± ± ± 0.34 Twonorm 2.84 ± ± ± ± ± ± Waveform 9.79 ± ± ± ± ± ± Tabe 2: Error rates of east-squares support vector machne, wth and wthout Bayesan reguarsaton of the mode seecton crteron, n ths case the PRESS statstc (Aen, 974), and Gaussan process cassfers over thrteen benchmark data sets (Rätsch et a., 200), usng both standard rada bass functon and automatc reevance determnaton kernes. The resuts for the EP-GPC were obtaned usng the MATLAB software accompanyng the book by Rasmussen and Wams (2006). The resuts for each method are presented n the form of the mean error rate over test data for 00 reasatons of each data set (20 n the case of the mage and spce data sets), aong wth the assocated standard error. The best resuts are shown n bodface and the second best n tacs (wthout mpcaton of statstca sgnfcance).

15 BAYESIAN REGULARISATION IN MODEL SELECTION mage and spce), whe beng sgnfcanty worse on sx (breast cancer, dabets, heart, rngnorm, twonorm and waveform). In the case of the eptca RBF kerne, the use of Bayesan reguarsaton n mode seecton has a dramatc effect on the performance of LS-SVM modes, wth the LS-SVM-BR mode provng sgnfcanty better than the conventona LS-SVM on nne of the thrteen benchmarks (breast cancer, dabets, fare soar, german, heart, rngnorm, spce, twonorm and waveform) wthout beng sgnfcanty worse on any of the remanng four data sets. Ths demonstrates that over-fttng n mode seecton, due to the arger number of kerne parameters, s key to be the sgnfcant factor causng the reatvey poor performance of modes wth the eptca RBF kerne. Agan, the Gaussan Process cassfer s sgnfcanty better than the LS-SVM wth Bayesan reguarsaton on the banana and twonorm data sets, but s sgnfcanty worse on four of the remanng eeven (dabets, heart, rngnorm and spce). Agan, accordng to the Wcoxon sgned rank test, nether the LSSVM-BR nor the EP-PGC s statstcay superor at the 95% eve of sgnfcance. However the magntude of the dfference n performance between LSSVM-BR and EP-GPC approaches tends to be greatest when the LSSVM-BR out-performs EP-GPC, most notaby on the heart, spce and rngnorm data sets. Ths provdes some support for the observaton of Wahba (990) that cross-vadaton based mode seecton procedures shoud be more robust aganst mode ms-specfcaton (see aso Rasmussen and Wams, 2006). 4. Dscusson The expermenta evauaton presented n the prevous secton demonstrates that over-fttng can occur n mode seecton, due to the varance of the mode seecton crteron. In many cases the mnmum of the seecton crteron usng the eptca RBF kerne s ower than that achevabe usng the spherca RBF kerne, however ths resuts n a degradaton n generasaton performance. Usng the PRESS statstc, the over-fttng s key to be most severe n cases wth a sma number of tranng patterns, as the varance of the eave-one-out estmator decreases as the sampe sze becomes arger. Usng the standard LSSVM, the eptca RBF kerne ony out-performs the spherca RBF kerne on two of the thrteen data sets, mage and spce, whch aso happen to be the two argest data sets n terms of the number of tranng patterns. The greatest degradaton n performance s obtaned on the heart benchmark, the thrd smaest. The heart data set aso has a reatvey arge number of nput features (3). A arge number of nput features ntroduces a many addtona degrees of freedom to optmse the mode seecton crteron, and so w generay tend to encourage over-fttng. However, there may be a compact subset of hghy reevant features wth the remander beng amost entrey unnformatve. In ths case the advantage of suppressng the nosy nputs s so great that t overcomes the predsposton towards over-fttng, and so resuts n mproved generasaton (as observed n the case of the mage and spce benchmarks). Whether the use of an eptca RBF kerne w mprove or degrade generasaton argey depends on such characterstcs of the data that are not known a-pror, and so t seems prudent to consder a range of kerne functons and seect the best va cross-vadaton. The expermenta resuts ndcate that Bayesan reguarsaton of the hyper-parameters s generay benefca, wthout at ths stage provdng a compete souton to the probem of over-fttng the mode seecton crteron. The effectveness of the Bayesan reguarsaton scheme s to a arge extent dependent on the approprateness of the pror mposed on the hyper-parameters. There s no reason to assume that the smpe Gaussan pror used here s n any sense optma, and ths s an 855

16 CAWLEY AND TALBOT ssue where further research s necessary (see Secton 4.2). The comparson of the ntegrate-out approach and the evdence framework hghghts a defcency of the smpe Gaussan pror. It suggests that the ntegrate-out approach s key to resut n md over-reguarsaton of the hyper-parameters n the presence of a arge number of rreevant features, as the correspondng hyper-parameters w be -determned. The LSSVM wth Bayesan reguarsaton of the hyper-parameters does not sgnfcanty outperform the expectaton propagaton based Gaussan process cassfer over the sute of thrteen benchmark data sets consdered. Ths s not whoy surprsng as the EP-GPC s at east very cose to the state-of-the-art, ndeed t s nterestng that the EP-GPC does not out-perform such a comparatvey smpe mode. The EP-GPC uses the margna kehood as the mode seecton crteron, whch gves the probabty of the data, gven the assumptons of the mode (Rasmussen and Wams, 2006). Cross-vadaton based approaches, on the other hand, provde an estmate of generasaton performance that does not depend on the mode assumptons, and so may be more robust aganst mode ms-specfcaton (Wahba, 990). The no free unch theorems suggest that, at east n terms of generasaton performance, there s a ack of nherent superorty of one cassfcaton method over another, n the absence of a-pror assumptons regardng the data. Ths mpes that f one cassfer performs better than another on a partcuar data set t s because the nductve bases of that cassfer provde a better ft to the partcuar pattern recognton task, rather than to ts superorty n a more genera sense. A mode wth strong nductve bases s key to beneft when these bases are we suted to the data, but w perform bady when they do not. Whe a mode wth weak nductve bases w be more robust, t s ess key to perform conspcuousy we on any gven data set. Ths means there are compementary advantages and dsadvantages to both approaches. 4. Reatonshp to Exstng Work The use of a pror over the hyper-parameters s n accordance wth norma Bayesan practce and has been used n Gaussan Process cassfcaton (Wams and Barber, 998). The probem of over-fttng n mode seecton has aso been addressed by Q et a. (2004), n the case of seectng nformatve features for a ogstc regresson mode usng an Automatc Reevance Determnaton (ARD) pror (cf., Tppng, 2000). In ths case, the Expectaton Propagaton method (Mnka, 200) s used to obtan a determnstc approxmaton of the posteror, and aso (as a by-product) a eaveone-out performance estmate. The atter s then used to mpement a form of eary-stoppng (e.g., Sare, 995) to prevent over-fttng resutng from the drect optmzaton of the margna kehood unt convergence. It seems key that ths approach woud be aso be benefca n the case of tunng the hyper-parameters of the covarance functon of Gaussan process mode, usng ether the eaveone-out estmate arsng from the EP approxmaton, or an approxmate eave-one-out estmate from the Lapace approxmaton (cf., Cawey and Tabot, 2007). 4.2 Drectons for Further Research In ths paper, the reguarsaton term corresponds to a smpe spherca Gaussan pror over the kerne parameters. One drecton of research woud be to nvestgate aternatve reguarsaton terms. The 856

17 BAYESIAN REGULARISATION IN MODEL SELECTION frst possbty woud be to use a reguarsaton term correspondng to a separabe Lapace pror, Ω(θ) = 2 d η = p(θ) = d ξ 2 exp{ ξ η }. Settng the dervatve of the reguarsed mode seecton crteron (4) to zero, we obtan Q η = ξ f η > 0 and Q ζ η < ξ f η = 0, ζ whch mpes that f the senstvty of the eave-one-out error, Q(θ), fas beow ξ/ζ, the vaue of the hyper-parameter, η w be set exacty to zero, effectvey prunng that nput from the mode. In ths way expct feature seecton may be obtaned as a consequence of (reguarsed) mode seecton. The mode seecton crteron wth Bayesan reguarsaton then becomes L = ogq(θ) + N ogω(θ) 2 where N s the number of nput features wth non-zero scae factors. Ths potentay overcomes the propensty towards under-fttng the data that mght be expected when usng the Gaussan pror, as the prunng acton of the Lapace pror means that the vaues of a remanng hyper-parameters are we-determned by the data. In the case of the Lapace pror, the ntegrate-out approach s exacty equvaent to contnuous updates of the hyper-parameters accordng to the update formuae under the evdence framework (Wams, 995). Aternatvey, defnng a pror over the functon of a mode seems more n accordance wth Bayesan deas than choosng a pror over the parameters of the mode. The use of a pror over the hyper-parameters based on the smoothness of the resutng mode aso provdes a potenta drecton for future research. In ths case, the reguarsaton term mght take the form, Ω(θ) = 2 d j= [ ] 2 2 ŷ, drecty penasng modes wth excess curvature. Ths reguarsaton term corresponds to curvature drven smoothng n mut-ayer perceptron networks (Bshop, 993), except that the mode output ŷ s vewed as a functon of the hyper-parameters, rather than of the weghts. A penaty term based on the frst-order parta dervatves s aso feasbe (cf., Drucker and Le Cun, 992). 5. Concuson Leave-one-out cross-vadaton has proved to be an effectve means of mode seecton for a varety of kerne earnng methods, provded the number of hyper-parameters to be tuned s reatvey sma. The use of kerne functons wth arge numbers of parameters often provdes suffcent degrees of freedom to over-ft the mode seecton crteron, eadng to poor generasaton. In ths paper, we have proposed the use of reguarsaton at the second eve of nference, that s, mode seecton. The use of Bayesan reguarsaton s shown to be effectve n reducng over-fttng, by ensurng the vaues of the kerne parameters reman sma, gvng a smoother kerne and hence a ess compex cassfer. Ths s acheved wth ony a mnma computatona expense as the addtona reguarsaton parameters are ntegrated out anaytcay usng a reference pror. Whe a fuy Bayesan x 2 j 857

18 CAWLEY AND TALBOT mode seecton strategy s conceptuay more eegant, t may aso be ess robust to mode msspecfcaton. The use of eave-one-out cross-vadaton n mode seecton and Bayesan methods at the next eve seems to be a pragmatc compromse. The effectveness of ths approach s ceary demonstrated n the expermenta evauaton where, on average, the LS-SVM wth Bayesan reguarsaton out-performs the expectaton-propagaton based Gaussan process cassfer, usng both spherca and eptca RBF kernes. Acknowedgments The authors woud ke to thank the organsers of the WCCI mode seecton workshop and performance predcton chaenge and the NIPS mut-eve nference workshop and mode seecton game, and feow partcpants for the stmuatng dscussons that have heped to shape ths work. We aso thank Car Rasmussen and Chrs Wams for ther advce regardng the EP-GPC and the anonymous revewers for ther detaed and constructve comments that have sgnfcanty mproved ths paper. References D. M. Aen. The reatonshp between varabe seecton and predcton. Technometrcs, 6:25 27, 974. E. Anderson, Z. Ba, C. Bschof, S. Backford, J. Demme, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarng, A. McKenney, and D. Sorenson. LAPACK Users Gude. SIAM Press, thrd edton, 999. C. M. Bshop. Curvature-drven smoothng: a earnng agorthm for feedforward networks. IEEE Transactons on Neura Networks, 4(5): , September 993. C. M. Bshop. Neura Networks for Pattern Recognton. Oxford Unversty Press, 995. L. Bo, L. Wang, and L. Jao. Feature scang for kerne Fsher dscrmnant anayss usng eaveone-out cross-vadaton. Neura Computaton, 8:96 978, Apr W. L. Buntne and A. S. Wegend. Bayesan back-propagaton. Compex Systems, 5: , 99. G. C. Cawey. Leave-one-out cross-vadaton based mode seecton crtera for weghted LS- SVMs. In Proceedngs of the Internatona Jont Conference on Neura Networks (IJCNN-2006), pages , Vancouver, BC, Canada, Juy G. C. Cawey and N. L. C. Tabot. Effcent eave-one-out cross-vadaton of kerne Fsher dscrmnant cassfers. Pattern Recognton, 36(): , November G. C. Cawey and N. L. C. Tabot. Fast eave-one-out cross-vadaton of sparse east-squares support vector machnes. Neura Networks, 7(0): , December G. C. Cawey and N. L. C. Tabot. Approxmate eave-one-out cross-vadaton for kerne ogstc regresson. Machne Learnng (submtted),

19 BAYESIAN REGULARISATION IN MODEL SELECTION C. Chapee, V. Vapnk, O. Bousquet, and S. Mukherjee. Choosng mutpe parameters for support vector machnes. Machne Learnng, 46():3 59, C. Cortes and V. Vapnk. Support vector networks. Machne Learnng, 20: , 995. J. Demšar. Statstca comparsons of cassfers over mutpe data sets. Journa of Machne Learnng Research, 7: 30, H. Drucker and Y. Le Cun. Improvng generazaton performance usng doube back-propagaton. IEEE Transactons on Neura Networks, 3(6):99 997, 992. S. Geman, E. Benenstock, and R. Doursat. Neura networks and the bas/varance dema. Neura Computaton, 4(): 58, 992. G. H. Goub and C. F. Van Loan. Matrx Computatons. The Johns Hopkns Unversty Press, Batmore, thrd edton edton, 996. I. S. Gradshteyn and I. M. Ryzhc. Tabe of Integras, Seres and Products. Academc Press, ffth edton, 994. I. Guyon, A. R. Saffar Azar Aamdar, G. Dror, and J. Buhmann. Performance predcton chaenge. In Proceedngs of the Internatona Jont Conference on Neura Networks (IJCNN-2006), pages , Vancouver, BC, Canada, Juy T. Joachms. Learnng to Cassfy Text usng Support Vector Machnes - Methods, Theory and Agorthms. Kuwer Academc Pubshers, S. S. Keerth, K. B. Duan, S. K. Shevade, and A. N. Poo. A fast dua agorthm for kerne ogstc regresson. Machne Learnng, 6( 3):5 65, November G. S. Kmedorf and G. Wahba. Some resuts on Tchebycheffan spne functons. J. Math. Ana. Appc., 33:82 95, 97. R. Kohav. A study of cross-vadaton and bootstrap for accuracy estmaton and mode seecton. In Proceedngs of the Fourteenth Internatona Conference on Artfca Integence (IJCAI), pages 37 43, San Mateo, CA, 995. Morgan Kaufmann. P. A. Lachenbruch and M. R. Mckey. Estmaton of error rates n dscrmnant anayss. Technometrcs, 0():, February 968. A. Luntz and V. Braovsky. On estmaton of characters obtaned n statstca procedure of recognton (n Russan). Techcheskaya Kbernetca, 3, 969. D. J. C. MacKay. Bayesan nterpoaton. Neura Computaton, 4(3):45 447, 992a. D. J. C. MacKay. A practca Bayesan framework for backprop networks. Neura Computaton, 4 (3): , 992b. D. J. C. MacKay. The evdence framework apped to cassfcaton networks. Neura Computaton, 4(5): , 992c. 859

20 CAWLEY AND TALBOT D. J. C. MacKay. Hyperparameters: Optmse or ntegrate out? In G. Hedbreder, edtor, Maxmum Entropy and Bayesan Methods. Kuwer, 994. J. Mercer. Functons of postve and negatve type and ther connecton wth the theory of ntegra equatons. Phosophca Transactons of the Roya Socety of London, A, 209:45 446, 909. C. A. Mcche. Interpoaton of scattered data: Dstance matrces and condtonay postve defnte functons. Constructve Approxmaton, 2: 22, 986. S. Mka, G. Rätsch, J. Weston, B. Schökopf, and K.-R. Müer. Fsher dscrmnant anayss wth kernes. In Neura Networks for Sgna Processng, voume IX, pages IEEE Press, New York, 999. S. Mka, G. Rätsch, J. Weston, B. Schökopf, A. J. Smoa, and K.-R. Müer. Invarant feature extracton and cassfcaton n feature spaces. In S. A. Soa, T. K. Leen, and K.-R. Müer, edtors, Advances n Neura Informaton Processng Systems, voume 2, pages MIT Press, T. P. Mnka. Expectaton propagaton for approxmate Bayesan nference. In Proceedngs of the 7 th Annua Conference on Uncertanty n Artfca Integence, pages Morgan Kauffmann, 200. J. A. Neder and R. Mead. A smpex method for functon mnmzaton. Computer Journa, 7: , 965. Y. Q, T. P. Mnka, R. W. Pcard, and Z. Ghahraman. Predctve automatc reevance determnaton by expectaton propagaton. In Proceedngs of the 2 st Internatona Conference on Machne Learnng, pages , Banff, Aberta, Canada, Juy C. E. Rasmussen and C. K. I. Wams. Gaussan Processes for Machne Learnng. Adaptve Computaton and Machne Learnng. MIT Press, G. Rätsch, T. Onoda, and K.-R. Müer. Soft margns for AdaBoost. Machne Learnng, 42(3): , 200. K. Saad, N. L. C. Tabot, and G. C. Cawey. Optmay reguarsed kerne Fsher dscrmnant anayss. In Proceedngs of the 7th Internatona Conference on Pattern Recognton (ICPR- 2004), voume 2, pages , Cambrdge, Unted Kngdom, August W. S. Sare. Stopped tranng and other remdes for overfttng. In Proceedngs of the 27th Symposum on the Interface of Computer Scence and Statstcs, pages , Pttsburgh, PA, USA, June B. Schökopf, K. Tsuda, and J.-P. Vert. Kerne Methods n Computatona Boogy. MIT Press, T. Seaks. SYMINV : An agorthm for the nverson of a postve defnte matrx by the Choesky decomposton. Econometrca, 40(5):96 962, September 972. J. Shawe-Tayor and N. Crstann. Kerne Methods for Pattern Anayss. Cambrdge Unversty Press,

21 BAYESIAN REGULARISATION IN MODEL SELECTION M. Stone. Cross-vadatory choce and assessment of statstca predctons. Journa of the Roya Statstca Socety, B 36(): 47, 974. S. Sundararajan and S. S. Keerth. Predctve approaches for choosng hyperparameters n Gaussan processes. Neura Computaton, 3(5):03 8, May 200. J. A. K. Suykens and J. Vandewae. Least squares support vector machne cassfers. Neura Processng Letters, 9(3): , June 999. J. A. K. Suykens, T. Van Geste, J. De Brabanter, B. De Moor, and J. Vandewae. Least Squares Support Vector Machnes. Word Scentfc, A. N. Tkhonov and V. Y. Arsenn. Soutons of I-posed Probems. John Wey, New York, 977. M. E. Tppng. Sparse Bayesan earnng and the reevance vector machne. Journa of Machne Learnng Research, :2 244, June G. Wahba. Spne Modes for Observatona Data. SIAM Press, Phadepha, PA, 990. C. K. I. Wams and D. Barber. Bayesan cassfcaton wth Gaussan processes. IEEE Transactons on Pattern Anayss and Machne Integence, 20(2):342 35, December 998. P. M. Wams. A Marquardt agorthm for choosng the step sze n backpropagaton earnng wth conjugate gradents. Technca Report CSRP-229, Unversty of Sussex, February 99. P. M. Wams. Bayesan reguarzaton and prunng usng a Lapace pror. Neura Computaton, 7 ():7 43,

Image Classification Using EM And JE algorithms

Image Classification Using EM And JE algorithms Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Image Cassfcaton Usng EM And JE agorthms Xaojn Sh Department of Computer Engneerng, Unversty of Caforna, Santa Cruz, CA, 9564 jennfer@soe.ucsc.edu

More information

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory Advanced Scence and Technoogy Letters Vo.83 (ISA 205), pp.60-65 http://dx.do.org/0.4257/ast.205.83.2 Research on Compex etworks Contro Based on Fuzzy Integra Sdng Theory Dongsheng Yang, Bngqng L, 2, He

More information

Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks

Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks Shengyang Sun, Changyou Chen, Lawrence Carn Suppementary Matera: Learnng Structured Weght Uncertanty n Bayesan Neura Networks Shengyang Sun Changyou Chen Lawrence Carn Tsnghua Unversty Duke Unversty Duke

More information

MARKOV CHAIN AND HIDDEN MARKOV MODEL

MARKOV CHAIN AND HIDDEN MARKOV MODEL MARKOV CHAIN AND HIDDEN MARKOV MODEL JIAN ZHANG JIANZHAN@STAT.PURDUE.EDU Markov chan and hdden Markov mode are probaby the smpest modes whch can be used to mode sequenta data,.e. data sampes whch are not

More information

Deriving the Dual. Prof. Bennett Math of Data Science 1/13/06

Deriving the Dual. Prof. Bennett Math of Data Science 1/13/06 Dervng the Dua Prof. Bennett Math of Data Scence /3/06 Outne Ntty Grtty for SVM Revew Rdge Regresson LS-SVM=KRR Dua Dervaton Bas Issue Summary Ntty Grtty Need Dua of w, b, z w 2 2 mn st. ( x w ) = C z

More information

Example: Suppose we want to build a classifier that recognizes WebPages of graduate students.

Example: Suppose we want to build a classifier that recognizes WebPages of graduate students. Exampe: Suppose we want to bud a cassfer that recognzes WebPages of graduate students. How can we fnd tranng data? We can browse the web and coect a sampe of WebPages of graduate students of varous unverstes.

More information

COXREG. Estimation (1)

COXREG. Estimation (1) COXREG Cox (972) frst suggested the modes n whch factors reated to fetme have a mutpcatve effect on the hazard functon. These modes are caed proportona hazards (PH) modes. Under the proportona hazards

More information

Neural network-based athletics performance prediction optimization model applied research

Neural network-based athletics performance prediction optimization model applied research Avaabe onne www.jocpr.com Journa of Chemca and Pharmaceutca Research, 04, 6(6):8-5 Research Artce ISSN : 0975-784 CODEN(USA) : JCPRC5 Neura networ-based athetcs performance predcton optmzaton mode apped

More information

Nested case-control and case-cohort studies

Nested case-control and case-cohort studies Outne: Nested case-contro and case-cohort studes Ørnuf Borgan Department of Mathematcs Unversty of Oso NORBIS course Unversty of Oso 4-8 December 217 1 Radaton and breast cancer data Nested case contro

More information

A DIMENSION-REDUCTION METHOD FOR STOCHASTIC ANALYSIS SECOND-MOMENT ANALYSIS

A DIMENSION-REDUCTION METHOD FOR STOCHASTIC ANALYSIS SECOND-MOMENT ANALYSIS A DIMESIO-REDUCTIO METHOD FOR STOCHASTIC AALYSIS SECOD-MOMET AALYSIS S. Rahman Department of Mechanca Engneerng and Center for Computer-Aded Desgn The Unversty of Iowa Iowa Cty, IA 52245 June 2003 OUTLIE

More information

On the Power Function of the Likelihood Ratio Test for MANOVA

On the Power Function of the Likelihood Ratio Test for MANOVA Journa of Mutvarate Anayss 8, 416 41 (00) do:10.1006/jmva.001.036 On the Power Functon of the Lkehood Rato Test for MANOVA Dua Kumar Bhaumk Unversty of South Aabama and Unversty of Inos at Chcago and Sanat

More information

Relevance Vector Machines Explained

Relevance Vector Machines Explained October 19, 2010 Relevance Vector Machnes Explaned Trstan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introducton Ths document has been wrtten n an attempt to make Tppng s [1] Relevance Vector Machnes

More information

Associative Memories

Associative Memories Assocatve Memores We consder now modes for unsupervsed earnng probems, caed auto-assocaton probems. Assocaton s the task of mappng patterns to patterns. In an assocatve memory the stmuus of an ncompete

More information

Sparse Training Procedure for Kernel Neuron *

Sparse Training Procedure for Kernel Neuron * Sparse ranng Procedure for Kerne Neuron * Janhua XU, Xuegong ZHANG and Yanda LI Schoo of Mathematca and Computer Scence, Nanng Norma Unversty, Nanng 0097, Jangsu Provnce, Chna xuanhua@ema.nnu.edu.cn Department

More information

Multispectral Remote Sensing Image Classification Algorithm Based on Rough Set Theory

Multispectral Remote Sensing Image Classification Algorithm Based on Rough Set Theory Proceedngs of the 2009 IEEE Internatona Conference on Systems Man and Cybernetcs San Antono TX USA - October 2009 Mutspectra Remote Sensng Image Cassfcaton Agorthm Based on Rough Set Theory Yng Wang Xaoyun

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

Leave-One-Out Cross-Validation Based Model Selection Criteria for Weighted LS-SVMs

Leave-One-Out Cross-Validation Based Model Selection Criteria for Weighted LS-SVMs Leave-One-Out Cross-Valdaton Based Model Selecton Crtera for Weghted LS-SVMs Gavn C. Cawley School of Computng Scences Unversty of East Angla Norwch NR4 7TJ Unted Kngdom E-mal: gcc@cmp.uea.ac.uk Abstract

More information

The Entire Solution Path for Support Vector Machine in Positive and Unlabeled Classification 1

The Entire Solution Path for Support Vector Machine in Positive and Unlabeled Classification 1 Abstract The Entre Souton Path for Support Vector Machne n Postve and Unabeed Cassfcaton 1 Yao Lmn, Tang Je, and L Juanz Department of Computer Scence, Tsnghua Unversty 1-308, FIT, Tsnghua Unversty, Bejng,

More information

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression 11 MACHINE APPLIED MACHINE LEARNING LEARNING MACHINE LEARNING Gaussan Mture Regresson 22 MACHINE APPLIED MACHINE LEARNING LEARNING Bref summary of last week s lecture 33 MACHINE APPLIED MACHINE LEARNING

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

On the Equality of Kernel AdaTron and Sequential Minimal Optimization in Classification and Regression Tasks and Alike Algorithms for Kernel

On the Equality of Kernel AdaTron and Sequential Minimal Optimization in Classification and Regression Tasks and Alike Algorithms for Kernel Proceedngs of th European Symposum on Artfca Neura Networks, pp. 25-222, ESANN 2003, Bruges, Begum, 2003 On the Equaty of Kerne AdaTron and Sequenta Mnma Optmzaton n Cassfcaton and Regresson Tasks and

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING 1 ADVANCED ACHINE LEARNING ADVANCED ACHINE LEARNING Non-lnear regresson technques 2 ADVANCED ACHINE LEARNING Regresson: Prncple N ap N-dm. nput x to a contnuous output y. Learn a functon of the type: N

More information

Application of support vector machine in health monitoring of plate structures

Application of support vector machine in health monitoring of plate structures Appcaton of support vector machne n heath montorng of pate structures *Satsh Satpa 1), Yogesh Khandare ), Sauvk Banerjee 3) and Anrban Guha 4) 1), ), 4) Department of Mechanca Engneerng, Indan Insttute

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

Journal of Multivariate Analysis

Journal of Multivariate Analysis Journa of Mutvarate Anayss 3 (04) 74 96 Contents sts avaabe at ScenceDrect Journa of Mutvarate Anayss journa homepage: www.esever.com/ocate/jmva Hgh-dmensona sparse MANOVA T. Tony Ca a, Yn Xa b, a Department

More information

A finite difference method for heat equation in the unbounded domain

A finite difference method for heat equation in the unbounded domain Internatona Conerence on Advanced ectronc Scence and Technoogy (AST 6) A nte derence method or heat equaton n the unbounded doman a Quan Zheng and Xn Zhao Coege o Scence North Chna nversty o Technoogy

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 30 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 2 Remedes for multcollnearty Varous technques have

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

QUARTERLY OF APPLIED MATHEMATICS

QUARTERLY OF APPLIED MATHEMATICS QUARTERLY OF APPLIED MATHEMATICS Voume XLI October 983 Number 3 DIAKOPTICS OR TEARING-A MATHEMATICAL APPROACH* By P. W. AITCHISON Unversty of Mantoba Abstract. The method of dakoptcs or tearng was ntroduced

More information

Chapter 6. Rotations and Tensors

Chapter 6. Rotations and Tensors Vector Spaces n Physcs 8/6/5 Chapter 6. Rotatons and ensors here s a speca knd of near transformaton whch s used to transforms coordnates from one set of axes to another set of axes (wth the same orgn).

More information

The Order Relation and Trace Inequalities for. Hermitian Operators

The Order Relation and Trace Inequalities for. Hermitian Operators Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

Conjugacy and the Exponential Family

Conjugacy and the Exponential Family CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the

More information

GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION. Machine Learning

GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION. Machine Learning CHAPTER 3 GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION Machne Learnng Copyrght c 205. Tom M. Mtche. A rghts reserved. *DRAFT OF September 23, 207* *PLEASE DO NOT DISTRIBUTE

More information

Short-Term Load Forecasting for Electric Power Systems Using the PSO-SVR and FCM Clustering Techniques

Short-Term Load Forecasting for Electric Power Systems Using the PSO-SVR and FCM Clustering Techniques Energes 20, 4, 73-84; do:0.3390/en40073 Artce OPEN ACCESS energes ISSN 996-073 www.mdp.com/journa/energes Short-Term Load Forecastng for Eectrc Power Systems Usng the PSO-SVR and FCM Custerng Technques

More information

Lower Bounding Procedures for the Single Allocation Hub Location Problem

Lower Bounding Procedures for the Single Allocation Hub Location Problem Lower Boundng Procedures for the Snge Aocaton Hub Locaton Probem Borzou Rostam 1,2 Chrstoph Buchhem 1,4 Fautät für Mathemat, TU Dortmund, Germany J. Faban Meer 1,3 Uwe Causen 1 Insttute of Transport Logstcs,

More information

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before

More information

Lecture 3: Dual problems and Kernels

Lecture 3: Dual problems and Kernels Lecture 3: Dual problems and Kernels C4B Machne Learnng Hlary 211 A. Zsserman Prmal and dual forms Lnear separablty revsted Feature mappng Kernels for SVMs Kernel trck requrements radal bass functons SVM

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Delay tomography for large scale networks

Delay tomography for large scale networks Deay tomography for arge scae networks MENG-FU SHIH ALFRED O. HERO III Communcatons and Sgna Processng Laboratory Eectrca Engneerng and Computer Scence Department Unversty of Mchgan, 30 Bea. Ave., Ann

More information

WAVELET-BASED IMAGE COMPRESSION USING SUPPORT VECTOR MACHINE LEARNING AND ENCODING TECHNIQUES

WAVELET-BASED IMAGE COMPRESSION USING SUPPORT VECTOR MACHINE LEARNING AND ENCODING TECHNIQUES WAVELE-BASED IMAGE COMPRESSION USING SUPPOR VECOR MACHINE LEARNING AND ENCODING ECHNIQUES Rakb Ahmed Gppsand Schoo of Computng and Informaton echnoogy Monash Unversty, Gppsand Campus Austraa. Rakb.Ahmed@nfotech.monash.edu.au

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

Note 2. Ling fong Li. 1 Klein Gordon Equation Probablity interpretation Solutions to Klein-Gordon Equation... 2

Note 2. Ling fong Li. 1 Klein Gordon Equation Probablity interpretation Solutions to Klein-Gordon Equation... 2 Note 2 Lng fong L Contents Ken Gordon Equaton. Probabty nterpretaton......................................2 Soutons to Ken-Gordon Equaton............................... 2 2 Drac Equaton 3 2. Probabty nterpretaton.....................................

More information

Uncertainty Specification and Propagation for Loss Estimation Using FOSM Methods

Uncertainty Specification and Propagation for Loss Estimation Using FOSM Methods Uncertanty Specfcaton and Propagaton for Loss Estmaton Usng FOSM Methods J.W. Baer and C.A. Corne Dept. of Cv and Envronmenta Engneerng, Stanford Unversty, Stanford, CA 94305-400 Keywords: Sesmc, oss estmaton,

More information

3. Stress-strain relationships of a composite layer

3. Stress-strain relationships of a composite layer OM PO I O U P U N I V I Y O F W N ompostes ourse 8-9 Unversty of wente ng. &ech... tress-stran reatonshps of a composte ayer - Laurent Warnet & emo Aerman.. tress-stran reatonshps of a composte ayer Introducton

More information

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to

More information

9 Adaptive Soft K-Nearest-Neighbour Classifiers with Large Margin

9 Adaptive Soft K-Nearest-Neighbour Classifiers with Large Margin 9 Adaptve Soft -Nearest-Neghbour Cassfers wth Large argn Abstract- A nove cassfer s ntroduced to overcome the mtatons of the -NN cassfcaton systems. It estmates the posteror cass probabtes usng a oca Parzen

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Which Separator? Spring 1

Which Separator? Spring 1 Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Xin Li Department of Information Systems, College of Business, City University of Hong Kong, Hong Kong, CHINA

Xin Li Department of Information Systems, College of Business, City University of Hong Kong, Hong Kong, CHINA RESEARCH ARTICLE MOELING FIXE OS BETTING FOR FUTURE EVENT PREICTION Weyun Chen eartment of Educatona Informaton Technoogy, Facuty of Educaton, East Chna Norma Unversty, Shangha, CHINA {weyun.chen@qq.com}

More information

Optimum Selection Combining for M-QAM on Fading Channels

Optimum Selection Combining for M-QAM on Fading Channels Optmum Seecton Combnng for M-QAM on Fadng Channes M. Surendra Raju, Ramesh Annavajjaa and A. Chockangam Insca Semconductors Inda Pvt. Ltd, Bangaore-56000, Inda Department of ECE, Unversty of Caforna, San

More information

Natural Language Processing and Information Retrieval

Natural Language Processing and Information Retrieval Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t Summary Support

More information

ON AUTOMATIC CONTINUITY OF DERIVATIONS FOR BANACH ALGEBRAS WITH INVOLUTION

ON AUTOMATIC CONTINUITY OF DERIVATIONS FOR BANACH ALGEBRAS WITH INVOLUTION European Journa of Mathematcs and Computer Scence Vo. No. 1, 2017 ON AUTOMATC CONTNUTY OF DERVATONS FOR BANACH ALGEBRAS WTH NVOLUTON Mohamed BELAM & Youssef T DL MATC Laboratory Hassan Unversty MORO CCO

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y) Secton 1.5 Correlaton In the prevous sectons, we looked at regresson and the value r was a measurement of how much of the varaton n y can be attrbuted to the lnear relatonshp between y and x. In ths secton,

More information

Characterizing Probability-based Uniform Sampling for Surrogate Modeling

Characterizing Probability-based Uniform Sampling for Surrogate Modeling th Word Congress on Structura and Mutdscpnary Optmzaton May 9-4, 3, Orando, Forda, USA Characterzng Probabty-based Unform Sampng for Surrogate Modeng Junqang Zhang, Souma Chowdhury, Ache Messac 3 Syracuse

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Unified Subspace Analysis for Face Recognition

Unified Subspace Analysis for Face Recognition Unfed Subspace Analyss for Face Recognton Xaogang Wang and Xaoou Tang Department of Informaton Engneerng The Chnese Unversty of Hong Kong Shatn, Hong Kong {xgwang, xtang}@e.cuhk.edu.hk Abstract PCA, LDA

More information

Chapter 9: Statistical Inference and the Relationship between Two Variables

Chapter 9: Statistical Inference and the Relationship between Two Variables Chapter 9: Statstcal Inference and the Relatonshp between Two Varables Key Words The Regresson Model The Sample Regresson Equaton The Pearson Correlaton Coeffcent Learnng Outcomes After studyng ths chapter,

More information

Non-linear Canonical Correlation Analysis Using a RBF Network

Non-linear Canonical Correlation Analysis Using a RBF Network ESANN' proceedngs - European Smposum on Artfcal Neural Networks Bruges (Belgum), 4-6 Aprl, d-sde publ., ISBN -97--, pp. 57-5 Non-lnear Canoncal Correlaton Analss Usng a RBF Network Sukhbnder Kumar, Elane

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

Cyclic Codes BCH Codes

Cyclic Codes BCH Codes Cycc Codes BCH Codes Gaos Feds GF m A Gaos fed of m eements can be obtaned usng the symbos 0,, á, and the eements beng 0,, á, á, á 3 m,... so that fed F* s cosed under mutpcaton wth m eements. The operator

More information

Sparse Gaussian Processes Using Backward Elimination

Sparse Gaussian Processes Using Backward Elimination Sparse Gaussan Processes Usng Backward Elmnaton Lefeng Bo, Lng Wang, and Lcheng Jao Insttute of Intellgent Informaton Processng and Natonal Key Laboratory for Radar Sgnal Processng, Xdan Unversty, X an

More information

Maxent Models & Deep Learning

Maxent Models & Deep Learning Maxent Models & Deep Learnng 1. Last bts of maxent (sequence) models 1.MEMMs vs. CRFs 2.Smoothng/regularzaton n maxent models 2. Deep Learnng 1. What s t? Why s t good? (Part 1) 2. From logstc regresson

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

IV. Performance Optimization

IV. Performance Optimization IV. Performance Optmzaton A. Steepest descent algorthm defnton how to set up bounds on learnng rate mnmzaton n a lne (varyng learnng rate) momentum learnng examples B. Newton s method defnton Gauss-Newton

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

Predicting Model of Traffic Volume Based on Grey-Markov

Predicting Model of Traffic Volume Based on Grey-Markov Vo. No. Modern Apped Scence Predctng Mode of Traffc Voume Based on Grey-Marov Ynpeng Zhang Zhengzhou Muncpa Engneerng Desgn & Research Insttute Zhengzhou 5005 Chna Abstract Grey-marov forecastng mode of

More information

NONLINEAR SYSTEM IDENTIFICATION BASE ON FW-LSSVM

NONLINEAR SYSTEM IDENTIFICATION BASE ON FW-LSSVM Journa of heoretca and Apped Informaton echnoogy th February 3. Vo. 48 No. 5-3 JAI & LLS. A rghts reserved. ISSN: 99-8645 www.jatt.org E-ISSN: 87-395 NONLINEAR SYSEM IDENIFICAION BASE ON FW-LSSVM, XIANFANG

More information

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method Appled Mathematcal Scences, Vol. 7, 0, no. 47, 07-0 HIARI Ltd, www.m-hkar.com Comparson of the Populaton Varance Estmators of -Parameter Exponental Dstrbuton Based on Multple Crtera Decson Makng Method

More information

Quantum Runge-Lenz Vector and the Hydrogen Atom, the hidden SO(4) symmetry

Quantum Runge-Lenz Vector and the Hydrogen Atom, the hidden SO(4) symmetry Quantum Runge-Lenz ector and the Hydrogen Atom, the hdden SO(4) symmetry Pasca Szrftgser and Edgardo S. Cheb-Terrab () Laboratore PhLAM, UMR CNRS 85, Unversté Le, F-59655, France () Mapesoft Let's consder

More information

MODEL TUNING WITH THE USE OF HEURISTIC-FREE GMDH (GROUP METHOD OF DATA HANDLING) NETWORKS

MODEL TUNING WITH THE USE OF HEURISTIC-FREE GMDH (GROUP METHOD OF DATA HANDLING) NETWORKS MODEL TUNING WITH THE USE OF HEURISTIC-FREE (GROUP METHOD OF DATA HANDLING) NETWORKS M.C. Schrver (), E.J.H. Kerchoffs (), P.J. Water (), K.D. Saman () () Rswaterstaat Drecte Zeeand () Deft Unversty of

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

More metrics on cartesian products

More metrics on cartesian products More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of

More information

4DVAR, according to the name, is a four-dimensional variational method.

4DVAR, according to the name, is a four-dimensional variational method. 4D-Varatonal Data Assmlaton (4D-Var) 4DVAR, accordng to the name, s a four-dmensonal varatonal method. 4D-Var s actually a drect generalzaton of 3D-Var to handle observatons that are dstrbuted n tme. The

More information

Analysis of CMPP Approach in Modeling Broadband Traffic

Analysis of CMPP Approach in Modeling Broadband Traffic Anayss of Approach n Modeng Broadband Traffc R.G. Garroppo, S. Gordano, S. Lucett, and M. Pagano Department of Informaton Engneerng, Unversty of Psa Va Dotsav - 566 Psa - Itay {r.garroppo, s.gordano, s.ucett,

More information

APPENDIX A Some Linear Algebra

APPENDIX A Some Linear Algebra APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,

More information

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z ) C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z

More information

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS) Some Comments on Acceleratng Convergence of Iteratve Sequences Usng Drect Inverson of the Iteratve Subspace (DIIS) C. Davd Sherrll School of Chemstry and Bochemstry Georga Insttute of Technology May 1998

More information

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan Kernels n Support Vector Machnes Based on lectures of Martn Law, Unversty of Mchgan Non Lnear separable problems AND OR NOT() The XOR problem cannot be solved wth a perceptron. XOR Per Lug Martell - Systems

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

Why feed-forward networks are in a bad shape

Why feed-forward networks are in a bad shape Why feed-forward networks are n a bad shape Patrck van der Smagt, Gerd Hrznger Insttute of Robotcs and System Dynamcs German Aerospace Center (DLR Oberpfaffenhofen) 82230 Wesslng, GERMANY emal smagt@dlr.de

More information

Gaussian process classification: a message-passing viewpoint

Gaussian process classification: a message-passing viewpoint Gaussan process classfcaton: a message-passng vewpont Flpe Rodrgues fmpr@de.uc.pt November 014 Abstract The goal of ths short paper s to provde a message-passng vewpont of the Expectaton Propagaton EP

More information

A General Column Generation Algorithm Applied to System Reliability Optimization Problems

A General Column Generation Algorithm Applied to System Reliability Optimization Problems A Genera Coumn Generaton Agorthm Apped to System Reabty Optmzaton Probems Lea Za, Davd W. Cot, Department of Industra and Systems Engneerng, Rutgers Unversty, Pscataway, J 08854, USA Abstract A genera

More information

D hh ν. Four-body charm semileptonic decay. Jim Wiss University of Illinois

D hh ν. Four-body charm semileptonic decay. Jim Wiss University of Illinois Four-body charm semeptonc decay Jm Wss Unversty of Inos D hh ν 1 1. ector domnance. Expected decay ntensty 3. SU(3) apped to D s φν 4. Anaytc forms for form factors 5. Non-parametrc form factors 6. Future

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

DETERMINATION OF UNCERTAINTY ASSOCIATED WITH QUANTIZATION ERRORS USING THE BAYESIAN APPROACH

DETERMINATION OF UNCERTAINTY ASSOCIATED WITH QUANTIZATION ERRORS USING THE BAYESIAN APPROACH Proceedngs, XVII IMEKO World Congress, June 7, 3, Dubrovn, Croata Proceedngs, XVII IMEKO World Congress, June 7, 3, Dubrovn, Croata TC XVII IMEKO World Congress Metrology n the 3rd Mllennum June 7, 3,

More information

Some results on a cross-section in the tensor bundle

Some results on a cross-section in the tensor bundle Hacettepe Journa of Matematcs and Statstcs Voume 43 3 214, 391 397 Some resuts on a cross-secton n te tensor bunde ydın Gezer and Murat tunbas bstract Te present paper s devoted to some resuts concernng

More information

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction ECONOMICS 5* -- NOTE (Summary) ECON 5* -- NOTE The Multple Classcal Lnear Regresson Model (CLRM): Specfcaton and Assumptons. Introducton CLRM stands for the Classcal Lnear Regresson Model. The CLRM s also

More information