Sparse Gaussian Processes Using Backward Elimination

Size: px

Start display at page:

Download "Sparse Gaussian Processes Using Backward Elimination"

Esther Cobb
5 years ago
Views:

1 Sparse Gaussan Processes Usng Backward Elmnaton Lefeng Bo, Lng Wang, and Lcheng Jao Insttute of Intellgent Informaton Processng and Natonal Key Laboratory for Radar Sgnal Processng, Xdan Unversty, X an , Chna {blf018, wlp}@163.com Abstract. Gaussan Processes (GPs) have state of the art performance n regresson. In GPs, all the bass functons are requred for predcton; hence ts test speed s slower than other learnng algorthms such as support vector machnes (SVMs), relevance vector machne (RVM), adaptve sparseness (AS), etc. To overcome ths lmtaton, we present a backward elmnaton algorthm, called GPs-BE that recursvely selects the bass functons for GPs untl some stop crteron s satsfed. By ntegratng rank-1 update, GPs-BE can be mplemented at a reasonable cost. Extensve emprcal comparsons confrm the feasblty and valdty of the proposed algorthm. 1 Introducton Covarance functons have a great effect on the performance of GPs. The experments performed by Wllams [1] and Rusmussen [] have shown that the followng covarance functon works well n practce d p p (, ) exp ( ) C x x j = θ p x xj (1.1) p= 1 where θ p s scalng factor. If some varable s unmportant or rrelevant for regresson, the assocated scalng factor wll be made small; otherwse t wll be made large. The key advantage of GPs s that the hyperparameters of covarance functon can be optmzed by maxmzng the evdence. Ths s not appeared n other kernel based learnng methods such as support vector machnes (SVMs) [3]. In SVMs, an extra model selecton crteron, e.g. cross valdaton score s requred for choosng hyperparameters, whch s ntractable when a large number of hyperparameters are nvolved. Though GPs are very successful, they also have some shortages: (1) the computatonal cost of GPs s ( ) 3 O l, where l s the sze of tranng samples, whch seems to prohbt the applcatons of GPs to large datasets; () all the bass functons are requred for predcton; hence ts test speed s slower than other learnng algorthms such as SVMs, relevance vector machne (RVM) [4], adaptve sparseness (AS) [5], etc. Some researchers have tred to deal wth the shortages of GPs. In 000, Smola et al. [6] presented sparse greedy Gaussan processes (SGGPs) whose computatonal J. Wang et al. (Eds.): ISNN 006, LNCS 3971, pp , 006. Sprnger-Verlag Berln Hedelberg 006

2 1084 L. Bo, L. Wang, and L. Jao cost s O( kn l ), where n s the number of bass functons and k s a constant factor. In 00, Csató et al. also proposed sparse on-lne Gaussan processes (SOGPs) [7] that result n good sparseness and low complexty smultaneously. However both SGGPs and SOGPs throw away the key advantage of GPs. As a result, they have dffcultes n tacklng the hyperparameters. Ths paper focuses on the second shortage of GPs above. We propose a backward elmnaton algorthm (GPs-BE) that recursvely selects the bass functons wth the smallest leave-one-out score at the current step untl some stop crteron s satsfed. GPs-BE has reasonable computatonal complexty by ntegratng rank-1 update formula. GPs-BE s performed after GPs s traned; hence all the advantages of GPs are reserved. Extensve emprcal comparsons show that our method greatly reduces the number of bass functons of GPs almost wthout sacrfcng the performance. Gaussan Processes l Let = {(, y) } 1 Z x be l emprcal samples set drawn from = ( ) y = f x, w + ε, = 1,, L l (.1) where ε s ndependent sample from some nose process whch s further assumed to be mean-zeros Gaussan wth varance σ. We further assume f l (, ) wc(, ) xw = xx (.) = 1 Accordng Bayesan nference, the posteror probablty of w can be expressed as P ( w Z) P = ( Z w) P( w) P( Z) (.3) Maxmzng the log-posteror s equvalent to mnmzng the followng objectve functon T T ( ( ) ( ( ) )) T L λ σ wˆ = arg mn w, = w C C+ I w wc y (.4) where I s the dentty matrx. Hyperparameters are chosen by maxmzng the followng evdence l T ( ) ( ) 1 1 T T P, σ π σ exp ( σ ) 1 θ Z = I + CC y I + CC y (.5) In the related Bayesan models, ths equalty s known as the margnal lkelhood, and ts maxmzaton s known as the type- maxmum lkelhood method [8]. Wllams [9] has demonstrated that ths model s equvalent to Gaussan Processes T σ I+ CC ; hence we call t GPs n ths paper. (GPs) wth the covarance ( )

3 Sparse Gaussan Processes Usng Backward Elmnaton Backward Elmnaton for Gaussan Processes In GPs, all the bass functons are used for predcton; therefore t s nferor to neural networks, SVMs and RVM n testng speed, whch seems to prohbt ts applcaton n some felds. Here, GPs-BE s proposed to overcome ths problem that selects the bass functon by a backward elmnaton technque after tranng procedure. GPs-BE s a backward greedy algorthm that recursvely removes the bass functon wth the smallest leave-one-out score at the current step untl some stop crteron s satsfed. For convenence of dervaton, we reformulate (.6) nto 1 w = H b (3.1) where ( T T ( ) H = C C+σ I ) and b = C y. Let Δ f k be the ncrement of L wth the tranng sample deleted and then the followng theorem holds true. ( ) ( k ) wk 1 Theorem 3.1: Δ f =, where R = H, R kk denotes the k th dagonal R kk 1 element of H. ( ) We call Δ f k leave-one-out score. At each step, we wll remove the bass functon wth the smallest leave-one-out score. The ndex of the bass functon to be deleted can be obtaned by ( k ) ( f ) th k s = arg mn Δ, (3.) k P where P s a set of the ndces of the remander bass functons. Note that the (l+1)-th varable,.e. the bas, s preserved durng the backward elmnaton process. When one bass functon s deleted, we requre updatng the matrx R and the vector w. In terms of a rank-1 update, R and w can be formulated as ( R ) j RsRsj = Rj,, j s, (3.3) R ss n RsRsj ( w ) = j bj, R j s Rss s. (3.4) Together wth w = Rb, (3.4) s smplfed as Rs ( w ) = w ws, R ss s. (3.5) Suppose that Δ t s the ncrement of f at the t-th teraton, and then we wll termnate the backward elmnaton procedure f Δt ε f (3.6) where we set ε = The detaled backward elmnaton procedure s summarzed n Fgure 3.1.

4 1086 L. Bo, L. Wang, and L. Jao Agorthm1: GPs-BE 1. Compute the ndex of bass functon to be removed by (3.);. Update the matrx R and the vector w by (3.3) and (3.5); 3. Remove the ndex resultng from step 1; 4. If (3.6) s satsfed, Stop; otherwse, go to Step 1. Fg Flow chart of backward elmnaton 4 Emprcal Study In order to evaluate the performance of GPs-BE, we compare t wth GPs, GPs-U, SVM, RVM and AS on four benchmark datasets,.e. Fredman1 [10], Boston Housng, Abalone and Computer Actvty [11]. GPs-U denotes GPs whose covarance functon has the same scalng factors. Before experments, all the tranng data are scaled n [-1, 1] and the testng data are adjusted usng the same lnear transformaton. For Fredman1 and Boston Housng data sets, the results are averaged over 100 random splts of the full datasets. For Abalone and Computer Actvty data sets, the results are averaged over 10 random splts of the mother datasets. The free parameters n GPs, GPs-BE and GPs-U are optmzed by maxmzng the evdence. The free parameters n RVM, SVMs and AS are selected by 10-fold cross valdaton procedure. Table 4.1. Characterstcs of four benchmark datasets Abrr. Problem Attrbutes Total Sze Tranng Sze Testng Sze FRI Fredman BOH Boston Housng ABA Abalone COA Computer Actvty Table 4.. Mean of the testng errors of sx algorthms Problem GPs GPs-BE GPs-U RVM SVMs AS FRI BOH ABA COA Table 4.3. Mean of the number of bass functons of sx algorthms on benchmark datasets Problem GPs GPs-BE GPs-U RVM SVMs AS FRI BOH ABA COA

5 Sparse Gaussan Processes Usng Backward Elmnaton 1087 Table 4.4. Runtme of sx algorthms on benchmark datasets Problem GPs GPs-BE GPs-U RVM SVMs AS FRI BOH ABA COA From Table 4. we know that GPs-BE and GPs obtan smlar generalzaton performance and are sgnfcantly better than GPs-U, RVM, SVMs and AS n the two regresson tasks,.e. Fredman1and Computer Actvty. As for the remanng two tasks, all the sx approaches have smlar performance. Snce GPs-U s often superor to SGGPs and SOGPs n terms of the generalzaton performance, GPs-BE s expected to have the better generalzaton performance than SGGPs and SOGPs.Table 4.3 show that the number of bass functons of GPs-BE approaches that of RVM and AS, and s sgnfcantly smaller than that of GPs, GPs-U and SVMs. Table 4.4 show that the runtme of GPs-BE approaches that of GPs, GPs-U and AS, and s sgnfcantly smaller than that of GPs, GPs-U and SVMs. An alternatve s to select the bass functons usng the forward selecton proposed by [1-13]. Table 4.5 compares our method wth forward selecton n the same stop crteron. Table 4.5. Comparson of backward elmnaton and forward selecton Problem Backward Elmnaton Forward Selecton FRI BOH ABA COA Normalzed Mean Table 4.5 shows that the backward elmnaton outperforms the forward selecton n the performance and the number of bass functons n the same stop crteron. In summary, GPs-BE greatly reduces the number of bass functons of GPs almost wthout sacrfcng the performance and ncreasng the runtme. Moreover, GPs-BE s better than GPs-U n performance, whch further ndcates the performance of GPs- BE s better than that of SGGPs and SOGPs. GPs-BE s better than SVMs n all the three aspects. GPs-BE s also better than RVM and AS n performance wth the smlar number of bass functons and runtme. Fnally, the backward elmnaton outperforms the forward selecton n the same stop crteron. 5 Concluson Ths paper presents a backward elmnaton algorthm to select the bass functons for GPs. By ntegratng rank-1 update, we can mplement GPs-BE at a reasonable cost. The results show that GPs-BE greatly reduces the number of bass functons of GPs

6 1088 L. Bo, L. Wang, and L. Jao almost wthout sacrfcng the performance and ncreasng the runtme. Comparsons wth forward selecton show that GPS-BE obtans better performance and smaller bass functons n the same stop crteron. Ths research s supported by Natonal Natural Scence Foundaton of Chna under grant and and Natonal 973 Project grant 001CB References 1. Wllams, C. K. I., Rasmussen, C. E.: Gaussan Processes for Regresson. Advances n Neural Informaton Processng Systems 8 (1996) Rasmussen, C. E.: Evaluaton of Gaussan Processes and Other Methods for Non-lnear Regresson. Ph.D. thess, Dep.of Computer Scence, Unversty of Toronto. Avalable from 3. Vapnk, V.: The Nature of Statstcal Learnng Theory. New York: Sprnger-Verlag (1995) 4. Tppng, M. E.: Sparse Bayesan Learnng and the Relevance Vector Machne. Journal Machne Learnng Research 1 (001) Fgueredo, M. A. T.: Adaptve Sparseness for Supervsed Learnng. IEEE Trans. Pattern Analyss and Machne Intellgence 5 (003) Smola, A. J., Bartlett, P. L.: Sparse Greedy Gaussan Processes Regresson, Advances n Neural Informaton Processng Systems 13 (000) Csato, L., Opper, M.: Sparse Onlne Gaussan Processes, Neural Computaton, 14 (00) Berger, J. O.: Statstcal Decson Theory and Bayesan Analyss. Sprnger, Second Edton (1985) 9. Wllams, C. K. I.: Predcton wth Gaussan Processes: from Lnear Regresson to Lnear Predcton and Beyond. Learnng and Inference n Graphcal Models (1998) Fredman, J. H.: Multvarate Adaptve Regresson Splnes. Annals of Statstcs 19 (1991) Blake, C. L., Merz, C. J.: UCI Repostory of Machne Learnng Databases, Techncal Report, Unversty of Calforna, Department of Informaton and Computer Scence, Irvne, CA (1998) Data avalable at 1. Chen, S., Cowan, C. F. N., Grant, P. M.: Orthogonal Least Squares Learnng Algorthm for Radal Bass Functon Networks. IEEE Trans. Neural Networks (1991) Bo, L. F., Wang, L., Jao, L. C.: Sparse Bayesan Learnng Based on an Effcent Subset Selecton, Lecture Notes n Computer Scence 3173 (004) 64-69

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume