Kernel Maximum a Posteriori Classification with Error Bound Analysis

Size: px

Start display at page:

Download "Kernel Maximum a Posteriori Classification with Error Bound Analysis"

Oscar Lloyd
5 years ago
Views:

1 Kernel Maxmum a Posteror Classfcaton wth Error Bound Analyss Zengln Xu, Kazhu Huang, Janke Zhu, Irwn Kng, and Mchael R. Lyu Dept. of Computer Scence and Engneerng, The Chnese Unv. of Hong Kong, Shatn, N.T., Hong Kong {zlxu,kzhuang,jkzhu,kng,lyu}@cse.cuhk.edu.hk Abstract. Kernel methods have been wdely used n data classfcaton. Many kernel-based classfers lke Kernel Support Vector Machnes (KSVM) assume that data can be separated by a hyperplane n the feature space. These methods do not consder the data dstrbuton. Ths paper proposes a novel Kernel Maxmum A Posteror (KMAP) classfcaton method, whch mplements a Gaussan densty dstrbuton assumpton n the feature space and can be regarded as a more generalzed classfcaton method than other kernel-based classfer such as Kernel Fsher Dscrmnant Analyss (KFDA). We also adopt robust methods for parameter estmaton. In addton, the error bound analyss for KMAP ndcates the effectveness of the Gaussan densty assumpton n the feature space. Furthermore, KMAP acheves very promsng results on eght UCI benchmark data sets aganst the compettve methods. 1 Introducton Recently, kernel methods have been regarded as the state-of-the-art classfcaton approaches [1]. The basc dea of kernel methods n supervsed learnng s to map data from an nput space to a hgh-dmensonal feature space n order to make data more separable. Classcal kernel-based classfers nclude Kernel Support Vector Machne (KSVM) [], Kernel Fsher Dscrmnant Analyss (KFDA) [3], and Kernel Mnmax probablty Machne [4,5]. The reasonablty behnd them s that the lnear dscrmnant functons n the feature space can represent complex separatng surfaces when mapped back to the orgnal nput space. However, one drawback of KSVM s that t does not consder the data dstrbuton and cannot drectly output the probabltes or confdences for classfcaton. Therefore, t s hard to be appled n systems that reason under uncertanty. On the other hand, n statstcal pattern recognton, the probablty denstes can be estmated from data. Future examples are then assgned to the class wth the Maxmum A Posteror (MAP) [6]. One typcal probablty densty functon s the Gaussan densty functon. The Gaussan densty functons are easy to handle. However, the Gaussan dstrbuton cannot be easly satsfed n the nput space and t s hard to deal wth non-lnearly separable problems. M. Ishkawa et al. (Eds.): ICONIP 007, Part I, LNCS 4984, pp , 008. c Sprnger-Verlag Berln Hedelberg 008

2 84 Z. Xu et al. To solve these problems, we propose a Kernel Maxmum a Posteror (KMAP) Classfcaton method under Gaussanty assumpton n the feature space. Dfferent from KSVM, we make the Gaussan densty assumpton, whch mples that data can be separated by more complex surfaces n the feature space. Generally, dstrbutons other than the Gaussan dstrbuton can also be assumed n the feature space. However, under a dstrbuton wth a complex form, t s hard to get a close form soluton and easy to trap n over-fttng. Moreover, wth the Gaussan assumpton, a kernelzed verson can be derved wthout knowng the explct form of the mappng functons for our model. In addton, to ndcate the effectveness of our assumpton, we calculate a separablty measure and the error bound for b-category data sets. The error bound analyss shows that Gaussan densty dstrbuton can be more easly satsfed n the feature space. Ths paper s organzed as follows. Secton derves the MAP decson rules n the feature space, and analyzes ts separablty measures and upper error bounds. Secton 3 presents the experments aganst other classfers. Secton 4 revews the related work. Secton 5 draws conclusons and lsts possble future research drectons. Man Results In ths secton, our MAP classfcaton model s derved. Then, we adopt a specal regularzaton to estmate the parameters. The kernel trck s used to calculate our model. Last, the separablty measure and the error bound are calculated n the kernel-nduced feature space..1 Model Formulaton Under the Gaussan dstrbuton assumpton, the condtonal densty functon for each class C (1 m) s wrtten as below: p(φ(x) C )= { 1 (π) N / Σ exp 1 } 1 / (Φ(x) µ ) T Σ 1 (Φ(x) µ ), (1) where Φ(x) s the mage of x n the feature space, N s the dmenson of the feature space (N could be nfnty), µ and Σ are the mean and the covarance matrx of C, respectvely, and Σ s the determnant of the covarance matrx. Accordng to the Bayesan Theorem, the posteror probablty of class C s calculated by P (C x) = p(x C )P (C ) m p(x C j )P (C j ). () Based on Eq. (), the decson rule can be formulated as below: x C w f P (C w x) = max 1 j m P (C j x). (3)

3 Kernel Maxmum a Posteror Classfcaton wth Error Bound Analyss 843 Ths means that a test data pont wll be assgned to the class wth the maxmum of P (C w x),.e., the MAP. Snce the MAP s calculated n the kernel-nduced feature space, the output model s named as the KMAP classfcaton. KMAP can provde not only a class label but also the probablty of a data pont belongng to that class. Ths probablty can be vewed as a confdence of classfyng new data ponts and can be used n statstcal systems that reason under uncertanty. If the confdence s lower than some specfed threshold, the system can refuse to make an nference. However, many kernel learnng methods ncludng KSVM cannot output these probabltes. It can be further formulated as follows: g (Φ(x)) = (Φ(x) µ ) T Σ 1 (Φ(x) µ ) + log Σ. (4) The ntutve meanng of the functon s that a class s more lkely assgned to an unlabeled data pont, when the Mahalanobs dstance from the data pont to the class center s smaller.. Parameter Estmaton In order to compute the Mahalanobs dstance functon, the mean vector and the covarance matrx for each class are requred to be estmated. Typcally, the mean vector (µ ) and the wthn covarance matrx (Σ ) are calculated by the maxmum lkelhood estmaton. In the feature space, they are formulated as follows: µ = 1 n Φ(x j ), (5) n Σ = S = 1 n n (Φ(x j ) µ )(Φ(x j ) µ ) T, (6) where n s the cardnalty of the set composed of data ponts belongng to C. Drectly employng S as the covarance matrx, wll generate quadratc dscrmnant functons n the feature space. In ths case, KMAP s noted as KMAP-M. However, the covarance estmaton problem s clearly ll-posed, because the number of data ponts n each class s usually much smaller than the number of dmensons n the kernel-nduced feature space. The treatment of ths ll-posed problem s to ntroduce the regularzaton. There are several knds of regularzaton methods. One of them s to replace the ndvdual wthncovarance matrx by ther average,.e., Σ = S = m =1 S m + ri, where I s the dentty matrx and r s a regularzaton coeffcent. Ths method can substantally reduce the number of free parameters to be estmated. Moreover, t also reduces the dscrmnant functon between two classes to a lnear one. Therefore, a lnear dscrmnant analyss method can be obtaned. Alternatvely, we can estmate the covarance matrx by combnng the above lnear dscrmnant functon wth the quadratc one. Instead of estmatng the

4 844 Z. Xu et al. covarance matrx n the nput space [7], we try to apply ths method n the feature space. The formulaton n the feature space s as follows: Σ = (1 η) Σ + η trace( Σ ) I, (7) n where Σ = (1 θ)s + θs. In the equatons, θ (0 θ 1) s a coeffcent lnked wth the lnear dscrmnant term and the quadratc dscrmnant one. Moreover, η (0 η 1) determnes the shrnkage to a multple of the dentty matrx. Ths approach s more flexble to adjust the effect of the regularzaton. The correspondng KMAP s noted as KMAP-R..3 Kernel Calculaton We derve methods to calculate the Mahalanobs dstance (Eq. (4)) usng the kernel trck,.e., we only need to formulate the functon n an nner-product form regardless of the explct mappng functon. To do ths, the spectral representaton of the covarance matrx, Σ = N Λ j Ω j Ω T j where Λ j R s the j-th egenvalue of Σ and Ω j R N s the egenvector relevant to Λ j, s utlzed. However, the small egenvalues wll degrade the performance of the functon overwhelmngly because they are underestmated due to the small number of examples. In ths paper, we only estmate the k largest egenvalues and replace each left egenvalue wth a nonnegatve number h. Thus Eq. (4) can be reformulated as follows: g (Φ(x)) = 1 [g 1 (Φ(x)) g (Φ(x))] + g 3 (Φ(x)) h = 1 N k ( [Ω T j (Φ(x) µ )] 1 h ) [Ω T j (Φ(x) µ )] h Λ j k + log. (8) h N k Λ j In the followng, we show that g 1 (Φ(x)), g (Φ(x)), and g 3 (Φ(x)) can all be wrtten n a kernel form. To formulate these equatons, we need to calculate the egenvalues Λ and egenvectors Ω. The egenvectors le n the space spanned by all the tranng samples,.e., each egenvector Ω j can be wrtten as a lnear combnaton of all the tranng samples: Ω j = n l=1 γ (l) j Φ(x l )=Uγ j (9) where γ j =(γ (1) (Φ(x 1 ),...,Φ(x n )). j,γ () j,..., γ (n) j ) T s an n dmensonal column vector and U =

5 Kernel Maxmum a Posteror Classfcaton wth Error Bound Analyss 845 It s easy to prove that γ j and Λ j are actually the egenvector and egenvalue of the covarance matrx Σ G (), where G () R n N s the -th block of the kernel matrx G relevant to C. We omt the proof due to the lmt of space. Accordngly, we can express g 1 (Φ(x)) as the kernel form: g 1 (Φ(x)) = n γ T j U T (Φ(x) µ ) T (Φ(x) µ )Uγ j [ ( )] n = γ T j K x 1 n K xl n l=1 = K x 1 n K xl, (10) n where K x = {K(x 1, x),...,k(x n, x)} T. In the same way, g (Φ(x)) can be formulated as the followng: g (Φ(x)) = k l=1 Substtutng (9) nto the above g (Φ(x)), we have: g (Φ(x)) = k ( 1 h ) Ω T j Λ (Φ(x) µ )(Φ(x) µ ) T Ω j. (11) j ( 1 h ) ( γ T j K x 1 Λ j n n l=1 K xl )( K x 1 n n l=1 K xl ) T γ j. (1) Now, the Mahalanobs dstance functon n the feature space g (Φ(x)) can be fnally wrtten n a kernel form, where N n g 3 (Φ(x)) s substtuted by the cardnalty of data n. The tme complexty of KMAP s manly due to the egenvalue decomposton whch scales as O(n 3 ). Thus KMAP has the same complexty as KFDA..4 Connecton to Other Kernel Methods In the followng, we show the connecton between KMAP and other kernel-based methods. In the regularzaton method based on Eq. (7), by varyng the settngs of θ and η, other kernel-based classfcaton methods can be derved. When (θ =0,η = 0), the KMAP model represents a quadratc dscrmnant method n the kernelnduced feature space; when (θ = 1,η = 0), t represents a kernel dscrmnant method; and when (θ =0,η = 1) or (θ =1,η = 1), t represents the nearest mean classfer. Therefore, by varyng θ and η, dfferent models can be generated from dfferent combnatons of quadratc dscrmnant, lnear dscrmnant and the nearest mean methods. We consder a specal case of the regularzaton method when θ = 1 and η = 0. If both classes are assumed to have the same covarance structure for a

6 846 Z. Xu et al. bnary class problem,.e., Σ = 1+, t leads to a lnear dscrmnant functon. Assumng all classes have the same class pror probabltes, g (Φ(x)) can be derved as: g (Φ(x)) = (Φ(x) µ ) T ( 1+ ) 1 (Φ(x) µ ), where =1,. We reformulate the above equaton n the followng form: g (Φ(x)) = w Φ(x) + b, where w = 4(Σ 1 + Σ ) 1 µ, and b =µ T (Σ 1 + Σ ) 1 µ. The decson hyperplane s f(φ(x)) = g 1 (Φ(x)) g (Φ(x)),.e., f(φ(x)) = (Σ 1 + Σ ) 1 (µ 1 µ )Φ(x) 1 (µ 1 µ ) T (Σ 1 + Σ ) 1 (µ 1 + µ ). (13) Eq. (13) s just the soluton of KFDA [3]. Therefore, KFDA can be vewed as a specal case of KMAP when all classes have the same covarance structure. Remark. KMAP provdes a rch class of kernel-based classfcaton algorthms usng dfferent regularzaton methods. Ths makes KMAP as a flexble framework for classfcaton adaptve to data dstrbuton..5 Separablty Measures and Error Bounds To measure the separablty of dfferent classes of data n the feature space, the Kullback-Lebler dvergence (a.k.a. K-L dstance) between two Gaussans s adopted. The K-L dvergence s defned as d K L [p (Φ(x)),p j (Φ(x))] = P (Φ(x)) ln p (Φ(x)) p j (Φ(x)). (14) Snce the K-L dvergence s not symmetrc, a two-way dvergence s used to measure the dstance between two dstrbutons d j = d K L [p (Φ(x)),p j (Φ(x))] + d K L [p j (Φ(x)),p (Φ(x))] (15) Followng [6], t can be proved that: d j = 1 (µ µ j ) T (Σ 1 + Σ 1 j )(µ µ j )+ 1 trace(σ 1 Σ j + Σ 1 j Σ I), (16) whch can be solved by usng the trck n Secton.3. The Bayesan decson rule guarantees the lowest average error rate as presented n the followng: P (correct) = m =1 R p(φ(x) C )P (C )dφ(x), (17) where R s the decson regon of class C. We mplement the Bhattacharyya bound n the feature space for the Gaussan densty dstrbuton functon. Followng [6], we have P (error) P (C 1 )P (C ) exp q(0.5), (18)

7 where Kernel Maxmum a Posteror Classfcaton wth Error Bound Analyss 847 q(0.5) = 1 8 (µ µ 1 ) T ( Σ1 + Σ ) 1 (µ µ 1 )+ 1 ln 1+ Σ1 Σ. (19) Usng the results n Secton.3, the Bhattacharyya error bound can be easly calculated n the kernel-nduced feature space. 3 Experments In ths secton, we report the experments to evaluate the separablty measure, the error bound and the predcton performance of the proposed KMAP. 3.1 Synthetc Data We compare the separablty measure and error bounds on three synthetc data sets. The descrpton of these data sets can be found n [8]. The data sets are named accordng to ther characterstcs and they are plotted n Fg. 1. We map the data usng RBF kernel to a specal feature space where Gaussan dstrbutons are approxmately satsfed. We then calculate separablty measures on all data sets accordng to Eq. (16). The separablty values for the Overlap, Bumpy, and Relevance n the orgnal nput space, are 14.94, 5.16, and.18, respectvely. Those correspondng values n the feature space are 30.88, 5.87, and 3631, respectvely. The results ndcate that data become more separable after mapped nto the feature space, especally for the Relevance data set. For data n the kernel-nduced feature space, the error bounds are calculated accordng to Eq. (18). Fgure 1 also plots the predcton rates and the upper error bounds for data n the nput space and n the feature space, respectvely. It can be observed that the error bounds are more vald n the feature space than those n the nput space. 3. Benchmark Data Expermental Setup. In ths experment, KSVM, KFDA, Modfed Quadratc Dscrmnant Analyss (MQDA) [9] and Kernel Fsher s Quadratc Dscrmnant Analyss (KFQDA) [10] are employed as the compettve algorthms. We mplement two varants of KMAP,.e., KMAP-M and KMAP-R. The propertes of eght UCI benchmark data sets are descrbed n Table 1. In all kernel methods, a Gaussan-RBF kernel s used. The parameter C of KSVM and the parameter γ n RBF kernel are all tuned by 10-cross valdaton. In KMAP, we select k pars of egenvalues and egenvectors accordng to ther l contrbuton to the covarance matrx,.e., the ndex j {l : α}; whle q n MQDF, the range of k s relatvely small and we select k by cross valdaton. PCA s used as the regularzaton method n KFQDA and the commutatve decay rato s set to 99%; the regularzaton parameter r s set to n KFDA. n q=1

8 848 Z. Xu et al. Overlap Bumpy (a) Overlap (b)bumpy 0.8 Relevance 60 nput_bound Predcton error and error bound (%) nput_rate feature_bound feature_rate (c) Relevance 0 Bumpy Relevance Overlap Dfferent data sets (d) Separablty Comparson Fg. 1. The data plot of Overlap, Bumpy and Relevance and the comparson of data separablty n the nput space and the feature space Table 1. Data set nformaton Data Set # Samples # Features # Classes Data Set # Samples # Features # Classes Iono Breast Twonorm Sonar Pma Irs Wne Segment In both KMAP and MQDF, h takes the value of Λ k+1. In KMAP-R, extra parameters (θ, η) are tuned by cross-valdaton. All expermental results are obtaned n 10 runs and each run s executed wth 10-cross valdaton for each data set. Expermental Results. Table reports the average predcton accuracy wth the standard errors on each data set for all algorthms. It can be observed that both varants of KMAP outperform MQDF, whch s an MAP method n the nput space. Ths also emprcally valdates that the separablty among dfferent classes of data becomes larger and that the upper error bounds get tghter and more accurate, after data are mapped to the hgh dmensonal feature space. Moreover, the performance of KMAP s compettve to that of other kernel methods. Especally, the performance of KMAP-R gets better predcton accuracy than all other methods for most of the data sets. The reason s that the regularzaton methods n KMAP favorably capture the pror dstrbutons of

9 Kernel Maxmum a Posteror Classfcaton wth Error Bound Analyss 849 Table. The predcton results of KMAP and other methods Data set KSVM MQDF KFDA KFQDA KMAP-M KMAP-R Iono(%) 94.1± ± ± ± ± ±0.3 Breast(%) 96.5± ± ± ± ± ±0.1 Twonorm(%) 96.1± ± ± ± ± ±0.4 Sonar(%) 86.6± ± ± ± ± ±1. Pma(%) 77.9± ± ± ± ± ±0.4 Irs(%) 96.± ± ± ± ± ±0.0 Wne(%) 98.8± ± ± ± ± ±0.6 Segment(%) 9.8± ± ± ± ±0. 9.1±0.8 Average(%) data, snce the Gaussan assumpton n the feature space can ft a very complex dstrbuton n the nput space. 4 Related Work In statstcal pattern recognton, the probablty densty functon can frst be estmated from data, then future examples could be assgned to the class wth the MAP. One typcal example s the Quadratc Dscrmnant Functon (QDF) [11], whch s derved from the multvarate normal dstrbuton and acheves the mnmum mean error rate under Gaussan dstrbuton. In [9], a Modfed Quadratc Dscrmnant Functon (MQDF) less senstve to estmaton error s proposed. [7] mproves the performance of QDF by covarance matrx nterpolaton. Unlke QDF, another type of classfers does not assume the probablty densty functons n advance, but are desgned drectly on data samples. An example s the Fsher dscrmnant analyss (FDA), whch maxmzes the between-class covarance whle mnmzng the wthn-class varance. It can be derved as a Bayesan classfer under Gaussan assumpton on the data. [3] develops a Kernel Fsher Dscrmnant Analyss (KFDA) by extendng FDA to a non-lnear space by the kernel trck. To supplement the statstcal justfcaton of KFDA, [10] extends the maxmum lkelhood method and Bayes classfcaton to ther kernel generalzaton under Gaussan Hlbert space assumpton. The authors do not drectly kernelze the quadratc forms n terms of kernel values. Instead, they use an explct mappng functon to map the data to a hgh dmensonal space. Thus the kernel matrx s usually used as the nput data of FDA. The derved model s named as Kernel Fsher s Quadratc Dscrmnant Analyss (KFQDA). 5 Concluson and Future Work In ths paper, we present a novel kernel classfer named Kernel-based Maxmum a Posteror, whch mplements Gaussan dstrbuton n the kernel-nduced feature space. Comparng to state-of-the-art classfers, the advantages of KMAP nclude that the pror nformaton of dstrbuton s ncorporated and that t can output probablty or confdence n makng a decson. Moreover, KMAP can

10 850 Z. Xu et al. be regarded as a more generalzed classfcaton method than other kernel-based methods such as KFDA. In addton, the error bound analyss llustrates that Gaussan dstrbuton s more easly satsfed n the feature space than that n the nput space. More mportantly, KMAP wth proper regularzaton acheves very promsng performance. We plan to ncorporate the probablty nformaton nto both the kernel functon and the classfer n the future work. Acknowledgments The work descrbed n ths paper s fully supported by two grants from the Research Grants Councl of the Hong Kong Specal Admnstratve Regon, Chna (Project No. CUHK405/04E and Project No. CUHK435/04E). References 1. Schölkopf, B., Smola, A.: Learnng wth Kernels. MIT Press, Cambrdge (00). Vapnk, V.N.: Statstcal Learnng Theory. John Wley & Sons, Chchester (1998) 3. Mka, S., Ratsch, G., Weston, J., Scholkopf, B., Muller, K.: Fsher dscrmnant analyss wth kernels. In: Proceedngs of IEEE Neural Network for Sgnal Processng Workshop, pp (1999) 4. Lanckret, G.R.G., Ghaou, L.E., Bhattacharyya, C., Jordan, M.I.: A robust mnmax approach to classfcaton. Journal of Machne Learnng Research 3, (00) 5. Huang, K., Yang, H., Kng, I., Lyu, M.R., Chan, L.: Mnmum error mnmax probablty machne. Journal of Machne Learnng Research 5, (004) 6. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classfcaton. Wley-Interscence Publcaton (000) 7. Fredman, J.H.: Regularzed dscrmnant analyss. Journal of Amercan Statstcs Assocaton 84(405), (1989) 8. Centeno, T.P., Lawrence, N.D.: Optmsng kernel parameters and regularsaton coeffcents for non-lnear dscrmnant analyss. Journal of Machne Learnng Research 7(), (006) 9. Kmura, F., Takashna, K., S., T., Y., M.: Modfed quadratc dscrmnant functons and the applcaton to Chnese character recognton. IEEE Transactons on Pattern Analyss and Machne Intellgence 9, (1987) 10. Huang, S.Y., Hwang, C.R., Ln, M.H.: Kernel Fsher s dscrmnant analyss n Gaussan Reproducng Kernel Hlbert Space. Techncal report, Academa Snca, Tawan, R.O.C. (005) 11. Fukunaga, K.: Introducton to Statstcal Pattern Recognton, nd edn. Academc Press, San Dego (1990)

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons