Producto Systems ad Iformato Egeerg Volume 5 (2009), pp. 4-50. ESTIMATION OF MISCLASSIFICATION ERROR USING BAYESIAN CLASSIFIERS PÉTER BARABÁS Uversty of Msolc, Hugary Departmet of Iformato Techology barabas@t.u-msolc.hu LÁSZLÓ KOVÁCS Uversty of Msolc, Hugary Departmet of Iformato Techology ovacs@t.u-msolc.hu [Receved Jauary 2009 ad accepted Aprl 2009] Abstract. Bayesa classfers provde relatvely good performace compared wth other more complex algorthms. Msclassfcato rato s very low for traed samples, but the case of outlers the msclassfcato error may crease sgfcatly. The usage of summato hac method Bayesa classfcato algorthm ca reduce the msclassfcatos rate for utraed samples. The goal of ths paper s to aalyze the applcablty of summato hac Bayesa classfers geeral. Keywords: Bayesa classfer, summato hac, polyomal dstrbuto, msclassfcato error. Itroducto The Bayesa classfcato method s a geeratve statstcal classfer. Studes comparg classfcato algorthms have foud that the smple or Nave Bayesa classfer provdes relatvely good performace compared wth other more complex algorthms. Accuracy of classfcato s a very mportat property of a classfer, a measure of whch ca be separated to two parts: a measure of accuracy case of traed samples ad a measure of accuracy case of utraed samples. Nave Bayesa classfcato s geerally very accurate the frst case sce all testg samples are traed before ad have o outlers; the secod case the effcecy s worse due to outlers. I [], the role of outlers s examed classfcato methods, the Nave Bayesa classfcato s reactve to outlers, ad they ca cause msclassfcato. Usage of summato hac ca reduce the effect of outlers. The goal of our research s to aalyze the geeralzato capablty of Bayesa classfcato usg summato hac. I the secod part a short summary about Nave Bayesa classfcato s gve. I the thrd part the cocept of summato hac s troduced ad examed. I the fourth part the classfcato methods are
42 P. BARABÁS AND L. KOVÁCS aalyzed cosderg the msclassfcato error. Fally, the test results ad coclusos have bee summarzed the last secto. It s assumed that the obects to be classfed are descrbed by -dmesoal patter vectors x = (x,,x ) R. The dmesos correspod to the attrbutes of the obects. Every patter vector s assocated wth a class label c, where the total umber of classes s m. The class label c deotes that the obect belogs to the -th class. Thus, a classfer ca be regarded as a fucto g x ) : R { c,..., c }. (.) ( m The optmal classfcato fucto s amed at mmzg the msclassfcato rs [2]. The rs value R depeds o the probablty of the dfferet classes ad o the msclassfcato cost of the classes: R ( g( x ) x) = b( g( x) c ) c x), (.2) c where c x) deotes the codtoal probablty of c for the patter vector x ad b(c c ) deotes the cost value of decdg favor of c stead of the correct class c. The cost fucto b has usually the followg smplfed form: 0, f c = c b ( c c ) = (.3), f c c. Usg ths d of fucto b, the msclassfcato error value ca be gve by R( g( x ) x) = g x c ( ) x). (.4) The optmal classfcato fucto mmzes the value R(g(x) x). As thus f c c c x ) =, (.5) P ( g( x) x) max, (.6) the the R(g(x) x) has a mmal value. The decso rule whch mmzes the average rs s the Bayes rule whch assgs the x patter vector to the class that has the greatest probablty for x[3].
ESTIMATION OF MISCLASSIFICATION ERROR USING BAYES CLASSIFICATION 43 2. Bayes classfcato A Bayesa classfer s based o Bayes theorem whch relates to the codtoal ad margal probabltes of two radom evets. Let A ad B deote evets. Codtoal probablty A B) s the probablty of evet A, gve the occurrece of evet B. Margal probablty s the ucodtoal probablty A) of evet A, regardless of whether evet B does or does ot occur. The smplfed verso of Bayesa theorem ca be wrtte for evet A ad B as follows: B A) A) P ( A B) =. (2.) B) If s the complemetary evet of A, called ot A. Let A, A 2, A 3, be a partto of the evet space. The geeral form of the theorem s gve as: B A ) A ) A B) =. B A ) A ) (2.2) Let C = {c } deote the set of classes. The observable propertes of the obects are descrbed by vector x. A obect wth propertes x has to be classfed to the class for whch the c x) probablty s maxmal. O the bass of Bayes theorem: x c ) c ) c x) =. (2.3) x) Sce x) s the same for all we have to maxmze oly the expresso x c )c ). The value c ) s gve a pror or ca be apprecated wth relatve frequeces from the samples. Accordg to the assumpto of Nave Bayes classfcato the attrbutes a gve class are codtoally depedet of every other attrbute. So the ot probablty model ca be expressed as,..., x ) = c ) x c ) = c, x. (2.4) Usg the above equato the probablty of class c for a obect featured by vector x s equal to
44 P. BARABÁS AND L. KOVÁCS c c ) x c = x) =. x) ) (2.5) For the case where c* x) s maxmal the correspodg class label [5] s: c * = c C { c )} = argmaxc C c) = x c) argmax x. (2.6) If a gve class ad feature ever occur together the trag set, the the relatve frequecy wll be zero. Thus, the total probablty s also set to zero. Oe of the smplest solutos of ths problem s to add to all occurreces of the gve attrbute. I case of a large umber of samples the dstorto of probabltes s margal ad the formato loss through the zero tag ca be elmated successfully. Ths techque s called Laplace estmato [4]. A more refed soluto s to add p stead of to the relatve frequeces, where p s the relatve frequecy of th attrbute value the global teachg set, ot oly the set belogg to class c. 3. Summato hac Outlers the classfcato ca dcate faulty data whch cause msclassfcato. The use of summato hac s a optoal method to reduce the msclassfcato error. Summato hac s a ad-hoc replacemet of a product by a sum a probablstc expresso []. Ths hac s usually explaed as a devce to cope wth outlers, wth o formal dervato. Ths ote shows that the hac does mae sese probablstcally, ad ca be best thought of as replacg a outler-sestve lelhood wth a outler-tolerat oe. Let us defe a vector x wth compoets x,x 2,,x ad a class c. I Bayes classfcato where the vector values are codtoally depedet: x c) = x c). (3.) = I ths case the probablty s sestve to outlers dvdual dmesos so f ay x c) value s equal to 0, the product wll be zero. Usg summato hac we get the followg:
ESTIMATION OF MISCLASSIFICATION ERROR USING BAYES CLASSIFICATION 45 x c) x c). (3.2) = I ths case the result wll be zero f ad oly f all p(x c) values are equal to 0. Usg (2.9) ad (3.2) the computg of wer class s based upo the followg formula: c * argmax x x c), (3.3) = = c C { c )} argmax c C c) Applyg summato hac the error of classfcato ca be reduced. I every equato above the frequecy probabltes are replaced wth ther approxmated values, where P e e ( ) = lm, (3.4) t ad t s the total umber of trals ad e s the umber of trals where evet e occurred. If the umber of test evets approaches fty, the relatve frequecy value wll coverge to the probablty value. I may classfcato tass; a small umber of samples s gve [6], the umber of tests s low, so a larger approxmato error wll arse the calculatos. We ca wrte the probablty as follows: x = v c x = x = v c ) + Δ, (3.5) where meas the error of approxmato. The cumulated classfcato error case of summato hac ca be computed by the summato of the error elemets. Ths error value dffers from the classfcato error for the product of probabltes as t s calculated by the followg form: = ( P ( x = v c ) + Δ ) ( x = v c )). (3.6) =
46 P. BARABÁS AND L. KOVÁCS 4. Aalyss of approxmato error The ma cause of msclassfcato s the error of the approxmated probablty values show formula (3.4). To calculate the error value, the followg model s appled. Let {c} be the set of classes, ad {a } the set of attrbutes where a attrbute may be of vector value. A test case s descrbed by a (a,c) par where c deotes the class related to the a attrbute. The uow probablty that a belogs to c s deoted by p. The relatve frequecy of the evet that a belogs to c s deoted by g. I the calculatos p are approxmated by g. The classfcato of the attrbute ca be regarded as a stochastc evet, where p, g ) deotes the probablty that g wll be used the calculatos stead of g. Let X(x ) be a -dmesoal stochastc varable, where x deotes the umber of attrbutes classfed as c. X has a polyomal dstrbuto: where N! 2 P ( x =, x2 = 2,..., x = ) = P P2... P, (4.)!!...! 2 N =, P =. (4.2) = = A gve g(, 2,, r ) frequecy value has dfferet P probabltes for the dfferet p(p,p 2,,p r ) probablty tuples. The p(p,p 2,,p r ) wth maxmal P value s assumed to be the real probablty value tuple. As the maxmum lelhood approxmato of the probablty s the frequecy value, the relatve frequeces are the best approxmatos of real probabltes: P. (4.3) N The probablty of other p vectors ca also be calculated wth ths formula. For the case = 2 the resultg P dstrbuto fucto s show Fg.. I the ext step, the approxmato error of product P s calculated. It s clear that the larger the dfferece betwee p ad g, the hgher the error value s. O the other had, the lower the dfferece betwee p ad g, the hgher probablty of ths par s. I the vestgato, the average error value s calculated the followg way:
ESTIMATION OF MISCLASSIFICATION ERROR USING BAYES CLASSIFICATION 47 ε ( g ) = P ( p, g) ε ( p, g), p (4.4) where ε deotes the error value, where ε(p,g): the error value of matchg p wth g, p,g): the probablty of matchg p wth g ad ε(g): the average error related to frequecy vector g., Fgure. Probablty fucto for the case =2 I the test case, the error formula for p ad the mea value of error ca be computed as follows: ε ( p, g) = p ( p ) ( ), (4.5) N N Fg. 2 shows the error fucto for the test bomal case. The umber of attempts s 00 where the umber of attrbutes belogg to class c s 30. I the Fgure, ca be see the mmum error s case of p=0.3. Sce the fucto s symmetrc, aother mmum pot ca be foud at p=0.7.
48 P. BARABÁS AND L. KOVÁCS, Fgure 2. Error fucto for the case p=0.3 I Nave Bayesa classfer the accuracy depeds strogly o the umber of attempts. The larger the test pool, the better the accuracy s. I Fg. 3 the mea value error fucto ca be see for dfferet N values. The results show that for a small umber of N values the use of summato hac ca mprove the accuracy but for a larger test pool the Nave Bayesa classfer s the domat oe. 5. Test results I frst tests [7] the referece pots were geerated wth uform dstrbuto space. The wer was the Nave Bayesa classfer teachg ad testg phase equally. The teachg accuracy had values from 80% to 00% depedg o evromet parameters. Usg summato hac ths accuracy decreased by about 0%. The testg accuracy s far lower, t s betwee 40 ad 70 percet case of Nave Bayesa classfer ad lower usg summato hac. The relatvely large rage of result values ca be explaed by the overtrag of the model whch ca be cotrolled by the correct choce of evromet parameters.
ESTIMATION OF MISCLASSIFICATION ERROR USING BAYES CLASSIFICATION 49 I later tests the referece pots were geerated sparsely, so the space has a small rego wth a relatvely large umber of referece pots ad outsde ths rego there are oly a few referece pots., Fgure 3. Mea value error fucto for dfferet (, N) values I the case of ths dstrbuto, the accuracy of classfers has chaged. The teachg accuracy of Nave Bayesa classfer remaed hghly smlar to other cases ad the use of summato hac brought up the accuracy to the Nave Bayesa. I the testg phase the experece shows that some cases the summato hac soluto ca mprove the effcecy of classfcato ad may cases t exceeds the Nave Bayesa. It cofrms the assumptos that the usage of summato hac Bayesa classfcato ca crease accuracy whe the samples cota a great umber of utraed attrbute values. The accuracy of classfcato depeds o may parameters of the evromet. Oe of the most mportat factors s the maxmum attrbute value parameter. Fg. 4 shows the accuracy fuctos for the followg maxmum attrbute parameter values: 20 (NB20,SH20), 00 (NB00,SH00) ad 500 (NB500,SH500). The otato NB s for Nave Bayesa algorthm ad SH for the modfed Bayesa algorthm. The accuracy of both algorthms has creased wth creasg the sze of the trag set.
50 P. BARABÁS AND L. KOVÁCS accuracy (%) 00 90 80 70 60 50 40 30 20 0 0 sze of trag set 30 60 20 250 500 000 NB20 SH20 NB00 SH00 NB500 SH500 Fgure 4. Relatve accuracy of algorthms accordg to umber of teachg samples 6. Coclusos Summato hac s a alteratve for the Nave Bayesa classfer wth larger probablty approxmato errors. Tag a decso tree as a referece classfer, we have compared the Nave Bayesa classfer wth the Bayesa classfer usg summato hac. The test results show that both methods ca yeld the same accuracy as the decso tree method has the case of large trag sets. REFERENCES [] THOMAS P. MINKA: The summato hac as a outler model, techcal ote, August 22, 2003 [2] HOLSTROM L, KOISTIEN P, LAAKSONEN J., OJA E: Neural ad Statstcal Classfers - Taxoomy ad Two Case Studes, IEEE Tras. O Neural Networs, Vol 8, No, 997. [3] KOVÁCS L., TERSTYÁNSZKI G.: Improved Classfcato Algorthm for the Couter Propagato Networ, Proceedgs of IJCNN 2000, Como, Italy. [4] JOAQUIM P. MARQUES DE SÁ: Appled Statstcs Usg SPSS, Statstca, Matlab ad R, Sprger, 2007, pp. 223-268 [5] FUCHUN PENG, DALE SHUURMANS, SHAOJUN WANG: Augmetg Nave Bayes Classfers wth Statstcal Laguage Models, Iformato Retreval, 7, Kluwer Academc Publshers, 2004, Netherlads, pp. 34-345 [6] ROBERT P.W. DUNN: Small sample sze geeralzato, 9 th Scadava Coferece of Image Aalyss, Jue 6-9, 995, Uppsala, Swede [7] BARABÁS P., KOVÁCS L.: Usablty of summato hac Bayes Classfcato, 9 th Iteratoal Symposum of Hugara Researchers o Computatoal Itellgece ad Iformatcs, November 6-8, 2008, Budapest, Hugary