Ensemble Confidence Estimates Posterior Probability

Ensemble Esimaes Poserior Probabiliy Michael Muhlbaier, Aposolos Topalis, and Robi Polikar Rowan Universiy, Elecrical and Compuer Engineering, Mullica Hill Rd., Glassboro, NJ 88, USA {muhlba6, opali5}@sudens.rowan.edu polikar@rowan.edu Absrac. We have previously inroduced he Learn ++ algorihm ha provides surprisingly promising performance for incremenal learning as well as daa fusion applicaions. In his conribuion we show ha he algorihm can also be used o esimae he poserior probabiliy, or he confidence of is decision on each es insance. On hree increasingly difficul ess ha are specifically designed o compare poserior probabiliy esimaes of he algorihm o ha of he opimal Bayes classifier, we have observed ha esimaed poserior probabiliy approaches o ha of he Bayes classifier as he number of classifiers in he ensemble increase. This saisfying and inuiively expeced oucome shows ha ensemble sysems can also be used o esimae confidence of heir oupu. Inroducion Ensemble / muliple classifier sysems have enjoyed increasing aenion and populariy over he las decade due o heir favorable performances and/or oher advanages over single classifier based sysems. In paricular, ensemble based sysems have been shown, among oher hings, o successfully generae srong classifiers from weak classifiers, resis over-fiing problems [, ], provide an inuiive srucure for daa fusion [-4], as well as incremenal learning problems [5]. One area ha has received somewha less of an aenion, however, is he confidence esimaion poenial of such sysems. Due o heir very characer of generaing muliple classifiers for a given daabase, ensemble sysems provide a naural seing for esimaing he confidence of he classificaion sysem on is generalizaion performance. In his conribuion, we show how our previously inroduced algorihm Learn ++ [5], inspired by AdaBoos bu specifically modified for incremenal learning applicaions, can also be used o deermine is own confidence on any given specific es daa insance. We esimae he poserior probabiliy of he class chosen by he ensemble using a weighed sofmax approach, and use ha esimae as he confidence measure. We empirically show on hree increasingly difficul daases ha as addiional classifiers are added o he ensemble, he poserior probabiliy of he class chosen by he ensemble approaches o ha of he opimal Bayes classifier. I is imporan o noe ha he mehod of ensemble confidence esimaion being proposed is no specific o Learn ++, bu can be applied o any ensemble based sysem. N.C. Oza e al. (Eds.): MCS 5, LNCS 354, pp. 36 335, 5. Springer-Verlag Berlin Heidelberg 5

Ensemble Esimaes Poserior Probabiliy 37 Learn ++ In ensemble approaches using a voing mechanism o combine classifier oupus, he individual classifiers voe on he class hey predic. The final classificaion is hen deermined as he class ha receives he highes oal voe from all classifiers. Learn ++ uses weighed majoriy voing, a raher non-democraic voing scheme, where each classifier receives a voing weigh based on is raining performance. One novely of he Learn ++ algorihm is is abiliy o incremenally learn from newly inroduced daa. For breviy, his feaure of he algorihm is no discussed here and ineresed readers are referred o [4,5]. Insead, we briefly explain he algorihm and discuss how i can be used o deermine is confidence as an esimae of he poserior probabiliy on classifying es daa. For each daase (D k ) ha consecuively becomes available o Learn ++, he inpus o he algorihm are (i) a sequence of m raining daa insances x k,i along wih heir correc labels y i, (ii) a classificaion algorihm BaseClassifier, and (iii) an ineger T k specifying he maximum number of classifiers o be generaed using ha daabase. If he algorihm is seeing is firs daabase (k=), a daa disribuion (D ) from which raining insances will be drawn - is iniialized o be uniform, making he probabiliy of any insance being seleced equal. If k>, hen a disribuion iniializaion sequence, iniializes he daa disribuion. The algorihm hen adds T k classifiers o he ensemble saring a =et k + where et k denoes he number of classifiers ha currenly exis in he ensemble. The pseudocode of he algorihm is given in Figure. For each ieraion, he insance weighs, w, from he previous ieraion are firs normalized (sep ) o creae a weigh disribuion D. A hypohesis, h, is generaed using a subse of D k drawn from D (sep ). The error, ε, of h is calculaed: if ε > ½, he algorihm deems he curren classifier h o be oo weak, discards i, and reurns o sep ; oherwise, calculaes he normalized error β (sep 3). The weighed majoriy voing algorihm is called o obain he composie hypohesis, H, of he ensemble (sep 4). H represens he ensemble decision of he firs hypoheses generaed hus far. The error E of H is hen compued and normalized (sep 5). The insance weighs w are finally updaed according o he performance of H (sep 6), such ha he weighs of insances correcly classified by H are reduced and hose ha are misclassified are effecively increased. This ensures ha he ensemble focus on hose regions of he feaure space ha are ye o be learned. We noe ha H allows Learn ++ o make is disribuion updae based on he ensemble decision, as opposed o AdaBoos which makes is updae based on he curren hypohesis h. 3 as an Esimae of Poserior Probabiliy In applicaions where he daa disribuion is known, an opimal Bayes classifier can be used for which he poserior probabiliy of he chosen class can be calculaed; a quaniy which can hen be inerpreed as a measure of confidence [6]. The poserior probabiliy of class ω j given insance x is classically defined using he Bayes rule as:

38 M. Muhlbaier, A. Topalis, and R. Polikar Inpu: For each daase D k k=,,,k Sequence of i=,,m k insances x k,i wih labels y Y {,..., c} i k = Weak learning algorihm BaseClassifier. Ineger T k, specifying he number of ieraions. Do for k=,,,k If k= Iniialize w = D( i) = / m, et = for all i. Else Go o Sep 5 o evaluae he curren ensemble on new daase D k, updae weighs, and recall curren number of classifiers! j= Do for = et k +, et k +,, et k + Tk : m. Se D = w i= w ( i) so ha D is a disribuion. et = k k T j. Call BaseClassifier wih a subse of D k randomly chosen using D. 3. Obain h : X! Y, and calculae is error: ε = D() i ih : ( x ) y i i If ε > ½, discard h and go o sep. Oherwise, compue normalized error as β = ε ε ). ( 4. Call weighed majoriy voing o obain he composie hypohesis H = arg max log ( β) y Y h : ( xi) = y i 5. Compue he error of he composie hypohesis E = D() i ih : ( xi) yi 6. Se B =E /(-E ), <B <, and updae he insance weighs: B, if H( xi) = yi D+ () i = D, oherwise Call weighed majoriy voing o obain he final hypohesis. H final K arg max log y Y k = : h ( xi) = y i = ( β ) Fig.. Learn ++ Algorihm P( x ωj) P( ωj) P( ω j x) = N P( x ω ) ( ) k k P ω = k Since class disribuions are rarely known in pracice, poserior probabiliies mus be esimaed. While here are several echniques for densiy esimaion [7], such echniques are difficul o apply for large dimensional problems. A mehod ha can ()

Ensemble Esimaes Poserior Probabiliy 39 esimae he Bayesian poserior probabiliy would herefore prove o be a mos valuable ool in evaluaing classifier performance. Several mehods have been proposed for his purpose [6-9]. One example is he sofmax model [8], commonly used wih classifiers whose oupus are binary encoded, as such oupus can be mapped ino an esimae of he poserior class probabiliy using Aj ( x) e P( ω x) C ( x ) = () j j N Ak ( x) e k= where A j (x) represens he oupu for class j, and N is he number of classes. C j (x) is hen he confidence of he classifier in predicing class ω j for insance x, which is an esimae of he poserior probabiliy P(ω j x). The sofmax funcion essenially akes he exponenial of he oupu and normalizes i o [ ] range by summing over he exponenials of all oupus. This model is generally believed o provide good esimaes if he classifier is well rained using sufficienly dense raining daa. In an effor o generae a measure of confidence for an ensemble of classifiers in general, and for Learn ++ in paricular, we expand he sofmax concep by using he individual classifier weighs in place of a single exper s oupu. The ensemble confidence, esimaing he poserior probabiliy, can herefore be calculaed as: where Fj ( x) e P( ω x) C ( x ) = (3) j j N Fk ( x) e k= ( β ) N log h( x) = ω j Fj ( x ) = = oherwise (4) The confidence, C j (x), associaed wih class ω j for insance x is herefore he exponenial of he sum of classifier weighs ha seleced class ω j, divided by he sum of he aforemenioned exponenials corresponding o each class. The significance of his confidence esimaion scheme is in is consideraion of he diversiy in he classifier decisions: in calculaing he confidence of class ω j, he confidence will increase if he classifiers ha did no choose class ω j have varying decisions as opposed o having a common decision, ha is, if he evidence agains class ω j is no srong. On he oher hand, he confidence will decrease if he classifiers ha did no choose class ω j have a common decision, ha is, here is srong evidence agains class ω j. 4 Simulaion Resuls In order o find ou if and how well he Learn ++ ensemble confidence approximaes he Bayesian poserior probabiliy, he modified sofmax approach was analyzed on hree increasingly difficul problems. In order o calculae he heoreical Bayesian poserior probabiliies, and hence compare he Learn ++ confidences o hose of Bayes-

33 M. Muhlbaier, A. Topalis, and R. Polikar ian probabiliies, experimenal daa were generaed from Gaussian disribuion. For raining, random insances were seleced from each class disribuion, using which an ensemble of 3 MLP classifiers were generaed wih Learn ++. The daa and classifier generaion process was hen repeaed and averaged imes wih randomly seleced daa o ensure generaliy. For each simulaion, we also benchmark he resuls by calculaing a mean square error beween Learn ++ and Bayes confidences over he enire grid of he feaure space, wih each added classifier o he ensemble. 4. Experimen A wo feaure, hree class problem, where each class has a known Gaussian disribuion is seen in Fig.. In his experimen class,, and 3 have a variance of.5 and are cenered a [-, ], [, ], and [, -], respecively. Since he disribuion is known (and is Gaussian), he acual poserior probabiliy can be calculaed from Equaion, given he known likelihood P(x ω j ) ha can be calculaed as ( π ) d / / Σ j T ( x µ j) Σ j ( x µ j) P( x ω ) = e (5) j where d is he dimensionaliy, and µ j and Σ j are he mean and he covariance marix of he disribuion from which j h class daa are generaed. Each class was equally likely, hence P(ω j )=/3. For each insance, over he enire grid of he feaure space shown in Fig., we calculaed he poserior probabiliy of he class chosen by he Bayes classifier, and ploed hem as a confidence surface, as shown in Fig.3a. Calculaing he confidences of Learn ++ decisions on he same feaure space provided he plo in Fig 3b, indicaing ha he ensemble confidence surface closely approximaes ha of he Bayes classifier. Densiy y.5.4.3.. Feaure - - - - Feaure Fig.. Daa disribuions used in Experimen

Ensemble Esimaes Poserior Probabiliy 33.8.6.8.6 Feaure - - - Feaure Feaure - - - Feaure Fig. 3. (a) Bayesian and (b) Learn ++ confidence surface for Experimen I is ineresing o noe ha he confidences in boh cases plumme around he decision boundaries and approach away from he decision boundary, an oucome ha makes inuiive sense. To quaniaively deermine how closely he Learn ++ confidence approximaes ha of Bayes classifier, and how his approximaion changes wih each addiional classifier, he mean squared error (MSE) was calculaed beween he ideal Bayesian confidence surface and he Learn ++ confidence over he enire grid of he feaure space - for each addiional classifier added o he ensemble. As seen in Fig.4, MSE beween he wo decreases as new classifiers are added o he ensemble, an expeced, bu neverheless immensely saisfying oucome. Furhermore, he decrease in he error is exponenial and raher monoonic, and does no appear o indicae any over-fiing, a leas for as many as 3 classifiers added o he ensemble. The ensemble confidence was hen compared o ha of a single MLP classifier, where he confidence was calculaed using he MLP s raw oupu values. The mean squared error was calculaed beween he resuling confidence and he Bayesian confidence and has been ploed as a doed line in Fig. 4 in comparison o he Learn ++ confidence. The single MLP differs from classifiers generaed using he Learn ++ algorihm on wo accouns. Firs, he single MLP is rained using all of he raining daa where each classifier in he Learn ++ ensemble is rained on /3 of he raining daa. Also, Learn ++ confidence is based on he discree decision of each classifier. If here were only one classifier in he ensemble, all classifiers would agree resuling in a confidence of. Therefore, confidence of a single MLP can only be calculaed based on he (sofmax normalized) acual oupu values unlike Learn ++ which uses a weighed voe of he discree oupu labels. 4. Experimen To furher characerize he behavior of his confidence esimaion scheme, Experimen was repeaed by increasing he variances of he class disribuions from.5 o.75, resuling in a more overlapping disribuion (Fig. 5) and a ougher classificaion problem. Learn ++ was rained wih daa generaed from his disribuion, is confidence calculaed over he enire grid of he feaure space and ploed in comparison o ha of Bayes classifier in Fig. 6. We noe ha low confidence valleys around he decision boundaries are wider in his case, an expeced oucome of he increased variance.

33 M. Muhlbaier, A. Topalis, and R. Polikar.3.3 Mean Squared Error.8.6.4...8 5 5 5 3 Number of Classifiers Fig. 4. Mean square error as a funcion of number of classifiers - Experimen Densiy.4. Feaure - - - - Feaure Fig. 5. Daa disribuions used in Experimen.8.6.8.6 Feaure - - - - Feaure Feaure - - - - Feaure Fig. 6. (a) Bayesian and (b) Learn ++ confidence surface for Experimen

Ensemble Esimaes Poserior Probabiliy 333.5.45 Mean Squared Error.4.35.3.5. 5 5 5 3 Number of Classifiers Fig. 7. Mean square error as a funcion of number of classifiers - Experimen Fig.7 shows ha he MSE beween he Bayes and Learn ++ confidences is once again decreasing as new classifiers are added o he ensemble. Fig. 7 also compares Learn ++ performance o a single MLP, shown as he doed line, as described above. 4.3 Experimen 3 Finally, an addiional class was added o he disribuion from Experimen wih a variance of.5 and mean a [ ] (Fig. 8), making i an even more challenging classificaion problem due o addiional overlap beween classes. Similar o he previous wo experimens, an ensemble of 3 classifiers was generaed by Learn ++, and rained wih daa drawn from he above disribuion. The confidence of he ensemble over he enire feaure space was calculaed and ploed in comparison wih he poserior probabiliy based confidence of he Bayes classifier over he same feaure space. Fig. 9 shows hese confidence plos, where he Learn ++ based ensemble confidence (Fig. 9b) closely approximaes ha of Bayes (Fig. 9a). Densiy.6.4. Feaure - - - - Feaure Fig. 8. Daa disribuions used in Experimen 3

334 M. Muhlbaier, A. Topalis, and R. Polikar.8.6.8.6 Feaure - - - Feaure Feaure - - - Feaure Fig. 9. (a) Bayesian and (b) Learn ++ confidence surface for Experimen 3 Fig. 9 indicaes ha Learn ++ assigns a larger peak confidence o he middle class han he Bayes classifier. Since he Learn ++ confidence is based on he discree decision of each classifier, when a es insance is presened from his porion of he space, mos classifiers agree on he middle class resuling in a high confidence. However, he Bayesian confidence is based on he disribuion of he paricular class and he disribuion overlap of he surrounding classes, hus lowering he confidence. Finally, he MSE beween he Learn ++ confidence and he Bayesian confidence, ploed in Fig., as a funcion of ensemble populaion, shows he now-familiar characerisic of decreasing error wih each new classifier added o he ensemble. For comparison, a single MLP was also rained on he same daa, and is mean squared error wih respec o he Bayesian confidence is shown by a doed line..5.45 Mean Squared Error.4.35.3.5 5 5 5 3 Number of Classifiers Fig.. Mean square error as a funcion of number of classifiers - Experimen 3 5 Conclusions and Discussions In his conribuion we have shown ha he confidence of an ensemble based classificaion algorihm in is own decision can easily be calculaed as an exponenially nor-

Ensemble Esimaes Poserior Probabiliy 335 malized raio of he weighs. Furhermore, we have shown - on hree experimens of increasingly difficul Gaussian disribuion - ha he confidence calculaed in his way approximaes he poserior probabiliy of he class chosen by he opimal Bayes classifier. In each case, we have observed ha he confidences calculaed by Learn ++ approximaed he Bayes poserior probabiliies raher well. However, in order o quaniaively assess exacly how close he approximaion was, we have also compued he mean square error beween he wo over he enire grid of he feaure space on which he wo classifiers were evaluaed. We have ploed his error as a funcion of he number of classifiers in he ensemble, and noiced ha he error decreased exponenially and monoonically as he number of classifiers increased; an inuiive, ye quie saisfying oucome. No over-fiing effecs were observed afer as many as 3 classifiers, and he final confidences esimaed by Learn ++ was ypically wihin % of he poserior probabiliies calculaed for he Bayes classifier. While hese resuls were obained by using Learn ++ as he ensemble algorihm, hey should generalize well o oher ensemble and/or boosing based algorihms. Acknowledgemen This maerial is based upon work suppored by he Naional Science Foundaion under Gran No. ECS-399, CAREER: An Ensemble of Classifiers Approach for Incremenal Learning. References. Kuncheva L.I. Combining Paern Classifiers, Mehods and Algorihms, Hoboken, NJ: Wiley Inerscience, 4.. Y. Freund and R. Schapire, A decision heoreic generalizaion of on-line learning and an applicaion o boosing, Compuer and Sysem Sci., vol. 57, no., pp. 9-39, 997. 3. Kuncheva L.I., A Theoreical Sudy on Six Classifier Fusion Sraegies, IEEE Trans. Paern Analysis and Machine Inelligence, vol. 4, no., pp. 8 86,. 4. Lewi M. and Polikar R., An ensemble approach for daa fusion wih Learn ++, Proc. 4 h In. Work. on Muliple Classifier Sysems, (Windea T.and Roli F., eds.) LNCS vol. 79, pp. 76-86, Berlin: Springer, 3. 5. Polikar R., Udpa L., Udpa S., and Honavar V., Learn ++ : An incremenal learning algorihm for supervised neural neworks, IEEE Trans. on Sysem, Man and Cyberneics (C), vol. 3, no. 4, pp. 497-58,. 6. Duin R.P., Tax M., Classifier condiional poserior probabiliies, Lecure Noes in Compuer Science, LNCS vol. 45, pp. 6-69, Berlin: Springer, 998. 7. Duda R., Har P., Sork D., In Paern Classificaion /e, Chap. 3 &4, pp. 8-4, New York, NY: Wiley Inerscience,. 8. Alpaydin E. and Jordan M. Local linear perceprons for classificaion. IEEE Transacions on Neural Neworks vol. 7, no. 3, pp. 788-79, 996. 9. Wilson D., Marinez T., Combining cross-validaion and confidence o measure finess, IEEE Join Conf. on Neural Neworks, vol., pp. 49-44, 999.