On the optimality of Naïve Bayes with dependent binary features

Size: px

Start display at page:

Download "On the optimality of Naïve Bayes with dependent binary features"

Grace Garrett
6 years ago
Views:

1 Pattern Recognition Letters 27 (26) On the otimality of Naïve Bayes with deendent binary features Ludmila I. Kuncheva * School of Informatics, University of Wales, Dean Street, Bangor, Gwynedd LL57 1UT, UK Received 2 July 25; received in revised form 7 November 25 Available online 3 January 26 Communicated by Prof. F. Roli Abstract While Naïve Bayes classifier (NB) is Bayes-otimal for indeendent features, we rove that it is also otimal for two equirobable classes and two features with equal class-conditional covariances. Although strict otimality does not extend for three features, equal covariances are exected to be beneficial in higher-dimensional saces. Ó 25 Elsevier B.V. All rights reserved. Keywords: Statistical attern recognition; Naïve Bayes classifier (NB); Otimality of NB; Deendent binary features 1. Introduction The Naïve Bayes classifier (NB), called also idiot s Bayes, continues to receive a lot of raise in the literature due to its simlicity and accuracy (Langley et al., 1992; Hand and Yu, 21; Jamain and Hand, 25). NB is Bayes-otimal, i.e., guarantees minimum classification error, when the features in the roblem are indeendent. However, it is well documented that NB is consistently good far beyond this otimality condition. The word most used to describe its erformance is surrising. Many studies look for exlanations for this henomenon and try to establish necessary and sufficient conditions for the otimality of NB. While imortant arguments and results have been already formulated (details are given in Section 2 below), there is no generally valid set of such necessary and sufficient conditions. This study looks for further insights into otimality of NB for binary features. The motivation came from a roblem from veterinary medicine where the signs measured on cattle are binary and the diagnosis is one of two ossible * Tel.: ; fax: address: L.I.Kuncheva@bangor.ac.uk classes, BSE or not BSE. BSE is a notifiable fatal neurodegenerative disease in cattle which has no known cure. There was a BSE eidemic in Britain in the 199s and with the first BSE case diagnosed in the USA at the end of 23, the roblem becomes one of global imortance. The curious asect of this roblem was that there was no data set as such but only estimates of the class-conditional robabilities by domain exerts. Given are only the marginal robabilities, P(x i =1jx k ), where x i is the ith sign (feature). Value 1 means that the sign is resent in the animal, and x k is the class label (BSE or not BSE). As no further information is available, the features must be considered as indeendent, hence NB can be alied as the otimal classifier. Assuming that the robability estimates are the exact robabilities, the question is how different is NB from the true otimal Bayes classifier? As there is no data set, an exerimental verification is not ossible. This romted the question of how much deendence NB can tolerate and still be otimal. The rest of the aer is organized as follows. Section 2 gives a brief account of some imortant results from the literature exlaining why NB works when the indeendence assumtion does not hold. In Section 3, a two-feature two-class roblem is considered. We rove that NB is otimal if the deendencies between the two features are the /$ - see front matter Ó 25 Elsevier B.V. All rights reserved. doi:1.116/j.atrec

2 L.I. Kuncheva / Pattern Recognition Letters 27 (26) same for both classes. Unfortunately, this result does not extend beyond two features. Section 4 looks into three binary features and brings into the study non-airwise measures of deendence: Q 123, divergence and two distances between robability distributions. Section 5 summarizes the results and outlines other ossible exlanation routes. 2. Naïve Bayes: why is it so successful? Let x =[x 1,...,x n ] T be a feature vector. To label it in one of the c classes of the roblem, x 1,...,x c, we use the osterior robability P(x k jx). Choosing the class corresonding to the largest osterior robability for the resective x guarantees minimum error across the whole sace sanned by the n features. We shall call this the Bayes classifier, and the corresonding error, the Bayes error, E B. The osterior robabilities are calculated as Pðx k jxþ ¼ ðxjx k ÞPðx k Þ ðxþ, where (xjx k ) is the class-conditional robability density function (df) conditioned on x k, P(x k ) is the rior robability for x k and (x) is the unconditional df. Naïve Bayes classifier (NB) assumes conditional indeendence between the features and calculates the class-conditional df as a roduct of n individual df s ðxjx k Þ¼ Yn i¼1 ðx i jx k Þ. If the indeendence assumtion does not hold, then the aroximation of (xjx k ) is inaccurate, which may lead to misclassifications. The surrising success of NB has been attributed to various estimation roerties (Hand and Yu, 21; Domingos and Pazzani, 1997). ð1þ NB estimates fewer arameters than other oular models, therefore it is less rone to overtraining, esecially for small samle sizes. The traditional re-selection of features tends to eliminate correlated features anyway therefore the indeendence assumtion may nearly hold for the remaining feature subset. The most imortant exlanation though lies in the fact that the conditional indeendence is only a sufficient but not a necessary condition for otimality of NB (Hand and Yu, 21; Domingos and Pazzani, 1996, 1997; Zhang, 24; Rish, 21; Rish et al., 21). Indeed, the accuracy of aroximation is irrelevant as long as for any x the largest osterior robability corresonds to the same class as with the true osterior robabilities. In fact, if there are more than two classes, even the order of the other osterior robabilities is irrelevant. It seems though that the quest for quantifying the degree of deendence which NB can tolerate, started by Langley et al. (1992), is still on-going. Domingos and Pazzani (1996, 1997), Rish (21), Rish et al. (21), Zhang (24) and others have identified cases where NB is otimal and other cases where it is not. Consider for examle continuous-valued variables and two Gaussian classes. If the covariance matrices for the classes are diagonal, then the features are indeendent and NB is the otimal model. The class-conditional df s can be decomosed into individual Gaussians and calculated as in (1). If the features are deendent, and the covariance matrices are equal, R 1 = R 2, then NB will also recover the correct (linear) classification boundary desite of the flawed aroximation of the df s. There are however, limits on the abilities of NB to reach near-otimal erformance. Take for examle a linearly searable air of non-gaussian classes. A linear classifier trained by the ercetron algorithm is guaranteed to learn the classification boundary while NB is not. 3. Otimality of NB for two binary features Let x =[x 1,x 2 ] T where x 1, x 2 2 {,1}. We assume that we have comlete knowledge of the true robabilities in the roblem, i.e., all P(xjx j ), j = 1,2, are given 1 for all four values of x. To facilitate the algebraic maniulations, the eight robabilities are denoted as a,b,...,h as shown in Table 1(a) and (b). While deendence is not defined in a unique commonly agreed way in the sace of categorical variables, for quantitative variables such measures exist. To evaluate deendence, we shall treat the and the 1 as numbers and will calculate covariance between the two binary features, searately for each class. The mean for x 1 given class x 1 is l 1 = (a + b) +1 (c + d)=c + d. The mean for x 2 is resectively l 2 = b + d. The covariance is the exectation of (x 1 l 1 )(x 2 l 2 ) (summed across the four values and weighted by the resective robability) Covðx 1 ; x 2 jx 1 Þ¼að ðcþdþþð ðbþdþþ þ bð ðcþdþþð1 ðbþdþþ cð1 ðcþdþþð ðbþdþþ þ dð1 ðcþdþþð1 ðbþdþþ ¼ ad bc. ð2þ Proosition. Let x =[x 1,x 2 ] T where x 1,x 2 2 {,1}, and let x 1 and x 2 be the classes of interest with Pðx 1 Þ¼Pðx 2 Þ¼ 1 2. If Cov(x 1,x 2 jx 1 )=Cov(x 1,x 2 jx 2 ), then the Naïve Bayes classifier (NB) is otimal for this roblem. Proof. For NB to make a mistake for some x, one of the following must be true: Pðxjx 1 ÞPðx 1 Þ > Pðxjx 2 ÞPðx 2 Þ and Pðx 1 jx 1 ÞPðx 2 jx 1 ÞPðx 1 Þ < Pðx 1 jx 2 ÞPðx 2 jx 2 ÞPðx 2 Þ or Pðxjx 1 ÞPðx 1 Þ < Pðxjx 2 ÞPðx 2 Þ and Pðx 1 jx 1 ÞPðx 2 jx 1 ÞPðx 1 Þ > Pðx 1 jx 2 ÞPðx 2 jx 2 ÞPðx 2 Þ. 1 We shall denote robability mass functions by caital P.

3 832 L.I. Kuncheva / Pattern Recognition Letters 27 (26) Table 1 Class-conditional robability mass functions for classes x 1 and x 2 for two binary features x 2 = x 2 =1 (a) Class x 1 x 1 = a b x 1 =1 c d a + b + c + d =1 (b) Class x 2 x 1 = e f x 1 =1 g h e + f + g + h =1 Without loss of generality, consider x = [, ] T and let a > e. ð3þ For NB to make a mistake and assign x 2 to x = [,] T,we must have 1 2 Pðx 1 ¼ jx 1 ÞPðx 2 ¼ jx 1 Þ < 1 2 Pðx 1 ¼ jx 2 ÞPðx 2 ¼ jx 2 Þ; ð4þ ða þ bþða þ cþ < ðe þ f Þðe þ gþ; ða þ bþða þ cþ ðeþfþðe þ gþ < ; a 2 þ ac þ ab þ bc e 2 eg ef fg <. From the equivalence of the covariances, ad bc = eh fg, bc ¼ ad eh þ fg. ð6þ Substituting in (5), a 2 þ ac þ ab þ ad eh þ fg e 2 eg ef fg < ; aða þ c þ b þ dþ eðeþf þ g þ hþ < ; a < e. ð7þ The above result contradicts (3) therefore (4) cannot be true and NB makes the same decision as the Bayes classifier ð5þ would. The same argument will hold for the other three values of x. h Unfortunately, this otimality argument does not hold even for the simle case when the two classes are not equirobable. Denote = P(x 1 ) and resectively 1 = P(x 2 ). An error will occur for x = [, ] T if a >(1 )e and (a + c)(a + b) <(1 )(e + g)(e + f). The former is equivalent to a 1 e >. Develoing the latter leads to a e < ðeh gf Þ fflfflfflfflffl{zfflfflfflfflffl} the covariance In order to force a contradiction, the right-hand side of (9) should be required to be negative. For this to hold, we need either >.5 and negative covariance or <.5 and ositive covariance. Notice that if the covariance is zero (indeendence), then the contradiction is in lace for any, and NB is otimal. It is interesting to find out whether there are conditions on and the sign of the covariance for which NB is guaranteed to be otimal. Suose that we do require that >.5 and the covariance is negative. In this case, for a > e, NB is guaranteed to make the correct decision because (8) and (9) cannot hold simultaneously. It is ossible that b >(1 )f (can be shown by an examle), therefore b 1 f >. The corresonding inequality for NB is b f < ðeh gf Þ. ð8þ ð9þ ð1þ ð11þ E NB E B E NB E B (a) (b) Fig. 1. Scatterlot of the error difference between Naïve Bayes classifier (NB) and the Bayes (otimal) classifier versus the rior robability for class x 1. (a) Equal covariances and (b) no restrictions.

4 L.I. Kuncheva / Pattern Recognition Letters 27 (26) As the RHS is ositive, this inequality may or may not hold together with (1). This argument shows that there is no simle condition that guarantees otimality of NB for two classes, two binary features and equal class-conditional covariances when rior robabilities are not equal. To evaluate the extent to which rior robability influences the otimality of NB in the two-feature two-class case, simulation exeriments were carried out. 1 random sets of robabilities a,b,...,h and were generated so that the class-conditional covariances were equal. Fig. 1(a) shows the scatterlot of the differences between NB error and Bayes error, E NB E B, for the 1 data oints versus the rior robability. Another set of 1 sets of robabilities was generated, this time without the restriction of equal covariances. The scatterlot is given in Fig. 1(b). It is clear that the restriction of equal covariances, although not guaranteeing otimality, brings NB very close to the Bayes (otimal) classifier. For equirobable classes, =.5, NB is indeed otimal, as seen in Fig. 1(a). If covariances are not equal, the largest discreancies between NB and the otimal classifier occur for about.5. We also calculated the mean E NB E B. For equal covariances, this value was.18 (std.66) and for the unrestricted robabilities, the mean was.135 (std.32). NB was not otimal in about 13% of the cases with equal covariances and in about 32% of the cases with no restriction. This shows that equal covariances are a strong rerequisite for near-otimal erformance of NB. 4. Otimality of NB for three binary features It is interesting whether the results from the revious section carry forward for more than two features. Let x =[x 1,x 2,x 3 ] T, where x 1,x 2,x 3 2 {,1}. Consider again the two-class roblem with Pðx 1 Þ¼Pðx 2 Þ¼ 1. 2 Assume that the airwise covariances are equal across the two classes for all airs of features, i.e., Covðx i ; x j jx 1 Þ¼Covðx i ; x j jx 2 Þ¼C ij ; i ¼ 1; 2; 3; j ¼ 1; 2; 3; i 6¼ j. ð12þ The following examle demonstrates that NB is not otimal for this case. Table 2 shows the class-conditional robability mass functions for the two classes. The class-conditional airwise covariances are equal for x 1 and x 2 and are as follows:.434 between x 1 and x 2,.23 between x 1 and x 3, and.88 between x 2 and x 3. The marginal distributions are given as the bottom row in Table 2. Using these, we have Pð½; ; Š T jx 1 Þ < Pð½; ; Š T jx 2 Þ ð:276 < :2856Þ and Pðx 1 ¼ jx 1 ÞPðx 2 ¼ jx 1 ÞPðx 3 ¼ jx 1 Þ > Pðx 1 ¼ jx 2 ÞPðx 2 ¼ jx 2 ÞPðx 3 ¼ jx 2 Þ ð:2226 > :1974Þ. While the Bayes classifier will label x = [,,] T in x 2,NB will label it in x 1. This shows that having equal classconditional covariances is not a sufficient condition for otimality of NB for three features even for equirobable classes. To estimate the effect of equal covariances on the otimality of NB, we carried out a simulation study for the roblem with three binary features and two equirobable classes. 1 random sets of robability mass functions, P(xjx 1 ) and P(xjx 2 ), were generated where the three covariances C 12, C 13 and C 23, were the same for both classes. (The details of the generation rocedure are given in Aendix A.) Fig. 2 shows the sorted values of E NB E B for the 2 oints. The curve with the equal covariance restriction lies underneath the curve where the distributions were generated without the restriction indicating that equal covariances are beneficial. In about 92% of the cases generated without restriction NB makes at least one mistake out of the eight ossible values for x. For the equal covariances case this figure is about 57%. Also, the mean error differences are.564 (std.486) for the unrestricted case and.157 (std.294) for the equal covariances case. The results show that equal deendencies of second order are insufficient to guarantee otimality of NB for three features. An additional measure of deendency may therefore be useful. Our first choice was the three-way Q Table 2 Joint and marginal class-conditional robability mass functions for classes x 1 and x 2 for three binary features and equal covariances across the two classes x 1 x 2 x 3 P(x) x 1 x 2 x 3 P(x) (a) Class x 1 (b) Class x fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Pðxi¼1jx1Þ fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Pðxi¼1jx2Þ

5 834 L.I. Kuncheva / Pattern Recognition Letters 27 (26) EN B.4 EB Equal covariances.2.15 No restrictions Data oints Fig. 2. Sorted values of the error diﬀerences for the 1 oints with and without equal covariances. statistic (Yule, 19). Consider the two-way table with robabilities as in Table 1(a). The odds ratio is r = ad/bc. Q is meant to serve as a correlation measure varying between 1 and 1 with value corresonding to indeendence. The transformation which achieves this is (r 1)/ (r + 1), leading to Q¼ ad bc. ad þ bc ð13þ The numerator of Q is the covariance between x1 and x2 as in (2), so for indeendent variables Q =. For a three-way table, there are two odds ratios, r1 and r2, e.g., for x3 = and x3 = 1, resectively. Their ratio, r1/r2 is now normalized to give the three-way Q. Denote by Puvw the robability that x1 = u, x2 = v and x3 = w, where u, v, w 2 {, 1}. Then Q123 ¼ P 111 P 1 P 1 P 1 P P 11 P 11 P 11. P 111 P 1 P 1 P 1 þ P P 11 P 11 P 11 ð14þ EN B EB The information included in Q123 is additional to the twoway Q s, which is the reason to include it in this study. The question is whether similar values of Q123 for x1 and x2 corresond to NB being close to otimal. Fig. 3(a) lots the value of the error diﬀerences, ENB EB, against the difference Q123(x1) Q123(x2) for the 1 data oints with equal covariances and (b) gives the lot for the unrestricted case. The equal covariances case shows a marked attern whereby similar Q123 s for the two classes bring NB close to otimality (lower error diﬀerence). This attern is not clearly visible for the general case in Fig. 3(b). The attern in Fig. 3(a) suggests that there might be an otimality condition where the corresonding class-conditional covariances are equal and also Q123(x1) = Q123(x2). Distances between distributions have been extensively used in attern recognition for estimating bounds for the Bayes error of classiﬁers and seciﬁcally as criteria for feature selection (Webb, 1999; van der Heijden et al., 24; Devijver and Kittler, 1982). We carried out simulation exeriments as before with three distances in lace of Q123: divergence, Bhattacharyya distance and Matusita distance. All three of them estimate how far aart the joint distribution is from the distribution reconstructed through the indeendence model. The farther aart these distributions are, the larger the deendence between the features. Let M1 and M2 be the values of such a measure for x1 and x2, resectively. Again, we want to ﬁnd out whether similar values of M1 and M2 mean that NB is close to otimality. Let x be a discrete variable. We consider two robability mass functions:- the true joint distribution, P*(x), and the distribution derived through the indeendence model, PNB(x). NB uses the latter while Bayes classiﬁer uses the former. Divergence is a oular measure akin to Kullback Leibler divergence (Webb, 1999; van der Heijden et al., 24). We used the discrete calculation EN B EB Q123( ω1) Q123(ω 2) Q123( ω1) Q123(ω 2) (a) (b) Fig. 3. Scatterlot of the error diﬀerence between Naı ve Bayes classiﬁer (NB) and the Bayes (otimal) classiﬁer versus Q123(x1) Q123(x2) with and without equal covariances for the two classes. (a) Equal covariances and (b) no restrictions.

6 E NB E B E NB E B Divergence ¼ X ðp P ðxþ ðxþ P NB ðxþþ log ð15þ P x NB ðxþ The results with the Bhattacharyya and Matusita distances were very similar to these with divergence so we left them out of this aer. Fig. 4 shows the scatterlots of E NB E B versus Divergence(x 1 ) Divergence(x 2 ). Interestingly, there is no evidence that similar degrees of feature deendence for the two classes brings NB close to otimality for either the general case or the case of equal covariances. This comes to show that Q 123 indeed comlements the covariances in that it rovides additional insight into ossible otimality conditions for NB. Divergence, on the other hand, is not indicative in this resect. The san of the divergence differences is only slightly reduced for the distributions with equal covariances. The difference between lots (a) and (b) in Fig. 4 merely shows what we already observed, that NB is closer to the otimal classifier when covariances between features are mirrored for the two classes. 5. Conclusions This aer looks for otimality conditions for the Naïve Bayes classifier (NB). In Section 3 we rove that for two binary features and two equirobable classes NB is otimal for deendent features as long as the covariances for the two classes are equal. We also show that this otimality does not hold for different rior robabilities for the classes. Desite not otimal, NB is close to the otimal classifier for the case of equal covariances (Fig. 1). Unfortunately this otimality condition does not carry forward to three features as shown in Section 4. Equal covariances again ensure that NB is closer to otimality than it is in the general case but there is outstanding deendency of higher order not accounted for by the airwise covariances. We ick four measures of (non-airwise deendency and look into the hyothesis that similar values of these measures for the two classes may comlement our otimality condition. Only the three-way Q 123 showed otential to be considered for the otimality condition (Fig. 3). Disaointingly, the oular divergence measure did not aear to be useful (Fig. 4). The above results are intended as a ste in the on-going quest for building a set of otimality conditions for NB (Langley et al., 1992; Domingos and Pazzani, 1997; Hand and Yu, 21). One roblem with this toic is that, while the notion of indeendence is well defined, there is no agreed measure of deendency between two discrete features. We could have taken the Q statistic or the correlation between the variables, or a myriad of measures available in the statistical literature (Sneath and Sokal, 1973). The roblem is even worse when it comes to measuring deendency for three or more features. There are several issues here. First, the fact that a relationshi cannot be roven beyond the case of two features using airwise correlations does not mean that such a relationshi does not exist. A relationshi may exist with another measure. Second, we found an interesting attern for Q 123 which may rove to be a way forward in exanding (or rather relaxing) the sufficient otimality conditions for NB. However, this seems to be a dead end as there is no definition of higher order Q s. Third, divergence (Bhattacharyya and Matusita distances as well) did not show any otential worthy of further exloration. According to these measures NB otimality does not deend on how similar the feature deendencies are in the class-conditional distributions. The lack of direct relationshi between feature deendence and classification accuracy of NB has been well documented in the literature through extensive exerimental studies. Although it has been hyothesized that the distribution of deendencies is more imortant than the magnitude of the

7 836 L.I. Kuncheva / Pattern Recognition Letters 27 (26) deendencies itself (Zhang, 24), there is little evidence in suort of this. Fourth, as Q is limited for u to three features and divergence-like measures are not articularly resonsive, it is not clear how an otimality condition can be formulated in the general case. We can seculate that mirrored covariances for the two classes will be beneficial in the higher-dimensional cases. It is ossible that different otimality conditions hold for odd and even number of features. Defining rigorous and general conditions to exlain why NB behaves as the otimal classifier is an intriguing research toic fuelled by curiosity rather than racticality. When a data set is available, it is easier to train and test NB than to calculate measures to estimate the chances of NB being otimal. The intuition is different when we come back to the veterinary roblem which insired this study. Since there is no data set, and NB is the only reasonable otion, it will be reassuring to discover further conditions which might hold in reality and under which NB is otimal. Acknowledgements I am grateful to Dr. Peter D. Cockcroft, Deartment of Clinical Veterinary Medicine, University of Cambridge, UK, for introducing the BSE roblem and data to our grou. I also would like to thank Chris Whitaker, School of Psychology, University of Wales, Bangor, for the very helful discussion on the Q statistic. Aendix A The algorithm below was used to generate randomly class-conditional distributions for 2 classes and 2 binary features such that the airwise covariances are equal for x 1 and x 2. The robability mass function (mf) for class x 1 for three binary features is shown in Table 3. Denote the corresonding mf values for class x 2 by caital letters A,...,H. Ste 1. Generate A,..., H from a uniform random distribution within the unit interval and normalize so that the sum is 1. Ste 2. Form the two-way tables and find the covariances, C 11, C 12 and C 13 for x 2. For examle, the table for x 1 and x 2 will have entries (A + E), (B + F), Table 3 Class-conditional robability mass functions for class x 1 for three binary features x 2 = x 2 =1 x 3 ¼ x 3 ¼ 1 x 1 = a b x 1 =1 c d x 1 = e f x 1 =1 g h (C + G) and (D + H), and the resective covariance will be C 12 ¼ðAþEÞðD þ HÞ ðbþfþðc þ GÞ: Ste 3. Find the mf for class x 1, i.e., calculate a,b,...,h. As these eight arameters are bound by three equations for the covariances and one normalizing equation, there are four arameters out of the eight that can be drawn randomly. Suose we ick randomly e, f, g and h and scale them so that their sum equals a random number between and 1. This scaling is needed so that there is room for the other four arameters between and 1. a, b, c and d are obtained as the solution of the following system of simultaneous equations: ða þ eþðd þ hþ ðbþfþðc þ gþ ¼C 12 ; ða þ bþðg þ hþ ðcþdþðe þ f Þ¼C 13 ; ða þ cþðf þ hþ ðbþdþðe þ gþ ¼C 23 ; a þ b þ c þ d þ e þ f þ g þ h ¼ 1. The solution is Z 1 ¼ ð1 e f g hþðg þ hþ C 13 ; e þ f þ g þ h Z 2 ¼ ð1 e f g hþðf þ hþ C 23 ; e þ f þ g þ h d ¼ C 12 ð1 Z 1 Z 2 f g hþh þ Z 1 Z 2 þ Z 1 f þ Z 2 g þ fg; a ¼ d Z 1 Z 2 þ 1 e f g h; b ¼ Z 2 d; c ¼ Z 1 d. Ste 4. Being robabilities, a,...,h must be between and 1. This is not guaranteed by the solution of the system at Ste 3. Therefore the last ste is to check whether all values are in [, 1]. If not, the new mf (a,...,h) is discarded and the rocedure starts from Ste 1. References Devijver, P., Kittler, J., Pattern Recognition: A Statistical Aroach. Prentice-Hall, Inc., Englewood Cliffs, NJ. Domingos, P., Pazzani, M., Beyond indeendence: Conditions for the otimality of the simle Bayesian classifier. In: Proc. 13th Internat. Conf. on Machine Learning. Domingos, P., Pazzani, M., On the otimality of the simle Bayesian classifier under zero-one loss. Machine Learning 29, Hand, D.J., Yu, K., 21. Idiot s Bayes not so stuid after all? Internat. Statist. Rev. 69, Jamain, A., Hand, D.J., 25. The Naïve Bayes mystery: a classification detective story. Pattern Recognition Lett. Langley, P., Iba, W., Thomson, K., An analysis of Bayesian classifiers. In: Proc. 1th National Conf. on Artificial Intelligence Rish, I., 21. An emirical study of the Naïve Bayes classifier. In: Proc. Internat. Joint Conf. on Artificial Intelligence, Worksho on Emirical Methods in A.

8 L.I. Kuncheva / Pattern Recognition Letters 27 (26) Rish, I., Hellerstein, J., Thathachar, J., 21. An analysis of data characteristics that affect Naïve Bayes erformance. Tech. Re. RC21993, IBM TJ Watson Research Center. Sneath, P., Sokal, R., Numerical Taxonomy. WH Freeman & Co. van der Heijden, F., Duin, R.P.W., de Ridder, D., Tax, D.M.J., 24. Classification, Parameter Estimation and State Estimation. Wiley, England. Webb, A., Statistical Pattern Recognition. Arnold, London, England. Yule, G., 19. On the association of attributes in statistics. Phil. Trans. A 194, Zhang, H., 24. The otimality of Naïve Bayes. In: Proc. 17th Internat. FLAIRS Conf., Florida, USA.

Probability Estimates for Multi-class Classification by Pairwise Coupling

Probability Estimates for Multi-class Classification by Pairwise Couling Ting-Fan Wu Chih-Jen Lin Deartment of Comuter Science National Taiwan University Taiei 06, Taiwan Ruby C. Weng Deartment of Statistics