Regularization and Averaging of the Selective Naïve Bayes classifier

Size: px

Start display at page:

Download "Regularization and Averaging of the Selective Naïve Bayes classifier"

Agatha Cooper
5 years ago
Views:

1 Regularization and Averaging of the Selective Naïve Bayes classifier Marc Boullé Abstract The Naïve Bayes classifier has proved to be very effective on any real data applications. Its perforances usually benefit fro an accurate estiation of univariate conditional probabilities and fro variable selection. However, although variable selection is a desirable feature, it is prone to overfitting. In this paper, we introduce a new regularization technique to select the ost probable subset of variables and propose a new odel averaging ethod. The weighting schee on the odels reduces to a weighting schee on the variables, and finally results in a Naïve Bayes with "soft variable selection". Extensive experiental results show that the averaged regularized classifier outperfors the initial Selective Naïve Bayes classifier. T I. INTRODUCTION HE Naïve Bayes odeling approach is based on the assuption that the variables are independent within each output label, and siply relies on the estiation of univariate conditional probabilities. The evaluation of the probabilities for nueric variables has already been discussed in the literature [10, 18, 22]. Experients deonstrate that even a siple Equal Width discretization with 10 bins brings superior perforances copared to the assuption using a Gaussian distribution. Using the Bayes optial MODL discretization ethod [4] to estiate the conditional probabilities has proved to be very efficient in detecting irrelevant variables [5]. Siilar iproveents can be achieved in the case of categorical variable, using the Bayes optial value grouping ethod presented in [2] and extended in [3]. The naïve independence assuption can har the perforances when violated. In order to better deal with highly correlated variables, the Selective Naïve Bayes approach [17] uses a greedy forward search to select the variables. The accuracy is evaluated directly on the training set, and the variables are selected as long as they do not degrade the accuracy. One proble with this approach is that it does not prevent the selection of irrelevant variables having no effect on the accuracy, or even of redundant variables having either an insignificant or no effect on the accuracy. In [5], the area under the ROC curve [11] is used as a selection criterion and exhibits a better predictive perforance than the accuracy criterion with fewer variables. Although the Selective Naïve Bayes approach perfors quite well on datasets with a reasonable nuber of variables, it does not scale on very large datasets with hundreds of thousands of instances and thousands of variables, such as in M. Boullé is with France Teleco R&D, 2, avenue Pierre Marzin, Lannion Cedex, France (e-ail: arc.boulle@lfranceteleco.co). arketing applications. The proble coes both fro the search algorith, whose coplexity is quadratic in the nuber of the variables, and fro the selection process which is subject to overfitting. In this paper, we present a new regularization technique to coproise between the nuber of selected variables and the perforance of the classifier. The new criterion is optiized owing to a search heuristic with super-linear algorithic coplexity in the nuber of instances and variables. We also present a new odel averaging ethod, inspired fro the Bayesian Model Averaging approach [15]. We show that averaging the odel turns into averaging the contribution of the variables in the case of the Selective Bayes Classifier. Finally we proceed with extensive experients to evaluate our ethod. The reainder of the paper is organized as follows. Section II presents the regularization technique, section III the odel averaging technique. Section IV browses related work. Section V proceeds with extensive experiental evaluations. Appendix suarizes the ethod and its results on the Perforance Prediction Challenge [14]. II. REGULARIZATION After introducing the ai of regularization, this section forally states the assuptions and notations, applies the Bayesian approach to derive a new evaluation criterion for variable selection, and finally presents the search algorith used to optiize this criterion. A. Introduction The Naïve Bayes classifier is a very robust algorith. It can hardly overfit the data, since no hypothesis space is explored during the learning process. The Selective Naïve Bayes classifier reduces the strong bias of the naïve independence assuption, owing to variable selection. The objective is to search aong all the subsets of variables, in order to find the best possible classifier, copliant with the Naïve Bayes assuption. The size of the searched hypothesis space grows exponentially with the nuber of variables, which ight cause overfitting. Experients show that during the variable selection process, the last added variables raise the "coplexity" of the classifier while having an insignificant ipact on the evaluation criterion (area under the ROC curve for exaple). These slight iproveents during the training step can be detriental to the predictive perforance in test. We propose to tackle this overfitting proble by relying on a Bayesian approach, where the best odel is found by axiizing the probability P(Model/Data) of the odel

2 given the data. Using the Bayes rule and since the probability P(Data) is constant while varying the odel, this is equivalent to axiizing P(Model)P(Data/Model). In the following, we describe how we copute the likelihood of the odels and propose a prior distribution for variable selection. B. Assuptions and Notation Let X(X 1, X 2,, X ) be the vector of the explanatory variables and Y the class variable. Let y 1, y 2,, y J be the J class values of Y. Let N be the nuber of instances, D{D n } the labeled database containing the instances D n (x (n), y (n) ). Let M{M } be the set of all the potential Selective Naïve Bayes odels. Each odel M is described by paraeter values a k, where a k is 1 if variable k is selected in odel M and 0 otherwise. Let P(y j ) be the prior probabilities of the class values, and P(X k /y j ) the conditional probability distributions of the explanatory variables given the class values. We assue that the prior probabilities P(y j ) and the conditional probability distributions P(X k /y j ) are known. The purpose of the ethod is to select the best subset of variables for Naïve Bayes classification. In the experiental section, the P(y j ) are estiated by counting and the P(X k /y j ) are coputed using the contingency tables, resulting fro the preprocessing of the explanatory variables. This preprocessing is perfored using the MODL discretization ethod [4] for the nueric variables and the MODL grouping ethod [3] for the categorical variables. The conditional probabilities are estiated using a -estiate (support+*p)/(coverage+) with J/N and p1/j, in order to avoid zero probabilities. The MODL preprocessing ethods are based on a Bayesian approach. A space of discretization (or grouping) odels is defined, and a prior distribution on this odel space is proposed. This leads to a Bayes optial evaluation criterion of discretization (or grouping) odels. C. Likelihood of Models The Naïve Bayes classifier assigns to each instance the class value having the highest conditional probability P( yj ) P( X yj ) P( yj X). (1) P( X) Using the assuption that the explanatory variables are independent conditionally to the class variable, we get ( j X) P y ( j ) ( k j) P y P X y k 1 P X. (2) For a given odel M, the class conditional probability estiation P (y j /X) turns into ( j ) P y X P y X ( j ) k ( k j ) P y a P X y k 1 P X k 1 j J ( j) k ( k j) P y a P X y ( j ) k ( k j ) P y a P X y j 1 k 1, (3). (4) Equation (4) provides the class conditional probability distribution for each odel M on the basis of the paraeter values a k of the odel. For a given instance D n, the probability of observing the class value y (n) given the explanatory values x (n) and given the odel M is P (y (n) /Xx (n) ). The likelihood of the odel is obtained by coputing the product of these quantities on the whole dataset. The negative log-likelihood of the odel is given by: N ( n) ( n) log ( P( D M) ) log ( P( y X x )). (5) n 1 This quantity turns out to be the su over the dataset of the Inforation Loss Function [21]. This quantity is a popular criterion for the evaluation of probabilistic prediction. It is iniized when the true probabilities are predicted. D. Prior for Variable Selection The paraeters of a variable selection odel M are the Boolean values a k. We propose a hierarchic prior, by first choosing the nuber of selected variables and second choosing the subset of selected variables. For the nuber of variables, we propose to use a unifor prior between 0 and variables, representing (+1) equiprobable alternatives. For the choice of the variables, we assign the sae probability to every subset of variables. The nuber of cobinations C(, ) sees the natural way to copute this prior, but it has the disadvantage of being syetric. Beyond /2 variables, every new variable akes the selection ore probable. Thus, adding irrelevant variables is favored, provided that this has an insignificant ipact on the likelihood of the odel. As we prefer sipler odels, we propose to use the nuber of cobinations with replaceent C(+ -1, ). Taking the negative log of this prior, we get the following code length for the variable selection odels log( P( M) ) log( + 1) + log( C( + 1, ) ).(6) Using this prior, the "inforational cost" of the first selected variables is about log() and about log(2) for the last variables. E. Posterior Distribution of the Models The posterior probability of a odel is evaluated as the product of the prior and the likelihood. This is equivalent to a MDL approach [20], where the code length of the odel plus the data given the odel has to be iniized:

3 ( + ) + ( C( + ) ) N log ( ( n ) ( n P ) y X x ) log 1 log 1, n 1. (7) The first two ters encode the coplexity of the odel and the last one the fit of the data. The coproise is found by iniizing this criterion. We can notice a trend of increasing attention to the predicted probabilities in the evaluation criteria proposed for variable selection: whereas the accuracy criterion focuses only on the ajority class, the area under the ROC curve evaluates the correct ordering of the predicted probabilities, our regularized criterion evaluates the correctness of all the predicted probabilities (not only their rank) and introduces a regularization ter to balance the coplexity of the odels. F. An Efficient Search Heuristic Many heuristics have been used for variable selection. The greedy Forward Selection heuristic evaluates all the variables, starting fro an epty set of variables. The best variable is added to the current selection, and the process is iterated until no new variable iproves the evaluation criterion. This heuristic ay fall in local optia and has a quadratic tie coplexity with respect to the nuber of variables. The Forward Backward Selection heuristic allows to add or drop one variable at each step, in order to avoid local optia. The Fast Forward Selection heuristic evaluates each variable one at a tie, and adds it to the selection as soon as this iproves the criterion. This last heuristic is tie effective, but its results exhibit a large variance caused by the dependence over the order of the variables. We introduce a new search heuristic called Fast Forward Backward Selection (FFWBW), based on a ix of the preceding approaches. It consists in a sequence of Fast Forward Selection and Fast Backward Selection steps. The variables are randoly reordered between each step, and evaluated only once during each Forward or Backward search. This process is iterated as long as two successive (Forward and Backward) search steps bring at least one iproveent of the criterion. Each search step requires O() evaluations. The whole process converges very quickly, so that it still requires O() evaluations in practice. Each evaluation of a Selective Naïve Bayes odel requires O(N) steps, ainly to evaluate all the class conditional probabilities. Using the additivity of these probabilities with respect to the addition or deletion of variables, the total tie coplexity of the FFWBW heuristic can be reduced down to O(N). In order to further reduce both the possibility of local optia and the variance of the results, this FFWBW heuristic is ebedded into a ulti-start (MS) algorith, by repeating the search heuristic starting fro several rando orderings of the variables. The nuber of repetitions is set to log 2 (N), which offers a reasonable coproise between tie coplexity and quality of the optiization. Overall, the tie coplexity of the MS(FFWBW) heuristic is O(Nlog(N)). Algorith MS(FFWBW) Multi-start: repeat log 2 (N) ties o Start with an epty subset of variables o Fast Forward Backward Selection Initialize an epty subset of variables Repeat until no iproveent Randoly reorder the variables Fast Forward Selection Randoly reorder the variables Fast Backward Selection o Update the best subset of variables if iproved Return the best subset of variables III. MODEL AVERAGING Model averaging consists in cobining the prediction of an enseble of classifiers in order to reduce the predictive error. This section introduces a new odel averaging ethod applied to the Selective Naïve Bayes classifier. A. Fro Bayesian Model Averaging to Expectation Most inductive ethods ignore the uncertainty in odel selection and are over-confident about their predictive perforances. The Bayesian Model Averaging (BMA) approach [15] provides a consistent ethod to accounting for odel uncertainty, by weighting the by their posterior probability. For a given variable of interest, this weighting schee is coputed according to the following forula: P( P( M, P( M. (8) This forula can be written, using only the prior probabilities and the likelihood of the odels. P( M, P( M) P( D M) P(. (9) P M P D M Let f(m,p( /M, and f(p( /. Using these notations, the BMA forula can be interpreted as the expectation of function f for the posterior distribution of the odels E f f M, D P M D. (10) We propose to extend the BMA approach in the case where f is not restricted to be a probability function. B. Expectation of the class conditional inforation The Naïve Bayes classifier provides an estiation of the class conditional probabilities. These estiated probabilities are the natural candidates for averaging. For a given odel M defined by the variable selection {a k }, we have (, ) k ( k ) k 1 P( X) PY a P X Y f M D. (11) Let I(M,-log(f(M,) be the class conditional inforation. Whereas the expectation of f relates to a (weighted) arithetic ean of the class conditional

4 probabilities, the expectation of I relates to a (weighted) geoetric ean of these probabilities. This puts ore ephasis on the agnitude of the estiated probabilities. Taking the negative log of (11), we obtain (, ) I M D I Y I X + a I X Y. (12) k k k 1 We are looking for the expectation of this conditional inforation I( M, P( M E( I), (13) P M D E I I Y I X + P( M aki( Xk Y) k 1 P( M ( ak P( M ) + ( k ) k 1 P( M E I I Y I X I X Y Let b k k a P M D ( P M. We have bk [0,1].,(14).(15) The b k coefficients are coputed using (7), on the basis of the prior probabilities and of the likelihood of the odels. Using these coefficients, the expectation of the conditional inforation is k k 1 E I I Y I X + b I X Y. (16) The averaged odel thus provides the following estiation for the class conditional probabilities: PY X ( k ) k 1 P( X) PY P X Y k bk. (17) It is noteworthy that the expectation of the conditional inforation in (16) is siilar to the conditional inforation estiated by each individual odel in (12). The weighting schee on the odels reduces to a weighting schee on the variables. When the MAP odel is selected, the variables have a weight of 1 when selected and 0 otherwise: this is a "hard selection" of the variables. When the above averaging is applied, each variable has a [0, 1] weight, which can be interpreted as a "soft selection". C. Expectation with Copression Coefficients Using the posterior probabilities to weight the odels in the averaging approach presents soe practical disadvantages. When the posterior distribution is sharply peaked around the MAP, averaging is alost the sae as selecting the MAP odel. These peaked posterior distributions are ore and ore likely to happen when the nuber of instances rises, since a few tenths of instances better classified by a odel are sufficient to increase its likelihood by several orders of agnitude. Therefore, the algorithic overhead is not valuable if averaging turns out to be the sae as selecting the MAP. We propose an alternative weighting schee, whose objective is to better account for the set of all odels. Let us first introduce the copression coefficient c(m, of a odel. The Bayesian odel selection approach we use to derive the criterion (7) is equivalent to a MDL odel selection approach where l( M) + l( D M) log ( P( M) ) log( P( D M) ).(18) Let M be the "null" odel, with no variable selected. The null odel estiates the class conditional probabilities by their prior probabilities, ignoring all the explanatory variables. The code length of the null odel can be interpreted as the quantity of inforation necessary to describe the classes, when no explanatory data is used to induce the odel. Applying (4), we have l M + l D M log Nent Y (19) J where ent( Y) P( yj) log ( P( yj) ) j 1. Each odel M can potentially exploit the explanatory data to better "copress" the class conditional inforation. The ratio of the code length of a odel to that of the null odel stands for a relative gain in copression efficiency. We define the copression coefficient c(m, of a odel as follows: l( M) + l( D M) c( M, 1. (20) l( M ) + l( D M ) The copression coefficient is 0 for the null odel, is axial when the true class conditional probabilities are correctly estiated and tends to 1 in case of separable classes. This coefficient can be negative for odels which provide an estiation worse than that of the null odel. In our heuristic attept to better account for all the odels, we replace the posterior probabilities by their related copression coefficient in the weighting schee. Let us focus again on the variable weights b k introduced in our first odel averaging ethod. Dividing the posterior probabilities by those of the null odel, we get P( M ak P( M bk. (21) P( M P M D We introduce new c k coefficients by taking the log of the probability ratios and noralizing by the code length of the null odel. We obtain akc( M, ck. (22) c M, D In the ipleentation, we ignore the "bad" odels and consider the positive copression coefficients only. Mainly, the principle of this new heuristic weighting

5 schee consists in soothing the peaked posterior probability distribution with the log function. D. An Efficient Algorith for Model Averaging We have previously introduced two odel averaging ethods which rely on the expectation of the class conditional inforation (with standard probabilistic weights or with copression-based weights). The calculation of this expectation requires the evaluation of all the variable selection odels, which is not coputationally feasible as soon as the nuber of variables goes beyond about 20. This expectation can heuristically be evaluated by sapling the posterior distribution of the odels and accounting only for the sapled odels in the weighting schee. We propose to reuse the MS(FFWBW) search heuristic to perfor this sapling. This heuristic is effective for finding high probability odels and searching in their neighborhood. The repetition of the search fro several rando starting points (in the ulti-start eta-heuristics) brings diversity and allows to escape local optia. We use the whole set of odels evaluated during the search to estiate the expectation. This sapling strategy is biased in favor of the ost probable odels. This has little ipact, since the ost probable odels contribute the ost to the weights. The overhead in the tie coplexity of the learning algorith is negligible, since the only need is to collect the posterior probability of the odels and to copute the weights in the averaging forula. Concerning the deployent of the averaged odel, the overhead is also negligible, since the initial Naïve Bayes estiation of the class conditional probabilities is just extended with variables weights. IV. RELATED WOR The section briefly reviews three popular alternative enseble ethods: Boosting, Bagging and Bayesian Model Averaging. A. Introduction The predictive perforances of the classifiers can be understood using the bias-variance decoposition [16]. The bias evaluates the fit capacity: classifiers with low bias such as neural networks can approxiate any data distribution, whereas classifiers with strong bias such as the Naïve Bayes classifier cannot fit coplex data. The variance results fro the statistical uncertainty around the training set. The odel averaging ethods ai at reducing the bias or the variance part of predictive error by cobining the predictions of an enseble of classifiers. They ainly differ by the way they saple the distribution of the odels and by their weighting schee. B. Boosting The Boosting averaging ethod [12] exploits a weight distribution of the instances. Initially, all the instances have the sae weight, and a classifier is trained on the initial training set. The instances that are incorrectly classified are given a higher weight, and a new training set is sapled fro the weight-updated dataset. This process is repeated a given nuber of iterations, and the averaged classifier is built using weights based on the predictive perforances of the individual classifiers. The Boosting ethod is theoretically founded to reduce the training bias, but without guarantee against overfitting. C. Bagging The Bagging (Boostrap Aggregating) averaging ethod [7] exploits a set of data saples obtained owing to a bootstrap process (N instances are randoly selected fro the training set, with replaceent). The classifier is trained on each bootstrap dataset, and the averaged classifier gives the sae weight to all the trained odels. The bootstrap process is a way of sapling the odel space around reasonably good and equiprobable odels. It is theoretically founded to reduce the variance part of the predictive error. The rando selection of data saples has been extended to variables in the Rando Forests classifier [8]: each node of classification tree is trained on a new randoly sapled subset of variables. A siilar approach is used in [23], where a sequence of Naïve Bayes classifiers is iteratively trained on randoly selected variables. The probability of selecting each variable is increased or decreased according to the perforance of the previously trained classifier. By default, the nuber of training iterations is equal to the nuber of variables, and each classifier has the sae weight. D. Bayesian Model Averaging The Bayesian Model Averaging (BMA) ethod [15] ais at accounting for the odel uncertainty. Whereas the MAP approach retrieves the ost probable odel given the data, the BMA approach exploits every odel in the odel space, weighted by their posterior probability. This approach relies on the definition of a prior distribution on the odels, on an efficient coputation technique to estiate the odel posterior probabilities and on an effective ethod to saple to posterior distribution. Apart fro these technical difficulties, the BMA approach is an appealing technique, with strong theoretical results concerning the optiality of its long-run perforances [19]. The BMA approach has been applied to the Naïve Bayes classifier in [9]. Apart fro the differences in the weighting schee, their ethod (DC) differs fro ours ainly on the initial assuptions. The DC ethod does not anage the nueric variables and assues ultinoial distributions with Dirichlet priors for the categorical variable. Structure odularity of the Bayesian network is also assued: each selection of a variable is independent fro the others. The DC approach estiates the full data distribution (explanatory and class variables), whereas we focus on the class conditional probabilities. Once the prior hyper-paraeters are fixed, the DC ethod allows to copute an exact odel averaging, whereas we rely on an heuristic to estiate the averaged odel. Copared to the DC ethod, our ethod is

6 not restricted to categorical attributes and does not need any hyper-paraeter. V. EXPERIMENTS This section presents an experiental evaluation of the perforances of the Selective Naïve Bayes inducing ethods described in the previous sections. We first coent on the rationale behind the copression-based odel averaging approach before proceeding with extensive experients. A. The Wavefor exaple 100% 200 odels (0.05%). Five variables (5, 9, 10, 11, 12) aong the 8 MAP variables are always selected, and the other odels exploits a diversity of subsets of variables. The potential benefit of odel averaging is to account for all these odels, with higher weights for the ost probable odels. In the Wavefor exaple, averaging using the posterior probabilities to weight the odels is alost the sae as selecting the MAP odel (which itself is hard to find with a heuristic search). The copression-based odel averaging exploits a flattened distribution of the weights (see figure 1) and thus enables to account for a large nuber of odels. This averaging approach is evaluated in the next section. Copression coefficient 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % odels Fig. 1. Repartition function of the copression coefficients (noralized posterior probabilities) of half a illion variable selection odels evaluated for the Wavefor dataset, sorted by increasing copression coefficient. For exaple, the 10% odels on the left represent the odels having the lowest copression coefficient. The Wavefor dataset [6] contains 5000 instances and 21 nueric variables, with 3 classes of waves. We use 70% of this dataset to train a Selective Naïve Bayes classifier, using the Bayes regularization introduced in section II. The MODL preprocessing deterines that 2 variables (1 st and 21 th ) are irrelevant. The variable selection proble consists in finding the ost probable subset of variables aong about half a illion (2 19 ) potential subsets. In order to study the posterior distribution of the odels, all these subsets are evaluated. The MAP odel selects 8 variables (5, 6, 9, 10, 11, 12, 13, 17). The posterior distribution is very sharp everywhere, not only around the MAP. Variable 18 is first selected in the 3 rd odel, which is about 40 ties less probable than the MAP odel. Variable 4 is first selected in the 10 th odel, about 4000 ties less probable than the MAP odel. Figure 1 displays the repartition function of the posterior probabilities, using a noralized log scale (copression coefficients). Using this logarithic transforation, the posterior distribution is flattened and can be visualized. The MAP odel is ties ore probable than the iniu a posteriori odel, which is the null one. This huge range of probabilities is projected on a [0%, 62%] range of copression coefficients. A closer look at the posterior distribution shows that ost of the good odels (in the top 50%) contain around 10 variables. Figure 2 displays the selected variables in the top Fig. 2. Index of the selected variables in the 200 ost probable Selective Naïve Bayes odels for the Wavefor dataset. Each line represents a odel, where the variables are in black color when selected. B. Experiental Setup The experients ai at coparing the perforances of odel averaging ethods versus the MAP ethod, the standard Selective Naïve Bayes (SNB) and Naïve Bayes (NB) ethods. All the classifiers except the last one exploit the sae MODL preprocessing. The evaluated ethods are: SNB(CMA): odel averaging using copressionbased weights in the expectation forula, SNB(MA): odel averaging using expectation, SNB(MAP): MAP SNB odel, SNB(AUC): optiization of the area under the ROC curve, SNB(ACC): optiization of the accuracy, NB: NB with MODL preprocessing, NB(EF): NB with 10 bins Equal Frequency discretization and no value grouping. All the SNB classifiers are optiized with the sae MS(FFWBW) search heuristic, except the SNB(ACC), based on the Forward Selection greedy heuristic. The DC ethod [9], siilar to the SNB(MA) approach, was not evaluated since it is restricted to categorical attributes. We evaluate three criteria of increasing coplexity: accuracy (ACC), area under the ROC curve (AUC) and inforational loss function (ILF). The ACC criterion focuses on the ost probable class (which is fair to evaluate

7 class probabilities, but not very discriinating in case of highly unbalanced datasets), the AUC criterion focusses on the ranking of the class probabilities and the ILF criterion on the class probabilities. We noralize the ILF using the copression rate CR1-ILF/Entropy (siilar to (20), without the prior regularization penalty). The entropy has the sae value as the ILF when the prior class probabilities are predicted on every instance. The noralized CR criterion is ainly ranged between 0 (prediction not better than the basic prediction of the class priors) and 1 (prediction of the true class probabilities in case of perfectly separable classes). It can be negative when the predicted probabilities are worse than the basic prior predictions. Nae Instances TABLE I UCI DATASETS Nueric variables Categorical variables Classes Majority Accuracy Abalone Adult Australian Breast Crx Geran Glass Heart Hepatitis HorseColic Hypothyroid Ionosphere Iris LED LED Letter Mushroo PenDigits Pia Satiage Segentation SickEuthyroid Sonar Spa Thyroid TicTacToe Vehicle Wavefor Wine Yeast We conduct the experients on two collections of datasets: 30 datasets fro the repository at University of California at Irvine [1] and 10 datasets fro the NIPS 2003 Feature Selection Challenge [13] and the WCCI 2006 Perforance Prediction Challenge [14]. A suary of soe properties of these datasets is given in table I for the UCI datasets and in table II for the Challenge datasets. We use stratified 10-fold cross validation to evaluate the criteria. A two-tailed Student test at the 5% confidence level is perfored in order to evaluate the significant wins or losses of the SNB(CMA) ethod versus each other ethod. Nae Instances TABLE II CHALLENGE DATASETS Nueric variables Categorical variables Classes Majority Accuracy Arcene % Dexter % Dorothea % Gisette % Madelon % Ada % Gina % Hiva % Nova % Sylva % C. Results We collect and average the three criteria owing to the stratified 10-fold cross validation, for the seven evaluated ethods on the forty datasets. We suarize these results using the ean of each criterion, the nuber of significant wins or losses of the SNB(CMA) ethod and the average rank of each ethod. The results are presented in table III for the UCI datasets and in table IV for the Challenge datasets. TABLE III EVALUATION OF THE METHODS ON THE UCI DATASETS Method ACC AUC CR M W/L R M W/L R M W/L R SNB(CMA) SNB(MA).817 9/ / /6 2.6 SNB(MAP) / / /6 3.6 SNB(AUC).820 8/ / /4 4.4 SNB(ACC).817 5/ / /2 4.5 NB / / /2 5.3 NB(EF) / / /2 5.3 The evaluated criteria are the accuracy (ACC), the area under the ROC curve (AUC) and the copression rate (CR). The results are suarized across the datasets using the ean (M), the nuber of wins and losses for the SNB(CMA) ethod (W/L) and the average rank (R). TABLE IV EVALUATION OF THE METHODS ON THE CHALLENGE DATASETS Method ACC AUC CR M W/L R M W/L R M W/L R SNB(CMA) SNB(MA).872 3/ / /0 2.3 SNB(MAP).865 4/ / /0 3.7 SNB(AUC).872 6/ / /0 4.1 SNB(ACC).875 3/ / /0 4.3 NB.841 7/ / /0 5.9 NB(EF).823 9/ / /0 6.7 The evaluated criteria are the accuracy (ACC), the area under the ROC curve (AUC) and the copression rate (CR). The results are suarized across the datasets using the ean (M), the nuber of wins and losses for the SNB(CMA) ethod (W/L) and the average rank (R). The results of the two NB ethods are reported ainly as a sanity check. The MODL preprocessing in the NB classifier exhibits better perforances than the Equal Frequency discretization ethod in the NB(EF) classifier. All the SNB ethods exploit the sae MODL

8 preprocessing, allowing a fair coparison. The experients confir the benefit of selecting the variables, using the standard selection ethods SNB(ACC) and SNB(AUC). These two ethods achieve coparable results, with an ephasis on their respective optiized criterion. They significantly iprove the result of the NB ethods, especially for the estiation of class conditional probabilities. The three regularized ethods SNB(MAP), SNB(MA) and SNB(CMA) focus on the estiation of the class conditional probabilities, which are evaluated using the copression rate criterion. They clearly outperfor the other ethods on this criterion. However, the SNB(MAP) ethod is not better than the two standard SNB ethods for the accuracy and AUC criteria. The MAP ethod increases the bias of the odels by penalizing the coplex odels, leading to a decayed fit of the data. The odel averaging approach exploited in SNB(MA) ethod offers only slight enhanceents copared to the SNB(MAP) ethod. This confirs the analysis drawn fro the Wavefor case study. The copression-based averaging ethod SNB(CMA) strongly doinates all the other ethods on all the criteria. On average, the nuber of significant wins is about 10 ties the nuber of significant losses, and aounts to ore than half of the 40 datasets. On the 10 challenge datasets, having very large nubers of variables, the SNB(CMA) ethods always gets the best results on the AUC and CR criteria, and alost always on the accuracy criterion. The doination of the SNB(CMA) increases with the coplexity of the criteria: it is noteworthy for accuracy (ACC), iportant for the ordering of the class conditional probabilities (AUC) and very large for the prediction of the class conditional probabilities (CR). This shows that the regularized and averaged Naïve Bayes becoes effective for conditional probability estiation, whereas the standard Naïve Bayes is usually considered to be poor at estiating these probabilities. VI. CONCLUSION The Naïve Bayes classifier is a popular ethod that is often highly effective on real datasets and is copetitive with or even soeties outperfors uch ore sophisticated classifiers. This paper confirs the potential benefit of variable selection to obtain still better perforances. We have proposed a new regularization ethod, founded on a Bayesian approach, to select the best subset of variables. We have introduced a new odel averaging ethod as well, using copression-based weights. Extensive experients on any datasets deonstrate the effectiveness of the new ethod, which considerably enhances the predictive perforances of the raw Naïve Bayes classifier, especially for the estiation of the class conditional probabilities. APPENDIX This appendix suarizes the ethod and its results on the Perforance Prediction Challenge [14]. Title: Regularized and Averaged Selective Naïve Bayes Classifier Nae, address, eail: Marc Boullé, France Teleco R&D, 2, avenue Pierre Marzin, Lannion cedex France arc.boulle@franceteleco.co Acrony of y best entry: SNB(CMA) + 10k F(3 tv References: M. Boullé, "Regularization and Averaging of the Selective Naïve Bayes classifier", International Joint Conference on Neural Networks, M. Boullé, "MODL: a Bayes Optial Discretization Method for Continuous Attributes", Machine Learning, to be published. Method: Our ethod is based on the Naive Bayes assuption. All the input features are preprocessed using the Bayes optial MODL discretization ethod. We use a Bayesian approach to coproise between the nuber of selected features and the perforance of the Selective Naïve Bayes classifier: this provides a regularized feature selection criterion. The feature selection search is perfored using alternate forward selection and backward eliination searches on randoly ordered feature sets: this provides a fast search heuristic, with super-linear tie coplexity with respect to the nuber of instances and features. Finally, our ethod introduces a variant of feature selection: feature "soft" selection. Whereas feature "hard" selection gives a "Boolean" weight to the features according to whether they selected or not, our ethod gives a continuous weight between 0 and 1 to each feature. This weighing schea of the features coes fro a new classifier averaging ethod, derived fro Bayesian Model Averaging. The ethod coputes the posterior probabilities of the classes, which is convenient when the classical accuracy criterion or the area under the ROC curve is evaluated. For the challenge, the Balanced Error Rate (BER) criterion is the ain criterion. In order to iprove the BER criterion, we adjusted the decision threshold in a post-optiization step. We still predict the class having the highest posterior probability, but we artificially adjust the class prior probabilities in order to optiize the BER criterion on the train dataset. For the challenge, several trials of feature construction have been perfored in order to evaluate the coputational and statistical scalability of the ethod, and to naively attept to escape the naïve Bayes assuption: 10k F(2: features constructed for each dataset, each one is the su of two randoly selected initial features, 100k F(2: features constructed (sus of two features),

9 10k F(3: features constructed (sus of three features). The perforance prediction guess is coputed using a stratified tenfold cross-validation. Results: In the challenge, we rank 7 th as a group and our best entry is 26 th, according to the average rank coputed by the organizers. On 2 of the 5 five datasets (ADA and SYLVA), our best entry ranks 1 st, as shown in table V. Our ethod is highly scalable and resistant to noisy or redundant features: it is able to quickly process about constructed features without decreasing the predictive perforance. Its ain liitation coes fro the Naïve Bayes assuption. However, when the constructed features allow to "partially" break the naïve Bayes assuption, the ethod succeeds in significantly iprove its perforances. This is the case for exaple for the GINA dataset, which does not fit well the naïve Bayes assuption: adding randoly constructed features allows to iprove the BER fro 12.83% down to 7.30%. The AUC criterion, which evaluates the ranking of the class posterior probabilities, indicates high perforances for our ethod. Code: Our ipleentation was done in C++. eywords: Discretization, Bayesianis, Naïve Bayes, Wrapper, Regularization, Model Averaging REFERENCES [1] C.L Blake, C.J Merz, UCI Repository of achine learning databases, UC. Irvine, Departent of Inforation and Coputer Science, [2] M. Boullé, "A Bayes Optial Approach for Partitioning the Values of Categorical Attributes". Journal of Machine Learning Research 6, 2005a, pp [3] M. Boullé, " A Grouping Method for Categorical Attributes Having Very Large Nuber of Values", P. Perner and A. Iiya (Eds.), Machine Learning and Data Mining in Pattern Recognition, Springer Verlag, lnai 3587, 2005b, pp [4] M. Boullé, "MODL: a Bayes Optial Discretization Method for Continuous Attributes", Machine Learning, to be published. [5] M. Boullé, "An Enhanced Selective Naïve Bayes Method with Optial Discretization", Feature extraction, foundations and Applications, II(25), to be published. [6] L. Breian, J.H. Friedan, R.A. Olshen and C.J. Stone, Classification and Regression Trees. California: Wadsworth International, [7] L. Breian, "Bagging predictors", Machine Learning, 24(2), 1996, pp [8] L. Breian, "Rando forests", Machine Learning, 45(1), 2001, pp [9] D. Dash and G.F. Cooper, "Exact odel averaging with naïve Bayesian classifiers", Proceeding of the Nineteenth International Conference on Machine Learning, 2002, pp [10] J. Dougherty, R. ohavi, M. Sahai, "Supervised and Unsupervised Discretization of Continuous Features". Proceedings of the 12th International Conference on Machine Learning. San Francisco, CA: Morgan aufann., 1995, pp [11] T. Fawcett, "ROC Graphs: Notes and Practical Considerations for Researchers", Technical Report HPL , HP Laboratories, [12] Y. Freund and R.E. Shapire, "Experients with a New Boosting Algorith", Proceeding of the Thirteenth International Conference on Machine Learning, 1996, pp [13] I. Guyon, "Design of experients of the NIPS 2003 variable selection benchark", [14] I. Guyon, "Model Selection Workshop and Perforance Prediction Challenge", IEEE World Congress on Coputational Intelligence, [15] J.A. Hoeting, D. Madigan, A.E. Raftery and C.T. Volinsky, "Bayesian odel averaging: A tutorial", Statistical Science, 14(4), 1999, [16] R. ohavi and D.H. Wolpert, "Bias Plus Variance Decoposition for Zero-One Loss Function", Proceeding of the Thirteenth International Conference on Machine Learning, 1996, pp [17] P. Langley and S. Sage, "Induction of Selective Bayesian Classifiers", Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence. Morgan aufann, 1994, pp [18] H. Liu, F. Hussain, C.L. Tan, M. Dash., "Discretization: An Enabling Technique", Data Mining and nowledge Discovery, 6(4), 2002, pp [19] A.E. Raftery and Y. Zheng "Long-Run Perforance of Bayesian Model Averaging", Technical Report no. 433, Departent of Statistics, University of Washington., [20] J. Rissanen, "Modeling by shortest data description", Autoatica, 14, 1978, pp [21] I.H. Witten, and E. Frank, Data Mining, Morgan aufann, 2000 [22] Y. Yang and G.I. Webb, "A coparative study of discretization ethods for naïve-bayes classifiers", Proceeding of the Pacific Ri nowledge Acquisition Workshop, [23] Z. Zheng, "Naïve Bayesian Classifier Coittees", Proceedings of the ECML'98, Berlin: Springer Verlag, 1998, pp TABLE V RESULTS OF OUR BEST ENTRY ON THE PERFORMANCE PREDICTION CHALLENGE DATASETS Dataset Our best entry The challenge best entry Test Test Ber Guess Test Test Test Ber Guess AUC BER Guess Error Score AUC BER Guess Error ADA (1) GINA HIVA NOVA SYLVA (1) Overall (26.4) (6.2) Test score

Compression-Based Averaging of Selective Naive Bayes Classifiers

Journal of Machine Learning Research 8 (2007) 1659-1685 Submitted 1/07; Published 7/07 Compression-Based Averaging of Selective Naive Bayes Classifiers Marc Boullé France Telecom R&D 2, avenue Pierre Marzin