Compression-Based Averaging of Selective Naive Bayes Classifiers

Size: px

Start display at page:

Download "Compression-Based Averaging of Selective Naive Bayes Classifiers"

Erin Potter
5 years ago
Views:

1 Journal of Machine Learning Research 8 (2007) Submitted 1/07; Published 7/07 Compression-Based Averaging of Selective Naive Bayes Classifiers Marc Boullé France Telecom R&D 2, avenue Pierre Marzin Lannion, France MARC.BOULLE@ORANGE-FTGROUP.COM Editor: Isabelle Guyon Abstract The naive Bayes classifier has proved to be very effective on many real data applications. Its performance usually benefits from an accurate estimation of univariate conditional probabilities and from variable selection. However, although variable selection is a desirable feature, it is prone to overfitting. In this paper, we introduce a Bayesian regularization technique to select the most probable subset of variables compliant with the naive Bayes assumption. We also study the limits of Bayesian model averaging in the case of the naive Bayes assumption and introduce a new weighting scheme based on the ability of the models to conditionally compress the class labels. The weighting scheme on the models reduces to a weighting scheme on the variables, and finally results in a naive Bayes classifier with soft variable selection. Extensive experiments show that the compressionbased averaged classifier outperforms the Bayesian model averaging scheme. Keywords: naive Bayes, Bayesian, model selection, model averaging 1. Introduction The naive Bayes classification approach (see Langley et al., 1992; Mitchell, 1997; Domingos and Pazzani, 1997; Hand and Yu, 2001) is based on the assumption that the variables are independent within each output label, and simply relies on the estimation of univariate conditional probabilities. The evaluation of the probabilities for numeric variables has already been discussed in the literature (see Dougherty et al., 1995; Liu et al., 2002; Yang and Webb, 2002). Experiments demonstrate that even a simple equal width discretization brings superior performance compared to the assumption using a Gaussian distribution. The naive independence assumption can harm the performance when violated. In order to better deal with highly correlated variables, the selective naive Bayes approach of Langley and Sage (1994) exploits a wrapper approach (see Kohavi and John, 1997) to select the subset of variables which optimizes the classification accuracy. In the method of Boullé (2006a), the area under the receiver operating characteristic (ROC) curve (see Fawcett, 2003) is used as a selection criterion and exhibits a better predictive performance than the accuracy criterion. Although the selective naive Bayes approach performs quite well on data sets with a reasonable number of variables, it does not scale on very large data sets with hundreds of thousands of instances and thousands of variables, such as in marketing applications. The problem comes both from the search algorithm, whose complexity is quadratic in the number of the variables, and from the selection process which is prone to overfitting. c 2007 Marc Boullé.

2 BOULLÉ In this paper, we present a new regularization technique to compromise between the number of selected variables and the performance of the classifier. The resulting variable selection criterion is optimized owing to an efficient search heuristic whose computational complexity is O(KN log(kn)), where N is the number of instances and K the number of variables. We also apply the Bayesian model averaging approach of Hoeting et al. (1999) and extend it with a compressionbased averaging scheme, which better accounts for the distribution of the models. We show that averaging the models turns into averaging the contribution of the variables in the case of the selective naive Bayes classifier. Finally we proceed with extensive experiments to evaluate our method. The remainder of the paper 1 is organized as follows. Section 2 introduces the assumptions and recalls the principles of the naive Bayes and selective naive Bayes classifiers. Section 3 presents the regularization technique for variable selection based on Bayesian model selection and Section 4 applies the Bayesian model averaging method to selective naive Bayes classifiers. In Section 5, the new selective naive Bayes classifiers are evaluated on an illustrative example. Section 6 analyzes the limits of Bayesian model averaging and proposes a new model averaging technique based on model compression coefficients. Section 7 proceeds with extensive experimental evaluations and Section 8 reports the results obtained in the performance prediction challenge organized by Guyon et al. (2006c). Finally, Section 9 concludes this paper and outlines research directions. 2. Selective Naive Bayes Classifier This section formally states the assumptions and notations and recalls the naive Bayes and selective naive Bayes approaches. 2.1 Assumptions and Notation Let X = (X 1,X 2,...X K ) be the vector of the K explanatory variables and Y the class variable. Let λ 1,λ 2,...λ J be the J class labels of Y. Let N be the number of instances and D = {D 1,D 2,...,D N } the labeled database containing the instances D n = (x (n),y (n) ). Let M = {M m } be the set of all the potential selective naive Bayes models. Each model M m is described by K parameter values a mk, where a mk is 1 if variable k is selected in model M m and 0 otherwise. Let us denote by P(λ j ) the prior probabilities P(Y = λ j ) of the class values, and by P(X k λ j ) the conditional probability distributions P(X k Y = λ j ) of the explanatory variables given the class values. We assume that the prior probabilities P(λ j ) and the conditional probability distributions P(X k λ j ) are known, once the preprocessing is performed. In the paper, the class conditional probabilities are estimated using the MODL discretization method of Boullé (2006b) for the numeric variables and the MODL grouping method of Boullé (2005a,b) for the categorical variables. MODL stands for minimum optimized description length and refers to the principle of minimum description length (MDL) of Rissanen (1978) as a model selection technique. More specifically, the MODL preprocessing methods exploit a maximum a posteriori (MAP) technique (see Robert, 1997) to select the most probable model of discretization 1. This paper is an extended version of the 2006 IJCNN conference paper (Boullé, 2006c). 1660

3 COMPRESSION-BASED AVERAGING OF SELECTIVE NAIVE BAYES CLASSIFIERS (resp. value grouping) given the input data. The choice of the prior distribution of the models is optimized for the task of data preparation, and the search algorithms are deeply optimized. Using the Bayes optimal MODL preprocessing methods to estimate the conditional probabilities has proved to be very efficient in detecting irrelevant variables (see Boullé, 2006a). In the experimental section, the P(λ j ) are estimated by counting and the P(X k λ j ) are computed using the contingency tables, resulting from the preprocessing of the explanatory variables. The conditional probabilities are estimated using a m-estimate (support + mp)/(coverage + m) with m = J/N and p = 1/J, in order to avoid zero probabilities. 2.2 Naive Bayes Classifier The naive Bayes classifier assigns to each instance the class value having the highest conditional probability P(λ j X) = P(λ j)p(x λ j ). P(X) Using the assumption that the explanatory variables are independent conditionally to the class variable, we get P(λ j X) = P(λ j) K k=1 P(X k λ j ). (1) P(X) In classification problems, Equation (1) is sufficient to predict the most probable class given the input data, since P(X) is constant. In problems where a prediction score is needed, the class conditional probability can be estimated using P(λ j X) = P(λ j) K k=1 P(X k λ j ) J i=1 P(λ i) K k=1 P(X k λ i ). (2) The naive Bayes classifier is poor at predicting the true class conditional probabilities, since the independence assumption is usually violated in real data applications. However, Hand and Yu (2001) show that the prediction score given by Equation (2) often provides an effective ranking of the instances for each class value. 2.3 Selective Naive Bayes Classifier The selective naive Bayes classifier reduces the strong bias of the naive independence assumption, owing to variable selection. The objective is to search among all the subsets of variables, in order to find the best possible classifier, compliant with the naive Bayes assumption. Langley and Sage (1994) propose to evaluate the selection process with the accuracy criterion, estimated on the train data set. However, this criterion suffers from some limits, even when the predictive performance is the only concern. In case of a skewed distribution of class labels for example, the accuracy may never be better than the majority accuracy, so that the selection process ends with an empty set of variables. This problem also arises when several consecutive selected variables are necessary to improve the accuracy. In the method proposed by Langley and Sage (1994), the selection process is iterated as long as there is no decay in the accuracy. This solution raises new problems, such as the selection of irrelevant variables with no effect on accuracy, or even the selection of redundant variables with either insignificant effect or no effect on accuracy. 1661

4 BOULLÉ Provost et al. (1998) propose to use receiver operating characteristic (ROC) analysis rather than the accuracy to evaluate induction models. This ROC criterion, estimated on the train data set (as in Langley and Sage, 1994), is used by Boullé (2006a) to assess the quality of variable selection for naive Bayes classifier. The method exploits the forward selection algorithm to select the variables, starting from an empty subset of variables. At each step of the algorithm, the variable which brings the best increase of the area under the ROC curve (AUC) is chosen and the selection process stops as soon as this area does not rise anymore. This allows capturing slight enhancements in the learning process and helps avoiding the selection of redundant variables or probes that have no effect on the ROC curve. Altogether, the variable selection method can be implemented in O(K 2 N logn) time. The preprocessing step needs O(KN logn) to discretize or group the values of all the variables. The forward selection process requires O(K 2 N logn) time, owing to the decomposability of the naive Bayes formula on the variables. The O(N logn) term in the complexity is due to the evaluation of the area under the ROC curve, based on the sort of the training instances. 3. MAP Approach for Variable Selection After introducing the aim of regularization, this section applies the Bayesian approach to derive a new evaluation criterion for variable selection and presents the search algorithm used to optimize this criterion. 3.1 Introduction The naive Bayes classifier is a very robust algorithm. It can hardly overfit the data, since no hypothesis space is explored during the learning process. On the opposite, the selective naive Bayes classifier explores the space of all subsets of variables to reduce the strong bias of the naive independence assumption. The size of the searched hypothesis space grows exponentially with the number of variables, which might cause overfitting. Experiments show that during the variable selection process, the last added variables raise the complexity of the classifier while having an insignificant impact on the evaluation criterion (AUC for example). These slight improvements during the training step, which have an insignificant impact on the test performance, are detrimental to the ease of deployment of the models and to their understandability. We propose to tackle this overfitting problem by relying on a Bayesian approach, where the MAP model is found by maximizing the probability P(Model Data) of the model given the data. In the following, we describe how we compute the likelihood of the models P(Data Model) and propose a prior distribution P(Model) for variable selection. 3.2 Likelihood of Models For a given model M m parameterized by the set of selected variable indicators {a mk }, the estimation of the class conditional probability P m (λ j X) turns into P m (λ j X) = P(λ j) K k=1 P(X k λ j ) a mk P(X) = P(λ j ) K k=1 P(X k λ j ) a mk J i=1 P(λ i) K k=1 P(X. (3) k λ i ) a mk 1662

5 COMPRESSION-BASED AVERAGING OF SELECTIVE NAIVE BAYES CLASSIFIERS Equation (3) provides the class conditional probability distribution for each model M m on the basis of the parameter values a mk of the model. For a given instance D n, the probability of observing the class value y (n) given the explanatory values x (n) and given the model M m is P m (Y = y (n) X = x (n) ). The likelihood of the model is obtained by computing the product of these quantities on the whole data set. The negative log-likelihood of the model is given by logp(d M m ) = 3.3 Prior for Variable Selection N n=1 logp m (Y = y (n) X = x (n) ). The parameters of a variable selection model M m are the Boolean values a mk. We propose a hierarchic prior, by first choosing the number of selected variables and second choosing the subset of selected variables. For the number K m of variables, we propose to use a uniform prior between 0 and K variables, representing (K + 1) equiprobable alternatives. For the choice of the K m variables, we assign the same probability to every subset of K m variables. The number of combinations ( K K m ) seems the natural way to compute this prior, but it has the disadvantage of being symmetric. Beyond K/2 variables, every new variable makes the selection more probable. Thus, adding irrelevant variables is favored, provided that this has an insignificant impact on the likelihood of the model. As we prefer simpler models, we propose to use the number of combinations with replacement ( K+K m ) 1 K m. Taking the negative log of this prior, we get the following code length l(m m ) for the variable selection models ( ) K + Km 1 l(m m ) = log(k + 1) + log. Using this prior, the informational cost of the first selected variables is about logk and about log2 for the last variables. To summarize our prior, each number of K m variable is equiprobable, and for a given K m, each subset of K m variables randomly chosen with replacement is equiprobable. This means that each specific small subset of variables has a greater probability than each specific large subset of variables, since the number of variable subsets of given size grows with K m. 3.4 Posterior Distribution of the Models The posterior probability of a model M m is evaluated as the product of the prior and the likelihood. This is equivalent to the MDL approach of Rissanen (1978), where the code length of the model plus the data given the model has to be minimized: K m ( ) K + Km 1 l(m m ) + l(d M m ) = log(k + 1) + log K m N n=1 logp m (y (n) X = x (n) ). (4) The first two terms encode the complexity of the model and the last one the fit of the data. The compromise is found by minimizing this criterion. We can notice a trend of increasing attention to the predicted probabilities in the evaluation criteria proposed for variable selection. Whereas the accuracy criterion focuses only on the majority class and the area under the ROC curve evaluates the correct ordering of the predicted probabilities, 1663

6 BOULLÉ our regularized criterion evaluates the correctness of all the predicted probabilities (not only their rank) and introduces a regularization term to balance the complexity of the models. 3.5 An Efficient Search Heuristic Many heuristics have been used for variable selection (see Guyon et al., 2006b). The greedy forward selection heuristic evaluates all the variables, starting from an empty set of variables. The best variable is added to the current selection, and the process is iterated until no new variable improves the evaluation criterion. This heuristic may fall in local optima and has a quadratic time complexity with respect to the number of variables. The forward backward selection heuristic allows to add or drop one variable at each step, in order to avoid local optima. The fast forward selection heuristic evaluates each variable one at a time, and adds it to the selection as soon as this improves the criterion. This last heuristic is time effective, but its results exhibit a large variance caused by the dependence over the order of the variables. Algorithm 1 Algorithm MS(FFWBW) Require: X (X 1,X 2,...X K ) {Set of input variables} Ensure: B {Best subset of variables} 1: B /0 {Start with an empty subset of variables} 2: for Step=1 to log 2 KN do 3: {Fast forward backward selection} 4: S /0 {Initialize an empty subset of variables} 5: Iter 0 6: repeat 7: Iter Iter + 1 8: X Shuffle(X) {Randomly reorder the variables to add} 9: {Fast forward selection} 10: for X k X do 11: if cost(s {X k }) < cost(s) then 12: S S {X k } 13: end if 14: end for 15: X Shuffle(X) {Randomly reorder the variables to remove} 16: {Fast backward selection} 17: for X k X do 18: if cost(s {X k }) < cost(s) then 19: S S {X k } 20: end if 21: end for 22: until no improvement or Iter MaxIter 23: {Update best subset of variables} 24: if cost(s) < cost(b) then 25: B S 26: end if 27: end for 1664

7 COMPRESSION-BASED AVERAGING OF SELECTIVE NAIVE BAYES CLASSIFIERS We introduce a new search heuristic called fast forward backward selection (FFWBW), based on a mix of the preceding approaches. It consists in a sequence of fast forward selection and fast backward selection steps. The variables are randomly reordered between each step, and evaluated only once during each forward or backward search. This process is iterated as long as two successive (forward and backward) search steps bring at least one improvement of the criterion or when the iteration number exceeds a given parameter MaxIter. In practice, the whole process converges very quickly, in one or two steps in most of the cases. Setting MaxIter = 5 for example is sufficient to bound the worst case complexity without decreasing the quality of the search algorithm. Evaluating a selective naive Bayes model requires O(KN) computation time, mainly to evaluate all the class conditional probabilities. According to Equation (3), these class conditional probabilities can be updated in O(1) per instance and O(N) for the whole data set when one variable is added or removed from the current subset of selected variables. Each fast forward selection or fast backward selection step considers O(K) additions or removals of variables and requires O(KN) computation time. The total time complexity of the FFWBW heuristic is O(KN), since the number of search steps is bounded by the constant parameter MaxIter. In order to further reduce both the possibility of local optima and the variance of the results, this FFWBW heuristic is embedded into a multi-start (MS) algorithm, by repeating the search heuristic starting from several random orderings of the variables. The number of repetitions is set to log 2 KN, which offers a reasonable compromise between time complexity and quality of the optimization. Overall, the time complexity of the MS(FFWBW) heuristic is O(KN log KN). The heuristic is detailed in Algorithm Bayesian Model Averaging of Selective Naive Bayes Classifiers Model averaging consists in combining the prediction of an ensemble of classifiers in order to reduce the prediction error. This section reminds the principles of Bayesian model averaging and applies this averaging scheme to the selective naive Bayes classifier. 4.1 Bayesian Model Averaging The Bayesian model averaging (BMA) method (Hoeting et al., 1999) aims at accounting for the model uncertainty. Whereas the MAP approach retrieves the most probable model given the data, the BMA approach exploits every model in the model space, weighted by their posterior probability. This approach relies on the definition of a prior distribution on the models, on an efficient computation technique to estimate the model posterior probabilities and on an effective method to sample the posterior distribution. Apart from these technical difficulties, the BMA approach is an appealing technique, with strong theoretical results concerning the optimality of its long-run performance, as shown by Raftery and Zheng (2003). The BMA approach has been applied to the naive Bayes classifier by Dash and Cooper (2002). Apart from the differences in the weighting scheme, their method (DC) differs from ours mainly on the initial assumptions. The DC method does not manage the numeric variables and assumes multinomial distributions with Dirichlet priors for the categorical variables, which requires the choice of hyper-parameters for each variable. Structure modularity of the Bayesian network is also assumed: each selection of a variable is independent from the others. The DC approach estimates the full data distribution (explanatory and class variables), whereas we focus on the class conditional probabilities. Once the prior hyper-parameters are fixed, the DC method allows to compute an exact model 1665

8 BOULLÉ averaging, whereas we rely on an heuristic to estimate the averaged model. Compared to the DC method, our method is not restricted to categorical attributes and does not need any hyper-parameter. 4.2 From Bayesian Model Averaging to Expectation For a given variable of interest, the BMA approach averages the predictions of all the models weighted by their posterior probability. P( D) = P( M m,d)p(m m D). m This formula can be written, using only the prior probabilities and the likelihood of the models. P( D) = m P( M m,d)p(m m )P(D M m ). m P(M m )P(D M m ) Let f (M m,d) = P( M m,d) and f (D) = P( D). Using these notations, the BMA formula can be interpreted as the expectation of function f for the posterior distribution of the models E( f ) = m f (M m,d)p(m m D). We propose to extend the BMA approach in the case where f is not restricted to be a probability function. 4.3 Expectation of the Class Conditional Information The selective naive Bayes classifier provides an estimation of the class conditional probabilities. These estimated probabilities are the natural candidates for averaging. For a given model M m defined by the variable selection {a mk }, we have f (M m,d) = P(Y ) K k=1 P(X k Y ) a mk. (5) P(X) Let I(M m,d) = log f (M m,d) be the class conditional information. Whereas the expectation of f relates to a (weighted) arithmetic mean of the class conditional probabilities, the expectation of I relates to a (weighted) geometric mean of these probabilities. This puts more emphasis on the magnitude of the estimated probabilities. Taking the negative log of (5), we obtain I(M m,d) = I(Y ) I(X) + K k=1 We are looking for the expectation of this conditional information E(I) = m I(M n,d)p(m m D) m P(M m D) = I(Y ) I(X) + m P(M m D) K k=1 a mki(x k Y ) m P(M m D) = I(Y ) I(X) + K k= a mk I(X k Y ). (6) I(X k Y ) m a mk P(M m D). m P(M m D)

9 COMPRESSION-BASED AVERAGING OF SELECTIVE NAIVE BAYES CLASSIFIERS Let b k = m a mk P(M m D) m P(M m D). We have b k [0,1]. The b k coefficients are computed using (4), on the basis of the prior probabilities and of the likelihood of the models. Using these coefficients, the expectation of the conditional information is E(I) = I(Y ) I(X) + K k=1 b k I(X k Y ). (7) The averaged model thus provides the following estimation for the class conditional probabilities: P(Y X) = P(Y ) K k=1 P(X k Y ) b k. P(X) It is noteworthy that the expectation of the conditional information in (7) is similar to the conditional information estimated by each individual model in (6). The weighting scheme on the models reduces to a weighting scheme on the variables. When the MAP model is selected, the variables have a weight of 1 when selected and 0 otherwise: this is a hard selection of the variables. When the above averaging scheme is applied, each variable has a [0, 1] weight, which can be interpreted as a soft selection. 4.4 An Efficient Algorithm for Model Averaging We have previously introduced a model averaging method which relies on the expectation of the class conditional information. The calculation of this expectation requires the evaluation of all the variable selection models, which is not computationally feasible as soon as the number of variables goes beyond about 20. This expectation can heuristically be evaluated by sampling the posterior distribution of the models and accounting only for the sampled models in the weighting scheme. We propose to reuse the MS(FFWBW) search heuristic to perform this sampling. This heuristic is effective for finding high probability models and searching in their neighborhood. The repetition of the search from several random starting points (in the multi-start meta-heuristics) brings diversity and allows to escape local optima. We use the whole set of models evaluated during the search to estimate the expectation. Although this sampling strategy is biased by the search heuristic, it has the advantage of being simple and computationally tractable. The overhead in the time complexity of the learning algorithm is negligible, since the only need is to collect the posterior probability of the models and to compute the weights in the averaging formula. Concerning the deployment of the averaged model, the overhead is also negligible, since the initial naive Bayes estimation of the class conditional probabilities is just extended with variable weights. 5. Evaluation on an Illustrative Example This section describes the waveform data set, introduces three evaluation criteria and illustrates the behavior of each variant of the selective naive Bayes classifier. 5.1 The Waveform Data Set The waveform data set introduced by Breiman et al. (1984) contains 5000 instances, 21 continuous variables and 3 equidistributed class values. Each instance is defined as a linear combination of two 1667

10 BOULLÉ out of the three triangular waveforms pictured in Figure 1, with randomly generated coefficients and Gaussian noise. Figure 2 plots 10 random instances from each class. 1 0 W a v e fo rm 1 W a v e fo rm 2 W a v e fo rm Figure 1: Basic waveforms used to generated the 21 input variables of the waveform data set. Class 1: combines waveforms 1 and 2 Class 2: combines waveforms 1 and 3 Class 3: combines waveforms 2 and Figure 2: Waveform data. Learning on the waveform data set is generally considered a difficult task in pattern recognition, with reported accuracy of 86.8% using a Bayes optimal classifier. The input variables are correlated, which violates the naive Bayes assumption. Selecting the best subset of variables compliant with the naive Bayes assumption is a challenging problem. 5.2 The Evaluation Criteria We evaluate three criteria of increasing complexity: accuracy (ACC), area under the ROC curve (AUC) and informational loss function (ILF). The ACC criterion evaluates the accuracy of the prediction, no matter whether its conditional probability is 51% or 100%. The AUC criterion (see Fawcett, 2003) evaluates the ranking of the class conditional probabilities. In a two-classes problem, the AUC is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Extending the AUC criterion to multi-class problems is not a trivial task and has lead to computationally intensive methods (see for example Deng et al., 2006). In our experiments, we use the approach of Provost and Domingos (2001) to calculate the multi-class AUC, by computing each one-against-the-others two-classes AUC and weighting them by the class prior probabilities P(λ j ). Although this version of multi-class AUC is sensitive to the class distribution, it is easy to compute, which motivates our choice. The ILF criterion (see Witten and Frank, 2000) evaluates the probabilistic prediction owing to the negative log of the predicted class conditional probabilities logp m (Y = y (n) X = x (n) ). 1668

11 COMPRESSION-BASED AVERAGING OF SELECTIVE NAIVE BAYES CLASSIFIERS The empirical mean of the ILF criterion is equal to ILF(M m ) = 1 N N logp m (Y = y (n) X = x (n) ). n=1 The predicted class conditional probabilities in the ILF criterion are given by Equation (2) for the naive Bayes classifier and by Equation (3) for the selective naive Bayes classifier. Let M /0 be the null model, with no variable selected. The null model estimates the class conditional probabilities by their prior probabilities, ignoring all the explanatory variables. For the null model M /0, we obtain ILF(M /0 ) = 1 N = N logp(y = y (n) ) n=1 J j=1 = H(Y ), P(λ j )logp(λ j ) where H(Y ) is the entropy of Shannon (1948) of the class variable. We introduce a compression rate to normalize the ILF criterion using CR(M m ) = 1 ILF(M m )/ILF(M /0 ) = 1 ILF(M m )/H(Y ). The normalized CR criterion is mainly ranged between 0 (prediction not better than the basic prediction of the class priors) and 1 (prediction of the true class probabilities in case of perfectly separable classes). It can be negative when the predicted probabilities are worse than the basic prior predictions. 5.3 Evaluation on the Waveform Data Set We use 70% of the waveform data set to train the classifiers and 30% to test them. These evaluations are merely illustrative; extensive experiments are reported in Section 7. In the case of the waveform data set, the MODL preprocessing method determines that 2 variables (1 st and 21 st ) are irrelevant, and the naive Bayes classifier uses all the 19 remaining variables. We evaluate four variants of selective naive Bayes methods. The SNB(ACC) method of Langley and Sage (1994) optimizes the train accuracy and the SNB(AUC) method of Boullé (2006a) optimizes the area under the ROC curve on the train data set. The SNB(MAP) method introduced in Section 3 selects the most probable subset of variables compliant with the naive Bayes assumption, and the SNB(BMA) method described in Section 4 averages the selective naive Bayes classifiers weighted by their posterior probability. In this experiment, we evaluate exhaustively the half a million models related to the 2 19 possible variable subsets. This allows us to focus on the variable selection criterion and to avoid the potential bias of the optimization algorithms. The selected subsets of variables are pictured in Figure 3. The SNB(ACC) method selects 12 variables and the SNB(AUC) 18 variables. The SNB(MAP) which focuses on a subset of variables compliant with the naive Bayes assumption selects only 8 variables, which turns out to be a subset of the variables selected by the alternative methods. 1669

12 BOULLÉ 1 SNB(ACC) V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V 1 0 V 1 1 V 1 2 V 1 3 V 1 4 V 1 5 V 1 6 V 1 7 V 1 8 V 1 9 V 2 0 SNB(AUC) V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V 1 0 V 1 1 V 1 2 V 1 3 V 1 4 V 1 5 V 1 6 V 1 7 V 1 8 V 1 9 V 2 0 SNB(MAP) V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V 1 0 V 1 1 V 1 2 V 1 3 V 1 4 V 1 5 V 1 6 V 1 7 V 1 8 V 1 9 V 2 0 SNB(BMA) V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V 1 0 V 1 1 V 1 2 V 1 3 V 1 4 V 1 5 V 1 6 V 1 7 V 1 8 V 1 9 V 2 0 Figure 3: Variables selected by the selective naive Bayes classifiers for the waveform data set. The predictive performance for the ACC, AUC and ILF criteria are reported in Figure 4. In multi-criteria analysis, a solution dominates (or is non-inferior to) another one if it is better for all criteria. A solution that cannot be dominated is Pareto optimal: any improvement of one of the criteria causes a deterioration on another criterion (see Pareto, 1906). The Pareto surface is the set of all the Pareto optimal solutions. A U C A C C SNB(BMA) SNB(MAP) SNB(AU C) SNB(ACC) NB A U C IL F Figure 4: Evaluation of the predictive performance of selective naive Bayes classifiers on the waveform data set. The SNB(ACC) method is slightly better than the NB method on the ACC criterion. Owing to its small subset of selected variables, it manages to reduce the redundancy between the variables and to significantly improve the estimation of the class conditional probabilities, as reported by its ILF evaluation. The SNB(AUC) method gets the same AUC performance as the NB method with one variable less. The SNB(MAP) and SNB(BMA) methods almost directly optimizes the ILF criterion on the train data set, with a regularization term related to the model prior. They get the best ILF evaluation on the test set, but are dominated by the NB and SNB(ACC) methods on the two other criteria, as shown in Figure 4. Almost all the methods are Pareto optimal: none of them is the best on the three evaluated criteria. Compared to the other variable selection methods, the SNB(MAP) truly focuses on complying with the naive independence assumption. This results in a much smaller subset of variables and a better prediction of the class conditional probabilities, at the expense of a decay on the other criteria. 1670

13 COMPRESSION-BASED AVERAGING OF SELECTIVE NAIVE BAYES CLASSIFIERS The SNB(BMA) method exploits a model averaging approach which results in soft variable selection. Figure 3 shows the weights of each variable. Surprisingly, the selected variables are almost the same as the 8 variables selected by the SNB(MAP) method. Compared to the hard variable selection scheme, the soft variable selection exhibits mainly one minor change: a new variable (V20) is selected with a small weight of The other modifications of the variable weights are insignificant: two variables (V6 and V17) decrease their weight from 1.0 to 0.99 and three variables (V7, V18 and V19) appear with a tiny weight of Since the variable selection is almost the same as in the MAP approach, the model averaging approach does not bring any significant improvement in the evaluation results, as shown in Figure Compression Model Averaging of Selective Naive Bayes Classifiers This section analyzes the limits of Bayesian model averaging and proposes a new weighting scheme that better exploits the posterior distributions of the models. 6.1 Understanding the Limits of Bayesian Model Averaging We use again the waveform data set to explain why the Bayesian model averaging method fails to outperform the MAP method. Figure 5: Index of the selected variables in the 200 most probable selective naive Bayes models for the waveform data set. Each line represents a model, where the variables are in black color when selected. The variable selection problem consists in finding the most probable subset of variables compliant with the naive Bayes assumption among about half a million (2 19 ) potential subsets. In order to study the posterior distribution of the models, all these subsets are evaluated. The MAP model selects 8 variables (V5, V6, V9, V10, V11, V12, V13, V17). A close look at the posterior distribution shows that most of the good models (in the top 50%) contain around 10 variables. Figure 5 displays the selected variables in the top 200 models (0.05%). Five variables (V5, V9, V10, V11, 1671

14 BOULLÉ V12) among the 8 MAP variables are always selected, and the other models exploit a diversity of subsets of variables. The potential benefit of model averaging is to account for all these models, with higher weights for the most probable models. However, the posterior distribution is very sharp everywhere, not only around the MAP. Variable V18 is first selected in the 3 rd model, which is about 40 times less probable than the MAP model. Variable V4 is first selected in the 10 th model, about 4000 times less probable than the MAP model. Figure 6 displays the repartition function of the posterior probabilities, using a log scale. Using this logarithmic transformation, the posterior distribution is flattened and can be visualized. The MAP model is times more probable than the minimum a posteriori model, which selects no variable. log10 P(Model ) + log10 P(D ata Model ) % 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % models Figure 6: Repartition function of the posterior probabilities of half a million variable selection models evaluated for the waveform data set, sorted by increasing posterior probability. For example, the 10% models on the left represent the models having the lowest posterior probability. In the waveform example, averaging using the posterior probabilities to weight the models is almost the same as selecting the MAP model (which itself is hard to find with a heuristic search) and model averaging is almost useless. In theory, BMA is optimal (see Raftery and Zheng, 2003), but this optimality result assumes that the true distribution of the data belongs to the space of models. In the case of the selective naive Bayes classifier, this assumption is violated on most real data sets and BMA fails to build effective model averaging. 6.2 Model Averaging with Compression Coefficients When the posterior distribution is sharply peaked around the MAP, averaging is almost the same as selecting the MAP model. These peaked posterior distributions are more and more likely to happen when the number of instances rises, since a few tens of instances better classified by a model are sufficient to increase its likelihood by several orders of magnitude. Therefore, the algorithmic overhead is not valuable if averaging turns out to be the same as selecting the MAP. In order to have a theoretical insight on the relation between MAP and BMA, let us analyze again the model selection criterion (4). It is closely related to the ILF criterion described in Section 5.2, according to l(m m ) + l(d M m ) = logp(m m ) + N ILF(M m ). 1672

15 COMPRESSION-BASED AVERAGING OF SELECTIVE NAIVE BAYES CLASSIFIERS For the the null model M /0, with no variable selected, we have: l(m /0 ) + l(d M /0 ) = logp(m /0 ) + N H(Y ). The posterior probability P(M m D) of a model M m relative to that of the null model is ( ) N P(M m D) P(M /0 D) = P(M m) H(Y ). (8) P(M /0 ) ILF(M m ) Equation (8) shows that the posterior probability of the models is exponentially peaked when N goes to infinity. Small improvements in the estimation of the conditional entropy brings very large differences in the posterior probability of the models, which explains why Bayesian model averaging is asymptotically equivalent to selecting the MAP model. We propose an alternative weighting scheme, whose objective is to better account for the set of all models. Let us precisely define the compression coefficient c(m m,d) of a model. The model selection criterion l(m m ) + l(d M m ) defined in Equation (4) represents the quantity of information required to encode the model plus the class values given the model. The code length of the null model M /0 can be interpreted as the quantity of information necessary to describe the classes, when no explanatory data is used to induce the model. Each model M m can potentially exploit the explanatory data to better compress the class conditional information. The ratio of the code length of a model to that of the null model stands for a relative gain in compression efficiency. We define the compression coefficient c(m m,d) of a model as follows: c(m m,d) = 1 l(m m) + l(d M m ) l(m /0 ) + l(d M /0 ). The compression coefficient is 0 for the null model, is maximal when the true class conditional probabilities are correctly estimated and tends to 1 in case of separable classes. This coefficient can be negative for models which provide an estimation worse than that of the null model. In our heuristic attempt to better account for all the models, we replace the posterior probabilities by their related compression coefficient in the weighting scheme. Let us focus again on the variable weights b k introduced in Section 4 in our first model averaging method. Dividing the posterior probabilities by those of the null model, we get b k = P(M m a m D) mk P(M /0 D). P(M m D) m P(M /0 D) We introduce new c k coefficients by taking the log of the probability ratios and normalizing by the code length of the null model. We obtain c k = m a mk c(m m,d). m c(m m,d) Mainly, the principle of this new heuristic weighting scheme consists in smoothing the exponentially peaked posterior probability distribution of Equation (8) with the log function. In the implementation, we ignore the bad models and consider the positive compression coefficients only. We evaluate the compression based model averaging (CMA) model using the model averaging algorithm introduced in Section

16 BOULLÉ 6.3 Evaluation on the Waveform Data Set We use the protocol introduced in Section 5 to evaluate the SNB(CMA) compression model averaging method on the waveform data set, with an exhaustive evaluation of all the models to avoid the potential bias of the optimization algorithms. Figure 7 shows the weights of each variable resulting from the soft variable selection of the SNB(CMA) compression model averaging method. Contrary to the SNB(BMA) method, the averaging has a significant impact on variable selection SNB(ACC) 0 1 V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V 1 0 V 1 1 V 1 2 V 1 3 V 1 4 V 1 5 V 1 6 V 1 7 V 1 8 V 1 9 V 2 0 SNB(AUC) V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V 1 0 V 1 1 V 1 2 V 1 3 V 1 4 V 1 5 V 1 6 V 1 7 V 1 8 V 1 9 V 2 0 SNB(MAP) V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V 1 0 V 1 1 V 1 2 V 1 3 V 1 4 V 1 5 V 1 6 V 1 7 V 1 8 V 1 9 V 2 0 SNB(BMA) V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V 1 0 V 1 1 V 1 2 V 1 3 V 1 4 V 1 5 V 1 6 V 1 7 V 1 8 V 1 9 V 2 0 SNB(CMA) V 2 V 3 V 4 V 5 V 6 V 7 V 8 V 9 V 10 V 11 V 12 V 13 V 14 V 15 V 16 V 17 V 18 V 19 V 20 Figure 7: Variables selected by the SNB(CMA) method and the alternative selective naive Bayes classifiers for the waveform data set. Instead of hard selecting about half of the variables as in the SNB(MAP) method, the SNB(CMA) method selects all the variable with weights around 0.5. Interestingly, the variable selection pattern is similar to that of the alternative variable selection methods, in a smoothed version. A central group of variables is emphasized around variable V11, between two less important groups of variables around variables V5 and V17. A U C A C C SNB(CMA) SNB(BMA) SNB(MAP) SNB(AU C) SNB(ACC) NB A U C IL F Figure 8: Evaluation of the predictive performance of selective naive Bayes classifiers on the waveform data set. 1674

17 COMPRESSION-BASED AVERAGING OF SELECTIVE NAIVE BAYES CLASSIFIERS In the waveform data set, all the variables are informative, but the most probable subsets of variables compliant with the naive Bayes assumption select only half of the variables. In other words, whereas many good SNB classifiers are available, none of them is able to account for all the information contained in the variables. Since the BMA model is almost the same as the MAP model, it fails to perform better than one single classifier. Our CMA approach averages complementary subsets of variables and exploits more information than the BMA approach. This smoothed variable selection results in improved performance, as shown in Figure 8. The SNB(CMA) method is the best one: it dominates all the other methods on the three evaluated criteria. 7. Experiments This section presents an experimental evaluation of the performance of the selective naive Bayes methods described in the previous sections. 7.1 Experimental Setup The experiments aim at comparing the performance of model averaging methods versus the MAP method, the standard selective naive Bayes (SNB) and naive Bayes (NB) methods. All the classifiers except the last one exploit the same MODL preprocessing, allowing a fair comparison. The evaluated methods are: No variable selection NB(EF): NB with 10 bins equal frequency discretization and no value grouping, NB: NB with MODL preprocessing, Variable selection SNB(ACC): optimization of the accuracy, SNB(AUC): optimization of the area under the ROC curve, SNB(MAP): MAP SNB model, Variable selection and model averaging SNB(BMA): Bayesian model averaging, SNB(CMA) 2 : compression-based model averaging. The three last SNB classifiers (SNB(MAP), SNB(BMA) and SNB(CMA)) represent our contribution in this paper. All the SNB classifiers are optimized with the same MS(FFWBW) search heuristic, except the SNB(ACC), based on the forward selection greedy heuristic. The DC method (Dash and Cooper, 2002), similar to the SNB(BMA) approach, was not evaluated since it is restricted to categorical attributes. The evaluated criteria are the same as for the waveform data set: accuracy (ACC), area under the ROC curve (AUC) and informational loss function (ILF) (with its compression rate (CR) normalization). 2. The method is implemented in a tool available as a shareware at

18 BOULLÉ Name Instances Numerical Categorical Classes Majority variables variables accuracy Abalone Adult Australian Breast Crx German Glass Heart Hepatitis HorseColic Hypothyroid Ionosphere Iris LED LED Letter Mushroom PenDigits Pima Satimage Segmentation SickEuthyroid Sonar Spam Thyroid TicTacToe Vehicle Waveform Wine Yeast Table 1: UCI Data Sets We conduct the experiments on two collections of data sets: 30 data sets from the repository at University of California at Irvine (Blake and Merz, 1996) and 10 data sets from the NIPS 2003 feature selection challenge (Guyon et al., 2006a) and the IJCNN 2006 performance prediction challenge (Guyon et al., 2006c). A summary of some properties of these data sets is given in Table 1 for the UCI data sets and in Table 2 for the challenge data sets. We use stratified 10-fold cross validation to evaluate the criteria. A two-tailed Student test at the 5% confidence level is performed in order to evaluate the significant wins or losses of the SNB(CMA) method versus each other method. 7.2 Results We collect and average the three criteria owing to the stratified 10-fold cross validation, for the seven evaluated methods on the forty data sets. The results are presented in Table 3 for the UCI data 1676

19 COMPRESSION-BASED AVERAGING OF SELECTIVE NAIVE BAYES CLASSIFIERS Name Instances Numerical Categorical Classes Majority variables variables accuracy Arcene Dexter Dorothea Gisette Madelon Ada Gina Hiva Nova Sylva Table 2: Challenge Data Sets sets and in Table 4 for the challenge data sets. They are summarized across the data sets using the mean, the number of wins and losses (W/L) for the SNB(CMA) method and the average rank, for each of the three evaluation criteria. Method ACC AUC CR Mean W/L Rank Mean W/L Rank Mean W/L Rank SNB(CMA) SNB(BMA) / / /6 2.6 SNB(MAP) / / /6 3.6 SNB(AUC) / / /4 4.4 SNB(ACC) / / /2 4.5 NB / / /2 5.3 NB(EF) / / /2 5.3 Table 3: Evaluation of the methods on the UCI data sets The three ways of aggregating the results (mean, W/L and rank) are consistent, and we choose to display the mean of each criterion to ease the interpretation. Figure 9 summarizes the results for the UCI data sets and Figure 10 for the challenge data sets. The results of the two NB methods are reported mainly as a sanity check. The MODL preprocessing in the NB classifier exhibits better performance than the equal frequency discretization method in the NB(EF) classifier. The experiments confirm the benefit of selecting the variables, using the standard selection methods SNB(ACC) and SNB(AUC). These two methods achieve comparable results, with an emphasis on their respective optimized criterion. They significantly improve the results of the NB methods, especially for the estimation of class conditional probabilities measured by the CR criterion. It is noteworthy that the NB and NB(EF) classifiers obtain poor CR results. Their mean CR 1677

20 BOULLÉ Method ACC AUC CR Mean W/L Rank Mean W/L Rank Mean W/L Rank SNB(CMA) SNB(BMA) / / /0 2.3 SNB(MAP) / / /0 3.7 SNB(AUC) / / /0 4.1 SNB(ACC) / / /0 4.3 NB / / /0 5.9 NB(EF) / / /0 6.7 Table 4: Evaluation of the methods on the challenge data sets 0.92 A U C 0.92 A U C 0.91 S N B (C M A ) S N B (B M A ) S N B (M A P ) S N B (A U C ) S N B (A C C ) N B N B (E F ) 0.91 S N B (C M A ) S N B (B M A ) S N B (M A P ) S N B (A U C ) S N B (A C C ) N B N B (E F ) A C C IL F Figure 9: Mean of the ACC, AUC and CR evaluation criteria on the 30 UCI data sets. result is less than 0 in the case of the challenge data sets, which means that their estimation of the class conditional probabilities is worse than that of the null model (which selects no variable). The three regularized methods SNB(MAP), SNB(BMA) and SNB(CMA) focus on the estimation of the class conditional probabilities, which are evaluated using the compression rate criterion. They clearly outperform the other methods on this criterion, especially for the challenge data sets where the improvement amounts to about 50%. However, the SNB(MAP) method is not better than the two standard SNB methods for the accuracy and AUC criteria. The MAP method increases the bias of the models by penalizing the complex models, leading to a decayed fit of the data A U C A U C S N B (C M A ) S N B (B M A ) S N B (M A P ) S N B (A U C ) S N B (A C C ) N B N B (E F ) S N B (C M A ) S N B (B M A ) S N B (M A P ) S N B (A U C ) S N B (A C C ) N B N B (E F ) 0.8 A C C IL F Figure 10: Mean of the ACC, AUC and CR evaluation criteria on the 10 challenge data sets. 1678

Report on Preliminary Experiments with Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge

Report on Preliminary Experiments with Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge Marc Boullé IJCNN 2007 08/2007 research & development Outline From univariate to multivariate