BAYESIAN network is a popular learning tool for decision

Size: px

Start display at page:

Download "BAYESIAN network is a popular learning tool for decision"

Louisa Merritt
5 years ago
Views:

1 Proceedings of International Joint Conference on Neural Networks, Dallas, Texas, USA, August 4-9, 2013 Self-Adaptive Probability Estimation for Naive Bayes Classification Jia Wu, Zhihua Cai, and Xingquan Zhu Abstract Probability estimation from a given set of training examples is crucial for learning Naive Bayes (NB) Classifiers. For an insufficient number of training examples, the estimation will suffer from the zero-frequency problem which does not allow NB classifiers to classify instances whose conditional probabilities are zero. Laplace-estimate and M-estimate are two common methods which alleviate the zero-frequency problem by adding some fixed terms to the probability estimation to avoid zero conditional probability. A major issue with this type of design is that the fixed terms are pre-specified without considering the uniqueness of the underlying training data. In this paper, we propose an Artificial Immune System (AIS) based self-adaptive probability estimation method, namely AISENB, which uses AIS to automatically and self-adaptively select the optimal terms and values for probability estimation. The unique immune system based evolutionary computation process, including initialization, clone, mutation, and crossover, ensure that AISENB can adjust itself to the data without explicit specification of functional or distributional forms for the underlying model. Experimental results and comparisons on 36 benchmark datasets demonstrate that AISENB significantly outperforms traditional probability estimation based Naive Bayes classification approaches. I. INTRODUCTION BAYESIAN network is a popular learning tool for decision making [1]. Naive Bayes (NB) [2] as a special case of a Bayesian network has been popularly used in many realworld learning tasks, especially for high dimensional data such as text classification [3] and web mining [4]. Given a training set D = {x 1,, x N } with N instances, each of which contains n attribute values and a class label. We use x i = {x i,1, x i,j, x i,n, y i } to denote the ith instance in the dataset D, with x i,j denoting the jth attribute value and y i denoting the class label of the instance. The class space Y = {c 1,, c k,, c L } denotes the set of labels each instance belonging to and c k denotes the kth label of the class space. The attribute of the dataset is denoted by A = {a 1,, a j,, a n }, where a j denotes the jth attribute. Each attribute can be a discrete random variable (with a number of discrete values) or a continuous random variable. In this paper, we only focus on categorical (or nominal) attributes, and for any attribute a j, we use Jia Wu is with Quantum Computation & Intelligent Systems Research Centre, Faculty of Engineering & Information Technology, University of Technology Sydney, Australia, and is affiliated with the Department of Computer Science, China University of Geosciences Wuhan, China. ( {jia.wu}@student.uts.edu.au). Xingquan Zhu is with Quantum Computation & Intelligent Systems Research Centre, Faculty of Engineering & Information Technology, University of Technology Sydney, Australia, ( {xingquan.zhu}@uts.edu.au). Zhihua Cai is with the Department of Computer Science, China University of Geosciences Wuhan, China. ( {zhcai}@cug.edu.cn). This work is supported, in part, by Australian Research Council Discovery Project under Grant No. DP , and by National Natural Science Foundation of China under Grant No a τ j, τ = 1,, a j to denote the τth attribute value of a j and a j denotes the total number of distinct values of a j. For each instance x i, its value satisfies x i,j a j. For ease of understanding, we also use (x i, y i ) as a shorthand to represent an instance and its class label and use x i as a shorthand of x i. We also use a j as a shorthand to represent attribute j. For an instance (x i, y i ) in the training set D, its class label satisfies y i Y, whereas a test instance x t only contains attribute values and its class label y t needs to be predicted by the learning model. By using conditional independency assumption, NB can follow the rules as defined in Eq.(1). c(x t ) = arg max c k Y P (c k ) n P (x t,j c k ) (1) j=1 Building an NB network is not difficult because one only needs to calculate class probability p(c k ) and conditional probability values P (x t,j c k ) from the training examples D. When calculating conditional probability value P (x t,j c k ), a necessary step is to observe the distribution of the random variable x t,j conditioned by the given class label c k. Depending on whether attribute a j is a discrete or a continuous random variable, the conditional probabilities are modeled either by some continuous probability distributions over the range of the attribute s values or by converting numeric attribute values into discrete space by using Discretization approaches. Because using continuous probability distribution requires a predefined distribution model (such as Gaussian or Normal distributions) to be employed in the learning process, the adoption of the wrong distribution models can severely deteriorate the learning model, discretization is the most common solution for handling continuous attributes in NB. The limitation of using Discretization is that it does not classify instances on which conditional probabilities of the attribute value is zero [5]. This is because Eq.(1) becomes zero, so no class can be selected to classify instance x t. Many reasons can result in zero conditional probability. The most common reason is the zero-frequency which means that the value x t,j practically exists in attribute a j s domain (i.t. x t,j a j, but this value (x t,j ) never appears in the training data D. In practice, this frequently happens when the size of the training dataset is small or the domain of an attribute a j is large [6]. In order to solve the problem, Laplace-estimate[7] and M-estimate[8] are usually applied for probability estimation by adding a very small value to each conditional probability value. Jiang et al. [9] conducted an extensive empirical study on the performance of several commonly used Laplace-estimate and M-estimate when different Bayesian classifiers (NB [2], TAN [10], AODE [11], and HNB [1]) are used as the base classifier, respectively /13/$ IEEE 2303

2 Lowd and Domingos [12] proposed a Naive Bayes model with M-estimate as an alternative to Bayesian networks for general probability estimation tasks. While M-estimate based NB (MENB) has shown to outperform Laplace-estimate (LENB) in some cases, MENB has two important parameters m and p which have to be carefully defined and have significant impact on the performance of the NB classifier. The two parameters are correlated with m being a positive integer, called the equivalent sample size, defining the parameter for controlling the shift towards p. Choosing suitable parameter values for m and p is, unfortunately, a problem dependent task and requires users to have good experience. Despite of their important roles, there is no consistent methodology to help identify optimal m and p values for probability value estimation, and most of the time, those values are arbitrarily set within some predefined ranges. For example, the m value is often empirically set to 1. p is the base rate or prior of estimate of the probability. One typical method to set p is to use a uniform distribution hypothetically [13]. In [9], Jiang et al. proposed a nested M- estimate method. Zadrozny and Elkan [14] set p to a constant 1/10. These parameter setting methods for M-estimate have achieved some good accuracy performance in some specific datasets. However, in many real-word applications, this assumption is often violated. Motivated by the above observations, in this paper, we propose to use Artificial Immune System (AIS) [15] mechanism for self-adaptive probability estimation for Naive Bayes classification. Our method will use AIS principles to design an automated search strategy to find optimal m and p parameters for each dataset. The unique immune system computation process, including initialization, clone, mutation, and crossover, ensures that our method can adjust itself to the data without any explicit specification of functional or distributional form for the underlying model. The experiments and comparisons, on 36 UCI benchmark datasets [16] which are commonly used to validate classification algorithms [17], demonstrate that AIS based probability Estimation for Naive Bayes (AISENB) classification can successfully find optimal parameter combinations for probability value estimation, and AISENB consistently outperforms other state-of-the-art NB algorithms. The remainder of the paper is structured as follows. Section II reviews related work on probability estimation in NB and briefly describes the artificial immune systems and its connection to the parameter search. In Section III, we propose a new estimation structure for Naive Bayes, include the calculation of the best value for conditional probabilities using artificial immune systems algorithm. Section IV reports experimental setups and comparisons, and we conclude the paper in Section V. II. RELATED WORK A. Probability Estimation for Naive Bayes Naive Bayesian Classifier can handle both nominal and numeric attributes. For a numeric attribute a j, it is normally discretized into a number of intervals over the range of the attribute values, so the whole attribute a j is treated as nominal (or categorical) for probability estimation. A basic approach to estimate probability can be defined as n k = x i D;y i=c k 1; p(c k ) = n k N (2) (3) n k n (j,τ) k = x i D;y i=c k;x i,j=a τ j 1 (4) where N is the total number of instances in the training set D, n k denotes the number of instances in D whose class labels equals to c k, and n (j,τ) k denotes the number of instances in D whose class label equals to c k and the jth attribute value equals to a τ j, as defined in Eq.(4). 1) Laplace-estimate: The limitation of the size of the training data or the use of Discretization may raise complication that it does not classify instances whose conditional probabilities have the zero-frequency problem. One way to solve the zero-frequency is Laplace-estimate. This estimation method introduces a prior probability for each attribute a j such that no attribute has a zero conditional probability values, as given in Eq.(5) and Eq.(6). p(c k ) = n k N + L n k + a j where, a j is the number of distinct values of attribute a j and L is the number of classes in D. 2) M-estimate: In Laplace-estimate, the fixed terms are introduced to p(c k ) and p(a τ j c k) without taking the size of different classes into consideration. For example, for a binary classification problem (L=2), if the majority class have 1000 instances and the minority class has 2 instances only. The ratio between without and with Laplace-estimate for majority class is ( ) ( ) = = 0.999, which is very trivial, whereas for the minority class the ratio 1 is ( ) ( ) = 1 2 = 0.5, which is very significant. In other words, Laplace-estimate has much larger impact on minor classes, compared to the majority classes [12]. To solve the problem, M-estimate introduces two parameters m and p to adjust the impact of the extra terms in the probability estimation as follows. p(c k ) = n k + m p (7) N + m + m p n k + m where p is the base rate or prior of estimate of the probability, and m (also called the equivalent sample size) is a positive integer controlling the shift towards p. To further take the class size into consideration, Jiang et al. [13] set p to a uniform distribution, namely, m = 1 and (5) (6) (8) 2304

3 Fig. 1. A simple view of immune response: When a B-cell (the middle rings on the left) recognizing a antigen (lozenge) with certain affinity, the system will respond and result in proliferation, differentiation and variation process of the B-cell then secretory antibodies. Antibodies with high affinity becomes memory cells. The others become effector cells. p = 1 a. For p(c j k), m = 1 and p = 1 L. Then we can rewrite Eq.(7) and Eq.(8) as follows: p(c k ) = n k + 1/(L) N /( a j ) n k + 1 (9) (10) Zadrozny and Elkan [14] propose to set m and p values to constants, such that m = 1 and p = 1/10. p(c k ) = n k + 1/10 N /10 n k + 1 (11) (12) Jiang et al. [9] propose a parameter setting method, where the p(c k ) is the same as Eq.(9) but for p(a τ j c k) the parameter p is set to p(a τ j ), where p(aτ j ) can be estimated by M- estimate again, which can be defined as follows. p(a τ j ) = n(j,τ) + m p N + m ; n(j,τ) = x i D;x i,j=a τ j 1 (13) where m = 1 and p = 1 a j, and n(j,τ) denotes the number of instances in D whose jth attribute value equals to a τ j. Meanwhile Eq.(8) can be rewritten to estimate probabilities of p(a τ j c k) as follows. p(a τ j c k ) = n(j,τk) + (n (j,τ) + 1/ a i )/(N + 1) n k + 1 B. AIS: Artificial Immune Systems (14) The human immune system contains two major parts: (1) humoral immunity, which deals with infectious agents in the blood and body tissues, and (2) cell-mediated immunity, which deals with body cells that have been infected. In general, the humoral system is managed by B-cells (with help from T-cells), and the cell-mediated system is managed by T-cells [18]. In this paper, humoral immunity is delegated to the natural immune system and the action of T-cells is not explained. When pathogens invade the body, antibodies that are produced from B-cells will respond for the detection of a foreign protein or antigen [19]. This response process could be explained by clonal selection theory [20], the details of which can be showed in Figure 1. The clonal selection followed by the B-cells of human immune system is the fundamental mechanism on which AIS is modeled. When AIS is used for classification, shape-space representation method, which aims at quantitatively describing the integrations among immune cells, is commonly used for modeling antibodies and antigens [21]. AIS has been well used in various areas of research including pattern recognition [22], clustering [23], optimization [24] and Remote Sensing [25]. However, few applications have been reported in Bayesian network. In this paper, we propose a new probability estimate method based on AIS, which has high classification accuracy performance for M-estimate for NB. III. AISENB: ARTIFICIAL IMMUNE SYSTEM BASED PROBABILITY ESTIMATION FOR NB A. Problem Definition In this paper, we focus on the calculation of the conditional probability p(x t,j c k ) and class probability p(c k ) by using M-estimate with optimal parameter values for m and p. While all existing M-estimate based approaches arbitrarily define the m and p values without considering the uniqueness of the underlying training data, we intend to solve the optimal m and p value selection problem as an optimization process. From Eq.(7) and Eq.(8), in order to obtain the class label for a test instance in test set T ( x t T ), each conditional probability p(x t,j c k ); j = 1,, n; k = 1,, L (n denotes the number of attributes and L denotes the number of class labels) and class probability p(c k ) need to be calculated. Assume that the calculation of each conditional probability value p(x t,j c k ) has an optimal < m j, p j > values, there are n m j and p j vectors (< m j, p j >, j = 1,, n) needed to finish the classification process, while for class probability p(c k ) one < m j, p j > vector is needed. For ease of understanding, we use a single vector < m, p >, m =< m 1,, m n+1 >, p =< p 1,, p n+1 > to denote the n + 1 m j and p j pairs (j = 1,, n + 1). As a result, the NB classification based on M-estimate can be translated to an optimization problem as follows. c(x t ) = arg max P (c k ) n P (x t,j c k ) c k C j=1 s.t. p(x t,j c k ) = n(j,t) k +m j p j n k+m j, 1 j n, (15) p(c k ) = nk+mj pj N+m j, j = n + 1, m j {1,, + }, 0 < p j 1 where tuple < m j, p j > denote the jth value of < m, p >. B. AISENB In order to use AIS to improve the performance of M- estimate method for NB, we regard the < m, p > value for probability estimation as immune response process, and 2305

4 TABLE I MAPPING BETWEEN THE IMMUNE SYSTEM AND AISENB Immune Response AISENB Antigens Training instances in datasets Antibody Parameters vector < m, p > Shape-space Possible values of the data vectors Affinity The classification accuracy using the vector < m, p > on the testing datasets Clonal Expansion Reproduction of parameters vectors that are well matched with antigens Affinity Maturation Specific mutation and crossover of < m, p > vector and removal of lowest stimulated parameter vectors. Immune Memory Memory set of mutated and crossed parameter vectors Metadynamics Continual removal and creation of parameter vectors Training Data Set Instance 1 Instance 2 AISENB Preprocessing and initialization of antibodies set W * Clone Process Evolve the antibody pupation W * * Develop memory cellw c and complete the training of this antigen in generation N Stopping Condition use the AIS to obtain the optimal m and p vector. Antigens in AISENB are simulated as feature attribute vectors which are presented to the system during the training and testing process. In particular, AISENB has its specific representation in Naive Bayes classification. The antibodies as candidate < m, p > vectors which has good affinity, known as classification accuracy, will experience a form of clonal expansion after being presented with an input data (analogous to antigens). After antibodies are cloned, they will go through a mutation process, in which specific mutation function will be designed. In order to increase the diversity of population in evolution and ensure the algorithms to have the ability of searching for the global optimum solution, individual crossoperation, which is exclusively designed for our probability estimation problem, is adopted in AISENB. The evolving optimization process of the AIS system will help discover the candidate < m, p > vector which has the best classification accuracy for NB classification. Table I summarizes the mapping between the immune system and AISENB. AISENB, which is similar to Laplace-estimated and M- estimated method, adjusts the parameters m and p adaptively. Before introducing the detailed algorithm design, we briefly define following notations. Let W = {w 1,, w H } represents the set of antibodies. Where, H represents the size of antibodies, in which w h represents a single antibody. We use w h = {w h,1,, w h,n+1 } to denote the hth antibody in antibody sets W, with wh,j denoting the jth value of the hth antibody w h. Let w c represent the memory cell, which has the best affinity. In AISENB, W presents vector sets with m and p, < M, P >= {< m 1, p 1 >,, < m H, p H >}. w h in AISENB, presented by < m h, p h >= {< m h,1, p h,1 >,, < m h,n+1, p h,n+1 >}, denotes the hth antibody in < M, P >, with < m h,j, p h,j > (analogous to wh,j ) denotes the jth value of antibody (w h or < m h, p h >). w c in AISENB presents that < m c, p c > with the best classification accuracy. Training set D a = {x a 1,, x a N a} represents the set of antigens, with N a antigens, in which x a i represents the ith antigen. Fig. 2. Y Obtain the optimal * memory cell Classifying test data sets using * w c w c AISENB classification system. We use AIS method to learn the optimal w c in AISENB, with no assumption or information about the parameter in AISENB (we expect that AIS can help us find the optimal parameters automatically). After obtaining the best individual w c, we build the AISENB classifier to classify test data. The detailed process of our new algorithm AISENB is described as follows: 1) Initialization: For individuals in W, we should first determine the antibody population size H, and make sure that every individual w h, h = 1, 2,, H in antibody population is generated through certain random mechanisms. The p h,j value of w h for each individual is set a random number distributed between (0, 1], while m h,j value of w h is set to a random positive integer. Besides, in the experiment a certain portion (e.g.80% ) of train instances D are used as a antigens sets D a to learn the w c and the remaining instances form test set D b. 2) Clone of AISENB: The calculation of affinity function: The affinity of the hth individual of the tth generation (w h )t is the classification accuracy that is obtained by AISENB using the (w h )t to carry out the probability estimation. Calculation of affinity function can be described as f[(w h) t ] = 1 N b δ[c(x b i), yi b ] (16) N b i=1 where, c(x b i ) is the classification result of the ith instance in test dataset D b with N b instances, using the AISENB classifier based on individual (w h )t. yi b is the actual class value of the ith instance. δ[c(x b i ), yb i ] is one if c(x b i ) = yb i, and zero otherwise. Antibody Selection: We sort individuals in Initial antibody population according to the affinity of each individual, and then choose the individual (w c) t with the best affinity performance in tth generation as the memory antibody. 2306

5 Algorithm 1 AISENB (Estimation method by AIS) Input: Clone Factor c; Threshold T, Crossover Factor CR; Maximum Evolution Generation MaxGen, Test affinity set D b ; Antibody Population W, Antigen Population D a ; Output: The target class label c(x t) of test instance x t; 1: W The p h,j value of w h for each individual is set a random number distributed between (0, 1], while m h,j value of w h is set to a random positive integer. 2: while t MaxGen and f[(w c )t+1 ] f[(w c )t ] T do 3: f[(w h )t ] Apply antigen population D a, test affinity set D b to antibody (w h )t, and calculate the affinity of (w h )t. 4: (w c )t Apply the sequence of each f[(w h )t ] to the whole antibody population (W ) t and find the (w c )t with the best affinity. 5: (W,r ) t Select the c antibodies with the lowest affinity and obtain the temporary antibody set. 6: (W,c ) t Clone (w c )t with clone factor c and obtain clone antibody set. 7: (W ) t ((W ) t (W,r ) t ) (W,c ) t ; 8: for all each (w h )t in (W ) t do 9: (v h )t+1 Apply two randomly antibody (w r1 )t and (w r2 )t to (w h )t and obtain the mutation individual. 10: (u h )t+1 Apply CR and (v h )t+1 to (w h )t and obtain the crossover individual. 11: (w h )t+1 Apply (u h )t+1 to (w h )t and obtain the new individual in t + 1th generation. 12: end for 13: end while 14: c(x t) Apply w c to instance xt and predict its class label. Antibody Clone: To ensure that the population size of every generation is fixed, the best individual (w c) t will be cloned under the clone factor c. After that, the individuals with low affinity are replaced using the clone set, according to the same rate c, to preserve the population size. 3) Evolution of AISENB.: Antibody Mutation: Using the mutate operation to treat the individuals in tth generation (W ) t. It means that we get the middle generation composed with the new variation individuals from the parent generation. For any individual (w h )t from the tth generation, the new variation individual (v h )t+1 can be generated as follows: (v h) t+1 = (w h) t + F [(w r1) t (w r2) t ] (17) Among them, r1 and r2 are randomly selected integers, which are not equal to i. F, as the variation factor during the process of evolution, can be adaptively obtained according to different clones [25]. F = 1 f[(w h) t ] (18) where f[(w h )t ] denotes the affinity of the hth individual of the tth generation. In the mutation process, it is necessary to maintain its zero mean, because it can reduce the possibility of infeasible solutions. Antibody Crossover: To avoid the loss of diversity among individuals in the population, and enhance the speed of convergence of AISENB. we must cross the hth individual of the tth generation and its corresponding variation vector obtained by mutation strategy. To ensure that the acquired trail individual vector in crossover operation has the evolution feature, at least one dimension of (m h,j ) t+1 and (p h,j ) t+1 in variable vectors (u h )t+1 should be provided by those in mutation vector (v h )t+1. We use the cross probability factor CR to decide which dimension variable should be provided by mutation individual vector and which one should be provided by target individual vector. The crossover can be described as (u h,j) t+1 = { (v h,j ) t+1, rand(j) CR or j = rand(h) (w h,j) t, rand(j) > CR and j rand(h) (19) where, rand(j) is a uniformly distributed random number between [0, 1], 0 j < n + 1. n is the number of attributes. randm(h) is a random integer between [1, n+1], which is to ensure that at least one dimension variable for m and p of the trail vector is provided by the variation vector (v h )t+1. Otherwise, it can not get the new individual leading to the result that the target vector (w h )t and the individual vector obtained by crossover operation are the same. 4) Antibody Population Update: To determine whether the crossover individual (u h )t+1 can replace the target individual vector (w h )t as a new individual (w h )t+1 in the t + 1th generation, AIS algorithm adopts a greedy search strategy. More specifically, a crossover individual is chosen as the offspring, if its affinity (u h )t+1 is better than that of the target individual (w h )t, otherwise, the individual (w h )t is maintained in the t + 1th generation. The system chooses the individual (w c) t+1 with the best affinity performance in t + 1th generation as the new memory antibody. The evolutionary process for the population includes three steps from step 2 to step 4. AIS repeats this process, until the algorithm evolution surpasses the pre-set maximum number MaxGen or the same optimal result is obtained continuously for more than a given threshold number T in the running process. IV. EXPERIMENTAL RESULTS AND ANALYSIS A. Experimental Conditions We conduct our experiments using WEKA [26] and validate the algorithm performance on 36 UCI benchmark datasets [16], which represent a wide range of domains and data characteristics and are described in Table II. In our experiments, we replace all missing attribute values using the unsupervised attribute filter ReplaceMissingV alues in WEKA, and apply unsupervised filter Discretize in WEKA to discretize numeric attributes into nominal attributes. We continue by introducing baseline algorithms and their abbreviations in our experiments. MENB with symbol P o denotes the underlying approach using the m and p setting according our experimental analysis. 1. MENB Eq 9,10 : NB with M-estimate (MENB) using the setting of m and p in the literature [13] by using Eq.(9) and Eq.(10). 2. MENB Eq 9,14 : MENB using the setting of m and p in the literature [9] by using Eq.(9) and Eq.(14). 2307

6 TABLE II DETAILED INFORMATION OF EXPERIMENTAL DATA Dataset Instances Attributes Classes Missing Numeric anneal Y Y anneal.orig Y Y audiology Y N autos Y Y balance-scale N Y breast-cancer Y N breast-w Y N colic Y Y colic.orig Y Y credit-a Y Y credit-g N Y diabetes N Y Glass N Y heart-c Y Y heart-h Y Y heart-statlog N Y hepatitis Y Y hypothyroid Y Y ionosphere N Y iris N Y kr-vs-kp N N labor Y Y letter N Y lymph N Y mushroom Y N primary-tumor Y N segment N Y sick Y Y sonar N Y soybean Y N splice N N vehicle N Y vote Y N vowel N Y waveform N Y zoo N Y 3. MENB Eq 11,12 : MENB using the setting of m and p in the literature [14] by using Eq.(11) and Eq.(12). 4. MENB P o [1,0.0001] : MENB optimal setting of (m = 1, p = ). 5. MENB P o [1,0.05] : MENB optimal setting of (m = 1, p = 0.05). 6. MENB P o [7,0.1] : MENB optimal setting of (m = 7, p = 0.1). 7. MENB P o [8,0.1] : MENB optimal setting of (m = 8, p = 0.1). 8. LNB : NB with Laplace-estimate. 9. AISENB : Artificial Immune System based selfadaptive probability estimation (the proposed method). In our experiments, the maximum evolution generation MaxGen is set to 100 and the size of antibody population N a is set to 50. Clone rate c is generally set to 5%. The Crossover factor CR in the cross process is set to 0.6. Threshold T is set to B. Cross-test for Parameter Setting This part of experiment evaluates the results of using different m and p values (the parameters selection in M- estimate) for Naive Bayes classification. We compare algorithm performance based on Laplace-estimate and M- estimate using Eq.(7) and Eq.(8), and carry out the Crosstest m values (m = 1, 2,, 8) and p values (p = , 0.005,, 0.5) via 10 runs of 10-fold cross validation. Figure 3 reports the average classification accuracy of MENB (with m and p values from the above interval) on the entire 36 datasets. The results show that the best parameter setting for the whole 36 datasets is < m = 1, p = 0.05 > (average accuracy = 82.74%). Average accuracy % on 36 standard UCI datasets p Fig. 3. The average accuracy of LENB (82.16%)and MENB with different m and p values on 36 UCI benchmark datasets. (With varying m values (m = 1, 2,, 8) and p values (m = , ,, 0.5). While the above analysis shows that m = 1 and p = 0.05 result in good performance, for individual dataset, this parameter setting may not guarantee the best performance. In order to prove this hypothesis, we consider < m = 1, p = 0.05 > as a possible optimal parameter combination, and carry out another experimental analysis and try to find other optimal parameter combinations. We analyze the frequency of each optimal parameter m, p and their combinations on every dataset, the result is shown in Table III. The frequency value in Table III shows that the parameter combination < m = 1, p = > is optimal on four datasets ( 4 of the upper left corner for example). The result asserts that attribute dependency and distributions are different as the dataset varies. According to the result in Table III, the single parameter with the maximum frequency is m = 1 (the frequency is 17) compared with other m values, and p = 0.1 ( the frequency is 18). So, < m = 1, p = 0.1 > is considered a second optimal combination. This type of experimental optimal parameters setting is the same as settings in [14] for M-estimate. < m = 1, p = >, < m = 7, p = 0.1 > and < m = 8, p = 0.1 > are regarded as the possible optimal parameter combination settings with there frequency is 4. On the basis of our discussion above, we can easily find other four possible optimal parameter combinations < m = 1, p = >, < m = 7, p = 0.1 >, < m = 8, p = 0.1 > and TABLE III FREQUENCY OF EACH OPTIMAL PARAMETER m, p COMBINATIONS p m RELATED TO 36 UCI DATASETS m=1 m=2 m=3 m=4 m=5 m=6 m=7 m=8 Frequency p= p= p= p= p= p= p= p= Frequency m

7 TABLE IV EXPERIMENTAL RESULTS OF AISENB VERSUS NAIVE BAYES WITH LAPLACE-ESTIMATE (LENB), M-ESTIMATE USING (m, p) SETTING IN THE LITERATURES [13] (MENB-EQ 9,10 ), [9] (MENB-EQ 9,14 ), [14] (MENB-EQ 11,12 ) AND USING (m, p) SETTING ACCORDING OUR EXPERIMENTAL ANALYSIS MENB-PO [1,0.05], MENB-PO [1,0.0001], MENB-PO [7,0.1] AND MENB-PO [8,0.1] : CLASSIFICATION ACCURACY AND STANDARD DEVIATION Dataset AISENB LENB MENB-Eq 9,10 MENB-Eq 9,14 MENB-Eq 11,12 MENB-Po [1,0.05] MENB-Po [1,0.0001] MENB-Po [7,0.1] MENB-Po [8,0.1] anneal 98.02± ±2.23 * 96.34±1.80 * 96.76±2.09 * 96.86±1.72 * 97.35±1.54 * 98.12± ±1.82 * 95.74±1.84 * anneal.orig 88.54± ± ± ± ± ± ± ±2.19 * 84.11±2.23 * audiology 77.15± ±6.37 * 75.74± ± ± ±6.50 * 73.25±7.27 * 46.55±2.71 * 46.28±2.28 * autos 70.77± ±11.35 * 66.12±11.12 * 68.13± ± ± ± ±11.42 * 65.09±11.28 * balance-scale 91.44± ± ± ± ± ± ± ± ±1.30 breast-cancer 72.73± ± ± ± ± ± ± ± ±7.84 breast-w 97.31± ± ± ± ± ± ± ± ±1.75 colic 81.16± ± ± ± ± ± ±6.38 * 79.54± ±6.05 colic.orig 74.05± ± ± ± ± ± ± ± ±6.98 credit-a 84.52± ± ± ± ± ± ± ± ±3.86 credit-g 75.95± ± ± ± ± ± ± ± ±3.70 diabetes 75.14± ± ± ± ± ± ± ± ±4.83 glass 58.14± ± ± ± ± ± ± ± ±10.11 heart-c 83.35± ± ± ± ± ± ±7.08 * 83.35± ±6.10 heart-h 83.95± ± ± ± ± ± ± ± ±5.96 heart-statlog 84.32± ± ± ± ± ± ±6.20 * 83.96± ±5.43 hepatitis 85.29± ± ± ± ± ± ± ± ±8.57 hypothyroid 93.44± ±0.73 * 92.68±0.75 * 92.83±0.69 * 92.82±0.70 * 92.84±0.71 * 92.92±0.66 * 93.38± ±0.56 ionosphere 90.80± ± ± ± ± ± ± ± ±4.28 iris 95.00± ± ± ± ± ± ± ± ±6.64 kr-vs-kp 91.10± ±1.91 * 87.81±1.90 * 87.80±1.90 * 87.80±1.90 * 87.80±1.90 * 87.80±1.90 * 87.78±1.87 * 87.76±1.88 * labor 93.73± ± ± ± ± ± ± ± ±8.56 letter 67.50± ±2.04 * 67.14±1.97 * 67.30± ±1.97 * 67.22±1.95 * 67.58± ±1.97 * 66.06±2.01 * lymph 85.27± ± ± ± ± ± ± ± ±9.43 mushroom 99.67± ±2.03 * 95.99±1.58 * 98.22±1.06 * 96.13±1.51 * 97.15±1.50 * 99.53± ±1.94 * 93.79±1.95 * primary-tumor 47.44± ± ± ± ± ± ± ±4.85 * 41.54±5.01 * segment 90.84± ±1.66 * 90.07±1.65 * 90.07±1.64 * 90.07±1.65 * 90.17±1.64 * 90.93± ±1.65 * 89.18±1.65 * sick 97.50± ±0.91 * 96.94±0.84 * 97.12±0.78 * 97.01±0.82 * 97.12±0.79 * 97.25± ±0.83 * 97.03±0.81 * sonar 76.28± ± ± ± ± ± ± ± ±9.89 soybean 94.20± ±3.23 * 93.53± ± ± ± ± ±2.90 * 90.85±2.89 * splice 95.46± ± ± ± ± ± ± ± ±1.14 vehicle 61.41± ± ± ± ± ± ± ± ±3.48 vote 90.69± ± ± ± ± ± ± ± ±3.93 vowel 67.26± ± ± ± ± ± ± ± ±4.74 waveform ± ± ± ± ± ± ±3.37 * 79.92± ±2.96 zoo 96.34± ± ± ± ± ± ± ±4.65 * 90.70±4.41 * Average 83.24± ± ± ± ± ± ± ± ±4.62 w/t/l - 0/26/10 0/28/8 0/30/6 0/29/7 0/28/8 0/29/7 0/24/12 0/24/12 : Statistically significant degradation via 10 runs of 10-fold cross validation with a 95% confidence level using t-test. < m = 1, p = 0.05 >. C. Accuracy analysis for MENB in connection with LENB The experimental result in Figure 3 shows that different parameters settings in M-estimate can affect the classifier performance. NB with M-estimate (MENB) overall performs better than the classifier with Laplace-estimate (LENB). Compared with LENB (accuracy 82.16%), MENB (m = 1, p 0.1), MENB (m = 2, p 0.1), MENB (m = 3, p 0.1) and MENB (m = 4, 0.01 p 0.1) have higher average classification accuracy. MENB (m = 1, p = 0.05, 82.74%) has much better accuracy performance than that of other parameter settings. The classification accuracy of MENB is inversely proportional to the m value. The overall performance on 36 datasets shows that for any given p value the larger the m value, the lower the classification accuracy is. In other words, MENB with m = 1 is optimal. This is consistent with the previous studies, where the m value in literatures [13], [9] and [14] are all set to 1. D. The Accuracy of AISENB In this section, we report our AISENB based method for estimating probability using simple evaluation model, the M-estimate scheme (using m and p value settings in the literature [13], [9] and [14]), M-estimate (with the possible optimal parameters setting though our experimental analysis) and the Laplace-estimate scheme. Table IV reports the accuracies of AISENB, MENB, and LENB. For MENB and LENB, we use m and p values reported in the literature. The average accuracy and standard deviation are summarized at the bottom of the table. A * symbol in Table IV indicates that the classification performance of this algorithm is statistically and significantly lower than AISENB in the table (at 95% confidence level under t-test). The w/t/l value in Table IV reports the number of times that the algorithm in the corresponding row wins, ties, and loses in all 36 benchmark datasets, compared to the corresponding AISENB algorithm. The detailed results in Table IV show that the proposed method AISENB is competitive with Naive Bayes classifier with Laplace-estimate, M-estimate in the literature, and M- estimate with optimal parameter settings via our experimental analysis. Several major findings can be highlighted as follows. 1. AISENB significantly outperforms LENB with 10 wins and 0 loss. The average classification accuracy on 36 datasets for AISENB (83.24%) is higher than LENB 2309

8 (82.16%). 2. AISENB outperforms MENB-Eq 9,10 with 8 wins and 0 loss, MENB-Eq 9,14 with 6 wins and 0 loss, and MENB-Eq 11,12 with 7 wins and 0 loss. The average classification accuracy of AISENB (83.24%) is higher than MENB-Eq 9,10 (82.60%), MENB-Eq 9,14 (82.63%), MENB-Eq 11,12 (82.68%). 3. AISENB significantly outperforms MENB-Po [7,0.1] (12 wins and 0 losses), MENB-Po [8,0.1] (12 wins and 0 losses), and MENB-Po [1,0.05] (8 wins and 0 losses), and outperforms MENB-Po [1,0.0001] in Accuracy (7 wins and 0 losses). The average classification accuracy of AISENB (83.24%) is higher than MENB-Po [7,0.1] (81.40%), MENB-Po [8,0.1] (81.22%), MENB-Po [1,0.05] (82.74%) and MENB-Po [1,0.0001] (82.61%). Considering that AISENB has an adaptive probability estimation mechanism whereas MENB requires a number of parameter settings, AISENB is overall more effective and stable than existing NB methods. It is worth mentioning that existing research [28] has found strong attribute dependencies in kr-vs-kp dataset, and our method AISENB achieves 91.10% accuracy, whereas the accuracy of other methods is about 87.80%. V. CONCLUSION In this paper, we studied two typical probability estimation methods, Laplace-estimate (LENB) and M-estimate (MEN- B), for Naive Bayes classification. Our analysis shows that MENB has better performance than LENB, but its accuracy is unstable with respect to different m and p parameter values. The unstable performance with respect to different parameter settings motivated us to design a self-adaptive parameter selection algorithm for probability estimation in NB classification. To address this challenge, we proposed to use artificial immune system (AIS) based method to adaptively estimate the probability, and validated the proposed design on 36 benchmark datasets. The experimental results and comparisons demonstrated that that our method (AISENB) outperforms the state-of-the-art probability estimation algorithms and can indeed achieve optimal parameter selection for different datasets. REFERENCES [1] L. Jiang, H. Zhang and Z. Cai, A Novel Bayes Model: Hidden Navie Bayes, IEEE Transactions on Knowledge and Data Engineering, vol. 21, pp , [2] H.H. Shan and A. Banerjee, Mixed-membership naive Bayes models, Data Mining and Knowledge Discovery,vol. 23, pp. 1-62, [3] S.B. Kim, K.S. Han, H.C. Rim and K.U Seoul, Some Effective Techniques for Naive Bayes Text Classification, IEEE Transactions on Knowledge and Data Engineering, vol. 18, pp , [4] C. Zhang, G.R. Xue, Y. Yu and H.Y. Zha, Web-scale classification with naive bayes, in Proceedings of the 18th international conference on World wide web, WWW09, ACM press, USA, 2009, pp [5] G.I. Webb, J.R. Boughton, F. Zheng, K.M. Ting and H. Salem, Learning by extrapolation from marginal to full-multivariate probability distributions: decreasingly naive Bayesian classification, Machine Learning, vol. 86, pp , [6] J. Duan, Z. Lin, W. Yi and M. Lu, Scaling Up the Accuracy of Bayesian Classifier Based on Frequent Itemsets by M-estimate, in Proceeding of Artificial Intelligence and Computational Intelligence, AICI 10, Springer Press, China, 2010, pp [7] B. Cestnik, Estimating Probabilities: A Crucial Task in Machine Learning, in Proceedings of the 9th European Conference on Artificial Intelligence, ECAI 90, IOS Press, Sweden, 1990, pp [8] T.M. Mitchell, Machine learning, McGraw-Hill Publishers, [9] L. Jiang, D. Wang and Z. Cai, Scaling Up the Accuracy of Bayesian Network Classifiers by M-Estimate, in Proceedings of the 3rd International Conference on Intelligent Computing, ICIC 07, Springer Press, China, 2007, pp [10] N. Friedman, D. Geiger and M. Goldszmidt, Bayesian Network Classifiers, Machine Learning, vol. 29, pp , [11] G.I. Webb, J. Boughton and Z. Wang, Not So Naive Bayes: Aggregating One-Dependence Estimators, Machine Learning, vol. 58 pp. 5-24, [12] D. Lowd, P. Domingos, Naive Bayes models for probability estimation, in Proceeding of 22nd International Conference on Machine Learning, ICML 05, ACM Press, Bonn, Germany, 2005, pp [13] L. Jiang, H. Zhang, Z.H. Cai and D. Wang, Weighted Averaged One-Dependence Estimators, Journal of experimental and Theoretical Artificial Intelligence, vol. 24, pp , [14] B. Zadrozny and C. Elkan, Learning and making decisions when costs and probabilities are both unknown, in Proceedings of 7th ACM SIGKDD international conference on Knowledge discovery and data mining, SIGKDD 01, ACM Press, USA, 2001, pp [15] L.N. De Castro and J. Timmis, Artificial Immune Systems: A New Computational Intelligence Approach, Springer Verlag, [16] C. Merz, P. Murphy, and D. Aha, UCI repository of machine learning databases. in Dept of ICS, University of California, Irvine, [17] G.B. Huang, X.J. Ding and H.M. Zhou, Optimization method based extreme learning machine for classification, Neurocomputing, vol. 74, pp , [18] A. Watkins, J. Timmis, Artificial Immune Recognition System (AIRS): Revisions and Refinements, in: Proceedings of 1st International Conference on Artificial Immune Systems, ICARIS 20, Canterbury, England, 2002, pp [19] J. Yang, X.J. Liu, T. Li, G. Liang and S.J. Liu, Distributed agents model for intrusion detection based on AIS, Knowledge-Based Systems, vol. 22, pp , [20] R.H. Shang, L.C. Jiao, F. Liu and W.P. Ma, A Novel Immune Clonal Algorithm for MO Problems, IEEE Transactions on Evolutionary Computation, vol. 16, pp , [21] S. Ozsen and S. Gunes, Attribute weighting via genetic algorithms for attribute weighted artificial immune system (AWAIS) and its application to heart disease and liver disorders problems, Expert Systems with Applications, vol. 36, pp , [22] J.S. Yuan, L.W. Zhang, C.Z. Zhao, Z. Li and Y.H. Zhang, An Improved Self-organization Antibody Network for Pattern Recognition and Its Performance Study, Pattern Recognition, vol. 321, pp , [23] L. de Mello Honorio, A.M.L. da Silva and D.A. Barbosa, A Cluster and Gradient-Based Artificial Immune System Applied in Optimization Scenarios, IEEE Transactions on Evolutionary Computation, vol. 16, pp , [24] K.M. Woldemariam, Vaccine-Enhanced Artificial Immune System for Multimodal Function Optimization, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 40, pp , [25] Y.F. Zhong and L.P. Zhang, An Adaptive Artificial Immune Network for Supervised Classification of Multi-/Hyperspectral Remote Sensing Imagery, IEEE Transactions on Geoscience and remote sensing, vol. 50, pp , [26] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques (2nd ed.), San Francisco, CA: Morgan Kaufmann, [27] C. Nadeau and Y. Bengio, Inference for the generalization error, Machine Learning, vol. 52, pp , [28] R. Kohavi, Scaling Up the Accuracy of Naive-Bayes Classifiers:A Decision-Tree Hybrid, in Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, KDD 96, AAAI Press, USA, 1996, pp

Voting Massive Collections of Bayesian Network Classifiers for Data Streams

Voting Massive Collections of Bayesian Network Classifiers for Data Streams Remco R. Bouckaert Computer Science Department, University of Waikato, New Zealand remco@cs.waikato.ac.nz Abstract. We present