Bayesian Networks for Classification

Finite Mixture Model of Bounded Semi-Naive Bayesian Networks for Classification Kaizhu Huang, Irwin King and Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories, Hong Kong {kzhuang,king,lyu}@cse.cuhk.edu.hk Abstract The Naive Bayesian (NB) network classifier, a probabilistic model with a strong assumption of conditional independence among features, shows a surprisingly competitive prediction performance even when compared with some state-of-the-art classifiers. With a looser assumption of conditional independence, the Semi-Naive Beyesian (SNB) network classifier is superior to NB classifiers when features are combined. However, the problem for SNB is that its structure is still strongly constrained which may generate inaccurate distributions for some datasets. A natural progression to improve SNB is to extend it using the mixture approach. However, in obtaining the final structure, traditional SNBs use the heuristic approaches to learn the structure from data locally. On the other hand, Expectation- Maximization (EM) method is used in the mixture approach to obtain the structure iteratively. The extension is difficult to integrate the local heuristic into the maximization step since it may not convergence. In this paper we firstly develop a Bounded Semi-Naive Bayesian network (B-SNB) model, which contains the restriction on the number of variables that can be joined in a combined feature. As opposed to local property of the traditional SNB models, our model enjoys a global nature and maintains a polynomial time cost. Overcoming the difficulty of integrating SNBs into the mixture model, we then propose an algorithm to extend it into a finite mixture structure, named Mixture of Bounded Semi-Naive Bayesian network (MBSNB). We give theoretical derivations, outline of the algorithm, analysis of algo-

rithm and a set of experiments to demonstrate the usefulness of MBSNB in some classification tasks. The novel finite MBSNB network shows good speed up, ability to converge and an increase in prediction accuracy. I. INTRODUCTION Learning accurate classifiers is one of the basic problems in machine learning and data analysis field. In such a problem, a pre-classified dataset D = {{x 1, C 1 }, {x 2, C 2 },..., {x N, C N }} is given, where x i = {A i 1, Ai 2,..., Ai n } Rn, is the n-dimension training sample in real space and C i Ω is the class label assigned to x i in a category space. The objective is to find a mapping function F : R n Ω to satisfy F (x i ) = C i. To handle this problem, many methods have been proposed. Among them are Statistical Neural Networks [23], Support Vector Machines [2] [36] and Decision trees [29]. Naive Bayesian Network (NB) [8] [20] shows a good performance in dealing with this problem even when compared with the state-of-the-art classifiers such as C4.5. With an independency assumption among the attributes, when given the class label, i.e., P (A i, A j C) = P (A i C)P (A j C), for 1 i j n, NB classifies a specific sample into the class with the largest joint probability P (C, A 1, A 2,..., A n ). Figure 1 is the graphical structure for NB. This joint probability can be decomposed into a multiplication form based on its independency assumption. The mapping function is written as follows: c = arg max C i P (C i, A 1, A 2,..., A n ) = arg max P (C i ) C i n P (A j C i ) (1) j=1 The success of NB is somewhat unexpected since its independency assumption will be violated in many cases. A typical example is the so-called Xor problem. In such a problem, attributes are two binary variables A and B. And when A is the same as B, the class label C is set to 1; otherwise, C is set to 0. Thus the attribute A is not independent on B, when given the class variable C. NB encounters problems in classifying the Xor data. The reason is that P (C = 0), P (C = 1), P (A = 0 C = 0), P (A = 1 C = 0), P (A = 0 C = 1), P (A = 1 C = 1) will be all nearly 0.5 when the data samples are sufficient. It

C A 1 A 2 A n-1 A n Fig. 1. Naive Bayesian Network Classifier. A i, 1 i n is the attribute. This figure means each attribute is independent on the other attributes, given the class label C. C B B1 2 m-1 m B B Fig. 2. Semi-Naive Bayesian Network Classifier. B i, 1 i m is the combined attribute. This figure means each combined attribute B i is independent on the other combined attributes B j, i j, given the class label C. will be hard to assign any data into the class 0 or 1 since the estimated joint probabilities according to Equation (1) for both classes will be about 0.5 0.5 0.5 = 0.125. Furthermore, so-called Semi-Naive Bayesian networks are proposed to remedy violations of NB s assumption by joining attributes into several combined attributes based on a conditional independency assumption among the combined attributes. Some performance improvements have been demonstrated in [17] [25]. Figure 2 is a graphical illustration of Semi-Naive Bayesian network. At this time, the conditional independency occurs among the combined attributes. However, even SNB makes the constraint of NB looser, such a relaxation is slight. SNB is still strongly restrained for its conditional independency assumption among the combined attributes. How to relax the

SNB s strong constraint effectively and efficiently becomes a popular topic. One possible way to solve this problem is to search an independence or dependence relationship among the attributes rather than impose a strong assumption on the attributes. This is the main idea of so-called unrestricted Bayesian Network (BN) [28]. Unfortunately, empirical results have demonstrated that searching an unrestricted BN structure does not show a better result than NB. This is partly because that unrestricted BN structures are prone to incurring overfitting problems (Overfitting is a such a phenomenon that the classifier can classify the training dataset perfect while it shows a low prediction accuracy for new data.). Furthermore, searching an unrestricted BN structure is generally an NP-complete problem [3]. Another possible way is to upgrade the SNB into a mixture structure, where a hidden variable is used to coordinate its components: SNB structures. Mixture approaches have achieved great successes in expanding its restricted components expressive power and bring in a better performance. Gaussian Mixture Model [22] is such an example. To our best knowledge, compared with the popularity in seaching unrestricted BN to relax the constraints of SNB, there is no one to do mixture upgrading work on SNB structure. This strange situation is perhaps partly for the difficulties of inducing mixture parameters for SNB. The core technique for mixture model is Expectation Maximization (EM) method, which requires a global optimization algorithm to obtain the structure or parameters of each component. This optimization algorithm with a global nature will guarantee the maximization step in EM and then guarantee the objective function will increase iteratively. In this way, the iterative EM process can go to a convergence point. However traditional algorithms of SNB [17] [25] often search their structures based on some heuristic methods, for the reason of large search space of SNB. While these heuristic approaches decrease the total time cost, they bring in a local nature as well, which will prevent them into the mixture structure. In this paper, we firstly propose a Bounded SNB (B-SNB) model. As opposed to local heuristic of the traditional SNBs, B-SNB is shown to be of global nature and has a polynomial time cost when given a bound on the maximum number of attributes that can be joined into the combined attributes (We call large

attribute in our B-SNB model). We search the B-SNB structure based on an combinatorial optimization method. As far as we know, this is the first purely combinatorial formulation of the learning problem for constructing SNB structure and the first algorithm with a global nature for SNB as well. With this B-SNB model, we provide an algorithm to perform the mixture structure upgrading, based on the EM method. Our experimental results show that this upgrading really bring in an increase in prediction rate for some benchmark datasets. This paper is organized as follows. In Section II, we give a short review about the related works. In Section III, we describe our B-SNB model in detail. Then in Section IV, we discuss the mixture of B- SNB model and give an induction algorithm. In Section V, a complexity analysis is given. Experimental results to show the advantages of our model are demonstrated in Section VI. A discussion section is given in Section VII. Finally, we conclude this paper in Section VIII. II. RELATED WORKS Since the invention of NB, many BN classifiers have been developed, including restricted types and unrestricted types. Among the restricted BN classifiers are Semi-Naive Bayesian networks classifiers [17] [25], Selective Naive Bayesian network classifiers [21], Recursive Bayesian classifier [19], Tree Augmented Naive Bayesian networks classifiers [11], Limited Bayesian network classifiers [30] and Adjusted probability Naive Bayesian networks classifiers [37]; And for inducing unrestricted Bayesian network classifiers, K2 [5] is a popular algorithm. Since our focus is in the restricted BNs, in the following, we will first give a short review about restricted ones and then we shift the focus on the mixture issues. NB s success triggers the popularity of Bayesian network classifier. Numbers of algorithms are developed to relax the strong assumption of NB. Kononenko [17] invented a Semi-Naive Bayesian network classifier which attempted to combine values of attributes to overcome the shortcomings of NB. However this proposition did not show a significantly better performance in two datasets experiments. In one dataset, Kononenko s algorithm has the same performance as NB. In the other dataset testing, only one percentage is increased compared with the NB performance. Rather than combines values, Pazzani [25] combined

attributes heuristically. He used either forward sequential selection and joining strategy or backward sequential elimination and joining strategy. For this time-consuming search process, Pazzani s algorithm is limited to combine only two attributes. Even in such a restriction, Pazzani s algorithm shows a big increase in classification accuracy compared with NB. Langley and Sage brought out a Selective Bayesian Classifier to eliminate dependent attributes from NB [21]. Different with Pazzani s approach, they threw away one of the two dependent attributes while Pazzani joined these two dependent attributes. This model s performance is shown to be slightly worse than the Pazzani s approach. Langley [19] proposed a Recursive Bayesian classifiers to adapt NB into some non-linearly separable problems. However it did not provide a significant benefit on naturally occurring databases [25]. Friedman et al. [11] developed so-called Tree Augmented Naive Bayesian (TAN) network classifier which integrated the Chow-Liu tree (CLT) [4] techniques with NB. Chow-Liu tree is a kind of tree structure in which each node (attribute) is assumed to have only one node (attribute) as its parents. In such a configuration, a global optimal tree structure can be found. By using Chow-Liu tree technique, TAN enjoys the globally optimal nature based on the assumption of TAN, i.e., each attribute has only the class label and another attributes as its parents. Its performance is demonstrated to be good against state-of-the-art classification methods in machine learning. All of the above models are strongly restricted. For example, Pazzani can only combines two attributes and in TAN each attribute has only one attribute as its parent node. These restrictions enable inductions in these models easily and conveniently. At the same time, these restrictions confine these models expression powers which are the main advantages as Bayesian networks. In the background that unrestricted Bayesian networks with a powerful expression ability cannot incur an increase in prediction accuracy, finding another path to upgrade the restricted Bayesian network is getting important. Meila and Jordan [24] proposed a mixture of trees (MT) model to expand the Chow-Liu tree s expression power based on EM algorithm. Their model is empirically shown to outperform other models such as C4.5

Z BN 1 BN 2 BN m-1 BN m Fig. 3. Mixture structure of Bayesian Network Classifier. BN i, 1 i m is the restricted Bayesian network and Z is the choice variable. and Mixture of factorial model in a lot of datasets in prediction accuracy. Motivated from this work, we wonder if it is possible for us to expand SNB classifiers into the mixture structure. See Figure 3, Z is a choice variable, which is used to condition the component restricted Bayesian networks. Learning the mixture structure and parameters, is often done through the EM algorithms [12] [22]. To maintain the convergence of EM we need to find a globally optimal or at least sub-optimal algorithm for constructing the component restricted Bayesian networks. This global optimality will maintain the value of the objective function will be increased in each iteration, which thus ensure the convergence of iterative process. The fact in learning the Bayesian networks classifiers is that some heuristic methods are used to search a good network structure rather than an optimal one (An exception is TAN, whose upgrading can be actually considered as MT). This is mainly for saving the computational cost. Thus in this paper, we firstly utilize combinatorial optimization technique to develop a kind of sub-optimal algorithm with polynomial time cost for Bounded Semi-Naive Bayesian network. To borrow combinatorial optimization technique into the learning structure from data is reported firstly in [14] and [31]. They aimed at finding an approximation of the optimal hyper graph structure by the combinatorial technique. Their work s contribution may be in the theoretic field not in the real application, since their approximation proportion to the optimal solution is about 1/324 even given a very strong constraint. Breaking through the bottleneck for mixture upgrading, we then propose a Mixture model of Bounded Semi-Naive Bayesian network, which is shown to outperform NB, CLT, and SNB in our experiments.

In fact Thiesson [34] et al. have proposed a mixture of general Bayesian networks. However its performance cannot be expected to be very promising since its components: unrestricted Bayesian network classifiers are not shown to be better than NB. III. BOUNDED SEMI-NAIVE BAYESIAN NETWORK Our Bouned Semi-Naive Bayesian network model is defined as follows: Definition 1: B-SNB Model : Given a set of N independent observations D = {x 1,..., x N } and a bound K, where x i = (A i 1, A i 2,..., A i n) is an n-dimension vector and A 1, A 2,..., A n are called variables or attributes, B-SNB is a maximum likelihood Bayesian network which satisfies the following conditions: 1) It is composed of m large attributes B 1, B 2,..., B m, 1 m n, where each large attribute B l = {A l1, A l2,..., A lkl } is a subset of attribute set:{a 1, A 2,..., A n 1, A n }. 2) There is no overlap among the large attributes and their union forms the attributes set. That is, B i B j = φ, for i j, and 1 i, j m, B 1 B 2... B m = {A 1, A 2,..., A n } (2) 3) B i is independent of B j for i j. Namely, P (B i, B j ) = P (B i )P (B j ) for i j, and 1 i, j m 4) The cardinality of each large attribute B l ( 1 l m) is not greater than K. If each large attribute has the same cardinality K, we call the B-SNB K-regular B-SNB. According to the above definition, the distribution encoded into this network, denoted by S, can be written into: m S(A 1, A 2,..., A n ) = P (B j ) (3) Except Item 4), the B-SNB model definition is the definition of the traditional SNB. We argue that this constraint on the cardinality is necessary. K cannot be set as a very large value, or it will incur an overfitting j=1

problem. It can be verified that when the K is set as n, the B-SNB distribution will be the empirical distribution, which will be extraordinary unreliable to represent the dataset especially when the the sample number is not sufficient. We should address that the above model is described in a supervised way for classification purpose. When using B-SNB for classification tasks, we first partition the pre-classified dataset into some sub-datasets by the class label and then train different B-SNB structures for different classes. From this viewpoint, Item 3) is actually a conditional independence formulation, when given the class variable, since this independency is assumed in the sub-database with an uniform class label. A. Learning the Optimal B-SNB from Data According to Maximum Likelihood Estimation criterion, the data log-likelihood over the best B-SNB for a given dataset should be the maximum value than over any other B-SNBs. The data log-likelihood over a specific B-SNB S, denoted by l S, can be written as: N l s = log(s(x i )) (4) i=1 In general, the optimal B-SNB estimated from a dataset D can be achieved in two steps. The first step is to learn an optimal B-SNB structure from D; the second step is to learn the optimal parameters for this optimal structure, where B-SNB parameters are those probabilities of each large attribute, i.e., P (B j ). It is easy to show that the sample frequency of a large attribute B j is the maximum-likelihood estimator for the probability P (B j ), when a specific B-SNB structure is given (see the Appendix). Thus the key problem in learning the optimal B-SNB is the structure learning problem, namely how to find the best m large attributes. However the total number of possible structures of the B-SNB for an n-dimension dataset will be a huge quantity. It can be verified that this quantity will be: where, {k 1,k 2,...,k n} G n! k 1!k 2!..k n!, G = {{k 1, k 2,..., k n } : k 1 + k 2 +... + k n = n, 0 k i k j K, 1 i j n}

Such a huge searching space for an optimal B-SNB will make it nearly impossible to take greedy methods especially when the K has to be set as some small value for the reliable probability estimation of large attributes. This is why the current SNB models [25] [17] have to take some heuristic methods to search the SNB structure. However as mentioned in Section I, these heuristic methods will bring a local nature together with the lower time cost, which is a main obstacle for the mixture upgrading. Different with the local heuristic method, we introduce another restriction condition to reduce the searching space, which will maintain the global nature of the solution of B-SNB problem as well. B. Reducing B-SNB Search Space To reduce the search space, we first describe two lemmas: Lemma 1: The maximum log likelihood of a specific B-SNB S for a dataset D, represented by l S, can be written into the following form: m l S = Ĥ(B i ), (5) i=1 where Ĥ(B i) is the entropy of large attribute B i based on the empirical distribution of D. The entropy among a k-large attribute {X 1, X 2,..., X k } can be defined as: Ĥ(X 1, X 2,..., X k ) = ˆP (x1,..., x k ) log ˆP (x 1,..., x k ), (6) X 1,...,X k where low-case character x i represents the assignment of the value to the variable X i, 1 i k and ˆP (x 1,..., x k ) is the empirical probability when the value of the large attribute (X 1, X 2,..., X k ) is equal to (x 1, x 2..., x k ). Lemma 2: Let µ and µ be two B-SNBs over dataset D. If µ is coarser than µ, then µ provides a better approximation than µ over D. The coarser concept is defined in this way: If µ can be obtained by combining the large attributes of µ without splitting the large attribute of µ, then µ is coarser than µ.

Lemma 1 describes that when the B-SNB structure is given, the maximum likelihood estimator for the parameters will be the sample frequency of the large attributes and the log likelihood is the negative summation of the entropies of large attributes. Lemma 2 describes such a truth that a higher order approximation will be superior to a lower order one when given a bound K for the cardinality of large attributes. Within the bound K that is used to maintain the reliable probability estimation, a higher order one will keep more information of the dataset than the lower order one. For example, P (a, b, c)p (d, e, f) will be more accurate than P (a, b)p (c)p (d, e, f) for approximating P (a, b, c, d, e, f), when each subitem probability can be estimated reliably. The details of the proof of Lemma 1 and Lemma 2 can be seen in Appendix. According to Lemma 2, given a bound K which is used to guarantee us the reliable probability estimation of large attributes, we should use large attribute with as high cardinality as possible. Or it is more possible that we can combine some of large attribute with small cardinalities into a new large attribute within the bound K. Thus the new SNB will be coarser than the old one. From this viewpoint, we add another constraint that the cardinality of each large attribute should be exactly bound K. This constraint is reasonable, since no SNBs coarser than a K-regular B-SNB exist when the bound is K. It should be noticed that a K-regular B-SNB is not absolutely better than a non-k-regular SNB with the biggest cardinality no more than K since obvious some non-k-regular SNBs cannot be combined into a K-regular SNB. However, we can see that the possible number of the B-SNB after the restriction will be about integer. n! (K!) [n/k], a much fewer number than the original one. Here [x] means rounding the x to the nearest To show that the reduction is much useful, we plot the possible number of B-SNB (K = 2) in the search space without reduction and with reduction in Figure 4. In the bottom subfigure of Figure 4, we also plot out the ratio of search space without reduction and with reduction. Two issues can be seen from this figure. Firstly the search space can be observed to be greatly reduced by our method. Secondly although the search space is much reduced, it is still expensive to perform a greedy search. Our method is to transform the B-SNB combinatorial optimization problem into an Integer Programming (IP) problem and then find the

Y1 (Search Space size) Y2 (Search Space size) R (Ratio of seach spaces) 10 50 10 40 The curve of the original seach space for B SNB (K=2) 10 30 10 20 10 10 10 40 10 0 5 10 15 ( a ) 20 25 30 35 Variable Number The curve of the seach space after reduction for B SNB (K= 2) 10 30 10 20 10 10 6 10 0 5 10 15 ( b ) 20 25 30 35 Variable Number 10 R=Y1/Y2 10 4 10 2 10 0 5 10 15 20 25 30 35 ( c ) Variable Number Fig. 4. The graph of search space of the B-SNB when K is equal to 2. (a) The original search space for B-SNB. (b) The search space after the reduction. (c) The ratio of search space without reduction and with reduction. Note that: the vertical axises for three subfigures are plotted according to the log scale. optimization solution. We further approximate the solution of IP based on the Linear Programming (LP) method, which can be solved in a polynomial computational cost. C. Transforming into Integer Programming Problem We firstly describe our B-SNB optimization problem under Maximum Likelihood Estimation criterion when the cardinality of each large attribute is constrained to be exactly bound K. B-SNB Optimization Problem: From the attributes set, find m = [n/k] K-cardinality subsets, which satisfy the B-SNB conditions, to maximize the log likelihood as shown in Equation (5). We write this B-SNB optimization problem into the following IP problem: Min x V1,V 2,...,V K Ĥ(V 1, V 2,..., V K ), (7) V 1,V 2,...,V K under the constraints: ( V K ) x V1,V 2,...,V K = 1, (8) V 1,V 2,...,V K 1 x V1,V 2,...,V K = {0, 1} (9)

Here V 1, V 2,..., V k represent any K attributes. Equation (8) and Equation (9) describe that for any attribute, it can just belong to one large attribute, i.e., when it occurs in one large attribute, it must not be in another large attribute, since there is no coverage among the large attributes. We approximate the solution of IP via Linear Programming (LP) method, which can be solved in a polynomial time. By relaxing x V1,V 2,...,V K = {0, 1} into 0 x V1,V 2,...,V K 1, the IP problem is transformed into an LP problem. Then a rounding procedure to get the integer solution is conducted on the solution of LP. It should be addressed that direct solving for IP problem is infeasible. It is reported that IP problems with as few as 40 variables can be beyond the abilities of even the most sophisticated computers [35]. We assume the set of all the possible large attributes {V 1, V 2,..., V K } as X. The rounding scheme is written as follows: Rounding Scheme: 1) Set the maximum x V1,V 2,...,V K for the large attributes in X to value 1, record its subscript as a large attribute {V M1, V M2,..., V MK }, delete this {V M1, V M2,..., V MK } from X. 2) Set all the coefficients x V1,V 2,...,V K of those large attributes to 0, which have the coverage with {V M1, V M2,..., V MK }. Delete all these large attributes from X. 3) Goto 1, until all the attributes are covered. Approximating IP solution by LP may reduce the accuracy of the SNB while it can decrease the computational cost. Shown in [13] for two real world datasets experiments, the LP solution is the satisfcatory approximation on IP problem. D. When n/k is not an integer Problems may be encountered when n cannot be divided by K exactly, i.e., (n mod K)= l 0. In this case, we will not be able to find a K-Regular-B-SMB since one large attribute will always have only l attributes. In solving this problem, we proposed two modified versions of the previous algorithm. The first one is to delete l attributes, which are the least important for the classifications. And the classification result is given just based on the remaining n l attributes. The least important l-cardinality large attribute is the

one with the maximum entropy. A simple example can illustrate the above deletion. Assume l = 1 and each attribute is a 0-1 variable. It is obvious that the least important attribute should be the one with the almost the same frequency for 0 and 1, i.e., P (A i = 0) 0.5 and P (A i = 1) 0.5, since this attribute is not much helpful for the discrimination. It can be seen that this kind of attribute is just the one with the maximum entropy. In conclusion, the first modification approach is described as follows: 1) Assume (n mod K)= l 0, among all the l-cardinality -subset of the attributes set, select the one which has the maximum entropy. We assume this l-subset is B l max = {A max 1, A max2,..., A maxl } (The superscript l in B l max implies that this is an l-large attribute.). Let W = {A 1, A 2,..., A n 1, A n }\B l max. 2) Perform the optimization on the attributes set W as shown in Subsection III-C. We assume the resulting K-cardinality large attributes found by the above modification approach are B K i, 1 i [n/k].the classification mapping function is given by: [n/k] c = arg max P (C j ) P Cj (Bi K ) (10) C j The second approach is that an l-large attribute is first extracted from the attributes set and then search a regular B-SNB in the remaining n l attributes. Different with the first one, the final classification decision is given still based on n attributes. And since the l-large attribute is involved in classification, it should be the most important l-cardinality large attribute for the classification. The most important l-large attribute is the one with the minimum entropy. From Lemma 1 we know that to maximize the log likelihood, the entropy of every large attribute should be as small as possible. That is why we first select the l-cardinality large attribute with the minimum entropy among all the l-cardinality large attributes. The algorithm can be described as follows: 1) Assume (n mod K)= l 0, among all the l-subset of the attributes set, select the one which has the i=1 minimum entropy. We assume this l-subset is B l min = {A min 1, A min2,..., A minl } (The superscript l in B l min implies that this is an l-large attribute.). Let W = {A 1, A 2,..., A n 1, A n }\B l min. 2) Perform the optimization on the attributes set W as shown in Subsection III-C.

The final classification mapping function is given by: [n/k] c = arg max P (C j )P Cj (B l C j mi) P Cj (Bi K ) (11) The advantage of the first modification approach is that the structure after deleting the l-cardinality large attribute will be the one with a global nature for the remaining n l attributes. This global nature will greatly benefit the convergence of the mixture upgrading. However, for the second approach, with the l- large attribute involved in the classification decision rule, the resulting structure with the l-cardinality large attribute will be of the local property, since we should first choose out this l-large attribute. On the other hand, keeping all the attributes in the decision rule may keep stay the corresponding attribute information, which may do some help for the classification. Considering the characteristics of these two approaches, in this paper, we use the second approach for constructing a separate B-SNB classifier and we use the first approach for constructing B-SNB in the mixture structure. i=1 IV. THE MIXTURE OF BOUNDED SEMI-NAIVE BAYESIAN NETWORK As mentioned in the previous sections, the key for the mixture upgrading of SNB networks lies on the algorithm to optimize the SNB problem. Different with traditional local heuristic SNB approaches, We perform direct combinatorial optimization on the network, which enables the B-SNB to have a global nature. Thus it will make it possible to upgrade SNB into a finite mixture structure. In this section, we first define the Mixture of Bounded Semi-Naive Bayesian network (MBSNB) model, then we give the optimization problem of the MBSNB model. Finally we conduct theoretical induction to provide the optimization algorithm for this problem under the EM [18] framework. Definition 2: Mixture of Bounded Semi-Naive Bayesian network model is defined as a distribution of the form: r Q(x) = λ k S k (x) (12) k=1

where λ k 0, k = 1,..., r, rk=1 λ k = 1, r is the number of components in the mixture structure. S k represents the distribution of the kth component K Bounded Semi-Naive network. λ k can be called component coefficient. Optimization Problem of MBSNB: Given a set of N independent observations D = {x 1, x 2,..., x N } and a bound K, find the mixture of K-Bounded-SNB model Q, which satisfies Q = arg max Q N log Q (x i ). (13) i=1 We use a modified derivation process as [24] to find the solution of the above optimization problem. According to EM algorithm, finding the solution of (13) is equal to maximizing the following complete log-likelihood function: l c (x 1,...,N, z 1,...,N Q) = = N r log (λ k S k (x i )) δ k,z i i=1 k=1 N r δ k,z i(log λ k + log S k (x i )) (14) i=1 k=1 where z is the choice variable which can be seen as the hidden variable to determine the choice of the component Semi-Naive structure; δ k,z i is equal to 1 when z i is equal to the kth value of choice variable and 0 otherwise. We utilize the EM algorithm to find the solution of Equation (14). Firstly taking the expectation of Equation (14) with respect to z, we will obtain N r E[l c (x 1,...,N, z 1,...,N Q)] = E(δ k,z i D)(log λ k + log S k (x i )) (15) i=1 k=1 where E(δ k,z i D) is actually the posterior probability given the ith observation, which can be calculated as: E(δ k,z i D) = P (z i V = x i ) = λ ks k (x i ) k λ k Sk (x i ) (16) We define N γ k (i) = E(δ k,z i D), Γ k = γ k (i), i=1 P k (x i ) = γ k(i) Γ k. Thus we obtain the expectation: r r E[l c (x 1,...,N, z 1,...,N N Q)] = Γ k log λ k + Γ k P k (x i ) log S k (x i ). (17) k=1 k=1 i=1

Then we perform Maximization step in (17) with respect to the parameters. It is easy to maximize the first part of (17) by Lagrange method with the constraint r k=1 λ k = 1. We can obtain: λ k = Γ k N k = 1,..., r. (18) If we consider P k (x i ) as the probability for each observation over the kth component B-SNB, the latter part of Equation (17) is in fact a B-SNB network optimization problem, which can be solved by our earlier proposed algorithm in Section III. It can be also seen here that the B-SNB solution should have a global nature. Or it is highly possible that this maximization will not be greater than the maximum value in the previous M step. Thus the EM algorithm will not converge to a fix point. Our B-SNB solution based on direct combinatorial optimization enables a global nature of the resulting B-SNB structure, even though our LP approximation will reduce this optimality a little bit. Therefore the finite mixture upgrading based on our B-SNB model will have a good convergence performance. We demonstrate this issue in the experiment as well. We show the optimization process as Algorithm 1. V. COMPUTATIONAL COMPLEXITY ANALYSIS In this section, we conduct a simple computational complexity analysis first on B-SNB and then for MBSNB. A strong empirical evidence shows that classical LP optimization methods such as simplex only takes O(w) iterations to find an optimal solution with w equality constraints [1]. Each iteration costs O(wN) arithmetic operations where N is the number of variables to be solved. For our LP problem of Equation (7) in B-SNB, there are totally N = Cn K variables x V 1,V 2,...,V K which need to be solved and w is equal to n. Accordingly the computational cost in our B-SNB optimization process is about n 2 Cn K. On the other hand, q K C K n operations are needed to calculate the K-variable entropy in Equation (7). Here q is the maximum number of values a variable can take on. Accordingly the total cost for finding the optimal B-SNB will be (n 2 + q K )C K n. It will be a O(nK+2 ) time cost, when K n.

Algorithm Mix-Semi(D,Q 0 ) input(dataset D = {x 1, x 2,..., x N }, model Q 0 = {r, S k, λ k, k = 1,..., r} ) repeat E step: Compute i γ k, P k (x i ) for k = 1,..., r, i = 1,..., N ; M step; for k = 1,..., r do λ k Γ k /N; endfor until convergence; S k = B SNB(P k ) output(model Q = {r, S k, λ k, k = 1,..., r}) Algorithm 1: Mixture of Semi-Naive Bayesian network algorithm However, in the traditional SNB [17], the computational cost is exponential. It is said that: the number of iterations over the training dataset is approximately equal to the number of values of all attributes. For a simple example in which every variable has q values, the combination cost will be q n, an exponential cost. As the variable dimension grows, the cost difference between the B-SNB and Kononenko SNB will be bigger and bigger. On the other hand, the approaches by Pazzani [25] is impractical for even three attributes combination even though their approaches have a low cost report of O(n 3 ) when combining two attributes. Although it would be possible to consider joining three (or more attributes), the computational complexity makes it impractical for most databases [25]. Thus the accuracy of Pazzani SNB may be limited in this sense. Table I shows the analysis result of the above. Here Max means the maximum number of variables which involve in a large attribute. Thus the total computational cost for mixture of B-SNB: MBSNB per EM is thus: O(rn K+2 ).

TABLE I COMPUTATION COST TABLE Methods B-SNB Kononenko Pazzani Cost O(n K+2 ) O(q n ) O(n 3 ) Max K N 1 2 VI. EXPERIMENTS In this section, we firstly propose a set of pre-processing methods to handle some important issues, to implement our models. Then we describe our experiments setup in Subsection VI-B. In Subsectin VI- C, we demonstrate that our models are superior to other approaches in prediction accuracy. Finally in Subsection VI-D, we show that our mixture model has a good convergence performance. A. Pre-Processing methods To implement our algorithms, there are four main issues to be handled. They are numeric attributes, zero counts, missing values and parameter selection. We deal with these issues respectively as follows: Numeric attributes should be discretized into discrete attributes, since our algorithms can only handle discrete attributes. We discretized these numeric attributes into five equal intervals. Although this approach performs slightly less accurate than a more informed one [7], it is sufficient to evaluate the performance of main approaches in this paper. Zero counts are obtained when a given class and attribute value never occur in the training dataset. This may cause some problems in estimating the probabilities in the classification mapping function based on the frequency. For example, in NB s classification mapping function, if some value of an attribute A k never occur, the estimated P (A k C) will be zero. Consequently, the joint probability in the right part of Equation (1) will be 0, whatever the other terms P (A i C) are. This is especially a serious problem in implementing B-SNB algorithm and MBSNB algorithm since it is more possible that some values of a large attribute will never occur. This happens in the two situations. The

first is that some configurations in a large attribute may never occur; The second is that some values of a given attribute may not occur; To tackle the first issue, once we find such a configuration of a given large attribute, we degrade the related probability calculation of this large attribute to its attributes probability multiplication. For example, let B = {A l1, A l2,..., A lk } a large attribute, its configuration {A l1 = 0, A l2 = 1,..., A lk = 1} never occurs. We replace the probability calculation of P (A l1, A l2,..., A lk C) with k i=1 P (A l1 C). In this way, in handling sparse database, the B-SNB can at lease maintain the performance of NB since it will degrade towards the NB classifier for the absence of many configurations of large attributes. To tackle the second issue, we use the popular Laplace correction methods [27]. The modified estimated empirical probability for P (A j = a jk C i ) is (n ijk +f)/(n i +fn j ) instead of the uncorrected one: n ijk /n i, where a jk is the value of an attribute A j, n ijk is the number of times class C i and the value a jk of attribute A j occur together, n i is the number of the observations with class label as C i and n j is the number of values of attribute A j. We take the same value 1/N for parameter f as [6] [16], N is the number of samples in training database. The correction for a large attribute is similar as the above. Missing values is simply considered as another discrete value for the corresponding attribute. This is reasonable in some datasets, where missing values are those which cannot be determined as any of the values of a specific attribute. Parameter selection must be done for the cardinality of the large attribute K in B-SNB and MBSNB approaches. For the parameter K, some small values are often given. This is mainly for the reason that it will not be reliable to estimate the probability of a large attribute by empirical probability when its cardinality is too big. As mentioned above, a large K-cardinality large attribute will be more possible to encounter zero count problems. In our experiments, we set K to 2 or 3. B. Experimental Setup 1) Datasets: To evaluate the performance of our B-SNB and MBSNB models, we conduct a series of experiments on 7 databases, among which 6 come from UCI Machine learning Repository [26] and the other

1 dataset called Xor is generated synthetically. Xor dataset is synthetically generated, in which, the class variable is determined by first two binary attributes and other four binary attributes are created randomly. Table II give short descriptions about the used datasets in this paper. Detailed information about these datasets can be seen in [26]. To examine the performance of our approaches in this paper, we take the 5-folder Cross Validation (CV) method [15] to perform testing for some small or medium size datasets. TABLE II DESCRIPTION OF DATA SETS USED IN THE EXPERIMENTS Dataset Variables Class Train Test Xor 6 2 2000 CV-5 Vote 15 2 435 CV-5 Tic-tac-toe 9 2 958 CV-5 Vehicle 18 4 946 CV-5 Segment 19 7 2310 30% Post 8 3 90 CV-5 Iris 4 3 150 CV-5 2) Experiments Environment: Platform: Windows 2000; Software: Matlab 6.1; Hardware environment: 1.4G HZ, Pentium 3 Processor, 512M RAM. C. Prediction Accuracy We train a MBSNB model Q Ci for each class C i of every dataset. And we use the Bayes formula: c(x) = arg max C i P (C i )Q Ci (x) (19) to classify a new instance x.

Our initial experiments show that our B-SNB model outperforms Naive Bayesian classifiers. The comparisons are shown in Figure 5. B-SNB model can be further improved by upgrading into the mixture model as shown in Figure 6. To evaluate our mixture model s performance, we also do the comparisons between our model and two other competitive methods: CLT and C4.5. The experiments show that the mixture model of B-SNB performs better than the CLT as shown in Figure 7 and outperform C4.5 slightly as shown in Figure 8. 50 50 45 45 40 40 35 35 2 MBSNB Error 30 25 20 3 BSNB Error 30 25 20 15 15 10 10 5 5 0 5 10 15 20 25 30 35 40 45 50 NB Error 0 0 5 10 15 20 25 30 35 40 45 50 NB Error Fig. 5. Scatter plot comparing NB and B-SNB. Points below the diagonal line correspond to data sets where B-SNB performs better, and points above the diagonal line correspond to data sets where NB performs better. 2-SNB means K, the cardinality of large attribute, is set to 2. It is similar for 3-SNB. Table III summarizes the prediction results of the main approaches in this paper. We calculate average accuracy for each approach, which is shown in Figure 9. It is observed that even the MBSNB is not certain to have best prediction performance for each dataset, its overall performance is the best as shown in Figure 6, Figure 7 and Figure 8. It is interesting to notice that in Post dataset, 2-BSNB performs better significantly than its 2-MBSNB upgrading. This situation may be due to two possible reasons. One reason is this dataset s sparsity. There are only 90 samples for 8 attributes and 3 classes. This sparsity will make the large attributes, even 2-cardinality large attributes probability estimation unreliably, which may influence the result of mixture models. The other possible reason is that 2-BSNB is the special case of 2-MBSNB. The performance decrease may imply that the dataset is not multi-modality one. This problem is actually how to

45 50 40 45 35 40 30 35 2 MNSNB Error 25 20 3 MBSNB Error 30 25 20 15 15 10 10 5 5 0 0 5 10 15 20 25 30 35 40 45 2 BSNB Error 0 0 5 10 15 20 25 30 35 40 45 3 BSNB Error Fig. 6. Scatter plot comparing B-SNB and MBSNB. Points below the diagonal line correspond to data sets where MBSNB performs better, and points above the diagonal line correspond to data sets where B-SNB performs better. 2-SNB, 2-MBSNB means K, the cardinality of large attribute, is set to 2. It is similar for 3-SNB and 3-MBSNB. select the number of mixture component. A further discussion about this will be given in Section VII. TABLE III PREDICTION ACCURACY OF THE PRIMARY APPROACHES IN THIS PAPER(%) Dataset NB CLT 2-BSNB 3-BSNB C4.5 2-MBSNB 3-MBSNB Xor 54.5 100 100 99.5 100 99.50 99.50 Tic-tac-toe 70.77 73.17 72.65 78.39 84.84 88.33 79.38 Vote 90.11 91.26 92.40 92.64 94.18 93.10 94.00 Vehicle 54.26 59.27 58.21 69.15 57.40 58.51 63.64 Segment 88.29 91.33 91.90 89.16 90.61 91.47 90.90 Post 61.11 62.22 63.44 57.76 57.78 58.89 61.12 Iris 92.00 92.67 94.00 92.00 95.34 94.00 92.00 Average 73.01 81.42 81.80 82.66 82.88 83.35 82.96

45 45 40 40 35 35 30 30 3 MBSNB Error 25 20 3 MBSNB Error 25 20 15 15 10 10 5 5 0 0 5 10 15 20 25 30 35 40 45 CLT Error 0 0 5 10 15 20 25 30 35 40 45 CLT Error Fig. 7. Scatter plot comparing CLT and MBSNB. Points below the diagonal line correspond to data sets where MBSNB performs better, and points above the diagonal line correspond to data sets where CLT performs better.2-mbsnb means K, the cardinality of large attribute, is set to 2. It is similar for 3-MBSNB. D. Convergence Performance In this subsection, we examine the convergence performance of our mixture model. As discussed in the previous sections, B-SNB s optimization method will greatly influence the mixture model s convergence. In Figure 10 and Figure 11, we show the convergence curves for 4 datasets used in our experiments. Except Vote dataset, the other three datasets demonstrate good curves in convergence. The zigzag in vote is caused by the LP approximation for IP solution, which will slightly reduce the global optimality of the final B-SNB model. However even some zigzags occur, the overall curve trend for Vote is toward a convergence point. This implies that the approximation method is successful and our B-SNB model maintains a good global nature. VII. DISCUSSION There are two main issues to be discussed here, which are also the topics of our future work. The first one is about IP approximation based on LP technique. The second is about the choice of the component number for mixture structure. The LP approximation method for finding the IP solution may reduce the optimality of the final B-SNB

45 45 40 40 35 35 30 30 MBSNB 2 Error 25 20 3 MBSNB Error 25 20 15 15 10 10 5 5 0 0 5 10 15 20 25 30 35 40 45 C4.5 Error 0 0 5 10 15 20 25 30 35 40 45 C4.5 Error Fig. 8. Scatter plot comparing CLT and MBSNB. Points below the diagonal line correspond to data sets where MBSNB performs better, and points above the diagonal line correspond to data sets where C4.5 performs better. 2-MBSNB means K, the cardinality of large attribute, is set to 2. It is similar for 3-MBSNB. Fig. 9. Average error of the main approaches in this paper. structure. However, compared with the infeasibility of direct IP solution, this approximation brings us a polynomial time cost. Though a hard analysis on the approximation level of LP is not available for the time being, our experiments show that this approximation maintains the convergence of the finite mixture upgrading, which thus implies a good approximation level of our approach. Although in this paper, we mainly discuss a finite mixture of B-SNB, it does not mean that choosing the number of the components is not important. As a open problem for mixture model, many researchers are

1 0.98 Class 1 Class 2 1 Class 1 Class 2 0.96 0.99 Normalized log likelihood 0.94 0.92 0.9 0.88 0.86 Normalized log likelihood 0.98 0.97 0.96 0.84 0.95 0.82 0.8 0 2 4 6 8 10 12 14 16 0.94 10 20 30 40 50 60 70 80 90 100 Times Times (a) Tic-tac-toe (b) Vote Fig. 10. Convergence Performance on Tic-tac-toe and Vote dataset 1 0.95 Class 1 Class 2 Class 3 Class 4 1 0.95 Class 1 Class 2 0.9 0.9 Normlized log likelihood 0.85 0.8 Normalized log likelihood 0.85 0.8 0.75 0.7 0.75 0.65 0 2 4 6 8 10 12 14 16 Times 0.7 0 2 4 6 8 10 12 14 16 Times (a) Vehicle (b) Segment Fig. 11. Convergence Performance on Tic-tac-toe and Vote dataset now working on it [9] [10] [32] [33]. In this paper, we set the number of the component under some intuitive considerations. For the databases with more attributes and large number of training samples such as Tic-tactoe,Vote, Vehicle and Segment datasets, we simply set the component number as 10; For the databases with a small number of attributes or small number of training samples such as Xor, Post and Iris datasets, we set the component number as a small value 5. This is somewhat due to the consideration of the resistance to over-fitting. For instance, in a small size database, a large number of component will be more possible to incur an over-fitting problem. Obviously, this number of the components is one of the factors, which will influence the MBSNB s performance. As mentioned in Subsection VI-C, B-SNB model is the special case

of MBSNB with component number as 1. If the dataset belongs to a single-modality one, a mixture model will not be a suitable model for this dataset. How to select the component number is one of our future work. VIII. CONCLUSION Semi-Naive Bayesian Network Classifier, one of the restricted probability model, shows a good performance to expand the Naive Bayesian classifier, which is a competitive model when compared with some state-of-the-art classifiers such as C4.5. The mixture models have demonstrated successes in representing accurate distributions for real applications. Thus it may be promising to upgrade the Semi-Naive Bayesian network into a mixture model. However for its extraordinary large search space, traditional Semi-Naive Bayesian network has to take some local heuristic method to learn the structure from data. This local property for the traditional methods prevents them into the mixture upgrading, since it will not guarantee that the value of optimization function is certainly greater than its value in previous step and thus will not guarantee the convergence of the EM process. In this paper, we break through the bottle-neck for mixture upgrading of Semi-Naive-Bayesian network. We propose a Bounded Semi-Naive Bayesian network and transform the optimization problem for Semi- Naive Bayesian network into an Integer Programming problem. Our Semi-Naive Bayesian model is shown to enjoy a global nature and maintain a polynomial time cost. We then upgrade it into a finite mixture model. To our best knowledge, this is the first mixture model for Semi-Naive Bayesian network. Our experimental results show that this mixture model has a good convergence performance and really bring in an increase in prediction accuracy as well. IX. APPENDIX Proof for Lemma 1 Let S is a specific B-SNB with n variables or attributes which are represented respectively by A i, 1 i n.

And this B-SNB s large attributes are represented by B i, 1 i m. We use (B 1,..., B m ) as the short form of (B 1, B 2,..., B m 1, B m ). The log likelihood over a data set can be written into the following: l S (x 1, x 2,..., x s ) = = = = s log P (x j ) j=1 s m log( P (B i )) j=1 i=1 m s log P (B i ) i=1 j=1 m ˆP (Bi ) log P (B i ) i=1 B i (20) The above term will be maximized when P (B i ) is estimated by ˆP (B i ),the empirical probability for large attribute B i. This can be easily obtained by maximizing l S with respect to P (B i ). Thus, l Smax = = m i=1 m i=1 B i ˆP (Bi ) log ˆP (B i ) Ĥ(B i ) Proof for Lemma 2: We just consider a simple case, a general case proof is much similar. Consider one partition as µ = (B 1, B 2,..., B m ) and another partition as µ 1 = (B 1, B 2,..., B m 1, B m1, B m2 ),,where we have: B m1 B m2 = φ and B m1 B m2 = B m According to the proof of Lemma 1 above, we have: l Sµmax = m Ĥ(B i ) i=1