Ant Colony Algorithms for Constructing Bayesian Multi-net Classifiers

Size: px

Start display at page:

Download "Ant Colony Algorithms for Constructing Bayesian Multi-net Classifiers"

Juniper Ward
5 years ago
Views:

1 Ant Coony Agorithms for Constructing Bayesian Muti-net Cassifiers Khaid M. Saama and Aex A. Freitas Schoo of Computing, University of Kent, Canterbury, UK. December 5, 2013 Abstract Bayesian Muti-nets (BMNs) are a specia kind of Bayesian network (BN) cassifiers that consist of severa oca Bayesian networks, one for each predictabe cass, to mode an asymmetric set of variabe dependencies given each cass vaue. Deterministic methods using greedy oca search are the most frequenty used methods for earning the structure of BMNs based on optimizing a scoring function. Ant Coony Optimization (ACO) is a meta-heuristic goba search method for soving combinatoria optimization probems, inspired by the behavior of rea ant coonies. In this paper, we propose two nove ACO-based agorithms with two different approaches to buid BMN cassifiers: ABC-Miner mn and ABC-Miner mn g. The former uses a oca earning approach, in which the ACO agorithm competes the construction of one oca BN at a time. The atter uses a goba approach, which invoves buiding a compete BMN cassifier by each singe ant in the coony. We experimentay evauate the performance of our ant-based agorithms on 33 benchmark cassification datasets, where our proposed agorithms are shown to be significanty better than other commony used deterministic agorithms for earning various Bayesian cassifiers in the iterature, as we as competitive to other we-known cassification agorithms. Keywords. Ant Coony Optimization (ACO), Cassification, Bayesian Network Cassifiers, Bayesian Muti-nets. 1 Introduction Bayesian network (BN) cassifiers are we-known kind of graphica probabiistic modes that aim to perform cass prediction by modeing the dependency reationships between the predictor attributes (variabes) in a dataset given the cass variabe. To cassify a given case, the BN cassifier computes the posterior probabiity of each avaiabe cass vaue, given the vaues of the attributes of the case, and then abeing the case with the cass having the highest posterior 1

2 probabiity [1, 2]. Severa Bayesian network cassifiers were introduced in the iterature: Naïve-Bayes, Tree Augmented Naïve-Bayes (TANs), Bayesian networks Augmented Naïve-Bayes (BANs) and Genera Bayesian Networks (GBNs) [3, 2], where a singe (probaby compex) network structure is buit to mode the variabe dependencies from the whoe dataset. In contrast, a Bayesian Muti-net (BMN) cassifier consists of severa oca networks, one for each avaiabe cass vaue. Each oca Bayesian network in the cass-based BMN has a different set of (in)dependencies amongst the variabes with respect to a specific cass vaue, which promotes the mode s predictive effectiveness and reduces its structura compexity [2, 4]. Ant Coony Optimization (ACO) [5] is a meta-heuristic typicay used for soving combinatoria optimization probems, inspired by observations of the inteigent behavior (in particuar, finding the shortest path between two points) of ant coonies in nature. ACO has been effectivey used for earning generapurpose BNs (rather than BN cassifiers) [6, 7, 8, 9], as we as different types of cassification modes [10, 11, 12, 13, 14, 15]. However, ABC-Miner (where ABC stands for Ant-based Bayesian Cassifier), recenty introduced by the authors in [16, 17], is the ony agorithm to use Ant Coony Optimization for earning BN cassifiers by constructing a BAN structure with at most k-dependencies from a dataset. The ABC-Miner agorithm has outperformed conventiona deterministic agorithms in terms of predictive accuracy [16, 17]. The motivation behind this work is the foowing. Athough earning the optima Bayesian network structure from a dataset is N P-hard [18], most of the agorithms used in the iterature for buiding Bayesian network cassifiers utiize greedy oca search and deterministic techniques; whie severa stochastic heuristic goba search agorithms can be effectivey appied to the task of buiding high-quaity networks in an acceptabe computationa time. ACO has been successfu in soving severa types of data mining probems, incuding cassification rue induction [12, 19, 13, 10, 11, 14] and BN cassifier structure earning [16, 17]. Therefore, we carry on expoiting the ACO paradigm and exporing different techniques for buiding Bayesian cassifiers. Unike ABC-Miner [16, 17], which buids a singe BAN mode, in this paper we propose two nove ACO-based agorithms which buid BMN cassifiers, where a different Bayesian network is buit for each cass abe. These two agorithms foow different approaches to buid BMN cassifiers, and are caed: ABC-Miner mn and ABC-Miner mn g. The former uses a oca earning approach, in which the ACO agorithm competes the construction of one oca BN at a time. The atter uses a goba approach, which invoves buiding a compete BMN cassifier in a singe ant trai of the ACO agorithm. We describe a the eements necessary to tacke our target earning probem using ACO, and experimentay compare the performance of our ACO-based agorithms for buiding BMN cassifiers with the recenty introduced ABC-Miner, with other conventiona agorithms for earning BN cassifiers used in the iterature, and with 3 other types of cassification agorithms: rue induction, decision tree construction and support vector machines. Note that this work is the first to utiize the ACO meta-heuristic for buiding cass-based BMN cassifiers. 2

3 The structure of the paper is as foows. Section 2 gives a brief overview of concepts and methods from the reevant areas of Ant coony Optimization, Bayesian networks and Bayesian muti-net cassifiers. In Section 3, we define the essentia eements for appying the ACO meta-heuristics to the probem of earning BMN cassifiers. We propose ABC-Miner mn that uses a oca approach, and ABC-Miner mn g that uses a goba approach for earning BMN cassifiers using ACO agorithms in Sections 4 and 5, respectivey. Section 6 discusses our experimenta methodoogy, whie the resuts and their anaysis are presented in Section 7. Finay, we concude with some genera remarks and provide possibe directions for future research in Section 8. 2 Background 2.1 Ant Coony Optimization Socia insect swarms are distributed systems that, in spite of the simpe behavior of their individuas, produce a group behavior that is abe to effectivey accompish compex tasks. Inspired by the behavior of natura ant coonies, Dorigo et a. [5, 20, 21] have defined an artificia ant coony meta-heuristic that can be appied to sove optimization probems, caed ant coony optimization (ACO). The main idea is to utiize a swarm of simpe individuas that use coective behavior to achieve a certain goa, such as finding the shortest path between a food source and the nest. ACO agorithms simuate the behavior of rea ants using a coony of artificia ants, which cooperate in finding good soutions to optimization probems. The outine of the basic ACO procedures is presented in Agorithm 1. Agorithm 1 Pseudo-code of basic ACO agorithm. Begin ACO ConstructionGraph Probem definition; Initiaize(); best ϕ; /* best soution found so far */ repeat current ant.constructsoution() AppyLocaSearch(current) if Quaity(current) > Quaity(best) then best current; end if ant.u pdatep heromone(current); unti termination condition return best; End Each artificia ant creates candidate soutions to the probem at hand. Communication between artificia ants in the coony is performed by depositing 3

4 pheromone on the part of the search space that was visited whie constructing the soution, indicating its quaity, in order to guide the search. The iterative process of buiding candidate soutions, evauating their quaity and updating pheromone vaues aows an ACO agorithm to converge to near-optima soutions. In order to appy the ACO meta-heuristic to a given optimization probem, the foowing eements shoud be defined in advance: Construction Graph - An appropriate representation of the probem s search space that contains the avaiabe decision components, with which an ant can incrementay construct a candidate soution. Heuristic Information - Probem-specific knowedge that is associated with each decision component in the construction graph and infuences the choice of which components wi be used for constructing candidate soutions. State Transition Formua - A probabiistic transition rue that uses the heuristic vaue η and the pheromone amount τ associated with the decision components for an ant to move in the search space from a state to another (i.e. seecting a decision component) during the soution construction process. Quaity Evauation Function - A probem-specific function by which the quaity of a constructed candidate soution is evauated for the purpose of pheromone update. The higher the quaity of a soution constructed, the more pheromone wi be deposited on the decision components used in that soution, which wi encourage other ants to seect those decision components in future iterations. Pheromone Update Strategy - A formua to be used for pheromone reinforcement and evaporation. Pheromone reinforcement is appied on decision components occurring in the constructed soution in proportion to its quaity, whie pheromone evaporation is appied on the whoe construction graph to avoid stagnation and eary convergence. Loca Optimizer - An optiona oca search procedure to improve the quaity of a constructed soution. This can be performed on each candidate soution, or just on the best soution among the coony to reduce computationa time. A the aforementioned eements for designing an ACO-based agorithm are concretey defined ater in the context of earning Bayesian muti-net cassifiers in Section Bayesian Networks Bayesian networks are knowedge representation and reasoning toos that mode dependency and independency reationships amongst variabes in a specific domain [22]. A directed acycic graph (DAG) is used to mode the network structure G where the nodes represent the domain variabes, and the edges between the nodes represent statistica dependencies between these variabes. In addition, a conditiona probabiity tabe (CPT) is obtained for each variabe with respect to its parents in the network. The set of CPTs represent the parameters Θ of the network, which quantifies the dependency reationships between the variabes. A BN(G,Θ) specifies a joint probabiity distribution over the set of 4

5 variabes X that is formuated in the product form: P (X 1, X 2,..., X n ) = n P (X i Parents(X i ), Θ, G), (1) i=1 Learning a Bayesian network from a dataset (in which the attributes are referred to as variabes) consists of two phases: earning the network structure, and then earning the parameters of the network. Parameter earning is considered a reativey straightforward process for any given BN structure with specified (in)dependencies between variabes, since the vaues of a CPT can be estimated directy from the data by the reative frequencies of each variabe with respect to its parents. The CPT of variabe X i encodes the ikeihood of each vaue of this variabe given each combination of vaues of its parents in the graph G. On the other hand, the aim of network structure earning is to find the graph G that best fits a given dataset D in terms of P (D G) This is known as the scoring-based approach [23, 22]. The agorithm uses a scoring function that evauates each G with respect to D, searching for the network structure that maximizes the vaue of the scoring function. K2, MDL, KL, BDEu and severa other scoring functions can be used for this task [24, 25, 26]. A recent, very comprehensive review on earning Bayesian networks approaches and issues is presented by Day et a. in [22]. For further information about Bayesian networks, the reader is referred to [25, 23], which aso provide a detaied discussion of the subject. 2.3 Bayesian Muti-net Cassifiers Bayesian Network Cassifiers (BNCs) [27, 2, 4] are a specia kind of the probabiistic networks, which focus on answering queries about the probabiity of a specific node: the cass attribute. A Bayesian muti-net (BMN) is composed of the prior probabiity distribution of the cass node and a set of oca Bayesian networks. Each Loca BN corresponds to a vaue that the cass variabe can have [3, 2, 28]. Unike other types of BNs, a BMN aows different (in)dependencies reationships amongst variabes given a dataset to be represented separatey, with respect to each individua cass vaue, as shown in Figure 1. For instance, for a given pair of variabes X i and X j, the actua dependency between them might be best represented as (X i X j ), given one cass vaue, and as (X j X i ) given another cass vaue, and there might be no dependency between them given yet another cass vaue. Even if the same kind of dependency (with the same direction) is represented in two oca BNs for different cass vaues, it is possibe that the precise vaues of the CPTs associated with that dependency woud be different in the two oca BNs, since each one is earnt from a different data subset (as discussed next). This shoud ead to a better modeing for reasoning and cass prediction [2]. A BMN cassifier is earnt by partitioning the training set D into C subsets, where C is the number of vaues in the domain of the cass attribute. Each subset D woud contain ony the instances abeed by cass vaue. Then, a 5

6 genera (not a cassifier by itsef) Bayesian network BN is buit for each cass C = with X = X 1, X 2,..., X n variabes using the D subset, to capture the variabe-dependency reationship given the specific cass vaue. Thus, a BMN cassifier is a set of oca BNs {BN 1, BN 2,..., BN C } that, together with the prior probabiity distribution of C, cassifies an instance x by choosing the cass C(x) that maximizes the posterior probabiity, as shown in Equation 2. C(x) = arg max P (x = x 1, x 2,..., x n BN ) P (C = ), (2) C P (x = x 1, x 2,..., x n BN ) = n P (x i Pa(X i ), BN ) (3) According to Equations 2 and 3, in order to cassify an instance using a BMN cassifier, we seect the cass of the oca BN that maximizes the probabiity of this instance. In other words, we ask each oca BN what is the probabiity of this instance, and assign the instance with the cass of the oca BN that answers with the highest probabiity, which means that this instance is more ikey to beong to the domain of that cass - in terms of the variabe-dependency structure and parameters - than to any other cass. In BMNs, buiding the oca modes for each data subset can be performed as a genera BN earning process. A we-known greedy agorithm for earning a BN structure is Agorithm-B, proposed in [29], which searches for the network structure that optimizes the vaue of a scoring metric, such as K2, cross-entropy or the Kuback distance [2, 30]. Chow-Liu tree Muti-net [2, 4] is a remarkabe agorithm for earning oca BNs due to its simpicity and effectiveness. The agorithm buids a tree-ike network structure, based on the mutua information between the variabes. An extension to the Chow-Liu tree Muti-net was proposed in [30], which invoves maximizing the cross-cass divergence. The Bayesian cass-matched muti-net agorithm [31] is another extension that uses a scoring function based on detection-rejection behavior. The recursive Bayesian Cassifier induction [32] can buid mutipe oca BNs for the same cass by further dividing its data subset recursivey. i=1 Figure 1: A Bayesian muti-net that consists of three different oca BNs, one for each cass vaue. Each oca BN has a different network structure that asserts the variabe dependencies in its corresponding cass vaue data subset. 6

7 2.4 ACO Reated Work Ant Coony Optimization has contributed effectivey in tacking the cassification probem. Parpinei et a. proposed Ant-Miner [10], the first ant-based cassification agorithm, which earns a set of cassification rues of the form: [IF conditions THEN cass]. Severa extensions, such as AntMiner+ by Martens et. a. [12], cant-miner by Otero et a. [19, 13], and muti-pheromone Ant-Miner by Saama et a. [33, 11, 14, 34] have been introduced in the iterature. A recent survey on Ant-Miner and its reated work is presented in [35]. Besides, Ant- Tree-Miner by Otero et a.[36] and cacdt by Boryczka et a. [37] are recentyintroduced ACO agorithms for buiding decision trees for cassification. Note that our work aso utiizes ant-based agorithms to hande cassification probems, yet with a very different approach: earning Bayesian networks to be used as cassifiers. Besides, ACO has been empoyed for earning genera-purpose BNs, rather than cassifiers, in severa works, incuding ACO-B by Campos et a. [7], MMACO by Pinto et a. [38, 9], ACO-E by Day et a. [39, 8], CHAINACO and K2ACO by Yanghui et a. [6] and recenty HACO-B by Junzhong et a. [40]. These various works are briefy discussed in [17]. However, in the area of Bayesian cassification, the authors have recenty introduced ABC-Miner [16] and extended the work in [17], at present the ony agorithm that tackes the cassification probem via using ACO for earning a BN cassifier in the structure of a BAN, with at most k-dependencies at each variabe node. The ant-based ABC-Miner agorithm was shown to outperform other conventiona BN cassifier earners proposed in the iterature in terms of predictive accuracy, overa, across 25 benchmarking cassification datasets from the we-known UCI dataset repository. For detaied discussion of the agorithm and it resuts, the reader is referred to [16, 17]. Note that, unike the current work, ABC-Miner earns one BN mode on a given dataset to perform the cassification task. Yet, in this work, we use ACO to earn a BMN for a given dataset, which consists of severa oca BN modes, one for each cass vaue, which is a different and more eaborated technique for Bayesian cassification. Besides, we have recenty introduced an ACO-based agorithm that earns custer-based BMNs [41], rather than earning cass-based BMNs; the focus of this current work. Custer-based BMN refers to partitioning the dataset into arbitrary data subsets, and earning a oca BN cassifier for each subset. Whie the procedure of ABC-Miner is comparabe to the agorithms proposed in our current work, the ACO custering-based BMN agorithm [41] is very different for the foowing reason. In the current proposed agorithms, ACO is empoyed for finding the optima structure of the oca BNs simiar to ABC-Miner, in which the ACO agorithm optimizes the structure of BAN cassifiers. On the other hand, in [41], ACO is utiized to produce the data custers, rather than earning the oca BN cassifiers, whie a simpe Naïve-Bayes cassifier is constructed for each data custer. 7

8 3 ACO Eements for earning BMN Cassifiers In this work, we are concerned with buiding the structure of a Bayesian mutinet. Unike ABC-Miner that buids a singe BAN mode to represent the variabe (in)dependency reationships with respect to the cass variabe, our target is to buid severa oca Bayesian networks BN, one for each cass vaue C, to compose a BMN cassifier. Each oca BN shoud mode the (in)dependencies between the input variabes (attributes) with respect to its specific cass vaue. This approach expicity encodes asymmetric cass-based (in)dependencies assertions amongst the variabes that cannot be represented in the structure of a singe Bayesian network. This in turn aims to improve the predictive power of the Bayesian mode, and simpify its interpretabiity as we. The construction of a BMN cassifier is carried out using two new ant-based agorithms, ABC-Miner mn and ABC-Miner mn g, which utiize two different approaches. Whie each approach is described ater in a separate section, the common ACO meta-heuristic eements that we used in our agorithms are presented in this section. It is important to highight that one of the most important eements of any ACO-based agorithm is the function by which the quaity of a constructed candidate soution is evauated. However, the quaity evauation functions used for the two proposed agorithms are not the same, since the type of constructed soution differs from one approach to the other. ABC-Miner mn finishes the construction of a genera oca BN before it proceeds to the next oca BN, whie in ABC-Miner mn g, each ant buids a compete BMN cassifier at once. Thus, each agorithm uses a different quaity evauation function, which wi be discussed ater in the context of each proposed approach. 3.1 Construction Graph Simiar to ABC-Miner, the decision components in the search space are a the edges X Y where X Y and X, Y beongs to the input attributes of the data subset D. These components (edges) represent the variabe (attribute) dependencies in the candidate oca BN constructed by an ant. An ant shoud seect a good and vaid combination of these decision components (variabedependency reationships) to construct a candidate soution (BN structure). A representation of the construction graph for a dataset with five input attributes (besides the cass attribute) is shown in Figure 2. At the beginning, the pheromone amount is initiaized for each decision component with the same vaue, given by: 1/T otaedges, where T otaedges is the number of edges associated with a compete graph (with an edge between every pair of nodes). In addition, the heuristic vaue for each edge X Y is set using the mutua information, which is computed as foows: I(X, Y D ) = p(x, y) p(x, y) og (4) p(x)p(y) x X y Y Mutua information is a measure of correation between two variabes. Such a function is used as the oca heuristic information associated with the edges 8

9 in order to ead the ants during the search to the edges between correated variabes for buiding a oca BN. Unike conditiona mutua information, used by ABC-Miner, the cacuation of the mutua information in this context is more accurate with respect to the oca BN. Since the mutua information between variabes is cacuated using ony subset D, which contains ony the instances abeed with cass vaue, the mutua information is conditioned on the cass vaue, which may change between the same two variabes in different oca BNs (i.e., different cass vaues). However, the conditiona mutua information between two variabes is conditioned on the cass variabe as a whoe and using the whoe dataset, which makes the current approach for cacuating the mutua information more cass-based reevant. Figure 2: Matrix representations of the construction graph for a five-inputattributes dataset. The eements on the eft represent the number of parents that each variabe can have in the network. The decision components on the right are the edges to buid the network structure, each has a current pheromone amount τ, and heuristic information η. 3.2 State Transition Rue The seection of the edges is performed according to the foowing probabiistic the standard formua in the ACO iterature [5]: P ij = [τ ij (t)] α [η ij ] β I a J b [τ ab(t)] α [η ab ] β (5) In Equation 5, P ij is the probabiity of seecting the edge i j, τ ij (t) is the current amount of pheromone associated with edge i j at iteration t and η ij is the heuristic information for edge i j. The edge a b represents a vaid seection in the set of avaiabe edges. An edge a b is vaid to be added in the oca BN being constructed if the foowing two criteria are satisfied: 1) its incusion does not create a directed cyce, 2) the upper imit of k b parents for the chid variabe b (discussed in the next subsection) is not exceeded by the incusion of the edge. After the ant adds a vaid edge to the oca BN, a the invaid edges are eiminated from the construction graph. 9

10 The exponents α and β are used to adjust the reative emphases of the pheromone (τ) and heuristic information (η), respectivey. Our proposed agorithms adapt the ants with personaity technique [11]. In other words, some ants wi give more emphases to pheromone information, whie others wi give more emphases to heuristic information. The α i and β i parameters are each independenty drawn from a Gaussian distribution centered at 2 with a standard deviation of 1, as used in [11]. This approach aims to advance exporation and promote diversity in the coony during the search process. 3.3 k-parents Seection In order to buid the structure of a Bayesian network, the number of parents (dependencies) that a node (variabe) can have in the network shoud be specified firsty. The maximum number of parents parameter is usuay fixed for a the variabes in the network, and typicay is set to a sma vaue both for the sake of computationa efficiency and to avoid over-fitting. Instead of having the user seecting the optimum number of dependencies that a variabe in the BN can have (at most k parents for each node), this seection is carried out by the ants in the ACO agorithm in a sef-adaptive manner. Moreover, we aow each variabe to have its own number of parents seected independenty prior to soution construction. As shown in the eft side of Figure 2, the number of dependencies is treated as decision components as we. The seection of k i vaue is done probabiisticay from a ist of avaiabe numbers. The user ony specifies the max parents parameter, and a the integer vaues from 1 to this parameter are avaiabe for the ant to use in the BN cassifier construction. More precisey, if the number of variabes is n, the ant woud have n vaues for the k parents imit, one for each variabe, where variabe i is restricted to have (at most) k i parents during the network construction. Later, the ant updates the pheromone on the vaue k i (i = 1, 2,..., n) after soution creation according to the quaity of this BN cassifier, which used vaue k i for variabe i in the process of the soution construction. This pheromone amount infuences the seection probabiity of this vaue by subsequent ants, eading to convergence on a near-optima vaue of k i dependencies for each variabe i. 3.4 Loca Optimization The best soution tbest constructed amongst the ants in the coony at the current iteration t undergoes oca search, which aims to improve its quaity, with respect to the used quaity measure function for each earning approach. The oca search procedure works as foows. It temporariy removes one edge at a time from the constructed BN, considering edges in the reverse order of their addition to the BN (i.e. removing first the ast edged that was added to the BN). If this edge remova improves the quaity of the BN, this edge is removed permanenty from the BN; otherwise it is added once again. Then it proceeds to the next edge, and this process is iterativey repeated unti a the edges are tested to be removed from the constructed network and the BN with 10

11 the highest quaity with respect to the used quaity measure is obtained, and used for pheromone update. 3.5 Pheromone Update After the current iteration s best soution is optimized via oca search, pheromone amounts are updated on the two types of decision components in the construction graph, i.e. on network edges as we as on the numbers of parents (variabe dependencies) to be seected for each node for constructing a BN cassifier structure. Pheromone update is carried out to affect the probabiity of seecting the decision components for buiding further candidate soutions according to the quaity of the previousy constructed ones, and consists of two phases; pheromone deposit and evaporation. Concerning pheromone deposit, the pheromone amount is increased on each decision component (edge i j) according to the quaity of two constructed soutions: the iteration best tbest and the goba best gbest using the foowing weighting reinforcement strategy: { τij (t) + ϕ τ ij (t + 1) = 1.Q tbest (t) if i j tbest τ ij (t) + ϕ 2.Q gbest (t) if i j gbest (6) where ϕ 1 and ϕ 2 represent the intensity of the pheromone to be deposited in iteration t according to the quaity of the iteration best and goba best soutions respectivey, and are cacuated as: ϕ 1 = max iterations t max iterations, ϕ t 2 = max iterations Hence, in eary iterations more weight is given to the oca best rather than the goba best (as max iterations t is greater than t). This is appied in order to improve search diversity. However, as the iterations go on, the quaity of the goba best increases, which gains more weight (as t increases and max iterations t decreases), eading to convergence towards a good soution. Note that the two pheromone deposit conditions in Equation 6 are appied on the same edge if it beongs to both tbest and gbest, and ϕ 1 + ϕ 2 is aways 1 at any given iteration t. Pheromone normaization (to simuate evaporation as in [10]) is then appied to a the τ ij in the construction graph by dividing each τ ij over the tota amount of pheromone on the edges in the construction graph. An anaogous normaization is aso carried out for the decision components representing the maximum number of dependencies. (7) 11

12 4 A Loca ACO Approach for Learning BMN Cassifiers 4.1 The ABC-Miner mn Agorithm Outine The BMN cassifier earning procedure in ABC-Miner mn is carried out in a oca fashion. The Ant Coony Optimization meta-heuristic is utiized to buid, independenty, a oca Bayesian network BN for cass vaue, using subset D in the training set D which contains ony the instances abeed with cass vaue. This is repeated for each cass vaue beonging to the domain of the cass variabe C. We treat each oca process as a separate probem, where a genera Bayesian network is buit; each oca BN by itsef is not a cassifier. Nevertheess, we consider in quaity evauation that this oca BN wi pay a part in a goba cassification probem soved by the finay produced BMN cassifier. The overa process of ABC-Miner mn is shown in Agorithm 2. Agorithm 2 Pseudo-code of ABC-Miner mn. Begin D training set; BMC ϕ; for = 1 to C do D Subset(D, ); BN (gbest) ϕ; Q(gbest) 0; t 1; InitiaizeP heromoneamounts(); InitiaizeHeuristicV aues(); repeat BN (tbest) ϕ; Q(tbest) 0; for i = 1 to coony size do BN (i) ant(i).createsoution(d ); Q(i) ComputeQuaity(BN (i), D); if Q(i) > Q(tbest) then BN (tbest) BN (i); Q(tbest) Q(i); end if end for P erformlocasearch(bn (tbest)); UpdateP heromone(bn (tbest)); if Q(tbest) > Q(gbest) then BN (gbest) BN (tbest); Q(gbest) Q(tbest); end if t t + 1; unti t = max iterations or Convergence() append BN (gbest) to BMC; end for return BMC; End In essence, each ant(i) in the coony creates a candidate soution BN (i), i.e., a oca BN, for a given subset D. Then the quaity of the constructed soution is evauated. The best soution BN (tbest) produced in the coony 12

13 at iteration t is seected to undergo oca search before the ant updates the pheromone trai according to the quaity of its soution Q(tbest). After that, we compare the iteration best soution BN (tbest) with the goba best soution BN (gbest) to keep track of the best soution found so far. This set of steps is considered an iteration of the repeat unti oop and is repeated unti the same soution is generated for a number of consecutive trias specified by the conv iterations parameter (used by Convergence() function in Agorithm 2 to test for convergence) or unti max iterations is reached. When this oop terminates, BN (gbest) is considered the oca Bayesian network BN for cass, and is appended to the Bayesian muti-net cassifier BM C. This process is repeated to buid a oca BN for each cass vaue C to produce a compete BMN cassifier. 4.2 Loca Bayesian Network Construction In order to create a candidate oca BN, each ant starts with an empty BN, i.e., an edge-ess network structure. As shown in Agorithm 3, which shows the pseudo-code of the BN creation procedure, the ant starts to add edges (representing variabe dependencies) to construct a genera Bayesian network. Unike ABC-Miner, which starts with a Naïve-Bayes structure to construct a BAN cassifier, the purpose of this procedure is to construct a oca BN to be a part of the overa produced BMN cassifier. The seection of the edges is performed according to the probabiistic state transition formua presented in Equation 5. After the ant adds a vaid edge to the oca BN, a the invaid edges are eiminated from the construction graph. The ant keeps adding edges to the current soution unti no vaid edges are avaiabe. When the structure of a candidate oca BN is competey constructed by an ant, the CPTs are earnt for the network using the data subset D. Hence, a the parameters of BN are conditioned on the cass vaue, which makes a oca BN in the BMN cass dependent. Agorithm 3 Pseudo-code of oca BN creation procedure. Begin CreateBN BN (i) ϕ; D INP UT ; k ist ant(i).seectm axp arentsf oreachv ariabe(); whie GetV aidedges() ϕ do {i j} ant i.seectedgep robabisticay(); BN (i) BN (i) {i j}; RemoveInvaidEdges(BN (i), k j ); end whie BN (i).learnp arameters(d ); return BN (i); End 13

14 4.3 Evauating the Quaity of a Candidate Loca BN Since the target of the agorithm is to buid a cassifier, the scoring function, which the agorithm tries to maximize, shoud be reated to the cassification effectiveness of the produced mode. This reasoning was behind seecting cassification accuracy as the scoring function by which a candidate soution is evauated in ABC-Miner. However, we shoud reca that each ant in the ABC-Miner agorithm creates a compete BN cassifier, which can be evauated directy using that scoring function. On the other hand, since ABC-Miner mn earns each oca BN in the muti-net separatey, cassification accuracy cannot be used as a scoring function, because each ant buids a genera Bayesian network BN (not a cassifier) for just a subset D of the dataset D for cass vaue. Consequenty, the constructed oca BN cannot be evauated as a cassifier unti a the oca BNs are earnt and the compete muti-net is composed. Hence, a candidate soution (oca BN) buit by an ant must be evauated as a genera Bayesian network. One of the most used scoring metrics for buiding and evauating Bayesian networks is K2, a metric based on uniform prior scoring [26]. It is used by Agorithm-B [29] and ACO-B [7], which are a greedy and an ant-based agorithm, respectivey, aiming to buid conventiona Bayesian networks. However, ABC-Miner mn does not aim to buid a BN as a fina product; it buids a oca BN to serve as a part of a cassifier. On the other hand, some discriminative scoring functions, such as Kuback distance for cross-cass divergence, have been used to earn BMNs [30]. However, in such cases, the agorithm starts with a compete initia structure for a the oca BN in the BMN (unike the current oca approach), and incrementay modifies this structure using a greedy search. This compete structure can be evauated as a cassifier using a discriminative objective function. Accordingy, the target is to earn a oca BN that not ony maximizes the margina ikeihood P (D BN ) of the data subset D, but aso minimizes the margina ikeihood P (D D BN ) of the other part of the training set. This aims to increase the discriminative power of the oca BNs with respect to the cass vaues, so that an instance x abeed by cass vaue woud have a high probabiity in Bayesian network BN and a ow probabiity in other oca BNs when the muti-net is used as a cassifier. Therefore, we propose a scoring function by which a candidate oca BN is evauated based on maximizing the difference between average margina og-ikeihood of positive and negative instances of the whoe training set, given the oca BN currenty constructed. The function is formuated as foows: n i=1 Q(BN ) = LL(x+ i BN m ) j=1 x + LL(x j BN ) x (8) where LL(x BN) is the og-ikeihood of the instance x given BN, x + are the instances in the training set abeed by cass vaue whose oca BN is currenty being constructed by the agorithm, x are the instances in the training set not abeed by cass vaue, x + is the number of instances abeed by, x is the 14

15 number of instances not abeed by. Athough ony the subset D is used to earn the CPT of a oca BN (as shown in Agorithm 3), the whoe training set D is used to evauated the quaity of the constructed oca BN (as shown in Agorithm 2) to perform the discrimination between instances beonging and not beonging to cass. This is performed by cacuating the difference between the average of the (og)ikeihood of the instances beonging to the current cass and the instances beonging to the other casses (as shown in Equation 8), and the optimization objective is to maximize this difference. 5 A Goba ACO Approach for Learning BMN Cassifiers 5.1 The ABC-Miner mn g Agorithm Outine The goba approach agorithm works in a different way from its oca counterpart. A singe run of the Ant Coony Optimization meta-heuristic is utiized to buid the whoe soution; each ant buids a compete Bayesian muti-net (BMN) cassifier as a candidate soution at once. This is accompished by buiding a candidate oca BN for each cass vaue C in each singe ant tria, appending them to the BMN cassifier, then evauating the resuting fina BMN cassifier as a whoe. Unike ABC-Miner mn, which has to competey finish buiding the oca Bayesian network BN before starting to buid BN +1, in ABC-Miner mn g each ant buids the whoe BMN cassifier BMC (incuding a the oca BNs) before performing evauation and pheromone update. As shown in Agorithm 4, each ant(i) in the coony creates a candidate soution BM C(i), i.e. a compete BMN cassifier. Then the quaity of the constructed BMN is evauated. The best soution BM C(tbest) produced in the coony at iteration t is seected to undergo oca search before the ant updates the pheromone trai according to the quaity of its soution Q(tbest). After that, we compare the iteration s best soution BM C(tbest) with the goba best soution BMC(gbest) to keep track of the best soution found so far. This is iterativey repeated unti the termination conditions are met. At the end, BM C(gbest) is considered the fina soution. In order to appy such a goba approach using ACO, we use severa construction graphs for a given dataset. More precisey, we use C construction graphs, one for each cass vaue C. Pheromone amounts and heuristic information (using mutua information) initiaization are performed on a the used construction graphs. As for soution creation, an ant uses each construction graph to buid each oca BN, one for each cass vaue. Once an ant creates the structure of the oca BN and earns its parameters, it proceeds to the construction of the oca BN +1, using the ( + 1)th construction graph and the D +1 subset of the data for parameter earning. The ant proceeds unti it constructs a oca BN for a the cass vaues, to compose a compete BMN cassifier. At this point, the quaity of the compete soution can be evauated, and the pheromone is updated on a construction graphs according to the quaity of the 15

16 Agorithm 4 Pseudo-code of ABC-Miner mn g. Begin D training set; t 1; BMC(gbest) ϕ; Q(gbest) 0; InitiaizeP heromoneamounts(); InitiaizeHeuristicV aues(); repeat BMC(tbest) ϕ; Q(tbest) 0; for i = 1 to coony size do BMC(i) ϕ; for = 1 to C do D subset(d, ); BN (i) ant(i).createsoution(d ); append BN (i) to BMC(i); end for Q(i) ComputeQuaity(BM C(i), D); if Q(i) > Q(tbest) then BMC(tbest) BMC(i); Q(tbest) Q(i); end if end for P erf ormlocasearch(bm C(tbest)); U pdatep heromone(bm C(tbest)); if Q(tbest) > Q(gbest) then BM C(gbest) BM C(tbest); Q(gbest) Q(tbest); end if t t + 1; unti t = max iterations or Convergence() return BM C(gbest); End 16

17 constructed BMN cassifier. Then a subsequent ant can perform a new soution construction tria. Figure 3 shows a diagram that compares the oca and the goba approaches for buiding a BMN cassifier. Figure 3: The two proposed ACO approaches for buiding a BMN cassifier for a 3-cass dataset: (a) the oca approach, where each oca BN is constructed at a time and the best oca BN constructed over n iterations for each cass are merged at the end to compose the fina BMN, and (b) the goba approach, where a compete candidate BMN is constructed at each iteration and the best BMN constructed over n iterations is the fina soution. The diagram is simpified by assuming that the size of the coony is 1; ony one soution is constructed at each iteration. 17

18 5.2 Evauating the Quaity of a Candidate BMN The quaity of a candidate BMN cassifier is evauated directy (as a cassifier) using the accuracy measure, as a conventiona measure of cassification performance, which is computed as foows: Accuracy = Correcty Cassified Cases T raining Set (9) Note that the use of cassification accuracy as a soution quaity evauation function is vaid in ABC-Miner mn g, simiar to ABC-Miner, since in both agorithms the constructed candidate soution represents a cassifier, which can be evauated using any measure of cassification performance. This is in contrast to the case in ABC-Miner mn, which buids a genera Bayesian network as a candidate soution for a oca BN in the BMN cassifier. 6 Experimenta Methodoogy 6.1 Comparative Evauations We compare the predictive accuracy of our proposed ant-based agorithms for earning BMN cassifiers with our previousy introduced ABC-Miner that earns BAN cassifiers, as we as three other widey used Bayesian cassifiers, namey Naïve-Bayes, Tree Augmented Naïve-Bayes (TAN) and Genera Bayesian Network (GBN). TAN aows a node in a BN to have one parent, besides the cass variabe. This produces a tree-ike structure BN. Unike the other BN cassifier earners, a GBN treats the cass variabe as an ordinary node; it buids a genera BN, finds the Markov banket of the cass node and uses it as a Bayesian cassifier [2]. A variation of the Chow-Liu (CL) tree agorithm [3] is used for buiding TANs, as foows. First, it computes the conditiona mutua information I(X, Y C) between each pair of variabes X and Y given cass variabe C. Then it buids a compete undirected graph connecting a the input variabes to find the maximum weighted spanning tree from the graph, where the weight of edge X Y is annotated with I(X, Y C). After that, it chooses a root variabe and sets the direction of a edges to be outwards of it. Finay, it adds one edge from the cass node to each of the other variabes to compete a TAN cassifier. As for the construction of GBNs, Agorithm-B [29] is used to buid a genera Bayesian network. The agorithm utiizes a greedy search to optimize the K2 scoring function for the Bayesian network. Then, the Markov banket of the cass node is extracted from the BN to be used as a cassifier. As for BMN agorithms, we compared our agorithms to Chow-Liu tree Muti-net (CL-Tree MN) [4]. This agorithm buids a tree-ike network structure, one for each cass vaue, using the mutua information between the variabes, by finding the maximum weighted spanning tree as in CL-tree agorithm [4]. In addition, we impemented BMN-K2, which uses a greedy hi-cimbing (GHC) 18

19 approach to earn oca Bayesian networks for the BMN cassifiers. That is, for each oca BN in the BMN, the agorithm starts with an empty (edge-ess) network. Then the agorithm searches for the edge that eads to the maximum increase in the quaity of the BN cassifier being constructed according to the K2 scoring function. Tabe 1 presents the main properties of the used Bayesian cassification agorithms. Tabe 1: Summary of the BN cassification agorithms used in the experiments. Agorithm Type Search Strategy Output Optimization Naïve-Bayes Determ. - NB - CL-Tree Determ. Max. Spanning Tree TAN Cond. Mut. Info. Agorithm-B Determ. Greedy Hi Cim. GBN K2 Function CL-Tree MN Determ. Max. Spanning Tree. BMN Mut. Info. BMN-K2 Determ. Greedy Hi Cim. BMN K2 Function ABC-Miner Stoch. Ant Coony Optim. BAN Predictive Acc. ABC-Miner mn Stoch. Ant Coony Optim. BMN Likeihood Diff. ABC-Miner mn g Stoch. Ant Coony Optim. BMN Predictive Acc. Besides the Bayesian agorithms, we compare our proposed ACO agorithms with three we-known cassification agorithms: JRip, J48 (the WEKA too s impementation of Ripper and C4.5, respectivey), and SVM [42]. JRip is a cassification rue induction agorithm, which earns a cassification mode that consists of a ist of cassification rues. J48 buids cassification decision trees, where the interna nodes are the input attribute vaues of the dataset, and the eaf nodes are the casses to be predicted. An SVM maps the cases into a higherdimensiona feature space and then finds the best hyperpane for separating cases of different casses where the best hyperpane is the one with the greatest possibe margin (a gap separating cases of different casses in the data space). SVMs has the disadvantage of producing back-box modes that can hardy be interpreted by users [42, 43]. 6.2 Experiment Setup The experiments were carried out using stratified 10-fod cross vaidation [42]. For the stochastic ACO-based agorithms, we run each agorithm 10 times using a different random seed to initiaize the search each time for each crossvaidation fod. In the case of the deterministic agorithms, each is run just once for each fod. The parameter configuration used in our experiments is shown in Tabe 3. Note that the max iterations parameter refers to the maximum number of iterations used to buid a singe oca BN in the ABC-Miner mn agorithm. The 19

20 Tabe 2: Parameter settings of ABC-Miner mn used in experiments. Parameter Vaue max iterations 100 coony size 10 conv iterations 15 max parents 3 vaue of this parameter is mutipied by the number of cass vaues when used with ABC-Miner mn g. On the other hand, for the BMN-K2 greedy agorithms, we refer to max iterations as the maximum number of soution evauations that the agorithm performs during the hi-cimbing search to buid a singe oca BN. It is set to 1000, which is equa to the vaue of max iterations mutipied by coony size used for our stochastic ant-based agorithms. For the sake of fair comparison, we imit each agorithm to the same fixed number of soution evauations to construct the BN cassifier. However, the maximum number might not be utiized competey; ACO-based agorithms might ony use a smaer number of iterations if they converged earier and the greedy-based agorithms might aso stop earier if they get stuck in a oca optimum. Note that, the resuts for ABC-Miner in the current experiments are sighty different than its resuts in [17] because here we used different parameter settings and different training/testing set fods. Unike our ant-based agorithms, the number of parents (k-dependencies) must be specified for Agorithm-B and BMN-K2. We set it to 3, which is the maximum number of parents used in our ACO agorithm, where the number of parents is seected dynamicay at each iteration (see Section 4.3). In our experiments, we used WEKA [42] impementations for Rippper (JRip), J48 (J48), and SVM (SMO), with its defaut parameter settings. 6.3 Datasets The performance of ABC-Miner mn and ABC-Miner mn g was evauated using 33 pubic-domain datasets from the University of Caifornia at Irvine UCI dataset repository [44]. The main characteristics of the datasets, such as number of cases, number of predictor attributes and number of casses are shown in Tabe 3. For the BN cassification agorithms, datasets having continuous attributes were discretized in a pre-processing step, using the we-known J48-Disc agorithm [42], appied to the training sets. For the other agorithms, the origina datasets with rea-vaued attributes were used. 20

21 Tabe 3: Description of datasets used in the experiments. Dataset Cases Attributes Casses abaone baance scae breast cancer (wisconsin) car evauation 1, chess (rook vs. pawn) 3, contraceptive method choice 1, statog credit (austraian) statog credit (german) 1, dermatoogy ecoi gass hayes-roth heart (ceveand) heart (statog) hepatitis ionosphere iris ung cancer monks mushrooms 8, nursery 12, parkinsons page Bocks cassification 5, pima diabetes post-operative patient segmentation 2, soybean SPECT heart tic-tac-to voting records wine yeast 1, zoo

22 7 Computationa Resuts and Anaysis 7.1 Predictive Accuracy Resuts Tabe 4 reports the mean and the standard error (mean ± standard error) of the predictive accuracy vaues obtained by 10-fod cross vaidation for the 33 datasets, where the highest accuracy for each dataset is shown in bod face. The ast row shows the average rank of each agorithm in terms of predictive accuracy. The average rank for a given agorithm g is obtained by first computing the rank of g on each dataset individuay. The individua ranks are then averaged across a datasets to obtain the overa average rank. Note that the ower the vaue of the rank, the better the agorithm. According to the resuts in Tabe 4, our proposed ACO-based agorithm for earning BMNs using a goba approach, ABC-Miner mn g, obtained the best overa rank in terms of predictive accuracy of 2.9, and achieved the best predictive resuts in 13 datasets out of 33. SVM came in the second pace by obtaining 3.2 as an overa rank, and achieved the best predictive resuts in 10 datasets. Our proposed ACO-based oca approach for earning BMNs, ABC-Miner mn, came in the third pace with overa rank of 3.6, and achieved the best predictive resuts in 6 datasets. J48 and JRip came in fourth and the fifth pace, respectivey, by obtaining 5.2 and 6.5 overa rank, respectivey, achieved the best resut in 5 and 7 datasets, respectivey. 22

23 Tabe 4: Predictive accuracy % (mean ± standard error) resuts. Dataset Naïve-B C-tree Ago-B C-tree MN BMN-K2 ABC-Miner ABC mn ABC mn g JRip J48 SVM ab 86.1 ± ± ± ± ± ± ± ± ± ± ± 0.6 ba 91.2 ± ± ± ± ± ± ± ± ± ± ± 0.9 bcw 92.1 ± ± ± ± ± ± ± ± ± ± ± 0.9 car 85.3 ± ± ± ± ± ± ± ± ± ± ± 0.9 chess 88.2 ± ± ± ± ± ± ± ± ± ± ± 2.8 cmc 52.2 ± ± ± ± ± ± ± ± ± ± ± 2.2 crd-a 77.5 ± ± ± ± ± ± ± ± ± ± ± 1.6 crd-g 75.6 ± ± ± ± ± ± ± ± ± ± ± 0.6 drm 96.2 ± ± ± ± ± ± ± ± ± ± ± 0.8 ecoi 86.5 ± ± ± ± ± ± ± ± ± ± ± 2.2 gass 74.2 ± ± ± ± ± ± ± ± ± ± ± 0.9 hay 80.0 ± ± ± ± ± ± ± ± ± ± ± 0.2 hrt-c 62.7 ± ± ± ± ± ± ± ± ± ± ± 0.3 hrt-s 83.4 ± ± ± ± ± ± ± ± ± ± ± 2.4 hep 71.9 ± ± ± ± ± ± ± ± ± ± ± 1.6 iono 82.6 ± ± ± ± ± ± ± ± ± ± ± 0.6 iris 96.2 ± ± ± ± ± ± ± ± ± ± ± 0.4 ung 72.5 ± ± ± ± ± ± ± ± ± ± ± 0.8 monk 61.6 ± ± ± ± ± ± ± ± ± ± ± 0.9 mush 95.8 ± ± ± ± ± ± ± ± ± ± ± 0.6 nurs 90.1 ± ± ± ± ± ± ± ± ± ± ± 0.9 park 84.5 ± ± ± ± ± ± ± ± ± ± ± 1.5 pbc 93.5 ± ± ± ± ± ± ± ± ± ± ± 0.6 pima 75.4 ± ± ± ± ± ± ± ± ± ± ± 1.2 pop 68.2 ± ± ± ± ± ± ± ± ± ± ± 0.9 seg 91.6 ± ± ± ± ± ± ± ± ± ± ± 0.6 soy 91.4 ± ± ± ± ± ± ± ± ± ± ± 1.2 SPECT 73.7 ± ± ± ± ± ± ± ± ± ± ± 0.8 ttt 70.3 ± ± ± ± ± ± ± ± ± ± ± 0.3 vot 90.3 ± ± ± ± ± ± ± ± ± ± ± 1.2 wine 95.6 ± ± ± ± ± ± ± ± ± ± ± 1.5 yeast 59.7 ± ± ± ± ± ± ± ± ± ± ± 0.8 zoo 94.2 ± ± ± ± ± ± ± ± ± ± ± 0.1 Rank

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with? Bayesian Learning A powerfu and growing approach in machine earning We use it in our own decision making a the time You hear a which which coud equay be Thanks or Tanks, which woud you go with? Combine