Learning Bayesian Networks Does Not Have to Be NP-Hard

Learning Bayesian Networks Does Not Have to Be NP-Hard Norbert Dojer Institute of Informatics, Warsaw University, Banacha, 0-097 Warszawa, Poland dojer@mimuw.edu.pl Abstract. We propose an algorithm for learning an optimal Bayesian network from data. Our method is addressed to biological applications, where usually datasets are small but sets of random variables are large. Moreover we assume that there is no need to examine the acyclicity of the graph. We provide polynomial bounds (with respect to the number of random variables) for time complexity of our algorithm for two generally used scoring criteria: Minimal Description Length and Bayesian-Dirichlet equivalence. 1 Introduction The framework of Bayesian networks is widely used in computational molecular biology. In particular, it appears attractive in the field of inferring a gene regulatory network structure from microarray expression data. Researchers dealing with Bayesian networks have generally accepted that without restrictive assumptions, learning Bayesian networks from data is NPhard with respect to the number of network vertices. This belief originates from a number of discouraging complexity results, which emerged over the last few years. Chickering [1] shows that learning an optimal network structure is NPhard for the BDe scoring criterion with an arbitrary prior distribution, even when each node has at most two parents and the dataset is trivial. In this approach a reduction of a known NP-complete problem is based on constructing a complicated prior distribution for the BDe score. Chickering et al. [] show that learning an optimal network structure is NP-hard for large datasets, even when each node has at most three parents and a scoring criterion favors the simplest model able to represent the distribution underlying the dataset exactly (as most of scores used in practice do). By a large dataset the authors mean the one which reflects the underlying distribution exactly. Moreover, this distribution is assumed to be directly accessible, so the complexity of scanning the dataset does not influence the result. In other papers [3,4], NP-hardness of the problem of learning Bayesian networks from data with some constraints on the structure of a network is shown. On the other hand, the known algorithms allow to learn the structure of optimal networks having up to 0-40 vertices [5]. Consequently, a large amount R. Královic and P. Urzyczyn (Eds.): MFCS 006, LNCS 416, pp. 305 314, 006. c Springer-Verlag Berlin Heidelberg 006

306 N. Dojer of work has been dedicated to heuristic search techniques of identifying good models [6,7]. In the present paper we propose a new algorithm for learning an optimal network structure. Our method is addressed to the case when a dataset is small and there is no need to examine the acyclicity of the graph. The first condition does not bother in biological applications, because expenses of microarray experiments restrict the amount of expression data. The second assumption is satisfied in many cases, e.g. when working with dynamic Bayesian networks or when the sets of regulators and regulatees are disjoint. We investigate the worst-case time complexity of our algorithm for generally used scoring criteria. We show that the algorithm works in polynomial time for both MDL and BDe scores. Experiments with real biological data show that, even for large floral genomes, the algorithm for the MDL score works in a reasonable time. Algorithm A Bayesian network (BN) N is a representation of a joint distribution of a set of discrete random variables X = {X 1,...,X n }. The representation consists of two components: a directed acyclic graph G =(X, E) encoding conditional (in-)dependencies a family θ of conditional distributions P (X i Pa i ), where The joint distribution of X is given by Pa i = {Y X (Y,X i ) E} P (X) = n P (X i Pa i ) (1) i=1 The problem of learning a BN is understood as follows: given a multiset of X-instances D = {x 1,...,x N } find a network graph G that best matches D. The notion of a good match is formalized by means of a scoring function S(G : D) having positive values and minimized for the best matching network. Thus the point is to find a directed acyclic graph G with the set of vertices X minimizing S(G : D). In the present paper we consider the case when there is no need to examine the acyclicity of the graph, for example: When edges must be directed according to a given partial order, in particular when the sets of potential regulators and regulatees are disjoint. When dealing with dynamic Bayesian networks. A dynamic BN describes stochastic evolution of a set of random variables over discretized time. Therefore conditional distributions refer to random variables in neighboring time points. The acyclicity constraint is relaxed, because the unrolled graph

Learning Bayesian Networks Does Not Have to Be NP-Hard 307 (with a copy of each variable in each time point) is always acyclic (see [7,8] for more details). The following considerations apply to dynamic BNs as well. In the sequel we consider some assumptions on the form of a scoring function. The first one states that S(G : D) decomposes into a sum over the set of random variables of local scores, depending on the values of a variable and its parents in the graph only. Assumption 1 (additivity). S(G : D) = n i=1 s(x i, Pa i : D {Xi} Pa i ),where D Y denotes the restriction of D to the values of the members of Y X. When there is no need to examine the acyclicity of the graph, this assumption allows to compute the parents set of each variable independently. Thus the point is to find Pa i minimizing s(x i, Pa i : D {Xi} Pa i )foreachi. Let us fix a dataset D and a random variable X. WedenotebyX the set of potential parents of X (possibly smaller than X due to given constraints on the structure of the network). To simplify the notation we continue to write s(pa) for s(x, Pa : D {X} Pa ). The following assumption expresses the fact that scoring functions decompose into components: g penalizing the complexity of a network and d evaluating the possibility of explaining data by a network. Assumption (splitting). s(pa) =g(pa) +d(pa) for some functions g, d : P(X) R + satisfying Pa Pa = g(pa) g(pa ). This assumption is used in the following algorithm to avoid considering networks with inadequately large component g. Algorithm 1. 1. Pa :=. for each P X chosen according to g(p) (a) if s(p) <s(pa) thenpa := P (b) if g(p) s(pa) then return Pa; stop In the above algorithm choosing according to g(p) means choosing increasingly with respect to the value of the component g of the local score. Theorem 1. Suppose that the scoring function satisfies Assumptions 1-. Then Algorithm 1 applied to each random variable finds an optimal network. Proof. The theorem follows from the fact that all the potential parents sets P omitted by the algorithm satisfy s(p) g(p) s(pa), where Pa is the returned set.

308 N. Dojer A disadvantage of the above algorithm is that finding a proper subset P X involves computing g(p ) for all -successors P of previously chosen subsets. It may be avoided when a further assumption is imposed. Assumption 3 (uniformity). Pa = Pa = g(pa) =g(pa ). The above assumption suggests the notation ĝ( Pa ) = g(pa). The following algorithm uses the uniformity of g to reduce the number of computations of the component g. Algorithm. 1. Pa :=. for p =1ton (a) if ĝ(p) s(pa) then return Pa; stop (b) P = argmin {Y X : Y =p}s(y) (c) if s(p) <s(pa) thenpa := P Theorem. Suppose that the scoring function satisfies Assumptions 1-3. Then Algorithm applied to each random variable finds an optimal network. Proof. The theorem follows from the fact that all the potential parents sets P omitted by the algorithm satisfy s(p) ĝ( P ) s(pa), where Pa is the returned set. The following sections are devoted to two generally used scoring criteria: Minimal Description Length and Bayesian-Dirichlet equivalence. We consider possible decompositions of the local scores and analyze the computational cost of the above algorithms. 3 Minimal Description Length The Minimal Description Length (MDL) scoring criterion originates from information theory [9]. A network N is viewed here as a model of compression of a dataset D. The optimal model minimizes the total length of the description, i.e. the sum of the description length of the model and of the compressed data. Let us fix a dataset D = {x 1,...,x N } and a random variable X. Recall the decomposition s(pa) = g(pa) + d(pa) of the local score for X. IntheMDL score g(pa) stands for the length of the description of the local part of the network (i.e. the edges ingoing to X and the conditional distribution P (X Pa)) and d(pa) is the length of the compressed version of X-values in D. Let k Y denote the cardinality of the set V Y of possible values of the random variable Y X. Thuswehave g(pa) = Pa log n + log N (k X 1) k Y Y Pa

Learning Bayesian Networks Does Not Have to Be NP-Hard 309 where log N is the number of bits we use for each numeric parameter of the conditional distribution. This formula satisfies Assumption but fails to satisfy Assumption 3. Therefore Algorithm 1 can be used to learn an optimal network, but Algorithm cannot. However, for many applications we may assume that all the random variables attain values from the same set V of cardinality k. In this case we obtain the formula g(pa) = Pa log n + log N (k 1)k Pa which satisfies Assumption 3. For simplicity, we continue to work under this assumption. The general case may be handled in much the same way. Compression with respect to the network model is understood as follows: when encoding the X-values, the values of Pa-instances are assumed to be known. Thus the optimal encoding length is given by d(pa) =N H(X Pa) where H(X Pa) = P (v, v)logp (v v) is the conditional entropy Pa of X given Pa (the distributions are estimated from D). Since all the assumptions from the previous section are satisfied, Algorithm may be applied to learn the optimal network. Let us turn to the analysis of its complexity. Theorem 3. The worst-case time complexity of Algorithm for the MDL score is O(n log k N N log k N). Proof. The proof consists in finding a number p satisfying ĝ(p) s( ). Given v V,wedenotebyN v the number of samples in D with X = v. By Jensen s Theorem we have d( ) =N H(X ) =N N v N log N N log 1=Nlog k N v Hence s( ) N log k +(k 1) log N and it suffices to find p satisfying p log n + k p (k 1) log N N log k +(k 1) log N It is easily seen that the above assertion holds for p log k N whenever N>5. Therefore we do not have to consider subsets with more than log k N parent variables. Since there are O(n p ) subsets with at most p parents and a subset with p elements may be examined in pn steps, the total complexity of the algorithm is O(n log k N N log k N). 4 Bayesian-Dirichlet Equivalence The Bayesian-Dirichlet equivalence (BDe) scoring criterion originates from Bayesian statistics [10]. Given a dataset D the optimal network structure G maximizes the posterior conditional probability P (G D). We have

310 N. Dojer P (G D) P (G)P (D G) =P (G) P (D G,θ)P (θ G)dθ where P (G) and P (θ G) are prior probability distributions on graph structures and conditional distributions parameters, respectively, and P (D G,θ)isevaluated due to (1). Heckerman et al. [6], following Cooper and Herskovits [10], identified a set of independence assumptions making possible decomposition of the integral in the above formula into a product over X. Under this condition, together with a similar one regarding decomposition of P (G), the scoring criterion S(G : D) = log P (G) log P (D G) obtained by taking log of the above term satisfies Assumption 1. Moreover, the decomposition s(pa) = g(pa)+ d(pa) of the local scores appears as well, with the components g and d derived from log P (G) and log P (D G), respectively. The distribution P ((X, E)) α E with a penalty parameter 0 <α<1in general is used as a prior over the network structures. This choice results in the function g( Pa ) = Pa log α 1 satisfying Assumptions and 3. However, it should be noticed that there are also used priors which satisfy neither Assumption nor 3, e.g. P (G) α Δ(G,G0),whereΔ(G, G 0 )isthecardinality of the symmetric difference between the sets of edges in G andintheprior network G 0. The Dirichlet distribution is generally used as a prior over the conditional distributions parameters. It yields d(pa) =log Γ ( (H v,v + N v,v )) Γ ( Γ (H v,v ) H v,v) Γ (H v,v + N v,v ) Pa where Γ is the Gamma function, N v,v denotes the number of samples in D with X = v and Pa = v, andh v,v is the corresponding hyperparameter of the Dirichlet distribution. Setting all the hyperparameters to 1 yields d(pa) =log Pa = Pa (k 1+ N v,v)! 1 = (k 1)! N v,v! ( log(k 1+ N v,v )! log(k 1)! ) log N v,v! where k = V. For simplicity, we continue to work under this assumption. The general case may be handled in a similar way. The following result allows to refine the decomposition of the local score into the sum of the components g and d.

Learning Bayesian Networks Does Not Have to Be NP-Hard 311 Proposition 1. Define d min = (log(k 1+N v)! log(k 1)! log N v!), where N v denotes the number of samples in D with X = v. Thend(Pa) d min for each Pa X. Proof. Fix Pa X. Givenv V Pa,wedenotebyN v the number of samples in D with Pa = v. Wehave d(pa) = ( log(k 1+N v )! log(k 1)! ) log N v,v! = Pa = k 1+N v log i N v,v log i Pa i=1 k 1+N v,v N v,v log i log i = N v,v log k 1+i = i Pa = = N v,v Pa i=1 i=1 i=1 log( k 1 i N v log k 1+i i = = +1) ( k 1+Nv Pa i=1 N v log( k 1 i i=1 +1)= ) N v log i log i = i=1 (log(k 1+N v )! log(k 1)! log N v!) = d min The first inequality follows from the fact that given N v the sum k 1+N v,v log i maximizes when N v = N v,v for some v V. Similarly, the second inequality holds, because given N v the sum N v,v Pa i=1 log( k 1 i +1) minimizes when N v = N v,v for some v V Pa. By the above proposition, the decomposition of the local score given by s(pa) = g (Pa) +d (Pa) with the components g (Pa) =g(pa) +d min and d (Pa) = d(pa) d min satisfies all the assumptions required by Algorithm. Let us turn to the analysis of its complexity. Theorem 4. The worst-case time complexity of Algorithm for the BDe score with the decomposition of the local score given by s(pa) =g (Pa) +d (Pa) is O(n N log α 1 k N log α 1 k).

31 N. Dojer Proof. The proof consists in finding a number p satisfying ĝ (p) s( ). We have d ( ) =d( ) d min = ( (k 1+N)! = log ) log N v! (k 1)! = (log(k 1+N)! log(k 1)!) k 1+N =loge = log i k k 1+N k 1+N ln i k k 1+ N k log i k 1+ N k ( ( k+n k 1+ N log e ln xdx k k ( log (k 1+N ) v)! log N v! = (k 1)! (log(k 1+N v )! log(k 1)!) = k 1+N v log i log i +(N k N k )log(k + N k ) = k 1 ln i +( N k N k )ln(k + N k ) k )) k 1+ N k ln xdx+ ln xdx = k 1+ N k ( =loge (k + N)(ln(k + N) 1) k(ln k 1) ( k (k 1+ N ) ) k )(ln(k 1+N k ) 1) (k 1)(ln(k 1) 1) = k + N = N log k + N log k k + N +log (1 + N k )k (1 + N N log k k k )k k The first inequality follows from the fact that the sum log N v! minimizes when the values of N v are uniformly distributed over v. The last inequality is due to the fact that k, and consequently k k k. Thus s( ) N log k + d min, so we need p satisfying p log α 1 N log k, i.e. p N log α 1 k. Therefore we do not have to consider subsets with more than N log α 1 k parent variables. Since there are O(n p )subsetswithatmostpparents and a subset with p elements may be examined in pn steps, the total complexity of the algorithm is O(n N log α 1 k N log α 1 k). 5 Application to Biological Data Microarray experiments measure simultaneously expression levels of thousands of genes. Learning Bayesian networks from microarray data aims at discovering gene regulatory dependencies.

Learning Bayesian Networks Does Not Have to Be NP-Hard 313 We tested the time complexity of learning an optimal dynamic BN on microarray experiments data of Arabidopsis thaliana ( 3 000 genes). There were 0 samples of time series data discretized into 3 expression levels. We ran our program on a single Pentium 4 CPU.8 GHz. The computation time was 48 hours for the MDL score and 170 hours for the BDe score with a parameter log α 1 = 5. The program failed to terminate in a reasonable time for the BDe score with a parameter log α 1 = 3, i.e α 0.05. 6 Conclusion The aim of this work was to provide an effective algorithm for learning optimal Bayesian network from data, applicable to microarray expression measurements. We have proposed a method addressed to the case when a dataset is small and there is no need to examine the acyclicity of the graph. We have provided polynomial bounds (with respect to the number of random variables) for time complexity of our algorithm for two generally used scoring criteria: Minimal Description Length and Bayesian-Dirichlet equivalence. An application to real biological data shows that our algorithm works with large floral genomes in a reasonable time for the MDL score but fails to terminate for the BDe score with a sensible value of the penalty parameter. Acknowledgements The author is grateful to Jerzy Tiuryn for comments on previous versions of this work and useful discussions related to the topic. I also thank Bartek Wilczyński for implementing and running our learning algorithm. This research was founded by Polish State Committee for Scientific Research under grant no. 3T11F018. References 1. Chickering, D.M.: Learning Bayesian networks is NP-complete. In Fisher, D., Lenz, H.J., eds.: Learning from Data: Artificial Inteligence and Statistics V. Springer- Verlag (1996). Chickering, D.M., Heckerman, D., Meek, C.: Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research 5 (004) 187 1330 3. Dasgupta, S.: Learning polytrees. In: Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), San Francisco, CA, Morgan Kaufmann Publishers (1999) 134 141 4. Meek, C.: Finding a path is harder than finding a tree. J. Artificial Intelligence Res. 15 (001) 383 389 (electronic) 5. Ott, S., Imoto, S., Miyano, S.: Finding optimal models for small gene networks. Pac. Symp. Biocomput. (004) 557 567 6. Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 0(3) (1995) 197 43

314 N. Dojer 7. Neapolitan, R.E.: Learning Bayesian Networks. Prentice Hall (003) 8. Friedman, N., Murphy, K., Russell, S.: Learning the structure of dynamic probabilistic networks. In Cooper, G.F., Moral, S., eds.: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Inteligence. (1998) 139 147 9. Lam, W., Bacchus, F.: Learning Bayesian belief networks: An approach based on the MDL principle. Computational Intelligence 10(3) (1994) 69 93 10. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9 (199) 309 347