Learning Bayesian Networks Does Not Have to Be NP-Hard

Similar documents
Learning in Bayesian Networks

A Decision Theoretic View on Choosing Heuristics for Discovery of Graphical Models

Model Complexity of Pseudo-independent Models

Branch and Bound for Regular Bayesian Network Structure Learning

Introduction to Probabilistic Graphical Models

Scoring functions for learning Bayesian networks

Discovery of Pseudo-Independent Models from Data

Applying Bayesian networks in the game of Minesweeper

Exact model averaging with naive Bayesian classifiers

The Monte Carlo Method: Bayesian Networks

COMP538: Introduction to Bayesian Networks

Lecture 5: Bayesian Network

The GlobalMIT Toolkit. Learning Optimal Dynamic Bayesian Network

6.867 Machine learning, lecture 23 (Jaakkola)

Exact distribution theory for belief net responses

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier

Parent Assignment Is Hard for the MDL, AIC, and NML Costs

Structure Learning: the good, the bad, the ugly

Beyond Uniform Priors in Bayesian Network Structure Learning

Bayesian Network Structure Learning using Factorized NML Universal Models

An Empirical-Bayes Score for Discrete Bayesian Networks

Belief Update in CLG Bayesian Networks With Lazy Propagation

An Empirical-Bayes Score for Discrete Bayesian Networks

Comparing Bayesian Networks and Structure Learning Algorithms

Polynomial-Time Algorithm for Learning Optimal BFS-Consistent Dynamic Bayesian Networks

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Score Metrics for Learning Bayesian Networks used as Fitness Function in a Genetic Algorithm

Markovian Combination of Decomposable Model Structures: MCMoSt

Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

2 : Directed GMs: Bayesian Networks

Polynomial-time algorithm for learning optimal tree-augmented dynamic Bayesian networks

Learning Causal Bayesian Networks from Observations and Experiments: A Decision Theoretic Approach

Machine Learning Lecture 14

Entropy-based pruning for learning Bayesian networks using BIC

Inferring Protein-Signaling Networks II

Entropy-based Pruning for Learning Bayesian Networks using BIC

Gearing optimization

Bayesian Network Learning and Applications in Bioinformatics. Xiaotong Lin

Inferring Protein-Signaling Networks

Lecture 8: December 17, 2003

{ p if x = 1 1 p if x = 0

Directed Graphical Models or Bayesian Networks

Integer Programming for Bayesian Network Structure Learning

STA 4273H: Statistical Machine Learning

Rapid Introduction to Machine Learning/ Deep Learning

Structure Learning of Bayesian Networks using Constraints

Probabilistic Graphical Models (I)

Food delivered. Food obtained S 3

Bayesian Network Classifiers *

Learning Bayesian Networks for Biomedical Data

Induction of Decision Trees

12 : Variational Inference I

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

Weighting Expert Opinions in Group Decision Making for the Influential Effects between Variables in a Bayesian Network Model

Probabilistic Graphical Models

Bayesian Networks. Motivation

Series 7, May 22, 2018 (EM Convergence)

Learning Bayes Net Structures

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

On the Complexity of Strong and Epistemic Credal Networks

Unsupervised machine learning

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

An Empirical Investigation of the K2 Metric

Embellishing a Bayesian Network using a Chain Event Graph

Learning of Bayesian Network Structure from Massive Datasets: The Sparse Candidate Algorithm

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Phylogenetic Networks, Trees, and Clusters

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

On Finding Optimal Polytrees

Inference in Hybrid Bayesian Networks Using Mixtures of Gaussians

Dynamic importance sampling in Bayesian networks using factorisation of probability trees

A Branch-and-Bound Algorithm for MDL Learning Bayesian Networks

Learning Bayesian networks

The Necessity of Bounded Treewidth for Efficient Inference in Bayesian Networks

Probabilistic Graphical Models

A Memetic Approach to Bayesian Network Structure Learning

Lecture 9: Bayesian Learning

Mixtures of Gaussians with Sparse Structure

The Parameterized Complexity of Approximate Inference in Bayesian Networks

Motivation. Bayesian Networks in Epistemology and Philosophy of Science Lecture. Overview. Organizational Issues

Chapter 3 Deterministic planning

Being Bayesian About Network Structure

The Bayesian Structural EM Algorithm

PBIL-RS: Effective Repeated Search in PBIL-based Bayesian Network Learning

An Introduction to Bayesian Machine Learning

Maximum likelihood bounded tree-width Markov networks

Undirected Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models

Graphical Models and Independence Models

CS Lecture 3. More Bayesian Networks

The size of decision table can be understood in terms of both cardinality of A, denoted by card (A), and the number of equivalence classes of IND (A),

15-780: Graduate Artificial Intelligence. Bayesian networks: Construction and inference

A Result of Vapnik with Applications

When Discriminative Learning of Bayesian Network Parameters Is Easy

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Efficient Sensitivity Analysis in Hidden Markov Models

Bayesian Networks. B. Inference / B.3 Approximate Inference / Sampling

Transcription:

Learning Bayesian Networks Does Not Have to Be NP-Hard Norbert Dojer Institute of Informatics, Warsaw University, Banacha, 0-097 Warszawa, Poland dojer@mimuw.edu.pl Abstract. We propose an algorithm for learning an optimal Bayesian network from data. Our method is addressed to biological applications, where usually datasets are small but sets of random variables are large. Moreover we assume that there is no need to examine the acyclicity of the graph. We provide polynomial bounds (with respect to the number of random variables) for time complexity of our algorithm for two generally used scoring criteria: Minimal Description Length and Bayesian-Dirichlet equivalence. 1 Introduction The framework of Bayesian networks is widely used in computational molecular biology. In particular, it appears attractive in the field of inferring a gene regulatory network structure from microarray expression data. Researchers dealing with Bayesian networks have generally accepted that without restrictive assumptions, learning Bayesian networks from data is NPhard with respect to the number of network vertices. This belief originates from a number of discouraging complexity results, which emerged over the last few years. Chickering [1] shows that learning an optimal network structure is NPhard for the BDe scoring criterion with an arbitrary prior distribution, even when each node has at most two parents and the dataset is trivial. In this approach a reduction of a known NP-complete problem is based on constructing a complicated prior distribution for the BDe score. Chickering et al. [] show that learning an optimal network structure is NP-hard for large datasets, even when each node has at most three parents and a scoring criterion favors the simplest model able to represent the distribution underlying the dataset exactly (as most of scores used in practice do). By a large dataset the authors mean the one which reflects the underlying distribution exactly. Moreover, this distribution is assumed to be directly accessible, so the complexity of scanning the dataset does not influence the result. In other papers [3,4], NP-hardness of the problem of learning Bayesian networks from data with some constraints on the structure of a network is shown. On the other hand, the known algorithms allow to learn the structure of optimal networks having up to 0-40 vertices [5]. Consequently, a large amount R. Královic and P. Urzyczyn (Eds.): MFCS 006, LNCS 416, pp. 305 314, 006. c Springer-Verlag Berlin Heidelberg 006

306 N. Dojer of work has been dedicated to heuristic search techniques of identifying good models [6,7]. In the present paper we propose a new algorithm for learning an optimal network structure. Our method is addressed to the case when a dataset is small and there is no need to examine the acyclicity of the graph. The first condition does not bother in biological applications, because expenses of microarray experiments restrict the amount of expression data. The second assumption is satisfied in many cases, e.g. when working with dynamic Bayesian networks or when the sets of regulators and regulatees are disjoint. We investigate the worst-case time complexity of our algorithm for generally used scoring criteria. We show that the algorithm works in polynomial time for both MDL and BDe scores. Experiments with real biological data show that, even for large floral genomes, the algorithm for the MDL score works in a reasonable time. Algorithm A Bayesian network (BN) N is a representation of a joint distribution of a set of discrete random variables X = {X 1,...,X n }. The representation consists of two components: a directed acyclic graph G =(X, E) encoding conditional (in-)dependencies a family θ of conditional distributions P (X i Pa i ), where The joint distribution of X is given by Pa i = {Y X (Y,X i ) E} P (X) = n P (X i Pa i ) (1) i=1 The problem of learning a BN is understood as follows: given a multiset of X-instances D = {x 1,...,x N } find a network graph G that best matches D. The notion of a good match is formalized by means of a scoring function S(G : D) having positive values and minimized for the best matching network. Thus the point is to find a directed acyclic graph G with the set of vertices X minimizing S(G : D). In the present paper we consider the case when there is no need to examine the acyclicity of the graph, for example: When edges must be directed according to a given partial order, in particular when the sets of potential regulators and regulatees are disjoint. When dealing with dynamic Bayesian networks. A dynamic BN describes stochastic evolution of a set of random variables over discretized time. Therefore conditional distributions refer to random variables in neighboring time points. The acyclicity constraint is relaxed, because the unrolled graph

Learning Bayesian Networks Does Not Have to Be NP-Hard 307 (with a copy of each variable in each time point) is always acyclic (see [7,8] for more details). The following considerations apply to dynamic BNs as well. In the sequel we consider some assumptions on the form of a scoring function. The first one states that S(G : D) decomposes into a sum over the set of random variables of local scores, depending on the values of a variable and its parents in the graph only. Assumption 1 (additivity). S(G : D) = n i=1 s(x i, Pa i : D {Xi} Pa i ),where D Y denotes the restriction of D to the values of the members of Y X. When there is no need to examine the acyclicity of the graph, this assumption allows to compute the parents set of each variable independently. Thus the point is to find Pa i minimizing s(x i, Pa i : D {Xi} Pa i )foreachi. Let us fix a dataset D and a random variable X. WedenotebyX the set of potential parents of X (possibly smaller than X due to given constraints on the structure of the network). To simplify the notation we continue to write s(pa) for s(x, Pa : D {X} Pa ). The following assumption expresses the fact that scoring functions decompose into components: g penalizing the complexity of a network and d evaluating the possibility of explaining data by a network. Assumption (splitting). s(pa) =g(pa) +d(pa) for some functions g, d : P(X) R + satisfying Pa Pa = g(pa) g(pa ). This assumption is used in the following algorithm to avoid considering networks with inadequately large component g. Algorithm 1. 1. Pa :=. for each P X chosen according to g(p) (a) if s(p) <s(pa) thenpa := P (b) if g(p) s(pa) then return Pa; stop In the above algorithm choosing according to g(p) means choosing increasingly with respect to the value of the component g of the local score. Theorem 1. Suppose that the scoring function satisfies Assumptions 1-. Then Algorithm 1 applied to each random variable finds an optimal network. Proof. The theorem follows from the fact that all the potential parents sets P omitted by the algorithm satisfy s(p) g(p) s(pa), where Pa is the returned set.

308 N. Dojer A disadvantage of the above algorithm is that finding a proper subset P X involves computing g(p ) for all -successors P of previously chosen subsets. It may be avoided when a further assumption is imposed. Assumption 3 (uniformity). Pa = Pa = g(pa) =g(pa ). The above assumption suggests the notation ĝ( Pa ) = g(pa). The following algorithm uses the uniformity of g to reduce the number of computations of the component g. Algorithm. 1. Pa :=. for p =1ton (a) if ĝ(p) s(pa) then return Pa; stop (b) P = argmin {Y X : Y =p}s(y) (c) if s(p) <s(pa) thenpa := P Theorem. Suppose that the scoring function satisfies Assumptions 1-3. Then Algorithm applied to each random variable finds an optimal network. Proof. The theorem follows from the fact that all the potential parents sets P omitted by the algorithm satisfy s(p) ĝ( P ) s(pa), where Pa is the returned set. The following sections are devoted to two generally used scoring criteria: Minimal Description Length and Bayesian-Dirichlet equivalence. We consider possible decompositions of the local scores and analyze the computational cost of the above algorithms. 3 Minimal Description Length The Minimal Description Length (MDL) scoring criterion originates from information theory [9]. A network N is viewed here as a model of compression of a dataset D. The optimal model minimizes the total length of the description, i.e. the sum of the description length of the model and of the compressed data. Let us fix a dataset D = {x 1,...,x N } and a random variable X. Recall the decomposition s(pa) = g(pa) + d(pa) of the local score for X. IntheMDL score g(pa) stands for the length of the description of the local part of the network (i.e. the edges ingoing to X and the conditional distribution P (X Pa)) and d(pa) is the length of the compressed version of X-values in D. Let k Y denote the cardinality of the set V Y of possible values of the random variable Y X. Thuswehave g(pa) = Pa log n + log N (k X 1) k Y Y Pa

Learning Bayesian Networks Does Not Have to Be NP-Hard 309 where log N is the number of bits we use for each numeric parameter of the conditional distribution. This formula satisfies Assumption but fails to satisfy Assumption 3. Therefore Algorithm 1 can be used to learn an optimal network, but Algorithm cannot. However, for many applications we may assume that all the random variables attain values from the same set V of cardinality k. In this case we obtain the formula g(pa) = Pa log n + log N (k 1)k Pa which satisfies Assumption 3. For simplicity, we continue to work under this assumption. The general case may be handled in much the same way. Compression with respect to the network model is understood as follows: when encoding the X-values, the values of Pa-instances are assumed to be known. Thus the optimal encoding length is given by d(pa) =N H(X Pa) where H(X Pa) = P (v, v)logp (v v) is the conditional entropy Pa of X given Pa (the distributions are estimated from D). Since all the assumptions from the previous section are satisfied, Algorithm may be applied to learn the optimal network. Let us turn to the analysis of its complexity. Theorem 3. The worst-case time complexity of Algorithm for the MDL score is O(n log k N N log k N). Proof. The proof consists in finding a number p satisfying ĝ(p) s( ). Given v V,wedenotebyN v the number of samples in D with X = v. By Jensen s Theorem we have d( ) =N H(X ) =N N v N log N N log 1=Nlog k N v Hence s( ) N log k +(k 1) log N and it suffices to find p satisfying p log n + k p (k 1) log N N log k +(k 1) log N It is easily seen that the above assertion holds for p log k N whenever N>5. Therefore we do not have to consider subsets with more than log k N parent variables. Since there are O(n p ) subsets with at most p parents and a subset with p elements may be examined in pn steps, the total complexity of the algorithm is O(n log k N N log k N). 4 Bayesian-Dirichlet Equivalence The Bayesian-Dirichlet equivalence (BDe) scoring criterion originates from Bayesian statistics [10]. Given a dataset D the optimal network structure G maximizes the posterior conditional probability P (G D). We have

310 N. Dojer P (G D) P (G)P (D G) =P (G) P (D G,θ)P (θ G)dθ where P (G) and P (θ G) are prior probability distributions on graph structures and conditional distributions parameters, respectively, and P (D G,θ)isevaluated due to (1). Heckerman et al. [6], following Cooper and Herskovits [10], identified a set of independence assumptions making possible decomposition of the integral in the above formula into a product over X. Under this condition, together with a similar one regarding decomposition of P (G), the scoring criterion S(G : D) = log P (G) log P (D G) obtained by taking log of the above term satisfies Assumption 1. Moreover, the decomposition s(pa) = g(pa)+ d(pa) of the local scores appears as well, with the components g and d derived from log P (G) and log P (D G), respectively. The distribution P ((X, E)) α E with a penalty parameter 0 <α<1in general is used as a prior over the network structures. This choice results in the function g( Pa ) = Pa log α 1 satisfying Assumptions and 3. However, it should be noticed that there are also used priors which satisfy neither Assumption nor 3, e.g. P (G) α Δ(G,G0),whereΔ(G, G 0 )isthecardinality of the symmetric difference between the sets of edges in G andintheprior network G 0. The Dirichlet distribution is generally used as a prior over the conditional distributions parameters. It yields d(pa) =log Γ ( (H v,v + N v,v )) Γ ( Γ (H v,v ) H v,v) Γ (H v,v + N v,v ) Pa where Γ is the Gamma function, N v,v denotes the number of samples in D with X = v and Pa = v, andh v,v is the corresponding hyperparameter of the Dirichlet distribution. Setting all the hyperparameters to 1 yields d(pa) =log Pa = Pa (k 1+ N v,v)! 1 = (k 1)! N v,v! ( log(k 1+ N v,v )! log(k 1)! ) log N v,v! where k = V. For simplicity, we continue to work under this assumption. The general case may be handled in a similar way. The following result allows to refine the decomposition of the local score into the sum of the components g and d.

Learning Bayesian Networks Does Not Have to Be NP-Hard 311 Proposition 1. Define d min = (log(k 1+N v)! log(k 1)! log N v!), where N v denotes the number of samples in D with X = v. Thend(Pa) d min for each Pa X. Proof. Fix Pa X. Givenv V Pa,wedenotebyN v the number of samples in D with Pa = v. Wehave d(pa) = ( log(k 1+N v )! log(k 1)! ) log N v,v! = Pa = k 1+N v log i N v,v log i Pa i=1 k 1+N v,v N v,v log i log i = N v,v log k 1+i = i Pa = = N v,v Pa i=1 i=1 i=1 log( k 1 i N v log k 1+i i = = +1) ( k 1+Nv Pa i=1 N v log( k 1 i i=1 +1)= ) N v log i log i = i=1 (log(k 1+N v )! log(k 1)! log N v!) = d min The first inequality follows from the fact that given N v the sum k 1+N v,v log i maximizes when N v = N v,v for some v V. Similarly, the second inequality holds, because given N v the sum N v,v Pa i=1 log( k 1 i +1) minimizes when N v = N v,v for some v V Pa. By the above proposition, the decomposition of the local score given by s(pa) = g (Pa) +d (Pa) with the components g (Pa) =g(pa) +d min and d (Pa) = d(pa) d min satisfies all the assumptions required by Algorithm. Let us turn to the analysis of its complexity. Theorem 4. The worst-case time complexity of Algorithm for the BDe score with the decomposition of the local score given by s(pa) =g (Pa) +d (Pa) is O(n N log α 1 k N log α 1 k).

31 N. Dojer Proof. The proof consists in finding a number p satisfying ĝ (p) s( ). We have d ( ) =d( ) d min = ( (k 1+N)! = log ) log N v! (k 1)! = (log(k 1+N)! log(k 1)!) k 1+N =loge = log i k k 1+N k 1+N ln i k k 1+ N k log i k 1+ N k ( ( k+n k 1+ N log e ln xdx k k ( log (k 1+N ) v)! log N v! = (k 1)! (log(k 1+N v )! log(k 1)!) = k 1+N v log i log i +(N k N k )log(k + N k ) = k 1 ln i +( N k N k )ln(k + N k ) k )) k 1+ N k ln xdx+ ln xdx = k 1+ N k ( =loge (k + N)(ln(k + N) 1) k(ln k 1) ( k (k 1+ N ) ) k )(ln(k 1+N k ) 1) (k 1)(ln(k 1) 1) = k + N = N log k + N log k k + N +log (1 + N k )k (1 + N N log k k k )k k The first inequality follows from the fact that the sum log N v! minimizes when the values of N v are uniformly distributed over v. The last inequality is due to the fact that k, and consequently k k k. Thus s( ) N log k + d min, so we need p satisfying p log α 1 N log k, i.e. p N log α 1 k. Therefore we do not have to consider subsets with more than N log α 1 k parent variables. Since there are O(n p )subsetswithatmostpparents and a subset with p elements may be examined in pn steps, the total complexity of the algorithm is O(n N log α 1 k N log α 1 k). 5 Application to Biological Data Microarray experiments measure simultaneously expression levels of thousands of genes. Learning Bayesian networks from microarray data aims at discovering gene regulatory dependencies.

Learning Bayesian Networks Does Not Have to Be NP-Hard 313 We tested the time complexity of learning an optimal dynamic BN on microarray experiments data of Arabidopsis thaliana ( 3 000 genes). There were 0 samples of time series data discretized into 3 expression levels. We ran our program on a single Pentium 4 CPU.8 GHz. The computation time was 48 hours for the MDL score and 170 hours for the BDe score with a parameter log α 1 = 5. The program failed to terminate in a reasonable time for the BDe score with a parameter log α 1 = 3, i.e α 0.05. 6 Conclusion The aim of this work was to provide an effective algorithm for learning optimal Bayesian network from data, applicable to microarray expression measurements. We have proposed a method addressed to the case when a dataset is small and there is no need to examine the acyclicity of the graph. We have provided polynomial bounds (with respect to the number of random variables) for time complexity of our algorithm for two generally used scoring criteria: Minimal Description Length and Bayesian-Dirichlet equivalence. An application to real biological data shows that our algorithm works with large floral genomes in a reasonable time for the MDL score but fails to terminate for the BDe score with a sensible value of the penalty parameter. Acknowledgements The author is grateful to Jerzy Tiuryn for comments on previous versions of this work and useful discussions related to the topic. I also thank Bartek Wilczyński for implementing and running our learning algorithm. This research was founded by Polish State Committee for Scientific Research under grant no. 3T11F018. References 1. Chickering, D.M.: Learning Bayesian networks is NP-complete. In Fisher, D., Lenz, H.J., eds.: Learning from Data: Artificial Inteligence and Statistics V. Springer- Verlag (1996). Chickering, D.M., Heckerman, D., Meek, C.: Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research 5 (004) 187 1330 3. Dasgupta, S.: Learning polytrees. In: Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), San Francisco, CA, Morgan Kaufmann Publishers (1999) 134 141 4. Meek, C.: Finding a path is harder than finding a tree. J. Artificial Intelligence Res. 15 (001) 383 389 (electronic) 5. Ott, S., Imoto, S., Miyano, S.: Finding optimal models for small gene networks. Pac. Symp. Biocomput. (004) 557 567 6. Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 0(3) (1995) 197 43

314 N. Dojer 7. Neapolitan, R.E.: Learning Bayesian Networks. Prentice Hall (003) 8. Friedman, N., Murphy, K., Russell, S.: Learning the structure of dynamic probabilistic networks. In Cooper, G.F., Moral, S., eds.: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Inteligence. (1998) 139 147 9. Lam, W., Bacchus, F.: Learning Bayesian belief networks: An approach based on the MDL principle. Computational Intelligence 10(3) (1994) 69 93 10. Cooper, G.F., Herskovits, E.: A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9 (199) 309 347