COMP538: Introduction to Bayesian Networks

COMP538: Introduction to Bayesian Networks Lecture 9: Optimal Structure Learning Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology Spring 2007 Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 1 / 47

Introduction A good structural learning algorithm should, among others, Discover the truth provided there is sufficient data Formulation of this intuition: Suppose sufficient data sampled from a true BN model. A good learning algorithm should be able to reconstruct the true model from data. Objective: Assumes (complete) data generated by a BN. Discusses when and how the generating model can be reconstructed from data. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 2 / 47

References Chickering, D. M. (1995). A transformational characterization of equivalent Bayesian network structures. In Proc. 11th Conf. on Uncertainty in Artificial Intelligence, 87-98. Chickering, D. M. (2002). Learning Equivalence Classes of Bayesian-Network Structures. Journal of Machine Learning Research, 2:445-498. Chickering, D. M. (2002b). Optimal Structure Identification with Greedy Search. Journal of Machine Learning Research 3:507-554. Kocka, T. and Castelo, R. (2001). Improved Learning of Bayesian Networks. In Proc. 17th Conf. on Uncertainty in Artificial Intelligence, 269-276. Meek, C. (1997). Graphical models: Selecting causal and statistical models. PhD thesis, Carnegie Mellon University. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 3 / 47

Outline Model Equivalence 1 Model Equivalence Conditions for Model Equivalence Representing Equivalence Class of Models Model Equivalence and Scoring Functions 2 Model Inclusion Model inclusion and Scoring Functions 3 Optimality Conditions 4 Greedy Equivalence Search (GES) Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 4 / 47

Model equivalence Model Equivalence BNs represent joint probabilities. Two different BNs are equivalent if they represent the same joint probability. Equivlence of BN structures Let S and S be two BN models (DAG structures) over variables V. We say S and S are equivalent if for any parameterization θ of S, there exists a parameterization θ of S such that P(V S, θ) = P(V S, θ ), and vice versa In words, S can represent any joint distribution that S can,and vice versa. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 5 / 47

Model equivalence Model Equivalence Examples: X Y and X Y are equivalent. (Show this) In a DAG, if we drop directions of all edges, we get its skeleton. Trees with the some skeleton are equivalent. Equivalent models have the same maximized likelihood. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 6 / 47

Model Equivalence Conditions for Model Equivalence Model Equivalence and Markov Property Theorem (9.1) (Meek 1995) Two BN models S and S are equivalent iff they imply, by the global Markov property, the same set of conditional independencies. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 8 / 47

Model Equivalence Model Equivalence and V-Structures Conditions for Model Equivalence In a DAG, a v-structure is a local pattern X Z Y such that X and Y are not adjacent. A S A S T L B T L B X R D X R D Theorem (9.2) (Verma and Pearl 1991) Two BN models are equivalent iff they have the same skeleton and the same v-structures. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 9 / 47

Model Equivalence Model Equivalence and Arc Reversal Conditions for Model Equivalence A S A S T L B T L B X R D X R D In a DAG, an arc X Y is covered if pa(y ) = pa(x) X. Theorem (9.3) (Chickering 1995) Two BN models are equivalent iff there exists a sequence of covered arc reversals which converts one into the other. Example: Arc reversals: A T, S L. There are several other equivalent models. Cannot reverse T R, L R, R D, B D because they are not covered. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 10 / 47

PDAG Model Equivalence Representing Equivalence Class of Models cl(s) denotes the class of all BN models equivalent to S. One way to represent the class (Theorem 9.2): Skeleton + v-structures Consisting of undirected as well as directed edges. Called acyclic partially directed graph (PDAG). A X T R L S D B A X T R L S D B A T X R L S D B Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 12 / 47

Compelled Edge Model Equivalence Representing Equivalence Class of Models A directed edge in a BN structure S is compelled if it is in all structures equivalent to S. Example: A S T L B X R D By Theorem 9.2, all edges participating in v-structures are compelled. Some edges might also be compelled. Example: R X. Any model with R X is not equivalent to the structure shown (why?) Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 13 / 47

Essential Graph Model Equivalence Representing Equivalence Class of Models Essential graph of a BN structure S is a PDAG Whose skeleton is the same as S, and Where the compelled edges and only those edges are directed. Also called DAG pattern (Spirtes et al. 1993), completed PDAG (Chickering 2002), and maximally oriented graphs (Chickering 1995). Used to represent equivalent class during learning Example: A X T R L S D B A T X R L S D B Essential graph of a tree-structured BN model is its skeleton. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 14 / 47

Model Equivalence Representing Equivalence Class of Models Computing Essential Graphs Algorithm for computing essential graph of a BN structure S: 1 Compute skeleton of S and orient only edges participating in v-structures. 2 Orient compelled edges: (Note. Cannot create additional v-structure). While more edges can be oriented 1 For each X Z Y such that X and Y are not adjacent, Orient Z Y as Z Y 2 For each X Y such that there is a directed path from X to Y, Orient X Y as X Y 3 For each X Z Y such that X and Y are not adjacent, X W, Y W, and Z W Orient Z W as Z W Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 15 / 47

Model Equivalence Computing Essential Graphs Representing Equivalence Class of Models Understanding the rules: Rule (a): If Z Y, we would have an additional v-structure X Z Y. Rule (b): If X Y, we would have a directed cycle. Rule (c): This situation looks like this: X Z Y W If Z W, we would have X Z and Z Y to avoid directed cycle. But this would lead to an additional v-structure. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 16 / 47

Model equivalence Model Equivalence Representing Equivalence Class of Models Theorem (9.4) (Meek 1995) Step2 of the algorithm is sound and complete: Notes: (Soundness) Edges oriented by algorithm are compelled edges. And they are oriented correctly. (Completeness) Algorithm orients all compelled edges. The algorithm is important if we want search with essential graphs (Chickering 2002), which we don t in class. However, we will use this algorithm and the results in the next lecture. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 17 / 47

Model Equivalence Equivalence-Invariant Scoring Functions Model Equivalence and Scoring Functions A scoring function is equivalence invariant if it gives the same score to equivalent models. Sometimes also said to likelihood equivalent or score equivalent. Equivalence-invariant scoring functions can be used to score equivalence classes. Others cannot. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 19 / 47

Model Equivalence Model Equivalence and Scoring Functions BIC Score The BIC score is equivalence invariant: Recall: BIC(S D) = logp(d S, θ ) d 2 logn By definition of equivalence, Equivalent models have the same maximized likelihood. According to Theorem 9.3,equivalent models have the same complexity: Covered arc reversal does not change complexity (prove this). The marginal likelihood (i.e. the CH) score is equivalence invariant if one sets the parameter priors Properly. (Heckerman et al 1994 1 ). 1 Heckerman, D., Geiger, D. and Chickering, D. M. (1994). Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Proc. 10th Conf. Uncertainty in Artificial Intelligence, 293-301. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 20 / 47

Outline Model Inclusion 1 Model Equivalence Conditions for Model Equivalence Representing Equivalence Class of Models Model Equivalence and Scoring Functions 2 Model Inclusion Model inclusion and Scoring Functions 3 Optimality Conditions 4 Greedy Equivalence Search (GES) Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 21 / 47

Model Inclusion Model Inclusion BN model S includes model S if all conditional independence statements valid in S are valid in S as well. Example: Consider models over X, Y, and Z X Y,Z X Y Z All conditional independencies valid in the model on the right are true in the model on the left. The model on the right includes the model on the left. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 22 / 47

Model Inclusion Model Inclusion and Equivalence S and S are equivalent iff S includes S and S includes S. We say that S strictly includes S if S includes S but S does not include S. Exercise: Show that X Y Z does not include X Y Z X Y Z does not include X Y Z. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 23 / 47

Model Inclusion Model Inclusion Theorem (9.5) (Chickering (2002b) A BN structure S includes S iff there exists a sequence of covered arc reversals and arc additions which converts S into S. Example: Model X Y Z includes model X Y, Z. Corollary (9.1) Covered arc reversal: X Y, Z X Y, Z. Arc addition: X Y, Z X Y Z. A BN structure S includes S iff S can represent any joint distribution that S can. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 24 / 47

Inclusion Boundary Model Inclusion Lower inclusion boundary IB (S) of model S consists of all models S such that S strictly includes S. There is no model S such that S strictly includes S and S strictly include S. Upper inclusion boundary IB + (S) of model S consists of all models S such that S strictly includes S. There is no model S such that S strictly includes S and S strictly include S. Inclusion boundary IB(S) = IB (S) IB + (S). Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 25 / 47

Inclusion Boundary Model Inclusion According to Theorem 9.5 IB + (S) consists of all models that can be obtained from S via A series of covered arc reversals. Addition of a single arc. Another series of covered arc reversals. IB (S) consists of all models that can be obtained from S via A series of covered arc reversals. Removal of an single arc. Another series of covered arc reversals. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 26 / 47

Example of IB Model Inclusion On the next slide, We show all DAGs over three variables. We depict equivalence and inclusion relations. The DAGs are grouped into equivalence classes. Two equivalence classes are adjacent if one can get from one to the other by a single arc addition or deletion. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 27 / 47

Model Inclusion Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 28 / 47

Useless Edge Model Inclusion Model inclusion and Scoring Functions A fact: Let X and Y be two non-adjacent nodes in a DAG. It is possible to add an edge between X and Y, either X Y or X Y, without creating a directed cycle. (Exercise: Prove this) Consider a joint probability distribution P and a DAG S. Suppose adding an edge X Y to S does not induce cycles. We say that adding the edge to S is useless w.r.t P if X P Y pa S (Y ) Otherwise we say that adding the edge to S is useful w.r.t P. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 30 / 47

Model Inclusion Locally Consistent Scoring Functions Model inclusion and Scoring Functions Now let P be the joint probability from which data D were sampled. A scoring function is locally consistent if Adding an edge that is useful w.r.t P to a model increases its score, and Adding an edge that is useless w.r.t P to a model decreases its score. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 31 / 47

Model Inclusion BIC is Locally Consistent Model inclusion and Scoring Functions The BIC score is locally consistent when the sample size is sufficiently large. Recall: BIC(S D) = logp(d S, θ ) d 2 logn = N i HˆP(X i pa S (X i )) d 2 logn where ˆP is the empirical distribution. Without losing generality, suppose X 2 / pa(x 1 ) and consider adding edge X 2 X 1 to S. If adding X 2 X 1 to S is useful w.r.t P, It must be also useful w.r.t ˆP when sample is large enough. Hence HˆP (X1 pa S(X 1)) > HˆP (X1 pa S(X 1),X 2) There adding the arc to S increases score when N is large. The difference in the first term increases faster than the difference in the second term. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 32 / 47

Model Inclusion Model inclusion and Scoring Functions BIC is Locally Consistent (Continue from previous slide) If adding X 2 X 1 to S is useless w.r.t P, It must be, when sample is large enough, useless w.r.t ˆP except for some random noisy. Hence HˆP (X1 pa S(X 1)) HˆP (X1 pa S(X 1),X 2) There adding the arc to S decreases score when N is large. No difference in first term, but the d in the second term becomes larger. The Bayesian score and the marginal likelihood (CH) score are also locally consistent when sample size is sufficiently large: Asymptotically, they are the same as BIC. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 33 / 47

Outline Optimality Conditions 1 Model Equivalence Conditions for Model Equivalence Representing Equivalence Class of Models Model Equivalence and Scoring Functions 2 Model Inclusion Model inclusion and Scoring Functions 3 Optimality Conditions 4 Greedy Equivalence Search (GES) Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 34 / 47

Optimality Conditions Optimality Conditions If a DAG S is a perfect-map of joint probability P, we say that P is faithful to S. Theorem (9.6) (Castelo and Kocka 2002, Chickering 2002b) Consider an hill climbing algorithm Alg that uses scoring function f and is based on data D.Suppose 1 D were sampled from a distribution P that is faithful to a BN model S and the sample size is sufficiently large, 2 The scoring function f is equivalence invariant and locally consistent. 3 The models that Alg examines at each step include the following: 1 All models that in the lower inclusion boundary of the current model. 2 All models that can be obtained from the current model by adding a single arc. Then Alg will reach a model in the equivalence class cl(s) and stop there. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 35 / 47

Some Notes Optimality Conditions The conditions are strong. They are always violated in practice. Compared to the straightforward hill-climbing neighborhood, we evaluate some neighbors less and some neighbors more. We do not evaluate arc reversals. But we have to consider not only arc removals but the whole lower inclusion boundary. In practice One usually uses the whole inclusion boundary One might even use the arc reversals Empirical evaluations of these possibilities are yet to be done. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 36 / 47

Proof of Theorem 9.6 Optimality Conditions Claim 1: Under the conditions of the theorem, if a model S is not equivalent to S, then there must exist another model S that such that either can be obtained from S by adding a single arc or is in the lower boundary of S f (S ) > f (S ) Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 37 / 47

Proof of Theorem 9.6 Optimality Conditions Claim 1 and the third condition of Theorem imply If the current model is not equivalent to S, Alg will always find a model that is strictly better than the current model. Since there are only finite many possible models, Alg will reach a model and stop there. The final model must be in the equivalence class cl(s). Otherwise, Alg would continue according to Claim 1. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 38 / 47

Proof of Claim 1 Optimality Conditions Two cases: 1 S includes S. 2 S does not include S. Case 1: S includes S. Because of Theorem 9.5, we can reach S from S by a series of covered arc reversals and arc deletions. Let S and S be the models we get before and after the FIRST arc removal respectively. Evidently, S is in the lower boundary of S. Because f is equivalence invariant, we have f (S ) = f ( S) It is also clear that the arc removed from S is useless w.r.t P. Because the scoring function is locally consistent and D is sufficiently large, f (S ) > f ( S) = f (S ) Nevin L. Zhang Hence (HKUST) Claim 1 is true inbayesian this case. Networks Spring 2007 39 / 47

Optimality Conditions Proof of Claim 1 Case 2: S does not include S. There must exist two nodes X and Y such that 1 X and Y are not adjacent in S. 2 X and Y are not d-separated by pa S (Y) in S. 3 Adding the arc X Y to S does not induce directed cycles. ( Let S be the resulting model.) (Exercise: prove this) Because P is faithful to S, the second property above implies that, under P, X and Y are not conditionally independent given pa S (Y ). Hence adding the arc X Y to S is useful w.r.t P. Because f is locally consistent, f (S ) > f (S ) Hence Claim 1 is also true in this case. Claim 1 is proved. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 40 / 47

Some Notes Optimality Conditions In Theorem 9.6, the requirement on Alg can be relaxed as follows: At each step, Alg finds a model better than the current model if such models exist. This relaxation allows one to consider stochastic hill-climbing (Kocka and Castelo 2002), which is more efficient than standard hill-climber. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 41 / 47

Outline Greedy Equivalence Search (GES) 1 Model Equivalence Conditions for Model Equivalence Representing Equivalence Class of Models Model Equivalence and Scoring Functions 2 Model Inclusion Model inclusion and Scoring Functions 3 Optimality Conditions 4 Greedy Equivalence Search (GES) Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 42 / 47

Greedy Equivalence Search (GES) Greedy equivalence search (GES) Proposed by Meek (1997) Start with empty model, model with no arcs. Phase I: Repeat the following until a local maximum is reached. Examine all models in the UPPER boundary of the current model. Pick the one with the best score. Phase II: Repeat the following until a local maximum Examine all models in the LOWER boundary of the current model. Pick the one with the best score. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 43 / 47

Discussions Greedy Equivalence Search (GES) In the first phase, the algorithm finds a model that includes the true model. In the second phase, it reduces it to the true model. According to Theorem 9.6, one needs only to add arcs in the first phase. Why does the algorithm do more then? At some point it goes through some complex graph, in the worst case the complete graph! With finite data it matters how complex the most complex graph is, it decides if you will end up in a local or global optima. It is generally believed that doing more than just adding edges might help to find some less complex most complex graph. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 44 / 47

Greedy Equivalence Search (GES) Implementation Details Generating all models in the inclusion boundary of a model by directly applying Theorem might be computationally expensive: Consider the model with 100 disjoint arcs. There is 2 100 equivalent DAGs representing the same model, Solution: Use essential graphs (one graph) to represent an equivalence classes of DAGs. See Chickering (2002b) on how to search inclusion boundaries implicitly by using search operators on essential graphs. Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 45 / 47

Greedy Equivalence Search (GES) Limitations and Empirical Results Despite the optimality result, local maxima is still a problem: We never have infinite data, and Data usually are not generated by joint distributions faithful to DAGs. The chart (Chickering 2002b) on the next page shows that GES often cannot reconstruct the generative model. But it is much better then hill-climbing with DAGs, D-Space, (the algorithm we described in the previous lecture.) E-space starts for another algorithm based on essential graphs by Chickering (2002). Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 46 / 47

Empirical Results Greedy Equivalence Search (GES) Nevin L. Zhang (HKUST) Bayesian Networks Spring 2007 47 / 47