Bayesian Networks: Representation, Variable Elimination

Bayesian Networks: Representation, Variable Elimination CS 6375: Machine Learning Class Notes Instructor: Vibhav Gogate The University of Texas at Dallas We can view a Bayesian network as a compact representation of a joint probability distribution (the computational view). We can also view it as a graphical representation of a set of conditional independence statements (the conditional independence view). As we will show, the two views are equivalent. (For more details please refer to [1] and [2]). 1 Computational View A Bayesian network is a directed acyclic graph (DAG) 1 G(X,E) where X = {X 1,...,X n } is a set of random variables and E is a set of directed edges. A node X j is a parent of node X i if there is a directed edge from X j to X i. We will denote by pa(x i ), the set of parents of X i. Let D(X i ) denote the set of finite values that X i can take, i.e., the domain of X i. For example, if X i is a binary variable, then D(X i ) = {True,False}. Each node in a Bayesian network is associated with a conditional probability table or CPT, whose functional form is P(X i pa(x i )) (i.e., variable given its parents). A Bayesian network B represents the following joint probability distribution Pr(x) = n P(x i π(x,pa(x i ))) (1) i=1 where x = (x 1,...,x n ) is an assignment to all variables in X and π(x,pa(x i )) is the projection of x on the parents of X i. (For instance, the projection of the assignment (X 1 = 0,X 2 = 0,X 3 = 1) on {X 1,X 3 } is the assignment (X 1 = 0,X 3 = 1)). Example 1. Figure 1 shows a Bayesian network over five binary variables {A,B,C,D,E} and its CPTs. From Eq. (1), we can see that the probability of an assignment to all random variables can be calculated by multiplying the probability values obtained by projecting the assignment on each CPT. For example, the probability of (A = 0,B = 0,C = 0,D = 1,E = 0) is 0.6 1 0.2 0.8 1 = 0.096. Exercise 1. Compute the probability of (A = 1,B = 1,C = 0,D = 0,E = 0) and (A = 0,B = 1,C = 1,D = 0,E = 0). 1 A DAG is a directed graph that has no directed cycles 1

A P(A) A=0 A=1.6.4 P(B A) A B=0 B=1 0 1 0 1.1.9 P(C A) A C=0 C=1 0.2.8 1.7.3 B D C E P(D B,C) B C D=0 D=1 0 0.2.8 0 1.1.9 1 0.3.7 1 1.5.5 P(E A,C) A C E=0 E=1 0 0 1 0 0 1.5.5 1 0.7.3 1 1 0 1 (a) (b) Figure 1: A Bayesian network and its CPTs Space Complexity The space required by a CPT is exponential in the number of its parents. Assuming that the Bayesian network has n variables, the space required to store it is O(n i exp( pa(x i) + 1)). Thus, if the number of parents for each node is bounded by a constant, the Bayesian network can be specified in polynomial space. Forget the O-notation for the time being. Let us calculate how much memory units are actually needed to represent a CPT on our computer. Assume that we need one memory unit to store a real-number. Then, the number of memory units required to represent a CPT P(X i pa(x i )) is: M[P(X i pa(x i )] = ( D(X i ) 1) X j pa(x i ) D(X j ) (2) where D(X i ) denotes the cardinality of D(X i ), i.e., the domain size of X i. How did we arrive at this number? Notice that to represent a marginal distribution over k points, we have to specify probabilities over some arbitrary selected k 1 points only; we can calculate the probability of the k-th point by subtracting the sum of the known probabilities over the k 1 points from 1. Thus, to represent a marginal distribution over a variable X i, we need D(X i ) 1 memory units. This gives us the first term in Equation (2). To specify a CPT for a variable, we have to represent a conditional marginal distribution for each possible assignment to its parents. The number of possible assignments to the parents equals the product of the domain sizes of the variables in the parent set. This gives us the second term in Equation (2). The number of memory units required to represent a Bayesian network B is simply a sum over the memory units required by each of its CPTs: M[B] = i M[P(X i pa(x i ))] Example 2. The number of memory units required to represent the CPT P(C A) in Figure 1 is 2. Exercise 2. Calculate the number of memory units required for representing the remaining CPTs in Figure 1. Also, calculate the number of memory units required to represent the Bayesian network. 2

2 Conditional Independence View A Bayesian network DAG is a compact representation of the following conditional independence statements: Each variable is conditionally independent of its non-descendants given its parents. (In a DAG G, if there is a directed path from X i to X j, then X j is a descendant of X i. If X j is neither a parent nor a descendant of X i, then X j is a non-descendant of X i ). We will use the notation I(X,Z,Y) to denote the statement that the set of variables X is conditionally independent of Y given Z. Recall that by definition, I(X,Z,Y) is equivalent to each of the following two statements: Pr(x y, z) = Pr(x z) Pr(x, y z) = Pr(x z) Pr(y z) Using the notation above, a Bayesian network represents the following conditional independence statements: I(X i,pa(x i ),nd(x i )) for all variables X i X where nd(x i ) is the set of non-descendants of X i. We will often refer to these statements as Markovian statements or Markovian assumptions. We mentioned before that the two views: first, a Bayesian network is a compact representation of a joint probability distribution; and second, a Bayesian network is a compact representation of a set of conditional independence statements are equivalent. Next, we show why this is the case. Given a DAG G, let (X 1,...,X n ) be a topological ordering of variables in G (recall that a topological ordering is an ordering in which parents are ordered before the children. For example, the ordering (A,B,C,D,E) is a topological ordering of the Bayesian network given in Figure 1. Similarly, the ordering(a, C, B, E, D) is also a topological ordering). By product rule of probability, the probability of a full assignment x = (x 1,...,x n ) to all variables is given by: Pr(x) = n P(x i x 1,...,x i 1 ) (3) i=1 From the Markovian assumption, we know that each variable is conditionally independent of its non-descendants given its parents. Since we have ordered the variables along a topological ordering, the set {X 1,...,X i 1 } includes only parents and non-descendants of X i. Therefore, for all variables X i, we have: P(x i x 1,...,x i 1 ) = P(x i π(x,pa(x i )) (4) Substituting Equation (4) in Equation (3), we get: Pr(x) = n P(x i π(x,pa(x i )) (5) i=1 Exercise 3. Enumerate the Markovian assumptions represented by the Bayesian network given in Figure 1. We can expand the set of conditional independence statements by applying the following five properties of conditional independence to the Markovian statements. These properties can be easily proved from first principles: 3

1. Symmetry I(X,Z,Y) I(Y,Z,X). 2. Decomposition I(X,Z,Y W) I(X,Z,Y) 3. Weak Union I(X,Z,Y W) I(X,Z W,Y) 4. Contraction I(X,Z Y,W)&I(X,Z,Y) I(X,Z,Y W) 5. Intersection For any positive distribution: I(X,Z W,Y)&I(X,Z Y,W) I(X,Z,Y W) Example 3. Applying the symmetry property to the statement I(B, A,{C, E}) yields the statement I({C, E}, A, B). Similarly, applying decomposition to the statement I(B, A,{C, E}) yields two statements I(B,A,E) and I(B,A,C). Algorithm 1 describes a scheme for computing the set of all possible conditional independence statements represented by a Bayesian network. Algorithm 1: CI statements of BN Input: A Bayesian network B Output: A set of conditional independence statements begin CI = Markovian statements represented by B repeat M=New statements derivable from CI by applying the five conditional independence properties in order CI = CI M until CI changes return CI end Thus, given a Bayesian network and a conditional independence statement, we can check whether the Bayesian network satisfies the conditional independence statement by checking if the statement is included in CI. Why is this important? Remember that a Bayesian network models a real-world domain. Based on prior knowledge or expert knowledge about the domain, we might expect some conditional independence property to hold in a Bayesian network and if this property is not satisfied by the given DAG, then we may want to change/add/delete some edges from the DAG. By doing so, our Bayesian network will better capture the subtleties or important structure in the domain. Unfortunately, this approach of constructing the set CI and then checking if the statement is included in CI is quite expensive and exponential in the worst case (because the cardinality of CI can be exponential in the number of variables). It turns out that we can use a much simpler graphical test called the d-separation test for this purpose. 2.1 D-separation test The d-separation test for Bayesian networks takes three subsets of variables X, Y and Z and returns true only if I(X,Z,Y). Algorithm 2 describes this test; it operates as follows. In the Repeat-Until loop, the algorithm iteratively deletes all leaf nodes that are not in X Y W (a leaf node is 4

Figure 2: Demonstration of the D-separation test. Adapted from [1]. For the DAG on the left, X = {A,S}, Y = {X,D} and Z = {B,P}. For the DAG on the right, X = {C,T}, Y = {B} and Z = {S,X} a node with no child nodes) until no more nodes can be deleted. Note that when you delete a leaf node, you remove all of its incoming edges. Moreover, notice that because we are deleting nodes, some nodes which were not leaf nodes in the original DAG may become leaf nodes. Then, the algorithm deletes all outgoing edges from nodes in Z and replaces each directed edge with an undirected edge, yielding an undirected graph. The algorithm returns True if X and Y are disconnected in the undirected graph and False Otherwise. We can show that: Theorem 1. Algorithm 2 is correct and returns True only if I(X,Z,Y). Proof. The proof is complicated and beyond the scope of this class. If you are interested in the proof, please refer to [1]. Algorithm 2: D-separation test Input: A DAG G and three sets of variables X,Z, and Y Output: True only if I(X,Z,Y) and False otherwise begin repeat Delete a leaf node W from G such that W / X Y Z until Until no more nodes can be deleted Delete all edges outgoing from nodes in Z from G Replace each directed edge in G with an undirected edge if X and Y are disconnected in G then return True else return False end Example 4. Figure 2 depicts two examples of the d-separation tests. Both tests are positive. 5

2.2 I-maps, D-maps and P-maps Since a Bayesian network G can use only polynomial space to model a joint probability distribution Pr, it may not be able to capture all the dependencies or independencies in Pr. In this section, we describe three notions, I-map, D-map and P-map, that capture the relationship between the conditional independence statements represented by G and Pr. LetI G denotetheset of conditional independencestatements represented byabayesian network G and let I Pr be the set of conditional independence statements that are true in the probability distribution Pr. Then, Definition 1. G is an I-map (independency map) of Pr iff I G I Pr. G is a D-map (dependency map) of Pr iff I Pr I G. I G is a P-map (perfect map) of I Pr iff G is both an I-map and a D-map of Pr I-map implies that every independence declared by d-separation on G holds in the distribution Pr. D-map implies that lack of d-separation in G implies a dependence in Pr. These definitions are not useful by themselves because a complete Bayesian network (let o be a topological ordering over a Bayesian network. A Bayesian network is complete if all variables ordered before X i in o are parents of X i ) is an I-map while a Bayesian network with no edges is a D-map. Therefore, in practice we enforce minimality. Formally, G is a minimal I-map of Pr if G ceases to be an I-map when we delete any edge from G. Similarly, G is a minimal D-map of Pr if G ceases to a D-map when we add any edge to G. 3 Common Queries over Bayesian networks Since a Bayesian network represents a joint probability distribution (i.e., complete knowledge about the world), it can be used to answer any query defined over the random variables. Some common query types are: Probability of Evidence (PE) We define evidence as an assignment to a subset of variables in the Bayesian network. Given evidence, the aim of this task, as the name suggests, is to compute the probability of evidence. Formally, let E be the set of evidence variables, E = e be the evidence and let Z = X\E be the set of non-evidence variables. Pr(e) = z D(Z) Pr(c(z,e)) = z D(Z) i=1 n P(π(c(z,e),{X i }) π(c(z,e),pa(x i )) (6) where c(z,e) denotes the composition of the assignments z and e and D(Z) is the Cartesian product of domains of all variables in Z. We have to use cumbersome notation in Equation 6 in order to be mathematically precise. To better understand the equation, recall that the probability of an event is a sum over all possible assignments in the joint distribution that are consistent with the event. To calculate it, we fix each evidence variable to the given value and sum over all possible assignments to the non-evidence variables. 6

Posterior Marginal probabilities (MAR) Given evidence e and a variable X i, the aim here is to compute Pr(X i = x i e), which is defined as: Pr(x i e) = Pr(c(x i,e)) Pr(e) (7) Since the posterior marginal probability is a ratio of two probabilities, algorithms used for computing probability of evidence can be used to compute it. Most Probable explanation (MPE) Given evidence e, the aim here is to find an assignment to all the non-evidence variables that has the highest probability. Formally, we have to find: arg maxpr(c(z,e)) (8) z Algorithms used to solve these queries or tasks are called inference algorithms. 4 Variable Elimination: An Exact Inference Algorithm Before, we derive the variable elimination, we will briefly describe three inference operators, summation, multiplication and maximization. 4.1 Inference Operators Function or Factor Given a set of variables Y, a function or a factor φ(y) is a mapping from all possible assignments to Y to a positive real number. We will refer to Y as the scope of the function. Multiplication or Product The product of two functions: φ 1 having scope X 1 and φ 2 having scope X 2 is a function φ having scope X = X 1 X 2. The value associated with each assignment x D(X) is given by: φ(x) = φ 1 (π(x,x 1 )) φ 2 (π(x,x 2 )) Example 5. The product of two functions φ 1 (A,B) and φ(b,c) is a function φ(a,b,c) given below ABφ 1 (a,b) 0 0 5 0 1 7 1 0 3 1 1 4 BCφ 2 (b,c) 0 0 2 0 1 9 1 0 3 1 1 1 = ABC φ(a,b,c) 0 0 0 5 2=10 0 0 1 5 9=45 0 1 0 7 3=21 0 1 1 7 1=7 1 0 0 3 2=6 1 0 1 3 9=27 1 1 0 4 3=12 1 1 1 4 1=4 7

Sum-out The sum-out of a variable X i from a function φ(x) such that X i X is a function φ defined over the set of variables X = X\{X i }. The value of each assignment x D(X ) is given by: φ (x ) = x i φ(c(x i,x )) Example 6. Summing out A from function φ(a,b) yields a function φ (B) given below: a ABφ(a,b) 0 0 5 0 1 7 1 0 3 1 1 4 = B φ (b) 0 5+3=8 1 7+4=11 Max-out The max-out of a variable X i from a function φ(x) such that X i X is a function φ defined over the set of variables X = X\{X i }. The value of each assignment x D(X ) is given by: φ (x ) = max x i φ(c(x i,x )) Example 7. Maxing out A from function φ(a,b) yields a function φ (B) given below: max a ABφ(a,b) 0 0 5 0 1 7 1 0 3 1 1 4 B φ (b) = 0 max(5,3)=5 1 max(7,4)=7 4.2 Example We explain the main concepts in variable elimination using an example. Consider the Bayesian network given in Figure 1. Let E = true be evidence. By definition, P(E = true) = a,b,c,d,e=true P(A = a)p(b = b A = a)p(c = c B = b) P(D = d B = b,c = c)p(e = true A = a,c = c) (9) as: However, notice that since E is already assigned a value, we can rewrite the CPT P(E A,C) ACφ(a,c) 0 0 0 0 1.5 1 0.3 1 1 1 8

Rewriting Equation (9) using φ, we get P(E = true) = P(A = a)p(b = b A = a)p(c = c B = b) a,b,c,d P(D = d B = b,c = c)φ(a = a,c = c) (10) Distributing sums over products along the order (A,B,C,D), we get P(E = true) = a P(A = a) b P(B = b A = a) c P(C = c B = b)φ(a = a,c = c) ( ) d P(D = d B = b,c = c) (11) We obtain Equation (11) from Equation (10) by applying the distributive law. Recall that the distributive law states that: ab 1 +ab 2 = a(b 1 +b 2 ) Summing out D from P(D = d B = b,c = c) yields a function φ(b,c) defined over B and C. d BCDP(D = d B = b,c = c) 0 0 0.2 0 0 1.8 0 1 0.1 0 1 1.9 1 0 0.3 1 0 1.7 1 1 0.5 1 1 1.5 = BCφ(b,c) 0 0 1 0 1 1 1 0 1 1 1 1 Substituting ( dp(d = d b = b,c = c) with φ(b = b,c = c), we get P(E = true) = a P(A = a) b P(B = b A = a) P(C = c B = b)φ(a = a,c = c)φ(b = b,c = c) (12) c Multiplying P(C = c B = b), φ(a = a,c = c) φ(b = b,c = c) yields a function φ(a,b,c) and we can rewrite Equation 12 as: P(E = true) = a P(A = a) b P(B = b A = a) c φ(a = a,b = b,c = c) (13) Summing out C from φ(a = a,b = b,c = c) yields a function φ(a,b). Substituting it in Equation (13), we get P(E = true) = a P(A = a) b P(B = b A = a)φ(a = a, B = b) (14) 9

Summing out B from P(B = b A = a)φ(a = a,b = b) yields a function φ(a). Substituting it in Equation (14), we get P(E = true) = P(A = a)φ(a = a) (15) a Summingout A from Equation P(A = a)φ(a = a) yields a numberwhich equals the probability of E = true. 4.3 The Algorithm (for the PE task) Algorithm 3: Variable Elimination Input: A Bayesian network and Evidence E = e Output: Probability of Evidence begin for each CPT that mentions one or more evidence variables do Instantiate evidence and remove evidence variables // Let Φ = {φ 1,...,φ n } be the set of functions. Note that these functions include the original CPTs as well as evidence instantiated CPTs Let o = (X 1,...,X n ) be an ordering of non-evidence variables for i = 1 to n do Multiply all functions in Φ that mention X i yielding a function φ Sum-out X i from φ yielding a function φ new Remove all functions that mention X i from φ Add φ new to Φ return φ Φ φ end Next, we generalize the process described in the previous subsection, yielding Algorithm 3. The algorithm takes as input a Bayesian network and evidence and outputs the probability of evidence. First, we instantiate evidence in every CPT that mentions one or more evidence variables. Namely, we remove all tuples that are not consistent with the evidence from the CPT and then project the CPT on the non-evidence variables in its scope. Let Φ denote the set (or database) of all valid functions (which is the union of the set of evidence instantiated CPTs and the CPTs which do not have any evidence variables). Then, given an ordering o of variables, the algorithm eliminates variables one by one, along o. A variable X i is eliminated by computing a product of all the functions that mention X i in Φ, and then summing out X i from this product. This creates a new function, whose scope is the union of the scopes of all functions that mention X i, minus X i. The algorithm then deletes the functions involving X i from Φ, adds the newly created function to it and continues. After all variables are eliminated, the algorithm returns the product of all functions in Φ. 4.4 Complexity The time and space complexity of eliminating a variable X i is linear in the size of the function obtained by computing the product of all functions that mention X i. Since the size of this function is exponential in the cardinality of the union of the scopes of the functions that mention X i, the 10

time and space complexity of eliminating a variable is exponential in the number of variables that co-occur with X i in one or more functions. The complexity of variable elimination is the sum of complexity of eliminating variables along the ordering o. It is possible to compute the complexity of variable elimination graphically. More precisely, the complexity can be determined using an undirected graph associated with a Bayesian network, called its primal graph (or moral graph). Definition 2 (Primal Graph). Given a set of functions Φ defined over a set of non-evidence variables X, the primal graph is a graph G(X,E) which has variables of X as its vertices and an edge between two variables that appear in the scope of a function φ Φ. Example 8. Given evidence E = e, the set Φ associated with the Bayesian network given in Figure 1 equals {P(A),P(B A),P(C A),P(D B,C),φ(A,C)}. The primal graph for Φ is given below: A B C D To compute the complexity of variable elimination, we have to convert the primal graph into an induced graph. Definition 3 (Induced Graph). Given aprimalgraphg(x,e) andanorderingo = (X 1,...,X n ) over the vertices of G, the induced graph of G along o is the graph G (X,E ) (i.e., it has the same vertices as G) obtained as follows: Initialize G = G For i = 1 to n do Form a clique between all vertices X j that are neighbors of X i but ordered below it in o (i.e., j > i) The width of a node in an induced graph is the number of neighbors of X i that are connected to X i but ordered below it. The width of an induced graph is the maximum width of a node. Example 9. The induced graph along the ordering (A,B,C,D) is given below. It is same as the primal graph. 11

A B C The induced graph along the ordering (B,C,D,A) is given below. The new edges, which we will call induced edges are drawn using dashed lines. D B C D Intuitively, the scheme used to construct the induced graph schematically follows the variable elimination algorithm. When we eliminate a variable X i, we create a function whose scope equals all nodes that are neighbors of X i but ordered below it. At each variable X i, the graph over the variables ordered below it is the primal graph w.r.t. the current set of functions Φ. From the discussion above, it is clear that the complexity of variable elimination along an ordering o is exponential in the width of the induced graph along o. Formally, Theorem 2. The time and space complexity of variable elimination along an ordering o is O(n exp(w)) where n is the number of variables and w is the width of the induced graph along o. Thus, in order to reduce the complexity of variable elimination, we should try to find an ordering that will yield an induced graph having minimum width. The minimum width is called the treewidth. In other words, given a primal graph G, the best we can do in terms of complexity is O(nexp(w )) where w is the treewidth of G. Unfortunately, the problem of finding the treewidth of a graph is NP-hard and therefore in practice we have to resort to various heuristic alternatives. Some notable heuristics are the mindegree heuristic and the min-fill heuristic. A 12

The min-fill heuristic G=primal graph For i = 1 to n do 1. Select a variable X which when eliminated will add the least number of edges in the induced graph. Ties broken randomly. Place X at position i in the ordering. 2. Update G by creating a clique out of nodes that are neighbors of X and are not yet ordered. The min-degree heuristic G=primal graph For i = 1 to n do 1. Select a variable X with the smallest degree. Ties broken randomly. Place X at position i in the ordering. 2. Update G by creating a clique out of nodes that are neighbors of X and are not yet ordered. Various empirical studies have shown that the min-fill heuristic yields orderings having smaller width than the min-degree heuristic. Exercise 4. Drawtheinducedgraphfortheprimalgraphgivenbelowalongtheordering(A,B,C,D,E,F). What is the complexity of performing variable elimination along this ordering. A B C D E F 4.5 Variable Elimination for MAR and MPE tasks Variable elimination can be used to compute the conditional marginal probabilities in a straightforward manner. Since, the conditional marginal probability is a ratio of two probabilities, all we have to do is run variable elimination twice, one to compute the numerator and second to compute the denominator. To compute the marginal distribution over a set of variables Y, all we have to do is eliminate all non-evidence variables except Y. You can verify that at termination, the product φ Φφ yields the marginal distribution P(Y e). To compute the MPE value using variable elimination, all we have to do is replace the sum-out operation by the max-out operation. At termination, the algorithm will output the MPE value. To compute the MPE assignment, we have to walk back along the reverse order of elimination and find the value for each variable that yielded the MPE value. 13

Remarks. Since the complexity of variable elimination is exponential in the treewidth of the primal graph associated with the Bayesian network, it is infeasible when the treewidth is large. Unfortunately, in practice, large treewidth Bayesian networks are a norm rather than an exception. Therefore, in practice, we have to rely on approximate inference algorithms. There are two classes of approximate inference algorithms: Belief propagation based and sampling or simulation based. Belief propagation algorithms are loosely based on variable elimination; they iteratively apply the elimination operation in an approximate manner. References [1] A. Darwiche, Modeling and Reasoning with Bayesian Networks, Cambridge university press, 2009. [2] D. Koller and N. Friedman, Probabilistic Graphical Models Principles and Techniques, MIT press, 2009. 5 Solutions 1 The probabilities are.4.9.7.3.7 = 0.05292 and.6 0.8.5.5 = 0 respectively. 2 The number of memory units required to represent the CPTs associated with A,B,C,D,E is 1,2,2,4,4. The number of memory units required to represent the Bayesian network is 1+2+2+ 4+4 = 13. 3 The Markovian assumptions associated with the Bayesian network given in Figure 1 are: I(A,, ) (16) I(B,A,{C,E}) (17) I(C, A, B) (18) I(D,{B,C},{A,E}) (19) I(E,{A,C},{B,D}) (20) 4 The induced graph along the ordering (A,B,C,D,E,F) is given below. 14

A B C D E The width of the nodes A, B, C, D, E and F are 2, 3, 3, 2, 1 and 0 respectively. The width of the ordering is the maximum width among the nodes, which is 3. Therefore, the complexity of running variable elimination along this ordering is O(6 exp(3)). F 15