Introduction to Discrete Probability Theory and Bayesian Networks

Size: px

Start display at page:

Download "Introduction to Discrete Probability Theory and Bayesian Networks"

Johnathan Oliver
6 years ago
Views:

1 Introduction to Discrete Probability Theory and Bayesian Networks Dr Michael Ashcroft October 10, 2012 This document remains the property of Inatas. Reproduction in whole or in part without the written permission of Inatas is strictly forbidden. 1

2 Contents 1 Introduction to Discrete Probability Discrete Probability Spaces Sample Spaces, Outcomes and Events Probability Functions The probabilities of events Random Variables Combinations of events Conditional Probability Independence Conditional Independence The Chain Rule Bayes Theorem The Dirichlet Distribution Parameter Likelihood Prior and Posteriori Distributions and Conjugate Priors Maximum Likelihood (ML) and Maximum Aposteriori (MAP) Introduction to Bayesian Networks Bayesian Networks D-Separation, The Markov Blanket and Markov Equivalence Potentials Exact Inference on a Bayesian Network: The Variable Elimination Algorithm Exact Inference on a Bayesian Network: The Junction Tree Algorithm Inexact Inference on a Bayesian Network: Likelihood Sampling Parameter Learning The Dirichlet Distribution Parameter Dirichlet Distribution

3 4 Structure Learning Search Spaces Ordered DAG Topologies DAG Topologies Markov Equivalence Classes of DAG Topologies The Bayesian Scoring Criterion (BS) The Bayesian Equivalent Scoring Criterion (BSe)

4 1 Introduction to Discrete Probability 1.1 Discrete Probability Spaces A discrete probability space is a pair: < S, P >, where S is a Sample Space and P is a probability function Sample Spaces, Outcomes and Events An outcome is a value that the stochastic system we are modeling can take. The sample space of our model is the set of all outcomes. An event is a subset of the sample space. (So they are also sets of outcomes.) Example 1. The sample space corresponding to rolling a die is 1, 2, 3, 4, 5, 6. The outcomes of this sample space are 1, 2, 3, 4, 5 and 6. The events of this sample space are the members of its power set:, {1}, {2}, {3}, {4}, {5}, {6}, {1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {2, 3}, {2, 4}, {2, 5}, {2, 6}, {3, 4}, {3, 5}, {3, 6}, {4, 5}, {4, 6}, {5, 6}, {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 2, 6}, {1, 3, 4}, {1, 3, 5}, {1, 3, 6}, {1, 4, 5}, {1, 4, 6}, {1, 5, 6}, {2, 3, 4}, {2, 3, 5}, {2, 3, 6}, {2, 4, 5}, {2, 4, 6}, {2, 5, 6}, {3, 4, 5}, {3, 4, 6}, {3, 5, 6}, {4, 5, 6}, {1, 2, 3, 4}, {1, 2, 3, 5}, {1, 2, 3, 6}, {1, 2, 4, 5}, {1, 2, 4, 6}, {1, 2, 5, 6}, {1, 3, 4, 5}, {1, 3, 4, 6}, {1, 3, 5, 6}, {1, 4, 5, 6}, {2, 3, 4, 5}, {2, 3, 4, 6}, {2, 3, 5, 6}, {2, 4, 5, 6}, {3, 4, 5, 6}, {1, 2, 3, 4, 5}, {1, 2, 3, 4, 6}, {1, 2, 3, 5, 6}, {1, 2, 4, 5, 6}, {1, 3, 4, 5, 6}, {2, 3, 4, 5, 6}, {1, 2, 3, 4, 5, 6} Probability Functions A probability function is a function from S to the real numbers, such that: (i) 0 p(s) 1, for each s S. (ii) (p(s)) = 1. Example 2. If a six sided dice is fair, the sample space associated with the result of a single throw is {1, 2, 3, 4, 5, 6} and the probability function is: p(1) = 1/6 p(2) = 1/6 4

5 p(3) = 1/6 p(4) = 1/6 p(5) = 1/6 p(6) = 1/6 Example 3. Now imagine a die that is not fair: It has twice the probability of coming up six as it does of coming up any other number. The probability function for such a die would be: p(1) = 1/7 p(2) = 1/7 p(3) = 1/7 p(4) = 1/7 p(5) = 1/7 p(6) = 2/7 1.2 The probabilities of events The probability of an event, E is defined: p(e) = o E p(o) Example 4. Continuing example 2 (the fair die), the event E, that the roll of the die produces an odd number, is {1, 3, 5}. Therefore: p(e) = p(o) o E = p(1) + p(3) + p(5) = 1/6 + 1/6 + 1/6 = 1/2 Example 5. Likewise, for example 3 (the biased die), the event F, that the roll of the die produces 5 or 6, is {5, 6}. Therefore: 5

6 p(f ) = o E p(o) = p(5) + p(6) = 1/7 + 2/7 = 3/7 Notice that an event represents the disjunctive claim that one of the outcomes that are its members occurred. So event E ({1, 3, 5}) represents the event that the roll of the die produced 1 or 2 or Random Variables Note: A random variable is neither random nor a variable! Often, we are interested in numerical values that are connected with our outcomes. We use random variables to model these. A random variable is a function from a sample space to the real numbers. Example 6. Suppose a coin is flipped three times. Let X(t) be the random variable that equals the number of heads that appear when t is the outcome. Then X(t) takes the following values: X(HHH) = 3 X(HHT ) = X(HT H) = X(T HH) = 2 X(HT T ) = X(T HT ) = X(T T H) = 1 X(T T T ) = 0 Notice that a random variable divides the sample space into disjoint and exhaustive set of events, each mapped to a unique real number, r. Let us term this set of events E r. The probability distribution of a random variable, X, on a sample space, S, is the set of ordered pairs < r, p(x = r) > for all r X(S), where p(x = r) = (p(o)). This is the probability that an outcome o o E r occurred such that X(o) = r and is often characterized by saying that X took the value r. As we would expect: (i) 0 p(e r ) 1, for each r X(S). (ii) p(e r ) = 1. r X(S) 6

7 Placing this in table form gives us a familiar discrete probability distribution. Points to note about random variables: 1. A random variable and its probability distribution together consistute a probability space. 2. A function from the codomain of a random variable to the real numbers is itself a random variable. 1.4 Combinations of events Some theorems: 1. p(ē) = 1 p(e) 2. p(e and F ) = p(e, F ) = p(e F ) 3. p(e or F ) = p(e F ) = p(e) + p(f ) p(e F ) Theorem 1 should be obvious! Note, though, that combined with the definition of a probability distribution, it entails that p( ) = 0. Lets look at an example for theorem 2. Example 7. Returning to example 4 with the fair die, let the event E, be {1, 3, 5} (that the roll of the die produces an odd number), and event F be {5, 6} (that the roll of the die produces 5 or 6). We want to calculate the probability that one of these events occurs, which is to say that the event E F occurs. Using theorem two we see: p(e F ) = p(e) + p(f ) p(e F ) = (p(1) + p(3) + p(5)) + (p(5) + p(6)) p(5) = p(1) + p(3) + p(5) + p(6) = 4/6 Which is as it should be! 7

8 1.5 Conditional Probability The conditional probability of one event, E, given another, F, is denoted: p(e F ). p(e F ) = p(e F ) p(f ) Example 8. Continuing example 7, we can calculate the probability that the roll of the die produces 5 or 6 given that it produces an odd number: p(e F ) = p(e F ) p(f ) = p(5) p(1)+p(3)+p(5) = 1/6 3/6 = 1/3 1.6 Independence Two events, E and F, are independent if and only if p(e, F ) = p(e)p(f ). That is to say, that the probability of both events occurring is simply the probability of the first event occurring multiplied by the probability that the second event occurs. Likewise, two random variables are independent if and only if p(x(s) = r 1 ), Y (s) = r 2 ) = p(x(s) = r 1 )p(y (s) = r 2 ), for all real numbers r 1 and r 2. Independence is of great practical importance: It significantly simplifies working out complex probabilities. Where a number of events are independent, we can quickly calculate their joint probability distribution from their individual probabilities. Example 9. Imagine we are examining the results of the (ordered) tosses of three coins. Given the possible results of each coin is {H, T }, the sample space for our model will be {H, T } 3. Let us define three random variables, X 1, X 2, X 3. X 1 maps outcomes to 1 if the first coin lands heads, and 0 otherwise. X 2 and X 3 do likewise for the second and third coins. Now assume we are given the following information: 1. p(x 1 = 1) = p(x 2 = 1) = p(x 3 = 1) = 0.1 8

9 If we also know that these random variables are independent, then we can immediately calculate the joint probability distribution for the three random variables from these three values alone (remembering that Ēn = 1 E n ): 1. p(x 1 = 1, X 2 = 1, X 3 = 1) = = p(x 1 = 1, X 2 = 1, X 3 = 0) = = p(x 1 = 1, X 2 = 0, X 3 = 1) = = p(x 1 = 1, X 2 = 0, X 3 = 0) = = p(x 1 = 0, X 2 = 1, X 3 = 1) = = p(x 1 = 0, X 2 = 1, X 3 = 0) = = p(x 1 = 0, X 2 = 0, X 3 = 1) = = p(x 1 = 0, X 2 = 0, X 3 = 0) = = If we do not know these random variables are independent, we require much more information. In fact, we will need to have the values for each of the entries in the joint probability distribution. Notice that: our storage requirements have jumped from linear on the number of random variables to exponential. (Very bad.) our computational complexity has fallen from linear of the number of random variables to constant. (Good, but we could obtain this in the early case as well, if we kept the probabilities after we calculated them.) Typically, the probability distributions that are of interest to us are such that this exponential storage complexity renders them intractable. Some methods for dealing with this, such as the naive Bayes classifier, simply assume independence among the random variables they are modeling. But this can lead to significantly lower accuracy from the model. 1.7 Conditional Independence Analogously to independence, we say that two events, E and F, are conditionally independent given another, G, if and only if P (G) 0 and one of the following holds: 1. p(e F G) = p(e G) and p(e G) 0, p(f G) p(e G) = 0 or p(f G) = 0. 9

10 Example 10. Say we have 13 objects. Each object is either black (B) or white(w), each object has either a 1 or a 2 written on it, and each object is either a square ( ) or a diamond( ). The objects are: B1, B1, B2, B2, B2, B2, B1, B2, B2 W1, W2, W1, W2 If we are interested in the characteristics of a randomly drawn object and assume all objects have equal chance of being drawn, then using the techniques we have already looked at, we can see that the event, E 1, that a randomly selected box has a 1 written on it is not independent from the event, E, that such a box is square. But they are conditionally independent given the event, E B that the box is black (and, in fact, also given the event that the box is white): p(e 1 ) = 5/13 p(e 1 E ) = 3/8 p(e 1 E B ) = 3/9 = 1/3 p(e 1 E E B ) = 2/6 = 1/3 There is little more to say about conditional independence at this point, but soon it will take center stage as a means of obtaining the accuracy of using the full joint distribution of the random variables we are modeling while avoiding the complexity issues that accompany this. 1.8 The Chain Rule The chain rule for events says that given n events, E 1, E 2,...E n, defined on the same sample space S: p(e 1, E 2,...E n ) = p(e n E n 1, E n 2,...E 1 )...p(e 2 E 1 )p(e 1 ) Applied to random variables, this gives us that for n random variables, X 1, X 2,...X n, defined on the same sample space S: p(x 1 = x 1, X 2 = x 2,...X n = x n ) = p(x n = n x X n 1 = x n 1, X n 2 = x n 2,...X 1 = x 1 )... = p(x 2 = x 2 X 1 = x 1 )p(x 1 = x 1 ) It is straightforward to prove this rule using the rule for conditional probability. 10

11 1.9 Bayes Theorem Bayes theorem is: p(f E) = Proof: p(e F )p(f ) p(e F )p(f )+p(e F )p( F ) 1. By the definition of conditional probability p(f E) = p(e F ) p(e) and p(e F ) = p(e F ) p(f ). 2. Therefore, p(e F ) = p(f E)p(E) = p(e F )p(f ). 3. Therefore, p(f E) = p(e F )p(f ) p(e) 4. p(e) = p(e S) = p(e (F F )) = p((e F ) (E F )) 5. (E F ) and (E F ) are disjoint (otherwise x (F F ) = ), so p(e) = p((e F ) (E F )) = p(e F )p(f ) + p(e F )p( F ) 6. Therefore p(f E) = p(e F )p(f ) p(e F )p(f ) p(e) = p(e F )p(f )+p(e F )p( F ) (Bayes theorem) Example 11. Suppose 1 person in has a particular rare disease. There exists a diagnostics test for this disease that is accurate 99% of the time when given to those who have the disease and 99.5% of the time when given to those who do not. Given this information, we can find the probability that someone who tests positive for the disease actually has the disease: Let E be the event that someone tests positive for the disease and F be the event that a person has the disease. We want to find p(f E). We know that 1 p(f ) = and so p( F ) = We also know that p(e F ) = 100, so p(ē F ) = Likewise we know that p(ē F ) = , so p(ē F ) = So by Bayes theorem: p(f E) = p(e F )p(f ) p(e F )p(f )+p(e F )p( F ) = (.99)(.00001) (.99)(.00001)+(.005)(.99999).002 Notice that the result was not intuitively obvious. Most people, if told only the information we had available, assume that testing positive means a very high probability of having the disease The Dirichlet Distribution The Dirichlet distribution is a multivariate distribution parametrized by a vector α of positive reals. It is often said that Dirichlet distributions represent the probabilities related with seeing value a i occur x i out of N = x n times. n If the probability of a random variable, X, taking particular values from the set a 1, a 2, a n is given by a Dirichlet distribution dir(x 1, x 2,...x n ) then: 11

12 p(x = x i ) = n xi x n In the special case where the random variables are binary, the Dirichlet distribution is also called the Beta distribution Parameter Likelihood The likelihood of a set of parameter values given some observed outcomes is the probability of those observed outcomes given those parameter values. Imagine that there are two types of coins, C1 and C2. The first type is fair and the other weighted. The probabilities associated with each type coming up heads or tails when tossed are given by the Beta distributions: C1 : beta(1, 1) and C2 : beta(3, 1). Now imagine that we have a coin and obtain the result: (H, H, H). What is the liklihood of the two parameter vectors given this data set? So: P (HHH C1) = 1 2 P (HHH C2) = = = L(C1 HHH) = 1 8 L(C2 HHH) = Prior and Posteriori Distributions and Conjugate Priors Consider the general problem of inferring a distribution for a parameter θ given some datum or data x. From Bayes theorem, this posterior distribution is equal to the product of the likelihood function L(θ x) (which is equal to P (x θ)) and the prior distribution p(θ), normalized (divided) by the probability of the data p(x) (which is equal to p(e F )p(f ) + p(e F )p( F ) = p(x)): p(θ x) = p(x θ)p(θ) p(x θ)p(θ)+p(x θ)p( θ) For certain choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter values) with respect to a likelihood function. Such a choice is a conjugate prior for the likelihood function. The Dirichlet distribution is self-conjugate, which means that if the prior and the likelihood function are both Dirichlet, then the posteriori will also be Dirichlet. Other self-conjugate distibutions include the Gaussian. 12

13 In the Dirichlet case, it is useful to think of the parameters of a conjugate prior as corresponding to pseudo-observations. That is to say, by assigning our two coins the prior distributions beta(1, 1) and beta(3, 1) we are saying that in the first case, our prior understanding of the behaviour of coins of type C1 is as if we had observed two tosses and seen one come up heads and the other tails. Likewise for the second case, our prior understanding of the behaviour of coins of type C2 is as if we had observed four tosses and seen three come up heads and one tails. The process of calculating the posteriori distribution given certain data and a Dirichlet prior is extremely easy. We simply count the observed number of times the variable takes specific values are add these numbers to the appropriate parameters of the prior. So if we knew we were dealing with a coin of type C1, then since we observed three heads, our posteriori distribution for coins of type C1 would be: dir(1 + 3, 1)=dir(4, 1). Note that the prior parameters (also called the hyper-parameters ) play a very important role in specifying how resistant to emendation the distribution is given new evidence. Another way of thinking about this is that they represent our confidence in the probabilities given. Imagine we had given coins of type C1 the prior distribution dir(1000, 1000), that is to say, our prior knowledge is equivalent to observing 2000 tosses, of which 1000 came up heads and 1000 tails. The prior probabilities of a coin of type C1 coming up heads would still =.5. But given the observed three heads, the posteriori would be - significantly more resistent to emendation! be = Maximum Likelihood (ML) and Maximum Aposteriori (MAP) An important question is how to decide which distributions we should use to model random variables. Because traditional hypothesis testing from statistical theory often fails to scale up, in machine learning it is common to use Maximum Likelihood (ML) and Maximum Aposteriori (MAP) methods. When working with ML, we choose the hypothesis that has the largest likelihood. So in our example with the two types of coins, we would choose the hypothesis that the coin in question was of type C2 since L(C1 HHH) = 1 8 < = L(C2 HHH). The example highlights the obvious problems with ML. In the real world it is overwhelming likely that a given coin is (approximately) fair. There are weighted coins (of various types), but these are extremely rare. So deciding on the basis of three heads that a coin is weighted is very questionable. MAP solves this problem by introducing prior knowledge in the form of a prior distribution. So we might specify that, apriori, the probability of a coin 999 being type C1 is 1000 and type C We now calculate the aposteriori probability that our coin is of type C1 or type C2: 13

14 p(c1 HHH) = p(hhh C1)p(C1) p(hhh C1)p(C1)+p(HHH C2)p(C2) = = = p(c2 HHH) = p(hhh C2)p(C2) p(hhh C1)p(C1)+p(HHH C2)p(C2) = = = So by taking into account our prior knowledge of the distribution of weighted and unweighted coins in the world, we decide overwhelmingly in favor of the hypothesis that the coin is unweighted. It is also possible to use likelihood and aposteriori probabilities to rank the relative fitness of hypotheses with the purpose of modeling the bahviour of phenomena with weighted sets of distributions. 2 Introduction to Bayesian Networks 2.1 Bayesian Networks A Bayesian Network is a model of a system, which in turn consists of a number of random variables. It consists of: 1. A directed acyclic graph (DAG), within which each random variable is represented by a node. The topology of this DAG must meet the Markov Condition: Each node must be conditionally independent of its nondescendants given its parents. 2. A set of conditional probability distributions, one for each node, which give the probability of the random variable represented by the given node taking particular values given the values the random variables represented by the node s parents take. 14

15 Examine the DAG in Figure 1 and the information in Table 1. From the chain rule, we know that the joint probability distribution of the random variables p(a, B, C, D, E) = p(e D, C, B, A)p(D C, B, A)p(C B, A)p(B A)p(A). But given the conditional independencies present in P, we know that: p(c B A) = p(c A) p(d C B A) = p(d C B) p(e D C B A) = p(e C) So we know that p(a, B, C, D, E) = p(e C)p(D C, B)p(C A)p(B A)p(A). This may not seem a huge improvement, but it is. It means we can calculate the full joint distribution from the (normally much, much smaller) conditional probability tables associated with each node. As the networks get bigger, the advantages of such a method become crucial. advantages of such a method become crucial. What we have done is pull the joint probability distribution apart by its conditional independencies. We now have a means of obtaining tractable calculations using the full joint distribution. It has been proven that every discrete probability distribution (and many continuous ones) can be represented by a Bayesian Network, and that every Bayesian Network represents some probability distribution. Of course, if there are no conditional independencies in the joint probability distribution, representing it with a Bayesian Network gains us nothing. But in practice, while independence relationship between random variables in a system we are interested in modeling are rare (and assumptions regarding such independence dangerous), conditional independencies are plentiful. Some important points about Bayesian Networks: Bayesian Networks provide much more information than simple classifiers (like neural networks, or support vector machines, etc). Most importantly, when used to predict the value a random variable will take, they return a probability distribution rather than simply specifying what value is most probable. Obviously, there are many advantages to this. Bayesian Networks have easily understandable and informative physical interpretation (unlike neural networks, or support vector machines, etc, which are effectively black boxes to all but experts). We will see one advantage of this in the next section. We can use Bayesian Networks to simply model the correlations and conditional independencies between the random variables of systems. But generally we are interested in inferring the probability distributions of a subset of the random variables of the network given knowledge of the values taken by another (possibly empty) subset. Bayesian Networks can also be extended to Influence Diagrams, with decision and utility nodes, in order to perform automated decision decision making. 15

16 2.2 D-Separation, The Markov Blanket and Markov Equivalence The Markov Condition also entails other conditional independencies. Because of the Markov Condition, these conditional independencies have a graph theoretic criterion called D-Separation (which we will not define, as it is difficult). Accordingly, when one set of random variables, Γ, is conditionally independent of another,, given a third, Θ, them we will say that the nodes representing the random variables in Γ are D-Separated from by Θ. The most important case of D-Separation/Conditional Independence is: A node is D-Separated of the entire graph given its parents, its children, and the other parents of its children. Because of this, the parents, children and other parents of a node s children are called the Markov Blanket of the node. This is important. Imagine we have a node, α, (which is associated with a random variable) whose probability distribution we wish to predict and whose Markov Blanket is the set of nodes, Γ. If we know the value of (the random variables associated with) every node in Γ, then we know that there is no more information regarding the value taken by (the random variable associated with) α. In this way, if we are confident that we can always establish the values of some of the random variables our network is modeling, we can often see that certain of the random variables are superfluous, and we need not continue to include them in the network nor collect information on them. Since, in practice, collecting data on random variables can be costly, this can be very helpful. We will also say that two DAGs are Markov Equivalent if they have the same D-Separations. 2.3 Potentials Where V is a set of random variables {v 1,...v n }, let Γ V be the Cartesian product of the co-domains of the random variables in V. So Γ V consists of all the possible combinations of values that the random variables of V can be take. Let φ V be a mapping V Γ V R, such that φ V (v i, x) = the ith term of x, where x Γ V. Ie φ V gives us the value assigned to a particular member of V by a particular member of Γ V. If W V, let ψw V be a mapping Γ V Γ W, such that φ W (x, ψw V (y)) = φ V (x, y), for all x W, y Γ V. So ψw V gives us the member of Γ W in which all the members of W are assigned the same values as a particular member of Γ V. A potential is an ordered pair < V, F >, where V is a set of random variables, and F is a mapping Γ V R. 16

17 A B C D E Figure 1: A DAG with five nodes Node Conditional Independencies A - B C and E, given A C B, given A D A and E, given B and C E A, B and D, given C Table 1: Conditional independencies required of random variables the DAG in Figure 1 to be a Bayesian Network A B C D E F G H I J K L M N O P Q R S T U V W Figure 2: The Markov Blanket of Node L 17

18 Given a set of potentials, {< V 1, F 1 >,... < V n, F n >}, the multiplication of these potentials is itself a potential, < V α, F α >, where: V α = n i=1 F α (x) = V i n i=1 F i (ψ Fα F i (x)) This is simpler than it appears. We call the set of random variables in a potential the potential s scheme. The scheme of a product of a set of potentials is the union of the schemes of the factors. Likewise, the values assigned by the function in the potential to particular value combinations of the random variables is the product of the values assigned by the functions of the factors to the same value combinations (for those random variables present in the factor). Example 12. Take the multiplication of two potentials pot 1 =< {X 1, X 2 }, f > and pot 2 =< {X 1, X 3 }, g >, where all random variables are binary: x 1 x 2 f(x 1 = x 1, X 2 = x 2 ) Table 2: pot 1 x 1 x 3 g(x 1 = x 1, X 3 = x 3 ) Table 3: pot 2 Where pot 3 = pot 1 pot 2, we have: Given a potentials, < V, F >, the marginalization out of some random variable v V from this potential is itself a potential, < V α, F α >, where: V α = V \ v F α (x) = y Γ V F (y), where ψ F F α (y) = x 18

19 x 1 x 2 x 3 h(x 1 = x 1, X 2 = x 2, X 3 = x 3 ) Table 4: pot 3. Example 13. If pot 4 is the result of marginalizing X 1 out of pot 1 from Example 12, then: x 2 i(x 2 = x 2 ) Table 5: pot 4 Some points: Note that potentials are simply generalizations of probability distributions, and that the latter are necessarily the former, but not vice versa. In fact, a conditional probability table is a potential, not a distribution. Unlike distributions, potentials need not sum to Exact Inference on a Bayesian Network: The Variable Elimination Algorithm Let Γ be a subset of random variables in our network. Let f be a function that assigns each random variable, v Γ a particular value from those that v can take, f(v). To obtain the probability that the random variables in Γ take the values assigned to them by f: 1. Perform a topological sort on the DAG. This gives us an ordering where all nodes occur before their descendants. From the definition of a DAG, this is always possible. 19

20 2. For each node, n, construct a bucket, b n. Also construct a null bucket, b. 3. For each conditional probability distribution in the network: (a) Create a list of random variables present in the conditional probability distribution. (b) For each random variable, v Γ, eliminate all rows corresponding to values other than f(v), and eliminate v from the associated list. (c) Associate this list with the resulting potential and place this potential in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the potential in the null bucket. 4. Proceed in the given order through the buckets: (a) Create a new potential by multiplying all potential in the bucket. Associate with this potential a list of random variables that includes all random variables on the lists associated with the original potential in the bucket. (b) In this potential, marginalize out (ie sum over) the random variable associated with the bucket. Remove the random variable associated with the bucket from the associated list. (c) Place the distribution in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the distribution in the null bucket. 5. Multiply together the distributions in the null bucket (this is simply scalar multiplication). To obtain the a posteriori probability that a subset of random variables, Γ, takes particular values given the observation that a second subset,, has taken particular values, we run the algorithm twice: First on Γ, then on Delta, and we divide the first by the second. Some points to note: The algorithm can be extended to obtain good estimates of error bars for our probability estimates, and wishing to do so is the main reason for using the algorithm. The complexity of the algorithm is dominated by the largest potential, which will be at least the size of the largest conditional probability table and which is, in practice, much, much smaller than the full joint distribution. 20

21 When used to calculate a large number of probabilities (such as the a posteriori probability distributions for each unobserved random variable), the algorithm is relatively inefficient, since, if f is a function from the random variables in the network to the number of values each can take, it must be run f(v) 1 times for each unobserved random variable, v. The algorithm can be run on the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to find that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know. 2.5 Exact Inference on a Bayesian Network: The Junction Tree Algorithm The Junction Tree algorithm is the work horse of Bayesian Network inference algorithms, permitting efficient exact inference. It does not, though, permit the calculation of error bars for our probability estimates. Since the Junction Tree algorithm is a generalization of the Variable Elimination algorithm, there is hope that the extension to the latter that permits us to obtain such error bars can likewise be generalized so as to be utilized in the former. Whether, and if so how, this can be done is an open research question. This algorithm utilizes a secondary structure formed from the Bayesian Network called a Junction Tree or Join Tree. We first show how to create this structure. Some Definitions: A cluster is a maximally connected sub-graph. The weight of a node is the number of values its associated random variable has. The weight of a cluster is the product of the weight of its constituent nodes. The Create (an Optimal) Junction Tree Algorithm: 1. Take a copy, G, of the DAG, join all unconnected parents and undirect all edges. 2. While there are still nodes left in G: (a) Select a node, n, from G, such that n causes the least number of edges to be added in step 2b, breaking ties by choosing the node which induces the cluster with the least weight. (b) Form a cluster, C, from the node and its neighbors, adding edges as required. 21

22 A B C D E F G H I Figure 3: A simple Bayesian Network Variable Value 1 Value 2 Value 3 Notes A Nothing known B Observed to be value 1. C Observed to not be value 2 D Soft evidence, with actual probabilities E Soft evidence, assigns same probabilities as D Table 6: EvidenceP otentials (c) If C is not a sub-graph of a previously stored cluster, store C as a clique. (d) Remove n from G. 3. Create n trees, each consisting of a single stored clique. Also create a set, S, of sepsets, where these are intersections of each pair of cliques. Repeat until n 1 sepsets have been inserted into the forest: (a) Select from S the sepset, s, that has the largest number of variables in it, breaking ties by calculating the product of the number of values of the random variables in the sets, and choosing the set with the lowest. Further ties can be broken arbitrarily. (b) Delete s from S. (c) Insert s between the cliques X and Y only if X and Y are on different trees in the forest. (This merges their two trees into a larger tree, until you are left with a single tree: The Junction Tree.) Before explaining how to perform inference using a Junction Tree, we require some definitions: 22

23 {B, E} {E} {D, E, G} {G} {G, I} {D} {F, H} {F } {C, D, F } {C, D} {A, C, D} Figure 4: The Junction Tree constructed from Figure 3 23

24 Evidence Potentials An evidence potential has a singleton set of random variables, and maps real numbers to the random variable s values. If working with hard evidence, it will map 0 to values which evidence has ruled out, and 1 to all other values (where at least one value must be mapped to 1). Where all values are mapped to 1, nothing is know about the random variables. Where all values except one are mapped to 1, it is known that the random variable takes the specified value. If working with soft evidence, values can be mapped to any non-negative real number, but the sum of these must be non-zero. Such a potential assigns values probabilities as specified by the its normalization. Message Pass We pass a message from one clique, c 1, to another, c 2, via the intervening sepset, s, by: 1. Save the potential associated with s. 2. Marginalize a new potential for s, containing only those variables in s, out of c Assign a new potential to c 2, such that: Collect Evidence pot(c 2 ) new = pot(c 2 ) old ( pot(s)new pot(s)old ) When called on a clique, c, Collect Evidence does the following: 1. Marks c. 2. Calls Collect Evidence recursively on the unmarked neighbors of c, if any. 3. Passes a message from c to the clique that called collect evidence, if any. Disperse Evidence When called on a clique, c, Disperse Evidence does the following: 1. Marks c. 2. Passes a message to each of the unmarked neighbors of c, if any. 3. Calls Disperse Evidence recursively on the unmarked neighbors of c, if any. To perform inference on a Junction Tree, we use the following algorithm: 1. Associate with each clique and sepset a potential, whose random variables are those of the clique/subset, and which associates with all value combinations of these random variables the value For each node: 24

25 (a) Associate with the node an evidence potential representing current knowledge. (b) Find a clique containing the node and its parents (it is certain to exist) and multiply in the node s conditional probability table to the clique s potential. (By multiply in is meant: multiply the node s conditional probability table and the clique s potential, and replace the cliques potential with the result.) (c) Multiply in the evidence potential associated with the node. 3. Pick an arbitrary root clique, and call collect evidence and then disperse evidence on this clique: 4. For each node you wish to obtain a posteriori probabilities for: (a) Select the smallest clique containing this node. (b) Create a copy of the potential associated with this clique. (c) Marginalize all other nodes out of the clique. (d) Normalize the resulting potential. This is the random variable s a posteriori probability distribution. Some points to note: The complexity of the algorithm is dominated by the largest potential associated a clique, which will be at least the size of, and probably much larger than, the largest conditional probability table. But it is, in practice, much smaller than the full joint distribution. When cliques are relatively small, the algorithm is comparatively efficient. There are also numerous techniques to improve efficiency available in the literature. A Junction Tree can be formed from the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to find that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know. 2.6 Inexact Inference on a Bayesian Network: Likelihood Sampling If the network is sufficiently complex, exact inference algorithms will become intractable. In such cases we turn to likelihood sampling. Using this algorithm, given a set of random variables, E, whose values we know (or are assuming), we can estimate a posteriori probabilities for the other random variables, U, in the network: 1. Perform a topological sort on the DAG. 25

26 2. Set all random variables in E to the value they are known/assumed to take. 3. For each random variable in U, create a score card, with a number for each value the random variable can take. Initially set all numbers to zero. 4. Repeat: (a) In the order generated in step 1, for each node in U, randomly assign values to each random variable using their conditional probability tables. (b) Given the values assigned, calculate the p(e = e), from the conditional probability tables of the random variables in E. Ie, where P ar(v) is random variables associated with the parents of the node associated with random variable v, par(v) are the values these parents have been assigned and E = {E 1,...E n }, calculate: p(e = e) = (p(e n = e n P ar(e n ) = par(e n )) E n E (c) For each random variable in U, add p(e = e) to the score for the value it was assigned in this sample. 5. For each random variable in U, normalize its score card. This is an estimate of the random variable s a posteriori probability distribution. 3 Parameter Learning 3.1 The Dirichlet Distribution The Dirichlet distribution is a multivariate distribution parametrized by a vector α of positive reals. It is often said that Dirichlet distributions represent the probabilities related with seeing value a i occur x i out of N = x n times. n If the probability of a random variable, X, taking particular values from the set a 1, a 2, a n is given by a Dirichlet distribution dir(x 1, x 2,...x n ) then: p(x = x i ) = n xi x n The corresponding probability density function is: p(f 1, f 2...f n 1 ) = where 0 f k 1 k=1 Γ(N) f n x1 1 1 f x fn xn 1 Γ(x k ) 26

27 n k=1 f k = 1 f n = 1 f 1 f 2... f n 1 ) Take two binary random variables, X and Y, with values (codomains) {x 1, x 2 } and {y 1, y 2 }, be represented by the Dirichlet distributions dir(4, 6) and dir(40, 60). While p(x = x 1 ) = p(y = y 1 ) =.4 and p(x = x 2 ) = p(y = y 2 ) =.6. However, our confidence in the probabilities given for Y would be much higher than those given for X, since so much more of the density distribution lies in the vicinity of these values. We shall also see that, for our purposes, the probabilities for Y would be much more resistant to emendation from new evidence than those for X. 3.2 Parameter Dirichlet Distribution We can now give the algorithm for learning a networks parameters given data D and graph G: 1. Perform a topological sort on G. 2. For each node, associate a set of Dirichlet distributions with each possible combination of values the random variables associated with the node s parents can take. 3. For each datum, d D: (a) For each node, n G, in the order given by step 1: i. Find the Dirichlet distribution associated with n that corresponds to the values taken by the random variables associated with the parents of n in d. ii. In this Add 1 to the parameter corresponding to the value the random variable associated with n takes in d. To encode prior information and/or enforce a level of conservativism, we can set the initial parameters of the Dirichlets of step 2. Regarding the conservativism, we do not want to conclude from a single instance that it is certain a random variable will take a given value. To avoid this, it is often suggested that the parameters all be initialized to 1, however this is normally rendered irrelevant because of the use of an equivalent sample size (see below). 4 Structure Learning To score network topologies we require: 1. A search space 27

28 2. A set of transistions which will permit us to search the state space 3. A scoring function which we will seek to maximise 4. A search strategy/algorithm 4.1 Search Spaces There are three search spaces that might be used Ordered DAG Topologies Firstly we might specify an ordering on the variables and search the topologies that respect this ordering. Justification for this is that the chain rule is valid in any order. For example: p(x 1, X 2, X 3 ) = p(x 3 X 1, X 2 )p(x 2 X 1 )p(x 1 ) = p(x 1 X 2, X 3 )p(x 2 X3)p(X 3 ) Motivations include the small size of the state space and the ability, because all graphs respect the ordering, to produce compound graphs from a number of high scoring topologoes. Problematically, though, not all conditional independencies potentially present can be represented by the topologies respecting a particular ordering. In cases where such independencies are present, and hence not encoded in the network, the network is more susceptible to noise and more complex than it needs to be. Searching Ordered DAG topologies also raises the issues that we be presented below for searching all DAG topologies DAG Topologies We can avoid these issues by searching all possible DAG topologies. This too, though, presents concerns. Remember that graph topologies can be divided in Markov equivalence classes, and that all topologies belonging to the same equivalence class encode the same conditional independencies. Apriori, we would like all equivalence classes to be equally likely to be selected. But some equivalence classes have massively more members than others. Therefore, since our state space searches are heuristic, and can get stuck at local maxima, if we search DAG topologies, such equivlance classes are much more likely than those with few members to be learnt Markov Equivalence Classes of DAG Topologies This leads to the obvious final search space: Equivalence classes of DAG topologies. This is generally the best option and is the option used in high end Bayesian Network applications. 28

29 4.2 The Bayesian Scoring Criterion (BS) The Bayesian scoring criterion (BS) scores the fitness of a topology by calculating the probability of the Data given the topology: P (d G) = Γ(N (G) ij ) r Γ(a (G) ijk + s(g) ijk ) i G Γ(N (G) ij + M (G) ij ) k=0 Γ(a (G) ijk ) where j P A(n) d is our learning data G is the graph we are scoring n is the number of nodes in the graph P A is a function from a node to the possible value combinations of the parents of that node. N (G) is the sum, for graph G, of the Dirichlet prior parameters for the row of node is conditional probability table corresponding to its parents value combination j. M is the sum of the learnt additions to the Dirichlet parameters for the same row. r i is the number of values node i has. a is the Dirichlet prior parameter corresponding to value k in row j for node i in graph G. s is the sum of the learnt additions to the same parameter. Given a new topology, we learn the parameters of the network using the algorithm explained earlier. We then score the topology given these parameters by using the BS. Importantly, the BS is locally updateable: We can calculate the effects that alterations to the topology have on the score, rather than needing to recalculate the score from scratch. 4.3 The Bayesian Equivalent Scoring Criterion (BSe) To ensure that the procedure outlined in the previous section will results in Markov equivalent topologies obtaining equal scores, we must use an equivalent sample size. What this means is that we pick some number, n, and fix it such that the prior parameters in the Dirichlets assoicated with each nodes conditional probability table sum to n. Because the size of the conditional probability distributions in exponential on the number of parents of the node, this often results either in nodes with no, or few, parents either having large prior parameters, and hence being resistent to learning from the data, or nodes 29

30 with many parents having prior parameters very close to zero, which results in a lack of conservativism. Generally, the second choice is choosen. Using the Bayesian scoring criterion with an equivalent sample size is called using the Bayesian equivalent scoring criterion (BSe). 30

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular