Introduction to Discrete Probability Theory and Bayesian Networks

Size: px
Start display at page:

Download "Introduction to Discrete Probability Theory and Bayesian Networks"

Transcription

1 Introduction to Discrete Probability Theory and Bayesian Networks Dr Michael Ashcroft October 10, 2012 This document remains the property of Inatas. Reproduction in whole or in part without the written permission of Inatas is strictly forbidden. 1

2 Contents 1 Introduction to Discrete Probability Discrete Probability Spaces Sample Spaces, Outcomes and Events Probability Functions The probabilities of events Random Variables Combinations of events Conditional Probability Independence Conditional Independence The Chain Rule Bayes Theorem The Dirichlet Distribution Parameter Likelihood Prior and Posteriori Distributions and Conjugate Priors Maximum Likelihood (ML) and Maximum Aposteriori (MAP) Introduction to Bayesian Networks Bayesian Networks D-Separation, The Markov Blanket and Markov Equivalence Potentials Exact Inference on a Bayesian Network: The Variable Elimination Algorithm Exact Inference on a Bayesian Network: The Junction Tree Algorithm Inexact Inference on a Bayesian Network: Likelihood Sampling Parameter Learning The Dirichlet Distribution Parameter Dirichlet Distribution

3 4 Structure Learning Search Spaces Ordered DAG Topologies DAG Topologies Markov Equivalence Classes of DAG Topologies The Bayesian Scoring Criterion (BS) The Bayesian Equivalent Scoring Criterion (BSe)

4 1 Introduction to Discrete Probability 1.1 Discrete Probability Spaces A discrete probability space is a pair: < S, P >, where S is a Sample Space and P is a probability function Sample Spaces, Outcomes and Events An outcome is a value that the stochastic system we are modeling can take. The sample space of our model is the set of all outcomes. An event is a subset of the sample space. (So they are also sets of outcomes.) Example 1. The sample space corresponding to rolling a die is 1, 2, 3, 4, 5, 6. The outcomes of this sample space are 1, 2, 3, 4, 5 and 6. The events of this sample space are the members of its power set:, {1}, {2}, {3}, {4}, {5}, {6}, {1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {2, 3}, {2, 4}, {2, 5}, {2, 6}, {3, 4}, {3, 5}, {3, 6}, {4, 5}, {4, 6}, {5, 6}, {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 2, 6}, {1, 3, 4}, {1, 3, 5}, {1, 3, 6}, {1, 4, 5}, {1, 4, 6}, {1, 5, 6}, {2, 3, 4}, {2, 3, 5}, {2, 3, 6}, {2, 4, 5}, {2, 4, 6}, {2, 5, 6}, {3, 4, 5}, {3, 4, 6}, {3, 5, 6}, {4, 5, 6}, {1, 2, 3, 4}, {1, 2, 3, 5}, {1, 2, 3, 6}, {1, 2, 4, 5}, {1, 2, 4, 6}, {1, 2, 5, 6}, {1, 3, 4, 5}, {1, 3, 4, 6}, {1, 3, 5, 6}, {1, 4, 5, 6}, {2, 3, 4, 5}, {2, 3, 4, 6}, {2, 3, 5, 6}, {2, 4, 5, 6}, {3, 4, 5, 6}, {1, 2, 3, 4, 5}, {1, 2, 3, 4, 6}, {1, 2, 3, 5, 6}, {1, 2, 4, 5, 6}, {1, 3, 4, 5, 6}, {2, 3, 4, 5, 6}, {1, 2, 3, 4, 5, 6} Probability Functions A probability function is a function from S to the real numbers, such that: (i) 0 p(s) 1, for each s S. (ii) (p(s)) = 1. Example 2. If a six sided dice is fair, the sample space associated with the result of a single throw is {1, 2, 3, 4, 5, 6} and the probability function is: p(1) = 1/6 p(2) = 1/6 4

5 p(3) = 1/6 p(4) = 1/6 p(5) = 1/6 p(6) = 1/6 Example 3. Now imagine a die that is not fair: It has twice the probability of coming up six as it does of coming up any other number. The probability function for such a die would be: p(1) = 1/7 p(2) = 1/7 p(3) = 1/7 p(4) = 1/7 p(5) = 1/7 p(6) = 2/7 1.2 The probabilities of events The probability of an event, E is defined: p(e) = o E p(o) Example 4. Continuing example 2 (the fair die), the event E, that the roll of the die produces an odd number, is {1, 3, 5}. Therefore: p(e) = p(o) o E = p(1) + p(3) + p(5) = 1/6 + 1/6 + 1/6 = 1/2 Example 5. Likewise, for example 3 (the biased die), the event F, that the roll of the die produces 5 or 6, is {5, 6}. Therefore: 5

6 p(f ) = o E p(o) = p(5) + p(6) = 1/7 + 2/7 = 3/7 Notice that an event represents the disjunctive claim that one of the outcomes that are its members occurred. So event E ({1, 3, 5}) represents the event that the roll of the die produced 1 or 2 or Random Variables Note: A random variable is neither random nor a variable! Often, we are interested in numerical values that are connected with our outcomes. We use random variables to model these. A random variable is a function from a sample space to the real numbers. Example 6. Suppose a coin is flipped three times. Let X(t) be the random variable that equals the number of heads that appear when t is the outcome. Then X(t) takes the following values: X(HHH) = 3 X(HHT ) = X(HT H) = X(T HH) = 2 X(HT T ) = X(T HT ) = X(T T H) = 1 X(T T T ) = 0 Notice that a random variable divides the sample space into disjoint and exhaustive set of events, each mapped to a unique real number, r. Let us term this set of events E r. The probability distribution of a random variable, X, on a sample space, S, is the set of ordered pairs < r, p(x = r) > for all r X(S), where p(x = r) = (p(o)). This is the probability that an outcome o o E r occurred such that X(o) = r and is often characterized by saying that X took the value r. As we would expect: (i) 0 p(e r ) 1, for each r X(S). (ii) p(e r ) = 1. r X(S) 6

7 Placing this in table form gives us a familiar discrete probability distribution. Points to note about random variables: 1. A random variable and its probability distribution together consistute a probability space. 2. A function from the codomain of a random variable to the real numbers is itself a random variable. 1.4 Combinations of events Some theorems: 1. p(ē) = 1 p(e) 2. p(e and F ) = p(e, F ) = p(e F ) 3. p(e or F ) = p(e F ) = p(e) + p(f ) p(e F ) Theorem 1 should be obvious! Note, though, that combined with the definition of a probability distribution, it entails that p( ) = 0. Lets look at an example for theorem 2. Example 7. Returning to example 4 with the fair die, let the event E, be {1, 3, 5} (that the roll of the die produces an odd number), and event F be {5, 6} (that the roll of the die produces 5 or 6). We want to calculate the probability that one of these events occurs, which is to say that the event E F occurs. Using theorem two we see: p(e F ) = p(e) + p(f ) p(e F ) = (p(1) + p(3) + p(5)) + (p(5) + p(6)) p(5) = p(1) + p(3) + p(5) + p(6) = 4/6 Which is as it should be! 7

8 1.5 Conditional Probability The conditional probability of one event, E, given another, F, is denoted: p(e F ). p(e F ) = p(e F ) p(f ) Example 8. Continuing example 7, we can calculate the probability that the roll of the die produces 5 or 6 given that it produces an odd number: p(e F ) = p(e F ) p(f ) = p(5) p(1)+p(3)+p(5) = 1/6 3/6 = 1/3 1.6 Independence Two events, E and F, are independent if and only if p(e, F ) = p(e)p(f ). That is to say, that the probability of both events occurring is simply the probability of the first event occurring multiplied by the probability that the second event occurs. Likewise, two random variables are independent if and only if p(x(s) = r 1 ), Y (s) = r 2 ) = p(x(s) = r 1 )p(y (s) = r 2 ), for all real numbers r 1 and r 2. Independence is of great practical importance: It significantly simplifies working out complex probabilities. Where a number of events are independent, we can quickly calculate their joint probability distribution from their individual probabilities. Example 9. Imagine we are examining the results of the (ordered) tosses of three coins. Given the possible results of each coin is {H, T }, the sample space for our model will be {H, T } 3. Let us define three random variables, X 1, X 2, X 3. X 1 maps outcomes to 1 if the first coin lands heads, and 0 otherwise. X 2 and X 3 do likewise for the second and third coins. Now assume we are given the following information: 1. p(x 1 = 1) = p(x 2 = 1) = p(x 3 = 1) = 0.1 8

9 If we also know that these random variables are independent, then we can immediately calculate the joint probability distribution for the three random variables from these three values alone (remembering that Ēn = 1 E n ): 1. p(x 1 = 1, X 2 = 1, X 3 = 1) = = p(x 1 = 1, X 2 = 1, X 3 = 0) = = p(x 1 = 1, X 2 = 0, X 3 = 1) = = p(x 1 = 1, X 2 = 0, X 3 = 0) = = p(x 1 = 0, X 2 = 1, X 3 = 1) = = p(x 1 = 0, X 2 = 1, X 3 = 0) = = p(x 1 = 0, X 2 = 0, X 3 = 1) = = p(x 1 = 0, X 2 = 0, X 3 = 0) = = If we do not know these random variables are independent, we require much more information. In fact, we will need to have the values for each of the entries in the joint probability distribution. Notice that: our storage requirements have jumped from linear on the number of random variables to exponential. (Very bad.) our computational complexity has fallen from linear of the number of random variables to constant. (Good, but we could obtain this in the early case as well, if we kept the probabilities after we calculated them.) Typically, the probability distributions that are of interest to us are such that this exponential storage complexity renders them intractable. Some methods for dealing with this, such as the naive Bayes classifier, simply assume independence among the random variables they are modeling. But this can lead to significantly lower accuracy from the model. 1.7 Conditional Independence Analogously to independence, we say that two events, E and F, are conditionally independent given another, G, if and only if P (G) 0 and one of the following holds: 1. p(e F G) = p(e G) and p(e G) 0, p(f G) p(e G) = 0 or p(f G) = 0. 9

10 Example 10. Say we have 13 objects. Each object is either black (B) or white(w), each object has either a 1 or a 2 written on it, and each object is either a square ( ) or a diamond( ). The objects are: B1, B1, B2, B2, B2, B2, B1, B2, B2 W1, W2, W1, W2 If we are interested in the characteristics of a randomly drawn object and assume all objects have equal chance of being drawn, then using the techniques we have already looked at, we can see that the event, E 1, that a randomly selected box has a 1 written on it is not independent from the event, E, that such a box is square. But they are conditionally independent given the event, E B that the box is black (and, in fact, also given the event that the box is white): p(e 1 ) = 5/13 p(e 1 E ) = 3/8 p(e 1 E B ) = 3/9 = 1/3 p(e 1 E E B ) = 2/6 = 1/3 There is little more to say about conditional independence at this point, but soon it will take center stage as a means of obtaining the accuracy of using the full joint distribution of the random variables we are modeling while avoiding the complexity issues that accompany this. 1.8 The Chain Rule The chain rule for events says that given n events, E 1, E 2,...E n, defined on the same sample space S: p(e 1, E 2,...E n ) = p(e n E n 1, E n 2,...E 1 )...p(e 2 E 1 )p(e 1 ) Applied to random variables, this gives us that for n random variables, X 1, X 2,...X n, defined on the same sample space S: p(x 1 = x 1, X 2 = x 2,...X n = x n ) = p(x n = n x X n 1 = x n 1, X n 2 = x n 2,...X 1 = x 1 )... = p(x 2 = x 2 X 1 = x 1 )p(x 1 = x 1 ) It is straightforward to prove this rule using the rule for conditional probability. 10

11 1.9 Bayes Theorem Bayes theorem is: p(f E) = Proof: p(e F )p(f ) p(e F )p(f )+p(e F )p( F ) 1. By the definition of conditional probability p(f E) = p(e F ) p(e) and p(e F ) = p(e F ) p(f ). 2. Therefore, p(e F ) = p(f E)p(E) = p(e F )p(f ). 3. Therefore, p(f E) = p(e F )p(f ) p(e) 4. p(e) = p(e S) = p(e (F F )) = p((e F ) (E F )) 5. (E F ) and (E F ) are disjoint (otherwise x (F F ) = ), so p(e) = p((e F ) (E F )) = p(e F )p(f ) + p(e F )p( F ) 6. Therefore p(f E) = p(e F )p(f ) p(e F )p(f ) p(e) = p(e F )p(f )+p(e F )p( F ) (Bayes theorem) Example 11. Suppose 1 person in has a particular rare disease. There exists a diagnostics test for this disease that is accurate 99% of the time when given to those who have the disease and 99.5% of the time when given to those who do not. Given this information, we can find the probability that someone who tests positive for the disease actually has the disease: Let E be the event that someone tests positive for the disease and F be the event that a person has the disease. We want to find p(f E). We know that 1 p(f ) = and so p( F ) = We also know that p(e F ) = 100, so p(ē F ) = Likewise we know that p(ē F ) = , so p(ē F ) = So by Bayes theorem: p(f E) = p(e F )p(f ) p(e F )p(f )+p(e F )p( F ) = (.99)(.00001) (.99)(.00001)+(.005)(.99999).002 Notice that the result was not intuitively obvious. Most people, if told only the information we had available, assume that testing positive means a very high probability of having the disease The Dirichlet Distribution The Dirichlet distribution is a multivariate distribution parametrized by a vector α of positive reals. It is often said that Dirichlet distributions represent the probabilities related with seeing value a i occur x i out of N = x n times. n If the probability of a random variable, X, taking particular values from the set a 1, a 2, a n is given by a Dirichlet distribution dir(x 1, x 2,...x n ) then: 11

12 p(x = x i ) = n xi x n In the special case where the random variables are binary, the Dirichlet distribution is also called the Beta distribution Parameter Likelihood The likelihood of a set of parameter values given some observed outcomes is the probability of those observed outcomes given those parameter values. Imagine that there are two types of coins, C1 and C2. The first type is fair and the other weighted. The probabilities associated with each type coming up heads or tails when tossed are given by the Beta distributions: C1 : beta(1, 1) and C2 : beta(3, 1). Now imagine that we have a coin and obtain the result: (H, H, H). What is the liklihood of the two parameter vectors given this data set? So: P (HHH C1) = 1 2 P (HHH C2) = = = L(C1 HHH) = 1 8 L(C2 HHH) = Prior and Posteriori Distributions and Conjugate Priors Consider the general problem of inferring a distribution for a parameter θ given some datum or data x. From Bayes theorem, this posterior distribution is equal to the product of the likelihood function L(θ x) (which is equal to P (x θ)) and the prior distribution p(θ), normalized (divided) by the probability of the data p(x) (which is equal to p(e F )p(f ) + p(e F )p( F ) = p(x)): p(θ x) = p(x θ)p(θ) p(x θ)p(θ)+p(x θ)p( θ) For certain choices of the prior, the posterior has the same algebraic form as the prior (generally with different parameter values) with respect to a likelihood function. Such a choice is a conjugate prior for the likelihood function. The Dirichlet distribution is self-conjugate, which means that if the prior and the likelihood function are both Dirichlet, then the posteriori will also be Dirichlet. Other self-conjugate distibutions include the Gaussian. 12

13 In the Dirichlet case, it is useful to think of the parameters of a conjugate prior as corresponding to pseudo-observations. That is to say, by assigning our two coins the prior distributions beta(1, 1) and beta(3, 1) we are saying that in the first case, our prior understanding of the behaviour of coins of type C1 is as if we had observed two tosses and seen one come up heads and the other tails. Likewise for the second case, our prior understanding of the behaviour of coins of type C2 is as if we had observed four tosses and seen three come up heads and one tails. The process of calculating the posteriori distribution given certain data and a Dirichlet prior is extremely easy. We simply count the observed number of times the variable takes specific values are add these numbers to the appropriate parameters of the prior. So if we knew we were dealing with a coin of type C1, then since we observed three heads, our posteriori distribution for coins of type C1 would be: dir(1 + 3, 1)=dir(4, 1). Note that the prior parameters (also called the hyper-parameters ) play a very important role in specifying how resistant to emendation the distribution is given new evidence. Another way of thinking about this is that they represent our confidence in the probabilities given. Imagine we had given coins of type C1 the prior distribution dir(1000, 1000), that is to say, our prior knowledge is equivalent to observing 2000 tosses, of which 1000 came up heads and 1000 tails. The prior probabilities of a coin of type C1 coming up heads would still =.5. But given the observed three heads, the posteriori would be - significantly more resistent to emendation! be = Maximum Likelihood (ML) and Maximum Aposteriori (MAP) An important question is how to decide which distributions we should use to model random variables. Because traditional hypothesis testing from statistical theory often fails to scale up, in machine learning it is common to use Maximum Likelihood (ML) and Maximum Aposteriori (MAP) methods. When working with ML, we choose the hypothesis that has the largest likelihood. So in our example with the two types of coins, we would choose the hypothesis that the coin in question was of type C2 since L(C1 HHH) = 1 8 < = L(C2 HHH). The example highlights the obvious problems with ML. In the real world it is overwhelming likely that a given coin is (approximately) fair. There are weighted coins (of various types), but these are extremely rare. So deciding on the basis of three heads that a coin is weighted is very questionable. MAP solves this problem by introducing prior knowledge in the form of a prior distribution. So we might specify that, apriori, the probability of a coin 999 being type C1 is 1000 and type C We now calculate the aposteriori probability that our coin is of type C1 or type C2: 13

14 p(c1 HHH) = p(hhh C1)p(C1) p(hhh C1)p(C1)+p(HHH C2)p(C2) = = = p(c2 HHH) = p(hhh C2)p(C2) p(hhh C1)p(C1)+p(HHH C2)p(C2) = = = So by taking into account our prior knowledge of the distribution of weighted and unweighted coins in the world, we decide overwhelmingly in favor of the hypothesis that the coin is unweighted. It is also possible to use likelihood and aposteriori probabilities to rank the relative fitness of hypotheses with the purpose of modeling the bahviour of phenomena with weighted sets of distributions. 2 Introduction to Bayesian Networks 2.1 Bayesian Networks A Bayesian Network is a model of a system, which in turn consists of a number of random variables. It consists of: 1. A directed acyclic graph (DAG), within which each random variable is represented by a node. The topology of this DAG must meet the Markov Condition: Each node must be conditionally independent of its nondescendants given its parents. 2. A set of conditional probability distributions, one for each node, which give the probability of the random variable represented by the given node taking particular values given the values the random variables represented by the node s parents take. 14

15 Examine the DAG in Figure 1 and the information in Table 1. From the chain rule, we know that the joint probability distribution of the random variables p(a, B, C, D, E) = p(e D, C, B, A)p(D C, B, A)p(C B, A)p(B A)p(A). But given the conditional independencies present in P, we know that: p(c B A) = p(c A) p(d C B A) = p(d C B) p(e D C B A) = p(e C) So we know that p(a, B, C, D, E) = p(e C)p(D C, B)p(C A)p(B A)p(A). This may not seem a huge improvement, but it is. It means we can calculate the full joint distribution from the (normally much, much smaller) conditional probability tables associated with each node. As the networks get bigger, the advantages of such a method become crucial. advantages of such a method become crucial. What we have done is pull the joint probability distribution apart by its conditional independencies. We now have a means of obtaining tractable calculations using the full joint distribution. It has been proven that every discrete probability distribution (and many continuous ones) can be represented by a Bayesian Network, and that every Bayesian Network represents some probability distribution. Of course, if there are no conditional independencies in the joint probability distribution, representing it with a Bayesian Network gains us nothing. But in practice, while independence relationship between random variables in a system we are interested in modeling are rare (and assumptions regarding such independence dangerous), conditional independencies are plentiful. Some important points about Bayesian Networks: Bayesian Networks provide much more information than simple classifiers (like neural networks, or support vector machines, etc). Most importantly, when used to predict the value a random variable will take, they return a probability distribution rather than simply specifying what value is most probable. Obviously, there are many advantages to this. Bayesian Networks have easily understandable and informative physical interpretation (unlike neural networks, or support vector machines, etc, which are effectively black boxes to all but experts). We will see one advantage of this in the next section. We can use Bayesian Networks to simply model the correlations and conditional independencies between the random variables of systems. But generally we are interested in inferring the probability distributions of a subset of the random variables of the network given knowledge of the values taken by another (possibly empty) subset. Bayesian Networks can also be extended to Influence Diagrams, with decision and utility nodes, in order to perform automated decision decision making. 15

16 2.2 D-Separation, The Markov Blanket and Markov Equivalence The Markov Condition also entails other conditional independencies. Because of the Markov Condition, these conditional independencies have a graph theoretic criterion called D-Separation (which we will not define, as it is difficult). Accordingly, when one set of random variables, Γ, is conditionally independent of another,, given a third, Θ, them we will say that the nodes representing the random variables in Γ are D-Separated from by Θ. The most important case of D-Separation/Conditional Independence is: A node is D-Separated of the entire graph given its parents, its children, and the other parents of its children. Because of this, the parents, children and other parents of a node s children are called the Markov Blanket of the node. This is important. Imagine we have a node, α, (which is associated with a random variable) whose probability distribution we wish to predict and whose Markov Blanket is the set of nodes, Γ. If we know the value of (the random variables associated with) every node in Γ, then we know that there is no more information regarding the value taken by (the random variable associated with) α. In this way, if we are confident that we can always establish the values of some of the random variables our network is modeling, we can often see that certain of the random variables are superfluous, and we need not continue to include them in the network nor collect information on them. Since, in practice, collecting data on random variables can be costly, this can be very helpful. We will also say that two DAGs are Markov Equivalent if they have the same D-Separations. 2.3 Potentials Where V is a set of random variables {v 1,...v n }, let Γ V be the Cartesian product of the co-domains of the random variables in V. So Γ V consists of all the possible combinations of values that the random variables of V can be take. Let φ V be a mapping V Γ V R, such that φ V (v i, x) = the ith term of x, where x Γ V. Ie φ V gives us the value assigned to a particular member of V by a particular member of Γ V. If W V, let ψw V be a mapping Γ V Γ W, such that φ W (x, ψw V (y)) = φ V (x, y), for all x W, y Γ V. So ψw V gives us the member of Γ W in which all the members of W are assigned the same values as a particular member of Γ V. A potential is an ordered pair < V, F >, where V is a set of random variables, and F is a mapping Γ V R. 16

17 A B C D E Figure 1: A DAG with five nodes Node Conditional Independencies A - B C and E, given A C B, given A D A and E, given B and C E A, B and D, given C Table 1: Conditional independencies required of random variables the DAG in Figure 1 to be a Bayesian Network A B C D E F G H I J K L M N O P Q R S T U V W Figure 2: The Markov Blanket of Node L 17

18 Given a set of potentials, {< V 1, F 1 >,... < V n, F n >}, the multiplication of these potentials is itself a potential, < V α, F α >, where: V α = n i=1 F α (x) = V i n i=1 F i (ψ Fα F i (x)) This is simpler than it appears. We call the set of random variables in a potential the potential s scheme. The scheme of a product of a set of potentials is the union of the schemes of the factors. Likewise, the values assigned by the function in the potential to particular value combinations of the random variables is the product of the values assigned by the functions of the factors to the same value combinations (for those random variables present in the factor). Example 12. Take the multiplication of two potentials pot 1 =< {X 1, X 2 }, f > and pot 2 =< {X 1, X 3 }, g >, where all random variables are binary: x 1 x 2 f(x 1 = x 1, X 2 = x 2 ) Table 2: pot 1 x 1 x 3 g(x 1 = x 1, X 3 = x 3 ) Table 3: pot 2 Where pot 3 = pot 1 pot 2, we have: Given a potentials, < V, F >, the marginalization out of some random variable v V from this potential is itself a potential, < V α, F α >, where: V α = V \ v F α (x) = y Γ V F (y), where ψ F F α (y) = x 18

19 x 1 x 2 x 3 h(x 1 = x 1, X 2 = x 2, X 3 = x 3 ) Table 4: pot 3. Example 13. If pot 4 is the result of marginalizing X 1 out of pot 1 from Example 12, then: x 2 i(x 2 = x 2 ) Table 5: pot 4 Some points: Note that potentials are simply generalizations of probability distributions, and that the latter are necessarily the former, but not vice versa. In fact, a conditional probability table is a potential, not a distribution. Unlike distributions, potentials need not sum to Exact Inference on a Bayesian Network: The Variable Elimination Algorithm Let Γ be a subset of random variables in our network. Let f be a function that assigns each random variable, v Γ a particular value from those that v can take, f(v). To obtain the probability that the random variables in Γ take the values assigned to them by f: 1. Perform a topological sort on the DAG. This gives us an ordering where all nodes occur before their descendants. From the definition of a DAG, this is always possible. 19

20 2. For each node, n, construct a bucket, b n. Also construct a null bucket, b. 3. For each conditional probability distribution in the network: (a) Create a list of random variables present in the conditional probability distribution. (b) For each random variable, v Γ, eliminate all rows corresponding to values other than f(v), and eliminate v from the associated list. (c) Associate this list with the resulting potential and place this potential in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the potential in the null bucket. 4. Proceed in the given order through the buckets: (a) Create a new potential by multiplying all potential in the bucket. Associate with this potential a list of random variables that includes all random variables on the lists associated with the original potential in the bucket. (b) In this potential, marginalize out (ie sum over) the random variable associated with the bucket. Remove the random variable associated with the bucket from the associated list. (c) Place the distribution in the bucket associated with random variable remaining in the list associated with the highest ordered node. If there are no random variables remaining, place the distribution in the null bucket. 5. Multiply together the distributions in the null bucket (this is simply scalar multiplication). To obtain the a posteriori probability that a subset of random variables, Γ, takes particular values given the observation that a second subset,, has taken particular values, we run the algorithm twice: First on Γ, then on Delta, and we divide the first by the second. Some points to note: The algorithm can be extended to obtain good estimates of error bars for our probability estimates, and wishing to do so is the main reason for using the algorithm. The complexity of the algorithm is dominated by the largest potential, which will be at least the size of the largest conditional probability table and which is, in practice, much, much smaller than the full joint distribution. 20

21 When used to calculate a large number of probabilities (such as the a posteriori probability distributions for each unobserved random variable), the algorithm is relatively inefficient, since, if f is a function from the random variables in the network to the number of values each can take, it must be run f(v) 1 times for each unobserved random variable, v. The algorithm can be run on the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to find that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know. 2.5 Exact Inference on a Bayesian Network: The Junction Tree Algorithm The Junction Tree algorithm is the work horse of Bayesian Network inference algorithms, permitting efficient exact inference. It does not, though, permit the calculation of error bars for our probability estimates. Since the Junction Tree algorithm is a generalization of the Variable Elimination algorithm, there is hope that the extension to the latter that permits us to obtain such error bars can likewise be generalized so as to be utilized in the former. Whether, and if so how, this can be done is an open research question. This algorithm utilizes a secondary structure formed from the Bayesian Network called a Junction Tree or Join Tree. We first show how to create this structure. Some Definitions: A cluster is a maximally connected sub-graph. The weight of a node is the number of values its associated random variable has. The weight of a cluster is the product of the weight of its constituent nodes. The Create (an Optimal) Junction Tree Algorithm: 1. Take a copy, G, of the DAG, join all unconnected parents and undirect all edges. 2. While there are still nodes left in G: (a) Select a node, n, from G, such that n causes the least number of edges to be added in step 2b, breaking ties by choosing the node which induces the cluster with the least weight. (b) Form a cluster, C, from the node and its neighbors, adding edges as required. 21

22 A B C D E F G H I Figure 3: A simple Bayesian Network Variable Value 1 Value 2 Value 3 Notes A Nothing known B Observed to be value 1. C Observed to not be value 2 D Soft evidence, with actual probabilities E Soft evidence, assigns same probabilities as D Table 6: EvidenceP otentials (c) If C is not a sub-graph of a previously stored cluster, store C as a clique. (d) Remove n from G. 3. Create n trees, each consisting of a single stored clique. Also create a set, S, of sepsets, where these are intersections of each pair of cliques. Repeat until n 1 sepsets have been inserted into the forest: (a) Select from S the sepset, s, that has the largest number of variables in it, breaking ties by calculating the product of the number of values of the random variables in the sets, and choosing the set with the lowest. Further ties can be broken arbitrarily. (b) Delete s from S. (c) Insert s between the cliques X and Y only if X and Y are on different trees in the forest. (This merges their two trees into a larger tree, until you are left with a single tree: The Junction Tree.) Before explaining how to perform inference using a Junction Tree, we require some definitions: 22

23 {B, E} {E} {D, E, G} {G} {G, I} {D} {F, H} {F } {C, D, F } {C, D} {A, C, D} Figure 4: The Junction Tree constructed from Figure 3 23

24 Evidence Potentials An evidence potential has a singleton set of random variables, and maps real numbers to the random variable s values. If working with hard evidence, it will map 0 to values which evidence has ruled out, and 1 to all other values (where at least one value must be mapped to 1). Where all values are mapped to 1, nothing is know about the random variables. Where all values except one are mapped to 1, it is known that the random variable takes the specified value. If working with soft evidence, values can be mapped to any non-negative real number, but the sum of these must be non-zero. Such a potential assigns values probabilities as specified by the its normalization. Message Pass We pass a message from one clique, c 1, to another, c 2, via the intervening sepset, s, by: 1. Save the potential associated with s. 2. Marginalize a new potential for s, containing only those variables in s, out of c Assign a new potential to c 2, such that: Collect Evidence pot(c 2 ) new = pot(c 2 ) old ( pot(s)new pot(s)old ) When called on a clique, c, Collect Evidence does the following: 1. Marks c. 2. Calls Collect Evidence recursively on the unmarked neighbors of c, if any. 3. Passes a message from c to the clique that called collect evidence, if any. Disperse Evidence When called on a clique, c, Disperse Evidence does the following: 1. Marks c. 2. Passes a message to each of the unmarked neighbors of c, if any. 3. Calls Disperse Evidence recursively on the unmarked neighbors of c, if any. To perform inference on a Junction Tree, we use the following algorithm: 1. Associate with each clique and sepset a potential, whose random variables are those of the clique/subset, and which associates with all value combinations of these random variables the value For each node: 24

25 (a) Associate with the node an evidence potential representing current knowledge. (b) Find a clique containing the node and its parents (it is certain to exist) and multiply in the node s conditional probability table to the clique s potential. (By multiply in is meant: multiply the node s conditional probability table and the clique s potential, and replace the cliques potential with the result.) (c) Multiply in the evidence potential associated with the node. 3. Pick an arbitrary root clique, and call collect evidence and then disperse evidence on this clique: 4. For each node you wish to obtain a posteriori probabilities for: (a) Select the smallest clique containing this node. (b) Create a copy of the potential associated with this clique. (c) Marginalize all other nodes out of the clique. (d) Normalize the resulting potential. This is the random variable s a posteriori probability distribution. Some points to note: The complexity of the algorithm is dominated by the largest potential associated a clique, which will be at least the size of, and probably much larger than, the largest conditional probability table. But it is, in practice, much smaller than the full joint distribution. When cliques are relatively small, the algorithm is comparatively efficient. There are also numerous techniques to improve efficiency available in the literature. A Junction Tree can be formed from the smallest sub-graph containing (the nodes representing) the variables whose a posteriori probabilities we wish to find that is D-Separated for the remainder of the Network by (nodes representing) random variables whose values we know. 2.6 Inexact Inference on a Bayesian Network: Likelihood Sampling If the network is sufficiently complex, exact inference algorithms will become intractable. In such cases we turn to likelihood sampling. Using this algorithm, given a set of random variables, E, whose values we know (or are assuming), we can estimate a posteriori probabilities for the other random variables, U, in the network: 1. Perform a topological sort on the DAG. 25

26 2. Set all random variables in E to the value they are known/assumed to take. 3. For each random variable in U, create a score card, with a number for each value the random variable can take. Initially set all numbers to zero. 4. Repeat: (a) In the order generated in step 1, for each node in U, randomly assign values to each random variable using their conditional probability tables. (b) Given the values assigned, calculate the p(e = e), from the conditional probability tables of the random variables in E. Ie, where P ar(v) is random variables associated with the parents of the node associated with random variable v, par(v) are the values these parents have been assigned and E = {E 1,...E n }, calculate: p(e = e) = (p(e n = e n P ar(e n ) = par(e n )) E n E (c) For each random variable in U, add p(e = e) to the score for the value it was assigned in this sample. 5. For each random variable in U, normalize its score card. This is an estimate of the random variable s a posteriori probability distribution. 3 Parameter Learning 3.1 The Dirichlet Distribution The Dirichlet distribution is a multivariate distribution parametrized by a vector α of positive reals. It is often said that Dirichlet distributions represent the probabilities related with seeing value a i occur x i out of N = x n times. n If the probability of a random variable, X, taking particular values from the set a 1, a 2, a n is given by a Dirichlet distribution dir(x 1, x 2,...x n ) then: p(x = x i ) = n xi x n The corresponding probability density function is: p(f 1, f 2...f n 1 ) = where 0 f k 1 k=1 Γ(N) f n x1 1 1 f x fn xn 1 Γ(x k ) 26

27 n k=1 f k = 1 f n = 1 f 1 f 2... f n 1 ) Take two binary random variables, X and Y, with values (codomains) {x 1, x 2 } and {y 1, y 2 }, be represented by the Dirichlet distributions dir(4, 6) and dir(40, 60). While p(x = x 1 ) = p(y = y 1 ) =.4 and p(x = x 2 ) = p(y = y 2 ) =.6. However, our confidence in the probabilities given for Y would be much higher than those given for X, since so much more of the density distribution lies in the vicinity of these values. We shall also see that, for our purposes, the probabilities for Y would be much more resistant to emendation from new evidence than those for X. 3.2 Parameter Dirichlet Distribution We can now give the algorithm for learning a networks parameters given data D and graph G: 1. Perform a topological sort on G. 2. For each node, associate a set of Dirichlet distributions with each possible combination of values the random variables associated with the node s parents can take. 3. For each datum, d D: (a) For each node, n G, in the order given by step 1: i. Find the Dirichlet distribution associated with n that corresponds to the values taken by the random variables associated with the parents of n in d. ii. In this Add 1 to the parameter corresponding to the value the random variable associated with n takes in d. To encode prior information and/or enforce a level of conservativism, we can set the initial parameters of the Dirichlets of step 2. Regarding the conservativism, we do not want to conclude from a single instance that it is certain a random variable will take a given value. To avoid this, it is often suggested that the parameters all be initialized to 1, however this is normally rendered irrelevant because of the use of an equivalent sample size (see below). 4 Structure Learning To score network topologies we require: 1. A search space 27

28 2. A set of transistions which will permit us to search the state space 3. A scoring function which we will seek to maximise 4. A search strategy/algorithm 4.1 Search Spaces There are three search spaces that might be used Ordered DAG Topologies Firstly we might specify an ordering on the variables and search the topologies that respect this ordering. Justification for this is that the chain rule is valid in any order. For example: p(x 1, X 2, X 3 ) = p(x 3 X 1, X 2 )p(x 2 X 1 )p(x 1 ) = p(x 1 X 2, X 3 )p(x 2 X3)p(X 3 ) Motivations include the small size of the state space and the ability, because all graphs respect the ordering, to produce compound graphs from a number of high scoring topologoes. Problematically, though, not all conditional independencies potentially present can be represented by the topologies respecting a particular ordering. In cases where such independencies are present, and hence not encoded in the network, the network is more susceptible to noise and more complex than it needs to be. Searching Ordered DAG topologies also raises the issues that we be presented below for searching all DAG topologies DAG Topologies We can avoid these issues by searching all possible DAG topologies. This too, though, presents concerns. Remember that graph topologies can be divided in Markov equivalence classes, and that all topologies belonging to the same equivalence class encode the same conditional independencies. Apriori, we would like all equivalence classes to be equally likely to be selected. But some equivalence classes have massively more members than others. Therefore, since our state space searches are heuristic, and can get stuck at local maxima, if we search DAG topologies, such equivlance classes are much more likely than those with few members to be learnt Markov Equivalence Classes of DAG Topologies This leads to the obvious final search space: Equivalence classes of DAG topologies. This is generally the best option and is the option used in high end Bayesian Network applications. 28

29 4.2 The Bayesian Scoring Criterion (BS) The Bayesian scoring criterion (BS) scores the fitness of a topology by calculating the probability of the Data given the topology: P (d G) = Γ(N (G) ij ) r Γ(a (G) ijk + s(g) ijk ) i G Γ(N (G) ij + M (G) ij ) k=0 Γ(a (G) ijk ) where j P A(n) d is our learning data G is the graph we are scoring n is the number of nodes in the graph P A is a function from a node to the possible value combinations of the parents of that node. N (G) is the sum, for graph G, of the Dirichlet prior parameters for the row of node is conditional probability table corresponding to its parents value combination j. M is the sum of the learnt additions to the Dirichlet parameters for the same row. r i is the number of values node i has. a is the Dirichlet prior parameter corresponding to value k in row j for node i in graph G. s is the sum of the learnt additions to the same parameter. Given a new topology, we learn the parameters of the network using the algorithm explained earlier. We then score the topology given these parameters by using the BS. Importantly, the BS is locally updateable: We can calculate the effects that alterations to the topology have on the score, rather than needing to recalculate the score from scratch. 4.3 The Bayesian Equivalent Scoring Criterion (BSe) To ensure that the procedure outlined in the previous section will results in Markov equivalent topologies obtaining equal scores, we must use an equivalent sample size. What this means is that we pick some number, n, and fix it such that the prior parameters in the Dirichlets assoicated with each nodes conditional probability table sum to n. Because the size of the conditional probability distributions in exponential on the number of parents of the node, this often results either in nodes with no, or few, parents either having large prior parameters, and hence being resistent to learning from the data, or nodes 29

30 with many parents having prior parameters very close to zero, which results in a lack of conservativism. Generally, the second choice is choosen. Using the Bayesian scoring criterion with an equivalent sample size is called using the Bayesian equivalent scoring criterion (BSe). 30

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

CMPSCI 240: Reasoning about Uncertainty

CMPSCI 240: Reasoning about Uncertainty CMPSCI 240: Reasoning about Uncertainty Lecture 17: Representing Joint PMFs and Bayesian Networks Andrew McGregor University of Massachusetts Last Compiled: April 7, 2017 Warm Up: Joint distributions Recall

More information

An Introduction to Bayesian Networks in Systems and Control

An Introduction to Bayesian Networks in Systems and Control 1 n Introduction to ayesian Networks in Systems and Control Dr Michael shcroft Computer Science Department Uppsala University Uppsala, Sweden mikeashcroft@inatas.com bstract ayesian networks are a popular

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Directed Graphical Models

Directed Graphical Models CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential

More information

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability Lecture Notes 1 Basic Probability Set Theory Elements of Probability Conditional probability Sequential Calculation of Probability Total Probability and Bayes Rule Independence Counting EE 178/278A: Basic

More information

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use? Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1 Bayes Networks CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 59 Outline Joint Probability: great for inference, terrible to obtain

More information

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional

More information

Statistical Theory 1

Statistical Theory 1 Statistical Theory 1 Set Theory and Probability Paolo Bautista September 12, 2017 Set Theory We start by defining terms in Set Theory which will be used in the following sections. Definition 1 A set is

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by Lecture 2: Belief (Bayesian)

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2 Representation Stefano Ermon, Aditya Grover Stanford University Lecture 2 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 2 1 / 32 Learning a generative model We are given a training

More information

PROBABILISTIC REASONING SYSTEMS

PROBABILISTIC REASONING SYSTEMS PROBABILISTIC REASONING SYSTEMS In which we explain how to build reasoning systems that use network models to reason with uncertainty according to the laws of probability theory. Outline Knowledge in uncertain

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Graphical Models Adrian Weller MLSALT4 Lecture Feb 26, 2016 With thanks to David Sontag (NYU) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013 Introduction to Bayes Nets CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Introduction Review probabilistic inference, independence and conditional independence Bayesian Networks - - What

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

{ p if x = 1 1 p if x = 0

{ p if x = 1 1 p if x = 0 Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

The Monte Carlo Method: Bayesian Networks

The Monte Carlo Method: Bayesian Networks The Method: Bayesian Networks Dieter W. Heermann Methods 2009 Dieter W. Heermann ( Methods)The Method: Bayesian Networks 2009 1 / 18 Outline 1 Bayesian Networks 2 Gene Expression Data 3 Bayesian Networks

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

4.1 Notation and probability review

4.1 Notation and probability review Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall

More information

Conditional Independence

Conditional Independence Conditional Independence Sargur Srihari srihari@cedar.buffalo.edu 1 Conditional Independence Topics 1. What is Conditional Independence? Factorization of probability distribution into marginals 2. Why

More information

Some Probability and Statistics

Some Probability and Statistics Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

Discrete Probability and State Estimation

Discrete Probability and State Estimation 6.01, Fall Semester, 2007 Lecture 12 Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Fall Semester, 2007 Lecture 12 Notes

More information

LEARNING WITH BAYESIAN NETWORKS

LEARNING WITH BAYESIAN NETWORKS LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016

More information

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution NETWORK ANALYSIS Lourens Waldorp PROBABILITY AND GRAPHS The objective is to obtain a correspondence between the intuitive pictures (graphs) of variables of interest and the probability distributions of

More information

2.4. Conditional Probability

2.4. Conditional Probability 2.4. Conditional Probability Objectives. Definition of conditional probability and multiplication rule Total probability Bayes Theorem Example 2.4.1. (#46 p.80 textbook) Suppose an individual is randomly

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

Chapter 2 Class Notes

Chapter 2 Class Notes Chapter 2 Class Notes Probability can be thought of in many ways, for example as a relative frequency of a long series of trials (e.g. flips of a coin or die) Another approach is to let an expert (such

More information

Machine Learning Lecture 14

Machine Learning Lecture 14 Many slides adapted from B. Schiele, S. Roth, Z. Gharahmani Machine Learning Lecture 14 Undirected Graphical Models & Inference 23.06.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de

More information

Probability COMP 245 STATISTICS. Dr N A Heard. 1 Sample Spaces and Events Sample Spaces Events Combinations of Events...

Probability COMP 245 STATISTICS. Dr N A Heard. 1 Sample Spaces and Events Sample Spaces Events Combinations of Events... Probability COMP 245 STATISTICS Dr N A Heard Contents Sample Spaces and Events. Sample Spaces........................................2 Events........................................... 2.3 Combinations

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Lecture 3 Probability Basics

Lecture 3 Probability Basics Lecture 3 Probability Basics Thais Paiva STA 111 - Summer 2013 Term II July 3, 2013 Lecture Plan 1 Definitions of probability 2 Rules of probability 3 Conditional probability What is Probability? Probability

More information

Conditional Independence and Factorization

Conditional Independence and Factorization Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008

More information

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014

1 : Introduction. 1 Course Overview. 2 Notation. 3 Representing Multivariate Distributions : Probabilistic Graphical Models , Spring 2014 10-708: Probabilistic Graphical Models 10-708, Spring 2014 1 : Introduction Lecturer: Eric P. Xing Scribes: Daniel Silva and Calvin McCarter 1 Course Overview In this lecture we introduce the concept of

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

1 INFO Sep 05

1 INFO Sep 05 Events A 1,...A n are said to be mutually independent if for all subsets S {1,..., n}, p( i S A i ) = p(a i ). (For example, flip a coin N times, then the events {A i = i th flip is heads} are mutually

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Advanced Probabilistic Modeling in R Day 1

Advanced Probabilistic Modeling in R Day 1 Advanced Probabilistic Modeling in R Day 1 Roger Levy University of California, San Diego July 20, 2015 1/24 Today s content Quick review of probability: axioms, joint & conditional probabilities, Bayes

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Lecture 8: Bayesian Networks

Lecture 8: Bayesian Networks Lecture 8: Bayesian Networks Bayesian Networks Inference in Bayesian Networks COMP-652 and ECSE 608, Lecture 8 - January 31, 2017 1 Bayes nets P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning CS4375 --- Fall 2018 Bayesian a Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell 1 Uncertainty Most real-world problems deal with

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network. ecall from last time Lecture 3: onditional independence and graph structure onditional independencies implied by a belief network Independence maps (I-maps) Factorization theorem The Bayes ball algorithm

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Discrete Probability. Chemistry & Physics. Medicine

Discrete Probability. Chemistry & Physics. Medicine Discrete Probability The existence of gambling for many centuries is evidence of long-running interest in probability. But a good understanding of probability transcends mere gambling. The mathematics

More information

Introduction to Machine Learning

Introduction to Machine Learning Uncertainty Introduction to Machine Learning CS4375 --- Fall 2018 a Bayesian Learning Reading: Sections 13.1-13.6, 20.1-20.2, R&N Sections 6.1-6.3, 6.7, 6.9, Mitchell Most real-world problems deal with

More information

CS626 Data Analysis and Simulation

CS626 Data Analysis and Simulation CS626 Data Analysis and Simulation Instructor: Peter Kemper R 104A, phone 221-3462, email:kemper@cs.wm.edu Today: Probability Primer Quick Reference: Sheldon Ross: Introduction to Probability Models 9th

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

I - Probability. What is Probability? the chance of an event occuring. 1classical probability. 2empirical probability. 3subjective probability

I - Probability. What is Probability? the chance of an event occuring. 1classical probability. 2empirical probability. 3subjective probability What is Probability? the chance of an event occuring eg 1classical probability 2empirical probability 3subjective probability Section 2 - Probability (1) Probability - Terminology random (probability)

More information

Probability Theory and Applications

Probability Theory and Applications Probability Theory and Applications Videos of the topics covered in this manual are available at the following links: Lesson 4 Probability I http://faculty.citadel.edu/silver/ba205/online course/lesson

More information

Discrete Probability and State Estimation

Discrete Probability and State Estimation 6.01, Spring Semester, 2008 Week 12 Course Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Spring Semester, 2008 Week

More information

Bayes Networks 6.872/HST.950

Bayes Networks 6.872/HST.950 Bayes Networks 6.872/HST.950 What Probabilistic Models Should We Use? Full joint distribution Completely expressive Hugely data-hungry Exponential computational complexity Naive Bayes (full conditional

More information

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( ) Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr Pr = Pr Pr Pr() Pr Pr. We are given three coins and are told that two of the coins are fair and the

More information

Outline Conditional Probability The Law of Total Probability and Bayes Theorem Independent Events. Week 4 Classical Probability, Part II

Outline Conditional Probability The Law of Total Probability and Bayes Theorem Independent Events. Week 4 Classical Probability, Part II Week 4 Classical Probability, Part II Week 4 Objectives This week we continue covering topics from classical probability. The notion of conditional probability is presented first. Important results/tools

More information

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models

Graphical Models. Andrea Passerini Statistical relational learning. Graphical Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning Probability distributions Bernoulli distribution Two possible values (outcomes): 1 (success), 0 (failure). Parameters: p probability

More information

Parameter Learning With Binary Variables

Parameter Learning With Binary Variables With Binary Variables University of Nebraska Lincoln CSCE 970 Pattern Recognition Outline Outline 1 Learning a Single Parameter 2 More on the Beta Density Function 3 Computing a Probability Interval Outline

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan Lecture 9: Naive Bayes, SVM, Kernels Instructor: Outline 1 Probability basics 2 Probabilistic Interpretation of Classification 3 Bayesian Classifiers, Naive Bayes 4 Support Vector Machines Probability

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Learning Bayesian Networks

Learning Bayesian Networks Learning Bayesian Networks Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-1 Aspects in learning Learning the parameters of a Bayesian network Marginalizing over all all parameters

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Sample Spaces, Random Variables

Sample Spaces, Random Variables Sample Spaces, Random Variables Moulinath Banerjee University of Michigan August 3, 22 Probabilities In talking about probabilities, the fundamental object is Ω, the sample space. (elements) in Ω are denoted

More information

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory

UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics. EECS 281A / STAT 241A Statistical Learning Theory UC Berkeley Department of Electrical Engineering and Computer Science Department of Statistics EECS 281A / STAT 241A Statistical Learning Theory Solutions to Problem Set 1 Fall 2011 Issued: Thurs, September

More information

Probability theory basics

Probability theory basics Probability theory basics Michael Franke Basics of probability theory: axiomatic definition, interpretation, joint distributions, marginalization, conditional probability & Bayes rule. Random variables:

More information

11. Probability Sample Spaces and Probability

11. Probability Sample Spaces and Probability 11. Probability 11.1 Sample Spaces and Probability 1 Objectives A. Find the probability of an event. B. Find the empirical probability of an event. 2 Theoretical Probabilities 3 Example A fair coin is

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information