Lecture 5: Bayesian Network

Topics of this lecture What is a Bayesian network? A simple example Formal definition of BN A slightly difficult example Learning of BN An example of learning Important topics in using BN Lec05/2

What is a Bayesian Network? A Bayesian network (BN) is a graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). In the literature, BN is also called Bayes network, belief network, etc. Compared with the methods we have studied so far, BN is more visible and interpretable in the sense that Given some result, we can understand the most important factor(s), or Given some factors or evidences, we can understand the most possible result(s). Lec05/3

A Simple Example (1) Reference: Mathematics in statistic inference of Bayesian network, K. Tanaka, 2009 Shown in the right is a simple BN. There are two factors here for the alarm to activate. Each of them has a different probability, P(A1) or P(A2), to occur, and Each of them can activate the alarm with a conditional probability, P(A3 A1) or P(A3 A2). Based on the probabilities, we can understand, say, the probability that A1 occurred when the alarm is activated. In a BN, each event is denoted by a node, and the causal relations between the events are denoted by the edges. A1 A3 A2 A1: Burglary A2: Earthquake A3: Alarm Lec05/4

A Simple Example (1) To define the BN, we need to find (learn) the following probabilities using data collected so far, or based on experiences of a human expert. Probability of A1 Probability of A2 A1=1 0.1 A1=0 0.9 A2=1 0.1 A2=0 0.9 A1 A2 Conditional probabilities A1=1 A1=0 A2=1 A2=0 A2=1 A2=0 A3=1 0.607 0.368 0.135 0.001 A3=0 0.393 0.632 0.865 0.999 A3 A1: Burglary A2: Earthquake A3: Alarm Lec05/5

A Simple Example (3) To make a decision, the simplest (but not efficient) way is to find the joint probability of all events, and then marginalize it based on the evidences. For the simple example given here, we have P A1, A2, A3 = P A3 A1, A2 P A1, A2 = P A3 A1, A2 P A1 P A2 The second lines uses the fact (we believe) that A1 and A2 are independent. From the joint probability, we can find P(A1,A3), P(A2,A3), and P(A3) via marginalization. Using Bayes theorem, we can find P(A1 A3)=P(A1,A3)/P(A3) probability of A1 when A3 is true P(A2 A3)=P(A2,A3)/P(A3) probability of A2 when A3 is true (1) Lec05/6

A Simple Example (4) Using the probabilities defined for the given example, we have P(A1 A3)=P(A1,A3)/P(A3)=0.751486 P(A2 A3)=P(A2,A3)/P(A3)=0.349377 Thus, for this example, when the alarm is activated, the most probable factor is burglary. Note that A1, A2, and A3 can represent other events, and the decisions can be made in the same way. For example, a simple BN can be defined for car diagnosis: A3: Car engine does not work A1: Run out of battery A2: Engine starter problem Lec05/7

Formal Definition of Bayesian Network (1) Reference: Bayesian Network, Maomi Ueno, 2013. Formally, a BN is defined by a 2-tuple or pair <G, Q>, where G is a directed acyclic graph (DAG), and Q is a set of probabilities. Each node of G represents an event or a random variable xi, and the edge (i,j) exists if the j-th node depends on the i-th node. The joint probability of the whole BN can be found by N P x = P(x i π i ) i=1 where pi is the set of parent nodes of xi. (2) Lec05/8

Formal Definition of Bayesian Network (2) In fact, to find the joint probability using Eq. (2), we need some assumptions. Given a graph G. Let X, Y, and Z be three disjoint sets of nodes. If all paths between nodes of X and those of Y contain at least one node of Z, we say that X and Y are separated by Z. This fact is denoted by I(X,Y Z) G. On the other hand, we use I(X,Y Z) to denote the fact that X and Y are conditional independent given Z. If I(X,Y Z) G is equivalent to I(X,Y Z) for all disjoint sets of nodes, G is said a perfect map (p-map). Lec05/9

Formal Definition of Bayesian Network (3) In fact, DAG is not powerful enough to represent all kinds of probability models. In the study of BN, we are more interested in graphs that can represent true conditional independences. These kind of graphs are called independent map (Imap). For a I-map, we have I(X, Y Z) G I(X, Y Z). To avoid trivial I-maps (complete graphs), we are more interested in minimum I-maps. A minimum I-map is an I-map with the minimum number of edges (if we delete any edge from G, G will not be I-map any more). Lec05/10

Formal Definition of Bayesian Network (4) D-separation is an important concept in BN. This concept is equivalent to conditional independence. That is, if X and Y are d-separated by Z, X and Y are conditional independent given Z, and vice versa. I(X, Y Z) d I(X, Y Z) G (3) In fact, Eq. (2) can be proved based on d-separation and the fact that there is no cycles in a BN. For more detail, please referred to the reference book written by Maomi Ueno. Lec05/11

A slightly difficult example (1) http://www.norsys.com/networklibrary.html# Lec05/12

A slightly difficult example (2) ASIA is a BN for a fictitious medical example about whether a patient has tuberculosis, lung cancer or bronchitis, related to their X-ray, dyspnea, visit-to-asia and smoking status. For convenience of discussion, we denote the above events using A1~A8 as follows: A1: Visit to Asia; A2: Smoking; A3: Tuberculosis; A4: Lung cancer; A5: Bronchitis; A6: Tuberculosis or Cancer; A7: X-ray result; A8: Dyspnea. Reference: Mathematics in statistic inference of Bayesian network, K. Tanaka, 2009 Lec05/13

A slightly difficult example (3) P(A1) A1=1 0.01 A1=0 0.99 P(A2) A2=1 0.5 A2=0 0.5 P(A3 A1) A1=1 A1=0 A3=1 0.05 0.01 A3=0 0.95 0.99 P(A4 A2) A2=1 A2=0 A4=1 0.1 0.01 A4=0 0.9 0.99 P(A5 A2) A2=1 A2=0 A5=1 0.6 0.3 A5=0 0.4 0.7 P(A7 A6) A1=6 A6=0 A7=1 0.98 0.05 A7=0 0.02 0.95 Lec05/14

A slightly difficult example (4) P(A6 A3, A4) A3=1 A3=0 A4=1 A4=0 A4=1 A4=0 A6=1 1 1 1 0 A6=0 0 0 0 1 P(A8 A5, A6) A5=1 A5=0 A6=1 A6=0 A6=1 A6=0 A8=1 0.9 0.8 0.7 0.1 A8=0 0.1 0.2 0.3 0.9 Lec05/15

A slightly difficult example (5) Using Eq. (2), we have P A 1,, A 8 = P A 8 A 5, A 6 P A 7 A 6 P A 6 A 3, A 4 P A 5 A 2 P A 4 A 2 P A 3 A 1 P(A 1 )P(A 2 ) Marginalize this joint probability, we can find, for example P A 8 = A 1 A 2 A 3 A 4 A 5 A 6 A 7 P A 1,, A 8 P A 3, A 8 = P A 1,, A 8 A 7 Using Bayesian theorem, we can find P(A3 A8) from P(A8) and P(A3,A8), and see A3 is the factor for causing A8. A 1 A 2 A 4 A 5 A 6 Lec05/16

A slightly difficult example (6) For the given BN, we have the results given below. From these results, we can see that probably A3 (Tuberculosis) is not the main factor for causing A8 (Dyspnea). P(A8) A8=1 0.43597 A8=0 0.56403 P(A3 A8) A8=1 A8=0 A3=1 0.01883 0.00388 A3=0 0.98117 0.99612 P(A3,A8) A3=1 A8=1 0.00821 A3=1 A8=0 0.02189 A3=0 A8=1 0.42776 A3=0 A8=0 0.56184 Lec05/17

Learning of BN (1) Reference: Bayesian Network, Maomi Ueno, 2013. Consider the learning problem of a BN containing N nodes representing the probability model of x = {x 1, x 2,, x N }, where each variable can take a limited number of discrete values. From Eq. (2) we can see that to define a BN, we need to determine the conditional probabilities P x i π x i, where π x i is the set of parent nodes of x i. If x i can take r i values, and π x i can take q i patterns (=number of combinations of all parent nodes), altogether we have N ς N i=1 r i q i parameters to determine in the learning process. Lec05/18

Learning of BN (2) Denoting the parameters by Θ = {θ ijk }, i = 1,2,, N; j = 1,2,, q i ; k = 1,2,, r i, we can estimate Q based on a training set. Suppose that we have already defined the structure of the BN. That is, G is given. Given an set of observations x, the likelihood of Q is given by where N P x Θ = ij i=1 q i j=1 r i k=1 N n θ ijk ijk i=1 ij = (σ r i k=1 n ijk ) r ς i k=1 n ijk! q i j=1 r i k=1 n θ ijk ijk is a normalization factor to make the sum of all probabilities 1. Here we have assumed that the likelihood follows the multinomial distribution. (4) (5) Lec05/19

Learning of BN (3) Where in Eq. (4), n ijk is the number of observations in which xi takes the k-th value when the parent set takes the j-th pattern. To find the MAP (Maximum a posteriori) estimation of Q, we assume that Q follows Dirichlet distribution given by Where N P Θ = δ ij i=1 q i j=1 r i k=1 δ ij = Γ(σ r i k=1 αijk ) r ς i k=1 Γ(αijk ) α θ ijk 1 ijk Γ is gamma function satisfying Γ x + 1 = xγ x, and α ijk is the hyperparameter. The a priori probability (6) takes the same form as the likelihood given by (4). (6) (7) Lec05/20

Learning of BN (4) From Eqs. (4) and (6), we can find the joint probability P(x,Q) as follows: N P x, Θ i=1 q i j=1 r i k=1 α θ ijk +n ijk 1 ijk Maximize the above probability with respect to Q, we get the following MAP estimation: θ ijk = α ijk + n ijk 1 (9) r Where α ij = σ i k=1 α ij + n ij r i αijk, and n ij = σ k=1 r i nijk. (8) Lec05/21

An example of learning (1) Reference: Bayesian Network, Maomi Ueno, 2013. A BN is given in the right figure. P(X2 X1) P(X1) X1 X1=1 X2=1 0.8 X1=1 X2=0 0.2 X1=1 0.8 X1=0 0.2 X2 X3 X1=0 X2=1 0.2 P(X4 X2,X3) X1=0 X2=0 0.8 P(X3 X1) X2, X3 X4 1 1 1 0.8 1 1 0 0.2 X4 P(X5 X3) X5 X1=1 X3=1 0.8 X1=1 X3=0 0.2 X1=0 X3=1 0.2 1 0 1 0.6 1 0 0 0.4 0 1 1 0.6 0 1 0 0.4 X3=1 X5=1 0.8 X3=1 X5=0 0.2 X3=0 X5=1 0.2 X1=0 X3=0 0.8 0 0 1 0.2 0 0 0 0.8 X3=0 X5=0 0.8 Lec05/22

An example of learning (2) Based on the given probabilities, we can generate as many data as we can. Suppose now we have 20 data given in the right table, we can estimate the probabilities from the data, using the MAP estimation. The results is given in the next page. # X1,X2,X3,X4,X5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 10111 00010 11111 11110 11110 11000 10000 00000 00000 01111 00000 11111 00111 11111 10111 11100 11010 11100 11010 11101 Lec05/23

An example of learning (3) In the MAP estimation, we have assumed that all hyperparameters are ½. True value MAP estimation (α ijk = 0.5) P(X2=1 X1=1) P(X2=1 X1=0) P(X3=1 X1=1) P(X3=1 X1=0) P(X4=1 X2=1,X3=1) P(X4=1 X2=1, X3=0) P(X4=1 X2=0, X3=1) P(X4=1 X2=0, X3=0) P(X5=1 X3=1) P(X5=1 X3=0) 0.8 0.78 0.2 0.08 0.8 0.78 0.2 0.08 0.8 0.65 0.6 0.5 0.6 0.5 0.2 0.41 0.8 0.42 0.2 0.34 Mean squared error 0.028 Lec05/24

Important topics in using BN In this lecture, we have studied the fundamental knowledge related to BN. To use BN more efficiently we should used algorithms that can calculate the wanted probabilities quickly, because marginalization can be very expensive when the number of nodes is large. To learn a BN with a proper structure, we need to adopt other algorithms (e.g. meta-heuristics like genetic algorithm), based on some optimization criterion (e.g. minimum description length, MDL). In any case, to obtain a good BN, we need a large amount of data, to guarantee the quality of the probabilities. Lec05/25

Homework For the example given in p. 22, write the equation for finding the joint probability based on Eq. (2). Write a program to build a joint probability distribution table (JPDT) based on your equation. The JPDT is similar to the truth table. It has two columns. The left column contains all patterns of the variables and the right column contains the probability values. For the example given here, there are 32 rows. Write a program to generate N d data, based on the JPDT. Write a program to find the MAP estimation of the parameters using Eq. (9). Again, we can assume that the hyper-parameters are all ½. Try to compare the results obtained with N d =20, N d =200, and N d =2,000, and draw some conclusions. Lec05/26