Lecture 5: Bayesian Network

Similar documents
Rapid Introduction to Machine Learning/ Deep Learning

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Directed Graphical Models

Learning Bayesian networks

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Bayesian Networks. Motivation

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

CS6220: DATA MINING TECHNIQUES

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

An Introduction to Bayesian Networks: Representation and Approximate Inference

Introduction to Bayesian Learning

COMP538: Introduction to Bayesian Networks

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS Lecture 3. More Bayesian Networks

Review: Bayesian learning and inference

Structure Learning: the good, the bad, the ugly

Bayesian Networks and Decision Graphs

Learning Bayesian Networks (part 1) Goals for the lecture

Bayesian Networks for Causal Analysis

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

CS6220: DATA MINING TECHNIQUES

Introduction to Artificial Intelligence. Unit # 11

Motivation. Bayesian Networks in Epistemology and Philosophy of Science Lecture. Overview. Organizational Issues

Bayesian Networks BY: MOHAMAD ALSABBAGH

An Empirical-Bayes Score for Discrete Bayesian Networks

STA 4273H: Statistical Machine Learning

Beyond Uniform Priors in Bayesian Network Structure Learning

Our learning goals for Bayesian Nets

Reasoning Under Uncertainty: Belief Network Inference

The Monte Carlo Method: Bayesian Networks

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Probabilistic Models

3 : Representation of Undirected GM

Bayesian Network Approach to Multinomial Parameter Learning using Data and Expert Judgments

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Classification

Probabilistic Graphical Networks: Definitions and Basic Results

University of Washington Department of Electrical Engineering EE512 Spring, 2006 Graphical Models

Module 10. Reasoning with Uncertainty - Probabilistic reasoning. Version 2 CSE IIT,Kharagpur

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Bayesian belief networks. Inference.

2 : Directed GMs: Bayesian Networks

Directed Graphical Models

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Bayesian Networks Basic and simple graphs

Graphical Models and Kernel Methods

Learning P-maps Param. Learning

BN Semantics 3 Now it s personal! Parameter Learning 1

COMP5211 Lecture Note on Reasoning under Uncertainty

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Learning Bayesian network : Given structure and completely observed data

Support for Multiple Cause Diagnosis with Bayesian Networks

Data Mining 2018 Bayesian Networks (1)

COMP538: Introduction to Bayesian Networks

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Quantifying uncertainty & Bayesian networks

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

Directed Graphical Models or Bayesian Networks

Bayesian networks as causal models. Peter Antal

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Bayesian Network Representation

Probabilistic Graphical Models. Rudolf Kruse, Alexander Dockhorn Bayesian Networks 153

Introduction to Probabilistic Graphical Models

Causal Bayesian networks. Peter Antal

An Empirical-Bayes Score for Discrete Bayesian Networks

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CHAPTER-17. Decision Tree Induction

CS 5522: Artificial Intelligence II

Conditional Independence

Probabilistic Graphical Models (I)

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Probabilistic Graphical Models

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Bayesian Network Modelling

Preliminaries Bayesian Networks Graphoid Axioms d-separation Wrap-up. Bayesian Networks. Brandon Malone

COS513 LECTURE 8 STATISTICAL CONCEPTS

CS 343: Artificial Intelligence

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Probabilistic Reasoning Systems

Introduction to Artificial Intelligence Belief networks

Introduction to Causal Calculus

MODULE -4 BAYEIAN LEARNING

Probabilistic Graphical Models

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Bayesian networks. Chapter 14, Sections 1 4

3 Undirected Graphical Models

An Introduction to Bayesian Machine Learning

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

Bayesian Networks Inference with Probabilistic Graphical Models

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Naïve Bayes classification

Lecture 8: December 17, 2003

Bayesian belief networks

Machine learning: lecture 20. Tommi S. Jaakkola MIT CSAIL

Transcription:

Lecture 5: Bayesian Network

Topics of this lecture What is a Bayesian network? A simple example Formal definition of BN A slightly difficult example Learning of BN An example of learning Important topics in using BN Lec05/2

What is a Bayesian Network? A Bayesian network (BN) is a graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). In the literature, BN is also called Bayes network, belief network, etc. Compared with the methods we have studied so far, BN is more visible and interpretable in the sense that Given some result, we can understand the most important factor(s), or Given some factors or evidences, we can understand the most possible result(s). Lec05/3

A Simple Example (1) Reference: Mathematics in statistic inference of Bayesian network, K. Tanaka, 2009 Shown in the right is a simple BN. There are two factors here for the alarm to activate. Each of them has a different probability, P(A1) or P(A2), to occur, and Each of them can activate the alarm with a conditional probability, P(A3 A1) or P(A3 A2). Based on the probabilities, we can understand, say, the probability that A1 occurred when the alarm is activated. In a BN, each event is denoted by a node, and the causal relations between the events are denoted by the edges. A1 A3 A2 A1: Burglary A2: Earthquake A3: Alarm Lec05/4

A Simple Example (1) To define the BN, we need to find (learn) the following probabilities using data collected so far, or based on experiences of a human expert. Probability of A1 Probability of A2 A1=1 0.1 A1=0 0.9 A2=1 0.1 A2=0 0.9 A1 A2 Conditional probabilities A1=1 A1=0 A2=1 A2=0 A2=1 A2=0 A3=1 0.607 0.368 0.135 0.001 A3=0 0.393 0.632 0.865 0.999 A3 A1: Burglary A2: Earthquake A3: Alarm Lec05/5

A Simple Example (3) To make a decision, the simplest (but not efficient) way is to find the joint probability of all events, and then marginalize it based on the evidences. For the simple example given here, we have P A1, A2, A3 = P A3 A1, A2 P A1, A2 = P A3 A1, A2 P A1 P A2 The second lines uses the fact (we believe) that A1 and A2 are independent. From the joint probability, we can find P(A1,A3), P(A2,A3), and P(A3) via marginalization. Using Bayes theorem, we can find P(A1 A3)=P(A1,A3)/P(A3) probability of A1 when A3 is true P(A2 A3)=P(A2,A3)/P(A3) probability of A2 when A3 is true (1) Lec05/6

A Simple Example (4) Using the probabilities defined for the given example, we have P(A1 A3)=P(A1,A3)/P(A3)=0.751486 P(A2 A3)=P(A2,A3)/P(A3)=0.349377 Thus, for this example, when the alarm is activated, the most probable factor is burglary. Note that A1, A2, and A3 can represent other events, and the decisions can be made in the same way. For example, a simple BN can be defined for car diagnosis: A3: Car engine does not work A1: Run out of battery A2: Engine starter problem Lec05/7

Formal Definition of Bayesian Network (1) Reference: Bayesian Network, Maomi Ueno, 2013. Formally, a BN is defined by a 2-tuple or pair <G, Q>, where G is a directed acyclic graph (DAG), and Q is a set of probabilities. Each node of G represents an event or a random variable xi, and the edge (i,j) exists if the j-th node depends on the i-th node. The joint probability of the whole BN can be found by N P x = P(x i π i ) i=1 where pi is the set of parent nodes of xi. (2) Lec05/8

Formal Definition of Bayesian Network (2) In fact, to find the joint probability using Eq. (2), we need some assumptions. Given a graph G. Let X, Y, and Z be three disjoint sets of nodes. If all paths between nodes of X and those of Y contain at least one node of Z, we say that X and Y are separated by Z. This fact is denoted by I(X,Y Z) G. On the other hand, we use I(X,Y Z) to denote the fact that X and Y are conditional independent given Z. If I(X,Y Z) G is equivalent to I(X,Y Z) for all disjoint sets of nodes, G is said a perfect map (p-map). Lec05/9

Formal Definition of Bayesian Network (3) In fact, DAG is not powerful enough to represent all kinds of probability models. In the study of BN, we are more interested in graphs that can represent true conditional independences. These kind of graphs are called independent map (Imap). For a I-map, we have I(X, Y Z) G I(X, Y Z). To avoid trivial I-maps (complete graphs), we are more interested in minimum I-maps. A minimum I-map is an I-map with the minimum number of edges (if we delete any edge from G, G will not be I-map any more). Lec05/10

Formal Definition of Bayesian Network (4) D-separation is an important concept in BN. This concept is equivalent to conditional independence. That is, if X and Y are d-separated by Z, X and Y are conditional independent given Z, and vice versa. I(X, Y Z) d I(X, Y Z) G (3) In fact, Eq. (2) can be proved based on d-separation and the fact that there is no cycles in a BN. For more detail, please referred to the reference book written by Maomi Ueno. Lec05/11

A slightly difficult example (1) http://www.norsys.com/networklibrary.html# Lec05/12

A slightly difficult example (2) ASIA is a BN for a fictitious medical example about whether a patient has tuberculosis, lung cancer or bronchitis, related to their X-ray, dyspnea, visit-to-asia and smoking status. For convenience of discussion, we denote the above events using A1~A8 as follows: A1: Visit to Asia; A2: Smoking; A3: Tuberculosis; A4: Lung cancer; A5: Bronchitis; A6: Tuberculosis or Cancer; A7: X-ray result; A8: Dyspnea. Reference: Mathematics in statistic inference of Bayesian network, K. Tanaka, 2009 Lec05/13

A slightly difficult example (3) P(A1) A1=1 0.01 A1=0 0.99 P(A2) A2=1 0.5 A2=0 0.5 P(A3 A1) A1=1 A1=0 A3=1 0.05 0.01 A3=0 0.95 0.99 P(A4 A2) A2=1 A2=0 A4=1 0.1 0.01 A4=0 0.9 0.99 P(A5 A2) A2=1 A2=0 A5=1 0.6 0.3 A5=0 0.4 0.7 P(A7 A6) A1=6 A6=0 A7=1 0.98 0.05 A7=0 0.02 0.95 Lec05/14

A slightly difficult example (4) P(A6 A3, A4) A3=1 A3=0 A4=1 A4=0 A4=1 A4=0 A6=1 1 1 1 0 A6=0 0 0 0 1 P(A8 A5, A6) A5=1 A5=0 A6=1 A6=0 A6=1 A6=0 A8=1 0.9 0.8 0.7 0.1 A8=0 0.1 0.2 0.3 0.9 Lec05/15

A slightly difficult example (5) Using Eq. (2), we have P A 1,, A 8 = P A 8 A 5, A 6 P A 7 A 6 P A 6 A 3, A 4 P A 5 A 2 P A 4 A 2 P A 3 A 1 P(A 1 )P(A 2 ) Marginalize this joint probability, we can find, for example P A 8 = A 1 A 2 A 3 A 4 A 5 A 6 A 7 P A 1,, A 8 P A 3, A 8 = P A 1,, A 8 A 7 Using Bayesian theorem, we can find P(A3 A8) from P(A8) and P(A3,A8), and see A3 is the factor for causing A8. A 1 A 2 A 4 A 5 A 6 Lec05/16

A slightly difficult example (6) For the given BN, we have the results given below. From these results, we can see that probably A3 (Tuberculosis) is not the main factor for causing A8 (Dyspnea). P(A8) A8=1 0.43597 A8=0 0.56403 P(A3 A8) A8=1 A8=0 A3=1 0.01883 0.00388 A3=0 0.98117 0.99612 P(A3,A8) A3=1 A8=1 0.00821 A3=1 A8=0 0.02189 A3=0 A8=1 0.42776 A3=0 A8=0 0.56184 Lec05/17

Learning of BN (1) Reference: Bayesian Network, Maomi Ueno, 2013. Consider the learning problem of a BN containing N nodes representing the probability model of x = {x 1, x 2,, x N }, where each variable can take a limited number of discrete values. From Eq. (2) we can see that to define a BN, we need to determine the conditional probabilities P x i π x i, where π x i is the set of parent nodes of x i. If x i can take r i values, and π x i can take q i patterns (=number of combinations of all parent nodes), altogether we have N ς N i=1 r i q i parameters to determine in the learning process. Lec05/18

Learning of BN (2) Denoting the parameters by Θ = {θ ijk }, i = 1,2,, N; j = 1,2,, q i ; k = 1,2,, r i, we can estimate Q based on a training set. Suppose that we have already defined the structure of the BN. That is, G is given. Given an set of observations x, the likelihood of Q is given by where N P x Θ = ij i=1 q i j=1 r i k=1 N n θ ijk ijk i=1 ij = (σ r i k=1 n ijk ) r ς i k=1 n ijk! q i j=1 r i k=1 n θ ijk ijk is a normalization factor to make the sum of all probabilities 1. Here we have assumed that the likelihood follows the multinomial distribution. (4) (5) Lec05/19

Learning of BN (3) Where in Eq. (4), n ijk is the number of observations in which xi takes the k-th value when the parent set takes the j-th pattern. To find the MAP (Maximum a posteriori) estimation of Q, we assume that Q follows Dirichlet distribution given by Where N P Θ = δ ij i=1 q i j=1 r i k=1 δ ij = Γ(σ r i k=1 αijk ) r ς i k=1 Γ(αijk ) α θ ijk 1 ijk Γ is gamma function satisfying Γ x + 1 = xγ x, and α ijk is the hyperparameter. The a priori probability (6) takes the same form as the likelihood given by (4). (6) (7) Lec05/20

Learning of BN (4) From Eqs. (4) and (6), we can find the joint probability P(x,Q) as follows: N P x, Θ i=1 q i j=1 r i k=1 α θ ijk +n ijk 1 ijk Maximize the above probability with respect to Q, we get the following MAP estimation: θ ijk = α ijk + n ijk 1 (9) r Where α ij = σ i k=1 α ij + n ij r i αijk, and n ij = σ k=1 r i nijk. (8) Lec05/21

An example of learning (1) Reference: Bayesian Network, Maomi Ueno, 2013. A BN is given in the right figure. P(X2 X1) P(X1) X1 X1=1 X2=1 0.8 X1=1 X2=0 0.2 X1=1 0.8 X1=0 0.2 X2 X3 X1=0 X2=1 0.2 P(X4 X2,X3) X1=0 X2=0 0.8 P(X3 X1) X2, X3 X4 1 1 1 0.8 1 1 0 0.2 X4 P(X5 X3) X5 X1=1 X3=1 0.8 X1=1 X3=0 0.2 X1=0 X3=1 0.2 1 0 1 0.6 1 0 0 0.4 0 1 1 0.6 0 1 0 0.4 X3=1 X5=1 0.8 X3=1 X5=0 0.2 X3=0 X5=1 0.2 X1=0 X3=0 0.8 0 0 1 0.2 0 0 0 0.8 X3=0 X5=0 0.8 Lec05/22

An example of learning (2) Based on the given probabilities, we can generate as many data as we can. Suppose now we have 20 data given in the right table, we can estimate the probabilities from the data, using the MAP estimation. The results is given in the next page. # X1,X2,X3,X4,X5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 10111 00010 11111 11110 11110 11000 10000 00000 00000 01111 00000 11111 00111 11111 10111 11100 11010 11100 11010 11101 Lec05/23

An example of learning (3) In the MAP estimation, we have assumed that all hyperparameters are ½. True value MAP estimation (α ijk = 0.5) P(X2=1 X1=1) P(X2=1 X1=0) P(X3=1 X1=1) P(X3=1 X1=0) P(X4=1 X2=1,X3=1) P(X4=1 X2=1, X3=0) P(X4=1 X2=0, X3=1) P(X4=1 X2=0, X3=0) P(X5=1 X3=1) P(X5=1 X3=0) 0.8 0.78 0.2 0.08 0.8 0.78 0.2 0.08 0.8 0.65 0.6 0.5 0.6 0.5 0.2 0.41 0.8 0.42 0.2 0.34 Mean squared error 0.028 Lec05/24

Important topics in using BN In this lecture, we have studied the fundamental knowledge related to BN. To use BN more efficiently we should used algorithms that can calculate the wanted probabilities quickly, because marginalization can be very expensive when the number of nodes is large. To learn a BN with a proper structure, we need to adopt other algorithms (e.g. meta-heuristics like genetic algorithm), based on some optimization criterion (e.g. minimum description length, MDL). In any case, to obtain a good BN, we need a large amount of data, to guarantee the quality of the probabilities. Lec05/25

Homework For the example given in p. 22, write the equation for finding the joint probability based on Eq. (2). Write a program to build a joint probability distribution table (JPDT) based on your equation. The JPDT is similar to the truth table. It has two columns. The left column contains all patterns of the variables and the right column contains the probability values. For the example given here, there are 32 rows. Write a program to generate N d data, based on the JPDT. Write a program to find the MAP estimation of the parameters using Eq. (9). Again, we can assume that the hyper-parameters are all ½. Try to compare the results obtained with N d =20, N d =200, and N d =2,000, and draw some conclusions. Lec05/26