Learning Bayesian Networks (part 1) Goals for the lecture

Similar documents
Bayesian Networks. Motivation

Directed Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Review: Bayesian learning and inference

Bayesian networks. Chapter 14, Sections 1 4

PROBABILISTIC REASONING SYSTEMS

Quantifying uncertainty & Bayesian networks

Introduction to Artificial Intelligence. Unit # 11

Informatics 2D Reasoning and Agents Semester 2,

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

last two digits of your SID

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

School of EECS Washington State University. Artificial Intelligence

Probabilistic Reasoning Systems

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

COMP5211 Lecture Note on Reasoning under Uncertainty

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Artificial Intelligence Bayesian Networks

Bayesian networks. Chapter Chapter

CS 188: Artificial Intelligence Spring Announcements

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Probabilistic Models

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

CS 188: Artificial Intelligence Fall 2009

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Bayes Networks 6.872/HST.950

CS 5522: Artificial Intelligence II

Bayesian Learning. Instructor: Jesse Davis

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

CS 343: Artificial Intelligence

Probabilistic Graphical Models

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Another look at Bayesian. inference

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

CS 188: Artificial Intelligence Spring Announcements

Introduction to Bayesian Learning

14 PROBABILISTIC REASONING

Lecture 8: Bayesian Networks

Introduction to Artificial Intelligence Belief networks

CS 5522: Artificial Intelligence II

Bayesian Networks. Material used

Defining Things in Terms of Joint Probability Distribution. Today s Lecture. Lecture 17: Uncertainty 2. Victor R. Lesser

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Graphical Models - Part I

Bayes Nets: Independence

Bayesian networks: Modeling

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

COMP90051 Statistical Machine Learning

Bayesian belief networks. Inference.

CSE 473: Artificial Intelligence Autumn 2011

COMP9414: Artificial Intelligence Reasoning Under Uncertainty

Lecture 5: Bayesian Network

CS 188: Artificial Intelligence Fall 2008

15-780: Graduate Artificial Intelligence. Bayesian networks: Construction and inference

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Probabilistic Classification

CMPSCI 240: Reasoning about Uncertainty

Probabilistic Models. Models describe how (a portion of) the world works

3 : Representation of Undirected GM

CS 188: Artificial Intelligence. Bayes Nets

Quantifying Uncertainty

Artificial Intelligence Bayes Nets: Independence

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Directed Graphical Models or Bayesian Networks

Bayesian networks (1) Lirong Xia

Bayesian Networks: Independencies and Inference

Bayesian networks. Chapter Chapter

CS 730/730W/830: Intro AI

Bayesian networks. Chapter Chapter Outline. Syntax Semantics Parameterized distributions. Chapter

Rapid Introduction to Machine Learning/ Deep Learning

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Lecture 8: December 17, 2003

Bayesian Methods: Naïve Bayes

{ p if x = 1 1 p if x = 0

Directed Graphical Models

Bayesian Networks (Part II)

LEARNING WITH BAYESIAN NETWORKS

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Networks Representation

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Reasoning Under Uncertainty: Belief Network Inference

Probabilistic representation and reasoning

COMP538: Introduction to Bayesian Networks

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

CSEP 573: Artificial Intelligence

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

CS Belief networks. Chapter

Transcription:

Learning Bayesian Networks (part 1) Mark Craven and David Page Computer Scices 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Some ohe slides in these lectures have been adapted/borrowed from materials developed by Tom Dietterich, Pedro Domingos, Tom Mitchell, David Page, and Jude Shavlik Goals for the lecture you should understand the following concepts the Bayesian network representation inference by enumeration the parameter learning task for Bayes nets the structure learning task for Bayes nets maximum likelihood estimation Laplace estimates m-estimates missing data in machine learning hidden variables missing at random missing systematically the EM approach to imputing missing values in Bayes net parameter learning 1

Bayesian network example Consider the following 5 binary random variables: B = a burglary occurs at your house E = an earthquake occurs at your house A = the alarm goes off J = John calls to reporhe alarm M = Mary calls to reporhe alarm Suppose we wano answer queries like what is P(B M, J)? Bayesian network example P ( B ) P ( E ) t f t f 0.001 0.999 Burglary Earthquake 0.001 0.999 P ( A B, E ) B E Alarm 0.95 0.05 0.94 0.06 0.29 0.71 0.001 0.999 JohnCalls MaryCalls P ( J A) A t 0.9 0.1 f 0.05 0.95 P ( M A) A t 0.7 0.3 f 0.01 0.99 2

Bayesian networks a BN consists of a Directed Acyclic Graph (DAG) and a set of conditional probability distributions in the DAG each node denotes random a variable each edge from X to Y represents that X directly influences Y formally: each variable X is independent of its nondescendants given its parents each node X has a conditional probability distribution (CPD) representing P(X Parents(X) ) Bayesian networks using the chain rule, a joint probability distribution can be expressed as P(X 1,, X n ) = P(X 1 ) P(X i X 1,, X i 1 )) n i=2 a BN provides a compact representation of a joint probability distribution P(X 1,, X n ) = n P(X i Parents(X i )) i=1 3

Bayesian networks Burglary JohnCalls Alarm Earthquake MaryCalls P(B,E, A, J, M ) = P(B) P(E) P(A B,E) P(J A) P(M A) a standard representation ohe joint distribution for the Alarm example has 2 5 = 32 parameters the BN representation ohis distribution has 20 parameters Bayesian networks consider a case with 10 binary random variables How many parameters does a BN with the following graph structure have? 2 4 4 4 4 4 4 = 42 4 8 4 How many parameters does the standard table representation ohe joint distribution have? = 1024 4

Advantages ohe Bayesian network representation Captures independence and conditional independence where they exist Encodes the relevant portion ohe full joint among variables where dependencies exist Uses a graphical representation which lends insight into the complexity of inference The inference task in Bayesian networks Given: values for some variables in the network (evidence), and a set of query variables Do: compute the posterior distribution over the query variables variables that are neither evidence variables nor query variables are hidden variables the BN representation is flexible enough that any set can be the evidence variables and any set can be the query variables 5

Inference by enumeration let a denote A=true, and a denote A=false suppose we re given the query: P(b j, m) probability the house is being burglarized given that John and Mary both called from the graph structure we can first compute: B E P(b, j,m) = P(b)P(E)P(A b,e)p( j A)P(m A) e, e a, a J A M sum over possible values for E and A variables (e, e, a, a) Inference by enumeration P(b, j,m) = P(b)P(E)P(A b,e)p( j A)P(m A) e, e a, a = P(b) P(E)P(A b,e)p( j A)P(m A) e, e a, a P(B) 0.001 P(E) 0.001 B E P(A) 0.95 0.94 0.29 f f 0.00 1 B J A E M = 0.001 (0.001 0.95 0.9 0.7 + 0.001 0.05 0.05 0.01+ 0.999 0.94 0.9 0.7 + 0.999 0.06 0.05 0.01) e, a e, a e, a e, a A P(J) t 0.9 f 0.05 A P(M) t 0.7 f 0.01 6

Inference by enumeration now do equivalent calculation for P( b, j, m) and determine P(b j, m) P(b j,m) = P(b, j,m) P( j,m) = P(b, j,m) P(b, j,m) + P( b, j,m) Comments on BN inference inference by enumeration is an exact method (i.e. it computes the exact answer to a given query) it requires summing over a joint distribution whose size is exponential in the number of variables in many cases we can do exact inference efficiently in large networks key insight: save computation by pushing sums inward in general, the Bayes net inference problem is NP-hard there are also methods for approximate inference these get an answer which is close in general, the approximate inference problem is NP-hard also, but approximate methods work well for many real-world problems 7

The parameter learning task Given: a set oraining instances, the graph structure of a BN Burglary Earthquake f f t Alarm JohnCalls MaryCalls Do: infer the parameters ohe CPDs The structure learning task Given: a set oraining instances f f t Do: infer the graph structure (and perhaps the parameters ohe CPDs too) 8

Parameter learning and maximum likelihood estimation maximum likelihood estimation (MLE) given a model structure (e.g. a Bayes net graph) G and a set of data D sehe model parameters θ to maximize P(D G, θ) i.e. make the data D look as likely as possible under the model P(D G, θ) Maximum likelihood estimation consider trying to estimate the parameter θ (probability of heads) of a biased coin from a sequence olips { } x = 1,1,1,0,1,0,0,1,0,1 the likelihood function for θ is given by: L(θ : x 1,, x n ) = θ x 1 (1 θ)1 x 1!θ x n (1 θ)1 x n = θ x i n (1 θ) x i for h heads in n flips the MLE is h/n 9

MLE in a Bayes net L(θ : D,G) = P(D G,θ) = P(x 1 (d ), x 2 (d ),, x n (d ) ) d D = P(x i (d ) Parents(x i (d ) )) d D i (d = P(x ) (d i Parents(x ) i )) i d D independent parameter learning problem for each CPD Maximum likelihood estimation now consider estimating the CPD parameters for B and J in the alarm network given the following data set B J A E M f f t t t t t t P(b) = 1 8 = 0.125 P( b) = 7 8 = 0.875 P( j a) = 3 4 = 0.75 P( j a) = 1 4 = 0.25 P( j a) = 2 4 = 0.5 P( j a) = 2 4 = 0.5 10

Maximum likelihood estimation suppose instead, our data set was this B J A E M f f t t t t t t P(b) = 0 8 = 0 P( b) = 8 8 = 1 do we really wano sehis to 0? Maximum a posteriori (MAP) estimation instead of estimating parameters strictly from the data, we could start with some prior belieor each for example, we could use Laplace estimates P(X = x) = v Values( X ) n x +1 (n v +1) pseudocounts where n v represents the number of occurrences of value v 11

Maximum a posteriori estimation a more general form: m-estimates P(X = x) = n x + p x m n v + m v Values( X ) prior probability of value x number of virtual instances M-estimates example now let s estimate parameters for B using m=4 and p b =0.25 B J A E M f f t t t t t t P(b) = 0 + 0.25 4 8 + 4 = 1 8 + 0.75 4 = 0.08 P( b) = = 11 12 8 + 4 12 = 0.92 12

Missing data Commonly in machine learning tasks, some feature values are missing some variables may not be observable (i.e. hidden) even for training instances values for some variables may be missing at random: what caused the data to be missing does not depend on the missing data itself e.g. someone accidentally skips a question on an questionnaire e.g. a sensor fails to record a value due to a power blip values for some variables may be missing systematically: the probability of value being missing depends on the value e.g. a medical test result is missing because a doctor was fairly sure of a diagnosis given earlier test results e.g. the graded exams that go missing on the way home from school are those with poor scores Missing data hidden variables; values missing at random these are the cases we ll focus on one solution: try impute the values values missing systematically may be sensible to represent missing as an explicieature value 13

Imputing missing data with EM Given: data set with some missing values model structure, initial model parameters Repeat until convergence Expectation (E) step: using current model, compute expectation over missing values Maximization (M) step: update model parameters with those that maximize probability ohe data (MLE or MAP) example: EM for parameter learning suppose we re given the following initial BN and training set P(B) 0.1 P(E) 0.2?? B E P(A) 0.9 0.6 0.3 0.2 B A E????? J M? A P(J) t 0.9 A P(M) t 0.8?? f 0.2 f 0.1 14

example: E-step P(a b, e, j, m) P( a b, e, j, m) P(B) 0.1 P(E) 0.2 B E B E P(A) 0.9 0.6 A 0.3 0.2 J M A P(J) A P(M) t 0.9 t 0.8 f 0.2 f 0.1 t: 0.0069 f: 0.9931 f:0.8 t:0.98 f: 0.02 t: 0.2 t: 0.3 f: 0.7 t: 0.997 f: 0.003 t: 0.0069 f: 0.9931 t: 0.2 example: E-step P( b, e,a, j, m) P(a b, e, j, m) = P( b, e,a, j, m) + P( b, e, a, j, m) 0.9 0.8 0.2 0.1 0.2 = 0.9 0.8 0.2 0.1 0.2 + 0.9 0.8 0.8 0.8 0.9 = 0.00288.4176 = 0.0069 P( b, e,a, j, m) P(a b, e, j, m) = P( b, e,a, j, m) + P( b, e, a, j, m) 0.9 0.8 0.2 0.9 0.2 = 0.9 0.8 0.2 0.9 0.2 + 0.9 0.8 0.8 0.2 0.9 = 0.02592.1296 = 0.2 P(b, e,a, j,m) P(a b, e, j,m) = P(b, e,a, j,m) + P(b, e, a, j,m) 0.1 0.8 0.6 0.9 0.8 = 0.1 0.8 0.6 0.9 0.8 + 0.1 0.8 0.4 0.2 0.1 = 0.03456.0352 = 0.98 15

example: M-step re-estimate probabilities using expected counts P(a b,e) = 0.997 1 P(a b, e) = 0.98 1 P(a b,e) = 0.3 1 0.0069 + 0.2 + 0.2 + 0.2 + 0.0069 + 0.2 + 0.2 P(a b, e) = 7 B J A E M P(a b,e) = B E P(A) 0.997 0.98 0.3 0.145 E #(a b e) E #(b e) re-estimate probabilities for P(J A) and P(M A) in same way t: 0.0069 f: 0.9931 f:0.8 t:0.98 f: 0.02 t: 0.2 t: 0.3 f: 0.7 t: 0.997 f: 0.003 t: 0.0069 f: 0.9931 t: 0.2 example: M-step re-estimate probabilities using expected counts P( j a) = 0.2 + 0.98 + 0.3+ 0.997 + 0.2 0.0069 + 0.2 + 0.98 + 0.2 + 0.3+ 0.2 + 0.997 + 0.0069 + 0.2 + 0.2 P( j a) = P( j a) = E #(a j) E #(a) 0.8 + 0.02 + 0.7 + 0.003+ 0.8 0.9931+ 0.8 + 0.02 + 0.8 + 0.7 + 0.8 + 0.003+ 0.9931+ 0.8 + 0.8 t: 0.0069 f: 0.9931 f:0.8 t:0.98 f: 0.02 t: 0.2 t: 0.3 f: 0.7 t: 0.997 f: 0.003 t: 0.0069 f: 0.9931 t: 0.2 16

Convergence of EM E and M steps are iterated until probabilities converge will converge to a maximum in the data likelihood (MLE or MAP) the maximum may be a local optimum, however the optimum found depends on starting conditions (initial estimated probability parameters) 17