Bayesian Networks. Motivation

Similar documents
Learning Bayesian Networks (part 1) Goals for the lecture

Bayesian networks. Chapter 14, Sections 1 4

Directed Graphical Models

PROBABILISTIC REASONING SYSTEMS

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Review: Bayesian learning and inference

Quantifying uncertainty & Bayesian networks

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

School of EECS Washington State University. Artificial Intelligence

Informatics 2D Reasoning and Agents Semester 2,

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Introduction to Artificial Intelligence. Unit # 11

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Probabilistic Reasoning Systems

COMP5211 Lecture Note on Reasoning under Uncertainty

Bayesian Networks BY: MOHAMAD ALSABBAGH

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayesian networks. Chapter Chapter

Artificial Intelligence Bayesian Networks

last two digits of your SID

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Defining Things in Terms of Joint Probability Distribution. Today s Lecture. Lecture 17: Uncertainty 2. Victor R. Lesser

Probabilistic Models

14 PROBABILISTIC REASONING

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Bayes Networks 6.872/HST.950

CS 730/730W/830: Intro AI

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Introduction to Artificial Intelligence Belief networks

Bayesian networks: Modeling

CS 188: Artificial Intelligence Spring Announcements

Bayesian Networks. Material used

CS 188: Artificial Intelligence Fall 2009

Bayesian belief networks. Inference.

Another look at Bayesian. inference

CS 188: Artificial Intelligence Spring Announcements

CS 5522: Artificial Intelligence II

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

CS 5522: Artificial Intelligence II

CSE 473: Artificial Intelligence Autumn 2011

COMP90051 Statistical Machine Learning

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Graphical Models - Part I

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Bayes Nets: Independence

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

Lecture 8: Bayesian Networks

CS 343: Artificial Intelligence

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.

Quantifying Uncertainty

Bayesian networks. Chapter Chapter

Bayesian networks. Chapter Chapter Outline. Syntax Semantics Parameterized distributions. Chapter

Lecture 5: Bayesian Network

COMP9414: Artificial Intelligence Reasoning Under Uncertainty

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Bayesian belief networks

Probabilistic Classification

CMPSCI 240: Reasoning about Uncertainty

Bayesian networks (1) Lirong Xia

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Probabilistic Graphical Models

Probabilistic representation and reasoning

Probabilistic Models. Models describe how (a portion of) the world works

Graphical Models and Kernel Methods

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

{ p if x = 1 1 p if x = 0

Probabilistic representation and reasoning

Rapid Introduction to Machine Learning/ Deep Learning

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Bayesian Learning. Instructor: Jesse Davis

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Uncertainty and Belief Networks. Introduction to Artificial Intelligence CS 151 Lecture 1 continued Ok, Lecture 2!

CS 188: Artificial Intelligence Fall 2008

Introduction to Bayesian Learning

CS 188: Artificial Intelligence. Bayes Nets

Artificial Intelligence Bayes Nets: Independence

This lecture. Reading. Conditional Independence Bayesian (Belief) Networks: Syntax and semantics. Chapter CS151, Spring 2004

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Probabilistic Graphical Models (I)

CS Belief networks. Chapter

Lecture 8: December 17, 2003

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

3 : Representation of Undirected GM

COMP538: Introduction to Bayesian Networks

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Qualifier: CS 6375 Machine Learning Spring 2015

Tópicos Especiais em Modelagem e Análise - Aprendizado por Máquina CPS863

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins.

Transcription:

Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations in total? How many free parameters we need to specify it? Do we get any insight about the variables directly? Factorized representation (chain rule):,,,,,,,,,, True for any distribution and any variable ordering How many free parameters we need to represent it? Assume independence Pros? Cons?,,,, 1

Goals for the lecture Bayesian network representation Bayesian network inference inference by enumeration Bayesian network learning parameter learning Bayesian network example Consider the following 5 binary random variables: B = a burglary occurs at your house E = an earthquake occurs at your house A = the alarm goes off J = John calls to report the alarm M = Mary calls to report the alarm Suppose we want to answer queries like what is P(B M, J)? 2

Bayesian network example P ( B ) P ( E ) t f t f 0.001 0.999 Burglary Earthquake 0.001 0.999 Alarm P ( A B, E) B E t t 0.95 0.05 0.94 0.06 f t 0.29 0.71 0.001 0.999 JohnCalls P ( J A) A t 0.9 0.1 f 0.05 0.95 MaryCalls P ( M A) A t 0.7 0.3 f 0.01 0.99 Bayesian networks a BN consists of a Directed Acyclic Graph (DAG) and a set of conditional probability distributions in the DAG each node denotes a (discrete) random variable each edge from X to Y represents that X directly influences Y formally: each variable X is independent of its nondescendants given its parents each node X has a conditional probability distribution (CPD) representing P(X Parents(X)) 3

Bayesian networks a BN specified a joint distribution,, Example The joint probability of Burglary is true, Earthquake is false, alarm is true, John calls and Mary calls Bayesian network example P ( B ) 0.001 0.999 Burglary JohnCalls P ( J A) A t 0.9 0.1 f 0.05 0.95 Alarm P ( E ) Earthquake 0.001 0.999 P ( A B, E) B E t t 0.95 0.05 0.94 0.06 MaryCalls f t 0.29 0.71 P ( M A) 0.001 0.999 A t 0.7 0.3 f 0.01 0.99 The joint probability of Burglary is true, Earthquake is false, alarm is true, John calls and Mary calls 4

Bayesian networks Factorized joint distribution from BN:,, Factorized joint distribution from chain rule:,,,, Bayesian networks Burglary Earthquake Alarm JohnCalls MaryCalls Chain rule:,,,,,,,,,, 1 2 4 8 16 BN:,,,,, 1 1 4 2 2 5

Bayesian networks consider a case with 10 binary random variables How many free parameters does a BN with the following graph structure have? 1 2 2 2 2 2 2 = 21 2 4 2 How many free parameters does the standard table representation of the joint distribution have? = 1023 Advantages of the Bayesian network representation represents a joint distribution compactly in a factorized way represents a set of conditional independence assumptions about a distribution uses a graphical representation which provides insight of the variables we can view the graph as encoding a generative sampling process 6

Bayesian network example P ( B ) 0.001 0.999 Burglary JohnCalls P ( J A) A t 0.9 0.1 f 0.05 0.95 Alarm P ( E ) Earthquake 0.001 0.999 P ( A B, E) B E t t 0.95 0.05 0.94 0.06 MaryCalls f t 0.29 0.71 P ( M A) 0.001 0.999 A t 0.7 0.3 f 0.01 0.99 How to generate a data poinrom this BN? The inference task in Bayesian networks Given: values for some variables in the network (evidence), and a set of query variables Do: compute the posterior distribution over the query variables variables that are neither evidence variables nor query variables are hidden variables the BN representation is flexible enough that any set can be the evidence variables and any set can be the query variables 7

B J A E M Inference by enumeration let a denote A=true, and a denote A=false query: P(b j, m) probability the house is being burglarized given that John and Mary both called,,,,,,,,,, P(b, j,m) P(b)P(e)P(a b,e)p(j a)p(m a) e a sum over possible values for E and A variables (e, e, a, a) Inference by enumeration P(b, j,m) P(b)P(e)P(a b,e)p( j a)p(m a) e a P(b) P(e)P(a b,e)p( j a)p(m a) e a P(B) 0.001 B E P(A) t t 0.95 0.94 f t 0.29 0.001 B P(E) 0.001 E A B E A J M 0.001 (0.001 0.95 0.9 0.7 0.001 0.05 0.05 0.01 0.999 0.94 0.9 0.7 0.999 0.06 0.05 0.01) e, a e, a e, a e, a J M A P(J) t 0.9 f 0.05 A P(M) t 0.7 f 0.01 8

Inference by enumeration now do equivalent calculation for P( b, j, m) and determine P(b j, m) P(b j,m) P(b, j,m) P( j,m) P(b, j,m) P(b, j,m) P(b, j,m) Comments on BN inference inference by enumeration is an exact method (i.e. it computes the exact answer to a given query) it requires summing over a joint distribution whose size is exponential in the number of variables in many cases we can do exact inference efficiently in large networks Variable elimination Junction tree 9

The parameter learning task Given: a set of training instances, the graph structure of a BN Burglary Earthquake B E A J M f f t f t Alarm JohnCalls MaryCalls Do: infer the parameters of the CPDs The structure learning task Given: a set of training instances B E A J M f f t f t Do: infer the graph structure (and perhaps the parameters of the CPDs too) 10

Parameter learning and maximum likelihood estimation maximum likelihood estimation (MLE) given a model structure (e.g. a Bayes net graph) and a set of data D set the model parameters θ to maximize P(D;θ) i.e. make the data D look as likely as possible under the model P(D;θ) Maximum likelihood estimation consider trying to estimate the parameter θ (probability of heads) of a biased coin from a sequence olips x 1,1,1, 0,1, 0, 0,1, 0,1 the likelihood function for θ is given by: 1 for h heads in n flips the MLE is h/n 11

Maximum likelihood estimation now consider estimating the CPD parameters for B and J in the alarm network given the following data set B J A E M B E A J M f f t f f t t t f t t t t t t t t t P(b) 1 8 0.125 P(b) 7 8 0.875 P( j a) 3 4 0.75 P(j a) 1 4 0.25 P( j a) 2 4 0.5 P(j a) 2 4 0.5 Maximum likelihood estimation suppose instead, our data set was this B J A E M B E A J M f f t f f t t t t t t t t t t t P(b) 0 8 0 P(b) 8 8 1 do we really want to set this to 0? 12

Maximum a posteriori (MAP) estimation instead of estimating parameters strictly from the data, we could start with some prior belieor each for example, we could use Laplace estimates P(X x) v Values( X ) n x 1 (n v 1) pseudocounts where n v represents the number of occurrences of value v Maximum a posteriori estimation a more general form: m-estimates P(X x) n x p x m n v m v Values( X ) prior probability of value x number of virtual instances 13

M-estimates example now let s estimate parameters for B using m=4 and p b =0.25 B E A J M B J A E M f f t f f t t t t t t t t t t t P(b) 0 0.25 4 8 4 1 8 0.75 4 0.08 P(b) 11 12 8 4 12 0.92 Missing data Commonly in machine learning tasks, some feature values are missing some variables may not be observable (i.e. hidden) even for training instances values for some variables may be missing at random: what caused the data to be missing does not depend on the missing data itself e.g. someone accidentally skips a question on an questionnaire e.g. a sensor fails to record a value due to a power blip values for some variables may be missing systematically: the probability of value being missing depends on the value e.g. a medical test result is missing because a doctor was fairly sure of a diagnosis given earlier test results e.g. the graded exams that go missing on the way home from school are those with poor scores 14

Missing data hidden variables; values missing at random these are the cases we ll focus on one solution: try impute the values values missing systematically may be sensible to represent missing as an explicieature value Imputing missing data with EM Given: data set with some missing values model structure, initial model parameters Repeat until convergence Expectation (E) step: using current model, compute expectation over missing values Maximization (M) step: update model parameters with those that maximize probability of the data (MLE or MAP) 15

example: EM for parameter learning suppose we re given the following initial BN and training set B E A J M P(B) 0.1 B B E P(A) t t 0.9 0.6 f t 0.3 0.2 J A P(J) t 0.9 A P(E) 0.2 E M A P(M) t 0.8??? t t? f t f t?? f t t t? t t??? f t f 0.2 f 0.1 example: E-step P(a b,e,j,m) P(a b,e,j,m) P(B) 0.1 P(E) 0.2 B E B E P(A) t t 0.9 0.6 A f t 0.3 0.2 J M A P(J) A P(M) t 0.9 t 0.8 f 0.2 f 0.1 B E A J M t: 0.0069 f: 0.9931 t:0.2 f:0.8 t:0.98 f: 0.02 t t t: 0.2 f: 0.8 f t f t t: 0.3 f: 0.7 t:0.2 f: 0.8 f t t t t: 0.997 f: 0.003 t t t: 0.0069 f: 0.9931 t:0.2 f: 0.8 t: 0.2 f: 0.8 f t 16

example: E-step P(b,e,a,j,m) P(a b,e,j,m) P(b,e,a,j,m) P(b,e,a,j,m) 0.9 0.8 0.2 0.1 0.2 0.9 0.8 0.2 0.1 0.2 0.9 0.8 0.8 0.8 0.9 0.00288.4176 0.0069 P(b,e,a, j,m) P(a b,e, j,m) P(b,e,a, j,m) P(b,e,a, j,m) 0.9 0.8 0.2 0.9 0.2 0.9 0.8 0.2 0.9 0.2 0.9 0.8 0.8 0.2 0.9 0.02592.1296 0.2 P(a b,e, j,m) P(b,e,a, j,m) P(b,e,a, j,m) P(b,e,a, j,m) 0.1 0.8 0.6 0.9 0.8 0.1 0.8 0.6 0.9 0.8 0.1 0.8 0.4 0.2 0.1 0.03456.0352 0.98 example: M-step re-estimate probabilities using expected counts P(a b,e) 0.997 1 P(a b,e) 0.98 1 P(a b,e) 0.3 1 0.0069 0.2 0.2 0.2 0.0069 0.2 0.2 P(a b,e) 7 B J A E M P(a b,e) B E P(A) t t 0.997 0.98 f t 0.3 0.145 E #(a b e) E #(b e) re-estimate probabilities for P(J A) and P(M A) in same way B E A J M t: 0.0069 f: 0.9931 t:0.2 f:0.8 t:0.98 f: 0.02 t t t: 0.2 f: 0.8 f t f t t: 0.3 f: 0.7 t:0.2 f: 0.8 f t t t t: 0.997 f: 0.003 t t t: 0.0069 f: 0.9931 t:0.2 f: 0.8 t: 0.2 f: 0.8 f t 17

Convergence of EM E and M steps are iterated until probabilities converge will converge to a maximum in the data likelihood (MLE or MAP) the maximum may be a local optimum, however the optimum found depends on starting conditions (initial estimated probability parameters) Summary BN has many advantages easy to encode domain knowledge (direct dependencies, causality) can represent uncertainty principled methods for dealing with missing values can answer arbitrary queries (in theory; in practice may be intractable) Next structure learning BN as supervised learning 18