Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Similar documents
Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Introduction to Artificial Intelligence. Unit # 11

Lecture 8: Bayesian Networks

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Introduction to Bayesian Learning

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Probabilistic Reasoning in AI COMP-526

Directed Graphical Models

Bayesian Networks. Motivation

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

CS6220: DATA MINING TECHNIQUES

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

CS6220: DATA MINING TECHNIQUES

Probabilistic Classification

CS 188: Artificial Intelligence Fall 2009

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Probabilistic Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Bayesian Networks Representation

15-780: Graduate Artificial Intelligence. Bayesian networks: Construction and inference

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Bayesian belief networks. Inference.

Bayesian belief networks

Learning Bayesian Networks (part 1) Goals for the lecture

CS221 / Autumn 2017 / Liang & Ermon. Lecture 13: Bayesian networks I

Conditional probabilities and graphical models

Bayesian Networks BY: MOHAMAD ALSABBAGH

4.1 Notation and probability review

Artificial Intelligence Bayesian Networks

Directed and Undirected Graphical Models

Rapid Introduction to Machine Learning/ Deep Learning

Introduction to Machine Learning

Conditional Independence and Factorization

COMP5211 Lecture Note on Reasoning under Uncertainty

Lecture 4 October 18th

Introduction to Machine Learning

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Y. Xiang, Inference with Uncertain Knowledge 1

CS 188: Artificial Intelligence Fall 2008

Today. Why do we need Bayesian networks (Bayes nets)? Factoring the joint probability. Conditional independence

Refresher on Probability Theory

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Bayes Networks 6.872/HST.950

CMPSCI 240: Reasoning about Uncertainty

Modeling and reasoning with uncertainty

Graphical Models and Kernel Methods

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

PROBABILISTIC REASONING SYSTEMS

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins.

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayes Nets: Independence

Lecture 5: Bayesian Network

Bayes Nets III: Inference

Fundamentals of Machine Learning for Predictive Data Analytics

Bayesian networks. Chapter 14, Sections 1 4

Review: Bayesian learning and inference

CS 188: Artificial Intelligence Spring Announcements

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Bayesian networks (1) Lirong Xia

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

CS 5522: Artificial Intelligence II

LING 473: Day 5. START THE RECORDING Bayes Theorem. University of Washington Linguistics 473: Computational Linguistics Fundamentals

Causality. Pedro A. Ortega. 18th February Computational & Biological Learning Lab University of Cambridge

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Based on slides by Richard Zemel

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Machine Learning. CS Spring 2015 a Bayesian Learning (I) Uncertainty

Introduction to Bayesian Networks

CS 5522: Artificial Intelligence II

Bayesian Networks Inference with Probabilistic Graphical Models

Probabilistic Graphical Models. Rudolf Kruse, Alexander Dockhorn Bayesian Networks 153

An Introduction to Bayesian Machine Learning

Directed Graphical Models or Bayesian Networks

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Directed Graphical Models

CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence Spring Announcements

Bayesian Networks. Characteristics of Learning BN Models. Bayesian Learning. An Example

PROBABILITY. Inference: constraint propagation

CS 343: Artificial Intelligence

Informatics 2D Reasoning and Agents Semester 2,

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Quantifying uncertainty & Bayesian networks

Discrete Probability and State Estimation

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

Lecture 13: Bayesian networks I. Review: definition. Review: person tracking. Course plan X 1 X 2 X 3. Definition: factor graph. Variables.

Our learning goals for Bayesian Nets

Learning in Bayesian Networks

Probabilistic Graphical Networks: Definitions and Basic Results

Probability and Statistics

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Probabilistic Reasoning Systems

Advanced Probabilistic Modeling in R Day 1

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Probabilistic Models. Models describe how (a portion of) the world works

Transcription:

Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by Lecture 2: Belief (Bayesian) networks. Bayes ball Conditional independence What is a belief network? Conditional independencies implied by a belief network p(x = x Y = y) the probability of X taking value x given that we know that Y is certain to have value y. This fits the situation when we observe something and want to make an inference about something related but unobserved: p(cancer recurs tumor measurements) p(gene expressed > 1.3 transcription factor concentrations) p(collision to obstacle sensor readings) p(word uttered sound wave) Why is this the right language for expressing knowledge? January 5, 2007 1 COMP-526 Lecture 2 January 5, 2007 2 COMP-526 Lecture 2 Example: Inference problem The following table describes the effectiveness of a certain drug in a population: Male Female Overall Recovered Died Recovered Died Recovered Died Drug used 15 40 90 50 105 90 No drug 20 40 20 10 40 50 Example (continued) The following table describes the effectiveness of a certain drug on a population: Male Female Overall Recovered Died Recovered Died Recovered Died Drug used 15 40 90 50 105 90 No drug 20 40 20 10 40 50 Does the drug help? Good news: the ratio of recovery for the whole population increases from 40/50 to 105/90 January 5, 2007 3 COMP-526 Lecture 2 January 5, 2007 4 COMP-526 Lecture 2

Simpson s paradox (Pearl) The following table describes the effectiveness of a certain drug on a population: Male Female Overall Recovered Died Recovered Died Recovered Died Drug used 15 40 90 50 105 90 No drug 20 40 20 10 40 50 Good news: the ratio of recovery for the whole population increases from 40/50 to 105/90 But the ratio of recovery decreases for both males and females! Eliminating the paradox If we derive correct conditional probabilities based on this data (assuming 50% males in the population) we get: p(recovery drug) = 1 2 p(recovery no drug) = 1 2 15 15 + 40 + 1 90 2 90 + 50 0.46 20 20 + 40 + 1 20 2 20 + 10 = 0.5 January 5, 2007 5 COMP-526 Lecture 2 January 5, 2007 6 COMP-526 Lecture 2 Using Bayes rule for inference Recall from last time: Bayes rule Bayes rule is very simple but very important for relating conditional probabilities: p(x y) = p(y x)p(x) p(y) Bayes rule is a useful tool for inferring the posterior probability of a hypothesis based on evidence and a prior belief in the probability of different hypotheses. Often we want to form a hypothesis about the world based on observable variables. Bayes rule is fundamental when viewed in terms of stating the belief given to a hypothesis H given evidence e: p(h e) = p(e H)p(H) p(e) p(h e) is sometimes called posterior probability p(h) is called prior probability p(e H) is called likelihood of the evidence (data) p(e) is just a normalizing constant, that can be computed from the requirement that P h p(h = h e) = 1: p(e) = X h p(e h)p(h) Sometimes we write p(h e) p(e H)p(H) January 5, 2007 7 COMP-526 Lecture 2 January 5, 2007 8 COMP-526 Lecture 2

Example: Medical Diagnosis A doctor knows that pneumonia causes a fever 95% of the time. She knows that if a person is selected randomly from the population, there is a 10 7 chance of the person having pneumonia. 1 in 100 people suffer from fever. You go to the doctor complaining about the symptom of having a fever (evidence). What is the probability that pneumonia is the cause of this symptom (hypothesis)? p(pneumonia fever) = p(fever pneumonia)p(pneumonia) p(fever) = 0.95 10 7 0.01 Computing conditional probabilities Typically, we are interested in the posterior joint distribution of some query variables Y given specific values e for some evidence variables E Let the hidden variables be Z = X Y E If we have a joint probability distribution, we can compute the answer by using the definition of conditional probabilities and marginalizing the hidden variables: p(y e) = p(y, e) p(e) p(y,e) = X z p(y, e,z) This yields the same big problem as before: the joint distribution is too big to handle January 5, 2007 9 COMP-526 Lecture 2 January 5, 2007 10 COMP-526 Lecture 2 Independence of random variables revisited We said that two r.v. s X and Y are independent, denoted X Y, if p(x, y) = p(x)p(y). But we also know that p(x, y) = p(x y)p(y). Hence, two r.v. s are independent if and only if: p(x y) = p(x) (and vice versa), x Ω X,y Ω Y This means that knowledge about Y does not change the uncertainty about X and vice versa. Conditional independence Two random variables X and Y are conditionally independent given Z if: p(x y, z) = p(x z), x, y, z This means that knowing the value of Y does not change the prediction about X if the value of Z is known. We denote this by X Y Z. Is there a similar requirement, but less stringent? January 5, 2007 11 COMP-526 Lecture 2 January 5, 2007 12 COMP-526 Lecture 2

Example Consider the medical diagnosis problem with three random variables: P (patient has pneumonia), F (patient has a fever), C (patient has a cough) The full joint distribution has 2 3 1 = 7 independent entries If someone has pneumonia, we can assume that the probability of a cough does not depend on whether they have a fever: p(c = 1 P = 1, F) = p(c = 1 P = 1) (1) Same equality holds if the patient does not have pneumonia: Example (continued) The joint distribution can now be written as: p(c, P, F) = p(c P, F)p(F P)p(P) = p(c P)p(F P)p(P) Hence, the joint can be described using 2 + 2 + 1 = 5 numbers instead of 7 Much more important savings happen with more variables! p(c = 1 P = 0, F) = p(c = 1 P = 0) (2) Hence, C and F are conditionally independent given P. January 5, 2007 13 COMP-526 Lecture 2 January 5, 2007 14 COMP-526 Lecture 2 Recursive Bayesian updating Naive Bayesian model A common assumption in early diagnosis is that the symptoms are independent of each other given the disease Let s 1,...s n be the symptoms exhibited by a patient (e.g. fever, headache etc) Let D be the patient s disease Then by using the naive Bayes assumption, we get: p(d, s 1,...s n ) = p(d)p(s 1 D) p(s n D) The naive Bayes assumption allows also for a very nice, incremental updating of beliefs as more evidence is gathered Suppose that after knowing symptoms s 1,...s n the probability of D is: ny p(d s 1... s n ) = p(d) p(s i D) What happens if a new symptom s n+1 appears? n+1 Y p(d s 1... s n, s n+1 ) = p(d) p(s i D) = p(d s 1... s n )p(s n+1 D) An even nicer formula can be obtained by taking logs: log p(d s 1... s n, s n+1 ) = log p(d s 1... s n ) + log p(s n+1 D) January 5, 2007 15 COMP-526 Lecture 2 January 5, 2007 16 COMP-526 Lecture 2

Recursive Bayesian updating The naive Bayes assumption allows also for a very nice, incremental updating of beliefs as more evidence is gathered Suppose that after knowing symptoms s 1,...s n the probability of D is: ny p(d s 1... s n ) = p(d) p(s i D) What happens if a new symptom s n+1 appears? n+1 Y p(d s 1... s n, s n+1 ) = p(d) p(s i D) = p(d s 1... s n )p(s n+1 D) An even nicer formula can be obtained by taking logs: A graphical representation of the naive Bayesian model Diagnosis Sympt1 Sympt2... Sympt n The nodes represent random variables The arcs represent influences The lack of arcs represents conditional independence relationships log p(d s 1... s n, s n+1 ) = log p(d s 1... s n ) + log p(s n+1 D) January 5, 2007 17 COMP-526 Lecture 2 January 5, 2007 18 COMP-526 Lecture 2 E=0 E=1 0.65 More generally: Bayesian networks p(r E) R=1 0.0001 A=0 A=1 R C=1 p(e) E=1 E=0 0.005 0.995 R=0 0.9999 0.35 p(c A) C=0 0.05 0.95 0.7 0.3 E A C B p(b) B=1 B=0 0.01 0.99 B=0,E=0 B=0,E=1 B=1,E=0 B=1,E=1 p(a B,E) A=1 A=0 0.001 0.999 0.3 0.8 0.7 0.2 0.95 0.05 Bayesian networks are a graphical representation of conditional independence relations, using graphs. A graphical representation for probabilistic models Suppose the world is described by a set of r.v. s X 1,... X n Let us define a directed acyclic graph such that each node i corresponds to an r.v. X i Since this is a one-to-one mapping, we will use X i to denote both the node in the graph and the corresponding r.v. Let X πi be the set of parents for node X i in the graph Suppose that we associate with each node a local function f i (x i, x πi ) with properties similar to those of conditional probability distributions: X x i Ω Xi f i (x i, x πi ) = 1, x πi and f i (x i, x πi ) 0, x i x πi January 5, 2007 19 COMP-526 Lecture 2 January 5, 2007 20 COMP-526 Lecture 2

Representation (continued) Given a set of such functions, we define a joint probability distribution as follows: p(x 1, x 2,...x n ) = ny f i (x i,x πi ) In other words, the joint probability distribution is defined as a product of local functions associated with the nodes You can check that p(x 1,... x n ) is indeed between 0 and 1 and that it sums to 1. By varying the values for f i we get different joint distributions. If we try to compute p(x i x πi ), we get f i (x i, x πi )! Hence, at each node, we really have its conditional probability given its parents: p(x i x πi ). January 5, 2007 21 COMP-526 Lecture 2 E=0 E=1 Example: A Bayesian (belief) network p(r E) R=1 0.0001 0.65 0.35 A=0 A=1 p(e) E=1 E=0 0.005 0.995 R=0 0.9999 R p(c A) C=1 C=0 0.05 0.95 0.7 0.3 E A C B p(b) B=1 B=0 0.01 0.99 p(a B,E) A=1 A=0 B=0,E=0 0.001 0.999 B=0,E=1 B=1,E=0 B=1,E=1 0.3 0.8 0.95 0.7 0.2 0.05 The nodes represent random variables The arcs represent influences At each node, we have a conditional probability distribution (CPD) for the corresponding variable given its parents January 5, 2007 22 COMP-526 Lecture 2 Factorization Let G be a DAG over variables X 1,...,X n. We say that a joint probability distribution p factorizes according to G if p can be expressed as a product: ny p(x 1,..., x n ) = p(x i x πi ) The individual factors p(x i x πi ) are called local probabilistic models or conditional probability distributions (CPD). Bayesian network definition A Bayesian network is a DAG G over variables X 1,..., X n, together with a distribution p that factorizes over G. p is specified as the set of conditional probability distributions associated with G s nodes.. E=0 E=1 p(r E) R=1 0.0001 0.65 0.35 A=0 A=1 p(e) E=1 E=0 0.005 0.995 R=0 0.9999 R p(c A) C=1 C=0 0.05 0.95 0.7 0.3 E A C B p(b) B=1 B=0 0.01 0.99 p(a B,E) A=1 A=0 B=0,E=0 0.001 0.999 B=0,E=1 0.3 0.7 B=1,E=0 0.8 0.2 B=1,E=1 0.95 0.05 January 5, 2007 23 COMP-526 Lecture 2 January 5, 2007 24 COMP-526 Lecture 2

Using a Bayes net for reasoning (1) Computing any entry in the joint probability table is easy because of the factorization property: p(b = 1, E = 0, A = 1, C = 1, R = 0) = p(b = 1)p(E = 0)p(A = 1 B = 1, E = 0)p(C = 1 A = 1)p(R = 0 E = 0) = 0.01 0.995 0.8 0.7 0.9999 0.0056 Computing marginal probabilities is also easy. E.g. What is the probability that a neighbor calls? p(c = 1) = X p(c = 1, e, b, r, a) =... e,b,r,a Using a Bayes net for reasoning (2) One might want to compute the conditional probability of a variable given evidence that is upstream from it in the graph E.g. What is the probability of a call in case of a burglary? P p(c = 1, B = 1) e,r,a p(c = 1, B = 1, e, r, a) p(c = 1 B = 1) = = p(b = 1) p(c, B = 1, e, r, a) P c,e,r,a This is called causal reasoning or prediction January 5, 2007 25 COMP-526 Lecture 2 January 5, 2007 26 COMP-526 Lecture 2 Using a Bayes net for reasoning (3) We might have some evidence and need an explanation for it. In this case, we compute a conditional probability based on evidence that is downstream in the graph E.g. Suppose we got a call. What is the probability of a burglary? What is the probability of an earthquake? p(b = 1 C = 1) = p(e = 1 C = 1) = p(c = 1 B = 1)p(B = 1) p(c = 1) p(c = 1 E = 1)p(E = 1) p(c = 1) =... =... Using a Bayes net for reasoning (4) Suppose that you now gather more evidence, e.g. the radio announces an earthquake. What happens to the probabilities? p(e = 1 C = 1,R = 1) p(e = 1 C = 1) and p(b = 1 C = 1,R = 1) p(b = 1 C = 1) This is called explaining away This is evidential reasoning or explanation. January 5, 2007 27 COMP-526 Lecture 2 January 5, 2007 28 COMP-526 Lecture 2

DAGs and independencies Given a graph G, what sort of independence assumptions does it imply? E.g. Consider the alarm network: E B R A C In general the lack of an edge corresponds to lack of a variable in the conditional probability distribution, so it must imply some independencies Implied independencies Independencies are important because they can help us answer queries more efficiently E.g. Suppose that we want to know the probability of a radio report given that there was a burglary. Do we really need to sum over all values of A, C, E? Given a Bayes net structure G, and given values for evidence variable(s) Y, what can we say about the sets of variables X and Z? Intuitively, the evidence will propagate along paths in the graph, and if it reaches both X and Z, then they are not independent. January 5, 2007 29 COMP-526 Lecture 2 January 5, 2007 30 COMP-526 Lecture 2 Indirect connection (continued) X Y Z A simple case: Indirect connection X Y Z We interpret the lack of an edge between X and Z as a conditional independence, X Z Y. Is this justified? Based on the graph structure, we have: Think of X as the past, Y as the present and Z as the future This is a simple Markov chain We interpret the lack of an edge between X and Z as a conditional independence, X Z Y. Is this justified? Hence, we have: p(z X, Y ) = p(x, Y, Z) = p(x)p(y X)p(Z Y ) p(x, Y,Z) p(x, Y ) = p(x)p(y X)p(Z Y ) p(x)p(y X) = p(z Y ) Note that the edges that are present do not imply dependence. But the edges that are missing do imply independence. January 5, 2007 31 COMP-526 Lecture 2 January 5, 2007 32 COMP-526 Lecture 2