Exact Inference by Complete Enumeration

Similar documents
Figure 1: Bayesian network for problem 1. P (A = t) = 0.3 (1) P (C = t) = 0.6 (2) Table 1: CPTs for problem 1. C P (D = t) f 0.9 t 0.

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Figure 1: Bayesian network for problem 1. P (A = t) = 0.3 (1) P (C = t) = 0.6 (2) Table 1: CPTs for problem 1. (c) P (E B) C P (D = t) f 0.9 t 0.

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Introduction to Artificial Intelligence. Unit # 11

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Probability Basics. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Informatics 2D Reasoning and Agents Semester 2,

Fitting a Straight Line to Data

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins.

Lecture : Probabilistic Machine Learning

Bayesian Learning (II)

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Probabilistic representation and reasoning

Uncertainty and Bayesian Networks

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Bayesian networks. Chapter Chapter

Probabilistic Graphical Models for Image Analysis - Lecture 1

CSC321 Lecture 18: Learning Probabilistic Models

Probabilistic Reasoning

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Probability, Entropy, and Inference / More About Inference

Bayesian networks. Chapter 14, Sections 1 4


Bayesian Networks. Motivation

Lecture 5: Bayes pt. 1

Physics 6720 Introduction to Statistics April 4, 2017

Probability and Statistics

Artificial Intelligence Programming Probability

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

2. I will provide two examples to contrast the two schools. Hopefully, in the end you will fall in love with the Bayesian approach.

CS 188: Artificial Intelligence Fall 2008

The problem of base rates

Density Estimation. Seungjin Choi

15-780: Graduate Artificial Intelligence. Bayesian networks: Construction and inference

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

CS 188: Artificial Intelligence Fall 2009

1 Using standard errors when comparing estimated values

Introduction to Artificial Intelligence Belief networks

Sources of Uncertainty

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

1 Probabilities. 1.1 Basics 1 PROBABILITIES

GAUSSIAN PROCESS REGRESSION

y Xw 2 2 y Xw λ w 2 2

Lecture 1: Probability Fundamentals

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

2. A Basic Statistical Toolbox

Which coin will I use? Which coin will I use? Logistics. Statistical Learning. Topics. 573 Topics. Coin Flip. Coin Flip

Mathematical Statistics

CSC 2541: Bayesian Methods for Machine Learning

CS 188: Artificial Intelligence Spring Announcements

Bayesian Regression (1/31/13)

GWAS V: Gaussian processes

Lecture 15. Probabilistic Models on Graph

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

Artificial Intelligence Bayesian Networks

MODULE -4 BAYEIAN LEARNING

Probability and Inference. POLI 205 Doing Research in Politics. Populations and Samples. Probability. Fall 2015

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

Quantifying uncertainty & Bayesian networks

Probabilistic representation and reasoning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Machine Learning

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Statistical Methods for Astronomy

Bayesian Networks (Part II)

Computational Cognitive Science

Study and research skills 2009 Duncan Golicher. and Adrian Newton. Last draft 11/24/2008

Bayesian Learning. Bayesian Learning Criteria

Probabilistic Models

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Approach to Field Research Data Generation and Field Logistics Part 1. Road Map 8/26/2016

STATISTICAL METHODS IN AI/ML Vibhav Gogate The University of Texas at Dallas. Propositional Logic and Probability Theory: Review

Bayesian Inference. Introduction

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Choosing among models

CSC 2541: Bayesian Methods for Machine Learning

This lecture. Reading. Conditional Independence Bayesian (Belief) Networks: Syntax and semantics. Chapter CS151, Spring 2004

PMR Learning as Inference

Why Bayesian? Rigorous approach to address statistical estimation problems. The Bayesian philosophy is mature and powerful.

Support Vector Machines and Bayes Regression

Massachusetts Institute of Technology

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Naive Bayes classification

Expectation Propagation Algorithm

Probability Calculus. Chapter From Propositional to Graded Beliefs

1 Bayesian Linear Regression (BLR)

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

Probability Calculus. p.1

Bayesian Learning Features of Bayesian learning methods:

Introduction to Artificial Intelligence. Prof. Inkyu Moon Dept. of Robotics Engineering, DGIST

Outline. Uncertainty. Methods for handling uncertainty. Uncertainty. Making decisions under uncertainty. Probability. Uncertainty

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Transcription:

21 Exact Inference by Complete Enumeration We open our toolbox of methods for handling probabilities by discussing a brute-force inference method: complete enumeration of all hypotheses, and evaluation of their probabilities. This approach is an exact method, and the difficulty of carrying it out will motivate the smarter exact and approximate methods introduced in the following chapters. 21.1 The burglar alarm Bayesian probability theory is sometimes called common sense, amplified. When thinking about the following questions, please ask your common sense what it thinks the answers are; we will then see how Bayesian methods confirm your everyday intuition. Example 21.1. Fred lives in Los Angeles and commutes 60 miles to work. Whilst at work, he receives a phone-call from his neighbour saying that Fred s burglar alarm is ringing. What is the probability that there was a burglar in his house today? While driving home to investigate, Fred hears on the radio that there was a small earthquake that day near his home. Oh, he says, feeling relieved, it was probably the earthquake that set off the alarm. What is the probability that there was a burglar in his house? (After Pearl, 1988). Earthquake Radio Burglar Alarm Phonecall Figure 21.1. Belief network for the burglar alarm problem. Let s introduce variables b (a burglar was present in Fred s house today), a (the alarm is ringing), p (Fred receives a phonecall from the neighbour reporting the alarm), e (a small earthquake took place today near Fred s house), and r (the radio report of earthquake is heard by Fred). The probability of all these variables might factorize as follows: P (b, e, a, p, r) = P (b)p (e)p (a b, e)p (p a)p (r e), (21.1) and plausible values for the probabilities are: 1. Burglar probability: P (b = 1) = β, P (b = 0) = 1 β, (21.2) e.g., β = 0.001 gives a mean burglary rate of once every three years. 2. Earthquake probability: P (e = 1) = ɛ, P (e = 0) = 1 ɛ, (21.3) 293

294 21 Exact Inference by Complete Enumeration with, e.g., ɛ = 0.001; our assertion that the earthquakes are independent of burglars, i.e., the prior probability of b and e is P (b, e) = P (b)p (e), seems reasonable unless we take into account opportunistic burglars who strike immediately after earthquakes. 3. Alarm ringing probability: we assume the alarm will ring if any of the following three events happens: (a) a burglar enters the house, and triggers the alarm (let s assume the alarm has a reliability of α b = 0.99, i.e., 99% of burglars trigger the alarm); (b) an earthquake takes place, and triggers the alarm (perhaps α e = 1% of alarms are triggered by earthquakes?); or (c) some other event causes a false alarm; let s assume the false alarm rate f is 0.001, so Fred has false alarms from non-earthquake causes once every three years. [This type of dependence of a on b and e is known as a noisy-or.] The probabilities of a given b and e are then: P (a = 0 b = 0, e = 0) = (1 f), P (a = 1 b = 0, e = 0) = f P (a = 0 b = 1, e = 0) = (1 f)(1 α b ), P (a = 1 b = 1, e = 0) = 1 (1 f)(1 α b ) P (a = 0 b = 0, e = 1) = (1 f)(1 α e ), P (a = 1 b = 0, e = 1) = 1 (1 f)(1 α e ) P (a = 0 b = 1, e = 1) = (1 f)(1 α b )(1 α e ), P (a = 1 b = 1, e = 1) = 1 (1 f)(1 α b )(1 α e ) or, in numbers, P (a = 0 b = 0, e = 0) = 0.999, P (a = 1 b = 0, e = 0) = 0.001 P (a = 0 b = 1, e = 0) = 0.009 99, P (a = 1 b = 1, e = 0) = 0.990 01 P (a = 0 b = 0, e = 1) = 0.989 01, P (a = 1 b = 0, e = 1) = 0.010 99 P (a = 0 b = 1, e = 1) = 0.009 890 1, P (a = 1 b = 1, e = 1) = 0.990 109 9. We assume the neighbour would never phone if the alarm is not ringing [P (p = 1 a = 0) = 0]; and that the radio is a trustworthy reporter too [P (r = 1 e = 0) = 0]; we won t need to specify the probabilities P (p = 1 a = 1) or P (r = 1 e = 1) in order to answer the questions above, since the outcomes p = 1 and r = 1 give us certainty respectively that a = 1 and e = 1. We can answer the two questions about the burglar by computing the posterior probabilities of all hypotheses given the available information. Let s start by reminding ourselves that the probability that there is a burglar, before either p or r is observed, is P (b = 1) = β = 0.001, and the probability that an earthquake took place is P (e = 1) = ɛ = 0.001, and these two propositions are independent. First, when p = 1, we know that the alarm is ringing: a = 1. The posterior probability of b and e becomes: P (b, e a = 1) = The numerator s four possible values are P (a = 1 b, e)p (b)p (e). (21.4) P (a = 1) P (a = 1 b = 0, e = 0) P (b = 0) P (e = 0) = 0.001 0.999 0.999 = 0.000 998 P (a = 1 b = 1, e = 0) P (b = 1) P (e = 0) = 0.990 01 0.001 0.999 = 0.000 989 P (a = 1 b = 0, e = 1) P (b = 0) P (e = 1) = 0.010 99 0.999 0.001 = 0.000 010 979 P (a = 1 b = 1, e = 1) P (b = 1) P (e = 1) = 0.990 109 9 0.001 0.001 = 9.9 10 7. The normalizing constant is the sum of these four numbers, P (a = 1) = 0.002, and the posterior probabilities are P (b = 0, e = 0 a = 1) = 0.4993 P (b = 1, e = 0 a = 1) = 0.4947 P (b = 0, e = 1 a = 1) = 0.0055 P (b = 1, e = 1 a = 1) = 0.0005. (21.5)

21.2: Exact inference for continuous hypothesis spaces 295 To answer the question, what s the probability a burglar was there? marginalize over the earthquake variable e: we P (b = 0 a = 1) = P (b = 0, e = 0 a = 1) + P (b = 0, e = 1 a = 1) = 0.505 P (b = 1 a = 1) = P (b = 1, e = 0 a = 1) + P (b = 1, e = 1 a = 1) = 0.495. (21.6) So there is nearly a 50% chance that there was a burglar present. It is important to note that the variables b and e, which were independent a priori, are now dependent. The posterior distribution (21.5) is not a separable function of b and e. This fact is illustrated most simply by studying the effect of learning that e = 1. When we learn e = 1, the posterior probability of b is given by P (b e = 1, a = 1) = P (b, e = 1 a = 1)/P (e = 1 a = 1), i.e., by dividing the bottom two rows of (21.5), by their sum P (e = 1 a = 1) = 0.0060. The posterior probability of b is: P (b = 0 e = 1, a = 1) = 0.92 P (b = 1 e = 1, a = 1) = 0.08. (21.7) There is thus now an 8% chance that a burglar was in Fred s house. It is in accordance with everyday intuition that the probability that b = 1 (a possible cause of the alarm) reduces when Fred learns that an earthquake, an alternative explanation of the alarm, has happened. Explaining away This phenomenon, that one of the possible causes (b = 1) of some data (the data in this case being a = 1) becomes less probable when another of the causes (e = 1) becomes more probable, even though those two causes were independent variables a priori, is known as explaining away. Explaining away is an important feature of correct inferences, and one that any artificial intelligence should replicate. If we believe that the neighbour and the radio service are unreliable or capricious, so that we are not certain that the alarm really is ringing or that an earthquake really has happened, the calculations become more complex, but the explaining-away effect persists; the arrival of the earthquake report r simultaneously makes it more probable that the alarm truly is ringing, and less probable that the burglar was present. In summary, we solved the inference questions about the burglar by enumerating all four hypotheses about the variables (b, e), finding their posterior probabilities, and marginalizing to obtain the required inferences about b. Exercise 21.2. [2 ] After Fred receives the phone-call about the burglar alarm, but before he hears the radio report, what, from his point of view, is the probability that there was a small earthquake today? 21.2 Exact inference for continuous hypothesis spaces Many of the hypothesis spaces we will consider are naturally thought of as continuous. For example, the unknown decay length λ of section 3.1 (p.48) lives in a continuous one-dimensional space; and the unknown mean and standard deviation of a Gaussian µ, σ live in a continuous two-dimensional space. In any practical computer implementation, such continuous spaces will necessarily be discretized, however, and so can, in principle, be enumerated at a grid of parameter values, for example. In figure 3.2 we plotted the likelihood

296 21 Exact Inference by Complete Enumeration Figure 21.2. Enumeration of an entire (discretized) hypothesis space for one Gaussian with parameters µ (horizontal axis) and σ (vertical). function for the decay length as a function of λ by evaluating the likelihood at a finely-spaced series of points. A two-parameter model Let s look at the Gaussian distribution as an example of a model with a twodimensional hypothesis space. The one-dimensional Gaussian distribution is parameterized by a mean µ and a standard deviation σ: P (x µ, σ) = 1 ) (x µ)2 exp ( 2πσ 2σ 2 Normal(x; µ, σ 2 ). (21.8) Figure 21.2 shows an enumeration of one hundred hypotheses about the mean and standard deviation of a one-dimensional Gaussian distribution. These hypotheses are evenly spaced in a ten by ten square grid covering ten values of µ and ten values of σ. Each hypothesis is represented by a picture showing the probability density that it puts on x. We now examine the inference of µ and σ given data points x n, n = 1,..., N, assumed to be drawn independently from this density. Imagine that we acquire data, for example the five points shown in figure 21.3. We can now evaluate the posterior probability of each of the one hundred subhypotheses by evaluating the likelihood of each, that is, the value of P ({x n } 5 n=1 µ, σ). The likelihood values are shown diagrammatically in figure 21.4 using the line thickness to encode the value of the likelihood. Subhypotheses with likelihood smaller than e 8 times the maximum likelihood have been deleted. Using a finer grid, we can represent the same information by plotting the likelihood as a surface plot or contour plot as a function of µ and σ (figure 21.5). -0.5 0 0.5 1 1.5 2 2.5 Figure 21.3. Five datapoints {x n } 5 n=1. The horizontal coordinate is the value of the datum, x n ; the vertical coordinate has no meaning. A five-parameter mixture model Eyeballing the data (figure 21.3), you might agree that it seems more plausible that they come not from a single Gaussian but from a mixture of two Gaussians, defined by two means, two standard deviations, and two mixing

21.2: Exact inference for continuous hypothesis spaces 297 Figure 21.4. Likelihood function, given the data of figure 21.3, represented by line thickness. Subhypotheses having likelihood smaller than e 8 times the maximum likelihood are not shown. 0.06 0.05 0.04 0.03 0.02 0.01 1 0 0.8 0.6 sigma 0.4 0 0.5 1 1.5 2 2 0.2 1.5 1 0.5 mean mean 0 1 0.9 0.8 0.7 0.6 sigma 0.5 0.4 0.3 0.2 0.1 Figure 21.5. The likelihood function for the parameters of a Gaussian distribution. Surface plot and contour plot of the log likelihood as a function of µ and σ. The data set of N = 5 points had mean x = 1.0 and S = (x x) 2 = 1.0.

298 21 Exact Inference by Complete Enumeration Figure 21.6. Enumeration of the entire (discretized) hypothesis space for a mixture of two Gaussians. Weight of the mixture components is π 1, π 2 = 0.6, 0.4 in the top half and 0.8, 0.2 in the bottom half. Means µ 1 and µ 2 vary horizontally, and standard deviations σ 1 and σ 2 vary vertically. coefficients π 1 and π 2, satisfying π 1 + π 2 = 1, π i 0. P (x µ 1, σ 1, π 1, µ 2, σ 2, π 2 ) = π 1 2πσ1 exp ( (x µ1)2 2σ 2 1 ) + π 2 2πσ2 exp ( (x µ2)2 Let s enumerate the subhypotheses for this alternative model. The parameter space is five-dimensional, so it becomes challenging to represent it on a single page. Figure 21.6 enumerates 800 subhypotheses with different values of the five parameters µ 1, µ 2, σ 1, σ 2, π 1. The means are varied between five values each in the horizontal directions. The standard deviations take on four values each vertically. And π 1 takes on two values vertically. We can represent the inference about these five parameters in the light of the five datapoints as shown in figure 21.7. If we wish to compare the one-gaussian model with the mixture-of-two model, we can find the models posterior probabilities by evaluating the marginal likelihood or evidence for each model H, P ({x} H). The evidence is given by integrating over the parameters, θ; the integration can be implemented numerically by summing over the alternative enumerated values of θ, P ({x} H) = P (θ)p ({x} θ, H), (21.9) 2σ 2 2 ) where P (θ) is the prior distribution over the grid of parameter values, which I take to be uniform. For the mixture of two Gaussians this integral is a five-dimensional integral; if it is to be performed at all accurately, the grid of points will need to be much finer than the grids shown in the figures. If the uncertainty about each of K parameters has been reduced by, say, a factor of ten by observing the data, then brute-force integration requires a grid of at least 10 K points. This

21.2: Exact inference for continuous hypothesis spaces 299 Figure 21.7. Inferring a mixture of two Gaussians. Likelihood function, given the data of figure 21.3, represented by line thickness. The hypothesis space is identical to that shown in figure 21.6. Subhypotheses having likelihood smaller than e 8 times the maximum likelihood are not shown, hence the blank regions, which correspond to hypotheses that the data have ruled out. -0.5 0 0.5 1 1.5 2 2.5 exponential growth of computation with model size is the reason why complete enumeration is rarely a feasible computational strategy. Exercise 21.3. [1 ] Imagine fitting a mixture of ten Gaussians to data in a twenty-dimensional space. Estimate the computational cost of implementing inferences for this model by enumeration of a grid of parameter values.