Probabilistic Reasoning. (Mostly using Bayesian Networks)

Similar documents
CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayesian networks. Chapter 14, Sections 1 4

Quantifying uncertainty & Bayesian networks

Probabilistic Reasoning Systems

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Uncertainty. 22c:145 Artificial Intelligence. Problem of Logic Agents. Foundations of Probability. Axioms of Probability

Uncertainty and Bayesian Networks

Bayesian networks. Chapter Chapter

Probabilistic representation and reasoning

Probabilistic representation and reasoning

Review: Bayesian learning and inference

Introduction to Artificial Intelligence Belief networks

Informatics 2D Reasoning and Agents Semester 2,

Bayesian networks. Chapter Chapter

Bayesian networks. Chapter Chapter Outline. Syntax Semantics Parameterized distributions. Chapter

PROBABILISTIC REASONING SYSTEMS

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Graphical Models - Part I

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

14 PROBABILISTIC REASONING

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Bayesian networks: Modeling

CS Belief networks. Chapter

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Pengju XJTU 2016

Directed Graphical Models

This lecture. Reading. Conditional Independence Bayesian (Belief) Networks: Syntax and semantics. Chapter CS151, Spring 2004

Uncertainty. Chapter 13

COMP5211 Lecture Note on Reasoning under Uncertainty

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Uncertainty. Chapter 13

Uncertainty and Belief Networks. Introduction to Artificial Intelligence CS 151 Lecture 1 continued Ok, Lecture 2!

Artificial Intelligence

Probabilistic Models

Bayesian Belief Network

Another look at Bayesian. inference

Artificial Intelligence Uncertainty

Bayesian Networks. Motivation

Bayes Nets: Independence

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Defining Things in Terms of Joint Probability Distribution. Today s Lecture. Lecture 17: Uncertainty 2. Victor R. Lesser

Probabilistic Reasoning

Bayesian Networks BY: MOHAMAD ALSABBAGH

Pengju

CS 5522: Artificial Intelligence II

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

CS 188: Artificial Intelligence Fall 2009

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Uncertainty. Outline

Probabilistic Models. Models describe how (a portion of) the world works

Bayesian Networks. Philipp Koehn. 6 April 2017

Introduction to Artificial Intelligence. Unit # 11

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins.

Quantifying Uncertainty

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Artificial Intelligence Bayes Nets: Independence

CS 5522: Artificial Intelligence II

Outline } Conditional independence } Bayesian networks: syntax and semantics } Exact inference } Approximate inference AIMA Slides cstuart Russell and

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Artificial Intelligence Bayesian Networks

Bayesian Networks. Philipp Koehn. 29 October 2015

Bayesian Approaches Data Mining Selected Technique

Y. Xiang, Inference with Uncertain Knowledge 1

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

COMP9414: Artificial Intelligence Reasoning Under Uncertainty

CS 188: Artificial Intelligence Spring Announcements

CS 343: Artificial Intelligence

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Uncertainty. CmpE 540 Principles of Artificial Intelligence Pınar Yolum Uncertainty. Sources of Uncertainty

Bayesian Networks. Material used

CSE 473: Artificial Intelligence Autumn 2011

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Bayesian networks (1) Lirong Xia

p(x) p(x Z) = y p(y X, Z) = αp(x Y, Z)p(Y Z)

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

Outline. Uncertainty. Methods for handling uncertainty. Uncertainty. Making decisions under uncertainty. Probability. Uncertainty

Graphical Models - Part II

Example. Bayesian networks. Outline. Example. Bayesian networks. Example contd. Topology of network encodes conditional independence assertions:

Outline. Bayesian networks. Example. Bayesian networks. Example contd. Example. Syntax. Semantics Parameterized distributions. Chapter 14.

Uncertain Knowledge and Reasoning

CS 188: Artificial Intelligence Fall 2008

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Probabilistic Models Bayesian Networks Markov Random Fields Inference. Graphical Models. Foundations of Data Analysis

Brief Intro. to Bayesian Networks. Extracted and modified from four lectures in Intro to AI, Spring 2008 Paul S. Rosenbloom

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence Spring Announcements

Uncertainty. Chapter 13. Chapter 13 1

Uncertainty. Outline. Probability Syntax and Semantics Inference Independence and Bayes Rule. AIMA2e Chapter 13

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

PROBABILISTIC REASONING Outline

Learning Bayesian Networks (part 1) Goals for the lecture

Uncertainty. Russell & Norvig Chapter 13.

School of EECS Washington State University. Artificial Intelligence

Transcription:

Probabilistic Reasoning (Mostly using Bayesian Networks)

Introduction: Why probabilistic reasoning? The world is not deterministic. (Usually because information is limited.) Ways of coping with uncertainty include default logic, fuzzy logic, and probability theory. Probability theory is well-studied, welldefined, well-supported. Also, systems based on it seem to work in practice.

Outline A running example Review of probability A naïve approach to probabilistic inference Bayesian networks!

An Example: The Burglar Alarm Domain Imagine you live in a nice house in Los Angeles. While you re at work, your neighbor John calls to say your burglar alarm is ringing, but neighbor Mary doesn't call. Sometimes the alarm is set off by minor earthquakes. Is there a burglar? Uncertainty clearly occurs here. Let us handle it using probability theory

Review of Probability A random variable is just a variable. Can be boolean (e.g., Earthquake [t, f]), range over a discrete set of values (e.g., Earthquake [none, moderate, severe]), or range over integers or reals (e.g., over the Richter scale). A sample space is a set Ω of possible configurations of the world, where each configuration contains different values for the random variables. (E.g., Earthquake=t, Burglar=f, Alarm=t.) An event is any subset of Ω, and can be described using a proposition. (E.g., Earthquake=t Burglar=f.) A probability model assigns a probability P(ω) to each ω Ω (and so to each possible event.)

Axioms of probability For all propositions A, B 0 P(A) 1 P(true) = 1 and P(false) = 0 P(A B) = P(A) + P(B) - P(A B) They imply that Σ P(ω) = 1.0. (If a distribution doesn t follow them, it is ill-defined.)

The Burglar Alarm domain Here is a well-defined joint probability distribution for part of the example domain: Earthquake Earthquake Burglar Burglar Burglar Burglar Alarm 0.0000019 0.00057942 0.00093812 0.000997002 Alarm 0.0000001 0.00141858 0.00005988 0.996005000 It can be used to answer any question about these three variables.

Inference by enumeration Inferring the probability of simple propositions, no evidence: Earthquake Earthquake Burglar Burglar Burglar Burglar Alarm 0.0000019 0.00057942 0.00093812 0.000997002 Alarm 0.0000001 0.00141858 0.00005988 0.996005000 For any proposition, sum up the entries where it is true. P(earthquake) = 0.002

Prior/Posterior probabilities Probabilities relate propositions to agent's own state of knowledge, and change with new evidence: P(alarm) = 0.002516442 P(alarm earthquake) = 0.29 So, these are not assertions about the world. The probability of a proposition before the arrival of any evidence is a prior or unconditional probability. The probability of a proposition given some evidence is a posterior or conditional probability. But what does given some evidence mean, mathematically?

Conditional Probability Definition of conditional probability: P(a b) = P(a b) / P(b) if P(b) > 0 Product rule gives an alternative formulation: P(a b) = P(a b) P(b) = P(b a) P(a) A general version holds for whole distributions, e.g., P(Earthquake, Alarm) = P(Alarm Earthquake) P(Earthquake) (View as a set of 4 2 equations, not matrix multiplication.) Chain rule is derived by successive application of product rule: P(X 1,,X n ) = P(X 1,...,X n-1 ) P(X n X 1,...,X n-1 ) = P(X 1,...,X n-2 ) P(X n-1 X 1,...,X n-2 ) P(X n X 1,...,X n-1 ) = = π i= 1^n P(X i X 1,,X i-1 )

Inference by enumeration, cont. For any proposition, sum the entries where it is true. Earthquake Earthquake Burglar Burglar Burglar Burglar Alarm 0.0000019 0.00057942 0.00093812 0.000997002 Alarm 0.0000001 0.00141858 0.00005988 0.996005000 P(alarm earthquake) = P(alarm,earthquake)/P(earthquake) = 0.00058132/0.002 = 0.29

Inference, generalized Typically, we are interested in the posterior joint distribution of the query variables Y given specific values e for the evidence variables E Let the hidden variables be H = X - Y - E Then the required summation of joint entries is done by summing out the hidden variables: P(Y E = e) = P(Y,E = e) / P(E = e) ΣP(Y,E= e, H = h) (The terms in the summation are joint entries because Y, E and H together exhaust the set of random variables)

Problems Obvious problems with doing inference by summing over the joint: 1. Space complexity to store the joint distribution is O(d n ), where d is the largest arity. 2. Worst-case time complexity is O(d n ). 3. How to find the numbers for O(d n ) entries? (Learning is easier when parameters are few.) We want to make use of regularities in the domain to express that large table more compactly

Independence A and B are independent iff P(A B) = P(A) or P(B A) = P(B) or P(A, B) = P(A) P(B) E.g., P(Earthquake, Burglar) = P(Earthquake)P(Burglar) So, instead of a table with three entries, we could have two tables with just one entry each. Wouldn t it be nice if we could factor the whole joint like this? But absolute independence is rare

Independence, cont Conditional independence occurs when two variables are independent given another variable. E.g. If John and Mary do not communicate when deciding to call you about your alarm, then MaryCalls and JohnCalls are conditionally independent given Alarm. P(MaryCalls,JohnCalls Alarm) = P(MaryCalls Alarm) P(JohnCalls Alarm) Taking all the independencies into account, we can rewrite the full joint as a product of smaller terms: P(MaryCalls, JohnCalls, Alarm, Earthquake, Burglar) = P(MaryCalls Alarm) P(JohnCalls Alarm) P(Alarm Earthquake, Burglar) P(Earthquake)P(Burglar) The full joint would have needed 31 entries. How many does this factorization need?

Bayesian networks Independence lets us express a joint compactly as a product of conditional distributions. This product can be represented using a directed, acyclic graph where: Graph nodes represent probabilistic variables Graph edges represent "direct influences" Each node X i has an associated conditional distribution: P (X i Parents (X i )) (In the simplest case, these will be represented as conditional probability tables, or CPTs.) Such graphs are known as Bayesian networks or belief networks (BNs.) They are a type of graphical model.

An example BN

An Example BN, cont. Topology of network encodes conditional independence assertions: JohnCalls and MaryCalls are conditionally independent given Alarm. Burglar and Earthquake are independent a priori, but dependent given Alarm. (This is known as 'explaining away.')

Local semantics General local semantics: Each node is conditionally independent of its nondescendants given its parents. Each node is conditionally independent of the rest of the network given its Markov blanket: its parents, its children, and its children s parents. (These concepts play a large role during inference.)

Global Semantics The full joint distribution is defined as the product of the local conditional distributions: P (X 1,,X n ) = π n i = 1 P (X i Parents(X i )) As we said, P(j m a b e) = P (j a) P (m a) P (a b, e) P ( b) P ( e)

Interpreting Bayesian networks We constructed our burglar alarm network by thinking about the inter-variable dependencies. The links seem to reflect causal relationships. One of the advantages of Bayesian networks is that they can make sense in this way. This makes them easier for domain experts to construct, or interpret. (Not all types of graphical models share this appealing feature!) Nevertheless, here is an all-purpose algorithm for constructing a Bayesian network by reasoning about dependencies.

Constructing Bayesian networks 1. Choose an ordering of variables X 1,,X n 2. For i = 1 to n add X i to the network select parents from X 1,,X i-1 such that P (X i Parents(X i )) = P (X i X 1,... X i-1 ) This choice of parents guarantees: n P (X 1,,X n ) = π i =1 P (X i X 1,, X i-1 )(chain rule) n = π i =1 P (X i Parents(X i ))(by construction)

Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)?

Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)?No P(A J, M) = P(A J)? P(A J, M) = P(A)?

Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)?No P(A J, M) = P(A J)? P(A J, M) = P(A)? No P(B A, J, M) = P(B A)? P(B A, J, M) = P(B)?

Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)?No P(A J, M) = P(A J)? P(A J, M) = P(A)? No P(B A, J, M) = P(B A)? Yes P(B A, J, M) = P(B)? No P(E B, A,J, M) = P(E A)? P(E B, A, J, M) = P(E A, B)?

Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)?No P(A J, M) = P(A J)? P(A J, M) = P(A)? No P(B A, J, M) = P(B A)? Yes P(B A, J, M) = P(B)? No P(E B, A,J, M) = P(E A)? No P(E B, A, J, M) = P(E A, B)? Yes

Example contd. Deciding conditional independence in noncausal directionsis hard (Causal models and conditional independence seem hardwired for humans!) Network is less compact: 1 + 2 + 4 + 2 + 4 = 13 numbers needed

Compact conditional distributions CPTs grow exponentially with the number of parents CPTs become infinite with continuous-valued parents or children. Solution: use not tables, but canonical distributions, such as: Deterministic functions Probabilistic density functions, like Gaussians Functions like Noisy-OR. (For discrete parents Y 1 Y n with independent failure probabilities q i, P(X Y 1 Y j, Y j+1 Y n ) = 1 - Π i=1 q i.) Could also encode context-specific independence using rules or trees.

BNs in Action: Driving a Car An example of a Dynamic Bayesian Network (DBN.)

Relational BNs In logic, first-order representations: Can encode things more compactly than propositional ones. (E.g., rules of chess.) Are applicable in new situations, where the set of objects is different. We want to do the same thing with probabilistic representations A general approach: defining network fragments that quantify over objects and putting them together once the set of objects is known. In the burglar alarm example, we might quantify over people, their alarms, and their neighbours. One specific approach: Relational Probabilistic Models (RPMs.)

BNs in action: Relational BNs A relational system for reasoning about papers and citations (as done by Citeseer.) The underlying set of papers is uncertain!

Summary Probability is a rigorous formalism for uncertain knowledge Joint probability distributions allow us to answer all queries by summing over possible worlds, but are impractical in nontrivial domains. Bayesian networks provide a compact representation of joint distributions and a natural representation for (causally induced) conditional independence

Generative Models The models we have looked at today are all examples of generative models. They define a full distribution over everything in the domain, so possible worlds can be generated from them. Generative models can answer any query. An alternative is discriminative models. There, the query and evidence are decided upon in advance and only the conditional distribution P(Q E) is learnt. Which type is better? Current opinion is divided.

A Leading Question Bayesian networks help us represent things compactly but how can that translate into better inference algorithms? How would you do inference using a Bayes net?