Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Similar documents
Quantifying uncertainty & Bayesian networks

Probabilistic Reasoning Systems

Uncertainty and Bayesian Networks

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Uncertainty. Chapter 13

Pengju XJTU 2016

Probabilistic Reasoning

Artificial Intelligence

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Artificial Intelligence Uncertainty

Bayesian networks. Chapter Chapter

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Uncertain Knowledge and Reasoning

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Bayesian networks. Chapter 14, Sections 1 4

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

Uncertainty. Outline

Uncertainty. 22c:145 Artificial Intelligence. Problem of Logic Agents. Foundations of Probability. Axioms of Probability

Probabilistic representation and reasoning

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 3 1

Uncertainty. Chapter 13

Uncertainty. Outline. Probability Syntax and Semantics Inference Independence and Bayes Rule. AIMA2e Chapter 13

Uncertainty. Chapter 13. Chapter 13 1

PROBABILISTIC REASONING SYSTEMS

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Uncertainty. Chapter 13, Sections 1 6

Probabilistic representation and reasoning

Outline. Uncertainty. Methods for handling uncertainty. Uncertainty. Making decisions under uncertainty. Probability. Uncertainty

Review: Bayesian learning and inference

Bayesian networks (1) Lirong Xia

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Graphical Models - Part I

Chapter 13 Quantifying Uncertainty

Pengju

PROBABILISTIC REASONING Outline

14 PROBABILISTIC REASONING

Probabilistic Models

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Statistical learning. Chapter 20, Sections 1 3 1

CSE 473: Artificial Intelligence Autumn 2011

Uncertainty (Chapter 13, Russell & Norvig) Introduction to Artificial Intelligence CS 150 Lecture 14

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Probabilistic Models. Models describe how (a portion of) the world works

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

CS 188: Artificial Intelligence Spring Announcements

CS 561: Artificial Intelligence

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

CS 188: Artificial Intelligence Fall 2009

CS 5522: Artificial Intelligence II

This lecture. Reading. Conditional Independence Bayesian (Belief) Networks: Syntax and semantics. Chapter CS151, Spring 2004

CS 343: Artificial Intelligence

Uncertainty and Belief Networks. Introduction to Artificial Intelligence CS 151 Lecture 1 continued Ok, Lecture 2!

Web-Mining Agents Data Mining

Bayesian Networks BY: MOHAMAD ALSABBAGH

Probabilistic Robotics

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

Informatics 2D Reasoning and Agents Semester 2,

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

Basic Probability and Decisions

Cartesian-product sample spaces and independence

Statistical Learning. Philipp Koehn. 10 November 2015

Bayesian Networks. Motivation

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

Resolution or modus ponens are exact there is no possibility of mistake if the rules are followed exactly.

CS 188: Artificial Intelligence Fall 2008

School of EECS Washington State University. Artificial Intelligence

Introduction to Bayesian Learning

COMP5211 Lecture Note on Reasoning under Uncertainty

Reasoning Under Uncertainty

Brief Intro. to Bayesian Networks. Extracted and modified from four lectures in Intro to AI, Spring 2008 Paul S. Rosenbloom

Learning with Probabilities

UNCERTAINTY. In which we see what an agent should do when not all is crystal-clear.

CS Belief networks. Chapter

Bayesian Belief Network

COMP9414: Artificial Intelligence Reasoning Under Uncertainty

Uncertainty (Chapter 13, Russell & Norvig)

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Bayesian Networks. Philipp Koehn. 6 April 2017

Introduction to Artificial Intelligence Belief networks

Uncertainty. Russell & Norvig Chapter 13.

Bayesian networks. Chapter Chapter

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Bayes Nets: Independence

Bayesian networks: Modeling

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.

CMPSCI 240: Reasoning about Uncertainty

13.4 INDEPENDENCE. 494 Chapter 13. Quantifying Uncertainty

Reasoning with Uncertainty. Chapter 13

From inductive inference to machine learning

Foundations of Artificial Intelligence

Bayesian Networks. Philipp Koehn. 29 October 2015

ARTIFICIAL INTELLIGENCE. Uncertainty: probabilistic reasoning

Bayesian networks. Chapter Chapter Outline. Syntax Semantics Parameterized distributions. Chapter

CS 188: Artificial Intelligence Fall 2009

Reasoning Under Uncertainty

Another look at Bayesian. inference

Transcription:

Probabilistic Reasoning Kee-Eung Kim KAIST Computer Science

Outline #1 Acting under uncertainty Probabilities Inference with Probabilities Independence and Bayes Rule Bayesian networks Inference in Bayesian networks

Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems Partial observability (road state, other driver s plans, etc.) Noisy sensors (often wrong traffic reports) Uncertainty in action outcomes (flat tire, etc.) Immense complexity of modeling and predicting traffic A logical agent faced with such a problem has to handle uncertain knowledge??? Agent s knowledge can at best provide only a degree of belief in the relevant sentences (probability theory is your friend)

Probability Probability provides a way of summarizing the uncertainty that comes from Laziness: failure to enumerate all exceptions, qualifications, etc. Ignorance: lack of relevant facts, initial conditions, etc. Degree of belief vs. degree of truth Probability of 0.8 does not mean 80% true (fuzzy logic/ambiguity) Rather, 80% degree of belief (probability/chance) Making decisions under uncertainty: Which action to choose? Depends on my preference Utility theory is used to represent and infer preferences Decision Theory = utility theory + probability theory

Basic Probability : the sample space E.g., 6 possible outcomes from a roll of a die 2 is a sample point/possible world/atomic event A probability space or probability model is a sample space with an assignment P( ) for every 2 s.t. 0 P( ) 1 P( ) = 1 An event A µ : P(A) = 2 A P( ) A random variable is a function from sample points to some range P induces a probability distribution for any random variable X: P(X = x i ) = :X( )=xi P( )

Prior Probability Prior or unconditional probabilities of propositions correspond to belief prior to arrival of any (new) evidence E.g., P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72 Probability distribution gives values for all possible assignments P(Weather) = [0.72, 0.1, 0.08, 0.1] Joint probability distribution for a set of random variables P(Weather, Cavity) = a 4 2 matrix of values: Weather= sunny rain cloudy snow Cavity=true 0.144 0.02 0.016 0.02 Cavity=false 0.576 0.08 0.064 0.08 Every question about a domain can be answered by the joint distribution because every event is a sum of sample points

Conditional Probability Conditional or posterior probabilities E.g. P(cavity toothache) = 0.8 i.e., given that toothache is all I know, NOT if toothache then 80% chance of cavity Implementation P(Cavity Toothache) = 2-element vector of 2-element vectors If we know more P(cavity toothache,cavity) = 1 New evidence may be irrelevant P(cavity toothache,rain) = P(cavity toothache) = 0.8 Domain knowledge is necessary for this kind of inference Useful rules Product rule: P(a Æ b) = P(a b)p(b) = P(b a)p(a) Chain rule: P(X 1,, X n ) = P(X 1,, X n-1 ) P(X n X 1,,X n-1 ) = i P(X i X 1,,X i-1 )

Inference by Enumeration Given the joint distribution toothache : toothache catch : catch catch : catch cavity 0.108 0.012 0.072 0.008 : cavity 0.016 0.064 0.144 0.576 For any proposition, sum the atomic events where P(cavity Ç toothache) =? P(: cavity toothache) =? P(Cavity toothache) = P(Cavity, toothache) = [P(Cavity, toothache, catch) + P(Cavity, toothache, : catch)] Compute distribution on query variable by fixing evidence variables and summing over hidden variables

Inference by Enumeration Let X be all random variables. Typically we want The posterior joint distribution of the query variables Y Given specific values e for the evidence variables E Let the hidden variables be H = X Y E P(Y E = e) = P(Y, E = e) = h P(Y, E = e, H = h) Obvious problems Worst-case time complexity O(d^n) where d is the largest arity Space complexity O(d^n) to store joint distribution How to find the numbers for O(d^n) entries??

Independence A and B are independent iff P(A B) = P(A), or P(B A) = P(B), or P(A,B) = P(A)P(B) P(Toothache,Catch,Cavity,Weather) = P(Toothache,Catch,Cavity) P(Weather) 32 entries reduced to 12; for n independent biased coins, 2^n! N Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent What to do?

Conditional Independence P(Toothache,Cavity,Catch) has 2 3 1 = 7 independent entries However, if we have If I have a cavity, the probability that the probe catches in it doesn t dependent on whether I have a toothache: P(catch toothache,cavity) = P(catch cavity) The same independence holds if I don t have a cavity: P(catch toothache,: cavity) = P(catch : cavity) Probe catch is conditionally independent of Toothache given Cavity: P(Catch Toothache,Cavity) = P(Catch Cavity) Equivalent statements: P(Toothache Catch,Cavity) = P(Toothache Cavity) P(Toothache,Catch Cavity) = P(Toothache Cavity)P(Catch Cavity) P(Toothache,Cavity,Catch) = P(Toothache Catch,Cavity)P(Catch Cavity)P(Cavity) = P(Toothache Cavity)P(Catch Cavity)P(Cavity) Ã 2+2+1 = 5 entries Conditional independence is our most basic and robust form of knowledge about uncertain environments

Bayes Rule From product rule P(a Æ b) = P(a b)p(b) = P(b a)p(a) ) Bayes rule Or in distribution form Useful for accessing diagnostic probability from causal probability Extreme conditional independence: naïve Bayes model

Bayesian Networks (= Graphical Models) A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax A set of nodes, one per random variable A directed, acyclic graph (link ¼ directly influences ) A conditional distribution for each node given its parents Conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values.

Example: The Alarm Network I have a burglar alarm installed, but it also responds on occasion to earthquakes. My neighbors John and Mary promise to call me at work when they hear alarm. John calls to say alarm is ringing, but Mary doesn t. Is there a burglar? Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Knowledge to be encoded into topology: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Semantics of Bayesian Networks Global semantics defines the full joint distribution as the product of the local conditional distributions E.g., P(j Æ m Æ a Æ : b Æ : e) = P(j a) P(m a) P(a : b, : e) P(: b) P(: e) Local semantics: each node is conditionally independent of its non-descendants given its parents Mental Exercise: Derive global semantics from local semantics and vice versa Global semantics and local semantics are equivalent

Constructing Bayesian Networks Choose an ordering of random variables: MaryCalls, JohnCalls, Alarm, Burglary, Earthquake MaryCalls JohnCalls Alarm Burglary Earthquake P(JohnCalls MaryCalls) = P(JohnCalls)? No P(Alarm JohnCalls,MaryCalls) = P(Alarm JohnCalls)? No P(Alarm JohnCalls,MaryCalls) = P(Alarm)? No P(Burglary Alarm,JohnCalls,MaryCalls) = P(Burglary Alarm)? Yes P(Burglary Alarm,JohnCalls,MaryCalls) = P(Burglary)? No P(Earthquake B,Alarm,J,M) = P(Earthquake Alarm)? No (!!) P(Earthquake Burglary,Alarm,J,M) = P(Earthquake Burglary,Alarm)? Yes

Constructing Bayesian Networks 1. Choose an ordering of variables X 1,,X n 2. For i = 1 to n Add X i to the network Select parents from X 1,,X i-1 such that This choice of parents guarantees global semantics (by chain rule) (by construction)

Inference Example: Car Diagnosis Initial evidence: Car won t start Testable (observed) variables: green broken, so fix it (hypothesis) variables: orange Hidden variables: gray

Inference by Enumeration in BN Simple query on the alarm network: Given that John called and Mary called, what is the probability of burglary? P(B j,m) = P(B, j, m) / P(j, m) = P(B, j, m) = e a P(B, e, a, j, m) Rewrite full joint probability using CPT: P(B j, m) = e a P(B) P(e) P(a B,e) P(j a) P(m a) = P(B) e P(e) a P(a B,e) P(j a) P(m a) Recursive depth-first enumeration: O(n) space, O(d^n) time

Evaluation Tree Evaluation tree of P(B) e P(e) a P(a B,e) P(j a) P(m a) Enumeration is inefficient: repeated computation Computes P(j a) P(m a) for each value of e

Inference by Variable Elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation 2-element vector 2-element vector 2 2 2 matrix 2 2 matrix 2-element vector

Basic Operations for Variable Elimination Summing out a variable from a product of factors Move any constant factors outside the summation Add up sub-matrices in pointwise product of remaining factors Assume f 1,, f i do not depend on X: Pointwise product of factors f 1 and f 2 E.g.

Irrelevant Variables Consider query P(JohnCalls Burglary=true) P(J b) = P(b) e P(e) a P(a b,e) P(J a) m P(m a) Sum over m is identically 1 (M is irrelevant to the query) Theorem: Y is irrelevant unless Y 2 Ancestors({X} [ E) where {X} is the set of query nodes X = JohnCalls, E = {Burglary} Ancestors({X} [ E) = {Alarm, Earthquake} MaryCalls is irrelevant Remember backward chaining algorithm?

Complexity of Exact Inference Singly connected networks (or polytrees) Any two nodes are connected by at most one path Time and space cost of variable elimination are O(d^k n) Multiply connected networks We can reduce 3-SAT problem to inference ) NP-hard To be exact, the problem is as hard as counting number of satisfying assignments in 3-SAT ) #P-hard

Summary Uncertainty arises from laziness and ignorance Inherent in complex, dynamic, or inaccessible worlds Probabilities express the agent s inability to reach a definite decision regarding the truth of sentence Probabilities summarize the agent s beliefs Prior probabilities, conditional probabilities, full joint probability distributions Bayesian networks represent conditional independence relationships in the domain in a concise way The topology of the network encodes conditional independence Exact inference by variable elimination Polytime on polytrees, NP-hard on general graphs Very sensitive to topology

Bayesian Learning

Outline #2 Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete data Linear regression ML parameter learning with hidden attribute data (EM algorithm)

Full Bayesian Learning View learning as Bayesian updating of a probability distribution over the hypothesis space Given H = {h 1, h 2, } and prior distribution P(H) j-th observation d j is the outcome of random variable D j Training data d = d 1,, d N Given the data so far, each hypothesis has a posterior probability P(h i d) = P(d h i ) P(h i ) where P(d h i ) is called the likelihood Prediction use a likelihood-weighted average over the hypotheses: P(X d) = i P(X d,h i ) P(h i d) = i P(X h i ) P(h i d) No need to pick one best-guess hypothesis!!!

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be?

Posterior/Prediction Prob. of Hypotheses Posterior probability: P(h i d) = P(d h i ) P(h i ) Prediction probability: P(X d) = i P(X h i ) p(h i d)

MAP Approximation Summing over the hypothesis space is often intractable e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes Instead, make prediction based on a single most probable hypothesis Maximum a posteriori (MAP) learning Choose h MAP maximizing P(h i d) i.e., maximize P(d h i ) P(h i ) or log P(d h i ) + log P(h i ) Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses, P(d h i ) is 1 if consistent, 0 otherwise ) MAP = simplest consistent hypothesis

ML Approximation For large data sets, prior becomes irrelevant The bits to encode data given hypothesis dominates the bits to encode hypothesis Maximum likelihood (ML) learning Choose h ML maximizing P(d h i ) i.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard (non-bayesian) statistical learning method

ML Parameter Learning in Bayes Nets Bag from a new manufacturer; fraction of cherry candies? A Bayesian network with one node Flavor The task is to learn P(Flavor) = Any is possible; continuum of hypotheses h is a parameter for this simple (binomial) family of models Suppose we unwrap N candies, c cherries and l = N c limes These are i.i.d. (independent, identically distributed) observations, so Maximize this w.r.t. which is easier for the log-likelihood: However, given small dataset, ML learning can assign zero probability to unobserved events

ML Learning with Multiple Parameters Red/green wrapper depends probabilistically on flavor Likelihood for, e.g., cherry candy in green wrapper: N candies, r c red-wrapped cherry candies, etc.

ML Learning with Multiple Parameters Log likelihood: Derivatives of L contain only the relevant parameter: With complete data, parameters can be learned separately

Example: Linear Gaussian Model y = 1 x + 2 plus Gaussian noise Maximize P(y x) w.r.t. parameters 1 and 2 = Minimize Minimizing the sum of squared errors gives the ML solution for a linear fit assuming Gaussian noise of fixed variance

Learning Bayes Nets w. Hidden Variables Candies from either Bag 1 or Bag 2 In addition to Flavor and Wrapper, some candies have a Hole Parameters: : probability of candy coming from Bag 1 F1, F2 : conditional probabilities of flavor being cherry, given coming from Bag 1 or Bag 2 W1, W2 : conditional probabilities of wrapper being red H1, H2 : conditional probabilities of candy having a hole

Learning Bayes Nets w. Hidden Variables EM (expectation-maximization) algorithm An iterative method for maximizing the likelihood with hidden vars Basic idea: Pretend that we know the parameters of the network and compute the posterior probabilities (E-step) Then using those probabilities, maximize the likelihood by adjusting parameters (M-step) Repeat the two steps until convergence Data with 1000 instances generated from a true model True parameters: =0.5, F1 = W1 = H1 =0.8, F2 = W2 = H2 =0.3 The counts from the data: W = red W = green H = 1 H = 0 H = 1 H = 0 F = cherry 273 93 104 90 F = lime 79 100 94 167

MAY SKIP Learning Bayes Nets w. Hidden Variables Start with (arbitrary) initial parameters For at iteration 1: Calculate the expected count (E-step): Find the maximum likelihood parameter (M-step): [Exercise]

MAY SKIP Learning Bayes Nets w. Hidden Variables For F1 at iteration 1: Calculate the expected count (E-step): Find the maximum likelihood parameter (M-step): [Exercise] Compute EM steps for the rest of parameters, and repeat

Markov Chain Monte Carlo (MCMC): A Universal Inference Algorithm