Brief Intro. to Bayesian Networks. Extracted and modified from four lectures in Intro to AI, Spring 2008 Paul S. Rosenbloom

Similar documents
Artificial Intelligence Uncertainty

Uncertainty. Outline

Artificial Intelligence

Quantifying uncertainty & Bayesian networks

Uncertainty. Chapter 13

Uncertainty. 22c:145 Artificial Intelligence. Problem of Logic Agents. Foundations of Probability. Axioms of Probability

Uncertainty. Chapter 13

PROBABILISTIC REASONING SYSTEMS

Probabilistic Reasoning Systems

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Bayesian networks. Chapter 14, Sections 1 4

Uncertainty and Belief Networks. Introduction to Artificial Intelligence CS 151 Lecture 1 continued Ok, Lecture 2!

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Uncertainty and Bayesian Networks

This lecture. Reading. Conditional Independence Bayesian (Belief) Networks: Syntax and semantics. Chapter CS151, Spring 2004

Graphical Models - Part I

Uncertainty. CmpE 540 Principles of Artificial Intelligence Pınar Yolum Uncertainty. Sources of Uncertainty

Uncertainty. Chapter 13, Sections 1 6

Probabilistic Models

Probabilistic representation and reasoning

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

14 PROBABILISTIC REASONING

Bayesian networks. Chapter Chapter

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Fusion in simple models

Informatics 2D Reasoning and Agents Semester 2,

Review: Bayesian learning and inference

CS 5100: Founda.ons of Ar.ficial Intelligence

Probability theory: elements

CS 5522: Artificial Intelligence II

CS 188: Artificial Intelligence Fall 2009

Probabilistic representation and reasoning

PROBABILISTIC REASONING Outline

13.4 INDEPENDENCE. 494 Chapter 13. Quantifying Uncertainty

Reasoning Under Uncertainty

Cartesian-product sample spaces and independence

UNCERTAINTY. In which we see what an agent should do when not all is crystal-clear.

Probabilistic Reasoning

CS 343: Artificial Intelligence

Web-Mining Agents Data Mining

CS 188: Artificial Intelligence Spring Announcements

Uncertainty. Outline. Probability Syntax and Semantics Inference Independence and Bayes Rule. AIMA2e Chapter 13

Probabilistic Models. Models describe how (a portion of) the world works

Directed Graphical Models

Chapter 13 Quantifying Uncertainty

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayesian Belief Network

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

Bayesian Networks. Motivation

Bayesian networks (1) Lirong Xia

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Bayesian Networks. Material used

Pengju XJTU 2016

Pengju

Reasoning Under Uncertainty

Uncertainty. Chapter 13. Chapter 13 1

Outline. Uncertainty. Methods for handling uncertainty. Uncertainty. Making decisions under uncertainty. Probability. Uncertainty

CS 380: ARTIFICIAL INTELLIGENCE UNCERTAINTY. Santiago Ontañón

CS 188: Artificial Intelligence Fall 2008

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Y. Xiang, Inference with Uncertain Knowledge 1

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Probabilistic Robotics

CSE 473: Artificial Intelligence Autumn 2011

Uncertain Knowledge and Reasoning

Uncertainty. Russell & Norvig Chapter 13.

ARTIFICIAL INTELLIGENCE. Uncertainty: probabilistic reasoning

Introduction to Artificial Intelligence. Unit # 11

School of EECS Washington State University. Artificial Intelligence

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Probabilistic Reasoning

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Uncertainty (Chapter 13, Russell & Norvig) Introduction to Artificial Intelligence CS 150 Lecture 14

COMP5211 Lecture Note on Reasoning under Uncertainty

Defining Things in Terms of Joint Probability Distribution. Today s Lecture. Lecture 17: Uncertainty 2. Victor R. Lesser

Reasoning with Uncertainty. Chapter 13

Computer Science CPSC 322. Lecture 18 Marginalization, Conditioning

COMP9414: Artificial Intelligence Reasoning Under Uncertainty

Bayesian Networks BY: MOHAMAD ALSABBAGH

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayes Nets: Independence

Bayesian networks: Modeling

Artificial Intelligence CS 6364

EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS

CS 561: Artificial Intelligence

Basic Probability and Decisions

Reasoning under Uncertainty: Intro to Probability

Bayesian networks. Chapter Chapter

Our Status. We re done with Part I Search and Planning!

CS Belief networks. Chapter

Bayesian networks. Chapter Chapter Outline. Syntax Semantics Parameterized distributions. Chapter

Foundations of Artificial Intelligence

Transcription:

Brief Intro. to Bayesian Networks Extracted and modified from four lectures in Intro to AI, Spring 2008 Paul S. Rosenbloom

Factor Graph Review Tame combinatorics in many calculations Decoding codes (origin of factor graphs) Bayesian networks and Markov random fields HMMs (and signal processing more generally) Production match? Decompose functions into product of nearly independent factors (or local functions) E.g., f(u,w,x,y,z) = f 1 (u,w,x)f 2 (x,y,z)f 3 (z) Many standard state-of-the-art algorithms can be derived from factor graph sum-product algorithm Belief propagation in Bayesian networks Forward-backward algorithm in HMMs Kalman filter, Viterbi algorithm, FFT, turbo decoding Rete algorithm (or equivalent)? 2

Why Should We Care? Idea/Hope/Fantasy: Factor graphs will provide a uniform layer for implementing and exploring cognitive architectures, while also pointing to novel architectures that are more uniform, integrated and functional Understand architecture and its modules/processes as functions and implement as (nested?) factor graphs Potential uniform representation and algorithm for symbols, signals and probabilities Potentially combine sequential and static reasoning Novel(?) and interesting in own right for pure symbol processing Have implemented a simplified rule system (in Lisp) Appears to eliminate exponential match problems in rules by reducing what computation is performed Returns compatible bindings of variables rather than instantiations Shows promise for tightly integrating forward (procedural), backward (declarative) and bidirectional (constraint) processing Indicates possibilities for novel hybrid modes Shows potential to bring additional uniformity, flexibility and extensibility to match algorithms 3

Today s Focus Background on Bayesian networks as a step towards understanding factor graphs (and reading the introductory articles on them) 4

Basic Concepts of Probability 1. P(x) is a real number between 0 and 1 (inclusive) representing the likelihood that x is true P(x)=1 Certainty P(true)=1 2. P(x)=0 Impossibility 3. P(false)=0 0<P(x)<1 Uncertainty (varying levels) For a fair coin, P(heads)=P(tails)=.5 For a fair die, P(1)= =P(6)=.166 P(A B) = P(A) + P(B) - P(A B) Kolmogorov s Axioms 5

More on Probability P(x y)=p(x)p(y) if x and y are independent P(heads 1 heads 2 )=P(heads 1 )P(heads 2 )=.5*.5=.25 P(AinCourse Raining)=P(AinCourse)P(Raining) No general way to compute when not independent, except P(x y)=0 if x and y are mutually exclusive P(AinCourse BinCourse)=0 P(x x)=0 Makes P(x y)=p(x)+p(y) by Axiom 3 P(x)=P(y) if x=y (logically equivalent) P(x 1 x n )=1 if the x i s are exhaustive P(heads tails)=1 P(x x)=1 P( x)=1-p(x) P(x x)=p(x)+p( x)-p(x x) 1= P(x)+P( x)-0 In general, summing probabilities over any partition yields 1 6

Probability Distributions A probability distribution for a random variable specifies probabilities for all values in its domain Defines a normalized (sums to 1) vector of values Given the domain of Weather as <sunny,rain,cloudy,snow>, could specify probabilities for domain elements as: P(Weather=sunny)=.72, P(Weather=rain)=.1, P(Weather=cloudy)=.08, P(Weather=snow)=.1 Or, as a vector of values: P(Weather) = <.72,.1,.08,.1> For continuous domains, get a probability density function Every precise value has essentially a zero probability, but density function shows how mass is distributed Will not focus ono continuous case c here 7

Joint Probability Distribution A joint probability distribution for a set of random variables gives the probability of every atomic event (a complete state of the world) on those random variables P(Weather,AinClass) = a 4 2 matrix of values: Weather = sunny rainy cloudy snow AinClass = true 0.144 0.02 0.016 0.02 AinClass = false 0.576 0.08 0.064 0.08 A full joint probability distribution covers all of the random variables used to describe the world Weather, Temperature, Cavity, Toothache, AinClass, etc. Every question about world is answerable by full joint distribution 8

Probabilistic Inference by Enumeration Can perform probabilistic inference by enumerating over (full) joint probability distribution Analogous to model checking in propositional logic, but here we are summing probabilities over all action events rather than checking for satisfiability in one or all of the models Start with the joint probability distribution P(Cavity,Toothache,Catch) Catch refers to Dentist s instrument catching on tooth For any proposition φ, sum p for atomic events where true P(φ) = Σ ω:ω φ P(ω) P(toothache) Called =.108 +.012 +.016 +.064 =.2 P(cavity) =.108 +.012 +.72 +.08 =.2 Called marginal probability of toothache The overall process p is called marginalization or summing out: [P(Y)) = Σ z P(Y,z)] 9

Inference of Complex Propositions Can also infer values of complex propositions in the same manner Sum over all action events in which proposition is true P(toothache cavity) =.108 +.012 +.016 +.064 +.072 +.008 =.28 P(toothache cavity) =.108 +.012 =.12 P(toothache cavity) =.108 +.012 +.072 +.008 +.144 +.576 =.92 P(cavity toothache) =.108 +.012 +.016 +.064 +.144 +.576 =.92 10

Probabilities and Evidence Degrees of belief (subjective probabilities) are centrally affected by evidence Prior probability refers to degree of belief before see any evidence P(sunny)=.5 P(aincourse)=.45 Conditional probability refers to degree of belief conditioned on seeing (only) particular evidence P(sunny losangeles,july)=.9 P(sunny pittsburgh)=.1 P(aincourse aonmidterm1)=.6 11

Conditional Probability Definition of conditional probability: P(a b) = P(a b) / P(b) if P(b) > 0 E.g. P(cavity toothache) = P(cavity toothache)/p(toothache) What percentage of cases with toothache have a cavity? Product rule gives an alternative formulation: P(a b) = P(a b) P(b) = P(b a) P(a) A general version holds for distributions P(X,Y) = P(X Y) P(Y) E.g., P(Weather,Cavity) = P(Weather Cavity) P(Cavity) Equivalent to a 4 2 set of equations, not matrix multiplication P(sunny,cavity) = P(sunny cavity) P(cavity) P(sunny, cavity) = P(sunny cavity) P( cavity) P(rainy,cavity) = P(rainy cavity) P(cavity) Also can be referred to as posterior probability 12

Inference w/ Conditional Probabilities Can compute conditional probabilities by enumerating over joint distribution via: P(a b) = P(a b) / P(b) P(cavity toothache) = P(cavity toothache)/p(toothache) = (.108 +.012)/(.108 +.012 +.016 +.064) =.12/.2 =.6 P( cavity toothache) = P( cavity toothache)/p(toothache) = (.016 +.064)/(.108 +.012 +.016 +.064) =.08/.2 =.4 1/P(toothache) is the normalization constant (α) for the distribution P(Cavity toothache) Ensures it sums to 1 13

Aside Conditional Probability and Implication What is the relationship? They are analogous but not equivalent I.e., P(a b) P(b a) Consider cavity and toothache P(toothache cavity) =.92 P(cavity toothache) = P(cavity toothache)/p(toothache) =.12/.2 =.6 14

Normalized Inference Procedure General inference procedure for cond. distributions: P(X e) = P(X,e)/P(e) = αp(x,e) = ασ y P(X,e,y) X is a single query variable e is a vector of values of the evidence variables E y is a vector of values for the remaining unobserved variables Y P(X,e,y) is a subset of values from the joint probability distribution Dentistry example P(Cavity toothache) = P(Cavity,toothache)/P(toothache) = αp(cavity,toothache) = α[p(cavity,toothache,catch)+p(cavity,toothache, catch)] = (1/.2)[.108,0.016 +.012,.064 ] = 5.12,.08 =.6,.4 Only problem, but a big one, is that this does not scale well If there are n Boolean variables, space and time are O(2 n ) Will look shortly at independence as a means of dealing with this 15

Multiple Sources of Evidence As more sources of evidence accumulate, the conditional probability of cavity will adjust E.g., if catch is also given, then have: P(cavity toothache,catch) = P(cavity toothache catch)/p(toothache catch) =.87 (up from.6) In general, as evidence accumulates over time, such as data from an x-ray, the conditional probability of cavity would continue to evolve Probabilistic reasoning is inherently non-monotonic However, if evidence is irrelevant, may simply ignore P(cavity toothache,sunny) = P(cavity toothache) = 0.6 Such reasoning about independence can be essential in making probabilistic reasoning tractable 16

Independence A and B are independent iff P(A B) = P(A) or P(B A) = P(B) or P(A, B) = P(A) P(B) E.g., P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather) Reduces 2x2x2x4=32 entries to 2x2x2+4=12 For n independent coin tosses, O(2 n ) O(n) Analogous to goal decomposition in planning Absolute (or marginal) independence is powerful but rare E.g., dentistry is a large field with hundreds of variables, none of which are absolutely independent form the others What to do? 17

Conditional Independence Two dependent variables may become conditionally independent when conditioned on a third variable Consider Catch and Toothache P(catch)=.34 while P(catch toothache)=.62, so dependent The problem is that while neither causes the other, they correlate because both caused by single underlying hidden variable (cavity) If condition both Catch and Toothache on Cavity, conditional probabilities become independent P(catch toothache, cavity) = P(catch cavity) =.9 P(catch toothache, cavity) = P(catch cavity) =.2 Catch is conditionally independent of Toothache given Cavity: P(Catch Toothache,Cavity) = P(Catch Cavity) Additional equivalent statements: P(Toothache Catch, Cavity) = P(Toothache Cavity) P(Toothache, Catch Cavity) = P(Toothache Cavity) P(Catch Cavity) 18

Conditional Independence (cont.) General rule is P(X,Y Z) = P(X Z) P(Y Z) Or P(X Y,Z) = P(X Z) or P(Y X,Z) = P(Y Z) With conditional independence can decompose individual large tables into sets of smaller ones, e.g.: P(Toothache, Catch, Cavity) = P(Toothache, Catch Cavity) P(Cavity) [product rule] = P(Toothache Cavity) P(Catch Cavity) P(Cavity) [cond. independence] The use of conditional independence can reduce the size of the representation of the joint distribution from exponential in n to linear in n For example, if there are many symptoms all caused by cavities One of the most significant recent advances in AI, as has made probabilistic reasoning tractable where previously understood not to be so Conditional independence is our most basic and robust form of knowledge about uncertain environments 19

Bayes' Rule Product rule: P(a b) = P(a b) P(b) = P(b a) P(a) Bayes' rule: P(a b) = P(b a) P(a) / P(b) Or in distribution form (and with normalization factor) P(Y X) = P(X Y) P(Y) / P(X) = αp(x Y) P(Y) Useful for deriving diagnostic probabilities from causal probabilities: P(Cause Effect) = P(Effect Cause) P(Cause) / P(Effect) E.g., let M be meningitis and S be stiff neck: P(m s) = P(s m) P(m) / P(s) = 0.8 0.0001 / 0.1 = 0.0008 Although the probability of having a stiff neck is very high if have meningitis (.8), posterior probability of meningitis given a stiff neck is still very small (.0008)! Causal (and prior) probabilities often easier to obtain than diagnostic ones Also may be more robust under shifts in baselines E.g., in an epidemic P(m s) may skyrocket, whereas P(s m) remains fixed (causally determined) 20

Combining Evidence (for Diagnosis) via Bayes' Rule and Conditional Independence P(Cavity toothache,catch) = αp(toothache catch Cavity) P(Cavity) = αp(toothache Cavity) P(catch Cavity) P(Cavity) This is an example of a naïve Bayes model: P(Cause,Effect 1,,Effect n ) = P(Cause) π i P(Effect i Cause) Cost of diagnostic reasoning now grows linearly rather than exponentially in number of conditionally independent effects Called naïve, because often used when the effects are not completely conditionally independent given the cause Such Bayesian classifiers can work surprisingly well even in the absence of complete conditional independence Often focus of learning efforts 21

Bayesian Networks Graphical notation for (conditional) independence Compact specification of full joint probability distribution Also called belief networks, probabilistic networks, causal networks (especially when used without probabilities) and knowledge maps 22

Syntax A set of nodes, one per variable Links between nodes encoding direct influences Yielding a directed, acyclic graph A conditional distribution for each node given its parents: P (X i Parents(X i )) In simplest case, represented as a conditional probability table (CPT) Provides distribution over X i for each combination of parent values Reduces to prior probabilities for nodes with no parents City Month sunny rainy City Weather Month Pittsburgh Pittsburgh Los Angeles January February January.4.3.7.2.3.2. 23

Independence in Bayesian Networks Topology of network encodes (conditional) independence assertions: Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity 24

Alarm Example My burglar alarm is fairly reliable, but is sometimes set off by a minor earthquake. Neighbors John and Mary promise to call if they hear the alarm, but sometimes make errors (John may call when he hears phone ringing and Mary may miss the alarm when listening to music). John has just called, but Mary hasn t. Is there a burglar? The evidence is subject to errors of both omission and commission Variables Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects "causal" knowledge: A burglar can set off the alarm An earthquake can set off the alarm The alarm can cause Mary to call The alarm can cause John to call 25

Alarm Example (Cont.) Only one value needed for X i in each row because, for boolean variables, P(false)=1-P(true) All factors (possibly infinite) not explicitly mentioned are implicitly incorporated into probabilities 26 Bird could fly through window pane, power could fail,

Compactness/Efficiency A CPT for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values If each of the n variables has no more than k parents, the complete network requires O(n 2 k ) numbers I.e., grows linearly with n vs. O(2 n ) for the full joint distribution For burglary net, 1+1+4+2+2=10 numbers (vs. 2 5-1=31) 27

Semantics If correct, the network represents the full joint distribution: n P(x 1,,x n ) = π i=1 P(x i parents(x i )) E.g., the probability of a complete false alarm (no burglary or earthquake) with two calls is: P(j,m,a, b, e) = P(j a) P(m a) P(a b, e) P( b) P( e) =.9 x.7 x.001 x.999 x.998.000063 28

The Chain Rule and Network Semantics The chain rule P(x 1,,x n ) = P(x n x n-1,,x 1 ) P(x n-1,,x 1 ) = P(x n x n-1,,x 1 ) P(x n-1 x n-2,,x 1 ) P(x n-2,,x 1 ) = P(x n x n-1,,x 1 ) P(x n-1 x n-2,,x 1 ) P(x 2 x 1 ) P(x 1 ) n = π i=1 P(x i x i-1,,x 1 ) The equation for network semantics is equivalent to the chain rule with the restriction that for every variable X i in the network: P(X i X i-1,,x 1 ) = P(X i parents(x i )), given that Parents(X i ) {X i-1,,x 1 } Requires labeling variables consistently with partial order of graph 29

Constructing Bayesian Networks Need a method for which locally testing conditional independence guarantees semantics P(X 1,,X n ) = π i =1 P(X i X 1,, X i-1 ) = π i =1 P(X i Parents(X i )) 1.Choose an ordering of variables X 1,,X n Preferably one where earlier variables do not depend causally on later ones Assures first step above while respecting causality 2.For i = 1 to n Add X i to the network n Select parents from X 1,,X i-1 such that P(X i Parents(X i )) = P(X i X 1,... X i-1 ) Assures second step above, respecting semantics n 30

Car Diagnosis Example 31

Car Insurance Example 32

Inference in Bayesian Networks Exact inference Basic algorithm is analogous to enumeration in FJPT Variable elimination emulates dynamic programming Save partial computations to avoid redundant calculation Approximate inference Monte Carlo methods perform sampling The more samples, the more accurate Variational methods use a reduced version of problem Loopy propagation propagates messages in polytrees Polytree: At most one undirected path between any two nodes I believe this is what maps onto summary-product algorithm in factor graphs 33

Enumeration in Bayesian Networks Compute probabilities from Bayesian network as if from FJPT, but without explicitly constructing the table Otherwise would lose benefit of decomposing full table into network Consider simple query on burglary network P(B j,m) = P(B,j,m)/P(j,m) = αp(b,j,m) = ασ e Σ a P(B,e,a,j,m) = ασ e Σ a P(B) P(e) P(a B,e) P(j a) P(m a) = αp(b)σ e P(e) Σ a P(a B,e) P(j a) P(m a) Compute by looping through variables in a depth-first fashion, multiplying CPT entries as we go I believe this maps onto the expression-tree (or single-i sum-product) algorithm for restricted factor graphs in the Kschischang et al article 34

Evaluation Tree for P(b j,m) αp(b)σ e P(e) Σ a P(a b,e) P(j a) P(m a) =.284 Repeated computation in subtrees provides the motivation for the variable elimination algorithm 35

Enumeration Algorithm 36