Bayesian Networks. CompSci 270 Duke University Ron Parr. Why Joint Distributions are Important

Similar documents
Bayesian Networks. Why Joint Distributions are Important. CompSci 570 Ron Parr Department of Computer Science Duke University

Directed Graphical Models or Bayesian Networks

Intelligent Systems (AI-2)

PROBABILISTIC REASONING SYSTEMS

CS 188: Artificial Intelligence. Bayes Nets

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Bayesian Networks Representation

Directed Graphical Models

CS 343: Artificial Intelligence

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Quantifying uncertainty & Bayesian networks

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Why Probability? It's the right way to look at the world.

Graphical Models. Lecture 10: Variable Elimina:on, con:nued. Andrew McCallum

Objectives. Probabilistic Reasoning Systems. Outline. Independence. Conditional independence. Conditional independence II.

Stochastic inference in Bayesian networks, Markov chain Monte Carlo methods

Computational Complexity of Inference

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Probabilistic Reasoning Systems

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Bayesian networks. Independence. Bayesian networks. Markov conditions Inference. by enumeration rejection sampling Gibbs sampler

Soft Computing. Lecture Notes on Machine Learning. Matteo Matteucci.

Bayes Nets: Independence

Bayes Nets III: Inference

Lecture 6: Graphical Models

14 PROBABILISTIC REASONING

{ p if x = 1 1 p if x = 0

Lecture 8. Probabilistic Reasoning CS 486/686 May 25, 2006

Directed Graphical Models

Informatics 2D Reasoning and Agents Semester 2,

Uncertainty and Bayesian Networks

Lecture 8: Bayesian Networks

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Introduction to Bayesian Networks

CMPSCI 240: Reasoning about Uncertainty

Review: Bayesian learning and inference

Today. Why do we need Bayesian networks (Bayes nets)? Factoring the joint probability. Conditional independence

COMP5211 Lecture Note on Reasoning under Uncertainty

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

Bayes Nets: Sampling

Bayesian networks. Chapter Chapter

CS 5522: Artificial Intelligence II

T Machine Learning: Basic Principles

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Introduction to Artificial Intelligence. Unit # 11

Bayesian networks. Instructor: Vincent Conitzer

Bayesian Networks BY: MOHAMAD ALSABBAGH

Structure Learning: the good, the bad, the ugly

CS6220: DATA MINING TECHNIQUES

Probabilistic Classification

Probability and Structure in Natural Language Processing

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

Bayes Networks 6.872/HST.950

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Discrete Random Variables

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Probabilistic Reasoning. (Mostly using Bayesian Networks)

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Learning P-maps Param. Learning

Stephen Scott.

Artificial Intelligence Bayes Nets: Independence

Introduc)on to Ar)ficial Intelligence

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

CPSC 540: Machine Learning

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Bayesian Networks: Representation, Variable Elimination

Bayesian Networks. Motivation

CS6220: DATA MINING TECHNIQUES

Bayesian Machine Learning

Directed and Undirected Graphical Models

Lecture 4 October 18th

last two digits of your SID

Directed and Undirected Graphical Models

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Bayesian networks. Chapter 14, Sections 1 4

Graphical Models - Part I

School of EECS Washington State University. Artificial Intelligence

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Approximate Inference

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Inference in Bayesian Networks

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Conditional Independence

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Introduction to Bayesian Networks. Probabilistic Models, Spring 2009 Petri Myllymäki, University of Helsinki 1

Transcription:

Bayesian Networks CompSci 270 Duke University Ron Parr Why Joint Distributions are Important Joint distributions gives P(X 1 X n ) Classification/Diagnosis Suppose X1=disease X2 Xn = symptoms Co-occurrence Suppose X3=lung cancer X5=smoking Rare event Detection Suppose X1 Xn = parameters of a credit card transaction Call card holder if P(X1 Xn) is below threshold? 1

Modeling Joint Distributions To do this correctly, we need a full assignment of probabilities to all atomic events Unwieldy in general for discrete variables: n binary variables = 2 n atomic events Independence makes this tractable, but too strong (rarely holds) Conditional independence is a good compromise: Weaker than independence, but still has great potential to simplify things Overview Conditional independence Bayesian networks Variable Elimination Sampling Factor graphs Belief propagation Undirected models 2

Conditional Independence Suppose we know the following: The flu causes sinus inflammation Allergies cause sinus inflammation Sinus inflammation causes a runny nose Sinus inflammation causes headaches How are these connected? Example 1: Simple graphical structure Flu Allergy Sinus Headache Nose Knowing sinus separates the variables from each other. 3

Example 2: Naïve Bayes Spam Filter S P(S) P(W 1 S) P(W n S) W 1 W 2 W n We will see later why this is a particularly convenient representation. (Does it make a correct assumption?) Conditional Independence We say that two variables, A and B, are conditionally independent given C if: P(A BC) = P(A C) P(AB C) = P(A C)P(B C) How does this help? We store only a conditional probability table (CPT) of each variable given its parents Naïve Bayes (e.g. Spam Assassin) is a special case of this! 4

Nota%on Reminder P(A B) is a conditional prob. distribution It is a function! P(A=true B=true), P(A=true B=false), P(A=false B=True), P(A=false B=true) P(A b) is a probability distribution, function P(a B) is a function, not a distribution P(a b) is a number What is Bayes Net? A directed acyclic graph (DAG) Given parents, each variable is conditionally independent of non-descendants Joint probability decomposes: P(x 1...x n ) = P(x i parents(x i ))!! i For each node X i, store P(X i parents(x i )) Call this a Conditional Probability Table (CPT) CPT size is exponential in number of parents Implies 5

Real Applications of Bayes Nets Diagnosis of lymph node disease Used by robots to identify meteorites to study Study the human genome: Alex Hartemink et al. Used in MS Windows Many other applications Space Efficiency Flu Sinus Allergy Headache Nose Entire joint distribution as 32 (31) entries P(H S),P(N S) have 4 (2) P(S AF) has 8 (4) P(A) has 2 (1) Total is 20 (10) This can require exponentially less space Space problem is solved for most problems 6

Naïve Bayes Space Efficiency S P(S) P(W 1 S) P(W n S) W 1 W 2 W n Entire Joint distribution has 2 n+1 (2 n+1-1) numbers vs. 4n+2 (2n+1) How To Build A Bayes Net Can always construct a valid Bayes net by inserting variables one at a time How to pick edges for new variable? Parents of each new variable are all variables that influence the distribution of the new variable No need (yet) for outgoing edges from new variable Can prove by induction that this preserves property that all variables are conditionally independent of non-descendants given parents 7

(Non)Uniqueness of Bayes Nets Order of adding variables can lead to different Bayesian networks for the same distribution Suppose A and B are independent, but C is a function of A and B Add A, B, then C: Add C, A, then B: parents(a)=parents(b)={}, parents(c)={a,b} parents(c)= {}, parents(a)={c}, parents(b)={a,c} Suppose A and B are uniform, C=(A B) A B C P 0 0 0 0.25 0 0 1 0 0 1 0 0 0 1 1 0.25 1 0 0 0 1 0 1 0.25 1 1 0 0 1 1 1 0.25 P(A) A P(C AB) C P(B) B P(a)=P(a )=P(b)=P(b9)=0.5 P(c)=0.75,P(c )=0.25 P(ab)=P(ab9)=P(a b)=p(a b9)=0.25 P(c ab)=p(c ab9)=p(c a b)=1.0, P(c a b9)=0 (only showing c=true case) Add A, then B, then C case 8

Suppose A and B are uniform, C=(A B) A B C P 0 0 0 0.25 0 0 1 0 0 1 0 0 0 1 1 0.25 1 0 0 0 1 0 1 0.25 1 1 0 0 1 1 1 0.25 C P(A C) A P(B AC) B P(C) P(c)=0.75,P(c )=0.25 P(ac)=0.5, P(ac )=0, P(a c)=p(a c )=0.25 P(a c)=2/3, P(a c)=1/3, P(a c )= 0, P(a c )=1 P(b ac)=1/2, P(b ac )=P(b a c )=0, P(b a c)=1 (only showing b=true case) Add C, then A, then B case Atomic Event Probabili1es P(x 1...x n ) = P(x i parents(x i ))!! i True for the distribution P(X 1 Xn) and for any individual setting of the variables P(x 1 x n ) Flu Allergy Sinus Headache Nose 9

P(f h) = P(fh) P(h) = Answering Queries Using Marginaliza1on SAN SANF P( fhsan) = P(hSANF) SAN SANF P( f)p(a)p(s Af)P(h S)P(N S) P(F)P(A)P(S AF)P(h S)P(N S) defn. of conditional probability P(hSANF) = x p(x parents(x))!! = P(h S)P(N S)P(S AF)P(A)P(F) Doing this naïvely, we need to sum over all atomic events defined over these variables. There are exponentially many of these. Digression: Nota.on I # $ %& = P(B), a distribution over B # $ %' = P(b), a probability $ % $ & = a function (lookup table of size 4) from combinations of assignments to A and B to numbers (if A, an B are independent, the this function is P(AB)) 10

Digression: Notation II #$ % & ' %(' * %(*) = P(A), a distribution over A How do we compute this? For each value of A we sum over all combinations of B and C Digression: Nota.on III # $ % &' $(& ' $(') = a function of A and B (table over 4 numbers) How do we compute this? For each combination of A and B we sum over all values of C 11

Working Smarter Flu Sinus Allergy P(h) =!! SANF P(hSANF) Headache = P(h S)P(N S)P(S AF)P(A)P(F) SANF = P(h S)P(N S) NS AF = P(h S) P(N S) S N AF P(S AF)P(A)P(F) P(S AF)P(A)P(F) Nose Potential for exponential reduction in computation. Computational Efficiency P(hSANF) = P(h S)P(N S)P(S AF)P(A)P(F) SANF SANF = P(h S) P(N S)!! S N AF P(S AF)P(A)P(F) The distribu9ve law allows us to decompose the sum. AKA: Variable elimina9on Potential for an exponential reduction in computation costs. 12

Naïve Bayes Efficiency S P(S) P(W 1 S) P(W n S) W 1 W 2 W n Given a set of words, we want to know which is larger: P(s W 1 W n ) or P( s W 1 W n ). Use Bayes Rule: P(S W 1...W n ) = P(W...W S)P(S) 1 n!! P(W 1...W n ) Naïve Bayes Efficiency II P(W 1 S) S P(S) P(S W 1...W n ) = P(W...W S)P(S) 1 n!! P(W 1...W n ) W 1 W 2 W n Observa;on 1: We can ignore P(W 1 W n ) Observa;on 2: P(S) is given n Observa;on 3: P(W 1 W n S) is easy: P(W 1...W n S) = P(W i S)!! i=1 13

Checkpoint BNs can give us an exponential reduction in the space required to represent a joint distribution. Storage is exponential in largest parent set. Claim: Parent sets are often reasonable. Claim: Inference cost is often reasonable. Question: Can we quantify relationship between structure and inference cost? Now the Bad News In full generality: Inference is NP-hard Decision problem: Is P(X)>0? We reduce from 3SAT 3SAT variables map to BN variables Clauses become variables with the corresponding SAT variables as parents 14

Reduction!! (X! X X ) (X! X X )... 1 2 3 2 3 4 X1 X2 X3 C1 C2 Problem: What if we have a large number of clauses? How does this fit into our decision problem framework? X4 And Trees We could make a single variable which is the AND of all of our clauses, but this would have CPT that is exponential in the number of clauses. X1 X2 C1 A1 Implement as a tree of ANDs. This is polynomial. X3 C2 A3 X4 A2 15

Checkpoint BNs can be very compact Worst case: Inference is intractable Hope that worst is case: Avoidable (frequently, but no free lunch) Easily characterized in some way Clues in the Graphical Structure Q: How does graphical structure relate to our ability to push in summations over variables? A: We relate summations to graph operations Summing out a variable = Removing node(s) from DAG Creating new replacement node Relate graph properties to computational efficiency 16

Variable Elimination as Graph Operations We can think of summing out a variable as crea:ng a new super variable which contains all of that variable s neighbors Another Example Network Cloudy!!P(c) = 0.5 P(s c) = 0.1!! P(s c!) = 0.5 Sprinkler Rain P(r c) = 0.8!! P(r c!) = 0.2 P(w sr) = 0.99 P(w sr!) = 0.9 P(w s!r) = 0.9!! P(w s!r!) = 0.0 W. Grass 17

Marginal Probabilities Suppose we want P(W): P(W) = P( CSRW)!! CSR = P( C)P(S C)P(R C)P(W RS) CSR SR C = P(W RS) P(S C)P(C)P(R C) A function of R and S P(C)=0.5 Sprinkler Elimina'ng Cloudy Cloudy P(sr) = 0.5*0.1*0.8 + 0.5*0.5*0.2 = 0.09 P(sr!) = 0.5*0.1*0.2 + 0.5*0.5*0.8 = 0.21 P(s!r) = 0.5*0.9*0.8 + 0.5*0.5*0.2 = 0.41 Rain!! P(s!r!) = 0.5*0.9*0.2 + 0.5*0.5*0.8 = 0.29 P(S C) = 0.1!! P(S C!) = 0.5 W. Grass P(R C) = 0.8!! P(R C!) = 0.2 Sprinkler Rain P(W) = P( CSRW)!! CSR = P( C)P(S C)P(R C)P(W RS) CSR SR C = P(W RS) P(S C)P(C)P(R C) W. Grass 18

Eliminating Sprinkler/Rain P(sr) = 0.09 P(sr!) = 0.21 P(s!r) = 0.41!! P(s!r!) = 0.29 Sprinkler W. Grass Rain P(w sr) = 0.99 P(w sr!) = 0.9 P(w s!r) = 0.9!! P(w s!r!) = 0.0 P(w) = P(w RS)P(RS) SR = 0.09*0.99 + 0.21*0.9 + 0.41*0.9 + 0.29*0!! = 0.6471 Dealing With Evidence Suppose we have observed that the grass is wet? What is the probability that it has rained? P(R W) = αp(rw)!! = α P( CSRW) CS = α P( C)P(S C)P(R C)P(W RS) CS C = α P(R C)P(C) P(S C)P(W RS) Is there a more clever way to deal with w? Only keep the relevant parts. S 19

Efficiency of Variable Elimination Exponential in the largest domain size of new variables created Equivalently: Exponential in largest function created by pushing in summations (sum-product algorithm) Linear for trees Almost linear for almost trees J Naïve Bayes Efficiency S P(S) P(W 1 S) P(W n S) W 1 W 2 W n Another way to understand why Naïve Bayes is efficient: It s a tree! 20

Facts About Variable Elimination Picking variables in op8mal order is NP hard For some networks, there will be no elimina8on ordering that results in a poly 8me solu8on (Must be the case unless P=NP) Polynomial for trees Need to get a liele fancier if there are a large number of query variables or evidence variables Review: Computing Marginal Probabilities Suppose you want P(X 1 X 2 ) Write: P(X 1 X 2 )=S X3 Xn P(X 1 X n ) Using Bayes net: P(X 1 X 2 )=S X3 Xn P i P(X i parents(x i )) Distribute sum over product 21

Key Things to Remember All variables summed out must be to the right of your summation The result of your summation should be a function of the remaining variables P(W) = P( CSRW)!! CSR = P( C)P(S C)P(R C)P(W RS) CSR SR No terms with C here! C = P(W RS) P(S C)P(C)P(R C) Result is a function of R and S Beyond Variable Elimina0on I Variable elimination must be rerun for every new query Possible to compile a Bayes net into a new data structure to make repeated queries more efficient Recall that inference in trees is linear Define a cluster tree where Clusters = sets of original variables Can infer original probs from cluster probs Intuition: C C S R SR W W 22

Beyond Variable Elimination II Compiling can be expensive if we are forced to make big clusters Big clusters correspond to inefficient variable elimina?on (But no guarantee that efficient elimina?on is possible in general) For networks w/o good elimina?on schemes Sampling (discussed briefly) Cutsets (not covered in this class) Varia?onal methods (not covered in this class) Loopy belief propaga?on (not covered in this class) Sampling A Bayes net is an example of a generative model of a probability distribution Generative models allow one to generate samples from a distribution in a natural way Sampling algorithm: While some variables are not sampled Pick variable x with no unsampled parents Assign this variable a value from p(x parents(x)) Do this n times Compute P(a) by counting in what fraction a is true 23

Sampling Example Suppose you want to compute P(H) Start with the parentless nodes: Flip a coin to decide F based on P(F) Flu Allergy Sinus Headache Nose F=true Flu Allergy F=true Flu Allergy Flip a coin to decide A based on P(A) Sinus Sinus Headache Nose Headache Nose F=true A=false F=true A=false Flu Headache Sinus Allergy Nose Flip a coin to decide S based on P(S fa ) Flu Headache Sinus Allergy Nose 24

F=true Flu A=false Allergy F=true Flu A=false Allergy S=true Headache Sinus Nose S=true Flip a coin to decide H Headache based on P(H s) Sinus Nose This now becomes a single sample for H. Need to repeat entire process man times to estimate P(H)! H=true F=true Flu S=true Headache Sinus A=false Allergy Nose No need to sample N for this query Sampling with Observed Evidence Suppose you know H=true Want to know P( F h)? How can we use sampling? F=true Flu S=true Sinus A=false Allergy Count frachon of Hmes F is true/false when H is also true H=true Headache Nose But what if we flip a coin For H and it turns out false? 25

Comments on Sampling Sampling is the easiest algorithm to implement Can compute marginal or conditional distributions by counting Not efficient in general Problem: How do we handle observed values? Rejection sampling: Quit and start over when mismatches occur Importance sampling: Use a reweighting trick to compensate for mismatches Low probability events are still a problem (low importance weights mean that you need many samples to get a good estimate of probability) Much more clever approaches to sampling are possible, though mismatch between sampling (proposal) distribution and reality is a constant concern Bayes Net Summary Bayes net = data structure for joint distribu7on Can give exponen7al reduc7on in storage Variable elimina7on and variants for tree-ish networks: simple, elegant methods efficient for many networks For some networks, must use approxima7on BNs are a major success story for modern AI BNs do the right thing (no ugly approxima7ons baked in) Exploit structure in problem to reduce storage/computa7on Not always efficient, but inefficient cases are well understood Work and used in prac7ce 26