Bayesian Networks. Why Joint Distributions are Important. CompSci 570 Ron Parr Department of Computer Science Duke University

Size: px

Start display at page:

Download "Bayesian Networks. Why Joint Distributions are Important. CompSci 570 Ron Parr Department of Computer Science Duke University"

Anissa Griffin
5 years ago
Views:

1 Bayesian Networks CompSci 570 Ron Parr Department of Computer Science Duke University Why Joint Distributions are Important Joint distribu6ons gives P(X 1 X n ) Classifica6on/Diagnosis Suppose X 1 =disease X 2 X n = symptoms Co-occurrence Suppose X 3 =lung cancer X 5 =smoking Rare event Detec6on Suppose X 1 X n = parameters of a credit card transac6on Call card holder if P(X 1 X n ) is below threshold? 1

2 Modeling Joint Distributions To do this correctly, we need a full assignment of probabili<es to all atomic events Unwieldy in general for discrete variables: n binary variables = 2 n atomic events Independence makes this tractable, but too strong (rarely holds) Condi<onal independence is a good compromise: Weaker than independence, but s<ll has great poten<al to simplify things Overview Conditional independence Bayesian networks Variable Elimination Sampling 2

3 Condi&onal Independence Suppose we know the following: The flu causes sinus inflammation Allergies cause sinus inflammation inflammation causes a runny nose inflammation causes headaches How are these connected? Example 1: Simple graphical structure Flu Allergy Knowing sinus separates the variables from each other. 3

4 Example 2: Naïve Bayes Spam Filter S P(S) P(W 1 S) P(W n S) W 1 W 2 W n We will see later why this is a particularly convenient representation. (Does it make a correct assumption?) Condi&onal Independence We say that two variables, A and B, are conditionally independent given C if: P(A BC) = P(A C) P(AB C) = P(A C)P(B C) How does this help? We store only a conditional probability table (CPT) of each variable given its parents Naïve Bayes (e.g. Spam Assassin) is a special case of this! 4

5 Notation Reminder P(A B) is a condi6onal prob. distribu6on It is a func6on! P(A=true B=true), P(A=true B=false), P(A=false B=True), P(A=false B=true) P(A b) is a probability distribu6on, func6on P(a B) is a func6on, not a distribu6on P(a b) is a number What is Bayes Net? A directed acyclic graph (DAG) Given parents, each variable is independent of non-descendants Joint probability decomposes: P(x 1...x n ) = P(x i parents(x i ))!! i For each node X i, store P(X i parents(x i )) Call this a CondiFonal Probability Table (CPT) CPT size is exponenfal in number of parents 5

6 Real Applications of Bayes Nets Diagnosis of lymph node disease Used by robots to iden;fy meteorites to study Study the human genome: Alex Hartemink et al. Used in MS Windows Many other applica;ons Space Efficiency Flu Allergy Entire joint distribution as 32 (31) entries P(H S),P(N S) have 4 (2) P(S AF) has 8 (4) P(A) has 2 (1) Total is 20 (10) This can require exponentially less space Space problem is solved for most problems 6

7 Naïve Bayes Space Efficiency S P(S) P(W 1 S) P(W n S) W 1 W 2 W n Entire Joint distribution has 2 n+1 (2 n+1-1) numbers vs. 4n+2 (2n+1) (Non)Uniqueness of Bayes Nets I Suppose you have two variables that are NOT independent Two possible networks: A is parent of B B is parent of A Which is right? There is no wrong answer! Each network can express arbitrary P(AB) Network does NOT encode causal or temporal dynamics 7

8 (Non)Uniqueness of Bayes Nets Can construct valid Bayes net by adding variables incrementally For each new variable, connect all influencing variables as parents new variables never become parents of existing variables (how does this ensure that all variables are conditionally independent of non-descendents given parents?) Different order can lead to different Bayesian networks for the same distribution Suppose A and B are uniform, C=(A B) A B C P P(A) A P(C AB) C P(B) B P(a)=P(a )=P(b)=P(b9)=0.5 P(c)=0.75,P(c )=0.25 P(ab)=P(ab9)=P(a b)=p(a b9)=0.25 P(c ab)=p(c ab9)=p(c a b)=1.0, P(c a b9)=0 (only showing c=true case) Add A, then B, then C case 8

9 Suppose A and B are uniform, C=(A B) A B C P P(A) A C P(B) B P(c)=0.75,P(c )=0.25 P(ac)=0.5, P(ac )=0, P(a c)=p(a c )=0.25 P(a c)=2/3, P(a c)=1/3, P(a c )= P(a c )=0 P(b ac)=1/2, P(b ac )=P(b a c )=0, P(b a c)=1 (only showing b=true case) Add C, then A, then B case Atomic Event Probabili1es P(x 1...x n ) = P(x i parents(x i ))!! i Flu Allergy Recall that this is guaranteed true if we construct net incrementally, so that for each new variable added, we connect all influencing variables as parents 9

10 Doing Things the Hard Way P(f h) = P(fh) P(h) = SAN SANF P( fhsan) = P(hSANF) SAN SANF P( f)p(a)p(s Af)P(h S)P(N S) P(F)P(A)P(S AF)P(h S)P(N S) defn. of conditional probability P(hSANF) = x p(x parents(x))!! = P(h S)P(N S)P(S AF)P(A)P(F) Doing this naïvely, we need to sum over all atomic events defined over these variables. There are exponen?ally many of these. Working Smarter Flu Allergy P(h) =!! SANF P(hSANF) = P(h S)P(N S)P(S AF)P(A)P(F) SANF = P(h S)P(N S) NS AF = P(h S) P(N S) S N AF P(S AF)P(A)P(F) P(S AF)P(A)P(F) Poten:al for exponen:al reduc:on in computa:on. 10

11 Understanding Nota.on Flu Allergy P(h) =!! SANF P(hSANF) = P(h S)P(N S)P(S AF)P(A)P(F) SANF = P(h S)P(N S) NS AF = P(h S) P(N S) S N P(S AF)P(A)P(F) AF P(S AF)P(A)P(F) 8 combina.ons 3 binary variables 2 variables summed out Result is a func.on of S Understanding Notation Flu Allergy P(h) =!! SANF P(hSANF) = P(h S)P(N S)P(S AF)P(A)P(F) SANF = P(h S)P(N S) NS AF = P(h S) P(N S) S N P(S AF)P(A)P(F) AF P(S AF)P(A)P(F) 4 combina:ons 2 binary variables 1 variable summed out Result is a func:on of S 11

12 Understanding Notation Flu Allergy P(h) =!! SANF P(hSANF) = P(h S)P(N S)P(S AF)P(A)P(F) SANF = P(h S)P(N S) NS AF = P(h S) P(N S) S N P(S AF)P(A)P(F) 2 combina:ons of 1 binary variables 1 variable summed out Result is a number AF P(S AF)P(A)P(F) Computa(onal Efficiency P(hSANF) = P(h S)P(N S)P(S AF)P(A)P(F) SANF SANF = P(h S) P(N S)!! S N AF P(S AF)P(A)P(F) The distribu(ve law allows us to decompose the sum. AKA: Variable elimina(on Analogous to Gaussian elimina(on but exponen(ally more expensive Potential for an exponential reduction in computation costs. 12

13 Naïve Bayes Efficiency S P(S) P(W 1 S) P(W n S) W 1 W 2 W n Given a set of words, we want to know which is larger: P(s W 1 W n ) or P( s W 1 W n ). Use Bayes Rule: P(S W 1...W n ) = P(W...W S)P(S) 1 n!! P(W 1...W n ) Naïve Bayes Efficiency II P(W 1 S) S P(S) P(S W 1...W n ) = P(W...W S)P(S) 1 n!! P(W 1...W n ) W 1 W 2 W n Observa;on 1: We can ignore P(W 1 W n ) Observa;on 2: P(S) is given n Observa;on 3: P(W 1 W n S) is easy: P(W 1...W n S) = P(W i S)!! i=1 Note: Can also do variable elimina;on by summing out leaves first. 13

14 Checkpoint BNs can give us an exponen1al reduc1on in the space required to represent a joint distribu1on. Storage is exponen1al in largest parent set. Claim: Parent sets are o@en reasonable. Claim: Inference cost is o@en reasonable. Ques1on: Can we quan1fy rela1onship between structure and inference cost? Now the Bad News In full generality: Computing P(X) is NPhard Decision problem: Is P(X)>0? We reduce from 3SAT 3SAT variables map to BN variables Clauses become variables with the corresponding SAT variables as parents 14

15 Reduction!! (X! X X ) (X! X X ) X1 X2 X3 C1 C2 Problem: What if we have a large number of clauses? How does this fit into our decision problem framework? X4 And Trees We could make a single variable which is the AND of all of our clauses, but this would have CPT that is exponen@al in the number of clauses. X1 X2 C1 A1 Implement as a tree of ANDs. This is polynomial. X3 C2 A3 X4 A2 15

16 Is BN Inference NP Complete? Can show that BN inference is #P hard #P is counting the number of satisfying assignments Idea: Assign variables uniform probability Probability of conjunction of clauses tells us how many assignments are satisfying Checkpoint BNs can be very compact Worst case: Inference is intractable Hope that worst case is: Avoidable (frequently, but no free lunch) Easily characterized in some way 16

17 Clues in the Graphical Structure Q: How does graphical structure relate to our ability to push in summa<ons over variables? A: We relate summa<ons to graph opera<ons Summing out a variable = Removing node(s) from DAG Crea<ng new replacement node Relate graph proper<es to computa<onal efficiency Variable Elimination as Graph Operations We can think of summing out a variable as creating a new super variable which contains all of that variable s neighbors 17

18 Another Example Network Cloudy!! P(c) = 0.5 P(s c) = 0.1!! P(s c!) = 0.5 Sprinkler Rain P(r c) = 0.8!! P(r c!) = 0.2 P(w sr) = 0.99 P(w sr!) = 0.9 P(w s!r) = 0.9!! P(w s!r!) = 0.0 W. Grass Marginal Probabilities Suppose we want P(W): P(W) = P( CSRW)!! CSR = P( C)P(S C)P(R C)P(W RS) CSR SR C = P(W RS) P(S C)P(C)P(R C) 18

19 P(C)=0.5 Sprinkler Eliminating Cloudy Cloudy P(sr) = 0.5*0.1* *0.5*0.2 = 0.09 P(sr!) = 0.5*0.1* *0.5*0.8 = 0.21 P(s!r) = 0.5*0.9* *0.5*0.2 = 0.41 Rain!! P(s!r!) = 0.5*0.9* *0.5*0.8 = 0.29 P(S C) = 0.1!! P(S C!) = 0.5 W. Grass P(R C) = 0.8!! P(R C!) = 0.2 Sprinkler Rain P(W) = P( CSRW)!! CSR = P( C)P(S C)P(R C)P(W RS) CSR SR C = P(W RS) P(S C)P(C)P(R C) W. Grass Elimina'ng Sprinkler/Rain P(sr) = 0.09 Sprinkler Rain P(sr!) = 0.21 P(s!r) = 0.41!! P(s!r!) = 0.29 W. Grass P(w sr) = 0.99 P(w sr!) = 0.9 P(w s!r) = 0.9!! P(w s!r!) = 0.0 P(w) = P(w RS)P(RS) SR = 0.09* * * *0!! =

20 Dealing With Evidence Suppose we have observed that the grass is wet? What is the probability that it has rained? P(R W) = αp(rw)!! = α P( CSRW) CS = α P( C)P(S C)P(R C)P(W RS) CS C = α P(R C)P(C) P(S C)P(W RS) Is there a more clever way to deal with w? Only keep the relevant parts. S Efficiency of Variable Elimina1on Exponential in the largest domain size of new variables created Equivalently: Exponential in largest function created by pushing in summations (sum-product algorithm) Linear for trees Almost linear for almost trees J 20

21 Naïve Bayes Efficiency S P(S) P(W 1 S) P(W n S) W 1 W 2 W n Another way to understand why Naïve Bayes is efficient: It s a tree! Facts About Variable Elimination Picking variables in op9mal order is NP hard For some networks, there will be no elimina9on ordering that results in a poly 9me solu9on (Must be the case unless P=NP) Polynomial for trees Need to get a liele fancier if there are a large number of query variables or evidence variables 21

22 Beyond Variable Elimina0on Variable elimination must be rerun for every new query Possible to compile a Bayes net into a new data structure to make repeated queries more efficient Recall that inference in trees is linear Define a cluster tree where Clusters = sets of original variables Can infer original probs from cluster probs For networks w/o good elimination schemes Sampling (discussed briefly) Cutsets (not covered in this class) Variational methods (not covered in this class) Loopy belief propagation (not covered in this class) Sampling A Bayes net is an example of a genera&ve model of a probability distribu8on Genera8ve models allow one to generate samples from a distribu8on in a natural way Sampling algorithm: Do n 8mes: While some variables are not sampled Pick variable x with no unsampled parents Assign this variable a value from p(x parents(x)) Compute P(a) by coun8ng in what frac8on a is true 22

23 Sampling Example Suppose you want to compute P(H) Start with the parentless nodes: Flip a coin to decide F based on P(F) Flu Allergy F=true Flu Allergy F=true Flu Allergy Flip a coin to decide A based on P(A) F=true A=false F=true A=false Flu Allergy Flip a coin to decide S based on P(S fa ) Flu Allergy 23

24 F=true Flu A=false Allergy F=true Flu A=false Allergy S=true S=true Flip a coin to decide H based on P(H s) This now becomes a single sample for H. Need to repeat encre process man Cmes to escmate P(H)! H=true F=true Flu S=true A=false Allergy No need to sample N for this query Sampling with Observed Evidence Suppose you know H=true Want to know P(F h)? How can we use sampling? F=true Flu S=true A=false Allergy Count fraction of times F is true/false when H is also true H=true But what if we flip a coin For H and it turns out false? 24

25 Comments on Sampling Sampling is the easiest algorithm to implement Can compute marginal or condi5onal distribu5ons by coun5ng Not efficient in general Problem: How do we handle observed values? Rejec5on sampling: Quit and start over when mismatches occur Importance sampling: Use a reweigh5ng trick to compensate for mismatches Low probability events are s5ll a problem (low importance weights mean that you need many samples to get a good es5mate of probability) Much more clever approaches to sampling are possible, though mismatch between sampling (proposal) distribu5on and reality is a constant concern Other Algorithms Exact: Can compile a non-tree BN into an equivalent factor graph by clustering variables No free lunch: Factors may be very large Approximate: If factors get too large, can approximate by splifng them mini-bucket or variahonal approximahons Usually equivalent to making some sort of independence assumphon that may or may not be reasonable Loopy belief propagahon treats non-tree Bayesian networks like trees iterahve algorithm that converges in some cases to something close to the correct answer 25

26 Bayes Net Summary Bayes net = data structure for joint distribution Can give exponential reduction in storage Variable elimination and variants for tree-ish networks: simple, elegant methods efficient for many networks For some networks, must use approximation BNs are a major success story for modern AI BNs do the right thing (no ugly approximations) Exploit structure in problem to reduce storage/computation Not always efficient, but inefficient cases are well understood Work and used in practice 26

Bayesian Networks. CompSci 270 Duke University Ron Parr. Why Joint Distributions are Important

Bayesian Networks. CompSci 270 Duke University Ron Parr. Why Joint Distributions are Important Bayesian Networks CompSci 270 Duke University Ron Parr Why Joint Distributions are Important Joint distributions gives P(X 1 X n ) Classification/Diagnosis Suppose X1=disease X2 Xn = symptoms Co-occurrence