Directed Graphical Models. William W. Cohen Machine Learning

Size: px

Start display at page:

Download "Directed Graphical Models. William W. Cohen Machine Learning"

Joy Walton
5 years ago
Views:

1 Directed Graphical Models William W. Cohen Machine Learning

2 MOTIVATION FOR GRAPHICAL MODELS

3 Recap: A paradox of induction A black crow seems to support the hypothesis all crows are black. A pink highlighter supports the hypothesis all non-black things are non-crows Thus, a pink highlighter supports the hypothesis all crows are black. x CROW( x) BLACK( x) or equivalently x BLACK( x) CROW( x) whut?

4 whut? crows non-crows not black black B = black C = crow collect statistics for P(B=b C=c)

5 Logical reasoning versus common-sense reasoning BLACK(jim) FLY(jim) BIRD(jim) EATS(jim,carrion)

6 Another difficult problem: commonsense reasoning Tweety is a bird. Most birds can fly. B(tweety) * x : B(x) F(x) Opus is a penguin. Penguins are birds. Penguins cannot fly. Pg(opus) x, Pg(x) B(x) x : Pg(x) F(x) We d like to be able to conclude: Opus cannot fly, and Tweety can F(opus) F(opus) F(tweety) Logically default reasoning

7 Another difficult problem: commonsense reasoning Tweety is a bird. Most birds can fly. Opus is a penguin. Penguins are birds. Penguins cannot fly. We d like to be able to conclude: Opus cannot fly, and Tweety can B(tweety) * x : B(x) Pg(x) F(x) Pg(opus) x, Pg(x) B(x) x : Pg(x) F(x) F(opus) F(tweety)? NO: F(tweety) only provable if he s provably NOT a penguin default reasoning

9 Another difficult problem: commonsense reasoning Tweety is a bird. Most birds can fly. Opus is a penguin. Penguins are birds. Penguins cannot fly. * x : B(x) We d like to be able to conclude: Opus cannot fly, and Tweety can B(tweety) Pg(x) Dodo(x) Dead(x)... F(x) F(opus) F(tweety)? NO: F(tweety) only provable if he s provably not a penguin and not dead and default reasoning

10 Recap: The Joint Distribution Example: Boolean variables A, B, C Recipe for making a joint distribution of M variables: 1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2 M rows). 2. For each combination of values, say how probable it is. A B C Prob

11 Another difficult problem: commonsense reasoning B(tweety) * x : B(x) Pg(x) F(x) Pg(opus) x, Pg(x) B(x) x : Pg(x) F(x) F(tweety) F(opus)

12 Another difficult problem: commonsense reasoning Pr(F, B, Pg) = Pr(F B, Pg)Pr(B Pg)Pr(Pg) Pr(B Pg) Pr(Pg) Pr(F B, Pg)

13 Pr(F, B, Pg) = Pr(F B, Pg)Pr(B Pg)Pr(Pg) The joint for an experiment: I pick an object, say in Frick Park, and measure: can it fly, is it a bird, is it a penguin. Tweety is a bird. Most non-penguin birds can 7ly. Opus is a penguin. Penguins are birds. Penguins cannot 7ly. * x : B(x) Pg(x) F(x) Pg B Pr(F=0 Pg,B) Pr(F=1 Pg,B) Pg Pr(Pg=0) Pr(Pg=1) x, Pg(x) B(x) Pg Pr(B=0 Pg) Pr(B=1 Pg)

14 Pr(F, B, Pg) = Pr(F B, Pg)Pr(B Pg)Pr(Pg) Tweety is a bird. Most birds can 7ly. * x : B(x) Pg(x) F(x) Pg B Pr(F=0 Pg,B) Pr(F=1 Pg,B) Opus is a penguin. Penguins are birds. Penguins cannot 7ly. b=0, Can Opus fly? Pr(F =1 Pg =1) Unlikely: Pr(F=1 B=1,Pg=1) = 0 = Pr(F =1 B = b, Pg =1)Pr(B = b Pg =1) = Pr(F =1 B =1, Pg =1)Pr(B =1 Pg =1)+Pr(F B = 0, Pg =1)Pr(B = 0 Pg =1) = 1

15 Pr(F, B, Pg) = Pr(F B, Pg)Pr(B Pg)Pr(Pg) If flying penguins are rare, it depends: do non-penguins birds fly? Pr(F B=1,Pg=0) are all or most birds non-penguins? Pr(B=1 Pg=0) are non-penguins common? Pr(Pg=0) Tweety is a bird. Most birds can 7ly. Opus is a penguin. Penguins are birds. Penguins cannot 7ly. Pr(F =1 B =1) Can Tweety fly? = Pr(F =1 B =1, Pg = pg)pr(b =1 Pg = pg) Pr(Pg = pg) pg=0,1 = Pr(F =1 B =1, Pg =1)Pr(B =1 Pg =1)Pr(Pg =1) + Pr(F =1 B =1, Pg = 0)Pr(B =1 Pg = 0)Pr(Pg = 0) Tweety is a 7lying penguin Tweety is a 7lying nonpenguin bird

16 Quiz. cid=421 cid=420 Pr(F =1 B =1) = Pr(F =1 B =1, Pg = pg)pr(b =1 Pg = pg) Pr(Pg = pg) pg=0,1 = Pr(F =1 B =1, Pg =1)Pr(B =1 Pg =1)Pr(Pg =1) + Pr(F =1 B =1, Pg = 0)Pr(B =1 Pg = 0)Pr(Pg = 0)

No: how do we (1) chose the conditional probabilities you need to model a task and (2)

17 Another difficult problem: commonsense reasoning Pr(F, B, Pg) = Pr(F B, Pg)Pr(B Pg)Pr(Pg) Have we solved the common-sense reasoning problem? No: how do we (1) chose the conditional probabilities you need to model a task and (2) use them algorithmically to answer questions? Pr(B Pg) Pr(Pg) Pr(F B, Pg) No: How do we invent numbers for all the rows of the CPTs?

18 Another difficult problem: commonsense reasoning Pr(F, B, Pg) = Pr(F B, Pg)Pr(B Pg)Pr(Pg) Have we solved the common-sense reasoning problem? Yes: We use directed graphical models. Semantics: how to specify them Inference: how to use them Pr(B Pg) Pr(Pg) Pr(F B, Pg) Yes: We use directed graphical models. Learning: how to find parameters

19 Probabilities and probabilistic inference Why is logic attractive? - There are well-understood algorithms for reasoning with a logical theory. E.g: we can use a computer to determine if B(x)èF(x). What about probabilities? - We can do some math manually and answer many questions. Not really satisfying - We can answer questions algorithmically with the joint Eg: we can compute Pr(F=1 B=1) But: this is not tractable for large models. - How can we answer questions algorithmically and efficiently? Answer: Graphical models

20 Probabilities and probabilistic inference Directed graphical models - Today: examples and semantics - Wednesday: inference algorithms - Next week: learning in graphical models using graphical models to specify learning algorithms: Naïve Bayes, LDA, HMMs,

21 GRAPHICAL MODELS: SEMANTICS AND DEFINITIONS

22 Example: practical problem 1 made easy I have 3 standard d20 dice, 1 loaded die. Experiment: (A) pick a d20 uniformly at random then (B) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? P(A=fair)=0.75 P(A=loaded)=0.25 A B P(B=critical A=fair=0.1 P(B=noncritical A=fair)=0.9 P(B=critical A=loaded)=0.5 P(B=noncritical A=loaded)=0.5 A P(A) Fair 0.75 Loaded 0.25 A B P(B A) Fair Critical 0.1 Fair Noncritical 0.9 P(A,B)=P(B A)P(A) Loaded Critical 0.5 Loaded Noncritical 0.5

23 Example: practical problem 1 made easy We have the information we need to answer other questions as well I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? What is Pr(A=1 B=1)? Example of inference P(A=fair)=0.75 P(A=loaded)=0.25 A B P(B=critical A=fair=0.1 P(B=noncritical A=fair)=0.9 P(B=critical A=loaded)=0.5 P(B=noncritical A=loaded)=0.5 A P(A) Fair 0.75 Loaded 0.25 A B P(B A) Fair Critical 0.1 Fair Noncritical 0.9 P(A,B)=P(B A)P(A) Loaded Critical 0.5 Loaded Noncritical 0.5

Example: practical problem 1 made easy In general: any chain-rule decomposition gives a DGM G: G has one node per random variable If P(X Y 1,,Y k ) is a factor in the decomposition, then G has edges

24 Example: practical problem 1 made easy In general: any chain-rule decomposition gives a DGM G: G has one node per random variable If P(X Y 1,,Y k ) is a factor in the decomposition, then G has edges fromy 1 àx,,y k àx X is annotated with a conditional probability table (CPT) encoding P(X=x Y 1= y 1,,Y k =y k ) for each tuple (x,y 1,,y k ) P(A=fair)=0.75 P(A=loaded)=0.25 A P(A) Fair 0.75 Loaded 0.25 P(A,B)=P(B A)P(A) A B Applied chain rule P(B=critical A=fair=0.1 P(B=noncritical A=fair)=0.9 P(B=critical A=loaded)=0.5 P(B=noncritical A=loaded)=0.5 A B P(B A) Fair Critical 0.1 Fair Noncritical 0.9 Loaded Critical 0.5 Loaded Noncritical 0.5

25 There s more than one network for any distribution I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. B A B P(B) critical non-critical P(A,B)=P(A B)P(B) B A P(A B) Critical Fair Noncritical Fair Critical Loaded Noncritical Loaded

There s more than one network for any distribution I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it.

26 There s more than one network for any distribution I have 3 standard d20 dice, 1 loaded die. Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die. What is P(B)? B A P(A,B)=P(A B)P(B) The moral: we have two things here A B P(A,B)=P(B A)P(A) a generative story, causal model, a joint probability distribution e.g. P(A,B) a decomposition: P(A,B)=P(B A)P(A) another decomposition: P(A,B)=P(A B)P(B) -- totally valid! it s usually cleaner to pick one that fits a generative story

27 There s more than one network for any distribution P(A, B,C, D) = P(A B,C, D)P(B,C, D) = P(A B,C, D)P(B C, D)P(C, D) = P(A B,C, D)P(B C, D)P(C D)P(D) P(A, B,C, D) = P(D A, B,C)P(A, B,C) = P(D A, B,C)P(C A, B)P(A, B) = P(D A, B,C)P(C A, B)P(B A)P(A) There are lots of decompositions of an model with N variables They are all correct Some are better than others.

There s more than one network for any distribution P(A, B,C, D) = P(A B,C, D)P(B,C, D) Suppose there are some conditional independencies P(A B,C,D)=P(A B) P(B C,D)=P(B C) = P(A B,C, D)P(B C, D)P(C,

28 There s more than one network for any distribution P(A, B,C, D) = P(A B,C, D)P(B,C, D) Suppose there are some conditional independencies P(A B,C,D)=P(A B) P(B C,D)=P(B C) = P(A B,C, D)P(B C, D)P(C, D) = P(A B,C, D)P(B C, D)P(C D)P(D) = P(A B)P(B C)P(C D)P(D) P(A, B,C, D) = P(D A, B,C)P(A, B,C) = P(D A, B,C)P(C A, B)P(A, B) = P(D A, B,C)P(C A, B)P(B A)P(A) Then the first decomposition can be simplified and compressed, the second can t

You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing

29 The (highly practical) Monty Hall problem You re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing a goat! You now can either stick with your guess 3 always change doors flip a coin and pick a new door randomly according to the coin

30 Example: practical problem 2

31 The (highly practical) Monty Hall problem You re in a game show. Behind one door is a prize. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing a goat! You now can either stick with your guess or change doors D A P(A) Stick, or swap? P(D) Stick 0.5 Swap 0.5 D Second guess First guess W A E The money B B C The revealed goat P(B) A B C P(C A,B) ( a b) ( c { a, b} ) ( a = b) ( c { a, }) 1.0 if P( C = c A = a, B = b) = 0.5 if b 0 otherwise Slide 31

32 The (highly practical) Monty Hall problem P(E = e A,C, D) # % = $ % &% 1.0 if ( e = a) ( d = stick) 1.0 if ( e {a, c} ) (d = swap) 0 otherwise If you stick: you win if your first guess was right. If you swap: you win if your first guess was wrong. A P(A) ' % ( % )% D Second guess A C D P(E A,C,D) First guess Stick or swap? A E C The money B The goat B P(B) A B C P(C A,B) ( a b) ( c { a, b} ) ( a = b) ( c { a, }) 1.0 if P( C = c A = a, B = b) = 0.5 if b 0 otherwise Slide 32

33 The (highly practical) Monty Hall problem We could construct the joint and compute P(E=B D=swap) again by the chain rule: P(A,B,C,D,E) = P(E A,C,D) * P(D) * P(C A,B ) * P(B ) * P(A) A P(A) D Second guess First guess Stick or swap? A C D P(E A,C,D) A E C The money B The goat B P(B) A B C P(C A,B) Slide 33

34 The (highly practical) Monty Hall problem We could construct the joint and compute P(E=B D=swap) again by the chain rule: P(A,B,C,D,E) = P(E A,B,C,D) * P(D A,B,C) * P(C A,B ) * P(B A) * P(A) A P(A) D Second guess First guess Stick or swap? A C D P(E A,C,D) A E C The money B The goat B P(B) A B C P(C A,B) Slide 34

35 The (highly practical) Monty Hall problem The joint table has? 3*3*3*2*3 = 162 rows The conditional probability tables (CPTs) shown have? *3*3 + 2*3*3 = 51 rows < 162 rows A P(A) First guess Stick or swap? A The money Big questions: The goat D C why are the CPTs smaller? how much smaller 1 1 are 2 the 0.5 CPTs than the joint? E can we compute the answers to queries like P(E=B d) without building the joint probability tables, just using the CPTs? Second guess A C D P(E A,C,D) B B P(B) A B C P(C A,B) Slide 35

36 The (highly practical) Monty Hall problem Why is the CPTs representation smaller? Follow the money! (B) P( E P( E = e 1.0 = a, b, c, d, e = P( E A, C, D) if if = e A ( e = a) ( d = stick ) ( e { a, c} ) ( d = otherwise = e A = a, C = a, B = c, D = b, C swap = d) A ) P(A) D Second guess First guess Stick or swap? A C D P(E A,C,D) = b, D = d) A E C The money B The goat B P(B) E A is B conditionally C P(C A,B) independent of B given A,D,C E B A, C, D I < E,{ A, C, D}, B > Slide 36

37 Conditional Independence formalized Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x M=y ^ L=z) = P(R=x M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S 1 s assignments S 2 s assignments & S 3 s assignments)= P(S1 s assignments S3 s assignments) Slide 37

38 The (highly practical) Monty Hall problem What are the conditional indepencies? I<A, {B}, C>? I<A, {C}, B>? I<E, {A,C}, B>? I<D, {E}, B>? D First guess Stick or swap? A C The money B The goat Second guess E Slide 38

39 Slide 39

40 Recap: Bayes Nets Formalized A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V, E where: - V is a set of vertices. - E is a set of directed edges joining vertices. No loops of any length are allowed. Each vertex in V contains the following information: - The name of a random variable - A probability distribution table indicating how the probability of this variable s values depends on all possible combinations of parent values.

41 Building a Bayes Net Choose a set of relevant variables. Choose an ordering for them Assume they re called X 1.. X m (where X 1 is first in ordering, etc) For i = 1 to m: - Add the X i node to the network - Set Parents(X i ) to be a minimal subset of {X 1 X i-1 } such that we have conditional independence of X i and all other members of {X 1 Xi -1 } given Parents(Xi ) - Define the probability table of P(X i =k Assignments of Parents(X i ) ).

42 The general case P(X 1= x 1 ^ X 2 =x 2 ^.X n-1 =x n-1 ^ X n =x n ) = P(X n =x n ^ X n-1 =x n-1 ^.X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n X n-1 =x n-1 ^.X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1 ^. X 2 =x 2 ^ X 1 =x 1 ) = P(X n =x n X n-1 =x n-1 ^.X 2 =x 2 ^ X 1 =x 1 ) * P(X n-1 =x n-1. X 2 =x 2 ^ X 1 =x 1 ) * P(X n-2 =x n-2 ^. X 2 =x 2 ^ X 1 =x 1 ) = = : : i= 1 = n n i= 1 P P ( X = x )(( X = x ) ( X = x ))) i i 1 i 1 ( X = x ) Assignments of Parents( X )) i i i So any entry in joint pdf table can be computed. And so any conditional probability can be computed. 1 1 i

43 Question: given a network can I find a chain-rule decomposition of the joint?

44 GRAPHICAL MODELS: DETERMINING CONDITIONAL INDEPENDENCIES

45 What Independencies does a Bayes Net Model? In order for a Bayesian network to model a probability distribution, the following must be true: Each variable is conditionally independent of all its nondescendants in the graph given the value of all its parents. This follows from n i=1 P(X 1 X n ) = P(X i parents(x i )) n i=1 = P(X i X 1 X i 1 ) But what else does it imply?

46 What Independencies does a Bayes Net Model? Example: Z Y X Given Y, does learning the value of Z tell us nothing new about X? I.e., is P(X Y, Z) equal to P(X Y)? Yes. Since we know the value of all of X s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z Y, X) = P(Z Y).

47 Quick proof that independence is symmetric Assume: P(X Y, Z) = P(X Y) Then: ), ( ) ( ), ( ), ( Y X P Z P Z Y X P Y X Z P = ) ( ) ( ) ( ), ( ) ( P Y Y X P Z P Z Y X P Z P Y = (Bayes s Rule) (Chain Rule) (By Assumption) (Bayes s Rule) ) ( ) ( ) ( ) ( ) ( Y P Y X P Z P Y X P Z Y P = ) ( ) ( ) ( ) ( Y Z P Y P Z P Z Y P = =

48 What Independencies does a Bayes Net Model? Let I<X,Y,Z> represent X and Z being conditionally independent given Y. Y X Z I<X,Y,Z>? Yes, just as in previous example: All X s parents given, and Z is not a descendant.

49 What Independencies does a Bayes Net Model? I<X,{U},Z>? No. I<X,{U,V},Z>? Yes. Maybe I<X, S, Z> iff S acts a cutset between X and Z in an undirected version of the graph? U Z X V

50 Things get a little more confusing X has no parents, so we know all its parents values trivially Z is not a descendant of X So, I<X,{},Z>, even though there s a undirected path from X to Z through an unknown variable Y. What if we do know the value of Y, though? Or one of its descendants? X Y Z

51 The Burglar Alarm example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your hom s burglar alarm is ringing. Uh oh! Burglar Alarm Phone Call Earthquake

52 Things get a lot more confusing But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. Earthquake explains away the hypothetical burglar. But then it must not be the case that Burglar Alarm Phone Call Earthquake I<Burglar,{Phone Call}, Earthquake>, even though I<Burglar,{}, Earthquake>!

54 d-separation to the rescue Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation. Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is blocked, where a path is blocked iff one or more of the following conditions is true:... ie. X and Z are dependent iff there exists an unblocked path

55 A path is blocked when... There exists a variable Y on the path such that - it is in the evidence set E - the arcs putting Y in the path are tail-to-tail Y unknown common causes of X and Z impose dependency Or, there exists a variable Y on the path such that - it is in the evidence set E - the arcs putting Y in the path are tail-to-head Or,... Y unknown causal chains connecting X an Z impose dependency

56 A path is blocked when (the funky case) Or, there exists a variable V on the path such that - it is NOT in the evidence set E - neither are any of its descendants - the arcs putting Y on the path are head-to-head Y Known common symptoms of X and Z impose dependencies X may explain away Z

57 d-separation to the rescue, cont d Theorem [Verma & Pearl, 1998]: - If a set of evidence variables E d-separates X and Z in a Bayesian network s graph, then I<X, E, Z>. d-separation can be computed in linear time using a depth-first-searchlike algorithm. Be careful: d-separation finds what must be conditionally independent - Might : Variables may actually be independent when they re not d- separated, depending on the actual probabilities involved

58 d-separation example A C E B D F I<C, {}, D>? I<C, {A}, D>? I<C, {A, B}, D>? I<C, {A, B, J}, D>? I<C, {A, B, E, J}, D>? G H I J

Bayesian Networks: Independencies and Inference

Bayesian Networks: Independencies and Inference Scott Davies and Andrew Moore Note to other teachers and users of these slides. Andrew and Scott would be delighted if you found this source material useful