qfundamental Assumption qbeyond the Bayes net Chain rule conditional independence assumptions,

Size: px

Start display at page:

Download "qfundamental Assumption qbeyond the Bayes net Chain rule conditional independence assumptions,"

Milton Simon
5 years ago
Views:

1 Bayes Nets: Conditional Independence qfundamental Assumption qbeyond the Bayes net Chain rule conditional independence assumptions, oadditional conditional independencies, that can be read off the graph. oexample P x # x $,, x #'$ = P x # Parents X # W X Y Z CI assumptions directly from simplifications in chain rule: Other implied CI assumptions Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 1

2 Bayes Nets: Conditional Independence (2) qgiven a BN, a common yet important question would be oare two nodes independent given certain evidence? ØIf yes, prove using probabilistic algebra ØIf no, give a counter example. oconsider the following example ØAre X and Z independent? X Y Z Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 2

3 D-Separation qanswer independence queries regarding variables in a BN. ostudy independence properties for triples ØCausal chain, common cause and common effect oanalyze complex cases in terms of member triples od-separation: a condition/algorithm for answering these queries. Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 3

4 D-Separation Causal Chains qx Transfer money, Ynon-zero balance, Z withdraw money X Y Z P X, Y, Z = P X P Y X P(Z Y) qis it guaranteed that X is independent of Z? ono! - Proof by counter example ØConsider the case- Transferring money causes non-zero balance causes withdrawal P y x = 1, P y x = 1 P z y = 1, P z y = 1 And the evidence P x = 0.8 Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 4

5 D-Separation Causal Chains qx Transfer money, Ynon-zero balance, Z withdraw money qis it guaranteed that X is independent of Z given Y? X Y Z P X, Y, Z = P X P Y X P(Z Y) qevidence along the chain blocks the influence Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 5

6 D-Separation Common Cause qx thin attendance, Yproject is due, Z students are busy Y qis it guaranteed that X is independent of Z? ono! - Proof by counter example ØConsider the scenario Project due causes both thin attendance and students busy X Z P x y = 1, P x y = 1 P z y = 1, P z y = 1 And the evidence P y = 0.8 P X, Y, Z = P Y P X Y P(Z Y) Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 6

7 D-Separation Common Cause qx thin attendance, Yproject is due, Z students are busy qis it guaranteed that X is independent of Z given Y? Y X Z P X, Y, Z = P Y P X Y P(Z Y) qobserving the cause blocks the influence between effects Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 7

8 D-Separation Common Effect qx Raining, Y- Cricket Match, Z - Traffic qis it guaranteed that X is independent of Y? X Y Z P X, Y, Z = P X P Y P(Z X, Y) Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 8

9 D-Separation Common Effect qx Raining, Y- Cricket Match, Z - Traffic X Z P X, Y, Z = P X P Y P(Z X, Y) Y qis it guaranteed that X is independent of Y given Z? ono, seeing traffic puts the rain and the cricket match as explanation qthis is different from the other cases oobserving an effect activates influence between possible causes Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 9

10 D-Separation Active/Inactive Paths qa (undirected) path is active if each triple is active - dependencies between nodes. X Y Z X Y Z Y Y X Z X Z X Y X Y Z Z Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 10

11 D-separation qgiven a BN, are two variables X and Y independent (given evidence Z)? oyes if X and Y are d-separated by Z ono active paths implies independence oa single inactive segment is sufficient to block the path qanalyze the graph ocheck for the three canonical forms all through the path Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 11

12 D-Separation Causal Chains qx Transfer money, Ynon-zero balance, Z withdraw money X Y Z P X, Y, Z = P X P Y X P(Z Y) qis it guaranteed that X is independent of Z? ono! - Proof by counter example ØConsider the case- Transferring money causes non-zero balance causes withdrawal P y x = 1, P y x = 1 P z y = 1, P z y = 1 And the evidence P x = 0.8 Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 12

13 D-Separation Common Cause qx thin attendance, Yproject is due, Z students are busy qis it guaranteed that X is independent of Z given Y? Y X Z P X, Y, Z = P Y P X Y P(Z Y) qobserving the cause blocks the influence between effects Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 13

14 Example qx independent of Y qx independent of Y given Z qx independent of Y given Z X Z Z Y Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 14

15 Example (2) qc independent of D qa independent of E given E qa independent of D qa independent of D given E qa independent of D given B, E C A B E D E Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 15

16 Conditional Independence in BNs U 1... U m U 1... U m Z 1j X Z nj Z 1j X Z nj Y 1... Y n Y 1... Y n (a) (b) Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 16

17 Probabilistic Inference in BNs The graphical independence representation yields efficient inference schemes We generally want to compute P(X E) where E is the evidence from sensory measurements (known values for variables) Sometimes, may want to compute just P(X) Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 17

18 Probabilistic Inference in BNs qenumeration qvariable Elimination qapproximate Inference osampling Quantifying Uncertainty CSL452 - ARTIFICIAL INTELLIGENCE 18

multiple query variables, too Step 1: Select the entries consistent with

19 Inference by Enumeration General case: Evidence variables: Query* variable: Hidden variables: All variables We want: * Works fine with multiple query variables, too Step 1: Select the entries consistent with the evidence Step 2: Sum out H to get joint of Query and evidence Step 3: Normalize 1 Z

20 Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: P(B j, m) = P(B, j, m)/p (j, m) = αp(b, j, m) = α Σ e Σ a P(B, e, a, j, m) B J A E M Rewrite full joint entries using product of CPT entries: P(B j, m) = α Σ e Σ a P(B)P (e)p(a B, e)p (j a)p (m a) = αp(b) Σ e P (e) Σ a P(a B, e)p (j a)p (m a) Recursive depth-first enumeration: O(n) space, O(d n ) time CSL452 - ARTIFICIAL INTELLIGENCE 20

21 Enumeration algorithm function Enumeration-Ask(X, e, bn) returns a distribution over X inputs: X, the query variable e, observed values for variables E bn, a Bayesian network with variables {X} E Y Q(X ) a distribution over X, initially empty for each value x i of X do extend e with value x i for X Q(x i ) Enumerate-All(Vars[bn], e) return Normalize(Q(X )) function Enumerate-All(vars, e) returns a real number if Empty?(vars) then return 1.0 Y First(vars) if Y has value y in e then return P (y P a(y )) Enumerate-All(Rest(vars), e) else return y P (y P a(y )) Enumerate-All(Rest(vars), e y ) where e y is e extended with Y = y CSL452 - ARTIFICIAL INTELLIGENCE 21

22 Evaluation tree αp B E P e F E P a B, e P j a P m a I P(b).001 P(e).002 P( e).998 P(a b,e).95 P( a b,e).05 P(a b, e).94 P( a b, e).06 P(j a).90 P(j a).05 P(j a).90 P(j a).05 P(m a) P(m a) P(m a) P(m a) Enumeration is inefficient: repeated computation e.g., computes P (j a)p (m a) for each value of e CSL452 - ARTIFICIAL INTELLIGENCE 22

23 Factor Zoo

24 Factor Zoo I Joint distribution: P(X,Y) Entries P(x,y) for all x, y Sums to 1 Selected joint: P(x,Y) A slice of the joint distribution Entries P(x,y) for fixed x, all y Sums to P(x) Number of capitals = dimensionality of the table T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 T W P cold sun 0.2 cold rain 0.3

6 Family of conditionals: P(Y X) Multiple conditionals Entries P(y

25 Factor Zoo II Single conditional: P(Y x) Entries P(y x) for fixed x, all y Sums to 1 T W P cold sun 0.4 cold rain 0.6 Family of conditionals: P(Y X) Multiple conditionals Entries P(y x) for all x, y Sums to X T W P hot sun 0.8 hot rain 0.2 cold sun 0.4 cold rain 0.6

26 Factor Zoo III Specified family: P( y X ) Entries P(y x) for fixed y, but for all x Sums to who knows! T W P hot rain 0.2 cold rain 0.6

27 Example: Traffic Domain Random Variables R: Raining T: Traffic L: Late for class! T P (L) =? R L +r 0.1 -r 0.9 +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 = X r,t = X r,t P (r, t, L) P (r)p (t r)p (L t) +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9

Inference by Enumeration: Procedural Outline Track objects called factors Initial factors are local CPTs (one per node) +r 0.1 -r 0.9 +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 +t +l 0.3 +t -l 0.

28 Inference by Enumeration: Procedural Outline Track objects called factors Initial factors are local CPTs (one per node) +r 0.1 -r 0.9 +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 Any known values are selected E.g. if we know, the initial factors are +r 0.1 -r 0.9 +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 +t +l 0.3 -t +l 0.1 Procedure: Join all factors, then eliminate all hidden variables

Operation 1: Join Factors First basic operation: joining factors Combining factors: Just like a database join Get all factors over the joining variable Build a new factor over the union of

29 Operation 1: Join Factors First basic operation: joining factors Combining factors: Just like a database join Get all factors over the joining variable Build a new factor over the union of the variables involved Example: Join on R R T +r 0.1 -r 0.9 +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 +r +t r -t r +t r -t 0.81 R,T Computation for each entry: pointwise products

Example: Multiple Joins R T L +r 0.1 -r 0.9 +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 Join R +r +t 0.08 +r -t 0.02 -r +t 0.09 -r -t 0.

30 Example: Multiple Joins R T L +r 0.1 -r 0.9 +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 Join R +r +t r -t r +t r -t t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 R, T L Join T R, T, L +r +t +l r +t -l r -t +l r -t -l r +t +l r +t -l r -t +l r -t -l 0.729

31 Operation 2: Eliminate Second basic operation: marginalization Take a factor and sum out a variable Shrinks a factor to a smaller one A projection operation Example: +r +t r -t r +t r -t t t 0.83

Multiple Elimination R, T, L +r +t +l 0.024 +r +t -l 0.

063 -r -t +l 0.081 -r -t -l 0.729 Sum out R T, L L +t +l 0.

32 Multiple Elimination R, T, L +r +t +l r +t -l r -t +l r -t -l r +t +l r +t -l r -t +l r -t -l Sum out R T, L L +t +l t -l t +l t -l Sum out T +l l 0.866

33 Thus Far: Multiple Join, Multiple Eliminate (= Inference by Enumeration)

34 Marginalizing Early (= Variable Elimination)

35 Marginalizing Early! (aka VE) Join R Sum out R Join T Sum out T R T L +r 0.1 -r 0.9 +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 +r +t r -t r +t r -t 0.81 R, T L +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 +t t 0.83 T L +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 T, L L +t +l t -l t +l t -l l l 0.866

36 Traffic Domain R P (L) =? T L Inference by Enumeration = X t X P (L t)p (r)p (t r) r Join on r Variable Elimination = X P (L t) X P (r)p (t r) t r Join on r Join on t Eliminate r Eliminate r Join on t Eliminate t Eliminate t

37 Evidence If evidence, start with factors that select that evidence No evidence uses these initial factors: +r 0.1 -r 0.9 +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 Computing, the initial factors become: +r 0.1 +r +t 0.8 +r -t 0.2 +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 We eliminate all vars other than query + evidence

38 Evidence II Result will be a selected joint of query and evidence E.g. for P(L +r), we would end up with: +r +l r -l Normalize +l l 0.74 To get our answer, just normalize this! That s it!

39 General Variable Elimination Query: Start with initial factors: Local CPTs (but instantiated by evidence) While there are still hidden variables (not Q or evidence): Pick a hidden variable H Join all factors mentioning H Eliminate (sum out) H Join all remaining factors and normalize

40 Example Choose A

41 Example Choose E Finish with B Normalize

42 Same Example in Equations P (B j, m) / P (B,j,m) marginal can be obtained from joint by summing out = X e,a = X e,a P (B, j, m, e, a) P (B)P (e)p (a B,e)P (j a)p (m a) use Bayes net joint distribution expression use x*(y+z) = xy + xz = X e P (B)P (e) X a P (a B,e)P (j a)p (m a) joining on a, and then summing out gives f 1 = X P (B)P (e)f 1 (j, m B,e) e = P (B) X P (e)f 1 (j, m B,e) e = P (B)f 2 (j, m B) use x*(y+z) = xy + xz joining on e, and then summing out gives f 2 All we are doing is exploiting uwy + uwz + uxy + uxz + vwy + vwz + vxy +vxz = (u+v)(w+x)(y+z) to improve computational efficiency!

43 Another Variable Elimination Example Start by inserting evidence, which gives the following initial factors: P (Z),P(X 1 Z),P(X 2 Z),P(X 3 Z),P(y 1 X 1 ),P(y 2 X 2 ),P(y 3 X 3 ) Eliminate X 1, this introduces the factor f 1 (y 1 Z) = P x 1 P (x 1 Z)P (y 1 x 1 ), and we are left with: P (Z),P(X 2 Z),P(X 3 Z),P(y 2 X 2 ),P(y 3 X 3 ),f 1 (y 1 Z) Eliminate X 2, this introduces the factor f 2 (y 2 Z) = P x 2 P (x 2 Z)P (y 2 x 2 ), and we are left with: P (Z),P(X 3 Z),P(y 3 X 3 ),f 1 (y 1 Z),f 2 (y 2 Z) Eliminate Z, this introduces the factor f 3 (y 1,y 2,X 3 )= P z P (z)p (X 3 z)f 1 (y 1 Z)f 2 (y 2 Z), nd we are left with: P (y 3 X 3 ),f 3 (y 1,y 2,X 3 ) No hidden variables left. Join the remaining factors to get: f 4 (y 1,y 2,y 3,X 3 )=P (y 3 X 3 ),f 3 (y 1,y 2,X 3 ) ormalizing over X 3 gives P (X 3 y 1,y 2,y 3 )=f 4 (y 1,y 2,y 3,X 3 )/ P x 3 f 4 (y 1,y 2,y 3,x 3 ) Computational complexity critically depends on the largest factor being generated in this process. Size of factor = number of entries in table. In example above (assuming binary) all factors generated are of size as they all only have one variable (Z, Z, and X 3 respectively).

44 Variable Elimination Ordering For the query P(X n y 1,,y n ) work through the following two different orderings as done in previous slide: Z, X 1,, X n-1 and X 1,, X n-1, Z. What is the size of the maximum factor generated for each of the orderings? Answer: 2 n versus 2 (assuming binary) In general: the ordering can greatly affect efficiency.

45 VE: Computational and Space Complexity The computational and space complexity of variable elimination is determined by the largest factor The elimination ordering can greatly affect the size of the largest factor. E.g., previous slide s example 2 n vs. 2 Does there always exist an ordering that only results in small factors? No! Exact inference in BNs is NP-Hard

46 Outline qprobability theory - basics qprobabilistic Inference oinference by enumeration qconditional Independence qbayesian Networks qinference in Bayesian Networks oexact Inference oapproximate Inference Bayes Nets CSL452 - ARTIFICIAL INTELLIGENCE 46

47 Approximate Inference: Sampling

48 Sampling Basic idea Draw N samples from a sampling distribution S Compute an approximate posterior probability Show this converges to the true probability P Why sample? Inference: getting a sample is faster than computing the right answer (e.g. with variable elimination)

49 Sampling Basics Sampling from given distribution Step 1: Get sample u from uniform distribution over [0, 1) E.g. random() in python Step 2: Convert this sample u into an outcome for the given distribution by having each outcome associated with a sub-interval of [0,1) with subinterval size equal to probability of the outcome Example C P(C) red 0.6 green 0.1 blue 0.3 If random() returns u = 0.83, then our sample is C = blue E.g, after sampling 8 times:

50 Sampling in Bayes Nets Prior Sampling (Direct Sampling) Rejection Sampling Likelihood Weighting Gibbs Sampling

51 Prior (Direct) Sampling

52 Prior (Direct) Sampling Ignore evidence. Sample from the joint probability. Do inference by counting the right samples.

53 Prior Sampling +c 0.5 -c 0.5 C, S, R, W C, R, S, W Cloudy +c +s 0.1 -s 0.9 -c +s 0.5 -s 0.5 Sprinkler Rain +c +r 0.8 -r 0.2 -c +r 0.2 -r 0.8 +s +r +w w r +w w s +r +w w r +w w 0.99 WetGrass Samples: +c, -s, +r, +w -c, +s, -r, +w

54 Prior Sampling For i=1, 2,, n Sample x i from P(X i Parents(X i )) Return (x 1, x 2,, x n )

55 Example We ll get a bunch of samples from the BN: +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w If we want to know P(W) We have counts <+w:4, -w:1> Normalize to get P(W) = <+w:0.8, -w:0.2> This will get closer to the true distribution with more samples Can estimate anything else, too What about P(C +w)? P(C +r, +w)? P(C -r, -w)? Fast: can use fewer samples if less time (what s the drawback?) S C W R

56 Prior Sampling Analysis This process generates samples with probability: i.e. the BN s joint probability Let the number of samples of an event be Then I.e., the sampling procedure is consistent

57 Rejection Sampling

58 Rejection Sampling Let s say we want P(C +s) Tally C outcomes, but ignore (reject) samples which don t have S=+s This is called rejection sampling It is also consistent for conditional probabilities (i.e., correct in the limit) S C W R +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w +c, -s, +r, +w -c, -s, -r, +w

59 Rejection Sampling IN: evidence instantiation For i=1, 2,, n Sample x i from P(X i Parents(X i )) If x i not consistent with evidence Reject: Return, and no sample is generated in this cycle Return (x 1, x 2,, x n )

60 Likelihood Weighting

Likelihood Weighting Problem with rejection sampling: If evidence is unlikely, rejects lots of samples Evidence not exploited as you sample Consider P(Shape blue) Shape Color pyramid, green pyramid,

61 Likelihood Weighting Problem with rejection sampling: If evidence is unlikely, rejects lots of samples Evidence not exploited as you sample Consider P(Shape blue) Shape Color pyramid, green pyramid, red sphere, blue cube, red sphere, green Idea: fix evidence variables and sample the rest Problem: sample distribution not consistent! Solution: weight by probability of evidence Shape given parents Color pyramid, blue pyramid, blue sphere, blue cube, blue sphere, blue

62 Likelihood Weighting +c 0.5 -c 0.5 +c +s 0.1 -s 0.9 -c +s 0.5 -s 0.5 Sprinkler Cloudy Rain +c +r 0.8 -r 0.2 -c +r 0.2 -r 0.8 +s +r +w w r +w w s +r +w w r +w w 0.99 WetGrass Samples: +c, +s, +r, +w

63 Likelihood Weighting IN: evidence instantiation w = 1.0 for i=1, 2,, n if X i is an evidence variable X i = observation x i for X i Set w = w * P(x i Parents(X i )) else Sample x i from P(X i Parents(X i )) return (x 1, x 2,, x n ), w

64 Likelihood Weighting Sampling distribution if z sampled and e fixed evidence Cloudy C Now, samples have weights S R W Together, weighted sampling distribution is consistent

65 Likelihood Weighting Likelihood weighting is good We have taken evidence into account as we generate the sample E.g. here, W s value will get picked based on the evidence values of S, R More of our samples will reflect the state of the world suggested by the evidence Likelihood weighting doesn t solve all our problems Evidence influences the choice of downstream variables, but not upstream ones (C isn t more likely to get a value matching the evidence) We would like to consider evidence when we sample every variable à Gibbs sampling

66 Gibbs Sampling

67 Gibbs Sampling Procedure: keep track of a full instantiation x 1, x 2,, x n. Start with an arbitrary instantiation consistent with the evidence. Sample one variable at a time, conditioned on all the rest, but keep evidence fixed. Keep repeating this for a long time. Property: in the limit of repeating this infinitely many times the resulting sample is coming from the correct distribution Rationale: both upstream and downstream variables condition on evidence. In contrast: likelihood weighting only conditions on upstream evidence, and hence weights obtained in likelihood weighting can sometimes be very small. Sum of weights over all samples is indicative of how many effective samples were obtained, so want high weight.

68 Gibbs Sampling Example: P( S +r) Step 1: Fix evidence R = +r C S +r Step 2: Initialize other variables Randomly C S +r W W Steps 3: Repeat Choose a non-evidence variable X Resample X from P( X all other variables) C C C C C C S +r W S +r W S +r W S +r W S +r W S +r W

69 Efficient Resampling of One Variable Sample from P(S +c, +r, -w) C S +r W Many things cancel out only CPTs with S remain! More generally: only CPTs that have resampled variable need to be considered, and joined together

70 Bayes Net Sampling Summary Prior Sampling P Rejection Sampling P( Q e ) Likelihood Weighting P( Q e) Gibbs Sampling P( Q e )

71 Further Reading on Gibbs Sampling* Gibbs sampling produces sample from the query distribution P( Q e ) in limit of re-sampling infinitely often Gibbs sampling is a special case of more general methods called Markov chain Monte Carlo (MCMC) methods Metropolis-Hastings is one of the more famous MCMC methods (in fact, Gibbs sampling is a special case of Metropolis-Hastings) You may read about Monte Carlo methods they re just sampling

CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew