Bayesian Networks. Example of Inference. Another Bayes Net Example (from Andrew Moore) Example Bayes Net. Want to know P(A)? Proceed as follows:

Size: px

Start display at page:

Download "Bayesian Networks. Example of Inference. Another Bayes Net Example (from Andrew Moore) Example Bayes Net. Want to know P(A)? Proceed as follows:"

Nathaniel Jessie Malone
5 years ago
Views:

ayesian Networks ayesian network (N) is a graphical representation of the direct dependencies over a set of variables, together with a set of conditional probability tables quantifying the strength

1 ayesian Networks ayesian network (N) is a graphical representation of the direct dependencies over a set of variables, together with a set of conditional probability tables quantifying the strength of those influences. ayes nets are effective tools to represent the world and make inferences where there is uncertainty. More formally: N over variables {X 1, X 2,, X n } consists of a G (directed acyclic graph) whose nodes are the variables. ach node X i is associated with a conditional probability tables (PTs) that specify P(X i Parents(X i )) for that X i xample ayes Net H P() = 0.7 P(~) = 0.3 P( ) = 0.8 P(~ ) = 0.2 P( ~) = 0.5 P(~ ~) = 0.5 P( ) = 0.7 P(~ ) = 0.3 P( ~) = 0.0 P(~ ~) = 1.0 P( ) = 0.2 P(~ ) = 0.8 P( ~) = 0.1 P(~ ~) = 0.9 P(H ) = 0.9 P(~H ) = 0.1 P(H ~) = 0.1 P(~H ~) = 0.9 Note that specifying P(,,,,H) for any assignment of values to these variables requires only 9 parameters in the joint (rather than 31): That means inference is linear in the number of variables instead of exponential! Moreover, inference is linear generally if dependence has a chain structure xample of Inference Want to know P()? Proceed as follows: H nother ayes Net xample (from ndrew Moore) M: Meredith is running the help session S: It is sunny out L: The leader of the help session arrives late ssume that all Ts may arrive late in bad weather. Some Ts may be more likely to be late than others. These are all terms specified in our local distributions!

nother ayes Net xample (from ndrew Moore) M: Meredith is running the help session S: It is sunny out L: The leader of the help session arrives late ssume that all Ts may arrive late in bad weather.

6 Lateness is independent of the weather and not independent of the T. Some Ts may be more likely to be late than others. P(S M) = P(S), P(S) = 0.3, P(M) = 0.

2 nother ayes Net xample (from ndrew Moore) M: Meredith is running the help session S: It is sunny out L: The leader of the help session arrives late ssume that all Ts may arrive late in bad weather. Some Ts may be more likely to be late than others. Let s begin by writing down knowledge we feel happy about: P(S M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is independent of the weather and not independent of the T. nother ayes Net xample (from ndrew Moore) M: Meredith is running the help session S: It is sunny out L: The leader of the help session arrives late ssume that all Ts may arrive late in bad weather. Some Ts may be more likely to be late than others. P(S M) = P(S), P(S) = 0.3, P(M) = 0.6 Lateness is independent of the weather and not independent of the T. To specify P(L S = u, M = v) we need values for the 4 cases of u/v = True/alse. ayes net example ayes net example M: Meredith is running the help session S: It is sunny out L: The leader of the help session arrives late M: Meredith is running the help session S: It is sunny out L: The leader of the help session arrives late ecause of conditional independence, we only need 6 values in the joint (how many would there be otherwise?) gain, conditional independence leads to computational savings!

Graphically representing the N Graphically representing the N Read the absence of an arrow between S and M to mean It would not help me predict M if I knew the value of S Read the two arrows into L

3 Graphically representing the N Graphically representing the N Read the absence of an arrow between S and M to mean It would not help me predict M if I knew the value of S Read the two arrows into L to mean If I want to know the value of L it may help me to know M and to know S. ltering the graph Now let s suppose we have these three events: onditional independence Once you know who the T is, then whether they arrive late does not affect whether the help session concerns Reasoning with ayes Nets. M: Meredith is running the help session L: The leader of the session arrives late R : The help session concerns Reasoning with ayes Nets nd we know: rin has a higher chance of being late than Meredith. rin has a higher chance of giving a help session about reasoning with Ns What kind of independences exist in our graph?

4 Graphically representing the N Let s say we have 5 variables M: Meredith is running the help session L: The leader of the session arrives late R : The help session concerns Reasoning with ayes Nets S: It is sunny out T: The session starts by 4:00 We know: T is only directly influenced by L (i.e. T is conditionally independent of R,M,S given L) L is only directly influenced by M and S (i.e. L is conditionally independent of R given M & S) R is only directly influenced by M (i.e. R is conditionally independent of L,S, given M) M and S are independent uilding a ayes Net uilding a ayes Net M: Meredith is running the help session L: The leader of the session arrives late R : The help session concerns Reasoning with ayes Nets S: It is sunny out T: The session starts by 4:00 M: Meredith is running the help session L: The leader of the session arrives late R : The help session concerns Reasoning with ayes Nets S: It is sunny out T: The session starts by 4:00 Step One: add variables. Step Two: add links. The link structure must be acyclic (Graph is a G) Remember that if a node X has parents, you are promising that any non-descendent of X is conditionally independent of X given its parents.

uilding a ayes Net M: Meredith is running the help session L: The leader of the session arrives late R : The help session concerns Reasoning with ayes Nets S: It is sunny out T: The session starts by

5 uilding a ayes Net M: Meredith is running the help session L: The leader of the session arrives late R : The help session concerns Reasoning with ayes Nets S: It is sunny out T: The session starts by 4:00 Step Three: add a probability table for each node. Note that the table for any node X must list P(X Parents(X)) for each possible combination of parent values. ach node is conditionally independent of all non-descendants, given its parents. Two unconnected variables may still be correlated You can deduce many other conditional independence relations from a ayes Net. uilding a ayes Net Note that it is always possible to construct a ayes Net to represent any distribution over variables X 1, X 2,, X n, using any ordering of variables. Take any ordering of the variables (say, the order given). rom the chain rule we obtain. P(X 1,,X n ) = P(X n X 1,,X n-1 )P(X n-1 X 1,,X n-2 ) P(X 1 ) Now for each Xi go through its conditioning set X 1,,X i-1, and iteratively remove all variables X j such that X i is conditionally independent of X j given the remaining variables. o this until no more variables can be removed. The final product will specify a ayes net. ausal Intuitions However, some orderings yield N s with very large parent sets. This requires exponential space, and (as we will see later) exponential time to perform inference. mpirically, and conceptually, a good way to construct a N is to use an ordering based on causality. This often yields a more natural and compact N. xample: ausal Intuitions Malaria, the flu and a cold all cause aches. So use the ordering that causes come before effects Malaria, lu, old, ches P(M,,,) = P( M,,) P( M,) P( M) P(M) ach of these diseases affects the probability of aches, so the first conditional probability does not change. It is reasonable to assume that these diseases are independent of each other: having or not having one does not change the probability of having the others. So P( M,) = P() P( M) = P()

6 xample: ausal Intuitions This yields a fairly simple ayes net. We only need one big PT, involving the family of ches. xample: ausal Intuitions ut suppose we build the N for the distribution using the opposite ordering, i.e., we use ordering ches, old, lu, Malaria P(,,,M) = P(M,,) P(,) P( ) P() We can t reduce P(M,,). Probability of Malaria is clearly affected by knowing aches. What about knowing aches and old, or aches and old and lu? Probability of Malaria is affected by both of these additional pieces of knowledge Knowing old and lu lowers the probability of aches related to Malaria since the other conditions explain away aches! xample: ausal Intuitions We obtain a much more complex ayes Net. In fact, we obtain no savings over explicitly representing the full joint distribution (i.e., representing the probability of every atomic event). urglary xample I'm at work, neighbour John calls to say my alarm is ringing, but neighbour Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? Variables: urglary, arthquake, larm, Johnalls, Maryalls Network topology reflects "causal" knowledge: burglar can set the alarm off n earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

7 urglary xample urglary xample burglary can set the alarm off n earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call Suppose we choose the ordering Maryalls, Johnalls, larm, urglary, arthquake? P(J M) = P(J)? # of Params: = 10 (vs = 31) urglary xample urglary xample Suppose we choose the ordering Maryalls, Johnalls, larm, urglary, arthquake? Suppose we choose the ordering Maryalls, Johnalls, larm, urglary, arthquake? P(J M) = P(J)? No P( J, M) = P( J)? P( J, M) = P()? P(J M) = P(J)? No P( J, M) = P( J)? P( J, M) = P()? No P(, J, M) = P( )? P(, J, M) = P()?

8 urglary xample urglary xample Suppose we choose the ordering Maryalls, Johnalls, larm, urglary, arthquake? Suppose we choose the ordering Maryalls, Johnalls, larm, urglary, arthquake? P(J M) = P(J)? No P( J, M) = P( J)? P( J, M) = P()? No P(, J, M) = P( )? Yes P(, J, M) = P()? No P(,,J, M) = P( )? P(,, J, M) = P(, )? P(J M) = P(J)? No P( J, M) = P( J)? P( J, M) = P()? No P(, J, M) = P( )? Yes P(, J, M) = P()? No P(,,J, M) = P( )? No P(,, J, M) = P(, )? Yes urglary xample Inference in ayes Nets eciding conditional independence is hard in non-causal directions! (ausal models and conditional independence seem hardwired for humans!) Network is less compact: = 13 numbers needed Given a ayes net P(X 1, X 2,, X n ) = P(X n P(Parents(X n ))) * P(X n-1 P(Parents(X n-1 ))) * * P(X 1 P(Parents(X 1 ))) nd some evidence = {a set of values for some of the variables} we want to compute the new probability distribution P(X k ) That is, we want to figure our P(X k = d ) for all d om[x k ] This is a posterior probability function, meaning that is is conditioned on evidence.

9 Inference in ayes Nets Other types of examples are, computing probability of different diseases given symptoms, computing probability of hail storms given different metrological evidence, etc. In such cases getting a good estimate of the probability of the unknown event allows us to respond more effectively (gamble rationally) Inference in ayes Nets In the larm example: P(urglary,arthquake, larm, Johnalls, Maryalls) = P(arthquake) * P(urglary) * P(larm arthquake,urglary) * P(Johnalls larm) * P(Maryalls larm) We may want to infer things like P(urglary=true Maryalls=false, Johnalls=true) Variable limination Making inferences can be computationally hairy. Variable elimination uses the product decomposition that defines a ayes Net and the summing out rule to compute posterior probabilities from information already in the network (PTs). V helps to reduce some of the hair in our computation. xample (inary valued Variables) P(,,,,,,G,H,I,J,K) = P() x P() x P( ) x P(,) x P( ) x P( ) x P(G) x P(H,) x P(I,G) x P(J H,I) x P(K I)

10 xample xample P(,,,,,,G,H,I,J,K) = P ( ) P ( ) P ( ) P (, ) P ( ) P ( ) P ( G ) P(H,)P(I,G)P(J H,I)P(K I) Now we can compute P(h,-i) P(d,h,-i) + P(-d,h,-i) = P(h,-i) Say that (our evidence) = {H=true, I=false}, and we want to know P( h,-i) (h: H is true, -h: H is false) irst, we write as a sum for each value of (i.e. = d and = -d),,,,,g,j,k P(,,,d,,,h,-i,J,K) = P(d,h,-i),,,,,G,J,K P(,,,-d,,,h,-i,J,K) = P(-d,h,-i) nd finally, P( h,-i) P( d h,-i) = P(d,h,-i)/P(h,-i) P(-d h,-i) = P(-d,h,-i)/P(h,-i) So very good; it appears we only need to compute P(d,h,-i) and P(-d,h,-i) to obtain the conditional probabilities we want. xample xample We start with P(d,h,-i) =,,,,,G,J,K P(,,,d,,,h,- i,j,k) Use ayes Net product decomposition to rewrite summation:,,,,,g,j,k P(,,,d,,,h,-i,J,K) =,,,,,G,J,K P()P()P( )P(d,)P( ) P ( d)p(g)p(h,)p(-i,g)p(j h,-i) Next, rearrange summations so that we are not summing over variables that do not depend on the summed variable. =,,,,, G, J, P()P()P( )P(d,)P( ) P( d)p(g)p(h,)p(-i,g)p(j h,-i) = P() P() P( )P(d,) P( ) P( d) G P(G)P(h,)P(-i,G) J P(J h,-i) = P() P() P(d,) P( ) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h, -i)

11 Now start computing. xample P() P() P(d,) P( ) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) = P(k -i) + P(-k -i) = c 1 J P(J h,-i) c 1 = c 1 J P(J h,-i) = c 1 (P(j h,-i) + P(-j h,-i)) = c 1 c 2 xample P() P() P(d,) P( ) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) c 1 c 2 G P(G) P(-i,G) = c 1 c 2 (P(g)P(-i,g) + P(-g)P(-i,-g)) ut P(-i,g) depends on the value of, so this is not a single number! xample xample Hum. Let s try ordering summations from front to back. P() P() P(d,) P( ) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) = P(a) P() P(d a,) P( a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(-a) P() P(d a,) P( a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) = P(a)P(b) P(d a,b) P( a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(a)P(-b) P(d a,-b) P( a) P( a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(-a)P(b) P(d -a,b) P( -a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(-a)P(-b) P(d -a,-b) P( -a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i)

12 xample = Yikes! The size of the sum is doubling as we expand each variable (into v and v). This approach has exponential complexity. ut let s look a bit closer. xample = P(a)P(b) P(d a,b) P( a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(a)P(-b) P(d a,-b) P( a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(-a)P(b) P(d -a,b) P( -a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(-a)P(-b) P(d -a,-b) P( -a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) Repeated subterm Repeated subterm xample If we store the value of the sub-terms, we need only compute them once. xample = P(a)P(b) P(d a,b) P( a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(a)P(-b) P(d a,-b) P( a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + f 1 P(-a)P(b) P(d -a,b) P( -a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(-a)P(-b) P(d -a,-b) P( -a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) f 2 = c 1 f 1 + c 2 f 1 + c 3 f 2 + c 3 f 2 where c 1 = P(a)P(b) P(d a,b), c 2 = P(a)P(-b) P(d a,-b), c 3 = P(a)P(-b) P(d a,-b) and c 4 = P(a)P(-b) P(d a,-b)

13 xample f 1 = P( a) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) = P(c a) P( c) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) + P(-c a) P( -c) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) Repeated subterm ynamic Programming Within computation of sub-terms we obtain more repeated smaller sub-terms. The core idea of dynamic programming is to remember all smaller computations, so that they can be reused. This can convert an exponential computation into one that takes only polynomial time. Variable elimination is a dynamic programming technique that computes the sum from the bottom up (starting with the smaller sub-terms and working its way up to the bigger terms). Relevant (return to this later) brief aside is to also note that in the sum P() P() P(d,) P( ) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) we have that = 1 (Why?), thus J P(J h,-i) = J P(J h,-i) urthermore J P(J h,-i) = 1. So we can, in theory, drop these last two terms from the computation as J and K are not relevant given our query and our evidence ( i and h). Variable limination (V) V works from the inside of our equation out, summing out K, then J, then G, as we tried to before. P() P() P(d,) P( ) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) When we tried to sum out G, we got here. c 1 c 2 G P(G) P(-i,G) = c 1 c 2 (P(g)P(-i,g) + P(-g)P(-i,-g)) and we found that P(-i,-g) to depend on the value of ; it wasn t a single number. However, we can still continue with the computation by computing two different numbers, one for each value of (i.e. for = f and =f).

14 Variable limination (V) t(-f) = c 1 c 2 G P(G) P(-i -f,g) t(f) = c 1 c 2 G P(G) P(-i f,g) t(-f) = c 1 c 2 (P(g)P(-i -f,g) + P(-g)P(-i -f,-g)) t(f) = c 1 c 2 (P(g)P(-i f,g) + P(-g)P(-i f,-g)) With these, we can sum out. Variable limination (V) P() P() P(d,) P( ) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) c 1 c 2 P( d) P(h,) G P(G) P(-i,G) = c 1 c 2 (P(f d) P(h,f)( G P(G) P(-i f,g)) +P(-f d)p(h,-f)( G P(G)P(-i -f,g)) = c 1 c 2 P( d) P(h,) t() t(f), t(-f) Variable limination (V) c 1 c 2 (P(f d) P(h,f)t(f) +P(-f d)p(h,-f)t(-f) Now this is a function of, so we obtain two new numbers s(e) = c 1 c 2 (P(f d) P(h e,f)t(f) +P(-f d)p(h e,-f)t(-f) s(-e) = c 1 c 2 (P(f d) P(h -e,f)t(f) +P(-f d)p(h -e,- f)t(-f) Variable limination (V) P() P() P(d,) P( ) P( ) P( d) P(h,) G P(G) P(-i,G) J P(J h,-i) On summing out we obtain two numbers which represent a function of. Then a function of, then a function of. inally, we can sum out all variables to obtain the single number we wanted to compute, which is P(d,h,-i). Now we can repeat the process to compute P(-d,h,-i). Or, instead of doing it twice, we can simply regard as a variable in the computation. This will result in some computations that depend on the value of, and we ll obtain different numbers for each value of. Proceeding in this manner will yield a function of (i.e., a number for each value of ).

15 Variable limination (V) In general, each stage V will compute a table of numbers: one number for each different instantiation of the variables that are in the sum. The size of these tables is exponential in the number of variables appearing in the sum, e.g., P( ) P(h,)t() depends on the value of and, thus we will obtain om[] * om[] different numbers in the resulting table. actors we call the tables of values computed by V factors. Note that the original probabilities that appear in the summation, e.g., P( ), are also tables of values (one value for each instantiation of and ). Thus we also call the original PTs factors. ach factor is a function of some variables, e.g., P( ) = f(,): it maps each value of its arguments to a number. tabular representation is exponential in the number of variables in the factor. Operations on actors If we examine the summation process we will see that various operations repeatedly occur on factors. Notation: f(x,y) denotes a factor over the variables X Y (where X and Y are sets of variables) The Product of Two actors Let f(x,y) & g(y,z) be two factors with variables Y in common The product of f and g, denoted h = f x g (or sometimes just h = fg), is defined as: h(x,y,z) = f(x,y) x g(y,z) f(,) g(,) h(,,) ab 0.9 bc 0.7 abc 0.63 ab~c 0.27 a~b 0.1 b~c 0.3 a~bc 0.08 a~b~c 0.02 ~ab 0.4 ~bc 0.8 ~abc 0.28 ~ab~c 0.12 ~a~b 0.6 ~b~c 0.2 ~a~bc 0.48 ~a~b~c 0.12

16 Summing a Variable Out of a actor Let f(x,y) be a factor We can sum out variable X from f to produce a new factor h = Σ X f, which is defined: h(y) = Σ x om(x) f(x,y) f(,) h() ab 0.9 b 1.3 a~b 0.1 ~b 0.7 ~ab 0.4 ~a~b 0.6 Restricting a actor Let f(x,y) be a factor We can restrict factor f to X = a by setting X to the value x and deleting incompatible elements of f s domain. efine h = f X=a as: h(y) = f(a,y) f(,) h() for f =a ab 0.9 b 0.9 a~b 0.1 ~b 0.1 ~ab 0.4 ~a~b 0.6 Variable limination the lgorithm V: xample Given query var Q, evidence vars (variables observed to have values e), and remaining vars Z. Let be factors in original PTs. actors: f 1 () f 2 () f 3 (,,) f 4 (,) Query: P()? vidence: = d limination Order:, f 1 () f 2 () f 3 (,,) f 4 (,) 1. Replace each factor f that mentions a variable(s) in with its restriction f =e (this might yield a constant factor) 2. or each Z j - in the order given - eliminate Z j Z as follows: (a) ompute new factor g j = Zj f 1 x f 2 x x f k, where the f i are the factors in that include Z j (b) Remove the factors f i (that mention Z j ) from and add new factor g j to 3. The remaining factors at the end of this process will refer only to the query variable Q. Take their product and normalize to produce P(Q ). Step 1 (Restriction): Replace f 4 (,) with f 5 () = f 4 (,d) Step 2 (liminate ): ompute & add f 6 (,)= Σ f 5 () f 3 (,,) to list of factors. Remove: f 3 (,,), f 5 () Step 3 (liminate ): ompute & add f 7 () = Σ f 6 (,) f 2 () to list of factors. Remove: f 6 (,), f 2 () Step 4 (Normalize inal actors): f 7 (), f 1 (). The product f 1 () x f 7 () is (unnormalized) posterior for. So we normalize: P( d) = α f 1 () x f 7 () where α = 1/ f 1 ()f 7 ()

17 Numeric xample Here s an example with some numbers f 1 () f 2 (,) f 3 (,) f 1 () f 2 (,) f 3 (,) f 4 () Σ f 2 (,)f 1 () f 5 () Σ f 3 (,) f 4 () a 0.9 ab 0.9 bc 0.7 b 0.85 c ~a 0.1 a~b 0.1 b~c 0.3 ~b 0.15 ~c ~ab 0.4 ~bc 0.2 ~a~b 0.6 ~b~c 0.8 V: uckets as a Notational evice Ordering:,,,,, 1. : 2. : 3. : 4. : 5. : 6. : f 1 () f 2 () f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) V: uckets Place Original actors in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) V: liminate the variables in order, placing new factor in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) 2. : f 6 (,,) 3. : f 1 () 4. : f 2 () 5. : 6. : 2. : f 6 (,,) 3. : f 1 (), f 7 (,,,) 4. : f 2 () 5. : 6. : 1. Σ f 3 (,,) x f 4 (,) x f 5 (,) = f 7 (,,,)

18 V: liminate the variables in order, placing new factor in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) V: liminate the variables in order, placing new factor in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) 2. : f 6 (,,) 3. : f 1 (), f 7 (,,,) 2. Σ f 6 (,,) = f 8 (,) 2. : f 6 (,,) 3. : f 1 (), f 7 (,,,) 3. Σ f 1 () x f 7 (,,,) = f 9 (,,) 4. : f 2 () 4. : f 2 (), f 9 (,,) 5. : f 8 (,) 5. : f 8 (,) 6. : 6. : V: liminate the variables in order, placing new factor in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) V: liminate the variables in order, placing new factor in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) 2. : f 6 (,,) 3. : f 1 (), f 7 (,,,) 4. Σ f 2 () x f 9 (,,) = f 10 (,) 2. : f 6 (,,) 3. : f 1 (), f 7 (,,,) 5. Σ f 8 (,) x f 10 (,) = f 11 () 4. : f 2 (), f 9 (,,) 5. : f 8 (,), f 10 (,) 4. : f 2 (), f 9 (,,) 5. : f 8 (,), f 10 (,) f 11 is the final answer, once we normalize it. 6. : 6. : f 11 ()

19 omplexity of Variable limination hypergraph has vertices just like an ordinary graph, but instead of edges between two vertices X Y it contains hyperedges. hyperedge is a set of vertices (i.e., potentially more than one) {,,} {,,} {,} omplexity of Variable limination Hypergraph of ayes Net The set of vertices are the nodes of the ayes net. The hyperedges are the variables appearing in each PT. {X i } Parents(X i ) omplexity of Variable limination Variable limination in the HyperGraph P(,,,,,) = P()P() X P(,) X P( ) X P( ) X P(,). To eliminate variable X i in the hypergraph we Remove the vertex X i reate a new hyperedge H i equal to the union of all of the hyperedges that contain X i minus X i Remove all of the hyperedges containing X from the hypergraph. dd the new hyperedge H i to the hypergraph.

20 omplexity of Variable limination omplexity of Variable limination liminate liminate omplexity of Variable limination Variable limination Notice that when at the outset of V we have a set of factors consisting of the reduced (or restricted) PTs. The unassigned variables for the vertices and the set of variables each factor depends on forms the hyperedges of a hypergraph H 1. liminate If the first variable we eliminate is X, then we remove all factors containing X (all hyperedges) and add a new factor that has as variables the union of the variables in the factors containing X (we add a hyperdege that is the union of the removed hyperedges minus X).

21 V actors f 5 (,) V: Place Original actors in first applicable bucket. f 5 (,) Ordering:,,,,, f 1 () f 2 () f 3 (,,) f 6,,) Ordering:,,,,, f 1 () f 2 () f 3 (,,) f 6 (,,) 1. : 2. : 3. : 4. : 5. : 6. : f 4 (,) 1. : f 3 (,,), f 4 (,), f 5 (,) 2. : f 6 (,,) 3. : f 1 () 4. : f 2 () 5. : 6. : f 4 (,) V: liminate, placing new factor f7 in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) V: liminate, placing new factor f8 in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) 2. : f 6 (,,) 2. : f 6 (,,) 3. : f 1 (), f 7 (,,,) 4. : f 2 () 5. : 3. : f 1 (), f 7 (,,,) 4. : f 2 () 5. : f 8 (,) 6. : 6. :

22 V: liminate, placing new factor f9 in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) V: liminate, placing new factor f10 in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) 2. : f 6 (,,) 2. : f 6 (,,) 3. : f 1 (), f 7 (,,,) 3. : f 1 (), f 7 (,,,) 4. : f 2 (), f 9 (,,) 5. : f 8 (,) 4. : f 2 (), f 9 (,,) 5. : f 8 (,), f 10 (,) 6. : 6. : V: liminate, placing new factor f11 in first applicable bucket. Ordering:,,,,, f 1 () f 2 () 1. : f 3 (,,), f 4 (,), f 5 (,) f 3 (,,) f 5 (,) f 4 (,) f 6 (,,) limination Width Given an ordering π of the variables and an initial hypergraph H eliminating these variables yields a sequence of hypergraphs H = H 0, H 1,H 2,,H n where H n contains only one vertex (the query variable). 2. : f 6 (,,) 3. : f 1 (), f 7 (,,,) 4. : f 2 (), f 9 (,,) 5. : f 8 (,), f 10 (,) 6. : f 11 () The elimination width π is the maximum size (number of variables) of any hyperedge in any of the hypergraphs H 0,H 1,,H n. The elimination width of the previous example was 4 ({,,,} in H 1 and H 2 ).

23 limination Width If the elimination width of an ordering π is k, then the complexity of V using that ordering is 2 O(k) limination width k means that at some stage in the elimination process a factor involving k variables was generated. That factor will require 2 O(k) space to store space complexity of V is 2 O(k) nd it will require 2 O(k) operations to process (either to compute in the first place, or when it is being processed to eliminate one of its variables). Time complexity of V is 2 O(k) NOT, that k is the elimination width of this particular ordering. Tree Width Given a hypergraph H with vertices {X 1,X 2,,X n } the tree width (ω) of H is the MINIMUM elimination width of any of the n! different orderings of the X i minus 1. Thus V has best case complexity of 2 O(ω) where ω is the tree width of the initial ayes Net. In the worst case, the tree width is equal to the number of variables. ifferent Orderings = ifferent limination Widths Suppose query variable is. onsider different orderings for this network,,,,g,h, : ad,,h,g,,,: Good Tree Width xponential in the tree width is the best that V can do. ut, finding an ordering that has elimination width equal to tree width is NP-Hard. so in practice there is no point in trying to speed up V by finding the best possible elimination ordering. Instead, heuristics are used to find orderings with good (low) elimination widths. In practice, this can be very successful. limination widths can often be relatively small, 8-10 even when the network has 1000s of variables. Thus V can be much more efficient than simply summing the probability of all possible events (which is exponential in the number of variables). Sometimes, however, the tree width is equal to the number of variables.

24 inding Good Orderings polytree is a singly connected ayes Net: in particular there is only one path between any two nodes. node can have multiple parents, but we have no cycles. Good orderings are easy to find for polytrees t each stage eliminate a singly connected node. ecause we have a polytree we are assured that a singly connected node will exist at each elimination stage. The size of the factors in the tree never increase. limination Ordering: Polytrees Tree width of a polytree is 1! liminating singly connected nodes allows V to run in time linear in size of network e.g., in this network, eliminate,,, X1, ; or eliminate X1, Xk,, ; or mix it up result: no factor ever larger than original PTs eliminating before these gives factors that include all of,, X1, Xk!!! Min ill Heuristic fairly effective heuristic is to always eliminate next the variable that creates the smallest size factor. This is called the min-fill heuristic. creates a factor of size k+2 creates a factor of size 2 creates a factor of size 1 This heuristic always solves polytrees in linear time. Relevance ertain variables have no impact on the query. or example, in network, computing P() with no evidence requires elimination of and. ut when you sum out these variables, you compute a trivial factor (that always evaluates to 1); for example: liminating : f 4 () = Σ f 3 (,) = Σ P( ) This is 1 for any value of (e.g., P(c b) + P(~c b) = 1) No need to think about or for this query

25 Relevance an restrict attention to relevant variables. Given query q, evidence : q itself is relevant if any node Z is relevant, its parents are relevant if e is a descendent of a relevant node, then is relevant We can restrict our attention to the subnetwork comprising only relevant variables when evaluating a query Q Relevance: xamples Query: P() relevant:,,, Query: P( ) relevant:,,, also:, hence, G G intuitively, we need to compute P( ) to compute P( ) H Query: P( H) relevant:,,, P()P()P(,)P( ) P(G)P(h G)P( G,)P( ) = P(G)P(h G)P( G,) P( ) = a table of 1 s = P(G)P(h G) P( G,) = a table of 1 s = [P()P()P(,)P( )] [P(G)P(h G)] [P(G)P(h G)] 1 but irrelevant once we normalize, multiplies each value of equally Relevance: xamples Query: P(,) algorithm says all variables except H are relevant; but really none except, (since cuts of all influence of others) algorithm is overestimating relevant set H G Independence in a ayes Net nother piece of information we can obtain from a ayes net is the structure of relationships in the domain. The structure of the N means: every X i is conditionally independent of all of its non-descendants given it parents: P(X i S Par(X i )) = P(X i Par(X i )) for any subset S Nonescendents(X i )

26 More generally. onditional independencies can be useful in computation, explanation, etc. related to a N How do we determine if two variables X, Y are independent given a set of variables? We can use a (simple) graphical property called -separation -separation: set of variables d-separates X and Y if it blocks every undirected path in the N between X and Y (we'll define locks next.) X and Y are conditionally independent given evidence if d- separates X and Y thus N gives us an easy way to tell if two variables are independent (set = ) or conditionally independent given. What does it mean to be blocked? X There exists a variable V on the path such that it is in the evidence set the arcs putting V in the path are tail-to-tail Or, there exists a variable V on the path such that it is in the evidence set the arcs putting V in the path are tail-to-head V X Y V Y What does it mean to be blocked? locking: Graphical View X Y V If the variable V is on the path such that the arcs putting V on the path are head-to-head, the variables are still blocked... so long as: V is NOT in the evidence set neither are any of its descendants

27 -separation implies conditional independence -Separation: Intuitions Theorem [Verma & Pearl, 1998]: If a set of evidence variables d-separates X and Z in a ayesian network s graph, then X is independent of Z given. Subway and Therm are dependent; but are independent given lu (since lu blocks the only path) -Separation: Intuitions -Separation: Intuitions ches and ever are dependent; but are independent given lu (since lu blocks the only path). Similarly for ches and Therm (dependent, but indep. given lu). lu and Mal are indep. (given no evidence): ever blocks the path, since it is not in evidence, nor is its descendant Therm. lu and Mal are dependent given ever (or given Therm): nothing blocks path now.

28 -Separation: Intuitions Subway, xotictrip are indep.; they are dependent given Therm; they are indep. given Therm and Malaria. This for exactly the same reasons for lu/mal above. -Separation xample In the following network determine if and are independent given the evidence. 1. and given no evidence? 2. and given {}? 3. and given {G,}? 4. and given {G,,H}? 5. and given {G,}? 6. and given {,}? 7. and given {,,H}? 8. and given {}? 9. and given {H,}? 10. and given {G,,,H,,,}? H G -Separation xample In the following network determine if and are independent given the evidence. 1. and given no evidence? N 2. and given {}? N 3. and given {G,}? Y 4. and given {G,,H}? Y 5. and given {G,}? N 6. and given {,}? Y 7. and given {,,H}? N 8. and given {}? Y 9. and given {H,}? Y 10. and given {G,,,H,,,}? Y H G

Reasoning under Uncertainty

Reasoning under Uncertainty This material is covered in chapters 13 and 14 of Russell and Norvig 2 nd and 3 rd edition. hapter 13 gives some basic background on probability from the point of view of I.