BAYESIAN NETWORKS AIMA2E CHAPTER (SOME TOPICS EXCLUDED) AIMA2e Chapter (some topics excluded) 1

BAYESIAN NEWORKS AIMA2E HAPER 14.1 5 (SOME OPIS EXLUDED) AIMA2e hapter 14.1 5 (some topics excluded) 1

} Syntax } Semantics } Parameterized distributions } Inference Outline ffl Exact inference (enumeration, variable elimination) ffl Approximate inference (stochastic simulation) AIMA2e hapter 14.1 5 (some topics excluded) 2

P(X i jp arents(x i )) Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax: Random Variables: a set of nodes, one per variable opology: a directed, acyclic graph ß (link directly influences ) Probabilities: a conditional distribution for each node given its parents: In the simplest case, conditional distribution represented as a conditional probability table (P) giving the distribution X over i for each combination of parent values AIMA2e hapter 14.1 5 (some topics excluded) 3

Example W eather is independent of the other variables oothache and atch are conditionally independent given avity P ( oothache; catch; avity; W eather) =? opology of network encodes conditional independence assertions: Weather avity oothache atch AIMA2e hapter 14.1 5 (some topics excluded) 4

Example W eather is independent of the other variables oothache and atch are conditionally independent given avity ( oothache; atch; avity; W eather) = P ( oothachejavity) P (atchjavity)p (avity)p (W eather) P opology of network encodes conditional independence assertions: Weather avity oothache atch AIMA2e hapter 14.1 5 (some topics excluded) 5

Example I m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn t call. Sometimes it s set off by minor earthquakes. Is there a burglar? Variables: Burglar, Earthquake, Alarm, Maryalls Johnalls, Network topology reflects causal knowledge: A burglar can set the alarm off An earthquake can set the alarm off he alarm can cause Mary to call he alarm can cause John to call AIMA2e hapter 14.1 5 (some topics excluded) 6

Example contd. Burglary Johnalls P(B).001 Alarm A P(J).05 Earthquake P(E).002 B E P(A).95.94.29.001 Maryalls A P(M).70.01 AIMA2e hapter 14.1 5 (some topics excluded) 7

ompactness A P X for Boolean k i with Boolean parents has rows for the combinations of parent values k 2 B E Each row requires one p number X for = true i (the number X for = false i is 1 just p) If each variable has no more k than parents, the complete network O(n 2 requires ) numbers k J A M I.e., grows linearly with n, vs.o(2 n ) for the full joint distribution or burglary net, 1 + 1 + 4 + 2 + 2=10numbers (vs. 2 5 1 = 31) AIMA2e hapter 14.1 5 (some topics excluded) 8

= Global semantics Global semantics defines the full joint distribution as the product of the local conditional distributions: B E A P(X 1 ;:::;X n ) = Π n i =1P(X i jparents(x i )) e.g., P (j ^ m ^ a ^ :b ^ :e) J M AIMA2e hapter 14.1 5 (some topics excluded) 9

Global semantics = P (jja)p (mja)p (aj:b; :e)p (:b)p (:e) Global semantics defines the full joint distribution as the product of the local conditional distributions: B E A P(X 1 ;:::;X n ) = Π n i =1P(X i jparents(x i )) e.g., P (j ^ m ^ a ^ :b ^ :e) J M AIMA2e hapter 14.1 5 (some topics excluded) 10

Local semantics Local semantics: each node is conditionally independent of its nondescendants given its parents U 1... U m Z 1j X Z nj Y 1... Y n heorem: Local semantics, global semantics AIMA2e hapter 14.1 5 (some topics excluded) 11

Markov blanket Each node is conditionally independent of all others given its Markov blanket: parents + children + children s parents U 1... U m Z 1j X Z nj Y 1... Y n AIMA2e hapter 14.1 5 (some topics excluded) 12

onstructing Bayesian networks P(X i jp arents(x i )) = P(X i jx 1 ; :::; X i 1 ) Need a method such that a series of locally testable assertions of conditional independence guarantees the required global semantics 1. hoose an ordering of X variables ;:::;X n 1 2. i =1ton or X add i to the network select parents X from ;:::;X i 1 1 such that his choice of parents guarantees the global semantics: P(X 1 ;:::;X n ) = Π n i =1P(X i jx 1 ; :::; X i 1 ) (chain rule) = Π n i =1P(X i jparents(x i )) (by construction) AIMA2e hapter 14.1 5 (some topics excluded) 13

Suppose we choose the ordering M, J, A, B, E P (JjM ) = P (J)? Example Maryalls Johnalls AIMA2e hapter 14.1 5 (some topics excluded) 14

Suppose we choose the ordering M, J, A, B, E Example (JjM ) = P (J)? No P (AjJ;M) = P (AjJ)? P (AjJ;M) = P (A)? P Maryalls Alarm Johnalls AIMA2e hapter 14.1 5 (some topics excluded) 15

Suppose we choose the ordering M, J, A, B, E Example Maryalls Johnalls Alarm Burglary (JjM ) = P (J)? No P (AjJ;M) = P (AjJ)? P (AjJ;M) = P (A)? No P (BjA; J;M) = P (BjA)? P P (BjA; J;M) = P (B)? AIMA2e hapter 14.1 5 (some topics excluded) 16

Suppose we choose the ordering M, J, A, B, E Example Maryalls Johnalls Alarm Burglary Earthquake (JjM ) = P (J)? No P (AjJ;M) = P (AjJ)? P (AjJ;M) = P (A)? No P (BjA; J;M) = P (BjA)? Yes P (BjA; J;M) = P (B)? No P (EjB; A; J;M) = P (EjA)? P (EjB; A; J;M) = P (EjA; B)? P AIMA2e hapter 14.1 5 (some topics excluded) 17

Suppose we choose the ordering M, J, A, B, E Example Maryalls Johnalls Alarm Burglary Earthquake (JjM ) = P (J)? No P (AjJ;M) = P (AjJ)? P (AjJ;M) = P (A)? No P (BjA; J;M) = P (BjA)? Yes P (BjA; J;M) = P (B)? No P (EjB; A; J;M) = P (EjA)? No P (EjB; A; J;M) = P (EjA; B)? Yes P AIMA2e hapter 14.1 5 (some topics excluded) 18

Example contd. Maryalls Johnalls Alarm Burglary Earthquake Deciding conditional independence is hard in noncausal directions (ausal models and conditional independence seem hardwired for humans!) Assessing conditional probabilities is hard in noncausal directions AIMA2e hapter 14.1 5 (some topics excluded) 19

Network is less compact: 1 + 2 + 4 + 2 + 4=13numbers needed AIMA2e hapter 14.1 5 (some topics excluded) 20

Example: ar diagnosis Initial evidence: car won t start estable variables (green), broken, so fix it variables (orange) Hidden variables (gray) ensure sparse structure, reduce parameters battery age alternator broken fanbelt broken battery dead no charging battery meter battery flat no oil no gas fuel line blocked starter broken lights oil light gas gauge car won t start dipstick AIMA2e hapter 14.1 5 (some topics excluded) 21

Example: ar insurance SocioEcon Age GoodStudent Mileage RiskAversion Seniorrain Extraar VehicleYear DrivingSkill MakeModel DrivingHist DrivQuality Antilock Airbag arvalue HomeBase Antiheft Ruggedness Accident heft OwnDamage ushioning Otherost Ownost Medicalost Liabilityost Propertyost AIMA2e hapter 14.1 5 (some topics excluded) 22

ompact conditional distributions X = f (P arents(x)) for some function f N orthamerican, anadian _ U S _ M exican @Level @t P grows exponentially with no. of parents P becomes infinite with continuous-valued parent or child Solution: canonical distributions that are defined compactly Deterministic nodes are the simplest case: E.g., Boolean functions E.g., numerical relationships among continuous variables = inflow + precipitation - outflow - evaporation AIMA2e hapter 14.1 5 (some topics excluded) 23

ompact conditional distributions contd. ) P (XjU 1 :::U j ; :U j+1 ::::U k ) = 1 Π j i =1q i Noisy-OR distributions model multiple noninteracting causes 1) U Parents :::U k 1 include all causes (can add leak node) 2) Independent failure q probability i for each cause alone lu Malaria P (ever) P (:ever) old 1:0 0.0 0:9 0.1 0:8 0.2 0:98 0:02 = 0:2 0:1 0:4 0.6 0:94 0:06 = 0:6 0:1 0:88 0:12 = 0:6 0:2 0:988 0:012 = 0:6 0:2 0:1 Number of parameters linear in number of parents AIMA2e hapter 14.1 5 (some topics excluded) 24

ontinuous nodes Networks may have discrete RVs, continuous RVs, or a mix of the two. All continuous (e.g., conditional Gaussian). Linear dynamic sysyems (Kalman filter). Discrete parents, continuous children (e.g., conditional Gaussian). Gaussian mixture models. ontinuous parents, discrete children (e.g., probit and logit functions). Difficult to deal with. AIMA2e hapter 14.1 5 (some topics excluded) 25

Inference tasks Simple queries: compute posterior marginal P(X i je = e) e.g., P (NoGasjGauge = empty; Lights = on; Starts = false) onjunctive P(X queries: ;X j i = e) = P(X je i = e)p(x je jx i ; j = e) E Optimal decisions: decision networks include utility information; probabilistic inference required P (outcomejaction; evidence) for Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation: why do I need a new starter motor? AIMA2e hapter 14.1 5 (some topics excluded) 26

m) P(Bjj; P(B;j;m)=P (j; m) = = ffp(b; j; m) = ff± e ± a P(B; e; a; j; m) Inference by enumeration m) P(Bjj; ff± e ± a P(B)P (e)p(ajb;e)p (jja)p (mja) = = ffp(b)± e P (e)± a P(ajB;e)P (jja)p (mja) Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: B E A J M Rewrite full joint entries using product of P entries: Recursive depth-first enumeration: O(n) space, O(d n ) time AIMA2e hapter 14.1 5 (some topics excluded) 27

Enumeration algorithm function ENUMERAION-ASK(X, e, bn) returns a distribution over X inputs: X, the query variable e, observed values for variables E bn, a Bayesian network with fxg [ variables [ E Y ) ψ a distribution over X, initially empty Q(X for each xi value of X do extend e with xi value for X ψ ENUMERAE-ALL(VARS[bn], e) Q(xi) return NORMALIZE(Q(X)) function ENUMERAE-ALL(vars, e) returns a real number if EMPY?(vars) then return 1.0 ψ Y IRS(vars) if Y has value y in e then P (y j P a(y )) return ENUMERAE-ALL(RES(vars), e) else return y P (y j Pa(Y )) ENUMERAE-ALL(RES(vars), ey) where P is e extended with Y = y ey AIMA2e hapter 14.1 5 (some topics excluded) 28

Evaluation tree Enumeration is inefficient: repeated computation e.g., computes P (jja)p (mja) for each value of e P(b).001 P(e).002 P( e).998 P(a b,e) P( a b,e) P(a b, e) P( a b, e).95.05.94.06 P(j a) P(j a) P(j a).05 P(j a).05 P(m a) P(m a).70.01 P(m a) P(m a).70.01 AIMA2e hapter 14.1 5 (some topics excluded) 29

m) P(Bjj; ff P(B) = Inference by variable elimination z } B P (e) z } ±e E P(ajB;e) z } ±a A (jja) z } P P (mja) z } M J = ffp(b)± e P (e)± a P(ajB;e)P (jja)f M (a) = ffp(b)± e P (e)± a P(ajB;e)f J (a)f M (a) = fff B (b) f μ E μ AJM (b) Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation ffp(b)±ep (e)±afa(a; b; e)fj(a)fm (a) = ffp(b)± e P (e)f μ AJM (b; e) (sum out A) = ffp(b)f μ E μ AJM (b) (sum out E) = AIMA2e hapter 14.1 5 (some topics excluded) 30

Variable elimination: Basic operations E.g., f 1 (a; b) f 2 (b; c) = f (a; b; c) Summing out a variable from a product of factors: move any constant factors outside the summation add up submatrices in pointwise product of remaining factors ±xf 1 f k = f 1 f i ± x f i+1 f k = f 1 f i f μ X assuming f 1 ;:::;f i do not depend on X Pointwise product of factors f 1 and f 2 : 1 (x 1 ;:::;x j ;y 1 ;:::;y k ) f 2 (y 1 ;:::;y k ;z 1 ;:::;z l ) f f (x = ;:::;xj;y 1 ;:::;yk;z 1 ;:::;zl) 1 AIMA2e hapter 14.1 5 (some topics excluded) 31

Variable elimination algorithm function ELIMINAION-ASK(X, e, bn) returns a distribution over X inputs: X, the query variable e, evidence specified as an event bn, a belief network specifying joint P(X distribution ;:::;Xn) 1 ψ factors []; ψ vars REVERSE(VARS[bn]) for each var in vars do ψ [MAKE-AOR(var; e)jfactors] factors if var is a hidden variable then ψ factors SUM-OU(var, factors) return NORMALIZE(POINWISE-PRODU(factors)) AIMA2e hapter 14.1 5 (some topics excluded) 32

Irrelevant variables P (Jjb) = ffp (b) X e P (e) X a P (ajb; e)p (Jja) X m P (mja) hm 1: Y is irrelevant unless Y 2 Ancestors(fXg[E) Ancestors(fXg[E) = falarm; Earthquakeg onsider the query P (JohnallsjBurglary = true) B E A Sum over m is identically 1; M is irrelevant to the query J M Here, X = Johnalls, E = fburglaryg, and so M is irrelevant AIMA2e hapter 14.1 5 (some topics excluded) 33

omplexity of exact inference Singly connected networks (or polytrees): any two nodes are connected by at most one (undirected) path time and space cost of variable elimination are O(d k n) Multiply connected networks: can reduce 3SA to exact ) inference NP-hard equivalent to counting 3SA ) models #P-complete AIMA2e hapter 14.1 5 (some topics excluded) 34

Inference by stochastic simulation Basic idea: 1) N Draw samples from a sampling S distribution 2) ompute an approximate posterior probability ^P 3) Show this converges to the true probability P 0.5 Outline: Sampling from an empty network Rejection sampling: reject samples disagreeing with evidence Likelihood weighting: use evidence to weight samples oin AIMA2e hapter 14.1 5 (some topics excluded) 35

Sampling from an empty network xi ψ a random sample from P(Xi j Parents(Xi)) function PRIOR-SAMPLE(bn) returns an event sampled from bn inputs: bn, a belief network specifying joint distribution P(X 1 ;:::;Xn) ψ x an event n with elements i = 1 for n to do return x AIMA2e hapter 14.1 5 (some topics excluded) 36

Example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 37

Example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 38

Example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 39

Example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 40

Example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 41

Example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 42

Example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 43

x1;:::;xn Sampling from an empty network contd. N!1 ^P (x 1 ;:::;xn) = lim lim NPS(x 1 ;:::;xn)=n N!1 = SPS(x 1 ;:::;xn) = P (x 1 :::x n ) Shorthand: ^P (x 1 ;:::;x n ) ß P (x 1 :::x n ) Probability that PRIORSAMPLE generates a particular event i.e., the true prior probability SPS(x 1 :::x n ) = Π n i =1 P (x ijparents(x i )) = P (x 1 :::x n ) S E.g., (t; f; t; t) = 0:5 0:9 0:8 0:9 = 0:324 = P (t; f; t; t) PS N Let (x 1 :::x n ) PS be the number of samples generated for event hen we have hat is, estimates derived from PRIORSAMPLE are consistent AIMA2e hapter 14.1 5 (some topics excluded) 44

^P(RainjSprinkler = true) = NORMALIZE(h8; 19i) = h0:296; 0:704i Rejection sampling ^P(Xje) estimated from samples agreeing with e function REJEION-SAMPLING(X, e, bn, N) returns an estimate P (Xje) of local variables: N, a vector of counts over X, initially zero for =1toN j do ψ x PRIOR-SAMPLE(bn) if x is consistent with e then ψ N[x] N[x]+1 where x is the value of X in x return NORMALIZE(N[X]) E.g., estimate P(RainjSprinkler = true) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false. Similar to a basic real-world empirical estimation procedure AIMA2e hapter 14.1 5 (some topics excluded) 45

of rejection sampling Analysis = ffnps(x; e) (algorithm defn.) ^P(Xje) N PS (X; e)=n PS (e) (normalized by N PS (e)) = P(X; e)=p (e) (property of PRIORSAMPLE) ß P(Xje) (defn. of conditional probability) = Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P (e) is small P (e) drops off exponentially with number of evidence variables! AIMA2e hapter 14.1 5 (some topics excluded) 46

Likelihood weighting Idea: fix evidence variables, sample only nonevidence variables, and weight each sample by the likelihood it accords the evidence function LIKELIHOOD-WEIGHING(X, e, bn, N) returns an estimate P (Xje) of local variables: W, a vector of weighted counts over X, initially zero for j =1toN do x, w ψ WEIGHED-SAMPLE(bn) ] ψ W[x ]+w where x is the value of X in x W[x NORMALIZE(W[X return ]) function WEIGHED-SAMPLE(bn, e) returns an event and a weight ψ x an event n with elements; ψ w 1 for i n =1to do Xi if has a xi value in e then ψ w P (Xi = xi j Parents(Xi)) w xi ψ else a random sample P(Xi j Parents(Xi)) from return x, w AIMA2e hapter 14.1 5 (some topics excluded) 47

w = 1:0 Likelihood weighting example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 48

w = 1:0 Likelihood weighting example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 49

w = 1:0 Likelihood weighting example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 50

w = 1:0 0:1 Likelihood weighting example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 51

w = 1:0 0:1 Likelihood weighting example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 52

w = 1:0 0:1 Likelihood weighting example P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 53

Likelihood weighting example w = 1:0 0:1 0:99 = 0:099 P(S ).10 Sprinkler P() loudy Wet Grass S R P(W S,R).99.01 Rain P(R ).80.20 AIMA2e hapter 14.1 5 (some topics excluded) 54

Likelihood weighting analysis w(z; e) = Π m i =1 P (e ijp arents(e i )) SWS(z; e)w(z; e) Sampling probability for WEIGHEDSAMPLE is Note: pays attention to evidence in ancestors only SWS(z; e) = Π l i =1P (z i jp arents(z i )) loudy somewhere in between prior and ) posterior distribution Sprinkler Rain Weight for a given sample z; e is Wet Grass Weighted sampling probability is = Π l i =1P (z i jp arents(z i )) Π m i =1P (e i jp arents(e i )) = P (z; e) (by standard global semantics of network) Hence likelihood weighting returns consistent estimates but performance still degrades with many evidence variables because a few samples have nearly all the total weight AIMA2e hapter 14.1 5 (some topics excluded) 55