Course Overview. Summary. Outline

Size: px

Start display at page:

Download "Course Overview. Summary. Outline"

Oscar Fox
5 years ago
Views:

1 Course Overview Lecture 9 Baysian Networks Marco Chiarandini Deptartment of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Introduction Artificial Intelligence Intelligent Agents Search Uninformed Search Heuristic Search Adversarial Search Minimax search Alpha-beta pruning Knowledge representation and Reasoning Propositional logic irst order logic Inference Uncertain knowledge and Reasoning Probability and Bayesian approach Bayesian Networks Hidden Markov Chains Kalman ilters Learning Decision rees Maximum Likelihood EM Algorithm Learning Bayesian Networks Neural Networks Support vector machines 2 Outline Summary Interpretations of probability 1. Axioms of Probability (Continuous/Discrete) Random Variables Prior probability, joint probability, conditional or posterior probability, chain rule Inference by enumeration How to reduce the computation of inference? 3 4

Probability basics Independence A and B are independent iff P(A B) = P(A) or P(B A) = P(B) or P(A, B) = P(A)P(B) Cavity oothache Catch Weather decomposes into Cavity oothache Catch Weather

2 Probability basics Independence A and B are independent iff P(A B) = P(A) or P(B A) = P(B) or P(A, B) = P(A)P(B) Cavity oothache Catch Weather decomposes into Cavity oothache Catch Weather P(oothache, Catch, Cavity, Weather) = P(oothache, Catch, Cavity)P(Weather) 32 entries reduced to 12; for n independent biased coins, 2 n n Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do? 5 6 Conditional independence Conditional independence contd. P(oothache, Cavity, Catch) has = 7 independent entries If I have a cavity, the probability that the probe catches in it doesn t depend on whether I have a toothache: (1) P(catch toothache, cavity) = P(catch cavity) he same independence holds if I haven t got a cavity: (2) P(catch toothache, cavity) = P(catch cavity) Catch is conditionally independent of oothache given Cavity: P(Catch oothache, Cavity) = P(Catch Cavity) Equivalent statements: P(oothache Catch, Cavity) = P(oothache Cavity) P(oothache, Catch Cavity) = P(oothache Cavity)P(Catch Cavity) Write out full joint distribution using chain rule: P(oothache, Catch, Cavity) = P(oothache Catch, Cavity)P(Catch, Cavity) = P(oothache Catch, Cavity)P(Catch Cavity)P(Cavity) = P(oothache Cavity)P(Catch Cavity)P(Cavity) I.e., = 5 independent numbers (equations 1 and 2 remove 2) In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n. Conditional independence is our most basic and robust form of knowledge about uncertain environments. 7 8

3 Bayes Rule Bayes Rule and conditional independence Product rule P(a b) = P(a b)p(b) = P(b a)p(a) = Bayes rule P(a b) = P(b a)p(a) P(b) or in distribution form P(Y X ) = P(X Y )P(Y ) P(X ) = αp(x Y )P(Y ) P(Cavity toothache catch) = α P(toothache catch Cavity)P(Cavity) = α P(toothache Cavity)P(catch Cavity)P(Cavity) his is an example of a naive Bayes model: P(Cause, Effect 1,..., Effect n ) = P(Cause) P(Effect i Cause) i Useful for assessing diagnostic probability from causal probability: P(Cause Effect) = P(Effect Cause)P(Cause) P(Effect) Cavity Cause E.g., let M be meningitis, S be stiff neck: P(m s) = P(s m)p(m) P(s) = = Note: posterior probability of meningitis still very small! 9 oothache Catch otal number of parameters is linear in n Effect 1 Effect n 10 Summary Outline Probability is a rigorous formalism for uncertain knowledge Joint probability distribution specifies probability of every atomic event Queries can be answered by summing over atomic events or nontrivial domains, we must find a way to reduce the joint size Independence and conditional independence provide the tools

4 Outline Definition A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax Semantics Parameterized distributions Syntax: a set of nodes, one per variable a directed, acyclic graph (link directly influences ) a conditional distribution for each node given its parents: P(X i Parents(X i )) In the simplest case, conditional distribution represented as a conditional probability table (CP) giving the distribution over X i for each combination of parent values Example Example opology of network encodes conditional independence assertions: Weather oothache Cavity Catch Weather is independent of the other variables oothache and Catch are conditionally independent given Cavity I m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn t call. Sometimes it s set off by minor earthquakes. Is there a burglar? Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects causal knowledge: A burglar can set the alarm off An earthquake can set the alarm off he alarm can cause Mary to call he alarm can cause John to call 15 16

5 Example contd. Compactness B E Burglary P(A B,E) P(B).001 Alarm Earthquake P(E).002 A CP for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values Each row requires one number p for X i = true (the number for X i = false is just 1 p) If each variable has no more than k parents, the complete network requires O(n 2 k ) numbers I.e., grows linearly with n, vs. O(2 n ) for the full joint distribution B J A E M JohnCalls A P(J A) MaryCalls A P(M A) or burglary net, = 10 numbers (vs = 31) Global semantics Local semantics Global semantics defines the full joint distribution as the product of the local conditional distributions: B E Local semantics: each node is conditionally independent of its nondescendants given its parents P(x 1,..., x n ) = n P(x i parents(x i )) i = 1 A U 1... U m e.g., P(j m a b e) = P(j a)p(m a)p(a b, e)p( b)p( e) = J M Z 1j X Z nj Y 1... Y n heorem: Local semantics global semantics 19 20

6 Markov blanket Constructing Each node is conditionally independent of all others given its Markov blanket: parents + children + children s parents Need a method such that a series of locally testable assertions of conditional independence guarantees the required global semantics Choose an ordering of variables X 1,..., X n U 1... U m or i = 1 to n add X i to the network select parents from X 1,..., X i 1 such that P(X i Parents(X i )) = P(X i X 1,..., X i 1 ) Z 1j X Z nj his choice of parents guarantees the global semantics: Y 1... Y n P(X 1,..., X n ) = = n P(X i X 1,..., X i 1 ) (chain rule) i = 1 n P(X i Parents(X i )) i = 1 (by construction) Example Example: Car insurance Suppose we choose the ordering M, J, A, B, E MaryCalls Burglary Alarm JohnCalls P(J M) = P(J)? No MaryCalls P(A J, M) = P(A J)? P(A J, M) = P(A)? No JohnCalls P(B A, J, M) = P(B A)? Yes P(B A, J, M) = P(B)? No P(E B, A, J, Alarm M) = P(E A)? No P(E B, A, J, M) = P(E A, B)? Yes Deciding conditional independence is hard Burglary in noncausal directions (Causal models and conditional independence seemearthquake hardwired for humans!) Assessing conditional probabilities is hard in noncausal directions Network is less compact: = 13 numbers needed 23 SocioEcon Age GoodStudent ExtraCar Mileage RiskAversion VehicleYear Seniorrain DrivingSkill MakeModel DrivingHist Antilock DrivQuality Airbag CarValue HomeBase Antiheft Ruggedness Accident heft OwnDamage Cushioning OtherCost OwnCost MedicalCost LiabilityCost PropertyCost 25

7 Compact conditional distributions CP grows exponentially with number of parents CP becomes infinite with continuous-valued parent or child Solution: canonical distributions that are defined compactly Deterministic nodes are the simplest case: X = f (Parents(X )) for some function f E.g., Boolean functions NorthAmerican Canadian US Mexican E.g., numerical relationships among continuous variables Level t = inflow + precipitation - outflow - evaporation Compact conditional distributions contd. Noisy-OR distributions model multiple noninteracting causes 1) Parents U 1... U k include all causes (can add leak node) 2) Independent failure probability q i for each cause alone = P(X U 1... U j, U j+1... U k ) = 1 Cold lu Malaria P(ever) P( ever) = = = = j i = 1 q i 26 Number of parameters linear in number of parents 27 Hybrid (discrete+continuous) networks Continuous child variables Discrete (Subsidy? and Buys?); continuous (Harvest and Cost) Need one conditional density function for child variable given continuous parents, for each possible assignment to discrete parents Subsidy? Harvest Most common is the linear Gaussian model, e.g.,: Cost Buys? P(Cost = c Harvest = h, Subsidy = true) = N(a t h + b t, σ t )(c) ( 1 = exp 1 ( ) ) 2 c (at h + b t ) σ t 2π 2 σ t Option 1: discretization possibly large errors, large CPs Option 2: finitely parameterized canonical families 1) Continuous variable, discrete+continuous parents (e.g., Cost) 2) Discrete variable, continuous parents (e.g., Buys?) Mean Cost varies linearly with Harvest, variance is fixed Linear variation is unreasonable over the full range but works OK if the likely range of Harvest is narrow 28 29

8 Continuous child variables Discrete variable w/ continuous parents P(c h, subsidy) Cost c Harvest h Probability of Buys? given Cost should be a soft threshold: Cumulative Probability Normal Distribution: µ = 0, σ = All-continuous network with linear Gaussian distributions = full joint distribution is a multivariate Gaussian Discrete+continuous linear Gaussian network is a conditional Gaussian network i.e., a multivariate Gaussian over all continuous variables for each combination of discrete variable values 30 Probit distribution uses integral of Gaussian: Φ(x) = x N(0, 1)(x)dx P(Buys? = true Cost = c) = Φ(( c + µ)/σ) x 31 Why the probit? Discrete variable contd. Sigmoid (or logit) distribution also used in neural networks: 1. It s sort of the right shape 2. Can be viewed as hard threshold whose location is subject to noise P(Buys? = true Cost = c) = exp( 2 c+µ σ ) Sigmoid has similar shape to probit but much longer tails: Cost Cost Noise Logistic Distribution: location = 0, scale = 1 Buys? Cumulative Probability x 32 33

9 Summary Outline Bayes nets provide a natural representation for (causally induced) conditional independence opology + CPs = compact representation of joint distribution Generally easy for (non)experts to construct Canonical distributions (e.g., noisy-or) = compact representation of CPs Continuous variables = parameterized distributions (e.g., linear Gaussian) Inference tasks Inference by enumeration Simple queries: compute posterior marginal P(X i E = e) e.g., P(NoGas Gauge = empty, Lights = on, Starts = false) Conjunctive queries: P(X i, X j E = e) = P(X i E = e)p(x j X i, E = e) Optimal decisions: decision networks include utility information; probabilistic inference required for P(outcome action, evidence) Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation: why do I need a new starter motor? Sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network: P(B j, m) = P(B, j, m)/p(j, m) = αp(b, j, m) = α e a P(B, e, a, j, m) Rewrite full joint entries using product of CP entries: P(B j, m) = α e a P(B)P(e)P(a B, e)p(j a)p(m a) = αp(b) e P(e) a P(a B, e)p(j a)p(m a) B J A Recursive depth-first enumeration: O(n) space, O(d n ) time E M 36 37

10 L L L L Enumeration algorithm Evaluation tree function Enumeration-Ask(X, e, bn) returns a distribution over X inputs: X, the query variable e, observed values for variables E bn, a Bayesian network with variables {X } E Y Q(X ) a distribution over X, initially empty for each value x i of X do extend e with value x i for X Q(x i ) Enumerate-All(Vars[bn], e) return Normalize(Q(X )) function Enumerate-All(vars, e) returns a real number if Empty?(vars) then return 1.0 Y irst(vars) if Y has value y in e then return P(y parent(y )) Enumerate-All(Rest(vars), e) else return P y P(y parent(y )) Enumerate-All(Rest(vars), e y ) where e y is e extended with Y = y P(j a).90 P(e).002 P(j a).05 P(m a) P(m a) P(b).001 P(j a).90 P( e).998 P(a b,e) P( a b,e) P(a b, e) P( a b, e) Enumeration is inefficient: repeated computation e.g., computes P(j a)p(m a) for each value of e P(j a).05 P(m a) P(m a) Complexity of exact inference Inference by stochastic simulation Singly connected networks (or polytrees): any two nodes are connected by at most one (undirected) path time and space cost (with variable elimination) are O(d k n) hence time and space cost are linear in n and k bounded by a constant Multiply connected networks: can reduce 3SA to exact inference = NP-hard equivalent to counting 3SA models = #P-complete 1. A v B v C 2. C v D v A 3. B v C v D A B C D AND Basic idea: Draw N samples from a sampling distribution S Compute an approximate posterior probability ˆP Show this converges to the true probability P Outline: Sampling from an empty network Rejection sampling: reject samples disagreeing with evidence Likelihood weighting: use evidence to weight samples Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior 0.5 Coin 45 46

11 Sampling from an empty network Example P(C).50 function Prior-Sample(bn) returns an event sampled from bn inputs: bn, a belief network specifying joint distribution P(X 1,..., X n ) x an event with n elements for i = 1 to n do x i a random sample from P(X i parents(x i )) given the values of Parents(X i ) in x return x C P(S C) Sprinkler S Cloudy Wet Grass R P(W S,R) Rain C P(R C) Sampling from an empty network contd. Probability that PriorSample generates a particular event i.e., the true prior probability S PS (x 1... x n ) = P(x 1... x n ) E.g., S PS (t, f, t, t) = = = P(t, f, t, t) Proof: Let N PS (x 1... x n ) be the number of samples generated for event x 1,..., x n. hen we have lim ˆP(x 1,..., x n ) N = lim PS(x 1,..., x n )/N N = S PS (x 1,..., x n ) n = P(x i parents(x i )) = P(x 1... x n ) i = 1 hat is, estimates derived from PriorSample are consistent Shorthand: ˆP(x 1,..., x n ) P(x 1... x n ) 49

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.

Bayesian networks. Chapter AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14. Bayesian networks Chapter 14.1 3 AIMA2e Slides, Stuart Russell and Peter Norvig, Completed by Kazim Fouladi, Fall 2008 Chapter 14.1 3 1 Outline Syntax Semantics Parameterized distributions AIMA2e Slides,