Reasoning with Lecture 1: Probability Calculus, NICTA and ANU Reasoning with
Overview of the Course Probability calculus, Bayesian networks Inference by variable elimination, factor elimination, conditioning Models for graph decomposition Most likely instantiations Complexity of probabilistic inference Compiling Bayesian networks Inference with local structure Applications Reasoning with
of Worlds Probability Calculus Worlds modeled with propositional variables A world is complete instantiation of variables of all worlds add up to 1 An event is a set of worlds Reasoning with
of Events Probability Calculus Pr(α) def = ω α Pr(ω) Pr(Earthquake) = Pr(ω 1 ) + Pr(ω 2 ) + Pr(ω 3 ) + Pr(ω 4 ) =.1 Reasoning with
of Events Pr(α) def = ω α Pr(ω) Pr(Earthquake) = Pr(ω 1 ) + Pr(ω 2 ) + Pr(ω 3 ) + Pr(ω 4 ) =.1 Pr( Burglary) =.8 Pr(Alarm) =.2442 Reasoning with
Propositional Sentences as Events Pr((Earthquake Burglary) Alarm) =? Reasoning with
Propositional Sentences as Events Pr((Earthquake Burglary) Alarm) =? Interpret world ω as model (satisfying assignment) for propositional sentence α Pr(α) def = ω =α Pr(ω) Reasoning with
Probability Calculus Suppose we re given evidence β that Alarm = true Need to update Pr(.) to Pr(. β) Reasoning with
Pr(β β) = 1, Pr( β β) = 0 Partition worlds into those = β and those = β Pr(ω β) = 0, for all ω = β Reasoning with
Pr(β β) = 1, Pr( β β) = 0 Partition worlds into those = β and those = β Pr(ω β) = 0, for all ω = β Relative beliefs stay put, for ω = β Reasoning with
Pr(β β) = 1, Pr( β β) = 0 Partition worlds into those = β and those = β Pr(ω β) = 0, for all ω = β Relative beliefs stay put, for ω = β ω =β Pr(ω) = Pr(β) Reasoning with
Probability Calculus Pr(β β) = 1, Pr( β β) = 0 Partition worlds into those = β and those = β Pr(ω β) = 0, for all ω = β Relative beliefs stay put, for ω = β ω =β Pr(ω) = Pr(β) ω =β Pr(ω β) = 1 Reasoning with
Probability Calculus Pr(β β) = 1, Pr( β β) = 0 Partition worlds into those = β and those = β Pr(ω β) = 0, for all ω = β Relative beliefs stay put, for ω = β ω =β Pr(ω) = Pr(β) ω =β Pr(ω β) = 1 Above imply we normalize by constant Pr(β) Reasoning with
Probability Calculus Pr(β β) = 1, Pr( β β) = 0 Partition worlds into those = β and those = β Pr(ω β) = 0, for all ω = β Relative beliefs stay put, for ω = β ω =β Pr(ω) = Pr(β) ω =β Pr(ω β) = 1 Above imply we normalize by constant Pr(β) Pr(ω β) def = { 0, if ω = β Pr(ω)/ Pr(β), if ω = β Reasoning with
Probability Calculus Reasoning with
Bayes Conditioning Probability Calculus Pr(α β) = ω =α Pr(ω β) = = ω =α, ω =β ω =α, ω =β = = ω =α β Pr(α β) Pr(β) Pr(ω β) + Pr(ω β) = ω =α, ω = β ω =α β Pr(ω)/ Pr(β) = 1 Pr(β) Pr(ω β) Pr(ω β) ω =α β Pr(ω) Reasoning with
Bayes Conditioning Probability Calculus Pr(Burglary) =.2 Reasoning with
Bayes Conditioning Probability Calculus Pr(Burglary) =.2 Pr(Burglary Alarm).741 Reasoning with
Bayes Conditioning Probability Calculus Pr(Burglary) =.2 Pr(Burglary Alarm).741 Pr(Burglary Alarm Earthquake).253 Reasoning with
Bayes Conditioning Probability Calculus Pr(Burglary) =.2 Pr(Burglary Alarm).741 Pr(Burglary Alarm Earthquake).253 Pr(Burglary Alarm Earthquake).957 Reasoning with
Probability Calculus Pr(Earthquake) =.1 Pr(Earthquake Burglary) =.1 Reasoning with
Probability Calculus Pr(Earthquake) =.1 Pr(Earthquake Burglary) =.1 Event α independent of event β if Pr(α β) = Pr(α) or Pr(β) = 0 Reasoning with
Probability Calculus Pr(Earthquake) =.1 Pr(Earthquake Burglary) =.1 Event α independent of event β if Pr(α β) = Pr(α) or Pr(β) = 0 Or equivalently Pr(α β) = Pr(α) Pr(β) Reasoning with
Probability Calculus Pr(Earthquake) =.1 Pr(Earthquake Burglary) =.1 Event α independent of event β if Or equivalently is symmetric Pr(α β) = Pr(α) or Pr(β) = 0 Pr(α β) = Pr(α) Pr(β) Reasoning with
Conditional is a dynamic notion Burglary independent of Earthquake, but not after accepting evidence Alarm Pr(Burglary Alarm).741 Pr(Burglary Alarm Earthquake).253 Reasoning with
Conditional is a dynamic notion Burglary independent of Earthquake, but not after accepting evidence Alarm Pr(Burglary Alarm).741 Pr(Burglary Alarm Earthquake).253 α independent of β given γ if Pr(α β γ) = Pr(α γ) Pr(β γ) or Pr(γ) = 0 Reasoning with
Variable Probability Calculus For disjoint sets of variables X, Y, and Z I Pr (X, Z, Y) : X independent of Y given Z in Pr Means x independent of y given z for all instantiations of x, y, z Reasoning with
: Chain Rule Pr(α 1 α 2... α n ) = Pr(α 1 α 2... α n ) Pr(α 2 α 3... α n )... Pr(α n ) Reasoning with
: Case Analysis Or equivalently Pr(α) = n Pr(α β i ) i=1 Pr(α) = n Pr(α β i ) Pr(β i ) i=1 Where events β i are mutually exclusive and collectively exhaustive Reasoning with
: Case Analysis Special case of n = 2 Pr(α) = Pr(α β) + Pr(α β) Or equivalently Pr(α) = Pr(α β) Pr(β) + Pr(α β) Pr( β) Reasoning with
: Bayes Rule Pr(α β) = Pr(β α) Pr(α) Pr(β) Reasoning with
: Bayes Rule Example: patient tested positive for disease D: patient has disease T : test comes out positive Would like to know Pr(D T ) Reasoning with
: Bayes Rule Example: patient tested positive for disease D: patient has disease T : test comes out positive Would like to know Pr(D T ) We know Pr(D) = 1/1000 Pr(T D) = 2/100 (false positive) Pr( T D) = 5/100 (false negative) Reasoning with
: Bayes Rule Example: patient tested positive for disease D: patient has disease T : test comes out positive Would like to know Pr(D T ) We know Pr(D) = 1/1000 Pr(T D) = 2/100 (false positive) Pr( T D) = 5/100 (false negative) Using Bayes rule: Pr(D T ) = Pr(T D) Pr(D) Pr(T ) Reasoning with
: Bayes Rule Computing Pr(T ): case analysis Pr(T ) = = Pr(T D) Pr(D) + Pr(T D) Pr( D) 95 100 1 1000 + 2 200 999 1000 = 2093 100000 Reasoning with
: Bayes Rule Computing Pr(T ): case analysis Pr(T ) = = Pr(T D) Pr(D) + Pr(T D) Pr( D) 95 100 1 1000 + 2 200 999 1000 = 2093 100000 Pr(D T ) = 95 2093 4.5% Reasoning with
Probability Calculus So far we ve considered hard evidence, that some event has occurred Soft evidence: neighbor with hearing problem calls to tell us they heard alarm May not confirm Alarm, but can still increase our belief in Alarm Two main methods to specify strength of soft evidence Reasoning with
: All Things Considered After receiving my neighbor s call, my belief in Alarm is now.85 Soft evidence as constraint Pr (β) = q Not a statement about strength of evidence per se, but result of its integration with our initial beliefs Reasoning with
: All Things Considered After receiving my neighbor s call, my belief in Alarm is now.85 Soft evidence as constraint Pr (β) = q Not a statement about strength of evidence per se, but result of its integration with our initial beliefs Scale beliefs in worlds = β so they add up to q Scale beliefs in worlds = β so they add up to 1 q Reasoning with
: All Things Considered For worlds Pr (ω) def = { q Pr(β) Pr(ω), if ω = β Pr(ω), if ω = β 1 q Pr( β) Reasoning with
: All Things Considered For worlds Pr (ω) def = { q Pr(β) Pr(ω), if ω = β Pr(ω), if ω = β 1 q Pr( β) For events (Jeffery s rule) Pr (α) = q Pr(α β) + (1 q) Pr(α β) Reasoning with
: All Things Considered For worlds Pr (ω) def = { q Pr(β) Pr(ω), if ω = β Pr(ω), if ω = β 1 q Pr( β) For events (Jeffery s rule) Pr (α) = q Pr(α β) + (1 q) Pr(α β) Bayes conditioning is special case of q = 1 Reasoning with
: Nothing Else Considered Would like to declare strength of evidence independently of current beliefs Reasoning with
: Nothing Else Considered Would like to declare strength of evidence independently of current beliefs Need notion of odds of event β: O(β) def = Pr(β) Pr( β) Reasoning with
: Nothing Else Considered Would like to declare strength of evidence independently of current beliefs Need notion of odds of event β: O(β) def = Pr(β) Pr( β) Declare evidence β by Bayes factor k = O (β) O(β) Reasoning with
: Nothing Else Considered Would like to declare strength of evidence independently of current beliefs Need notion of odds of event β: O(β) def = Pr(β) Pr( β) Declare evidence β by Bayes factor k = O (β) O(β) My neighbor s call increases odds of Alarm by factor of 4 Reasoning with
: Nothing Else Considered Would like to declare strength of evidence independently of current beliefs Need notion of odds of event β: O(β) def = Pr(β) Pr( β) Declare evidence β by Bayes factor k = O (β) O(β) My neighbor s call increases odds of Alarm by factor of 4 Compute Pr (β) and fall back on Jeffrey s rule to update beliefs Reasoning with
Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Probabilistic Reasoning Basic task: belief updating in face of hard and soft evidence Can be done in principle using definitions we ve seen Inefficient, requires joint probability tables (exponential) Also a modeling difficulty Specifying joint probabilities is tedious Beliefs (e.g., independencies) held by human experts not easily enforced Reasoning with
Capturing Graphically Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Reasoning with
Capturing Graphically Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Evidence R could increase Pr(A), Pr(C) Reasoning with
Capturing Graphically Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Evidence R could increase Pr(A), Pr(C) But not if we know A Reasoning with
Capturing Graphically Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Evidence R could increase Pr(A), Pr(C) But not if we know A C independent of R given A Reasoning with
Capturing Graphically Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Parents(V): variables with edge to V Descendants(V): variables to which directed path from V Nondescendants(V): all except V, Parents(V), and Descendants(V) Reasoning with
Markovian Assumptions Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Every variable is conditionally independent of its nondescendants given its parents I (V, Parents(V ), Nondescendants(V )) Reasoning with
Markovian Assumptions Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Every variable is conditionally independent of its nondescendants given its parents I (V, Parents(V ), Nondescendants(V )) Given direct causes of variable, our beliefs in that variable are no longer influenced by any other variable except possibly by its effects Reasoning with
Capturing Graphically Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation I (C, A, {B, E, R}) I (R, E, {A, B, C}) I (A, {B, E}, R) I (B,, {E, R}) I (E,, B) Reasoning with
Conditional Probability Tables Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation DAG G together with Markov(G) declares a set of independencies, but does not define Pr Pr is uniquely defined if a conditional probability table (CPT) is provided for each variable X CPT: Pr(x u) for every value x of variable X and every instantiation u of parents U Reasoning with
Conditional Probability Tables Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Pr(c a) Pr(r e) Pr(a b, e) Pr(e) Pr(b) Reasoning with
Conditional Probability Tables Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Reasoning with
Conditional Probability Tables Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Half of probabilities are redundant Only need 10 probabilities for whole DAG Reasoning with
Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Bayesian Network A pair (G, Θ) where G is DAG over variables Z, called network structure Θ is set of CPTs, one for each variable in Z, called network parametrization Reasoning with
Bayesian Network Probability Calculus Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Θ X U : CPT for X and parents U X U: network family θ x u = Pr(x u): network parameter x θ x u = 1 for every parent instantiation u Reasoning with
Bayesian Network Probability Calculus Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Network instantiation: e.g., a, b, c, d, e θ x u z: instantiations xu & z are compatible θ a, θ b a, θ c a, θ d b,c, θ e c all z above Reasoning with
Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Bayesian Network constraints imposed by network structure (DAG) and numeric constraints imposed by network parametrization (CPTs) are satisfied by unique Pr Pr(z) def = θ x u z θ x u (chain rule) Bayesian network is implicit representation of this probability distribution Reasoning with
Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Chain Rule Example Pr(z) def = θ x u z θ x u (chain rule) Pr(a, b, c, d, e) = θ a θ b a θ c a θ d b,c θ e c = (.6)(.2)(.2)(.9)(.1) =.0216 Reasoning with
Compactness of Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Let d be max variable domain size, k max number of parents Size of any CPT = O(d k+1 ) Total number of network parameters = O(n d k+1 ) for n-variable network Compact as long as k is relatively small, compared with joint probability table (up to d n entries) Reasoning with
Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Properties of Probabilistic Distribution Pr represented by Bayesian network (G, Θ) is guaranteed to satisfy every independence assumption in Markov(G) It may satisfy more I Pr (X, Parents(X ), Nondescendants(X )) Derive independence by graphoid axioms Reasoning with
Graphoid Axioms: Symmetry Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation I Pr (X, Z, Y) I Pr (Y, Z, X) If learning y does not influence our belief in x, then learning x does not influence our belief in y either Reasoning with
Graphoid Axioms: Decomposition Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation I Pr (X, Z, Y W) I Pr (X, Z, Y) I Pr (X, Z, W) If learning yw does not influence our belief in x, then learning y alone, or w alone, does not influence our belief in x either If some information is irrelevant, then any part of it is also irrelevant Reverse does not hold in general: Two pieces of information may each be irrelevant on their own, yet their combination may be relevant Reasoning with
Graphoid Axioms: Decomposition Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation I Pr (X, Z, Y W) I Pr (X, Z, Y) I Pr (X, Z, W) In particular, can strengthen Markov(G) I Pr (X, Parents(X ), W) for W Nondescendants(X ) Useful for proving chain rule for Bayesian networks from chain rule of probability calculus Reasoning with
Graphoid Axioms: Weak Union Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation I Pr (X, Z, Y W) I Pr (X, Z Y, W) If information yw is not relevant to our belief in x, then partial information y will not make rest of information, w, relevant In particular, can strengthen Markov(G) I Pr (X, Parents(X ) W, Nondescendants(X )\W) for W Nondescendants(X ) Reasoning with
Graphoid Axioms: Contraction Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation I Pr (X, Z, Y) I Pr (X, Z Y, W) I Pr (X, Z, Y W) If, after learning irrelevant information y, information w is found to be irrelevant to our belief in x, then combined information yw must have been irrelevant to start with Reasoning with
Intersection Axiom Probability Calculus Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation I Pr (X, Z W, Y) I Pr (X, Z Y, W) I Pr (X, Z, Y W) for strictly positive distribution Pr If information w is irrelevant given y, and information y is irrelevant given w, then their combination is irrelevant to start with Not true in general, when there re zero probabilities (determinism) Reasoning with
Graphical Test of Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Graphoid axioms produce independencies beyond Markov(G) Derivation of independencies is nontrivial; need efficient way Linear-time graphical test : d-separation Reasoning with
d-separation Probability Calculus Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation dsep G (X, Z, Y): every path between X & Y is blocked by Z dsep G (X, Z, Y) I Pr (X, Z, Y) for every Pr induced by G (soundness) Also enjoys weak completeness (discussed later) Reasoning with
d-separation: Paths and Valves Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Reasoning with
d-separation: Paths and Valves Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Reasoning with
d-separation: Path Blocking Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Path is blocked if any valve is closed Sequential valve W closed iff W Z Divergent valve W closed iff W Z Convergent value W closed iff W Z and none of Descendants(W ) appears in Z Reasoning with
d-separation: Examples Probability Calculus Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Reasoning with
d-separation: Examples Probability Calculus Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation Reasoning with
Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation d-separation: Complexity Need not enumerate paths between X & Y dsep G (X, Z, Y) iff X and Y are disconnected in new DAG G obtained by pruning DAG G as follows Delete all leaves W X Y Z Delete all edges outgoing from Z Linear in size of G Reasoning with
d-separation: Soundness Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation dsep G (X, Z, Y) I Pr (X, Z, Y) for Pr induced by G Prove by showing every independence claimed by d-separation can be derived from graphoid axioms Reasoning with
Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation d-separation: Completeness? dsep G (X, Z, Y) I Pr (X, Z, Y)? Does not hold Consider X Y Z X not d-separated from Z (given ) However, if θ y x = θ y x then X & Y are independent, so are X & Z Not surprising as d-separation is based on structure alone Reasoning with
d-separation: Weak Completeness Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation For every DAG G, parametrization Θ such that dsep G (X, Z, Y) I Pr (X, Z, Y) where Pr is induced by Bayesian network (G, Θ) Implies that one cannot improve on d-separation Reasoning with
Summary Probability Calculus Capturing Graphically Conditional Probability Tables Graphoid Axioms d-separation of worlds & events, Bayes conditioning, independence, chain rule, case analysis, Bayes rule, two methods for soft evidence Capturing independence graphically, Markovian assumptions, conditional probabilities tables, Bayesian networks, chain rule, graphoid axioms, d-separation Reasoning with