Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Probabilistic Reasoning Kee-Eung Kim KAIST Computer Science

Outline #1 Acting under uncertainty Probabilities Inference with Probabilities Independence and Bayes Rule Bayesian networks Inference in Bayesian networks

Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems Partial observability (road state, other driver s plans, etc.) Noisy sensors (often wrong traffic reports) Uncertainty in action outcomes (flat tire, etc.) Immense complexity of modeling and predicting traffic A logical agent faced with such a problem has to handle uncertain knowledge??? Agent s knowledge can at best provide only a degree of belief in the relevant sentences (probability theory is your friend)

Probability Probability provides a way of summarizing the uncertainty that comes from Laziness: failure to enumerate all exceptions, qualifications, etc. Ignorance: lack of relevant facts, initial conditions, etc. Degree of belief vs. degree of truth Probability of 0.8 does not mean 80% true (fuzzy logic/ambiguity) Rather, 80% degree of belief (probability/chance) Making decisions under uncertainty: Which action to choose? Depends on my preference Utility theory is used to represent and infer preferences Decision Theory = utility theory + probability theory

Basic Probability : the sample space E.g., 6 possible outcomes from a roll of a die 2 is a sample point/possible world/atomic event A probability space or probability model is a sample space with an assignment P( ) for every 2 s.t. 0 P( ) 1 P( ) = 1 An event A µ : P(A) = 2 A P( ) A random variable is a function from sample points to some range P induces a probability distribution for any random variable X: P(X = x i ) = :X( )=xi P( )

Prior Probability Prior or unconditional probabilities of propositions correspond to belief prior to arrival of any (new) evidence E.g., P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72 Probability distribution gives values for all possible assignments P(Weather) = [0.72, 0.1, 0.08, 0.1] Joint probability distribution for a set of random variables P(Weather, Cavity) = a 4 2 matrix of values: Weather= sunny rain cloudy snow Cavity=true 0.144 0.02 0.016 0.02 Cavity=false 0.576 0.08 0.064 0.08 Every question about a domain can be answered by the joint distribution because every event is a sum of sample points

Conditional Probability Conditional or posterior probabilities E.g. P(cavity toothache) = 0.8 i.e., given that toothache is all I know, NOT if toothache then 80% chance of cavity Implementation P(Cavity Toothache) = 2-element vector of 2-element vectors If we know more P(cavity toothache,cavity) = 1 New evidence may be irrelevant P(cavity toothache,rain) = P(cavity toothache) = 0.8 Domain knowledge is necessary for this kind of inference Useful rules Product rule: P(a Æ b) = P(a b)p(b) = P(b a)p(a) Chain rule: P(X 1,, X n ) = P(X 1,, X n-1 ) P(X n X 1,,X n-1 ) = i P(X i X 1,,X i-1 )

Inference by Enumeration Given the joint distribution toothache : toothache catch : catch catch : catch cavity 0.108 0.012 0.072 0.008 : cavity 0.016 0.064 0.144 0.576 For any proposition, sum the atomic events where P(cavity Ç toothache) =? P(: cavity toothache) =? P(Cavity toothache) = P(Cavity, toothache) = [P(Cavity, toothache, catch) + P(Cavity, toothache, : catch)] Compute distribution on query variable by fixing evidence variables and summing over hidden variables

Inference by Enumeration Let X be all random variables. Typically we want The posterior joint distribution of the query variables Y Given specific values e for the evidence variables E Let the hidden variables be H = X Y E P(Y E = e) = P(Y, E = e) = h P(Y, E = e, H = h) Obvious problems Worst-case time complexity O(d^n) where d is the largest arity Space complexity O(d^n) to store joint distribution How to find the numbers for O(d^n) entries??

Independence A and B are independent iff P(A B) = P(A), or P(B A) = P(B), or P(A,B) = P(A)P(B) P(Toothache,Catch,Cavity,Weather) = P(Toothache,Catch,Cavity) P(Weather) 32 entries reduced to 12; for n independent biased coins, 2^n! N Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent What to do?

Conditional Independence P(Toothache,Cavity,Catch) has 2 3 1 = 7 independent entries However, if we have If I have a cavity, the probability that the probe catches in it doesn t dependent on whether I have a toothache: P(catch toothache,cavity) = P(catch cavity) The same independence holds if I don t have a cavity: P(catch toothache,: cavity) = P(catch : cavity) Probe catch is conditionally independent of Toothache given Cavity: P(Catch Toothache,Cavity) = P(Catch Cavity) Equivalent statements: P(Toothache Catch,Cavity) = P(Toothache Cavity) P(Toothache,Catch Cavity) = P(Toothache Cavity)P(Catch Cavity) P(Toothache,Cavity,Catch) = P(Toothache Catch,Cavity)P(Catch Cavity)P(Cavity) = P(Toothache Cavity)P(Catch Cavity)P(Cavity) Ã 2+2+1 = 5 entries Conditional independence is our most basic and robust form of knowledge about uncertain environments

Bayes Rule From product rule P(a Æ b) = P(a b)p(b) = P(b a)p(a) ) Bayes rule Or in distribution form Useful for accessing diagnostic probability from causal probability Extreme conditional independence: naïve Bayes model

Bayesian Networks (= Graphical Models) A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax A set of nodes, one per random variable A directed, acyclic graph (link ¼ directly influences ) A conditional distribution for each node given its parents Conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values.

Example: The Alarm Network I have a burglar alarm installed, but it also responds on occasion to earthquakes. My neighbors John and Mary promise to call me at work when they hear alarm. John calls to say alarm is ringing, but Mary doesn t. Is there a burglar? Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Knowledge to be encoded into topology: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Semantics of Bayesian Networks Global semantics defines the full joint distribution as the product of the local conditional distributions E.g., P(j Æ m Æ a Æ : b Æ : e) = P(j a) P(m a) P(a : b, : e) P(: b) P(: e) Local semantics: each node is conditionally independent of its non-descendants given its parents Mental Exercise: Derive global semantics from local semantics and vice versa Global semantics and local semantics are equivalent

Constructing Bayesian Networks Choose an ordering of random variables: MaryCalls, JohnCalls, Alarm, Burglary, Earthquake MaryCalls JohnCalls Alarm Burglary Earthquake P(JohnCalls MaryCalls) = P(JohnCalls)? No P(Alarm JohnCalls,MaryCalls) = P(Alarm JohnCalls)? No P(Alarm JohnCalls,MaryCalls) = P(Alarm)? No P(Burglary Alarm,JohnCalls,MaryCalls) = P(Burglary Alarm)? Yes P(Burglary Alarm,JohnCalls,MaryCalls) = P(Burglary)? No P(Earthquake B,Alarm,J,M) = P(Earthquake Alarm)? No (!!) P(Earthquake Burglary,Alarm,J,M) = P(Earthquake Burglary,Alarm)? Yes

Constructing Bayesian Networks 1. Choose an ordering of variables X 1,,X n 2. For i = 1 to n Add X i to the network Select parents from X 1,,X i-1 such that This choice of parents guarantees global semantics (by chain rule) (by construction)

Inference Example: Car Diagnosis Initial evidence: Car won t start Testable (observed) variables: green broken, so fix it (hypothesis) variables: orange Hidden variables: gray

Inference by Enumeration in BN Simple query on the alarm network: Given that John called and Mary called, what is the probability of burglary? P(B j,m) = P(B, j, m) / P(j, m) = P(B, j, m) = e a P(B, e, a, j, m) Rewrite full joint probability using CPT: P(B j, m) = e a P(B) P(e) P(a B,e) P(j a) P(m a) = P(B) e P(e) a P(a B,e) P(j a) P(m a) Recursive depth-first enumeration: O(n) space, O(d^n) time

Evaluation Tree Evaluation tree of P(B) e P(e) a P(a B,e) P(j a) P(m a) Enumeration is inefficient: repeated computation Computes P(j a) P(m a) for each value of e

Inference by Variable Elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation 2-element vector 2-element vector 2 2 2 matrix 2 2 matrix 2-element vector

Basic Operations for Variable Elimination Summing out a variable from a product of factors Move any constant factors outside the summation Add up sub-matrices in pointwise product of remaining factors Assume f 1,, f i do not depend on X: Pointwise product of factors f 1 and f 2 E.g.

Irrelevant Variables Consider query P(JohnCalls Burglary=true) P(J b) = P(b) e P(e) a P(a b,e) P(J a) m P(m a) Sum over m is identically 1 (M is irrelevant to the query) Theorem: Y is irrelevant unless Y 2 Ancestors({X} [ E) where {X} is the set of query nodes X = JohnCalls, E = {Burglary} Ancestors({X} [ E) = {Alarm, Earthquake} MaryCalls is irrelevant Remember backward chaining algorithm?

Complexity of Exact Inference Singly connected networks (or polytrees) Any two nodes are connected by at most one path Time and space cost of variable elimination are O(d^k n) Multiply connected networks We can reduce 3-SAT problem to inference ) NP-hard To be exact, the problem is as hard as counting number of satisfying assignments in 3-SAT ) #P-hard

Summary Uncertainty arises from laziness and ignorance Inherent in complex, dynamic, or inaccessible worlds Probabilities express the agent s inability to reach a definite decision regarding the truth of sentence Probabilities summarize the agent s beliefs Prior probabilities, conditional probabilities, full joint probability distributions Bayesian networks represent conditional independence relationships in the domain in a concise way The topology of the network encodes conditional independence Exact inference by variable elimination Polytime on polytrees, NP-hard on general graphs Very sensitive to topology

Bayesian Learning

Outline #2 Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete data Linear regression ML parameter learning with hidden attribute data (EM algorithm)

Full Bayesian Learning View learning as Bayesian updating of a probability distribution over the hypothesis space Given H = {h 1, h 2, } and prior distribution P(H) j-th observation d j is the outcome of random variable D j Training data d = d 1,, d N Given the data so far, each hypothesis has a posterior probability P(h i d) = P(d h i ) P(h i ) where P(d h i ) is called the likelihood Prediction use a likelihood-weighted average over the hypotheses: P(X d) = i P(X d,h i ) P(h i d) = i P(X h i ) P(h i d) No need to pick one best-guess hypothesis!!!

Example Suppose there are five kinds of bags of candies: 10% are h 1 : 100% cherry candies 20% are h 2 : 75% cherry candies + 25% lime candies 40% are h 3 : 50% cherry candies + 50% lime candies 20% are h 4 : 25% cherry candies + 75% lime candies 10% are h 5 : 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be?

Posterior/Prediction Prob. of Hypotheses Posterior probability: P(h i d) = P(d h i ) P(h i ) Prediction probability: P(X d) = i P(X h i ) p(h i d)

MAP Approximation Summing over the hypothesis space is often intractable e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes Instead, make prediction based on a single most probable hypothesis Maximum a posteriori (MAP) learning Choose h MAP maximizing P(h i d) i.e., maximize P(d h i ) P(h i ) or log P(d h i ) + log P(h i ) Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses, P(d h i ) is 1 if consistent, 0 otherwise ) MAP = simplest consistent hypothesis

ML Approximation For large data sets, prior becomes irrelevant The bits to encode data given hypothesis dominates the bits to encode hypothesis Maximum likelihood (ML) learning Choose h ML maximizing P(d h i ) i.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the standard (non-bayesian) statistical learning method

ML Parameter Learning in Bayes Nets Bag from a new manufacturer; fraction of cherry candies? A Bayesian network with one node Flavor The task is to learn P(Flavor) = Any is possible; continuum of hypotheses h is a parameter for this simple (binomial) family of models Suppose we unwrap N candies, c cherries and l = N c limes These are i.i.d. (independent, identically distributed) observations, so Maximize this w.r.t. which is easier for the log-likelihood: However, given small dataset, ML learning can assign zero probability to unobserved events

ML Learning with Multiple Parameters Red/green wrapper depends probabilistically on flavor Likelihood for, e.g., cherry candy in green wrapper: N candies, r c red-wrapped cherry candies, etc.

ML Learning with Multiple Parameters Log likelihood: Derivatives of L contain only the relevant parameter: With complete data, parameters can be learned separately

Example: Linear Gaussian Model y = 1 x + 2 plus Gaussian noise Maximize P(y x) w.r.t. parameters 1 and 2 = Minimize Minimizing the sum of squared errors gives the ML solution for a linear fit assuming Gaussian noise of fixed variance

Learning Bayes Nets w. Hidden Variables Candies from either Bag 1 or Bag 2 In addition to Flavor and Wrapper, some candies have a Hole Parameters: : probability of candy coming from Bag 1 F1, F2 : conditional probabilities of flavor being cherry, given coming from Bag 1 or Bag 2 W1, W2 : conditional probabilities of wrapper being red H1, H2 : conditional probabilities of candy having a hole

Learning Bayes Nets w. Hidden Variables EM (expectation-maximization) algorithm An iterative method for maximizing the likelihood with hidden vars Basic idea: Pretend that we know the parameters of the network and compute the posterior probabilities (E-step) Then using those probabilities, maximize the likelihood by adjusting parameters (M-step) Repeat the two steps until convergence Data with 1000 instances generated from a true model True parameters: =0.5, F1 = W1 = H1 =0.8, F2 = W2 = H2 =0.3 The counts from the data: W = red W = green H = 1 H = 0 H = 1 H = 0 F = cherry 273 93 104 90 F = lime 79 100 94 167

MAY SKIP Learning Bayes Nets w. Hidden Variables Start with (arbitrary) initial parameters For at iteration 1: Calculate the expected count (E-step): Find the maximum likelihood parameter (M-step): [Exercise]

MAY SKIP Learning Bayes Nets w. Hidden Variables For F1 at iteration 1: Calculate the expected count (E-step): Find the maximum likelihood parameter (M-step): [Exercise] Compute EM steps for the rest of parameters, and repeat

Markov Chain Monte Carlo (MCMC): A Universal Inference Algorithm