C:145 Artificial Intelligence@ Uncertainty Readings: Chapter 13 Russell & Norvig. Artificial Intelligence p.1/43 Logic and Uncertainty One problem with logical-agent approaches: Agents almost never have access to the whole truth about their environments. Very often, even in simple worlds, there are important questions for which there is no boolean answer. In that case, an agent must reason under uncertainty. Uncertainty also arises because of an agent s incomplete or incorrect understanding of its environment.
Uncertainty Let action L t = leave for airport t minutes before flight. Will L t get me there on time? Problems partial observability (road state, other drivers plans, etc.) noisy sensors (unreliable traffic reports) uncertainty in action outcomes (flat tire, etc.) immense complexity of modelling and predicting traffic Hence a purely logical approach either 1. risks falsehood ( A 25 will get me there on time ), or 2. leads to conclusions that are too weak for decision making ( A 25 will get me there on time if there s no Artificial Intelligence p.3/43 accident on the way, it doesn t rain, my tires remain intact, Reasoning under Uncertainty A rational agent is one that makes rational decisions (in order to maximize its performance measure). A rational decision depends on the relative importance of various goals, the likelihood they will be achieved the degree to which they will be achieved.
Handling Uncertain Knowledge Reasons FOL-based approaches fail to cope with domains like medical diagnosis: Laziness: too much work to write complete axioms, or too hard to work with the enormous sentences that result. Theoretical Ignorance: The available knowledge of the domain is incomplete. Practical Ignorance: The theoretical knowledge of the domain is complete but some evidential facts are missing. Artificial Intelligence p.5/43 Degrees of Belief In several real-world domains the agent s knowledge can only provide a degree of belief in the relevant sentences. The agent cannot say whether a sentence is true, but only that is true x% of the times. The main tool for handling degrees of belief is Probability Theory. The use of probability summarizes the uncertainty that stems from our laziness or ignorance about the domain.
Probability Theory Probability Theory makes the same ontological commitments as FOL: Every sentence ϕ is either true or false. The degree of belief that ϕ is true is a number P between 0 and 1. P(ϕ) = 1 ϕ is certainly true. P(ϕ) = 0 ϕ is certainly not true. P(ϕ) = 0.65 ϕ is true with a 65% chance. Artificial Intelligence p.7/43 Probability of Facts Let A be a propositional variable, a symbol denoting a proposition that is either true or false. P(A) denotes the probability that A is true in the absence of any other information. Similarly, P( A) = probability that A is false P(A B) = P(A B) = probability that both A and B are true probability that either A or B (or both) are true Examples: P( Blonde) P(Blonde BlueEyed) P(Blonde BlueEyed)
Conditional/Unconditional Probability P(A) is the unconditional (or prior) probability of fact A. An agent can use unconditional probability of A to reason about A only in the absence of no further information. If some further evidence B becomes available, the agent must use the conditional (or posterior) probability: P(A B) the probability of A given that all the agent knows is B. Note: P(A) can be thought as the conditional probability of A with respect to the empty evidence: P(A) = P(A ). Artificial Intelligence p.9/43 Conditional Probabilities The probability of a fact may change as the agent acquires more, or different, information: 1. P(Blonde) 2. P(Blonde Swedish) 3. P(Blonde Kenian) 4. P(Blonde Kenian EuroDescent) 1. If we know nothing about a person, the probability that he/she is blonde equals a certain value, say 0.2. 2. If we know that a person is Swedish the probability that s/he is blonde is much higher, say 0.9. 3. If we know that the person is Kenyan, the probability s/he is blonde much lower, say 0.000003. 4. If we know that the person is Kenyan and not of European descent, the probability s/he is blonde is
The Axioms of Probability Probability Theory is governed by the following axioms: 1. All probabilities are between 0 and 1. for all ϕ, 0 P(ϕ) 1 2. Valid propositions have probability 1. Unsatisfiable propositions have probability 0. P(α α) = 1 P(α α) = 0 3. The probability of disjunction is defined as follows. P(α β) = P(α)+P(β) P(α β) Artificial Intelligence p.11/43 Understanding Axiom 3 P(A B) = P(A)+P(B) P(A B)
Conditional Probabilities Conditional probabilities are defined in terms of unconditional ones, whenever P(B) > 0, P(A B) = P(A B) P(B) The same thing can be written as the product rule: P(A B) = P(A B)P(B) = P(B A)P(A) A and B are independent iff P(A B) = P(A) (or P(B A) = P(B), or P(A B) = P(A)P(B)). Artificial Intelligence p.13/43 Random Variable A random variable is a variable ranging over a certain domain of values. It is discrete if it ranges over a discrete (that is, countable) domain. continuous if it ranges over the real numbers. We will only consider discrete random variables with finite domains. Propositional variables can be seen as random variables over the Boolean domain.
Variable Random Variables Domain Age {1,2,...,120} W eather {sunny, dry, cloudy, rain, snow} Size {small, medium, large} Blonde {true, f alse} The probability that a random variable X has value val a is written as P(X = val) Note: P(X = true) is written simply as P(X). P(X = false) is written as P( X). a In Probability Theory, contrary to First-Order Logic, variables are capitalized and constant values are not. Artificial Intelligence p.15/43 Probability Distribution If X is a random variable, we use the bold case P(X) to denote a vector of values for the probabilites of each individual element that X can take. Example: P(Weather = sunny) = 0.6 P(Weather = rain) = 0.2 P(Weather = cloudy) = 0.18 P(Weather = snow) = 0.02 Then P(W eather) = 0.6, 0.2, 0.18, 0.02 (the value order of sunny, rain, cloudy, snow is assumed). P(Weather) is called a probability distribution for the random
Joint Probability Distribution If X 1,...,X n are random variables, P(X 1,...,X n ) denotes their joint probability distribution (JPD), an n-dimensional matrix specifying the probability of every possible combination of values for X 1,...,X n. Example Sky : {sunny, cloudy, rain, snow} Wind : {true,false} P(W ind, Sky) = sunny cloudy rain snow true 0.30 0.15 0.17 0.01 f alse 0.30 0.05 0.01 0.01 Artificial Intelligence p.17/43 Joint Probability Distribution All relevant probabilities about a vector X 1,...,X n of random variables can be computed from P(X 1,...,X n ). S = sunny S = cloudy S = rain S = snow P(W) W 0.30 0.15 0.17 0.01 0.63 W 0.30 0.05 0.01 0.01 0.37 P(S) 0.60 0.20 0.18 0.02 1.00 P(S = rain W) = 0.17 P(W) = 0.30+0.15+0.17+0.01 = 0.63 P(S = rain) = 0.17+0.01 = 0.18 P(S = rain W) = P(S = rain W)/P(W) = 0.17/0.63 = 0.27
Joint Probability Distribution A joint probability distribution P(X 1,...,X n ) provides complete information about the probabilities of its random variables. However, JPD s are often hard to create (again because of incomplete knowledge of the domain). Even when available, JPD tables are very expensive, or impossible, to store because of their size. A JPD table for n random variables, each ranging over k distinct values, has k n entries! A better approach is to come up with conditional probabilities as needed and compute the others from them. Artificial Intelligence p.19/43 An alternative to JPD: The Bayes Rule Recall that for any fact A and B, P(A B) = P(A B)P(B) = P(B A)P(A) From this we obtain the Bayes Rule: P(B A) = P(A B)P(B) P(A) The rule is useful in practice because it is often easier to figure out P(A B), P(B), P(A) than to figure out P(B A) directly.
Applying the Bayes Rule What is the probability that a patient has meningitis (M) given that he has a stiff neck (S)? P(M S) = P(S M)P(M) P(S) P(S M) is easier to estimate than P(M S) because it refers to causal knowledge: meningitis typically causes stiff neck. P(S M) can be estimated from past medical cases and the knowledge about how meningitis works. Similarly, P(M),P(S) can be estimated from statistical information. Artificial Intelligence p.21/43 Applying the Bayes Rule The Bayes rule is helpful even in absence of (immediate) causal relationships. What is the probability that a blonde (B) is Swedish (S)? P(S B) = P(B S)P(S) P(B) All P(B S),P(S),P(B) are easily estimated from statistical information. P(B S) P(S) P(B) # of blonde Swedish Swedish population = 9 10 Swedish population world population =... # of blondes world population =...
Conditional Independence In terms of exponential explosion, conditional probabilities do not seem any better than JPD s for computing the probability of a fact, given n > 1 pieces of evidence. P(Meningitis StiffNeck Nausea DoubleVision) However, certain facts do not always depend on all the evidence. P(Meningitis StiffNeck Astigmatic) = P(Meningitis StiffNeck) Meningitis and Astigmatic are conditionally independent, given StiffNeck. Artificial Intelligence p.23/43 Bayesian Networks Conditional independence information is crucial to making (automated) probabilistic reasoning feasible. Bayesian Networks are a successful example of probabilistic systems that exploit conditional independence to reason effectively under uncertainty.
Making Probabilistic Reasoning Feasible A joint probability distribution (JPD) contains all the relevant information to reason about the various kinds of probabilities of a set {X 1,...,X n } of random variables. Unfortunately, JPD tables are difficult to create and also very expensive to store. One possibility is to work with conditional probabilities and exploit the fact that many random variables are conditionally independent. Belief Networks are a successful example of probabilistic systems that exploit conditional independence to reason effectively under uncertainty. Artificial Intelligence p.25/43 Review of Some Concepts JPD is a collection of probabilties: P(X 1,...,X n ) = {P(X 1 = x 1 X n = x n ) x i Domain(X i )} Conditional Probability: P(X 1 = x 1 X 2 = x 2 ) = P(X 1=x 1 X 2 =x 2 ) P(X 2 =x 2 ) P(X 1 = x 1 X 2 = x 2 ) = P(X 1 = x 1 X 2 = x 2 )P(X 2 = x 2 ) or P(X 1 = x 1 X 2 = x 2,...,X n = x n ) = P(X 1=x 1 X 2 =x 2 X n =x n ) P(X 2 =x 2 X n =x n ) P(X 1 = x 1 X 2 = x 2 X n = x n ) or = P(X 1 = x 1 X 2 = x 2,...,X n = x n )P(X 2 = x 2 X n = x n ) = n i=1 P(X i = x i X i+1 = x i+1,...,x n = x n )
Review of Some Concepts (2) Conditional Independence: If P(X 1 = x 1 X 2 = x 2,...,X n = x n ) = P(X 1 = x 1 X 2 = x 2,...,X n 1 = x n 1 ) then we say X 1 = x 1 is conditionally indepedent of X n = x n given the evidence of X 2 = x 2,...,X n 1 = x n 1. Artificial Intelligence p.27/43 Belief Networks Let X 1,...,X n be random variables. A belief network (or Bayesian network) for X 1,...,X n is a graph with m nodes such that there is a node for each X i, all the edges between two nodes are directed, there are no cycles, each node has a conditional probability table (CPT), given in terms of the node s parents. The intuitive meaning of an edge from a node X i to a node X j is that X i has a direct influence on X j.
Conditional Probability Tables Each node X i in a belief network has a CPT expressing the probability of X i, given its parents as evidence. CPT for Alarm: Alarm Burglary Earthquake T F T T 0.950 0.050 T F 0.940 0.060 F T 0.290 0.710 F F 0.001 0.999 For instance, the value of P(Alarm Burglary Earthquake) is given in position (1,1) of the table; the value of P( Alarm Burglary Earthquake) is given in position (3,2) of the table. Artificial Intelligence p.29/43 A Belief Network with CPTs Note: The tables only show P(X) here because P( X) = 1 P(X).
The Semantics of Belief Networks There are two, equivalent, ways to interpret a belief network for the variables X 1,...,X n : 1. The network is a representation of the JPD P(X 1,...,X n ). 2. The network is a collection of conditional dependence statements about X 1,...,X n. Interpretation 1 is helpful in understanding how to construct belief networks. Interpretation 2 is helpful in designing inference procedures on top of them. Artificial Intelligence p.31/43 Belief Network as Joints The whole joint P(X 1,...,X n ) can be computed from a belief network for X 1,...,X n and its CPTs. For each tuple x 1,...,x n of possible values for X 1,...,X n, n P(X 1 = x 1 X n = x n ) = P(X i = x i Parents(X i )) i=1 Notation: Parents(X i ) is a subset of {X i = x i 1 i n}, that is, each parent of X i takes the value specified in x 1,...,x n. n P(X 1 = x 1 X 2 = x 2 X n = x n ) = P(X i = x i X i+1 = x i+1,...,x n = x n ) tells us that Parents(X i ) may be {X i+1 = x i+1,...,x n = x n }. i=1
Belief Network as Joints P(X 1 = x 1 X n = x n ) = n P(X i = x i Parents(X i )) i=1 P(J M A B E) = P(J A)P(M A)P(A B E)P( B)P( E) = 0.9 0.7 0.001 0.999 0.998 = 0.00062 Artificial Intelligence p.33/43 BN and Conditional Independence Let {X 1,X 2,...,X n } be any set of nodes in the network so that all the parents of X 1 are in {X 2,...,X n }, no node in {X 2,...,X n } is a descendant of X 1. Let x 1,...,x n be a value assignment for X 1,...,X n. From equation (1) we can show that, P(X 1 = x 1 X 2 = x 2 X n = x n ) = P(X 1 = x 1 Parents(X 1 )) Given its parents as evidence, each node is conditionally independent from any node other than itself and its parents in the network.
BN and Conditional Independence P(X 1 = x 1 X 2 = x 2 X n = x n ) = P(X 1 = x 1 Parents(X 1 )) Examples P(B E) = P(B) P(J M A) = P(J A) P(J A E) = P(J A) P(J A B E) = P(J A) P(J M A B E) = P(J A) Exercise: Find all the conditional independences holding in this network. Artificial Intelligence p.35/43 Constructing BN: General Procedure 1. Choose a set of random variables X i that describe the domain. 2. Order the variables into a list L. 3. Start with an empty network. 4. While there are variables left in L: (a) Remove the first variable X i of L and add a node for it to the network. (b) Choose a minimal set of nodes already in the network which satisfy the conditional dependence property with X i. (c) Make these nodes the parents of X i. (d) Fill in the CPT for X i.
Ordering the Variables The order in which add the variables to the network is important. Wrong orderings produces more complex networks. Artificial Intelligence p.37/43 Ordering the Variables Right A general, effective heuristic for constructing simpler belief networks is to exploit causal links between random variables whenever possible. That is, add the variables to the network so that causes get added before effects.
Locally Structured Systems In addition to being a complete and non-redundant representation of the domain, a belief network is often more compact than a full joint. The reason is that probabilistic domains are often representable as a locally structured system. In a locally structured system, each subcomponent interacts with only a bounded number of other components, regardless of the size of the system. The complexity of local structures generally grows linearly, instead of exponentially. Artificial Intelligence p.39/43 BN and Locally Structured Systems In many real-world domains, each random variable is influenced by at most k others, for some fixed constant k. Assuming that we have n variables and, for simplicity, that they are all Boolean, a joint table will have 2 n entries. In a well constructed belief network, each node will have at most k parents. This means that each node will have a CPT with at most 2 k entries, for a total of 2 k n entries. Example: n=20, k=5. entries in network CPTs 2 5 20 = 640
Inference in Belief Networks Main task of a belief network: Compute the conditional probability of a set of query variables, given exact values for some evidence variables. P(Query Evidence) Belief networks are flexible enough so that any node can serve as either a query or an evidence variable. In general, to decide what actions to take, an agent 1. first gets values for some variables from its percepts, or from its own reasoning, 2. then asks the network about the possible values of the other variables. Artificial Intelligence p.41/43 Probabilistic Inference with BN Belief networks are a very flexible tool for probabilistic inference because they allow several kinds of inference: Diagnostic inference (from effects to causes) Example: Infer P(Burglary JohnCalls) Causal inference (from causes to effects) Infer P(JohnCalls Burglary) Example: Intercausal inference (between causes of a common effect) Example: Infer P(Burglary Alarm Earthquake) Mixed inference (combination of the above) Example: Infer P(Alarm JohnCalls Earthquake)
Types of Inference in Belief Networks Q = query variable E = evidence variable Artificial Intelligence p.43/43