Probability Calculus. Chapter From Propositional to Graded Beliefs

Size: px

Start display at page:

Download "Probability Calculus. Chapter From Propositional to Graded Beliefs"

Brittney Fowler
5 years ago
Views:

1 Chapter 2 Probability Calculus Our purpose in this chapter is to introduce probability calculus and then show how it can be used to represent uncertain beliefs, and then change them in the face of new information. 2.1 From Propositional to Graded Beliefs We have seen in the previous chapter that propositional logic provides a valuable tool for representing beliefs about particular situations. But we have also seen that it is most appropriate for representing categorical beliefs. Specifically, given a propositional knowledge base, one can classify each sentence α as either: Believed: α; Disbelieved α; or Neither: α and α. This coarse classification of sentences which can be visualized by examining Figure 2.1 is a consequence of the binary classification imposed by the knowl- α α α a b c Figure 2.1: Three possible relationships between a knowledge base and a sentence α: (a) α since Mods( ) Mods(α); (b) α since Mods( ) Mods( α); and (c) α and α. 1

2 2 Class Notes for CS262A, UCLA world Earthquake Burglary Alarm Pr(.) ω 1 true true true.019 ω 2 true true false.001 ω 3 true false true.056 ω 4 true false false.024 ω 5 false true true.162 ω 6 false true false.018 ω 7 false false true.0072 ω 8 false false false.7128 Table 2.1: A state of belief, also known as a joint probability distribution. edge base on worlds, i.e., a world is either possible or impossible depending on whether it satisfies or contradicts. One can obtain a much finer classification of sentences through a finer classification of worlds. In particular, we can assign a degree of belief or probability in [0, 1] to each world ω and denote it by Pr(ω). The belief in, or probability of, a sentence α can then be defined as: Pr(α) def Pr(ω), (2.1) which induces a continuous classification on sentences. Consider now Table 2.1 which lists a set of worlds and their corresponding degrees of beliefs. Table 2.1 is known as a state of belief or a joint probability distribution and we will require that the beliefs assigned to worlds add up to 1: w ω α Pr(w) 1. Based on Table 2.1, we have the following beliefs: Pr(Earthquake) Pr(ω 1 ) + Pr(ω 2 ) + Pr(ω 3 ) + Pr(ω 4 ).1 Pr(Burglary).2 Pr(Alarm).2442 It is relatively straightforward to establish the following properties of beliefs. First, a bound on the belief in any sentence: 0 Pr(α) 1 for any sentence α. (2.2) This follows since every degree of belief must be in [0, 1], leading to 0 Pr(α), and since the beliefs assigned to worlds must add up to 1, leading to Pr(α) 1. The second property is a baseline for inconsistent sentences: Pr(α) 0 when α is inconsistent. (2.3) This follows since there are no worlds that satisfy α. The third property is a baseline for valid sentences: Pr(α) 1 when α is valid. (2.4)

3 Adnan Darwiche c α α Figure 2.2: The worlds that satisfy α and those that satisfy α form a partition of the set of all worlds. α β Figure 2.3: The worlds that satisfy α β can be partitioned into three sets: those satisfying α β, α β and α β. This follows since α is satisfied by every world. The following property allows one to compute the belief in a sentence given the belief in its negation: Pr(α) + Pr( α) 1. (2.5) This follows because every world must either satisfy α or satisfy α, but cannot satisfy both; see Figure 2.2. Consider now Table 2.1 for an example and let α : Burglary. We then have: Pr(Burglary) Pr(ω 1 ) + Pr(ω 2 ) + Pr(ω 5 ) + Pr(ω 6 ).2 Pr( Burglary) Pr(ω 3 ) + Pr(ω 4 ) + Pr(ω 7 ) + Pr(ω 8 ).8 The next property allows us to compute the belief in a disjunction: Pr(α β) Pr(α) + Pr(β) Pr(α β). (2.6) This identity is best seen by examining Figure 2.3. If we simply add Pr(α) and Pr(β) we will end up summing the beliefs in worlds that satisfy α β twice. Hence, by subtracting Pr(α β), we will end up accounting for the belief in every world that satisfies α β only once. Consider Table 2.1 for an example and let α : Earthquake and β : Burglary. We then have: Pr(Earthquake) Pr(ω 1 ) + Pr(ω 2 ) + Pr(ω 3 ) + Pr(ω 4 ).1 Pr(Burglary) Pr(ω 1 ) + Pr(ω 2 ) + Pr(ω 5 ) + Pr(ω 6 ).2

4 4 Class Notes for CS262A, UCLA Pr(Earthquake Burglary) Pr(ω 1 ) + Pr(ω 2 ).02 Pr(Earthquake Burglary) The belief in a disjunction α β can also be computed directly from the belief in α and the belief in β: Pr(α β) Pr(α) + Pr(β) when α and β are mutually exclusive. In this case, there is no world that satisfies both α and β. Hence, α β is inconsistent and Pr(α β) 0. A related question is whether we can state a non trivial logical condition on α and β which would permit one to compute the belief in the conjunction α β in terms of the belief in α and the belief in β. This turns out to be impossible, but we will later present an interesting non logical condition for this purpose. We should stress here that the joint probability distribution is usually too large to allow a direct representation as given in Table 2.1. For example, if we have 20 variables and each has two values, the table will have 1, 048, 576 entries. And if we have 40 variables, the table will have 1, 099, 511, 627, 776 entries! We will discuss however in the next chapter a key tool, known as a Bayesian network, for efficiently representing the joint probability distribution Notational Conventions Before we move on to the next subject of updating beliefs, we need to settle some notational conventions. First, it is common to replace the conjoin operator ( ) by a comma (,) so we will often write Pr(α, β) instead of Pr(α β). Next, it is also common to use the term event to refer to a set of worlds. But we will also use this term when referring to a sentence α, since each sentence denotes a set of worlds, Mods(α). Finally, we will denote variables by upper case letters (A) and their values by lower case letters (a). Sets of variables will be denoted by bold face upper case letters (A) and their instantiations by bold face lower case letters (a). For variable A and value a, we will often write a instead of Aa and, hence, Pr(a) instead of Pr(Aa). For a variable A with values true and false, we may use a to denote Atrue and a to denote Afalse. Therefore, Pr(A), Pr(Atrue) and Pr(a) all represent the same quantity in this case. Similarly, Pr( A), Pr(Afalse) and Pr(a) all represent the same quantity. 2.2 Updating Beliefs Consider again the state of belief in Table 2.1 and suppose that we now know that the Alarm variable has taken the value true. This piece of information is not compatible with the state of belief, since it ascribes a belief of.2442 to the Alarm being true. One of the key questions is then to update the state of belief so it becomes compatible with this new piece of information, which we will refer

5 Adnan Darwiche c to as evidence. More generally, evidence will be represented by an arbitrary sentence, say β, and our goal is to update the state of belief Pr(.) into a new state of belief, which we will denote by Pr(. β). Given that β is known for sure, we expect the new state of belief Pr(. β) to assign a belief of 1 to β: Pr(β β) 1. This immediately implies that Pr( β β) 0 and, hence, every world ω that satisfies β must be assigned the belief 0: Pr(ω β) 0 for all ω β. (2.7) To completely define the new state of belief Pr(. β), all we have to do then is define the new belief in every world ω that satisfies β. We already know that the sum of all such beliefs must be 1: Pr(ω β) 1. (2.8) ω β But this leaves us with many options for Pr(ω β) when world ω satisfies β. Since evidence β tells us nothing about worlds that satisfy β, except that the total belief in them should be 1, it is then reasonable to perturb our beliefs in such worlds as little as possible. To this end, we will insist that our relative beliefs in these worlds stay the same: Pr(ω) Pr(ω ) Pr(ω β) Pr(ω β) for all ω, ω β, Pr(ω ) 0. (2.9) The constraints expressed by Equations 2.8 and 2.9 leave us with only one option for the new beliefs in worlds that satisfy the evidence β: Pr(ω β) Pr(ω) Pr(β) for all ω β. That is, the new beliefs in such worlds are just the result of normalizing our old beliefs, with the normalization constant being our old belief in the evidence, Pr(β). Our new state of belief is now completely defined: Pr(ω β) def { 0, if ω β; Pr(ω)/Pr(β), if ω β. (2.10) The new state of belief Pr(. β) will be called the result of conditioning the old state Pr on evidence β. Consider now the state of belief in Table 2.1 and suppose that the evidence β is Alarm. The result of conditioning this state of belief on Alarm is given in Table 2.2. Let us now examine some of the changes in beliefs that are induced by this new evidence. First, our belief in Burglary increases: Pr(Burglary).2 Pr(Burglary Alarm).741

6 6 Class Notes for CS262A, UCLA world Earthquake Burglary Alarm Pr(.) Pr(. Alarm) ω 1 true true true /.2442 ω 2 true true false ω 3 true false true /.2442 ω 4 true false false ω 5 false true true /.2442 ω 6 false true false ω 7 false false true /.2442 ω 8 false false false Table 2.2: A state of belief and the result of its conditioning on evidence Alarm. And so does our belief in Earthquake: Pr(Earthquake).1 Pr(Earthquake Alarm).307 One can derive a simple closed form for the updated belief in an arbitrary sentence α given evidence β, without having to explicitly compute the belief Pr(ω β) for every world ω. The derivation is as follows: Pr(α β) ω α Pr(ω β) Equation 2.1 ω α, ω β ω α, ω β ω α β ω α β 1 Pr(β) Pr(ω β) + ω α, ω β Pr(ω β) Equation 2.10 Pr(ω β) Properties of Pr(ω β) Pr(ω)/Pr(β) Equation 2.10 ω α β Pr(α β) Pr(β) Pr(ω) Equation 2.1. ω satisfies β or β but not both The closed form, Pr(α β) Pr(α β), (2.11) Pr(β) is known as Bayes conditioning. Note that the updated state of belief Pr(. β)

7 Adnan Darwiche c is defined only when Pr(β) 0. We will usually avoid stating this condition explicitly in the future, but it should be implicitly assumed. Let us now use Bayes conditioning to further examine some of the belief dynamics in our previous example. In particular, here is how some beliefs would change upon accepting the evidence Earthquake: Pr(Burglary).2 Pr(Burglary Earthquake).2 Pr(Alarm).2442 Pr(Alarm Earthquake).75 That is, the belief in Burglary is not changed, but the belief in Alarm increases. Here are some more belief changes, as a reaction to the evidence Burglary: Pr(Alarm).2442 Pr(Alarm Burglary).905 Pr(Earthquake).1 Pr(Earthquake Burglary).1 The belief in Alarm increases in this case, but the belief in Earthquake stays the same. The above belief dynamics are a property of the state of belief in Table 2.1 and may not hold for other states of beliefs. For example, it is possible to conceive of a reasonable state of belief in which information about Earthquake would change the belief about Burglary and vice versa. One of the central questions in building automated reasoning systems is that of synthesizing states of beliefs that are faithful, i.e., those that correspond to the beliefs held by some human expert. We shall study a major technique in the next chapter for synthesizing faithful states of beliefs. Before we move on, let us look at one more example of belief change. We know that the belief in Burglary increases when accepting the evidence Alarm. The question though is how would such a belief change further upon obtaining more evidence. Here s what happens when we get a confirmation that an Earthquake took place: Pr(Burglary Alarm).741 Pr(Burglary Alarm Earthquake).253 That is, our belief in a Burglary decreases in this case, as we now have an explanation of Alarm. If on the other hand we get a confirmation that there was no Earthquake, our belief in Burglary increases even further: Pr(Burglary Alarm).741 Pr(Burglary Alarm Earthquake).957 As it turns out, some of the above belief changes are not accidental as they are guaranteed by the method used to construct the state of belief in Table 2.1. We will have more to say about this in the next chapter.

8 8 Class Notes for CS262A, UCLA 2.3 Independence According to the state of belief in Table 2.1, the evidence Burglary does not change the belief in Earthquake: Pr(Earthquake).1 Pr(Earthquake Burglary).1 Hence, we say in this case that the state of belief Pr finds Earthquake independent of Burglary. More generally, we will say that Pr finds event α independent of event β iff Pr(α β) Pr(α) when Pr(β) 0. (2.12) Note that the state of belief in Table 2.1 also finds Burglary independent of Earthquake: Pr(Burglary).2 Pr(Burglary Earthquake).2 It is indeed a general property that Pr must find event α independent of event β if it also finds β independent of α. Independence satisfies many other properties that we shall explore in depth in future chapters. Independence provides a general condition under which the belief in a conjunction α β can be expressed in terms of the belief in α and that in β. Specifically, if Pr finds α independent of β, we must have: Pr(α β) Pr(α)Pr(β), which follows immediately from the definition of independence and Bayes conditioning (Equation 2.11). The above equation is sometimes taken as the definition of independence, where the equation Pr(α β) Pr(α) is viewed as a consequence. We will sometimes use the above equation to stress the symmetry between α and β in the definition of independence. It is important here to stress the difference between independence and logical disjointness (mutual exclusiveness) as it is common to mix these two notions. Recall that two events α and β are disjoint (mutually exclusive) iff they share no models: Mods(α) Mods(β). That is, they cannot hold together at the same world. On the other hand, events α and β are independent iff Pr(α β) Pr(α)Pr(β). Note that disjointness is an objective property of events, while independence is a property of beliefs. Hence, two people with different beliefs may disagree on whether two events are independent, but they cannot disagree on their disjointness Conditional Independence Independence is a dynamic notion. That is, one may find two events independent at some point, but then find them dependent after obtaining some evidence. For example, we have seen earlier how the state of belief in Table 2.1 finds Burglary

9 Adnan Darwiche c independent of Earthquake. This state of belief, however, finds these events dependent on each other after accepting the evidence Alarm: Pr(Burglary Alarm).741 Pr(Burglary Alarm Earthquake).253 That is, the evidence Earthquake changes the belief in Burglary in the presence of evidence Alarm. In general, independent events may become dependent given new evidence, and dependent events may become independent given new evidence. This calls for the following more general definition of independence. We say that state of belief Pr finds event α conditionally independent of event β given event γ iff Pr(α β γ) Pr(α γ) when Pr(β γ) 0. (2.13) That is, in the presence of evidence γ, the additional evidence β will not change the belief in α. Conditional independence enables the following more general equation for computing the belief in a conjunction: Pr(α β γ) Pr(α γ)pr(β γ) Variable Independence We will find it useful in the future to talk about independence between sets of variables. In particular, let X, Y and Z by three disjoint sets of variables. We will say that a state of belief Pr finds X independent of Y given Z, denoted I Pr (X, Z, Y), to mean that Pr finds α independent of β given γ for all sentences α, β and γ that represent states of variables X, Y, and Z, respectively. Suppose for example that X {A, B}, Y {C} and Z {D, E}. The statement I Pr (X, Z, Y) is then a compact notation for a number of statements about independence: A B is independent of C given D; A B is independent of C given D,..., A B is independent of C given D. Using the notation we developed in Section 2.1.1, the statement I Pr (X, Z, Y) is then asserting that Pr(x y, z) Pr(x z) when Pr(y, z) 0 for all x, y, and z. This is obviously a much more compact notation, which we will use frequently. 2.4 Further Properties of Beliefs We will discuss in this section some more properties of beliefs that are commonly used. We start with the Chain Rule: Pr(α 1 α 2... α n ) Pr(α 1 α 2... α n )Pr(α 2 α 3... α n )... Pr(α n ). This rule follows from a repeated application of Bayes conditioning. We will find a major use of the chain rule when we discuss Bayesian networks in the following chapter.

10 10 Class Notes for CS262A, UCLA The next important property of beliefs is Case Analysis: Pr(α) n Pr(α β i ), (2.14) i1 where the events β 1,..., β n are mutually exclusive and exhaustive. 1 Case analysis holds because the models of α β 1,..., α β n form a partition of the models of α. Intuitively, case analysis says that we can compute the belief in event α by adding up our beliefs in a number of non overlapping cases, α β 1,..., α β n, that cover all conditions under which α holds. Another version of case analysis is the following: Pr(α) n Pr(α β i )Pr(β i ), (2.15) i1 where the events β 1,..., β n are mutually exclusive and exhaustive. This version is obtained from the first one by applying Bayes conditioning, and calls for considering a number of non overlapping and exhaustive cases, β 1,..., β n. We compute our belief in α under each one of these cases, Pr(α β i ), and then add up these beliefs after applying the weight of each case, Pr(β i ). Two simple and useful forms of case analysis are these: Pr(α) Pr(α β) + Pr(α β) Pr(α) Pr(α β)pr(β) + Pr(α β)pr( β). These equations hold because β and β are mutually exclusive and exhaustive. The main value of case analysis is that in many situations, computing our beliefs in the cases is easier than computing our beliefs in α. We shall see many examples of this phenomena in later chapters. The last property of beliefs we shall consider is known as Bayes Rule or Bayes Theorem: Pr(α β) Pr(β α)pr(α), (2.16) Pr(β) which follows from applying Bayes conditioning twice. The classical usage of this rule is when event α is perceived to be a cause of event β for example, α is a disease and β is a symptom and our goal is to assess our belief in the cause given the symptom. It is common for the belief in an effect given its cause, Pr(β α), to be more readily available than the belief in a cause given one of its effects, Pr(α β). Hence, this rule allows us to compute the latter from the former. To consider an example of Bayes rule, suppose that we have a patient who was just tested for a particular disease and the test came out positive. We know that one in every thousand people has this disease. We also know that the test is not reliable: it has a false positive rate of 2% and a false negative rate of 5%. 1 That is, Mods(β j ) Mods(β k ) for j k, and n i1 Mods(β i) is the set of all worlds.

11 Adnan Darwiche c Our goal is then to assess our belief in the patient having the disease given that the test came out positive. If we let variable D stand for the patient has the disease, and variable T stand for the test came out positive, our goal is then to compute Pr(D T ). From the given information, we know that Pr(D) since one in every thousand has the disease this is our belief in the patient having the disease before we run any tests. Since the false positive rate of the test is 2%, we know that Pr(T D) and, by Equation 2.5, Pr( T D) Similarly, since the false negative rate of the test is 5%, we know that and Using Bayes rule, we now have Pr( T D) Pr(T D) Pr(D T ) Pr(T ). The belief in the test coming out positive for an average individual, Pr(T ), is not readily available but can be computed using case analysis: Pr(T ) Pr(T D)Pr(D) + Pr(T D)Pr( D) , which leads to Pr(D T ) %. It turns out that if the test false positive is brought down to 2/1000, the above belief in disease will go up to around 32.2%. Another way to solve the above problem is to construct the state of belief completely and then use it to answer queries. This is feasible in this case because we have only two events of interest T and D, leading up to only four worlds: world T D ω 1 true true ω 2 true false ω 3 false true ω 4 false false

12 12 Class Notes for CS262A, UCLA If we can obtain the belief in each one of these worlds, then we are done since the belief in any other sentence can be computed mechanically using Equations 2.1 and To compute the beliefs in the above worlds, we can use the chain rule: Pr(ω 1 ) Pr(T D) Pr(T D)Pr(D) Pr(ω 2 ) Pr(T D) Pr(T D)Pr( D) Pr(ω 3 ) Pr( T D) Pr( T D)Pr(D) Pr(ω 4 ) Pr( T D) Pr( T D)Pr( D). All of the above quantities are available directly from the problem statement. 2.5 Soft Evidence There are two types of evidence that one may encounter: hard evidence and soft evidence. Hard evidence is information to the effect that some event has occurred, which is also the type of evidence we have considered earlier. Soft evidence on the other hand is not conclusive: we may get an unreliable testimony that event β occurred, which may increase our belief in β, but not to the point where we would consider it certain. One key issue relating to soft evidence is how to specify it. There are two key methods for this, which we will discuss next The All things considered Method One method for specifying a soft evidence on event β is by stating the new belief in β after the evidence has been accommodated. For example, we would say given this soft evidence on β, my belief in β becomes.85. Formally, we are stating that Pr (β).85, where Pr denotes the new state of belief after accommodating the evidence. This is sometimes known as the All things considered method, since the new belief in β depends not only on the new evidence, but also on our old beliefs. That is, the statement Pr (β).85 is not a statement about the evidence itself, but about the result of its integration with our current beliefs. Given this method of specifying evidence, computing the new state of belief Pr can be done along the same principles we used for Bayes conditioning. In particular, suppose that we obtain some soft evidence on event β, which leads us to change our belief in β to q. We will denote such evidence by the pair (β, q) and understand it as imposing the following constraint on the new state of belief Pr : Pr (β) q, which immediately gives the additional constraint Pr ( β) 1 q. Therefore, we know that we must change the beliefs in worlds that satisfy β so these beliefs add up to q. We also know that we must change the beliefs in worlds that satisfy β so they add up to 1 q. Again, if we insist on preserving the relative beliefs in worlds that satisfy β, and also on preserving the relative beliefs in worlds that satisfy β, we find ourselves committed to the

13 Adnan Darwiche c following definition of Pr : Pr (ω) { q Pr(β) Pr(ω), 1 q Pr( β) Pr(ω), if ω β if ω β. That is, we effectively have to scale our beliefs in the worlds satisfying β using the constant q/pr(β), and similarly for the worlds satisfying β. There is also a useful closed form for the above definition, which can be derived similarly to Equation 2.11: Pr Pr(α β) Pr(α β) (α) q + (1 q), (2.17) Pr(β) Pr( β) where Pr is the new state of belief after accommodating the soft evidence (β, q). This method of updating a state of belief in the face of soft evidence is known as Jeffrey s Rule. Note that Bayes conditioning is a special case of Jeffrey s rule when q 1, which is to be expected as they were both derived using the same principle. Jeffrey s rule has a simple generalization to the case where the evidence concerns a set of mutually exclusive and exhaustive events β 1,..., β n, where the new beliefs in these events are q 1,..., q n, respectively. This soft evidence can be accommodated using the following generalized version of Jeffrey s rule: Pr (α) n i1 Pr(α β i ) q i. (2.18) Pr(β i ) Consider the following example, due to Jeffrey. Assume that we are given a piece of cloth C, where its color can be one of: green (c g ), blue (c b ), or violet (c v ). We want to know whether, in the next day, the cloth will be sold (s), or not sold (s). Our original state of belief is as follows: worlds S C Pr(.) ω 1 s c g.12 ω 2 s c b.12 ω 3 s c v.32 ω 4 s c g.18 ω 5 s c b.18 ω 6 s c v.08 Therefore, our original belief in the cloth being sold is Pr(s).56. Moreover, our original beliefs in the colors c g, c b, c v are.3,.3, and.4, respectively. Assume that we now inspect the cloth by candlelight, and we conclude that our new beliefs in these colors should be.7,.25, and.05, respectively. If we apply

14 14 Class Notes for CS262A, UCLA Jeffrey s rule, we get the following new state of belief: worlds S C Pr (.) ω 1 s c g.28 ω 2 s c b.10 ω 3 s c v.04 ω 4 s c g.42 ω 5 s c b.15 ω 6 s c v.01 Therefore, our new belief in the cloth being sold is now Pr (s).42. We can also obtain this result using the close form given by Equation 2.18: Pr (s) The Nothing else considered Method The second method for specifying soft evidence on event β is based on declaring the strength of this evidence, independently of currently held beliefs. In particular, let us define the odds of event β as follows: O(β) def Pr(β) Pr( β). (2.19) That is, an odds of 1 indicates that we believe β and β equally, while an odds of 10 indicates that we believe β ten times more than we believe β. Given the notion of odds, we can specify soft evidence on event β by declaring the relative change it induces on the odds of β, that is, by specifying the ratio O (β)/o(β), where O (β) is the odds of β after accommodating the evidence, Pr (β)/pr ( β). The ratio O (β)/o(β) is known as the Bayes factor. Hence, a Bayes factor of 1 indicates a neutral evidence, while a Bayes factor of 2 indicates an evidence on β which is strong enough to double the odds of β. This method of specifying evidence is sometimes known as the Nothing else considered method, as it is a statement about the strength of evidence, without any reference to the initial state of belief since the Bayes factor does not constrain the initial state of belief. Suppose now that we obtain soft evidence on β whose strength is given by a Bayes factor of k, and our goal is to compute the new state of belief Pr that results from accommodating this evidence. If we are able to translate this evidence into a form which is accepted by Jeffrey s rule, then we can use that rule to compute Pr. This turns out to be possible as we describe next. First, from the constraint O (β)/o(β) k, we get: Pr (β) kpr(β) kpr(β) + Pr( β). Hence, we can view this as a problem of updating the initial state of belief Pr using Jeffrey s rule and the soft evidence given above. That is, what we have

15 Adnan Darwiche c done is translate a Nothing else considered specification of soft evidence a constraint on O (β)/o(β) into an All things considered specification a constraint on Pr (β). Computing Pr using Jeffrey s rule and the above soft evidence, we get: Pr kpr(α β) + Pr(α β) (α), (2.20) kpr(β) + Pr( β) where Pr is the new state of belief after accommodating soft evidence on event β using a Bayes factor of k. Note that Bayes conditioning is a special case of the above rule, when the Bayes factor tends to infinity. Note also that the difference between Equations 2.17 and 2.20 is only in the way soft evidence is specified. The first rule expects the evidence to be specified as a pair (β, q), where q is the new belief in event β, Pr (β) q. The second rule expects the evidence to be specified as a pair (β, k), where k O (β)/o(β) is a Bayes factor that quantifies the strength of evidence. Consider now the following example due to Pearl, which concerns the alarm of Mr. Holmes house and the potential of a burglary. The initial state of belief is given by: world Alarm Burglary Pr(.) ω 1 true true ω 2 true false ω 3 false true ω 4 false false One day, Mr. Holmes receives a call from his neighbor, Mrs. Gibbons, saying that she may have heard the alarm of his house going off. Since Mrs. Gibbons suffers from a hearing problem, Mr. Holmes concludes that Mrs. Gibbons testimony increases the odds of the alarm going off by a factor of 4: O (Alarm)/O(Alarm) 4. Our goal now is to compute our new belief in a burglary taking place, Pr (Burglary). Using Equation 2.20 with α : Burglary, β : Alarm and k 4, we get: Pr (Burglary) 4( ) ( ) There is a generalization of Equation 2.20 to the case where the soft evidence bears on a set of mutually exclusive and exhaustive events β 1,..., β n. This generalization requires that we define the odds of event β i to event β j : O(β i, β j ) def Pr(β i ) Pr(β j ). The odds of β, O(β), is then a special case since O(β, β) O(β). Given this more general notion of odds, a soft evidence bearing on a set of mutually exclusive and exhaustive events β 1,..., β n can then be specified using a set of numbers λ 1,..., λ n, with the following interpretation: O (β i, β j ) O(β i, β j ) λ i λ j. (2.21)

16 16 Class Notes for CS262A, UCLA That is, we are specifying the relative increase in the odds of β i to β j for every pair of events. Note here that the specific numbers λ 1,..., λ n do not matter; only their ratios are important. 2 Furthermore, each ratio λ i /λ j is known as the Bayes factor for events β i /β j. Hence, the numbers λ 1,..., λ n are indirectly specifying a set of Bayes factors, one for each pair of distinct events in β 1,..., β n. 3 Given the constraints given by Equation 2.21 on the new state of belief, one can show that the new belief in any of the events β j must be given by: 4 Pr (β j ) λ j Pr(β j ) n i1 λ ipr(β i ). Using these new beliefs and Jeffrey s rule (Equation 2.18), we get the following rule: n Pr i1 (α) λ ipr(α β i ) n i1 λ. (2.22) ipr(β i ) This generalizes Equation 2.20, which falls as a special case when n 2, β 1 β, β 2 β, λ 1 k, and λ 2 1. Again, note that the difference between Equation 2.18 and Equation 2.22 is only in the way soft evidence is specified. The first equation expects evidence in the form of a set of mutually exclusive and exhaustive events β 1,..., β n and a corresponding set of beliefs q 1,..., q n, which are interpreted as constraints of the form Pr (β i ) q i. The second rule expects a different set of numbers λ 1,..., λ n, which are interpreted as constraints of the form O (β i, β j )/O(β i, β j ) λ i /λ j. Again, we will provide another, possibly more intuitive, interpretation of the numbers λ 1,..., λ n in the following section. To illustrate Equation 2.22, consider the cloth example we discussed in the previous section, and suppose that we have some soft evidence on the mutually exclusive and exhaustive events C c g, C c b, and C c v whose strength is quantified by λ g : λ b : λ v 7 : 2.5 :.375. We can compute our new belief in the event Ss using Equation 2.22 as follows: Pr (Ss) (7.12) + (2.5.12) + ( ) (7.3) + (2.5.3) + (.375.4).42. The above evidence strength was in fact chosen carefully so it leads to the same state of belief Pr that we arrived at using Jeffrey s rule. 2 See the following section for another interpretation of these ratios. 3 For the special case of n 2, β 2 must be equivalent to β 1, O (β 1, β 2 )/O(β 1, β 2 ) O (β 1 )/O(β 1 ), and the ratio λ 1 /λ 2 is then the Bayes factor k discussed earlier. 4 This can be shown as follows. By unfolding O and O in Equation 2.21, we get Pr (β i )/λ i Pr(β i ) Pr (β j )/λ j Pr(β j ). We also get that Pr (β 1 )/λ 1 Pr(β 1 )... Pr (β n)/λ npr(β n) k. This leads to Pr (β i ) kλ i Pr(β i ) for i 1,..., n. Adding up the two sides of this equation for i 1,..., n, we get 1 k n i1 λ ipr(β i ). For any particular β j, we finally get Pr (β j ) λ j Pr(β j )/ n i1 λ ipr(β i ).

17 Adnan Darwiche c The Virtual Evidence Method Equation 2.22 has an alternative semantics using the notion of virtual evidence, which we explain next. Suppose that we have a soft evidence bearing on the set of mutually exclusive and exhaustive events β 1,..., β n. We can model this evidence explicity by augmenting our language with a new propositional variable V, which represents the event of receiving this soft evidence for example, V could represent the event of receiving a call from our unreliable neighbor that the alarm in our house went off. We can now quantify the strength of this evidence by specifying the probability that we will receive it given each of the events β i : Pr(V β i ) λ i, for i 1,..., n. The new state of belief Pr, after the soft evidence has been accommodated, is now given by Pr(. V ). If we also assume that V is independent of every other event given β i, for i 1,..., n, we can then show that Pr(. V ) is indeed equal to Pr as given by Equation This method is known as the method of virtual evidence as it is based on introducing a new virtual variable V, which allows one to model the soft evidence in terms of hard evidence on V, where the relationship of V to events β 1,..., β n is uncertain. Moreover, this uncertainty is captured explicitly using the numbers λ 1,..., λ n. According to this method, the ratios of these numbers, Pr(V β i ) Pr(V β j ) λ i λ j, are interpreted as the odds of receiving evidence V given event β i to receiving it given event β j. Note that these ratios where called Bayes factors in the previous section. We have seen earlier that when the soft evidence bears only on β and β, its strength can be specified using the Bayes factor k O (β)/o(β), which corresponds to a setting of λ 1 k and λ 2 1. This very common case can then be handled using virtual evidence by simply ensuring that: Pr(V β) Pr(V β) k. The method of virtual evidence is quite important practically, as it allows us to integrate soft evidence using the tools developed for hard evidence. We will rely on this method for accommodating soft evidence in future chapters.

Probability Calculus. p.1

Probability Calculus. p.1 Probability Calculus p.1 Joint probability distribution A state of belief : assign a degree of belief or probability to each world 0 Pr ω 1 ω Ω sentence > event world > elementary event Pr ω ω α 1 Pr ω