CS 188, Fall 2005 Solutions for Assignment 4

CS 188, Fall 005 Solutions for Assignment 4 1. 13.1[5pts] Show from first principles that P (A B A) = 1 The first principles needed here are the definition of conditional probability, P (X Y ) = P (X Y )/P (Y ), and the definitions of the logical connectives. It is not enough to say that if B A is given then A must be true! From the definition of conditional probability, and the fact that A A A and that conjunction is commutative and associative, we have P (A B A) = P (A (B A)) P (B A) = P (B A) P (B A) = 1. 13.6[10pts] Given the full joint distribution shown in Figure 13.3, calculate the following. (a) P (toothache) (b) P(Cavity) (c) P(Toothache cavity) (d) P(Cavity toothache catch) The main point of this exercise is to become completely familiar with the basic mechanics of answering queries by adding up entries in the joint distribution. It also helps you to understand the various notations of bold versus non-bold P, and uppercase versus lowercase variable names. (a) This asks for the probability that Toothache is true. P (toothache) = 0.108 + 0.01 + 0.016 + 0.064 = 0. (b) This asks for the vector of probability values for the random variable Cavity. It has two values, which we list in the order true, false. First add up 0.108 + 0.01 + 0.07 + 0.008 = 0.. Then we have P(Cavity) = 0., 0.8. (c) This asks for the vector of probability values for Toothache, given that Cavity is true. P(Toothache cavity) = (.108 +.01)/0., (0.07 + 0.008)/0. = 0.6, 0.4 (d) This asks for the vector of probability values for Cavity, given that either Toothache or Catch is true. First compute P (toothache catch) = 0.108 + 0.01 + 0.016 + 0.064 + 0.07 + 0.144 = 0.416. Then P(Cavity toothache catch) = (0.108 + 0.01 + 0.07)/0.416, (0.016 + 0.064 + 0.144)/0.416 = 0.4615, 0.5384 3. 13.7[5pts] Show that the three forms of independence in Equation (13.8) are equivalent. From P (a b) = P (a) we have, multiplying both sides, P (a b)p (b) = P (a)p (b); by the product rule, we can rewrite the LHS to obtain P (a b) = P (a)p (b). Hence, the first definition implies the third. We can reverse the process, applying the product rule in the other direction and dividing by P (b) to prove that the third implies the first provided P (b) is not zero. (If it is zero, the conditional probability is not defined.) Exactly the same argument applies to prove the third from the second, except we multiply by P (a) rather than P (b). Hence, we have shown equivalence of all three definitions in the case that neither P (a) nor P (b) is zero. 1

4. 13.8[5] After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don t have the disease). The good news is that this is a rare disease, striking only one in 10,000 people. Why is it good news that the disease is rare? What are the chances that you actually have the disease? We are given the following information: P (test disease) = 0.99 P ( test disease) = 0.99 P (disease) = 0.0001 and the observation test. What the patient is concerned about is P (disease test). Roughly speaking, the reason it is a good thing that the disease is rare is that P (disease test) is proportional to P (disease), so a lower prior for disease will mean a lower value for P (disease test). Roughly speaking, if 10,000 people take the test, we expect 1 to actually have the disease, and most likely test positive, while the rest do not have the disease, but 1% of them (about 100 people) will test positive anyway, so P (disease test) will be about 1 in 100. More precisely, using the normalization equation from page 48: P (disease test) = P (test disease)p (disease) P (test disease)p (disease)+p (test disease)p ( disease) = 0.99 0.0001 0.99 0.0001+0.01 0.9999 =.009804 The moral is that when the disease is much rarer than the test accuracy, a positive test result does not mean the disease is likely. A false positive reading remains much more likely. Here is an alternative exercise along the same lines: A doctor says that an infant who predominantly turns the head to the right while lying on the back will be right-handed, and one who turns to the left will be left-handed. Isabella predominantly turned her head to the left. Given that 90% of the population is right-handed, what is Isabella s probability of being right-handed if the test is 90% accurate? If it is 80% accurate? The reasoning is the same, and the answer is 50% right-handed if the test is 90% accurate, 69% right-handed if the test is 80% accurate. 5. 13.11[15pts] Suppose you are given a bag containing n unbiased coins. You are told that n 1 of these coins are normal, with heads on one side and tails on the other, whereas one coin is a fake, with heads on both sides. (a) Suppose you reach into the bag, pick out a coin uniformly at random, flip it, and get a head. What is the (conditional) probability that the coin you chose is the fake coin? (b) Suppose you continue flipping the coin for a total of k times after picking it and see k heads. Now what is the conditional probability that you picked the fake coin? (c) Suppose you wanted to decide whether the chosen coin was fake by flipping it k times. The decision procedure returns FAKE if all k flips come up heads, otherwise it returns NORMAL. What is the (unconditional) probability that this procedure makes an error? (a) A typical counting argument goes like this: There are n ways to pick a coin, and outcomes for each flip (although with the fake coin, the results of the flip are indistinguishable), so there are n total atomic events. Of those, only pick the fake coin, and + (n 1) result in heads. So the probability of a fake coin given heads, P (fake heads), is /( + n 1) = /(n + 1). Often such counting arguments go astray when the situation gets complex. It may be better to do it more formally: P(F ake heads) = αp(heads F ake)p(f ake)

= α 1.0, 0.5 1/n, (n 1)/n = α 1/n, (n 1)/n = /(n + 1), (n 1)/(n + 1) (b) Now there are k n atomic events, of which k pick the fake coin, and k + (n 1) result in heads. So the probability of a fake coin given a run of k heads, P (fake heads k ), is k /( k + (n 1)). Note this approaches 1 as k increases, as expected. If k = n = 1, for example, than P (fake heads 10 ) = 0.9973. Doing it the formal way: P(F ake heads k ) = αp(heads k F ake)p(f ake) = α 1.0, 0.5 k 1/n, (n 1)/n = α 1/n, (n 1)/ k n = k /( k + n 1), (n 1)/( k + n 1) (c) The procedure makes an error if and only if a fair coin is chosen and turns up heads k times in a row. The probability of this P (heads k fake)p ( fake) = (n 1)/ k n. 6. 13.15[5pts] (Adapted from Pearl 1988) You are a witness of a night-time hit-and-run accident involving a taxi in Athens. All taxis in Athens are blue or green. You swear, under oath, that the taxi was blue. Extensive testing shows that under the dim lighting conditions, discrimination between blue and green is 75% reliable. Is it possible to calculate the most likely color for the taxi? (Hint: distinguish carefully between the proposition that the taxi is blue and the proposition that it appears blue.) What now, given that 9 out of 10 Athenian taxis are green? The relevant aspect of the world can be described by two random variables: B means the taxi was blue, and LB means the taxi looked blue. The information on the reliability of color identification can be written as P (LB B) = 0.75 P ( LB B) = 0.75 We need to know the probability that the taxi was blue, given that it looked blue: P (B LB) P (LB B)P (B) 0.75P (B) P ( B LB) P (LB B)P ( B) 0.5(1 P (B)) Thus we cannot decide the probability without some information about the prior probability of blue taxis, P (B). For example, if we knew that all taxis were blue, i.e., P (B) = 1, then obviously P (B LB) = 1. On the other hand, if we adopt Laplace s Principle of Indifference, which states that propositions can be deemed equally likely in the absence of any differentiating information, then we have P (B) = 0.5 and P (B LB) = 0.75. Usually we will have some differentiating information, so this principle does not apply. Given that 9 out of 10 taxis are green, and assuming the taxi in question is drawn randomly from the taxi population, we have P (B) = 0.1. Hence P (B LB) 0.75 0.1 0.075 P ( B LB) 0.5 0.9 0.5 0.075 P (B LB) = 0.075+0.5 = 0.5 0.5 P ( B LB) = 0.075+0.5 = 0.75 7. 13.18[10pts] Text categorization is the task of assigning a given document to one of a fixed set of categories, on the basis of the text it contains. Naive Bayes models are often used for this task. In these models, the query variable is the document category, and the effect variables are the presence or absence of each word in the language; the assumption is that words occur independently in documents, with frequencies determined by the document category. 3

(a) Explain precisely how such a model can be constructed, given as training data a set of documents that have been assigned to categories. (b) Explain precisely how to categorize a new document. (c) Is the independence assumption reasonable? Discuss. This question is essentially previewing material in Chapter 3 (page 84), but it should be quite straightforward to figure out how to estimate a conditional probability from complete data. (a) The model consists of the prior probability P(Category) and the conditional probabilities P(W ord i Category), where W ord i is true iff the document in question contains the ith word in the dictionary. For each category c, P(Category = c) is estimated as the fraction of all documents that are of category c. Similarly, P(W ord i = true Category = c) is estimated as the fraction of documents of category c that contain word i. (b) We take advantage of the conditional independence structure of the model as described at the bottom of p.48: P(Category word 1..., word n ) = αp(category, word 1..., word n ) = αp(category) i P(word i Category) (c) The independence assumption is clearly violated in practice. For example, the word pair artificial intelligence occurs more frequently in any given document category than would be suggested by multiplying the probabilities of artificial and intelligence. This means that the true probability of any given set of words occurring is typically much higher than the model suggests; the effect gets worse for larger sets of words, so the relative category probabilities of documents of different lengths tend to be very unreliable. 8. 14.1[15pts] Consider the network for car diagnosis shown in the figure. (a) Extend the network with the Boolean variables IcyWeather and StarterMotor. (b) Give reasonable conditional probability tables for all the nodes. (c) How many independent values are contained in the joint probability distribution for eight Boolean nodes, assuming that no conditional independence relations are known to hold among them? (d) How many independent probability values do your network tables contain? (e) The conditional distribution for Starts could be described as a noisy-and distribution. Define this family in general and relate it to the noisy-or distribution. Adding variables to an existing net can be done in two ways. Formally speaking, one should insert the variables into the variable ordering and rerun the network construction process from the point where the first new variable appears. Informally speaking, one never really builds a network by a strict ordering. Instead, one asks what variables are direct causes or influences on what other ones, and builds local parent/child graphs that way. It is usually easy to identify where in such a structure the new variable goes, but one must be very careful to check for possible induced dependencies downstream. (a) IcyW eather is not caused by any of the car-related variables, so needs no parents. It directly affects the battery and the starter motor. StarterM otor is an additional precondition for Starts. The new network is shown in Figure 1 above. (b) Reasonable probabilities may vary a lot depending on the kind of car and perhaps the personal experience of the assessor. The following values indicate the general order of magnitude and relative values that make sense: A reasonable prior for IcyWeather might be 0.05 (perhaps depending on location and season). P (Battery IcyW eather) = 0.95, P (Battery IcyW eather) = 0.997. P (StarterM otor IcyW eather) = 0.98, P (Battery IcyW eather) = 0.999. 4

1 IcyWeather Battery StarterMotor 1 Radio Ignition 8 Gas Starts Moves Figure 1: Car network amended to include IcyW eather and StarterMotorW orking (SMW ). P (Radio Battery) = 0.9999, P (Radio Battery) = 0.05. P (Ignition Battery) = 0.998, P (Ignition Battery) = 0.01. P (Gas) = 0.995. P (Starts Ignition, StarterM otor, Gas) = 0.9999, other entries 0.0. P (Moves Starts) = 0.998. (c) With 8 Boolean variables, the joint has 8 1 = 55 independent entries. (d) Given the topology shown in 1, the total number of independent CPT entries is 1+++++1+8+= 0. (e) The CPT for Starts describes a set of nearly necessary conditions that are together almost sufficient. That is, all the entries are nearly zero except for the entry where all the conditions are true. That entry will be not quite 1 (because there is always some other possible fault that we didn t think of), but as we add more conditions it gets closer to 1. If we add a Leak node as an extra parent, then the probability is exactly 1 when all parents are true. We can relate noisy-and to noisy-or using de Morgan s rule: A B ( A B). That is, noisy-and is the same as noisy-or except that the polarities of the parent and child variables are reversed. In the noisy-or case, we have P (Y = true x 1,..., x k ) = 1 {i:x i = true} where q i is the probability that the presence of the ith parent fails to cause the child to be true. In the noisy-and case, we can write P (Y = true x 1,..., x k ) = {i:x i = false} where r i is the probability that the absence of the ith parent fails to cause the child to be false (e.g., it is magically bypassed by some other mechanism). 9. 14.[15pts] In your local nuclear power station, there is an alarm that senses when a temperature gauge exceeds a given threshold. The gauge measures the temperature of the core. Consider the Boolean variables A (alarm sounds), F A (alarm is faulty), and F G (gauge is faulty) and the multivalued nodes G (gauge reading) and T (actual core temperature). r i q i 5

(a) Draw a Bayesian network for this domain, given that the gauge is more likely to fail when the core temperature gets too high. (b) Is your network a polytree? (c) Suppose there are just two possible actual and measured temperatures, normal and high; the probability that the gauge gives the correct temperature is x when it is working, but y when it is faulty. Give the conditional probability table associated with G. (d) Suppose the alarm works correctly unless it is faulty, in which case it never sounds. Give the conditional probability table associated with A. (e) Suppose the alarm and gauge are working and the alarm sounds. Calculate an expression for the probability that the temperature of the core is too high, in terms of the various conditional probabilities in the network. F G F A T G A Figure : A Bayesian network for the nuclear alarm problem. (a) A suitable network is shown in??. The key aspects are: the failure nodes are parents of the sensor nodes, and the temperature node is a parent of both the gauge and the gauge failure node. It is exactly this kind of correlation that makes it difficult for humans to understand what is happening in complex systems with unreliable sensors. (b) No matter which way the student draws the network, it should not be a polytree because of the fact that the temperature influences the gauge in two ways. (c) The CPT for G is shown below. The wording of the question is a little tricky because F G means not working and F g means working. (d) The CPT for A is as follows: T = Normal T = High F G F G F G F G G = Normal y x 1 y 1 x G = High 1 y 1 x y x G = Normal G = High F A F A F A F A A 0 0 0 1 A 1 1 1 0 (e) This part actually asks the student to do something usually done by Bayesian network algorithms. The great thing is that doing the calculation without a computer program makes it easy to see the nature of the calculations that the algorithms are systematizing. It illustrates the importance of creating complete and correct algorithms. Abbreviating T = High and G = High by T and G, the probability of interest here is P (T A, F G, F A ). Because the alarm s behavior is deterministic, we can reason that if the alarm is working and sounds, G must be High. Because F A and A are d-separated from T, we need only calculate P (T F G, G). There are several ways to go about doing this. The opportunistic way is to notice that the CPT entries give us P (G T, F G ), which suggests using the generalized Bayes Rule to switch G and T with F G as background: P (T F G, G) P (G T, F G )P (T F G ) 6

We then use Bayes Rule again on the last term: A similar relationship holds for T : Normalizing, we obtain P (T F G, G) P (G T, F G )P ( F G T )P (T ) P ( T F G, G) P (G T, F G )P ( F G T )P ( T ) P (T F G, G) = P (G T, F G)P ( F G T )P (T ) P (G T, F G)P ( F G T )P (T )+P (G T, F G)P ( F G T )P ( T ) The systematic way to do it is to revert to joint entries (noticing that the subgraph of T, G, and F G is completely connected so no loss of efficiency is entailed). We have P (T F G, G) = P (T, F G, G) P (G, F G ) = P (T, F G, G) P (T, G, F G ) + P (T, G, F G ) Now we use the chain rule formula to rewrite the joint entries as CPT entries: P (T F G, G) = P (T )P ( F G T )P (G T, F G) P (T )P ( F G T )P (G T, F G)+P ( T )P ( F G T )P (G T, F G) which of course is the same as the expression arrived at above. Letting P (T ) = p, P (F G T ) = g, and P (F G T ) = h, we get P (T F G, G) = p(1 g)(1 x) p(1 g)(1 x) + (1 p)(1 h)x 10. 14.11[15pts] Consider the query P(Rain Sprinkler = true, WetGrass = true) and how MCMC can answer it. (a) How many states does the Markov chain have? (b) Calculate the transition matrix Q containing q(y y ) for all y, y. (c) What does Q, the square of the transition matrix, represent? (d) What about Q n as n? (e) Explain how to do probabilistic inference in Bayesian networks, assuming that Q n is available. Is this a practical way to do inference? (a) There are two uninstantiated Boolean variables (Cloudy and Rain) and therefore four possible states. (b) First, we compute the sampling distribution for each variable, conditioned on its Markov blanket. P(C r, s) = αp(c)p(s C)P(r C) = α 0.5, 0.5 0.1, 0.5 0.8, 0. = α 0.04, 0.05 = 4/9, 5/9 P(C r, s) = αp(c)p(s C)P( r C) = α 0.5, 0.5 0.1, 0.5 0., 0.8 = α 0.01, 0.0 = 1/1, 0/1 P(R c, s, w) = αp(r c)p(w s, R) = α 0.8, 0. 0.99, 0.90 = α 0.79, 0.180 = /7, 5/7 P(R c, s, w) = αp(r c)p(w s, R) = α 0., 0.8 0.99, 0.90 = α 0.198, 0.70 = 11/51, 40/51 Strictly speaking, the transition matrix is only well-defined for the variant of MCMC in which the variable to be sampled is chosen randomly. (In the variant where the variables are chosen in a fixed order, the transition probabilities depend on where we are in the ordering.) Now consider the transition matrix. 7

Entries on the diagonal correspond to self-loops. Such transitions can occur by sampling either variable. For example, q((c, r) (c, r)) = 0.5P (c r, s) + 0.5P (r c, s, w) = 17/7 Entries where one variable is changed must sample that variable. For example, q((c, r) (c, r)) = 0.5P ( r c, s, w) = 5/54 Entries where both variables change cannot occur. For example, q((c, r) ( c, r)) = 0 This gives us the following transition matrix, where the transition is from the state given by the row label to the state given by the column label: (c, r) (c, r) ( c, r) ( c, r) (c, r) (c, r) ( c, r) ( c, r) 17/7 5/54 5/18 0 11/7 /189 0 10/1 /9 0 59/153 0/51 0 1/4 11/10 310/357 (c) Q represents the probability of going from each state to each state in two steps. (d) Q n (as n ) represents the long-term probability of being in each state starting in each state; for ergodic Q these probabilities are independent of the starting state, so every row of Q is the same and represents the posterior distribution over states given the evidence. (e) We can produce very large powers of Q with very few matrix multiplications. For example, we can get Q with one multiplication, Q 4 with two, and Q k with k. Unfortunately, in a network with n Boolean variables, the matrix is of size n n, so each multiplication takes O( 3n ) operations. 8