Handling Uncertainty

Size: px

Start display at page:

Download "Handling Uncertainty"

Emory Hutchinson
5 years ago
Views:

1 Handling Unertainty

2 Unertain knowledge Typial example: Diagnosis. Name Toothahe Cavity Can we ertainly derive the diagnosti rule: if Toothahe=true then Cavity=true? The problem is that this rule isn t right always. Smith true true Mike true true Mary false true Quiny true false Not all patients with toothahe have avities; some of them have gum disease, an absess, et. We ould try turning the rule into a ausal rule: if Cavity=true then Toothahe=true But this rule isn t neessarily right either; not all avities ause pain.

3 Belief and Probability The onnetion between toothahes and avities is not a logial onsequene in either diretion. However, we an provide a degree of belief on the rules. Our main tool for this is probability theory. E.g. We might not know for sure what afflits a partiular patient, but we believe that there is, say, an 80% hane that is probability 0.8 that the patient has avity if he has a toothahe. We usually get this belief from statistial data.

4 Syntax Basi element: random variable orresponds to an attribute of data. e.g., Cavity (do I have a avity?) is one of <avity, avity> Weather is one of <sunny,rainy,loudy,snow> Both Cavity and Weather are disrete random variables Domain values must be exhaustive and mutually exlusive Elementary propositions are onstruted by the assignment of a value to a random variable: e.g., Cavity = avity, Weather = sunny

5 Prior probability and distribution Prior or unonditional probability assoiated with a proposition is the degree of belief aorded to it in the absene of any other information. e.g., P(Cavity = avity) = 0.1 (or abbrev. P(avity) = 0.1) P(Weather = sunny) = 0.7 (or abbrev. P(sunny) = 0.7) Probability distribution gives values for all possible assignments: P(Weather = sunny) = 0.7 P(Weather = rain) = 0. P(Weather = loudy) = 0.08 P(Weather = snow) = 0.0

6 Conditional probability E.g., P(avity toothahe) = 0.8 i.e., probability of avity given that toothahe is all I know It an be interpreted as the probability that the rule if Toothahe=true then Cavity=true holds. Definition of onditional probability: P(a b) = P(a b) / P(b) if P(b) > 0 Produt rule gives an alternative formulation: P(a b) = P(a b) P(b) = P(b a) P(a)

7 Bayes' Rule Produt rule P(a b) = P(a b) P(b) = P(b a) P(a) Bayes' rule: P(a b) = P(b a) P(a) / P(b) Useful for assessing diagnosti probability from ausal probability as: P(Cause Effet) = P(Effet Cause) P(Cause) / P(Effet) Bayes s rule is useful in pratie beause there are many ases where we do have good probability estimates for these three numbers and need to ompute the fourth.

8 Applying Bayes rule For example, A dotor knows that the meningitis auses the patient to have a stiff nek 50% of the time. The dotor also knows some unonditional fats: the prior probability that a patient has meningitis is 1/50,000, and the prior probability that any patient has a stiff nek is 1/0. So, what do we have in term of probabilities.

9 Bayes rule (ont d) P(StiffNek=true Meningitis=true) = 0.5 P(Meningitis=true) = 1/50000 P(StiffNek=true) = 1/0 P(Meningitis=true StiffNek=true) = P(StiffNek=true Meningitis=true) P(Meningitis=true) / P(StiffNek=true) = (0.5) * (1/50000) / (1/0) = That is, we expet only 1 in 5000 patients with a stiff nek to have meningitis. This is still a very small hane. Reason is a very small apriori probability. Also, observe that P(Meningitis=false StiffNek=true) = P(StiffNek=true Meningitis=false) P(Meningitis=false) / P(StiffNek=true) 1/ P(StiffNek=true) is the same for both onditional probabilities. It is alled the normalization onstant (denoted as α).

10 Bayes rule -- more vars P( ause, effet1, effet ) P( ause effet1, effet ) = = αp( ause, effet1, effet ) P( effet, effet ) = αp( effet, effet 1 = αp( effet 1 = αp( effet 1 effet effet, ause), ause) P( effet, ause) P( effet 1, ause) ause) P( ause) Although the effet 1 might not be independent of effet, it might be that given the ause they are independent. E.g. effet 1 is abilityinreading effet is lengthofarms There is indeed a dependene of abilityinreading to lengthofarms. People with longer arms read better than those with short arms. However, given the ause Age the abilityinreading is independent of lengthofarms.

11 Naive Bayes Two assumptions: Attributes (effets) are equally important onditionally independent (given the lass value) P( ause effet, effet 1 ) =αp( effet 1 effet, ause) P( effet ause) P( ause) P =αp ( effet1 ause) P( effet ause) P( ause) ause effet1,..., effetn ) = αp( effet ause)... P( effetn ause) P( ause) ( 1 This means that knowledge about the value of a partiular attribute doesn t tell us anything about the value of another attribute (if the lass is known) Although the formula is based on assumptions that are almost never orret, this sheme works well in pratie!

12 Weather Data Here we don t really have effets, but rather evidene.

13 Naïve Bayes for lassifiation Classifiation learning: what s the probability of the lass given an instane? Instane (Evidene E) E 1 =e 1, E =e,, E n =e n Class C = {, } Naïve Bayes assumption: evidene an be split into independent parts (i.e. attributes of instane!) P( E)=P( e 1,e,, e n ) = P(e 1 ) P(e ) P(e n ) P() / P(e 1,e,, e n )

14 The weather data example P(play=yes E) = P(Outlook=Sunny play=yes) * P(Temp=Cool play=yes) * P(Humidity=High play=yes) * P(Windy=True play=yes) * P(play=yes) / P(E) = (/9) * (3/9) * (3/9) * (3/9) * (9/14) / P(E) = / P(E) Don t worry for the 1/P(E); It s alpha, the normalization onstant.

15 The weather data example P(play=no E) = P(Outlook=Sunny play=no) * P(Temp=Cool play=no) * P(Humidity=High play=no) * P(Windy=True play=no) * P(play=no) / P(E) = (3/5) * (1/5) * (4/5) * (3/5) * (5/14) / P(E) = / P(E)

16 Normalization onstant play=yes play=no 0.5% E 79.5% P(play=yes E) + P(play=no E) = 1 i.e / P(E) / P(E) = 1 i.e. P(E) = So, P(play=yes E) = / ( ) = 0.5% P(play=no E) = / ( ) = 79.5%

17 The zero-frequeny problem What if an attribute value doesn t our with every lass value (e.g. Outlook=overast for lass Play=no )? Probability P(Outlook=overast play=no) will be zero! P(Play= no E) will also be zero! No matter how likely the other values are! Solution: Add 1 to the ount for every attribute value-lass ombination (Laplae estimator); Add k (no of possible attribute values) to the denominator. (see example on the right). P(play=yes E) = P(Outlook=Sunny play=yes) * P(Temp=Cool play=yes) * P(Humidity=High play=yes) * P(Windy=True play=yes) * P(play=yes) / P(E) = (/9) * (3/9) * (3/9) * (3/9) *(9/14) / P(E) = / P(E) It will be instead: Number of possible values for Outlook = ((+1)/(9+3)) * ((3+1)/(9+3)) * ((3+1)/(9+)) * ((3+1)/(9+)) *(10/16) / P(E) = / P(E) Number of possible values for Windy

18 Missing values Training phase : instane will not be inluded in the frequeny ount for attribute value-lass ombination Classifiation phase : attribute will be omitted from alulation Example: P(play=yes E) = P(play=no E) = P(Temp=Cool play=yes) * P(Humidity=High play=yes) * P(Windy=True play=yes) * P(play=yes) / P(E) = (4/1)*(4/11)*(4/11)*(10/16) / P(E) = / P(E) P(Temp=Cool play=no) * P(Humidity=High play=no) * P(Windy=True play=no) * P(play=no) / P(E) = (/8)*(5/7)*(4/7)*(6/16) / P(E) = / P(E) After normalization: P(play=yes E) = 4%, P(play=no E) = 58%

19 Dealing with numeri attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the lass). Probability density funtion for the normal distribution is: f ( x µ ) 1 σ ( x lass) = e σ ππ We approximate µ by the sample mean: x = 1 n n i= 1 x i We approximate σ by the sample variane: σ = 1 n n 1 i= 1 ( x i x)

20 Weather Data outlook temperature humidity windy play sunny FALSE no sunny TRUE no overast FALSE yes rainy FALSE yes rainy FALSE yes rainy TRUE no overast TRUE yes sunny 7 95 FALSE no sunny FALSE yes rainy FALSE yes sunny TRUE yes overast 7 90 TRUE yes overast FALSE yes rainy TRUE no We need to ompute: f(temperature=66 yes) f(temperature=66 yes) =e^(- ((66-m)^ / *var) ) / sqrt(*3.14*var) m = ( )/ 9 = 73 var = ( (83-73)^ + (70-73)^ + (68-73)^ + (64-73)^ + (69-73)^ + (75-73)^ + (75-73)^ + (7-73)^ + (81-73)^ )/ (9-1) = 38 f(temperature=66 yes) =e^(- ((66-73)^ / (*38) ) ) / sqrt(*3.14*38) =.034

21 Weather Data outlook temperature humidity windy play sunny FALSE no sunny TRUE no overast FALSE yes rainy FALSE yes rainy FALSE yes rainy TRUE no overast TRUE yes sunny 7 95 FALSE no sunny FALSE yes rainy FALSE yes sunny TRUE yes overast 7 90 TRUE yes overast FALSE yes rainy TRUE no We ompute similarly: f(humidity=90 yes) f(humidity=90 yes) =e^(- ((90-m)^ / *var) ) / sqrt(*3.14*var) m = ( )/ 9 = 79 var = ( (86-79)^ + (96-79)^ + (80-79)^ + (65-79)^ + (70-79)^ + (80-79)^ + (70-79)^ + (90-79)^ + (75-79)^ )/ (9-1) = 104 f(humidity=90 yes) =e^(- ((90-79)^ / (*104) ) ) / sqrt(*3.14*104) =.0

22 A new day E: Classifying a new day P(play=yes E) = P(Outlook=Sunny play=yes) * P(Temp=66 play=yes) * P(Humidity=90 play=yes) * P(Windy=True play=yes) * P(play=yes) / P(E) = = (/9) * (0.034) * (0.0) * (3/9) *(9/14) / P(E) = / P(E) P(play=no E) = P(Outlook=Sunny play=no) * P(Temp=66 play=no) * P(Humidity=90 play=no) * P(Windy=True play=no) * P(play=no) / P(E) = = (3/5) * (0.091) * (0.038) * (3/5) *(5/14) / P(E) = / P(E) After normalization: P(play=yes E) = 0.9%, P(play=no E) = 79.1%

23 10 Tax Data Naive Bayes Tid Refund Marital Status Taxable Inome Evade 1 Yes Single 15K No No Married 100K No 3 No Single 70K No 4 Yes Married 10K No 5 No Divored 95K Yes 6 No Married 60K No 7 Yes Divored 0K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Classify: (_, No, Married, 95K,?) (Apply also the Laplae normalization)

24 10 Tax Data Naive Bayes Tid Refund Marital Status Taxable Inome 1 Yes Single 15K No No Married 100K No 3 No Single 70K No 4 Yes Married 10K No Evade 5 No Divored 95K Yes 6 No Married 60K No 7 Yes Divored 0K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Classify: (_, No, Married, 95K,?) (Apply also the Laplae normalization) P(Yes E) =? P(Yes) = (3+1)/(10+) = 0.33 P(Refund=No Yes) = (3+1)/(3+) = 0.8 P(Status=Married Yes) = (0+1)/(3+3) = 0.17 f ( x µ ) 1 σ ( inome Yes) = e πσ Approximate µ with: ( )/3 =90 Approximate σ with: ( (95-90)^+(85-90) ^+(90-90) ^ )/ (3-1) = 5 f(inome=95 Yes) = e(- ( (95-90)^ / (*5)) ) / sqrt(*3.14*5) =.048 P(Yes E) = α*.8*.17*.048*.33= α*.00154

25 10 Tid Refund Marital Status Taxable Inome Evade Tax Data P(No E) =? P(No) =(7+1)/(10+) =.67 P(Refund=No No) = (4+1)/(7+) =.556 P(Status=Married No) = (4+1)/(7+3) =.5 1 Yes Single 15K No No Married 100K No 3 No Single 70K No 4 Yes Married 10K No 5 No Divored 95K Yes 6 No Married 60K No 7 Yes Divored 0K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Classify: (_, No, Married, 95K,?) (Apply also the Laplae normalization) f ( x µ ) 1 σ ( inome No) = e σ Approximate µ with: π ( )/7 =110 Approximate σ with: ((15-110)^ + ( )^ + (70-110)^ + (10-110)^ + (60-110)^ + (0-110)^ + (75-110)^ )/(7-1) = 975 f(inome=95 No) = e( -((95-110)^ / (*975)) ) /sqrt(*3.14* 975) = P(No E) = α*.556*.5*.00704*0.67= α*

26 10 Tax Data Tid Refund Marital Status Taxable Inome 1 Yes Single 15K No No Married 100K No 3 No Single 70K No 4 Yes Married 10K No Evade 5 No Divored 95K Yes 6 No Married 60K No 7 Yes Divored 0K No 8 No Single 85K Yes P(Yes E) = α* P(No E) = α* α = 1/( )=88.60 P(Yes E) = * = 0.6 P(No E) = * = 0.38 We predit Yes. 9 No Married 75K No 10 No Single 90K Yes Classify: (_, No, Married, 95K,?) (Apply also the Laplae normalization)

27 Text Categorization Text ategorization is the task of assigning a given doument to one of a fixed set of ategories, on the basis of the text it ontains. Naïve Bayes models are often used for this task. In these models, the query variable is the doument ategory, and the effet variables are the presene or absene of eah word in the language. How suh a model an be onstruted, given as training data a set of douments that have been assigned to ategories?

28 Text Categorization The model onsists of the prior probability P(Category) and the onditional probabilities P(Word i Category) for every word in the language For eah ategory, P(Category = ) is estimated as the fration of all the training douments that are of that ategory. Similarly, P(Word i = true Category = ) is estimated as the fration of douments of ategory that ontain word i. Also, P(Word i = true Category = ) is estimated as the fration of douments not of ategory that ontain word.

29 Text Categorization (ont d) Now we an use naïve Bayes for lassifying a new doument: Assume Word 1,, Word n are the words ourring in the new doument. P(Category = Word 1 = true,, Word n = true) = α*p(category = ) n i=1 P(Word i = true Category = ) P(Category = Word 1 = true,, Word n = true) = α*p(category = ) n i=1 P(Word i = true Category = ) α is the normalization onstant. Observe that similarly with the missing values. The new doument doesn t ontain every word for whih we omputed the probabilities.

CS 687 Jana Kosecka. Uncertainty, Bayesian Networks Chapter 13, Russell and Norvig Chapter 14,

CS 687 Jana Kosecka. Uncertainty, Bayesian Networks Chapter 13, Russell and Norvig Chapter 14, CS 687 Jana Koseka Unertainty Bayesian Networks Chapter 13 Russell and Norvig Chapter 14 14.1-14.3 Outline Unertainty robability Syntax and Semantis Inferene Independene and Bayes' Rule Syntax Basi element: