Bayesian Decision Theory

Size: px

Start display at page:

Download "Bayesian Decision Theory"

Luke Townsend
6 years ago
Views:

1 Bayesan Decson heory Berln hen 2005 References:. E. Alpaydn Introducton to Machne Learnng hapter 3 2. om M. Mtchell Machne Learnng hapter 6

2 Revew: Basc Formulas for robabltes roduct Rule: probablty A B of a conuncton of two events A and B ( A B ) ( A B ) ( A B ) ( B ) ( B A) ( A) Sum Rule: probablty of a dsuncton of two events A and B ( A B ) ( A) + ( B ) ( A B ) A B heorem of total probablty: f events A A n are mutually exclusve and exhaustve ( A A 0 and A ) ( ) n ( ) n ( B ) ( B A ) n ( B A ) ( A ) A B A 2 A A n MLDM-Berln hen 2

3 Revew: Basc Formulas for robabltes (cont.) han Rule: probablty of a conuncton of many events A A2 A K ( A A K A ) 2 n n ( A ) ( A A ) ( A A A )... ( A A A K A ) n 2 n MLDM-Berln hen 3

4 lassfcaton Illustratve ase : redt Scorng x x x2 hght 2 Low Gven a new applcaton 0 t ncome savngs - rs - rs x ( x) f > hoose 2 otherwse or equvalent ly f > hoose 2 otherwse t [ ] x x ( x) ( x) 2 x 2 ( x) + ( x) Note that 2 MLDM-Berln hen 4

5 lassfcaton (cont.) Bayes lassfer We can use the probablty theory to mae nference from data ( x ) ( x ) ( ) ( x ) x x x : observed data (varable) : class hypothess ( x) : pror probablty of : pror probablty of x : probablty of gven MLDM-Berln hen 5

6 lassfcaton (cont.) alculate the posteror probablty of the concept after havng the observaton x or 2 ombne the pror and what the data tells usng Bayes rule ( x ) posteror ( x ) ( ) ( x ) lelhood pror evdence x 2 and 2 are mutually exclusve and exhaustve classes (concepts) ( x ) ( x ) + ( x ) ( x ) ( ) + ( x ) ( ) MLDM-Berln hen 6

7 MLDM-Berln hen 7 lassfcaton (cont.) Bayes lassfer: Extended to mutually exclusve and exhaustve classes K K K and 0 x x x x x

8 lassfcaton (cont.) Maxmum lelhood classfer he posteror probablty that the data belongs to class L ( x ) ( x ) Have the same classfcaton result as that of Bayes lassfer f the pror probablty ( ) s assumed to be equal to each other max L max ( x) max( x ) max( x)? ( x ) ( ) ( x) MLDM-Berln hen 8

9 lassfcaton: Illustratve ase 2 Does a patent have cancer or not? A patent taes a lab test and result would be x "+" or. If the result comes bac postve ( x "+" ) 2 x " " 2. And we also new that the test returns a correct postve result (+) n only 98% of the cases n whch the dsease s actually present ( ( + ) 0.98) and a correct negatve result (-) n only 97% of the cases n whch the dsease s not present Furthermore of the entre populaton have ths cancer ( 0.008) MLDM-Berln hen 9

10 lassfcaton: Illustratve ase 2 (cont.) Bayes lassfer: ( + ) ( + ) 2 ( + ) ( + ) ( ) + + ) ( ) ( + 2 ) ( 2 ) ( + ) ( ) + + ) ( ) ( + ) 0. 2 ( + ) MLDM-Berln hen 0

11 lassfcaton: Illustratve ase 2 (cont.) Maxmum lelhood classfer: ( + ) 0.98 ( + ) MLDM-Berln hen

12 Losses and Rss Decsons are not always prefect E.g. Loan Applcaton he loss for a hgh-rs applcant erroneously accepted (false acceptance) may be dfferent from that for an erroneously reected low-rs applcant (false reecton) Much crtcal for other cases such as medcal dagnoss or earthquae predcton MLDM-Berln hen 2

13 Expected Rs α Hypothesze that the example x belongs to class Suppose the example actually belongs to some class Def: the Expected Rs for tang acton R λ K ( α x ) λ ( x ) 0 f f A zero-one loss functon All correct decsons have no loss and all error are equally costly α hoose the acton wth mnmum rs α arg mn ( x ) R α MLDM-Berln hen 3

14 Expected Rs (cont.) α hoosng the acton wth mnmum rs s equvalent to choosng the class wth the hghest posteror probablty R K ( α x ) λ ( x ) K - ( x ) ( x ) hoose the acton α wth α arg max ( x ) MLDM-Berln hen 4

15 Expected Rs: Reect Acton Involved Manual Decsons? Wrong decsons (msclassfcaton) may have very hgh cost Resort to a manual decson when automatc system has low certanty of ts decson Defne an addtonal acton of reect (or doubt) 0 f x λ λ f K + otherwse α K + λ 0 λ s the loss ncurred for choosng the (K+)st acton of reect.. K K + MLDM-Berln hen 5

16 MLDM-Berln hen 6 Expected Rs: Reect Acton Involved (cont.) he rs for choosng the reect ((K+)st) acton Recall that the rs for choosng acton λ λ λ α + K K K R x x x K α α K + x x x x K K R - λ α α K + x.. + K K K + α α

17 Expected Rs: Reect Acton Involved (cont.) he optmal decson rule s to: hoose Reect f R hat s hoose f R and ( α x) < R( α x) R ( α x) < R( α x) for all K ( α x) < R( α x) for all K f K + and Reect otherwese λ 0 λ When always reect the chosen acton When always accept the chosen acton ( x) > ( x) ( x) > λ K + for all K MLDM-Berln hen 7

18 Dscrmnant Functons lassfcaton can be thought of as a set of dscrmnant functons g x for each class such that hoose f...k g ( x) max g ( x) g x can be expressed by usng the Bayes s classfer (wth mnmum rs and no addtonal acton of reect) g ( x) R( α x) If the zero-one loss functon s mposed g x can also be expressed by ( x) ( x) g Wth the same ranng result we can have g ( x ) ( x ) ( ) MLDM-Berln hen 8

19 Dscrmnant Functons (cont.) he nstance space thus can be dvded nto K decson regons R... where R R K x g x ( x) max g MLDM-Berln hen 9

20 Dscrmnant Functons (cont.) For two-class problems we can merely defne a sngle dscrmnant functon g ( x) - g ( x) g x 2 ( x) f g > 0 hoose 2 otherwse MLDM-Berln hen 20

21 hoosng Hypotheses* : MA rteron In machne learnng we are nterested n fndng the best (most probable) hypothess (classfer) h c from some hypothess space H gven the observed t t t tranng data set X x r r c t 2 L n { } c h MA arg max h c H ( h X ) c c arg max h c H ( X h ) ( h ) c c ( X ) c c arg max h c H ( X h ) ( h ) c c c A Maxmum a osteror (MA) hypothess h MA MLDM-Berln hen 2

22 hoosng Hypotheses*: ML rteron If we further assume that every hypothess s equally probable a pror e.g. ( h ). he above equaton c h c can be smplfed as: h ML arg max h c H ( X h ) c c A Maxmum Lelhood (ML) hypothess h ML X c h often called the lelhood of the data set c gven h c X c MLDM-Berln hen 22

23 Naïve Bayes lassfer A smplfed approach to the Bayes s classfer x x... he attrbutes of an nstance/example are assumed to be ndependent condtoned on a gven class hypothess Naïve Bayes assumpton: Naïve Bayes lassfer: 2 x d x d ( x ) ( x x x ) ( x ) MA arg arg arg arg 2... max max max max d ( x ) n ( x ) ( ) ( x ) ( x... x ) ( ) d ( ) ( x ) n d n n arg max ( x ) ( ) MLDM-Berln hen 23

24 Naïve Bayes lassfer (cont.) x ( x x x ) A B Illustratve case Gven a data set wth 3-dmensonal Boolean examples x ( x A xb x ) tran a naïve Bayes classfer to predct the classfcaton Attrbute A F F F F Attrbute B F F F F Attrbute F F F lassfcaton D F F F ( D ) / 2 ( D F ) / 2 ( A D ) / 3 ( A F D ) ( B D ) / 3 ( B F D ) ( D ) / 3 ( B F D ) 2 / 3 2 / 3 2 / 3 ( A D F ) / 3 ( A F D F ) 2 / 3 ( B D F ) / 3 ( B F D F ) 2 / 3 ( D F ) 2 / 3 ( B F D F ) / 3 What s the predcted probablty ( D A B F )? What s the predcted probablty? ( D B ) MLDM-Berln hen 24

25 MLDM-Berln hen 25 Naïve Bayes lassfer (cont.) Illustratve case (cont.) () F D F D F B A D D F B A D D F B A F B A D D F B A F B A D ) ( F D F D B D D B D D B B D D B B D

26 How to ran a Naïve Bayes lassfer Naïve_Bayes_Learn(examples) For each target value v maxmum lelhood (ML) estmate of ( v ) ˆ v For each attrbute value a ( v ) maxmum lelhood (ML) estmate of ( a v ) ˆ a x lassfy_new_instance(x) x a a of each attrbute a x v v n or v v ( a ) a x v x( a ) x v x v v a v NB arg max v V ( v ) ( a v ) a x MLDM-Berln hen 26

27 Naïve Bayes: Example 2 onsder layenns agan and new nstance <Outloosunny emperaturecool Humdtyhgh Wndstrong> Want to compute v NB arg v V max ( v ) Outloo sunny v emperature cool v { yes no} ( Humdty hghv ) ( Wnd Strong v ) ( yes) ( Outloo sunny yes) ( emperature cool yes) ( Humdty hgh yes) ( Wnd Strong yes) ( no) ( Outloo sunny no) ( emperature cool no) ( Humdty hgh no) ( Wnd Strong no) v NB no MLDM-Berln hen 27

28 Dealng wth Data Sparseness What f none of the tranng nstances wth target value v have attrbute value a? hen ˆ v ( a v ) NB arg 0 max v V and ( v ) ˆ ( a v ) ypcal soluton s Bayesan estmate for n ˆ... s number of tranng examples for whch n c s number of tranng examples for whch v v and p s pror estmate for ˆ ( a v ) s weght gven to pror (.e. number of vrtual examples) m ˆ ( a v ) n c n + + mp m ˆ Smoothng a v v v a a MLDM-Berln hen 28

29 Example: Learnng to lassfy ext For nstance Learn whch news artcles are of nterest Learn to classfy web pages by topc Naïve Bayes s among the most effectve algorthms What attrbutes shall we use to represent text documents he word occurs n each document poston MLDM-Berln hen 29

30 Example: Learnng to lassfy ext (cont.) arget oncept: Interestng? Document {+-}. Represent each document by vector of words one attrbute per word poston n document 2. Learnng Use tranng examples to estmate (+) (-) doc (doc +) doc w (doc -) Naïve Bayes condtonal ndependence assumpton length ( doc ) doc v ( a w v ) Where a s probablty that word n poston s w w v gven v me Invarant One more assumpton: ( a w v ) ( a m w v ) m a w MLDM-Berln hen 30

31 Example: Learnng to lassfy ext (cont.) Learn_Naïve_Bayes_ext(Examples V). ollect all words and other toens that occur n Examples Vocabulary all dstnct words and other toens n Examples 2. alculate the requred v and w v probablty terms docs subset of Examples for whch the target value s v docs Examples ext a sngle document created by concatenatng all members of docs n total number of words n ext (countng duplcate words multple tmes) w For each word n Vocabulary n number of tmes word w occurs n ( w v ) n + Smoothed ungram n + Vocabulary v MLDM-Berln hen 3

32 Example: Learnng to lassfy ext (cont.) lassfy_naïve_bayes_ext(doc) postons all word postons n Doc that contan toens found n Vocabulary v NB Return where v NB arg max v V ( v ) ( a v ) postons MLDM-Berln hen 32

33 Bayesan Networs remse Naïve Bayes assumpton of condtonal ndependence too restrctve But t s ntractable wthout some such assumptons Bayesan networs descrbe condtonal ndependence among subsets of varables Allows combnng pror nowledge about (n)dependences among varables wth observed tranng data Bayesan Networs also called Bayesan Belef Networs Bayes Nets Belef Networs robablstc Networs Graphcal Models etc. MLDM-Berln hen 33

34 Bayesan Networs (cont.) A smple graphcal notaton for condtonal ndependence assertons and hence for compact specfcaton of full ont dstrbutons Syntax A set of nodes one per varable (dscrete or contnuous) For dscrete varable they can be ether bnary or not A drected acyclc graph (ln/arrow drectly nfluences ) A condtonal dstrbuton for each node gven ts parents ( X arents) X x x In the smplest case condtonal dstrbuton represented as a ondtonal robablty able () gvng the dstrbuton over ( X ) for each combnaton of parent values MLDM-Berln hen 34

35 Bayesan Networs (cont.) E.g. nodes of dscrete bnary varables ondtonal robablty able () S B () F F F F Each node s asserted to be condtonally ndependent of ts nondescendants gven ts mmedate predecessors Drected acyclc graph MLDM-Berln hen 35

36 Example :Dentst Networ opology of networ encodes condtonal ndependence assertons Weather s ndependent of the other varables oothache and atch are condtonally ndependent gven avty avty s a drect cause of oothache and atch MLDM-Berln hen 36

37 ondtonal (In)dependence Defnton: X s condtonally ndependent of Y gven Z f the probablty dstrbuton governng X s ndependent of the value of Y gven the value of Z; that s f ( x y z ) ( X x Y y Z z ) ( X x Z z ) More compactly we can wrte ( X Y Z ) ( X Z ) ondtonal ndependence allows breang down nference nto calculaton over small group of varables MLDM-Berln hen 37

38 ondtonal (In)dependence (cont.) Example: hunder s condtonally ndependent of Ran gven Lghtnng (hunder Ran Lghtnng) (hunder Lghtnng) Recall that Naïve Bayes uses condtonal ndependence to ustfy ( X Y Z ) ( X Y Z ) ( Y Z ) ( X Z ) ( Y Z ) XY are mutually ndependent gven Z MLDM-Berln hen 38

39 ondtonal (In)dependence (cont.) Bayesan Networ also can be thought of as a causal graph n that llustrates causaltes between varables We can mae a dagnostc nference from the t ( R W )? predcton dagnoss ( R W ) ( R) ( R) W ( W ) ( W R) ( R) ( R) ( R) + ( W R) ( R) W ( > ( R) 0.4) MLDM-Berln hen 39

40 ondtonal (In)dependence (cont.) Suppose that sprnler s ncluded as another cause of wet grass redctve nference ( W S ) ( W R S ) ( R S ) + ( W R S ) ( R S ) ( W R S ) ( R) + ( W R S ) ( R) ( W ) ( W R S ) ( R S ) + ( W R S ) ( R S ) + ( W R S ) ( R S ) + ( W R S ) ( R S ) ( W R S ) ( R ) ( S ) + ( W R S ) ( R ) ( S ) + ( W R S ) ( R ) ( S ) + ( W R S ) ( R ) ( S ) Dagnostc nference (I) ( S W ) ( W S ) ( S ) ( W ) ( > ( S ) 0.2 ) MLDM-Berln hen 40

41 ondtonal (In)dependence (cont.) Dagnostc nference (II) S R W ( W R S ) ( S R ) ( W R ) ( W R S ) ( S ) ( W R ) ( > ( S ) 0. 2 ) ( W R ) ( W R S ) ( S R ) + ( W R S ) ( S R ) ( W R S ) ( S ) + ( W R S ) ( S ) MLDM-Berln hen 4

42 Example 2: Burglary Networ You re at wor neghbor John calls to say your alarm s rngng but neghbor Mary doesn't call. Sometmes t's set off by mnor earthquaes. Is there a burglar? ( Burglary Johnall Maryall F )? Varables: Burglar Earthquae Alarm Johnalls Maryalls Networ topology reflects causal nowledge A burglar can set the alarm off An earthquae can set the alarm off he alarm can cause Mary to call he alarm can cause John to call But John sometmes confuses the telephone rngng wth the alarm Mary les rather loud musc and sometmes msses the alarm MLDM-Berln hen 42

43 Example 2: Burglary Networ ondtonal robablty able () each row shows prob. gven a state of parents For Boolean varables ust the prob. for true s shown MLDM-Berln hen 43

44 ompactness han rule A for Boolean X wth Boolean (true/false) parents has 2 rows for the combnatons of parent values Each row requres one number p for X true (the number for X false s ust -p) If each varable has no more than parents the complete networ requres O(n 2 ) numbers I.e. grows lnearly wth n vs. O(2 n ) for the full ont dstrbuton For burglary net numbers (vs ?) ( B E A J M ) ( B) ( E B) ( A B E) ( J B E A) ( M B E A J ) ( B) ( E) ( A B E) ( J A) ( M A) MLDM-Berln hen 44

45 Global Semantcs Global semantcs defnes the full ont dstrbuton as the product of the local condtonal dstrbutons n ( X X ) X arents( X ) n... he Bayesan Networ s semantcally A representaton of the ont dstrbuton A encodng of a collecton of condtonal ndependence statements E.g. ( J ( J M A B E) A) ( M A) ( A B E) ( B) ( E) MLDM-Berln hen 45

46 Local Semantcs Local semantcs: each node s condtonally ndependent of ts nondescendants gven ts parents Local semantcs global semantcs MLDM-Berln hen 46

47 Marov Blanet Each node s condtonally ndependent of all others gven ts parents + chldren + chldren's parents MLDM-Berln hen 47

48 onstructng Bayesan Networs Need a method such that a seres of locally testable assertons of condtonal ndependence guarantees the requred global semantcs. hoose an orderng of varables X.. X.. X n 2. For to n add X to the networ and select parents from X such.. X that arents( X ) { X X } ( X X X ) ( X arents( X )).. hs choce of parents guarantees the global semantcs n ( X... X n ) ( X X.. X ) n ( X arents( X )) (chan rule).. (by constructon) MLDM-Berln hen 48

49 Example for onstructng Bayesan Networ Suppose we choose the orderng: M J A B E (J M) (J)? MLDM-Berln hen 49

50 Example (cont.) Suppose we choose the orderng: M J A B E (J M) (J)? No (A JM) (A J)? (A JM) (A)? MLDM-Berln hen 50

51 Example (cont.) Suppose we choose the orderng: M J A B E (J M) (J)? No (A JM) (A J)? No (A JM) (A)? No (B AJM) (B A)? (B AJM) (B)? MLDM-Berln hen 5

52 Example (cont.) Suppose we choose the orderng: M J A B E (J M) (J)? No (A JM) (A J)? No (A JM) (A)? No (B AJM) (B A)? Yes (B AJM) (B)? No (E BAJM) (E A)? (E BAJM) (E BA)? MLDM-Berln hen 52

53 Example (cont.) Suppose we choose the orderng: M J A B E (J M) (J)? No (A JM) (A J)? No (A JM) (A)? No (B AJM) (B A)? Yes (B AJM) (B)? No (E BAJM) (E A)? No (E BAJM) (E BA)? Yes MLDM-Berln hen 53

54 Example (cont.) Summary Decdng condtonal ndependence s hard n noncausal drectons (ausal models and condtonal ndependence seem hardwred for humans!) Assessng condtonal probabltes s hard n noncausal drectons Networ s less compact: numbers needed MLDM-Berln hen 54

55 Inference ass Smple queres: compute posteror margnal E.g. ( Burglary Johnalls true Marryalls true) onunctve queres: X X E e X X E e ( X E e) Optmal decsons: probablstc nference ( Outcome Acton Evdence) ( X E e) MLDM-Berln hen 55

56 MLDM-Berln hen 56 Inference by Enumeraton Slghtly ntellgent way to sum out varables from the ont wthout actually constructng ts explct representaton Smple query on the burglary networ Rewrte full ont entres usng product of entres: m a a e B a e B m a a e B a e B m a e B m B e a e a e a α α α e a m a e B m m B m m B m B ) ( α α α

57 Evaluaton ree Enumeraton s neffcent: repeated computaton\al e E.g. computes a m a for each value of MLDM-Berln hen 57

58 HW-4: Bayesan Networs A new bnary varable concernng cat mang nose on the roof (roof) S R redctve nferences ( F )? W F ( F S )? MLDM-Berln hen 58

59 MLDM-Berln hen 59 Bayesan Networs for Informaton Retreval D W t w d t d w d d w t w d t d t w d t d w t d w d Documents opcs Words

60 MLDM-Berln hen 60 Bayesan Networs for Informaton Retreval D W t w d t t w d t d d w d d w t w t d t w t d w d Documents opcs Words

CHAPTER 3: BAYESIAN DECISION THEORY

CHAPTER 3: BAYESIAN DECISION THEORY HATER 3: BAYESIAN DEISION THEORY Decson mang under uncertanty 3 Data comes from a process that s completely not nown The lac of nowledge can be compensated by modelng t as a random process May be the underlyng