UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 12: Bayes Classifiers. Dr. Yanjun Qi. University of Virginia

Size: px

Start display at page:

Download "UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 12: Bayes Classifiers. Dr. Yanjun Qi. University of Virginia"

Charleen Ford
5 years ago
Views:

1 Dr. Yanjun Q / UVA CS 6316 / f16 UVA CS 6316/4501 Fall 2016 Machne Learnng Lecture 12: Genera@ve Bayes Classfers Dr. Yanjun Q Unversty of Vrgna Department of Computer Scence 1

2 Dr. Yanjun Q / UVA CS 6316 / f16 Where are we? è Fve major secfons of ths course q Regresson (supervsed q ClassfcaFon (supervsed q Unsupervsed models q Learnng theory q Graphcal models 2

3 Dr. Yanjun Q / UVA CS 6316 / f16 Where are we? è Three major secfons for classfcafon We can dvde the large varety of classfcaton approaches nto roughly three major types 1. Dscrmnatve - drectly estmate a decson rule/boundary - e.g., support vector machne, decson tree 2. Generatve: - buld a generatve statstcal model - e.g., naïve bayes classfer, Bayesan networks 3. Instance based classfers - Use observaton drectly (no models - e.g. K nearest neghbors 3

4 Dr. Yanjun Q / UVA CS 6316 / f16 Last Lecture Recap: Probablty Revew The bg pcture : data <-> probablsfc model Sample space, Events and Event spaces Random varables Jont probablty, Margnal probablty, condfonal probablty, Chan rule, Bayes Rule, Law of total probablty, etc. Structural properfes Independence, condfonal ndependence 4

5 Dr. Yanjun Q / UVA CS 6316 / f16 Today : GeneraFve Bayes Classfers ü Bayes Classfer MAP classfcafon rule GeneraFve Bayes Classfer ü Naïve Bayes Classfer ü Gaussan Bayes Classfers Gaussan dstrbufon Gaussan NBC LDA, QDA 5

6 C Dr. Yanjun Q / UVA CS 6316 / f16 A Dataset for classfcafon C Output as Dscrete Class Label C 1, C 2,, C L Data/ponts/nstances/examples/samples/records: [ rows ] Features/a0rbutes/dmensons/ndependent varables/covarates/ predctors/regressors: [ columns, except the last] Target/outcome/response/label/dependent varable: specal column to be predcted [ last column ] 6

7 Dr. Yanjun Q / UVA CS 6316 / f16 Bayes classfers Treat each feature attrbute and the class label as random varables. Gven a sample x wth attrbutes ( x 1, x 2,, x p : Goal s to predct ts class C. Specfcally, we want to fnd the value of C that maxmzes p( C x 1, x 2,, x p. Can we estmate p(c x = p( C x 1, x 2,, x p drectly from data? 7

8 Dr. Yanjun Q / UVA CS 6316 / f16 Bayes classfers Treat each feature attrbute and the class label as random varables. Gven a sample x wth attrbutes ( x 1, x 2,, x p : Goal s to predct ts class C. Specfcally, we want to fnd the value of C that maxmzes p( C x 1, x 2,, x p. Can we estmate p(c x = p( C x 1, x 2,, x p drectly from data? 8

9 Dr. Yanjun Q / UVA CS 6316 / f16 Bayes classfers Treat each feature attrbute and the class label as random varables. Gven a sample x wth attrbutes ( x 1, x 2,, x p : Goal s to predct ts class C. Specfcally, we want to fnd the value of C that maxmzes p( C x 1, x 2,, x p. Can we estmate p(c x = p( C x 1, x 2,, x p drectly from data? 9

10 Bayes classfers Dr. Yanjun Q / UVA CS 6316 / f16 è MAP classfcaton rule Establshng a probablstc model for classfcaton è MAP classfcaton rule MAP: Maxmum A Posteror Assgn x to c* f * P( C = c X = x > P( C = c X = x c c, c = c1 *,,c L Adapt from Prof. Ke Chen NB 10 sldes

11 Bayes classfers Dr. Yanjun Q / UVA CS 6316 / f16 è MAP classfcaton rule Establshng a probablstc model for classfcaton è MAP classfcaton rule MAP: Maxmum A Posteror Assgn x to c* f P(C = c * X = x > P(C = c X = x for c c *, c = c 1,,c L Adapt from Prof. Ke Chen NB 11 sldes

12 Bayes classfers Dr. Yanjun Q / UVA CS 6316 / f16 è MAP classfcaton rule Establshng a probablstc model for classfcaton è MAP classfcaton rule MAP: Maxmum A Posteror Assgn x to c* f * P( C = c X = x > P( C = c X = x c c, c = c1 *,,c L Adapt from Prof. Ke Chen NB 12 sldes

13 Bayes classfers Dr. Yanjun Q / UVA CS 6316 / f16 è MAP classfcaton rule Establshng a probablstc model for classfcaton (1 Dscrmnatve (2 Generatve 13

14 (1 Dscrmnatve Dr. Yanjun Q / UVA CS 6316 / f16 P(C X C = c 1,,c L, X = (X 1,,X p P ( c 1 x P ( c 2 x P( x c L Dscrmnatve Probablstc Classfer x1 x2 x = (x 1, x 2,, x p x p Adapt from Prof. Ke Chen NB 14 sldes

15 (2 Generatve Dr. Yanjun Q / UVA CS 6316 / f16 P(X C, C = c 1,,c L, X = (X 1,,X p P( x c1 P( x c2 P( x cl Generatve Probablstc Model for Class 1 Generatve Probablstc Model for Class 2 Generatve Probablstc Model for Class L x1 x2 x p x1 x2 x p x1 x2 x p x = (x 1, x 2,, x p Adapt from Prof. Ke Chen NB 15 sldes

16 (2 Generatve Dr. Yanjun Q / UVA CS 6316 / f16 P(X C, C = c 1,,c L, X = (X 1,,X p P( x c1 P( x c2 P( x cl Generatve Probablstc Model for Class 1 Generatve Probablstc Model for Class 2 Generatve Probablstc Model for Class L x1 x2 x p x1 x2 x p x1 x2 x p x = (x 1, x 2,, x p Adapt from Prof. Ke Chen NB 16 sldes

17 Revew : Bayes Rule Dr. Yanjun Q / UVA CS 6316 / f16 for Generatve Bayes Classfers P(C, X = P(C XP(X = P(X CP(C P(C X = P(X CP(C P(X P(C 1, P(C 2,, P(C L P(C 1 x, P(C 2 x,, P(C L x P(C X = P(X CP(C P(X 17

18 Revew : Bayes Rule Dr. Yanjun Q / UVA CS 6316 / f16 for Generatve Bayes Classfers P(C, X = P(C XP(X = P(X CP(C P(C X = P(X CP(C P(X P(C 1, P(C 2,, P(C L P(C 1 x, P(C 2 x,, P(C L x P(C X = P(X C P(C P(X 18

19 Revew : Bayes Rule Dr. Yanjun Q / UVA CS 6316 / f16 for Generatve Bayes Classfers P(C, X = P(C XP(X = P(X CP(C Posteror P(C X = P(X CP(C P(X Pror P(C 1, P(C 2,, P(C L P(C 1 x, P(C 2 x,, P(C L x P(C X = P(X CP(C P(X 19

20 Summary: Generatve classfcaton wth the MAP rule MAP classfcaton rule MAP: Maxmum A Posteror Assgn x to c* f * P( C = c X = x > P( C = c X = x c c, c = c1 * Dr. Yanjun Q / UVA CS 6316 / f16,,c L Generatve classfcaton wth the MAP rule Apply Bayes rule to convert P( Xthem = x Cnto = c posteror P( C = c probabltes P( C = c X = x = P( X = x P( X = x C = c P( C = c for = 1,2,, L Then apply the MAP rule Adapt from Prof. Ke Chen NB 20 sldes

21 Summary: Generatve classfcaton wth the MAP rule Generatve classfcaton wth the MAP rule Apply Bayes rule to convert them nto posteror probabltes Then apply the MAP rule,c L, c c c c c P C c C P = = = > = = 1 * *, ( ( x X x X L c C P c C P P c C P c C P c C P, 1,2, for ( ( ( ( ( ( = = = = = = = = = = = x X x X x X x X Dr. Yanjun Q / UVA CS 6316 / f16 Adapt from Prof. Ke Chen NB sldes 21

22 Summary: Dr. Yanjun Q / UVA CS 6316 / f16 Generatve Bayes Classfer wth the MAP rule Task: Classfy a new nstance X based on a tuple of attrbute values nto one of the classes X = X 1, X 2,, X p c MAP = argmax c j C = argmax c j C = argmax c j C P(c j x 1, x 2,, x p P(x 1, x 2,, x p c j P(c j P(x 1, x 2,, x p P(x 1, x 2,, x p c j P(c j MAP = Maxmum A Posteror Adapt From Carols prob tutoral 22

23 Dr. Yanjun Q / UVA CS 6316 / f16 C Example: Play Tenns An Example 23

24 Dr. Yanjun Q / UVA CS 6316 / f16 24

25 Dr. Yanjun Q / UVA CS 6316 / f16 C Example: Play Tenns Example 25

26 maxmum lkelhood estmates (explan later smply use the frequences n the data Dr. Yanjun Q / UVA CS 6316 / f16 26

27 Generatve Bayes Classfer: Learnng Phase Dr. Yanjun Q / UVA CS 6316 / f16 C P(C 1, P(C 2,, P(C L Outlook (3 values P(Play=Yes = 9/14 P(Play=No = 5/14 P(X 1,X 2,, X p C 1, P(X 1,X 2,, X p C 2 Temperature (3 values Humdty (2 values Wnd (2 values Play=Yes Play=No sunny hot hgh weak 0/9 1/5 sunny hot hgh strong /9 /5 sunny hot normal weak /9 /5 sunny hot normal strong /9 / *3*2*2 [conjunctons of attrbutes] * 2 [two classes]=72 parameters 27

28 Generatve Bayes Classfer: Dr. Yanjun Q / UVA CS 6316 / f16! [ ˆP( a 1 c * ˆP( a p c * ] ˆP(c * >[ ˆP( a 1 c ˆP( a p c] ˆP(c Test Phase Gven an unknown nstance X ts = ( a 1,, a p Look up tables to assgn the label c* to X ts f ˆP( a 1, a p c * ˆP(c * > ˆP( a 1, a p c ˆP(c, c c *, c = c 1,,c L Gven a new nstance, x =(Outlook=Sunny, Temperature=Cool, Humdty=Hgh, Wnd=Strong 28

29 Dr. Yanjun Q / UVA CS 6316 / f16 Today : GeneraFve Bayes Classfers ü Bayes Classfer MAP classfcafon rule GeneraFve Bayes Classfer ü Naïve Bayes Classfer ü Gaussan Bayes Classfers Gaussan dstrbufon Gaussan NBC LDA, QDA 29

30 Naïve Bayes Classfer Dr. Yanjun Q / UVA CS 6316 / f16 Bayes classfcaton argmax c j C P(x 1,x 2,,x p c j P(c j Dffculty: learnng the jont probablty Naïve Bayes classfcaton Assumpton that all nput attrbutes are condtonally ndependent! 30

31 Naïve Bayes Classfer Dr. Yanjun Q / UVA CS 6316 / f16 Naïve Bayes classfcaton Assumpton that all nput attrbutes are condtonally ndependent! P(X 1,X 2,,X p C= P(X 1 X 2,,X p,cp(x 2,,X p C = P(X 1 CP(X 2,,X p C = P(X 1 CP(X 2 C P(X p C 31

32 Naïve Bayes Classfer Dr. Yanjun Q / UVA CS 6316 / f16 Naïve Bayes classfcaton Assumpton that all nput attrbutes are condtonally ndependent! P(X 1,X 2,,X p C= P(X 1 CP(X 2 C P(X p C MAP classfcaton rule: for a sample x = (x 1,x 2,,x p [P(x 1 c * P(x p c * ]P(c * > [P(x 1 c P(x p c]p(c, c c *, c = c 1,,c L 32

33 Naïve Bayes Classfer Dr. Yanjun Q / UVA CS 6316 / f16 Naïve Bayes classfcaton Assumpton that all nput attrbutes are condtonally ndependent! P(X 1,X 2,,X p C= P(X 1 CP(X 2 C P(X p C MAP classfcaton rule: for a sample x = (x 1,x 2,,x p [P(x 1 c * P(x p c * ]P(c * > [P(x 1 c P(x p c]p(c, c c *, c = c 1,,c L 33

34 Naïve Bayes Classfer (for dscrete nput attrbutes - tranng Dr. Yanjun Q / UVA CS 6316 / f16 Naïve Bayes Algorthm (for dscrete nput attrbutes Learnng Phase: Gven a tranng set S, For each target value of c (c = c 1,,c L ˆP(C = c estmate P(C = c wth examples n S; For every attrbute value x jk of each attrbute X j ( j =1,, p; k =1,,K j ˆP(X j = x jk C = c estmate P(X j = x jk C = c wth examples n S; Output: condtonal probablty tables; for X j, K j L elements 34

35 Naïve Bayes Classfer (for dscrete nput attrbutes - tranng Dr. Yanjun Q / UVA CS 6316 / f16 Naïve Bayes Algorthm (for dscrete nput attrbutes Learnng Phase: Gven a tranng set S, For each target value of c (c = c 1,,c L ˆP(C = c estmate P(C = c wth examples n S; For every attrbute value x jk of each attrbute X j ( j =1,, p; k =1,,K j ˆP(X j = x jk C = c estmate P(X j = x jk C = c wth examples n S; Output: condtonal probablty tables; for X j, K j L elements 35

36 Naïve Bayes Classfer (for dscrete nput attrbutes - tranng Dr. Yanjun Q / UVA CS 6316 / f16 Naïve Bayes Algorthm (for dscrete nput attrbutes Learnng Phase: Gven a tranng set S, For each target value of c (c = c 1,,c L ˆP(C = c estmate P(C = c wth examples n S; For every attrbute value x jk of each attrbute X j ( j =1,, p; k =1,,K j ˆP(X j = x jk C = c estmate P(X j = x jk C = c wth examples n S; Output: condtonal probablty tables; for X j, K j L elements 36

37 Naïve Bayes Dr. Yanjun Q / UVA CS 6316 / f16 (for dscrete nput attrbutes - testng Naïve Bayes Algorthm (for dscrete nput attrbutes Test Phase: Gven an unknown nstance X! = ( a 1!,, a! p Look up tables to assgn the label c* to X f [ ˆP( a 1! c * ˆP( a! p c * ] ˆP(c * > [ ˆP( a 1! c ˆP( a! p c] ˆP(c, c c *, c = c 1,, c L 37

38 Dr. Yanjun Q / UVA CS 6316 / f16 C Example: Play Tenns An Example 38

39 Learnng (tranng the NBC Model maxmum lkelhood estmates (explan later smply use the frequences n the data (, ( ˆ( j j j c C N c C x X N c x P = = = = C X 1 X 2 X 5 X 3 X 4 X 6 N c C N c P j j ( ˆ( = = 39 Dr. Yanjun Q / UVA CS 6316 / f16

40 Dr. Yanjun Q / UVA CS 6316 / f16 40

41 Dr. Yanjun Q / UVA CS 6316 / f16 Estmate P(X j = x jk C = c wth examples n tranng; Learnng Phase Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Ran 3/9 2/5 P(X 2 C 1, P(X 2 C 2 Temperature Play=Yes Play=No Hot 2/9 2/5 Mld 4/9 2/5 Cool 3/9 1/5 Humdty Play=Yes Play=N o Hgh 3/9 4/5 Normal 6/9 1/5 P(X 4 C 1, P(X 4 C 2 Wnd Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/ [naïve assumpton] * 2 [two classes]= 20 parameters P(Play=Yes = 9/14 P(Play=No = 5/14 P(C 1, P(C 2,, P(C L 41

42 Testng the NBC Model Dr. Yanjun Q / UVA CS 6316 / f16 Test Phase Gven a new nstance,! [ ˆP( a 1 c * ˆP( a p c * ] ˆP(c * >[ ˆP( a 1 c ˆP( a p c] ˆP(c x =(Outlook=Sunny, Temperature=Cool, Humdty=Hgh, Wnd=Strong 42

43 Testng the NBC Model Dr. Yanjun Q / UVA CS 6316 / f16 Test Phase Gven a new nstance,! [ ˆP( a 1 c * ˆP( a p c * ] ˆP(c * >[ ˆP( a 1 c ˆP( a p c] ˆP(c x =(Outlook=Sunny, Temperature=Cool, Humdty=Hgh, Wnd=Strong 43

44 Testng the NBC Model Dr. Yanjun Q / UVA CS 6316 / f16 Test Phase! [ ˆP( a 1 c * ˆP( a p c * ] ˆP(c * >[ ˆP( a 1 c ˆP( a p c] ˆP(c Gven a new nstance, x =(Outlook=Sunny, Temperature=Cool, Humdty=Hgh, Wnd=Strong Look up n condtonal-prob tables P(Outlook=Sunny Play=Yes = 2/9 P(Temperature=Cool Play=Yes = 3/9 P(Humnty=Hgh Play=Yes = 3/9 P(Wnd=Strong Play=Yes = 3/9 P(Play=Yes = 9/14 P(Outlook=Sunny Play=No = 3/5 P(Temperature=Cool Play==No = 1/5 P(Humnty=Hgh Play=No = 4/5 P(Wnd=Strong Play=No = 3/5 P(Play=No = 5/14 MAP rule P(Yes x : [P(Sunny YesP(Cool YesP(Hgh YesP(Strong Yes]P(Play=Yes = P(No x : [P(Sunny No P(Cool NoP(Hgh NoP(Strong No]P(Play=No = Gven the fact P(Yes x < P(No x, we label x to be No. 44

Dr. Yanjun Q / UVA CS 6316 / f16 WHY? Naïve Bayes Assumpton P(c j Can be esfmated from the frequency of classes n the tranng examples. P(x 1,x 2,,x p c j O( X 1. X 2. X 3. X p.

45 Dr. Yanjun Q / UVA CS 6316 / f16 WHY? Naïve Bayes Assumpton P(c j Can be esfmated from the frequency of classes n the tranng examples. P(x 1,x 2,,x p c j O( X 1. X 2. X 3. X p. C parameters Could only be esfmated f a very, very large number of tranng examples was avalable. Naïve Bayes CondFonal Independence AssumpFon: Assume that the probablty of observng the conjuncfon of alrbutes s equal to the product of the ndvdual probablfes P(x c j. If no naïve assumpfon 45 Adapt From Mannng textcat tutoral

46 Dr. Yanjun Q / UVA CS 6316 / f16 WHY? Naïve Bayes Assumpton Not Naïve P(c j Can be esfmated from the frequency of classes n the tranng examples. P(x 1,x 2,,x p c j O( X 1. X 2. X 3. X p. C parameters Could only be esfmated f a very, very large number of tranng examples was avalable. Naïve P(x k c j O([ X 1 + X 2 + X 3.+ X p ]. C parameters Assume that the probablty of observng the conjuncfon of alrbutes s equal to the product of the 46 ndvdual probablfes P(x c j.

47 Dr. Yanjun Q / UVA CS 6316 / f16 WHY? Naïve Bayes Assumpton Not Naïve P(c j Can be esfmated from the frequency of classes n the tranng examples. P(x 1,x 2,,x p c j O( X 1. X 2. X 3. X p. C parameters Could only be esfmated f a very, very large number of tranng examples was avalable. Naïve P(x k c j O([ X 1 + X 2 + X 3.+ X p ]. C parameters Assume that the probablty of observng the conjuncfon of alrbutes s equal to the product of the 47 ndvdual probablfes P(x c j.

48 Dr. Yanjun Q / UVA CS 6316 / f16 DETOUR: Course Schedule WED / In CLASS / 70mns Open to your notes + (prnted lecture + Four HWs we had so far Nothng else s allowed Please turn off your phone at the begnnng No Electronc Devces (other than basc calculator Fnal Exam Wll be close-note! 48

49 Learnng (tranng the NBC Model maxmum lkelhood estmates (explan later smply use the frequences n the data (, ( ˆ( j j j c C N c C x X N c x P = = = = C X 1 X 2 X 5 X 3 X 4 X 6 N c C N c P j j ( ˆ( = = 49 Dr. Yanjun Q / UVA CS 6316 / f16

50 Dr. Yanjun Q / UVA CS 6316 / f16 For nstance: C=Flu X 1 X 2 X 3 X 4 X 5 X 6 =Muscle-ache What f we have seen no tranng cases where patent had no flu and muscle aches? Zero probabltes cannot be condtoned away, no matter the other evdence! ˆP(X 6 = t C = not_flu= N(X = t,c = nf 6 N(C = nf = 0?? = argmax c ˆP(c ˆP(x c 50

51 Dr. Yanjun Q / UVA CS 6316 / f16 51

52 Smoothng to Avod Overfttng k c C N c C x X N c x P j j j + = + = = = ( 1, ( ˆ( # of values of feature X 52 To make sum_ (P(x Cj=1 Dr. Yanjun Q / UVA CS 6316 / f16 Adapt From Mannng textcat tutoral

53 Dr. Yanjun Q / UVA CS 6316 / f16 Smoothng to Avod Overfttng ˆP(x c j = N(X = x,c = c j +1 N(C = c j + k # of values of X Somewhat more subtle verson overall fracfon n data where X =x,k P( x N( X = x, C c mp ˆ c =, k j, k, k j N( C = c + m j = + extent of smoothng 53

54 Dr. Yanjun Q / UVA CS 6316 / f16 Today : GeneraFve Bayes Classfers ü Bayes Classfer MAP classfcafon rule GeneraFve Bayes Classfer ü Naïve Bayes Classfer ü Gaussan Bayes Classfers Gaussan dstrbufon Gaussan NBC LDA, QDA 54

55 Dr. Yanjun Q / UVA CS 6316 / f16 Revew: ConFnuous Random Varables Probablty densty funcfon (pdf nstead of probablty mass funcfon (pmf For dscrete RV: Probablty mass funcfon (pmf: P(X = x A pdf (prob. Densty func. s any funcfon f(x that descrbes the probablty densty n terms of the nput varable x. 55

56 Dr. Yanjun Q / UVA CS 6316 / f16 Revew: Probablty of ConFnuous RV ProperFes of pdf f (x 0, x + f (x= 1 Actual probablty can be obtaned by takng the ntegral of pdf E.g. the probablty of X beng between 5 and 6 s P(5 X 6= f (xdx

57 Dr. Yanjun Q / UVA CS 6316 / f16 Revew: Mean and Varance of RV Mean (ExpectaFon: Dscrete RVs: µ = E ( X = v = ( X P( X E v v E(g(X = v g(v P(X = v ConFnuous RVs: + ( X ( E = xf x dx E(g(X = + g(x f (xdx 57 Adapt From Carols prob tutoral

58 Dr. Yanjun Q / UVA CS 6316 / f16 Revew: Mean and Varance of RV Varance: Dscrete RVs: Var(X = E((X µ 2 2 ( X ( µ P( X = v = V v v ConFnuous RVs: + V( X = ( x µ f ( x dx 2 Covarance: Cov(X,Y = E((X µ x (Y µ y = E(XY µ x µ y 58 Adapt From Carols prob tutoral

59 Dr. Yanjun Q / UVA CS 6316 / f16 Gaussan Dstrbu@on ( 2 X ~ N µσ, Mean Courtesy: hlp://research.mcrosow.com/~cmbshop/prml/ndex.htm Covarance Matrx 59

Dr. Yanjun Q / UVA CS 6316 / f16 Mul@varate Normal (Gaussan PDFs The only wdely used contnuous jont PDF s the multvarate normal (or Gaussan: Where * represents

60 Dr. Yanjun Q / UVA CS 6316 / f16 Mul@varate Normal (Gaussan PDFs The only wdely used contnuous jont PDF s the multvarate normal (or Gaussan: Where * represents determnant Bvarate normal PDF:. Mean of normal PDF s at peak value. Contours of equal PDF form ellpses. X 1 X 2 The covarance matrx captures lnear dependences among the varables 60

61 Example: the Bvarate Normal dstrbuton Dr. Yanjun Q / UVA CS 6316 / f16 f ( x,x = ( 2π Σ 1 2 e 1/2! x! µ ( T Σ 1!! ( x µ wth 2 2! µ = µ 1 µ 2 and σ σ σ ρσσ Σ= = σ σ ρσ σ σ ( Σ= σσ σ = σσ ρ

62 Surface Plots of the bvarate Normal dstrbuton Dr. Yanjun Q / UVA CS 6316 / f16 62

63 Contour Plots of the bvarate Normal dstrbuton Dr. Yanjun Q / UVA CS 6316 / f16 63

64 Scatter Plots of data from the bvarate Normal dstrbuton Dr. Yanjun Q / UVA CS 6316 / f16 64

65 Trvarate Normal dstrbuton Dr. Yanjun Q / UVA CS 6316 / f16 x 3 x 2 x 1 65

66 How to Gaussan: MLE (Later Dr. Yanjun Q / UVA CS 6316 / f16 We can ft statstcal models by maxmzng the probablty / lkelhood of generatng the observed samples: L(x 1,,x n \theta = p(x 1 \theta p(x n \theta (the samples are assumed to be IID In the 1D Gaussan case, we smply set the mean and the varance to the sample mean and the sample varance: µ = 1 n n = 1 x 2 σ = 1 n n ( x µ =1 2 66

67 The p-multvarate Normal dstrbuton ( < X 1, X 2!, X p >~ N µ "#,Σ Dr. Yanjun Q / UVA CS 6316 / f16 67

68 DETOUR: ProbablsFc InterpretaFon of Lnear Regresson Let us assume that the target varable and the nputs are related by the equafon: where ε s an error term of unmodeled effects or random nose Now assume that ε follows a Gaussan N(0,σ, then we have: By IID assumpfon: T y ε θ + = x = σ θ πσ θ ( exp ; ( T y x y p x = = = = σ θ πσ θ θ n T n n y x y p L ( exp ; ( ( x Dr. Yanjun Q / UVA CS 6316 / f16

69 Dr. Yanjun Q / UVA CS 6316 / f16 DETOUR: ProbablsFc InterpretaFon of Lnear Regresson Let us assume that the target varable and the nputs are related by the equafon: where ε s an error term of unmodeled effects or random nose Now assume that ε follows a Gaussan N(0,σ, then we have: By IID assumpfon: T y ε θ + = x = σ θ πσ θ ( exp ; ( T y x y p x = = = = σ θ πσ θ θ n T n n y x y p L ( exp ; ( ( x

70 DETOUR: ProbablsFc InterpretaFon of Lnear Regresson Dr. Yanjun Q / UVA CS 6316 / f16 Let us assume that the target varable and the nputs are related by the equafon: T y = θ x + ε where ε s an error term of unmodeled effects or random nose Now assume that ε follows a Gaussan N(0,σ 2, then we have: p 1 ( y exp 2πσ ( y x; θ = 2 By IID (ndependent and denfcally dstrbuted assumpfon: n 1 L(θ= p( y x ;θ = =1 2πσ n exp T θ x 2σ n =1 2 ( y θ T x 2 2σ 2

71 n L(θ= p( y x ;θ = =1 1 2πσ n exp n =1 ( y θ T x 2 2σ 2 Dr. Yanjun Q / UVA CS 6316 / f16 We can learn \theta by maxmzng the probablty / lkelhood of generafng the observed samples: l(θ= log(l(θ= nlog 1 2πσ 1 1 n ( y σ 2 2 θ T x 2 =1 n 1 = T J ( θ ( x θ 2 = 1 y 2 71

72 n L(θ= p( y x ;θ = =1 1 2πσ n exp n =1 ( y θ T x 2 2σ 2 Dr. Yanjun Q / UVA CS 6316 / f16 We can learn \theta by maxmzng the probablty / lkelhood of generafng the observed samples: l(θ= log(l(θ= nlog 1 2πσ 1 1 n ( y σ 2 2 θ T x 2 =1 n 1 = T J ( θ ( x θ 2 = 1 y 2 72

73 n L(θ= p( y x ;θ = =1 1 2πσ n exp n =1 ( y θ T x 2 2σ 2 Dr. Yanjun Q / UVA CS 6316 / f16 We can learn \theta by maxmzng the probablty / lkelhood of generafng the observed samples: l(θ= log(l(θ= nlog 1 2πσ 1 1 n ( y σ 2 2 θ T x 2 =1 n 1 = T J ( θ ( x θ 2 = 1 y 2 73

74 Dr. Yanjun Q / UVA CS 6316 / f16 Maxmum Lkelhood EsFmaFon A general Statement Consder a sample set T=(X 1...X n whch s drawn from a probablty dstrbufon P(X \theta where \theta are parameters. If the Xs are ndependent wth probablty densty funcfon P(X \theta, the jont probablty of the whole set s ˆ θ = argmax θ P( X 1... X n θ = n P( =1 X θ ths may be maxmsed wth respect to \theta to gve the maxmum lkelhood esfmates. P(Tran M(θ = argmax P( X 1... X n θ 74 θ

75 Dr. Yanjun Q / UVA CS 6316 / f16 Maxmum Lkelhood EsFmaFon A general Statement Consder a sample set T=(X 1...X n whch s drawn from a probablty dstrbufon P(X \theta where \theta are parameters. If the Xs are ndependent wth probablty densty funcfon P(X \theta, the jont probablty of the whole set s P( X 1... X n θ = è Ths may be maxmsed wth respect to \theta to gve the maxmum lkelhood esfmates (MLE of \theta : n P( =1 θ ˆ = argmax P( X 1... X n θ θ X θ 75

76 The dea s to Dr. Yanjun Q / UVA CS 6316 / f16 ü assume a parfcular model wth unknown parameters: ü we can then defne the probablty of observng a gven event condfonal on a parfcular set of parameters. P( X θ ü We have observed a set of outcomes n the real world. ü It s then possble to choose a set of parameters whch are most lkely to have produced the observed results. ˆ θ = argmax θ Ths s maxmum lkelhood. In most cases t s both consstent and effcent. It provdes a standard to compare other esfmafon technques. log(l(θ = log(p( X θ =1 It s owen convenent to work wth the Log of the lkelhood funcfon. n P( X 1... X n θ θ 76

77 The dea s to Dr. Yanjun Q / UVA CS 6316 / f16 ü assume a parfcular model wth unknown parameters, ü we can then defne the probablty of observng a gven event condfonal on a parfcular set of parameters. P( ü We have observed a set of outcomes n the real world. ü It s then possble to choose a set of parameters whch are most lkely to have produced the observed results. ˆ θ = argmax θ Ths s maxmum lkelhood. In most cases t s both consstent and effcent. It provdes a standard to compare other esfmafon technques. log(l(θ = log(p( X θ =1 It s owen convenent to work wth the Log of the lkelhood funcfon. n P( X 1... X n θ θ X θ 77

78 The dea s to Dr. Yanjun Q / UVA CS 6316 / f16 ü assume a parfcular model wth unknown parameters, ü we can then defne the probablty of observng a gven event condfonal on a parfcular set of parameters. P( ü We have observed a set of outcomes n the real world. ü It s then possble to choose a set of parameters whch are most lkely to have produced the observed results. ˆ θ = argmax θ Ths s maxmum lkelhood. In most cases t s both consstent and effcent. It provdes a standard to compare other esfmafon technques. log(l(θ = log(p( X θ =1 It s owen convenent to work wth the Log of the lkelhood funcfon. n P( X 1... X n θ θ X θ 78

79 DETOUR: ProbablsFc InterpretaFon of Lnear Regresson Dr. Yanjun Q / UVA CS 6316 / f16 Hence the log-lkelhood s: l(θ= log(l(θ= nlog 1 2πσ 1 1 n ( y σ 2 2 θ T x 2 =1 Recognze the last term? Yes t s: n 1 T J ( θ = ( x θ 2 = 1 y 2 Thus under ndependence Gaussan resdual assumpfon, resdual square error s equvalent to MLE of θ!

80 Dr. Yanjun Q / UVA CS 6316 / f16 80

81 Dr. Yanjun Q / UVA CS 6316 / f16 Today : GeneraFve Bayes Classfers ü Bayes Classfer MAP classfcafon rule GeneraFve Bayes Classfer ü Naïve Bayes Classfer ü Gaussan Bayes Classfers Gaussan dstrbufon Gaussan NBC Not-naïve Gaussan BC è LDA, QDA 81

82 Dr. Yanjun Q / UVA CS 6316 / f16 Gaussan Naïve Bayes Classfer argmax C P(C X = argmax C Naïve Bayes Classfer P(X,C = argmax C P(X C = P(X 1, X 2,, X p C P(X CP(C = P(X 1 X 2,, X p,cp(x 2,, X p C = P(X 1 CP(X 2,, X p C = P(X 1 CP(X 2 C P(X p C 2 1 ( X j µ j Pˆ( Xj C = c = exp 2 2πσ j 2σ j µ : mean(avearage of attrbute values X of examples σ j j : standarddevaton of attrbute values X j j of examples for whch C = c for whch C = c 82

83 Gaussan Naïve Bayes Classfer Contnuous-valued Input Attrbutes Dr. Yanjun Q / UVA CS 6316 / f16 Condtonal probablty modeled wth the normal dstrbuton 1 " ˆP(X j C = c = exp (X j µ j 2 % 2 2πσ $ j # 2σ ' j & µ j : mean (avearage of attrbute values X j of examples for whch C = c σ j : standard devaton of attrbute values X j of examples for whch C = c Learnng Phase: Output: normal dstrbutons and p L for X = (X 1,, X p, C = c 1,, c L P(C = c =1,, L Test Phase: for X! = ( X 1!,, X! p Calculate condtonal probabltes wth all the normal dstrbutons Apply the MAP rule to make a decson 83

84 Gaussan Naïve Bayes Classfer Contnuous-valued Input Attrbutes Dr. Yanjun Q / UVA CS 6316 / f16 Condtonal probablty modeled wth the normal dstrbuton 1 " ˆP(X j C = c = exp (X j µ j 2 % 2 2πσ $ j # 2σ ' j & µ j : mean (avearage of attrbute values X j of examples for whch C = c σ j : standard devaton of attrbute values X j of examples for whch C = c Learnng Phase: Output: normal dstrbutons and p L for X = (X 1,, X p, C = c 1,, c L P(C = c =1,, L Test Phase: for X! = ( X 1!,, X! p Calculate condtonal probabltes wth all the normal dstrbutons Apply the MAP rule to make a decson 84

85 Dr. Yanjun Q / UVA CS 6316 / f16 Naïve Gaussan means? Not Naïve P(X 1, X 2,, X p C = Naïve P(X 1, X 2,, X p C = c j = P(X 1 CP(X 2 C P(X p C = 1 $ exp (X j µ j 2 ' 2 2πσ & j % 2σ j ( Dagonal Matrx Σ_ c k = Λ _ c k Each class covarance matrx s dagonal 85

86 Dr. Yanjun Q / UVA CS 6316 / f16 Today : GeneraFve Bayes Classfers ü Bayes Classfer MAP classfcafon rule GeneraFve Bayes Classfer ü Naïve Bayes Classfer ü Gaussan Bayes Classfers Gaussan dstrbufon Gaussan NBC Not-naïve Gaussan BC è LDA, QDA 86

87 Dr. Yanjun Q / UVA CS 6316 / f16 1 covarance matrx are the same across classes è LDA (Lnear Dscrmnant Analyss Each class covarance matrx s the same Class k Class l Class k Class l 87

88 Dr. Yanjun Q / UVA CS 6316 / f16 Op@mal Classfca@on argmax k P(C _ k X = argmax k P(X,C = argmax k P(X CP(C - Note 88

89 argmax P(C k X= argmax P(X,C k = argmax k k k = argmax log{p(x C k P(C k }! k Dr. Yanjun Q / UVA CS 6316 / f16 P(X C k P(C k 89

90 argmax P(C k X= argmax P(X,C k = argmax k k k = argmax log{p(x C k P(C k }! k Dr. Yanjun Q / UVA CS 6316 / f16 P(X C k P(C k 90

91 Dr. Yanjun Q / UVA CS 6316 / f16 log P(C X k P(C! l X = log P(X C k P(X C l + log P(C k P(C l 91

92 Dr. Yanjun Q / UVA CS 6316 / f16 è The Decson Boundary Between class k and l, {x : δ k (x = δ l (x}, s lnear Equals to zero Boundary ponts X : when P(c_k X == P(c_l X, the lew lnear equafon ==0, a lnear lne / plane 92

93 Dr. Yanjun Q / UVA CS 6316 / f16 Vsualzaton (three classes 93

94 Dr. Yanjun Q / UVA CS 6316 / f16 (2 If covarance matrx are not same e.g. è QDA (QuadraFc Dscrmnant Analyss 94

95 Dr. Yanjun Q / UVA CS 6316 / f16 LDA on Expanded Bass LDA wth quadrafc bass Versus QDA 95

96 Dr. Yanjun Q / UVA CS 6316 / f16 (3 Regularzed Dscrmnant Analyss 96

97 Dr. Yanjun Q / UVA CS 6316 / f16 An example: Gaussan Bayes Classfer 97

98 Dr. Yanjun Q / UVA CS 6316 / f16 Gaussan Bayes Classfer 98

99 Today Recap : GeneraFve Bayes Classfers Dr. Yanjun Q / UVA CS 6316 / f16 ü Bayes Classfer MAP classfcafon rule GeneraFve Bayes Classfer ü Naïve Bayes Classfer ü Gaussan Naïve Bayes Classfers Gaussan dstrbufon Gaussan NBC Not-naïve Gaussan BC è LDA, QDA 99

models p(x C EPE wth 0-1 loss è lkelhood P(X 1,, X p C Search/Optmzaton Many optons Models, Parameters Prob.

100 argmax k P(C _ k X = argmax k Generatve Bayes Classfer P(X,C = argmax k P(X CP(C Dr. Yanjun Q / UVA CS 6316 / f16 Task classfcaton Representaton Score Functon Prob. models p(x C EPE wth 0-1 loss è lkelhood P(X 1,, X p C Search/Optmzaton Many optons Models, Parameters Prob. Models Parameter Bernoull Naïve p(w = true c k = p,k Gaussan Nave Mul2nomal 1 " ˆP(X j C = c k = exp (X j µ jk 2 % 2 2πσ $ jk # 2σ ' jk & N! P(W 1 = n 1,...,W v = n v c k = n 1k!n 2k!..n vk! θ n 1k n θ 2 k n 1k 2k..θ vk vk

101 Dr. Yanjun Q / UVA CS 6316 / f16 References q Prof. Andrew Moore s revew tutoral q Prof. Ke Chen NB sldes q Prof. Carlos Guestrn rectafon sldes 101

UVA CS / Introduc8on to Machine Learning and Data Mining

UVA CS / Introduc8on to Machine Learning and Data Mining UVA CS 4501-001 / 6501 007 Introduc8on to Machne Learnng and Data Mnng Lecture 16: Genera,ve vs. Dscrmna,ve / K- nearest- neghbor Classfer / LOOCV Yanjun Q / Jane,, PhD Unversty of Vrgna Department of