Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Size: px

Start display at page:

Download "Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING"

Reynold Butler
5 years ago
Views:

1 MACHINE LEARNING Vasant Honavar Bonformatcs and Computatonal Bology Program Center for Computatonal Intellgence, Learnng, & Dscovery Iowa State Unversty

2 Recall the Bayesan recpe for classfcaton The Bayesan recpe s smple, optmal, and n prncple, straghtforward to apply To use ths recpe n practce, we need to know P ω the generatve model for data for each class and Pω the pror probabltes of classes Because these probabltes are unknown, we need to estmate them from data or learn them! s typcally hgh-dmensonal Need to estmate P ω from lmted data

3 Naïve Bayes Classfer We can classfy f we know P ω How to learn P ω? One soluton: Assume that the random varables n are condtonally ndependent gven the class. Result: Naïve Bayes classfer whch performs optmally under certan assumptons A smple, practcal learnng algorthm grounded n Probablty Theory When to use Attrbutes that descrbe nstances are lkely to be condtonally ndependent gven classfcaton The data s nsuffcent to estmate all the probabltes relably f we do not assume ndependence 3

4 Naïve Bayes Classfer Successful applcatons Dagnoss Document Classfcaton Proten Functon Classfcaton Predcton of proten-proten nterfaces and many others.. 4

5 Condtonal Independence Let on a gven event space. Z P Z,..., Z,... Z n n andw be random varables are mutually ndependent gven W n Z, Z,... Zn W P Z W f Note that these represent sets of equatons, for all possble value assgnments to random varables 5

6 Implcatons of Independence Suppose we have 5 Bnary attrbutes and a bnary class label Wthout ndependence, n order to specfy the ont dstrbuton, we need to specfy a probablty for each possble assgnment of values to each varable resultng n a table of sze 6 64 Suppose the features are ndependent gven the class label we only need 5x0 entres The reducton n the number of probabltes to be estmated s even more strkng when N, the number of attrbutes s large from O N to ON 6

7 7 Iowa State Unversty Nave Bayes Classfer,...,, arg max,...,,,...,, arg max..., arg max where...,, attrbute values n termsof s descrbed..., where an nstance : dscrete valued target functon Consder a n n n n n n n n MAP n n n P x x x P x x x P P x x x P x x x P Doman x x x x f ω ω ω ω ω ω χ χ ω ω ω Ω Ω Ω Ω ω MAP s called the maxmum a posteror classfcaton

8 Nave Bayes Classfer ω MAP MAP ω NB arg max P ω ω Ω ω Ω arg max P ω Ω arg max ω Ω arg max P ω n n If the attrbutes are ndependent gven the class, we have ω P x ω P x, x... n x, x,..., x ω P ω n x n x n ω P ω 8

9 For each possble value ω Pˆ Nave Bayes Learner Ω ω Estmate P Ω ω, D For Classfy Pˆ c each possble value a k ω a new nstance n argmax P ω P x ω ω Ω Estmate of Ω, a x k of, x P a Ω ω, D,... x Estmate s a procedure for estmatng the relevant probabltes from set of tranng examples N k 9

10 Estmaton of Probabltes from Small Samples Pˆ n n p a k As n, Pˆ k ω s the pror estmate for Pˆ m s the weght gven to the pror n number of a n + mp + m whch have attrbute value a k k s the number of tranng examples of class ω tranng examples of class ω ω n n k k a k for attrbute ω Ths s effectvely the same as usng Drchlet prors as we shall see later 0

11 Sample Applcatons of Naïve Bayes Classfer Learnng datng preferences Learn whch news artcles are of nterest. Learn to classfy web pages by topc. Learn to classfy SPAM Learn to assgn protens to functonal famles based on amno acd composton Nave Bayes s among the most useful algorthms What attrbutes shall we use to represent text?

12 Learnng Datng Preferences Instances ordered 3-tuples of attrbute values correspondng to Classes Heght tall, short Har dark, blonde, red Eye blue, brown +, Tranng Data Instance Class label I t, d, l + I s, d, l + I 3 t, b, l I 4 t, r, l I 5 s, b, l I 6 t, b, w + I 7 t, d, w + I 8 s, b, w +

13 Probabltes to estmate P+ 5/8 PHeght c t s PHar c d b r P 3/8 + 3/5 /5 + 3/5 /5 0 /3 /3 0 /3 /3 PEye c + l /5 w 3/5 0 Classfy Heghtt, Harb, eyel P + 3/5/5/5 /5 P /3/3 4/9 Classfcaton? Classfy Heghtt, Harr, eyew Note the problem wth zero probabltes Soluton Use Laplacan estmates 3

14 Learnng to Classfy Text Target concept Interestng? : Documents {+,-} Learnng: Use tranng examples to estmate P +, P -, P d +, P d - Alternatve generatve models for documents: Represent each document by sequence of words In the most general case, we need a probablty for each word occurrence n each poston n the document, for each possble document length Too many probabltes to estmate! Represent each document by tuples of word counts 4

15 P d length d ω Learnng to Classfy Text P length d P ω, length d Ths would requre estmatng for each document, Vocabulary lengthd probabltes for each possble document length! To smplfy matters, assume that probablty of encounterng a specfc word n a partcular poston s ndependent of the poston, and of document length Treat each document as a bag of words! Ω 5

16 Bag of Words Representaton So we estmate one poston -ndependent class - condtonal probablty P w probabltes P Vocabulary Ω k ω for each word w ω... P nstead of The number of probabltes to be estmated drops to w the set of ω k length d k The result s a generatve model for documents that treats each document as an ordered tuple of word frequences More sophstcated models can consder dependences between adacent word postons Markov models we wll come back to these later 6

17 Learnng to Classfy Text Wth the bag of words representaton, we have P nkd! k d ω s proportonal to P wk ω nkd! k where n kd k s the number of occurences of w gnorng dependence on length of the document We can estmate P w k n document d ω from the labeled bags of words we have. k n kd 7

18 Naïve Bayes Text Classfer Gven 000 tranng documents from each group, learn to classfy new documents accordng to the newsgroup where t belongs Nave Bayes acheves 89% classfcaton accuracy comp.graphcs comp.os.ms-wndows.msc comp.sys.bm.pc.hardware comp.sys.mac.hardware comp.wndows.x alt.athesm soc.relgon.chrstan talk.relgon.msc talk.poltcs.mdeast talk.poltcs.msc talk.poltcs.guns msc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sc.space sc.crypt sc.electroncs sc.med 8

19 Naïve Bayes Text Classfer Representatve artcle from rec.sport.hockey Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogcse!uwm.edu From: John Doe Subect: Re: Ths year's bggest and worst opnon... Date: 5 Apr 93 09:53:39 GMT I can only comment on the Kngs, but the most obvous canddate for pleasant surprse s Alex Zhtnk. He came hghly touted as a defensve defenseman, but he's clearly much more than that. Great skater and hard shot though wsh he were more accurate. In fact, he pretty much allowed the Kngs to trade away that huge defensve lablty Paul Coffey. Kelly Hrudey s only the bggest dsappontment f you thought he was any good to begn wth. But, at best, he's only a medocre goaltender. A better choce would be Tomas Sandstrom, though not through any fault of hs own, but because some thugs n Toronto decded. 9

20 Sequence Classfcaton Need a generatve model for sequences Smplest alternatve sequence-length ndependent multnomal bag of letters model! More sophstcated alternatves possble for example, Markov models that capture dependences among small wndows of neghborng letters 0

21 Naïve Bayes Learner Summary Produces mnmum error classfer f attrbutes are condtonally ndependent gven the class When to use Attrbutes that descrbe nstances are lkely to be condtonally ndependent gven classfcaton There s not enough data to estmate all the probabltes relably f we do not assume ndependence Often works well even f when ndependence assumpton s volated Domgos and Pazzan, 996 Can be used teratvely Kang et al., 006

22 Estmatng probabltes from data dscrete case Maxmum lkelhood estmaton Bayesan estmaton Maxmum a posteror estmaton

23 Example: Bnomal Experment Head Tal When tossed, the thumbtack can land n one of two postons: Head or Tal We denote by the unknown probablty PH. Estmaton task Gven a sequence of toss samples x[], x[],, x[m] we want to estmate the probabltes PH and PT - 3

24 Statstcal parameter fttng Consder samples x[], x[],, x[m] such that The set of values that can take s known Each s sampled from the same dstrbuton Each s sampled ndependently of the rest..d. samples The task s to fnd a parameter Θ so that the data can be summarzed by a probablty Px[] Θ. The parameters depend on the gven famly of probablty dstrbutons: multnomal, Gaussan, Posson, etc. We wll focus frst on bnomal and then on multnomal dstrbutons The man deas generalze to other dstrbuton famles 4

25 The Lkelhood Functon How good s a partcular? It depends on how lkely t s to generate the observed data : D P D P x[ m] L The lkelhood for the sequence H,T, T, H, H s m L :D L : D

26 Lkelhood functon The lkelhood functon L : D provdes a measure of relatve preferences for varous values of the parameter gven a collecton of observatons D drawn from a dstrbuton that s parameterzed by fxed but unknown. L : D s the probablty of the observed data D consdered as a functon of. Suppose data D s 5 heads out of 8 tosses. What s the lkelhood functon assumng that the observatons were generated by a bnomal dstrbuton wth an unknown but fxed parameter?

27 Suffcent Statstcs To compute the lkelhood n the thumbtack example we only requre N H and N T the number of heads and the number of tals N H and N T are suffcent statstcs for the parameter that specfes the bnomal dstrbuton A statstc s smply a functon of the data A suffcent statstc s for a parameter s a functon that summarzes from the data D, the relevant nformaton sd needed to compute the lkelhood L :D. Ifs s a suffcent statstc for sd sd, then L :D L :D L : D N H N T 7

28 Maxmum Lkelhood Estmaton Man Idea: Learn parameters that maxmze the lkelhood functon Maxmum lkelhood estmaton s Intutvely appealng One of the most commonly used estmators n statstcs Assumes that the parameter to be estmated s fxed, but unknown 8

29 Example: MLE for Bnomal Data Applyng the MLE prncple we get Why? NH ˆ N + N H T Example: N H,N T 3, ML estmate s 3/5 0.6 L :D

30 N H NT MLE for Bnomal data L :D log L :D N log + N log The lkelhood s postve for all legtmate values of So maxmzng the lkelhood s equvalent to maxmzng ts logarthm.e. log lkelhood log L log L N + N ML H :D 0 at extrema of L :D :D T N + N H N N H N T H H + T N 0 H Note that the lkelhood s ndeed maxmzed at ML because n the neghborhood of ML, the value of the lkelhood s smaller than t s at ML T 30

31 Maxmum and curvature of lkelhood around the maxmum At the maxmum, the dervatve of the log lkelhood s zero At the maxmum, the second dervatve s negatve. The curvature of the log lkelhood s defned as I : log L D Large observed curvature I ML at ML s assocated wth a sharp peak, ntutvely ndcatng less uncertanty about the maxmum lkelhood estmate I ML s called the Fsher nformaton 3

32 Maxmum Lkelhood Estmate ML estmate can be shown to be Asymptotcally unbased Asymptotcally consstent - converges to the true value as the number of examples approaches nfnty lm N lm N Pr E { ε } ML lm E N ML True 0 Asymptotcally effcent acheves the lowest varance that any estmate can acheve for a tranng set of a certan sze satsfes the Cramer-Rao bound True ML True 3

33 Maxmum Lkelhood Estmate ML estmate can be shown to be representatonally nvarant If ML s an ML estmate of, and g s a functon of, then g ML s an ML estmate of g When the number of samples s large, the probablty dstrbuton of ML has Gaussan dstrbuton wth mean True the actual value of the parameter a consequence of the central lmt theorem a random varable whch s a sum of a large number of random varables has a Gaussan dstrbuton ML estmate s related to the sum of random varables We can use the lkelhood rato to reect the null hypothess correspondng to 0 as unsupported by data f the rato of the lkelhoods evaluated at 0 and at ML s small. The rato can be calbrated when the lkelhood functon s approxmately quadratc 33

34 Naïve Bayes Classfer We can defne the lkelhood for a Naïve Bayesan Classfer Let Θ be the class condtonal probabltes for class Let L be the correspondng lkelhood L factorzes..d. samples p L Θ : D P x[ p], K, x [ p]: Θ p p L Θ P x [ p]: Θ P x [ p]: Θ : D n Independence factorzaton Each Θ specfes a bnomal dstrbuton assocated wth class for th attrbute 34

35 Naïve Bayes Classfer Decomposton Independent Estmaton Problems If the parameters for each famly are decoupled va ndependence, then they can be estmated ndependently of each other 35

36 From Bnomal to Multnomal Suppose a random varable can take the values,,,k We want to learn the parameters,., K Suffcent statstcs: N, N,, N K - the number of tmes each outcome s observed K Lkelhood functon Nk L : D k k ML estmate ˆk N k N l l 36

37 37 Iowa State Unversty MLE estmates for Nave Bayes Classfers When we assume that P C s multnomal, we get the decomposton: For each class we get an ndependent multnomal estmaton problem The MLE s Θ Θ Θ c x c x N c x c x c x N m c x P m c m x P D L,, : : ] [ ] [ :, ˆ c x c N c x N

38 Summary of Maxmum Lkelhood estmaton Defne a lkelhood functon whch s a measure of how lkely t s that the observed data were generated from a probablty dstrbuton wth a partcular choce of parameters Select the parameters that maxmze the lkelhood In smple cases, ML estmate has a closed form soluton In other cases, ML estmaton may requre numercal optmzaton Problem wth ML estmate assgns zero probablty to unobserved values can lead to dffcultes when estmatng from small samples Queston How would Naïve Bayes classfer behave f some of the class condtonal probablty estmates are zero? 38

39 Bayesan Estmaton MLE commts to a specfc value of the unknown parameter s MLE s the same n both cases shown vs Of course, n general, one cannot summarze a functon by a sngle number! Intutvely, the confdence n the estmates should be dfferent 39

40 Bayesan Estmaton Maxmum Lkelhood approach s Frequentst at ts core Assumes there s an unknown but fxed parameter Estmates wth some confdence Predcton of probabltes usng the estmated parameter value Bayesan Approach Represents uncertanty about the unknown parameter Uses probablty to quantfy ths uncertanty: Unknown parameters as random varables Predcton follows from the rules of probablty: Expectaton over the unknown parameters 40

41 4 Iowa State Unversty Example: Bnomal Data Revsted ] [ [0,], and In ths case, d d D p D H m P D p d p D p p D p d p D p p D p D p Suppose that we choose a unform pror p for n [0,] P D s proportonal to the lkelhood L :D

42 Example: Bnomal Data Revsted NH,NT 4, MLE for P H s 4/5 0.8 Bayesan estmate s 5 P x[ M + ] H D P D d 0. 74K 7 In ths example, MLE and Bayesan predcton dffer It can be proved that If the pror s well-behaved.e. does not assgn 0 densty to any feasble parameter value Then both MLE and Bayesan estmate converge to the same value n the lmt Both almost surely converge to the underlyng dstrbuton P But the ML and Bayesan approaches behave dfferently when the number of samples s small 4

43 All relatve frequences are not equ-probable In practce we mght want to express prors that allow us to express our belefs regardng the parameter to be estmated For example, we mght want a pror that assgns a hgher probablty to parameter values that descrbe a far con than t does to an unfar con The beta dstrbuton allows us to capture such pror belefs 43

44 Gamma Functon: Beta dstrbuton Γ 0 x t x t e dt The ntegral converges f and only f x > 0. If x s an nteger that s greater than 0, t can be shown that Γ x x! Γ x + So x Γ x The beta densty functon wth parameters a, b, N a + b, where a,b are real numbers > 0, p Γ Γ N a Γ b a beta ;a,b b where 0 s : 44

45 If a,b are real numbers > a b d Beta dstrbuton 0, then a + Γ b + Γ a + b 0 + Γ a If has dstrbuton gven by beta N Let D Let N H { [],..., [ M ]} s; N T t; Then we can show that be and p p ;a,b, then E. a sequence of d samples from a bnomal dstrbuton; beta ; a, b D beta ; a + s, b + t Update of the parameter wth a beta pror based on data yelds a beta posteror 45

46 Conugate Famles The property that the posteror dstrbuton follows the same parametrc form as the pror dstrbuton s called conugacy Conugate famles are useful because: For many dstrbutons we can represent them wth hyper parameters They allow for sequental update to obtan the posteror In many cases we have closed-form soluton for predcton Beta pror s a conugate famly for the bnomal lkelhood 46

47 Bayesan predcton pror : beta Data : D posteror : ; a, b { [],... [ M ]} p predcton : P D beta ; a + N, b + N [ M + ] H D H T a + N H N + M a + N H a + b + N + N H T 47

48 48 Iowa State Unversty Drchlet Prors Recall that the lkelhood functon s A Drchlet pror wth hyperparameters α,,α K s defned as Then the posteror has the same form, wth hyperparameters α +N,,α K +N K Θ K k N k k D L : K K k k k K k k K k k... Θ α Γ N Γ P k where ; 0 ; Θ α + Θ Θ Θ K k N k K k N k K k k k k k k D P P D P α α

49 Drchlet Prors Drchlet prors enable closed form predcton based on multnomal samples: If PΘ s Drchlet wth hyperparameters α,,α K then αk P [] k k P Θ dθ α Snce the posteror s also Drchlet, we get l l P [ M + ] k D αk + Nk k P Θ D dθ α + N l l l 49

50 Intuton behnd prors The hyperparameters α,,α K can be thought of as magnary counts from our pror experence Equvalent sample sze α + +α K The larger the equvalent sample sze the more confdent we are n our pror 50

51 Effect of Prors Predcton of PH after seeng data wth N H 0.5 N T for dfferent sample szes Dfferent strength α H + α T Fxed rato α H / α T Fxed strength α H + α T Dfferent rato α H / α T

52 Effect of Prors In real data, Bayesan estmates are less senstve to nose n the data P D MLE Drchlet.5,.5 Drchlet, Drchlet5,5 Drchlet0, N Toss Result 0 N 5

53 Conugate Famles The property that the posteror dstrbuton follows the same parametrc form as the pror dstrbuton s called conugacy Drchlet pror s a conugate famly for the multnomal lkelhood Conugate famles are useful because: For many dstrbutons we can represent them wth hyperparameters They allow for sequental update wthn the same representaton In many cases we have closed-form soluton for predcton 53

54 Bayesan Estmaton P x[ M + ] x[], K, x[ M ] P x[ M + ], x[], K, x[ M ] P x[], K, x[ M ] d P x[ M + ] P x[], K, x[ M ] d where Lkelhood Pror P x[], Kx[ M ] Posteror P x[], Kx[ M ] P P x[], Kx[ M ] Probablty of data 54

55 Summary of Bayesan estmaton Treat the unknown parameters as random varables Assume a pror dstrbuton for the unknown parameters Update the dstrbuton of the parameters based on data Use Bayes rule to make predcton 55

56 Maxmum a posteror MAP estmates Reconclng ML and Bayesan approaches P Θ Θ D MAP P D Θ P Θ P D arg max arg max P Θ arg max P Θ Θ P Θ D D Θ P Θ Θ L Θ : D 56

57 Maxmum a posteror MAP estmates Reconclng ML and Bayesan approaches Θ L Θ D ΘMAP arg max P : Θ Lke n Bayesan estmaton, we treat the unknown parameters as random varables But we estmate a sngle value for the parameter the maxmum a posteror estmate that corresponds to the most probable value of the parameter gven the data for a gven choce of the pror 57

58 Back to Naïve Bayes Classfer P ˆ 0 ˆ ˆ a ω P ω P l al ω 0 k k l If one of the attrbute values has estmated class condtonal probablty of 0, t domnates all other attrbute values When we have few examples, ths s more lkely Soluton use prors e.g., assume each value to be equally lkely unless data ndcates otherwse 58

59 Decson Tree Classfers Decson tree Representaton for modelng dependences among nput varables usng Elements of nformaton theory How to learn decson trees from data Over-fttng and how to mnmze t How to deal wth mssng values n the data Learnng decson trees from dstrbuted data Learnng decson trees at multple levels of abstracton 59

60 Decson tree representaton In the smplest case, each nternal node tests on an attrbute each branch corresponds to an attrbute value each leaf node corresponds to a class label In general, each nternal node corresponds to a test on nput nstances wth mutually exclusve and exhaustve outcomes tests may be unvarate or multvarate each branch corresponds to an outcome of a test each leaf node corresponds to a class label 60

61 Decson Tree Representaton Attrbutes E x a m p l e s 3 4 x 0 0 y 0 0 Class c A B A B, 0,0, ca x 0 00 cb x 0 x 0 0 ca, 0, 0, 00 cb y 0 00 ca cb Data set Tree Tree Should we choose Tree or Tree? Why? 6

62 Decson tree representaton Any Boolean functon can be represented by a decson tree Any functon f : A A A n C A where each s the doman of the th attrbute and C s a dscrete set of values class labels can be represented by a decson tree In general, the nputs need not be dscrete valued 6

63 Learnng Decson Tree Classfers Decson trees are especally well suted for representng smple rules for classfyng nstances that are descrbed by dscrete attrbute values Decson tree learnng algorthms Implement Ockham s razor as a preference bas smpler decson trees are preferred over more complex trees Are relatvely effcent lnear n the sze of the decson tree and the sze of the data set Produce comprehensble results Are often among the frst to be tred on a new data set 63

64 Learnng Decson Tree Classfers Ockham s razor recommends that we pck the smplest decson tree that s consstent wth the tranng set Smplest tree s one that takes the fewest bts to encode why? nformaton theory There are far too many trees that are consstent wth a tranng set Searchng for the smplest tree that s consstent wth the tranng set s not typcally computatonally feasble Soluton Use a greedy algorthm not guaranteed to fnd the smplest tree but works well n practce Or restrct the space of hypothess to a subset of smple trees 64

65 Informaton Some ntutons Informaton reduces uncertanty Informaton s relatve to what you already know Informaton content of a message s related to how surprsng the message s Informaton s related Informaton depends on context 65

66 Dgresson: Informaton and Uncertanty Message Sender Recever You are stuck nsde. You send me out to report back to you on what the weather s lke. I do not le, so you trust me. You and I are both generally famlar wth the weather n Iowa On a July afternoon n Iowa, I walk nto the room and tell you t s hot outsde On a January afternoon n Iowa, I walk nto the room and tell you t s hot outsde 66

67 Dgresson: Informaton and Uncertanty Sender Message Recever How much nformaton does a message contan? If my message to you descrbes a scenaro that you expect wth certanty, the nformaton content of the message for you s zero The more surprsng the message to the recever, the greater the amount of nformaton conveyed by the message What does t mean for a message to be surprsng? 67

68 Dgresson: Informaton and Uncertanty Suppose I have a con wth heads on both sdes and you know that I have a con wth heads on both sdes. I toss the con, and wthout showng you the outcome, tell you that t came up heads. How much nformaton dd I gve you? Suppose I have a far con and you know that I have a far con. I toss the con, and wthout showng you the outcome, tell you that t came up heads. How much nformaton dd I gve you? 68

69 Informaton Wthout loss of generalty, assume that messages are bnary made of 0s and s. Conveyng the outcome of a far con toss requres bt of nformaton need to dentfy one out of two equally lkely outcomes Conveyng the outcome one of an experment wth 8 equally lkely outcomes requres 3 bts.. Conveyng an outcome of that s certan takes 0 bts In general, f an outcome has a probablty p, the nformaton content of the correspondng message s p log p I 0 0 I 69

70 Informaton s Subectve Suppose there are 3 agents Adran, Oksana, Jun, n a world where a dce has been tossed. Adran observes the outcome s a 6 and whspers to Oksana that the outcome s even but Jun knows nothng about the outcome. Probablty assgned by Oksana to the event 6 s a subectve measure of Oksana s belef about the state of the world. Informaton ganed by Adran by lookng at the outcome of the dce log 6 bts. Informaton conveyed by Adran to Oksana log 6 log 3 bts Informaton conveyed by Adran to Jun 0 bts 70

71 Informaton and Shannon Entropy Suppose we have a message that conveys the result of a random experment wth m possble dscrete outcomes, wth probabltes p, p,... p m The expected nformaton content of such a message s called the entropy of the probablty dstrbuton H p, p,.. p I I m p p log p p 0 otherwse m p I provded p 0 7

72 Let H H Shannon s entropy as a measure of nformaton n n P p log p log p H, r P p... p p p p 0, p log p I 0I 0 0 bt n be The entropy of the dstrbuton P s gven by r log a dscrete probablty log dstrbuton log bt 7

73 Propertes of Shannon s entropy r P H r P 0 If there are N possble outcomes, r If p H P, log N N p H r If such that P, 0 v H P s a contnuous functon of H r P r P log N 73

74 Shannon s entropy r as a measure r of nformaton P, H P For any dstrbuton s the optmal number of bnary questons requred on average to determne an outcome drawn from P. We can extend these deas to talk about how much nformaton s conveyed by the observaton of the outcome of one experment about the possble outcomes of another mutual nformaton We can also quantfy the dfference between two probablty dstrbutons Kullback-Lebler dvergence or relatve entropy 74

75 Codng Theory Perspectve Suppose you and I both know the dstrbuton P r I choose an outcome accordng to P r Suppose I want to send you a message about the outcome You and I could agree n advance on the questons I can smply send you the answers Optmal message length on average s H P r Ths generalzes to nosy communcaton 75

76 Entropy of random varables and sets of random varables For a H If random varable P log P s a set of H n a log P a random varables, P log P x P takng values a... a n, 76

77 For Jont Entropy and Condtonal Entropy random varables and Y, the ont entropy H, Y Condtonal entropy of H H Y, Y, Y Y P, Y log gven Y P, Y log P, Y P Y P Y H Y a Y a P, Y alog P Y a 77

78 Jont Entropy and Condtonal Entropy Some Useful results : H H H, Y H + H Y H Y H Y When do we have equalty? Chan rule for Entropy H, Y + H Y Y + H Y 78

79 Example of entropy calculatons P H; Y H 0.. P H; Y T 0.4 P T; Y H 0.3. P T; Y T 0. H,Y-0.log P H 0.6. H 0.97 PY H 0.5. HY.0 PY H H 0./ PY T H PY H T 0.3/ PY T T0./ HY

80 80 Iowa State Unversty Mutual Informaton, log,, probablty dstrbutons, In terms of,,, Or by usng chan rule,, and nformaton between mutual the average, and random varable For a, b Y P a P b Y a P b Y a P Y I Y H Y H Y I Y H H Y I Y H Y H Y H H Y H Y H Y H H Y I Y Y Y Queston: When s I,Y0?

81 Relatve Entropy Let P and Q be two dstrbutons over random varable The relatve entropy Kullback - Lebler dstance s a measure of "dstance" from P to Q. P D P Q P log Q Note D P Q D Q P D P Q 0 D P P 0. 8

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before