GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION. Machine Learning

Size: px

Start display at page:

Download "GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION. Machine Learning"

Chad Carter
6 years ago
Views:

1 CHAPTER 3 GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION Machne Learnng Copyrght c 205. Tom M. Mtche. A rghts reserved. *DRAFT OF September 23, 207* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* Ths s a rough draft chapter ntended for ncuson n the upcomng second edton of the textbook Machne Learnng, T.M. Mtche, McGraw H. You are wecome to use ths for educatona purposes, but do not dupcate or repost t on the nternet. For onne copes of ths and other materas reated to ths book, vst the web ste tom/mbook.htm. Pease send suggestons for mprovements, or suggested exercses, to Tom.Mtche@cmu.edu. Learnng Cassfers based on Bayes Rue Here we consder the reatonshp between supervsed earnng, or functon approxmaton probems, and Bayesan reasonng. We begn by consderng how to desgn earnng agorthms based on Bayes rue. Consder a supervsed earnng probem n whch we wsh to approxmate an unknown target functon f : X Y, or equvaenty P(Y X). To begn, we w assume Y s a booean-vaued random varabe, and X s a vector contanng n booean attrbutes. In other words, X = X,X 2...,X n, where X s the booean random varabe denotng the th attrbute of X. Appyng Bayes rue, we see that P(Y = y X) can be represented as P(Y = y X = x k ) = P(X = x k Y = y )P(Y = y ) j P(X = x k Y = y j )P(Y = y j )

2 Copyrght c 205, Tom M. Mtche. 2 where y m denotes the mth possbe vaue for Y, x k denotes the kth possbe vector vaue for X, and where the summaton n the denomnator s over a ega vaues of the random varabe Y. One way to earn P(Y X) s to use the tranng data to estmate P(X Y ) and P(Y ). We can then use these estmates, together wth Bayes rue above, to determne P(Y X = x k ) for any new nstance x k. A NOTE ON NOTATION: We w consstenty use upper case symbos (e.g., X) to refer to random varabes, ncudng both vector and non-vector varabes. If X s a vector, then we use subscrpts (e.g., X to refer to each random varabe, or feature, n X). We use ower case symbos to refer to vaues of random varabes (e.g., X = x j may refer to random varabe X takng on ts jth possbe vaue). We w sometmes abbrevate by omttng varabe names, for exampe abbrevatng P(X = x j Y = y k ) to P(x j y k ). We w wrte E[X] to refer to the expected vaue of X. We use superscrpts to ndex tranng exampes (e.g., X j refers to the vaue of the random varabe X n the jth tranng exampe.). We use δ(x) to denote an ndcator functon whose vaue s f ts ogca argument x s true, and whose vaue s 0 otherwse. We use the #D{x} operator to denote the number of eements n the set D that satsfy property x. We use a hat to ndcate estmates; for exampe, ˆθ ndcates an estmated vaue of θ.. Unbased Learnng of Bayes Cassfers s Impractca If we are gong to tran a Bayes cassfer by estmatng P(X Y ) and P(Y ), then t s reasonabe to ask how much tranng data w be requred to obtan reabe estmates of these dstrbutons. Let us assume tranng exampes are generated by drawng nstances at random from an unknown underyng dstrbuton P(X), then aowng a teacher to abe ths exampe wth ts Y vaue. A hundred ndependenty drawn tranng exampes w usuay suffce to obtan a maxmum kehood estmate of P(Y ) that s wthn a few percent of ts correct vaue when Y s a booean varabe. However, accuratey estmatng P(X Y ) typcay requres many more exampes. To see why, consder the number of parameters we must estmate when Y s booean and X s a vector of n booean attrbutes. In ths case, we need to estmate a set of parameters θ j P(X = x Y = y j ) where the ndex takes on 2 n possbe vaues (one for each of the possbe vector vaues of X), and j takes on 2 possbe vaues. Therefore, we w need to estmate approxmatey 2 n+ parameters. To cacuate the exact number of requred parameters, note for any fxed j, the sum over of θ j must be one. Therefore, for any partcuar vaue y j, and the 2 n possbe vaues of x, we need compute ony 2 n ndependent parameters. Gven the two possbe vaues for Y, we must estmate a tota of 2(2 n ) such θ j parameters. Unfortunatey, ths corresponds to two Why? See Chapter 5 of edton of Machne Learnng.

3 Copyrght c 205, Tom M. Mtche. 3 dstnct parameters for each of the dstnct nstances n the nstance space for X. Worse yet, to obtan reabe estmates of each of these parameters, we w need to observe each of these dstnct nstances mutpe tmes! Ths s ceary unreastc n most practca earnng domans. For exampe, f X s a vector contanng 30 booean features, then we w need to estmate more than 3 bon parameters. 2 Nave Bayes Agorthm Gven the ntractabe sampe compexty for earnng Bayesan cassfers, we must ook for ways to reduce ths compexty. The Nave Bayes cassfer does ths by makng a condtona ndependence assumpton that dramatcay reduces the number of parameters to be estmated when modeng P(X Y ), from our orgna 2(2 n ) to just 2n. 2. Condtona Independence Defnton: Gven three sets of random varabes X,Y and Z, we say X s condtonay ndependent of Y gven Z, f and ony f the probabty dstrbuton governng X s ndependent of the vaue of Y gven Z; that s (, j,k)p(x = x Y = y j,z = z k ) = P(X = x Z = z k ) As an exampe, consder three booean random varabes to descrbe the current weather: Ran, T hunder and Lghtnng. We mght reasonaby assert that T hunder s ndependent of Ran gven Lghtnng. Because we know Lghtnng causes T hunder, once we know whether or not there s Lghtnng, no addtona nformaton about T hunder s provded by the vaue of Ran. Of course there s a cear dependence of T hunder on Ran n genera, but there s no condtona dependence once we know the vaue of Lghtnng. Athough X, Y and Z are each snge random varabes n ths exampe, more generay the defnton appes to sets of random varabes. For exampe, we mght assert that varabes {A, B} are condtonay ndependent of {C,D} gven varabes {E,F}. 2.2 Dervaton of Nave Bayes Agorthm The Nave Bayes agorthm s a cassfcaton agorthm based on Bayes rue and a set of condtona ndependence assumptons. Gven the goa of earnng P(Y X) where X = X...,X n, the Nave Bayes agorthm makes the assumpton that each X s condtonay ndependent of each of the other X k s gven Y, and aso ndependent of each subset of the other X k s gven Y. The vaue of ths assumpton s that t dramatcay smpfes the representaton of P(X Y ), and the probem of estmatng t from the tranng data. Consder, for exampe, the case where X = X,X 2. In ths case

4 Copyrght c 205, Tom M. Mtche. 4 P(X Y ) = P(X,X 2 Y ) = P(X X 2,Y )P(X 2 Y ) = P(X Y )P(X 2 Y ) Where the second ne foows from a genera property of probabtes, and the thrd ne foows drecty from our above defnton of condtona ndependence. More generay, when X contans n attrbutes whch satsfy the condtona ndependence assumpton, we have P(X...X n Y ) = n = P(X Y ) () Notce that when Y and the X are booean varabes, we need ony 2n parameters to defne P(X = x k Y = y j ) for the necessary, j,k. Ths s a dramatc reducton compared to the 2(2 n ) parameters needed to characterze P(X Y ) f we make no condtona ndependence assumpton. Let us now derve the Nave Bayes agorthm, assumng n genera that Y s any dscrete-vaued varabe, and the attrbutes X...X n are any dscrete or reavaued attrbutes. Our goa s to tran a cassfer that w output the probabty dstrbuton over possbe vaues of Y, for each new nstance X that we ask t to cassfy. The expresson for the probabty that Y w take on ts kth possbe vaue, accordng to Bayes rue, s P(Y = y k X...X n ) = P(Y = y k)p(x...x n Y = y k ) j P(Y = y j )P(X...X n Y = y j ) where the sum s taken over a possbe vaues y j of Y. Now, assumng the X are condtonay ndependent gven Y, we can use equaton () to rewrte ths as P(Y = y k X...X n ) = P(Y = y k) P(X Y = y k ) j P(Y = y j ) P(X Y = y j ) Equaton (2) s the fundamenta equaton for the Nave Bayes cassfer. Gven a new nstance X new = X...X n, ths equaton shows how to cacuate the probabty that Y w take on any gven vaue, gven the observed attrbute vaues of X new and gven the dstrbutons P(Y ) and P(X Y ) estmated from the tranng data. If we are nterested ony n the most probabe vaue of Y, then we have the Nave Bayes cassfcaton rue: Y argmax y k P(Y = y k ) P(X Y = y k ) j P(Y = y j ) P(X Y = y j ) whch smpfes to the foowng (because the denomnator does not depend on y k ). Y argmax P(X Y = y k ) (3) y k P(Y = y k ) (2)

5 Copyrght c 205, Tom M. Mtche Nave Bayes for Dscrete-Vaued Inputs To summarze, et us precsey defne the Nave Bayes earnng agorthm by descrbng the parameters that must be estmated, and how we may estmate them. When the n nput attrbutes X each take on J possbe dscrete vaues, and Y s a dscrete varabe takng on K possbe vaues, then our earnng task s to estmate two sets of parameters. The frst s θ jk P(X = x j Y = y k ) (4) for each nput attrbute X, each of ts possbe vaues x j, and each of the possbe vaues y k of Y. Note there w be njk such parameters, and note aso that ony n(j )K of these are ndependent, gven that they must satsfy = j θ jk for each par of,k vaues. In addton, we must estmate parameters that defne the pror probabty over Y : π k P(Y = y k ) (5) Note there are K of these parameters, (K ) of whch are ndependent. We can estmate these parameters usng ether maxmum kehood estmates (based on cacuatng the reatve frequences of the dfferent events n the data), or usng Bayesan MAP estmates (augmentng ths observed data wth pror dstrbutons over the vaues of these parameters). Maxmum kehood estmates for θ jk gven a set of tranng exampes D are gven by ˆθ jk = ˆP(X = x j Y = y k ) = #D{X = x j Y = y k } (6) #D{Y = y k } where the #D{x} operator returns the number of eements n the set D that satsfy property x. One danger of ths maxmum kehood estmate s that t can sometmes resut n θ estmates of zero, f the data does not happen to contan any tranng exampes satsfyng the condton n the numerator. To avod ths, t s common to use a smoothed estmate whch effectvey adds n a number of addtona haucnated exampes, and whch assumes these haucnated exampes are spread eveny over the possbe vaues of X. Ths smoothed estmate s gven by ˆθ jk = ˆP(X = x j Y = y k ) = #D{X = x j Y = y k } + #D{Y = y k } + J where J s the number of dstnct vaues X can take on, and determnes the strength of ths smoothng (.e., the number of haucnated exampes s J). Ths expresson corresponds to a MAP estmate for θ jk f we assume a Drchet pror dstrbuton over the θ jk parameters, wth equa-vaued parameters. If s set to, ths approach s caed Lapace smoothng. Maxmum kehood estmates for π k are ˆπ k = ˆP(Y = y k ) = #D{Y = y k} D (7) (8)

6 Copyrght c 205, Tom M. Mtche. 6 where D denotes the number of eements n the tranng set D. Aternatvey, we can obtan a smoothed estmate, or equvaenty a MAP estmate based on a Drchet pror over the π k parameters assumng equa prors on each π k, by usng the foowng expresson ˆπ k = ˆP(Y = y k ) = #D{Y = y k} + D + K (9) where K s the number of dstnct vaues Y can take on, and agan determnes the strength of the pror assumptons reatve to the observed data D. 2.4 Nave Bayes for Contnuous Inputs In the case of contnuous nputs X, we can of course contnue to use equatons (2) and (3) as the bass for desgnng a Nave Bayes cassfer. However, when the X are contnuous we must choose some other way to represent the dstrbutons P(X Y ). One common approach s to assume that for each possbe dscrete vaue y k of Y, the dstrbuton of each contnuous X s Gaussan, and s defned by a mean and standard devaton specfc to X and y k. In order to tran such a Nave Bayes cassfer we must therefore estmate the mean and standard devaton of each of these Gaussans: µ k = E[X Y = y k ] (0) σ 2 k = E[(X µ k ) 2 Y = y k ] () for each attrbute X and each possbe vaue y k of Y. Note there are 2nK of these parameters, a of whch must be estmated ndependenty. Of course we must aso estmate the prors on Y as we π k = P(Y = y k ) (2) The above mode summarzes a Gaussan Nave Bayes cassfer, whch assumes that the data X s generated by a mxture of cass-condtona (.e., dependent on the vaue of the cass varabe Y ) Gaussans. Furthermore, the Nave Bayes assumpton ntroduces the addtona constrant that the attrbute vaues X are ndependent of one another wthn each of these mxture components. In partcuar probem settngs where we have addtona nformaton, we mght ntroduce addtona assumptons to further restrct the number of parameters or the compexty of estmatng them. For exampe, f we have reason to beeve that nose n the observed X comes from a common source, then we mght further assume that a of the σ k are dentca, regardess of the attrbute or cass k (see the homework exercse on ths ssue). Agan, we can use ether maxmum kehood estmates (MLE) or maxmum a posteror (MAP) estmates for these parameters. The maxmum kehood estmator for µ k s ˆµ k = j δ(y j = y k ) X j δ(y j = y k ) (3) j

7 Copyrght c 205, Tom M. Mtche. 7 where the superscrpt j refers to the jth tranng exampe, and where δ(y = y k ) s f Y = y k and 0 otherwse. Note the roe of δ here s to seect ony those tranng exampes for whch Y = y k. The maxmum kehood estmator for σ 2 k s ˆσ 2 k = j δ(y j = y k ) (X j ˆµ k ) 2 δ(y j = y k ) (4) j Ths maxmum kehood estmator s based, so the mnmum varance unbased estmator (MVUE) s sometmes used nstead. It s ˆσ 2 k = ( j δ(y j = y k )) (X j ˆµ k ) 2 δ(y j = y k ) (5) j 3 Logstc Regresson Logstc Regresson s an approach to earnng functons of the form f : X Y, or P(Y X) n the case where Y s dscrete-vaued, and X = X...X n s any vector contanng dscrete or contnuous varabes. In ths secton we w prmary consder the case where Y s a booean varabe, n order to smpfy notaton. In the fna subsecton we extend our treatment to the case where Y takes on any fnte number of dscrete vaues. Logstc Regresson assumes a parametrc form for the dstrbuton P(Y X), then drecty estmates ts parameters from the tranng data. The parametrc mode assumed by Logstc Regresson n the case where Y s booean s: P(Y = X) = + exp(w 0 + n = w X ) (6) and P(Y = 0 X) = exp(w 0 + n = w X ) + exp(w 0 + n = w X ) (7) Notce that equaton (7) foows drecty from equaton (6), because the sum of these two probabtes must equa. One hghy convenent property of ths form for P(Y X) s that t eads to a smpe near expresson for cassfcaton. To cassfy any gven X we generay want to assgn the vaue y k that maxmzes P(Y = y k X). Put another way, we assgn the abe Y = 0 f the foowng condton hods: < P(Y = 0 X) P(Y = X) substtutng from equatons (6) and (7), ths becomes < exp(w 0 + n = w X )

8 Copyrght c 205, Tom M. Mtche. 8 Fgure : Form of the ogstc functon. In Logstc Regresson, P(Y X) s assumed to foow ths form. and takng the natura og of both sdes we have a near cassfcaton rue that assgns abe Y = 0 f X satsfes 0 < w 0 + n = w X (8) and assgns Y = otherwse. Interestngy, the parametrc form of P(Y X) used by Logstc Regresson s precsey the form mped by the assumptons of a Gaussan Nave Bayes cassfer. Therefore, we can vew Logstc Regresson as a cosey reated aternatve to GNB, though the two can produce dfferent resuts n many cases. 3. Form of P(Y X) for Gaussan Nave Bayes Cassfer Here we derve the form of P(Y X) entaed by the assumptons of a Gaussan Nave Bayes (GNB) cassfer, showng that t s precsey the form used by Logstc Regresson and summarzed n equatons (6) and (7). In partcuar, consder a GNB based on the foowng modeng assumptons: Y s booean, governed by a Bernou dstrbuton, wth parameter π = P(Y = ) X = X...X n, where each X s a contnuous random varabe For each X, P(X Y = y k ) s a Gaussan dstrbuton of the form N(µ k,σ ) For a and j, X and X j are condtonay ndependent gven Y

9 Copyrght c 205, Tom M. Mtche. 9 Note here we are assumng the standard devatons σ vary from attrbute to attrbute, but do not depend on Y. We now derve the parametrc form of P(Y X) that foows from ths set of GNB assumptons. In genera, Bayes rue aows us to wrte P(Y = X) = P(Y = )P(X Y = ) P(Y = )P(X Y = ) + P(Y = 0)P(X Y = 0) Dvdng both the numerator and denomnator by the numerator yeds: or equvaenty P(Y = X) = + P(Y = X) = + exp(n P(Y =0)P(X Y =0) P(Y =)P(X Y =) P(Y =0)P(X Y =0) P(Y =)P(X Y =) ) Because of our condtona ndependence assumpton we can wrte ths P(Y = X) = = P(Y =0) + exp(n P(Y =) + n P(X Y =0) P(X Y =) ) + exp(n π π + n P(X Y =0) P(X Y =) ) (9) Note the fna step expresses P(Y = 0) and P(Y = ) n terms of the bnoma parameter π. Now consder just the summaton n the denomnator of equaton (9). Gven our assumpton that P(X Y = y k ) s Gaussan, we can expand ths term as foows: ) n P(X Y = 0) P(X Y = ) ( exp (X µ 0 ) 2 2πσ 2 2σ 2 ( (X µ ) 2 = n exp 2πσ 2 2σ 2 ( (X µ ) = 2 (X µ 0 ) 2 ) nexp 2σ 2 ( (X µ ) = 2 (X µ 0 ) 2 ) = = = 2σ 2 ) ( (X 2 2X µ + µ 2 ) (X 2 2σ 2 ( 2X (µ 0 µ ) + µ 2 ) µ2 0 ( µ0 µ σ 2 2σ 2 X + µ2 µ2 0 2σ 2 ) 2X µ 0 + µ 2 0 ) ) (20)

10 Copyrght c 205, Tom M. Mtche. 0 Note ths expresson s a near weghted sum of the X s. Substtutng expresson (20) back nto equaton (9), we have Or equvaenty, P(Y = X) = + exp(n π π P(Y = X) = where the weghts w...w n are gven by + ( µ0 µ σ 2 + exp(w 0 + n = w X ) w = µ 0 µ σ 2 ) X + µ2 µ2 0 ) ) 2σ 2 (2) (22) and where Aso we have w 0 = n π π + µ 2 µ2 0 2σ 2 P(Y = 0 X) = P(Y = X) = exp(w 0 + n = w X ) + exp(w 0 + n = w X ) (23) 3.2 Estmatng Parameters for Logstc Regresson The above subsecton proves that P(Y X) can be expressed n the parametrc form gven by equatons (6) and (7), under the Gaussan Nave Bayes assumptons detaed there. It aso provdes the vaue of the weghts w n terms of the parameters estmated by the GNB cassfer. Here we descrbe an aternatve method for estmatng these weghts. We are nterested n ths aternatve for two reasons. Frst, the form of P(Y X) assumed by Logstc Regresson hods n many probem settngs beyond the GNB probem detaed n the above secton, and we wsh to have a genera method for estmatng t n a more broad range of cases. Second, n many cases we may suspect the GNB assumptons are not perfecty satsfed. In ths case we may wsh to estmate the w parameters drecty from the data, rather than gong through the ntermedate step of estmatng the GNB parameters whch forces us to adopt ts more strngent modeng assumptons. One reasonabe approach to tranng Logstc Regresson s to choose parameter vaues that maxmze the condtona data kehood. The condtona data kehood s the probabty of the observed Y vaues n the tranng data, condtoned on ther correspondng X vaues. We choose parameters W that satsfy W argmax W P(Y X,W) where W = w 0,w...w n s the vector of parameters to be estmated, Y denotes the observed vaue of Y n the th tranng exampe, and X denotes the observed

11 Copyrght c 205, Tom M. Mtche. vaue of X n the th tranng exampe. The expresson to the rght of the argmax s the condtona data kehood. Here we ncude W n the condtona, to emphasze that the expresson s a functon of the W we are attemptng to maxmze. Equvaenty, we can work wth the og of the condtona kehood: W argmax W np(y X,W) as Ths condtona data og kehood, whch we w denote (W) can be wrtten (W) = Y np(y = X,W) + ( Y )np(y = 0 X,W) Note here we are utzng the fact that Y can take ony vaues 0 or, so ony one of the two terms n the expresson w be non-zero for any gven Y. To keep our dervaton consstent wth common usage, we w n ths secton fp the assgnment of the booean varabe Y so that we assgn and P(Y = 0 X) = + exp(w 0 + n = w X ) P(Y = X) = exp(w 0 + n = w X ) + exp(w 0 + n = w X ) In ths case, we can reexpress the og of the condtona kehood as: (W) = Y np(y = X,W) + ( Y )np(y = 0 X,W) = Y n P(Y = X,W) P(Y = 0 X,W) + np(y = 0 X,W) = Y (w 0 + n w X ) n( + exp(w 0 + n w X )) (24) (25) where X denotes the vaue of X for the th tranng exampe. Note the superscrpt s not reated to the og kehood functon (W). Unfortunatey, there s no cosed form souton to maxmzng (W) wth respect to W. Therefore, one common approach s to use gradent ascent, n whch we work wth the gradent, whch s the vector of parta dervatves. The th component of the vector gradent has the form (W) w = X (Y ˆP(Y = X,W)) where ˆP(Y X,W) s the Logstc Regresson predcton usng equatons (24) and (25) and the weghts W. To accommodate weght w 0, we assume an magnary X 0 = for a. Ths expresson for the dervatve has an ntutve nterpretaton: the term nsde the parentheses s smpy the predcton error; that s, the dfference

12 Copyrght c 205, Tom M. Mtche. 2 between the observed Y and ts predcted probabty! Note f Y = then we wsh for ˆP(Y = X,W) to be, whereas f Y = 0 then we prefer that ˆP(Y = X,W) be 0 (whch makes ˆP(Y = 0 X,W) equa to ). Ths error term s mutped by the vaue of X, whch accounts for the magntude of the w X term n makng ths predcton. Gven ths formua for the dervatve of each w, we can use standard gradent ascent to optmze the weghts W. Begnnng wth nta weghts of zero, we repeatedy update the weghts n the drecton of the gradent, on each teraton changng every weght w accordng to w w + η X (Y ˆP(Y = X,W)) where η s a sma constant (e.g., 0.0) whch determnes the step sze. Because the condtona og kehood (W) s a concave functon n W, ths gradent ascent procedure w converge to a goba maxmum. Gradent ascent s descrbed n greater deta, for exampe, n Chapter 4 of Mtche (997). In many cases where computatona effcency s mportant t s common to use a varant of gradent ascent caed conjugate gradent ascent, whch often converges more qucky. 3.3 Reguarzaton n Logstc Regresson Overfttng the tranng data s a probem that can arse n Logstc Regresson, especay when data s very hgh dmensona and tranng data s sparse. One approach to reducng overfttng s reguarzaton, n whch we create a modfed penazed og kehood functon, whch penazes arge vaues of W. One approach s to use the penazed og kehood functon W argmax W np(y X,W) λ 2 W 2 whch adds a penaty proportona to the squared magntude of W. Here λ s a constant that determnes the strength of ths penaty term. Modfyng our objectve by addng n ths penaty term gves us a new objectve to maxmze. It s easy to show that maxmzng t corresponds to cacuatng the MAP estmate for W under the assumpton that the pror dstrbuton P(W) s a Norma dstrbuton wth mean zero, and a varance reated to /λ. Notce that n genera, the MAP estmate for W nvoves optmzng the objectve np(y X,W) + np(w) and f P(W) s a zero mean Gaussan dstrbuton, then np(w) yeds a term proportona to W 2. Gven ths penazed og kehood functon, t s easy to rederve the gradent descent rue. The dervatve of ths penazed og kehood functon s smar to

13 Copyrght c 205, Tom M. Mtche. 3 our earer dervatve, wth one addtona penaty term (W) w = X (Y ˆP(Y = X,W)) λw whch gves us the modfed gradent descent rue w w + η X (Y ˆP(Y = X,W)) ηλw (26) In cases where we have pror knowedge about key vaues for specfc w, t s possbe to derve a smar penaty term by usng a Norma pror on W wth a non-zero mean. 3.4 Logstc Regresson for Functons wth Many Dscrete Vaues Above we consdered usng Logstc Regresson to earn P(Y X) ony for the case where Y s a booean varabe. More generay, f Y can take on any of the dscrete vaues {y,...y K }, then the form of P(Y = y k X) for Y = y,y = y 2,...Y = y K s: exp(w k0 + n = P(Y = y k X) = w kx ) + K j= exp(w j0 + n = w (27) jx ) When Y = y K, t s P(Y = y K X) = + K j= exp(w j0 + n = w jx ) (28) Here w j denotes the weght assocated wth the jth cass Y = y j and wth nput X. It s easy to see that our earer expressons for the case where Y s booean (equatons (6) and (7)) are a speca case of the above expressons. Note aso that the form of the expresson for P(Y = y K X) assures that [ K k= P(Y = y k X)] =. The prmary dfference between these expressons and those for booean Y s that when Y takes on K possbe vaues, we construct K dfferent near expressons to capture the dstrbutons for the dfferent vaues of Y. The dstrbuton for the fna, Kth, vaue of Y s smpy one mnus the probabtes of the frst K vaues. In ths case, the gradent descent rue wth reguarzaton becomes: w j w j + η X (δ(y = y j ) ˆP(Y = y j X,W)) ηλw j (29) where δ(y = y j ) = f the th tranng vaue, Y, s equa to y j, and δ(y = y j ) = 0 otherwse. Note our earer earnng rue, equaton (26), s a speca case of ths new earnng rue, when K = 2. As n the case for K = 2, the quantty nsde the parentheses can be vewed as an error term whch goes to zero f the estmated condtona probabty ˆP(Y = y j X,W)) perfecty matches the observed vaue of Y.

14 Copyrght c 205, Tom M. Mtche. 4 4 Reatonshp Between Nave Bayes Cassfers and Logstc Regresson To summarze, Logstc Regresson drecty estmates the parameters of P(Y X), whereas Nave Bayes drecty estmates parameters for P(Y ) and P(X Y ). We often ca the former a dscrmnatve cassfer, and the atter a generatve cassfer. We showed above that the assumptons of one varant of a Gaussan Nave Bayes cassfer mpy the parametrc form of P(Y X) used n Logstc Regresson. Furthermore, we showed that the parameters w n Logstc Regresson can be expressed n terms of the Gaussan Nave Bayes parameters. In fact, f the GNB assumptons hod, then asymptotcay (as the number of tranng exampes grows toward nfnty) the GNB and Logstc Regresson converge toward dentca cassfers. The two agorthms aso dffer n nterestng ways: When the GNB modeng assumptons do not hod, Logstc Regresson and GNB typcay earn dfferent cassfer functons. In ths case, the asymptotc (as the number of tranng exampes approach nfnty) cassfcaton accuracy for Logstc Regresson s often better than the asymptotc accuracy of GNB. Athough Logstc Regresson s consstent wth the Nave Bayes assumpton that the nput features X are condtonay ndependent gven Y, t s not rgdy ted to ths assumpton as s Nave Bayes. Gven data that dsobeys ths assumpton, the condtona kehood maxmzaton agorthm for Logstc Regresson w adjust ts parameters to maxmze the ft to (the condtona kehood of) the data, even f the resutng parameters are nconsstent wth the Nave Bayes parameter estmates. GNB and Logstc Regresson converge toward ther asymptotc accuraces at dfferent rates. As Ng & Jordan (2002) show, GNB parameter estmates converge toward ther asymptotc vaues n order og n exampes, where n s the dmenson of X. In contrast, Logstc Regresson parameter estmates converge more sowy, requrng order n exampes. The authors aso show that n severa data sets Logstc Regresson outperforms GNB when many tranng exampes are avaabe, but GNB outperforms Logstc Regresson when tranng data s scarce. 5 What You Shoud Know The man ponts of ths chapter ncude: We can use Bayes rue as the bass for desgnng earnng agorthms (functon approxmators), as foows: Gven that we wsh to earn some target functon f : X Y, or equvaenty, P(Y X), we use the tranng data to earn estmates of P(X Y ) and P(Y ). New X exampes can then be cassfed usng these estmated probabty dstrbutons, pus Bayes rue. Ths

15 Copyrght c 205, Tom M. Mtche. 5 type of cassfer s caed a generatve cassfer, because we can vew the dstrbuton P(X Y ) as descrbng how to generate random nstances X condtoned on the target attrbute Y. Learnng Bayes cassfers typcay requres an unreastc number of tranng exampes (.e., more than X tranng exampes where X s the nstance space) uness some form of pror assumpton s made about the form of P(X Y ). The Nave Bayes cassfer assumes a attrbutes descrbng X are condtonay ndependent gven Y. Ths assumpton dramatcay reduces the number of parameters that must be estmated to earn the cassfer. Nave Bayes s a wdey used earnng agorthm, for both dscrete and contnuous X. When X s a vector of dscrete-vaued attrbutes, Nave Bayes earnng agorthms can be vewed as near cassfers; that s, every such Nave Bayes cassfer corresponds to a hyperpane decson surface n X. The same statement hods for Gaussan Nave Bayes cassfers f the varance of each feature s assumed to be ndependent of the cass (.e., f σ k = σ ). Logstc Regresson s a functon approxmaton agorthm that uses tranng data to drecty estmate P(Y X), n contrast to Nave Bayes. In ths sense, Logstc Regresson s often referred to as a dscrmnatve cassfer because we can vew the dstrbuton P(Y X) as drecty dscrmnatng the vaue of the target vaue Y for any gven nstance X. Logstc Regresson s a near cassfer over X. The near cassfers produced by Logstc Regresson and Gaussan Nave Bayes are dentca n the mt as the number of tranng exampes approaches nfnty, provded the Nave Bayes assumptons hod. However, f these assumptons do not hod, the Nave Bayes bas w cause t to perform ess accuratey than Logstc Regresson, n the mt. Put another way, Nave Bayes s a earnng agorthm wth greater bas, but ower varance, than Logstc Regresson. If ths bas s approprate gven the actua data, Nave Bayes w be preferred. Otherwse, Logstc Regresson w be preferred. We can vew functon approxmaton earnng agorthms as statstca estmators of functons, or of condtona dstrbutons P(Y X). They estmate P(Y X) from a sampe of tranng data. As wth other statstca estmators, t can be usefu to characterze earnng agorthms by ther bas and expected varance, taken over dfferent sampes of tranng data. 6 Further Readng Wasserman (2004) descrbes a Reweghted Least Squares method for Logstc Regresson. Ng and Jordan (2002) provde a theoretca and expermenta comparson of the Nave Bayes cassfer and Logstc Regresson.

16 Copyrght c 205, Tom M. Mtche. 6 EXERCISES. At the begnnng of the chapter we remarked that A hundred tranng exampes w usuay suffce to obtan an estmate of P(Y ) that s wthn a few percent of the correct vaue. Descrbe condtons under whch the 95% confdence nterva for our estmate of P(Y ) w be ± Consder earnng a functon X Y where Y s booean, where X = X,X 2, and where X s a booean varabe and X 2 a contnuous varabe. State the parameters that must be estmated to defne a Nave Bayes cassfer n ths case. Gve the formua for computng P(Y X), n terms of these parameters and the feature vaues X and X In secton 3 we showed that when Y s Booean and X = X...X n s a vector of contnuous varabes, then the assumptons of the Gaussan Nave Bayes cassfer mpy that P(Y X) s gven by the ogstc functon wth approprate parameters W. In partcuar: and P(Y = X) = + exp(w 0 + n = w X ) P(Y = 0 X) = exp(w 0 + n = w X ) + exp(w 0 + n = w X ) Consder nstead the case where Y s Booean and X = X...X n s a vector of Booean varabes. Prove for ths case aso that P(Y X) foows ths same form (and hence that Logstc Regresson s aso the dscrmnatve counterpart to a Nave Bayes generatve cassfer over Booean features). Hnts: Smpe notaton w hep. Snce the X are Booean varabes, you need ony one parameter to defne P(X Y = y k ). Defne θ P(X = Y = ), n whch case P(X = 0 Y = ) = ( θ ). Smary, use θ 0 to denote P(X = Y = 0). Notce wth the above notaton you can represent P(X Y = ) as foows P(X Y = ) = θ X ( θ ) ( X ) Note when X = the second term s equa to because ts exponent s zero. Smary, when X = 0 the frst term s equa to because ts exponent s zero. 4. (based on a suggeston from Sandra Zes). Ths queston asks you to consder the reatonshp between the MAP hypothess and the Bayes optma hypothess. Consder a hypothess space H defned over the set of nstances X, and contanng just two hypotheses, h and h2 wth equa pror probabtes P(h) = P(h2) = 0.5. Suppose we are gven an arbtrary set of tranng

17 Copyrght c 205, Tom M. Mtche. 7 data D whch we use to cacuate the posteror probabtes P(h D) and P(h2 D). Based on ths we choose the MAP hypothess, and cacuate the Bayes optma hypothess. Suppose we fnd that the Bayes optma cassfer s not equa to ether h or to h2, whch s generay the case because the Bayes optma hypothess corresponds to averagng over a hypotheses n H. Now we create a new hypothess h3 whch s equa to the Bayes optma cassfer wth respect to H, X and D; that s, h3 cassfes each nstance n X exacty the same as the Bayes optma cassfer for H and D. We now create a new hypothess space H = {h,h2,h3}. If we tran usng the same tranng data, D, w the MAP hypothess from H be h3? W the Bayes optma cassfer wth respect to H be equvaent to h3? (Hnt: the answer depends on the prors we assgn to the hypotheses n H. Can you gve constrants on these prors that assure the answers w be yes or no?) 7 Acknowedgements I very much apprecate recevng hepfu comments on earer drafts of ths chapter from the foowng: Nathane Farfed, Raner Gemua, Vneet Kumar, Andrew McCaum, Anand Prahad, We Wang, Geoff Webb, and Sandra Zes. REFERENCES Mtche, T (997). Machne Learnng, McGraw H. Ng, A.Y. & Jordan, M. I. (2002). On Dscrmnatve vs. Generatve Cassfers: A comparson of Logstc Regresson and Nave Bayes, Neura Informaton Processng Systems, Ng, A.Y., and Jordan, M. (2002). Wasserman, L. (2004). A of Statstcs, Sprnger-Verag.

Image Classification Using EM And JE algorithms

Image Classification Using EM And JE algorithms Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Image Cassfcaton Usng EM And JE agorthms Xaojn Sh Department of Computer Engneerng, Unversty of Caforna, Santa Cruz, CA, 9564 jennfer@soe.ucsc.edu