GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION. Machine Learning

Size: px
Start display at page:

Download "GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION. Machine Learning"

Transcription

1 CHAPTER 3 GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION Machne Learnng Copyrght c 205. Tom M. Mtche. A rghts reserved. *DRAFT OF September 23, 207* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* Ths s a rough draft chapter ntended for ncuson n the upcomng second edton of the textbook Machne Learnng, T.M. Mtche, McGraw H. You are wecome to use ths for educatona purposes, but do not dupcate or repost t on the nternet. For onne copes of ths and other materas reated to ths book, vst the web ste tom/mbook.htm. Pease send suggestons for mprovements, or suggested exercses, to Tom.Mtche@cmu.edu. Learnng Cassfers based on Bayes Rue Here we consder the reatonshp between supervsed earnng, or functon approxmaton probems, and Bayesan reasonng. We begn by consderng how to desgn earnng agorthms based on Bayes rue. Consder a supervsed earnng probem n whch we wsh to approxmate an unknown target functon f : X Y, or equvaenty P(Y X). To begn, we w assume Y s a booean-vaued random varabe, and X s a vector contanng n booean attrbutes. In other words, X = X,X 2...,X n, where X s the booean random varabe denotng the th attrbute of X. Appyng Bayes rue, we see that P(Y = y X) can be represented as P(Y = y X = x k ) = P(X = x k Y = y )P(Y = y ) j P(X = x k Y = y j )P(Y = y j )

2 Copyrght c 205, Tom M. Mtche. 2 where y m denotes the mth possbe vaue for Y, x k denotes the kth possbe vector vaue for X, and where the summaton n the denomnator s over a ega vaues of the random varabe Y. One way to earn P(Y X) s to use the tranng data to estmate P(X Y ) and P(Y ). We can then use these estmates, together wth Bayes rue above, to determne P(Y X = x k ) for any new nstance x k. A NOTE ON NOTATION: We w consstenty use upper case symbos (e.g., X) to refer to random varabes, ncudng both vector and non-vector varabes. If X s a vector, then we use subscrpts (e.g., X to refer to each random varabe, or feature, n X). We use ower case symbos to refer to vaues of random varabes (e.g., X = x j may refer to random varabe X takng on ts jth possbe vaue). We w sometmes abbrevate by omttng varabe names, for exampe abbrevatng P(X = x j Y = y k ) to P(x j y k ). We w wrte E[X] to refer to the expected vaue of X. We use superscrpts to ndex tranng exampes (e.g., X j refers to the vaue of the random varabe X n the jth tranng exampe.). We use δ(x) to denote an ndcator functon whose vaue s f ts ogca argument x s true, and whose vaue s 0 otherwse. We use the #D{x} operator to denote the number of eements n the set D that satsfy property x. We use a hat to ndcate estmates; for exampe, ˆθ ndcates an estmated vaue of θ.. Unbased Learnng of Bayes Cassfers s Impractca If we are gong to tran a Bayes cassfer by estmatng P(X Y ) and P(Y ), then t s reasonabe to ask how much tranng data w be requred to obtan reabe estmates of these dstrbutons. Let us assume tranng exampes are generated by drawng nstances at random from an unknown underyng dstrbuton P(X), then aowng a teacher to abe ths exampe wth ts Y vaue. A hundred ndependenty drawn tranng exampes w usuay suffce to obtan a maxmum kehood estmate of P(Y ) that s wthn a few percent of ts correct vaue when Y s a booean varabe. However, accuratey estmatng P(X Y ) typcay requres many more exampes. To see why, consder the number of parameters we must estmate when Y s booean and X s a vector of n booean attrbutes. In ths case, we need to estmate a set of parameters θ j P(X = x Y = y j ) where the ndex takes on 2 n possbe vaues (one for each of the possbe vector vaues of X), and j takes on 2 possbe vaues. Therefore, we w need to estmate approxmatey 2 n+ parameters. To cacuate the exact number of requred parameters, note for any fxed j, the sum over of θ j must be one. Therefore, for any partcuar vaue y j, and the 2 n possbe vaues of x, we need compute ony 2 n ndependent parameters. Gven the two possbe vaues for Y, we must estmate a tota of 2(2 n ) such θ j parameters. Unfortunatey, ths corresponds to two Why? See Chapter 5 of edton of Machne Learnng.

3 Copyrght c 205, Tom M. Mtche. 3 dstnct parameters for each of the dstnct nstances n the nstance space for X. Worse yet, to obtan reabe estmates of each of these parameters, we w need to observe each of these dstnct nstances mutpe tmes! Ths s ceary unreastc n most practca earnng domans. For exampe, f X s a vector contanng 30 booean features, then we w need to estmate more than 3 bon parameters. 2 Nave Bayes Agorthm Gven the ntractabe sampe compexty for earnng Bayesan cassfers, we must ook for ways to reduce ths compexty. The Nave Bayes cassfer does ths by makng a condtona ndependence assumpton that dramatcay reduces the number of parameters to be estmated when modeng P(X Y ), from our orgna 2(2 n ) to just 2n. 2. Condtona Independence Defnton: Gven three sets of random varabes X,Y and Z, we say X s condtonay ndependent of Y gven Z, f and ony f the probabty dstrbuton governng X s ndependent of the vaue of Y gven Z; that s (, j,k)p(x = x Y = y j,z = z k ) = P(X = x Z = z k ) As an exampe, consder three booean random varabes to descrbe the current weather: Ran, T hunder and Lghtnng. We mght reasonaby assert that T hunder s ndependent of Ran gven Lghtnng. Because we know Lghtnng causes T hunder, once we know whether or not there s Lghtnng, no addtona nformaton about T hunder s provded by the vaue of Ran. Of course there s a cear dependence of T hunder on Ran n genera, but there s no condtona dependence once we know the vaue of Lghtnng. Athough X, Y and Z are each snge random varabes n ths exampe, more generay the defnton appes to sets of random varabes. For exampe, we mght assert that varabes {A, B} are condtonay ndependent of {C,D} gven varabes {E,F}. 2.2 Dervaton of Nave Bayes Agorthm The Nave Bayes agorthm s a cassfcaton agorthm based on Bayes rue and a set of condtona ndependence assumptons. Gven the goa of earnng P(Y X) where X = X...,X n, the Nave Bayes agorthm makes the assumpton that each X s condtonay ndependent of each of the other X k s gven Y, and aso ndependent of each subset of the other X k s gven Y. The vaue of ths assumpton s that t dramatcay smpfes the representaton of P(X Y ), and the probem of estmatng t from the tranng data. Consder, for exampe, the case where X = X,X 2. In ths case

4 Copyrght c 205, Tom M. Mtche. 4 P(X Y ) = P(X,X 2 Y ) = P(X X 2,Y )P(X 2 Y ) = P(X Y )P(X 2 Y ) Where the second ne foows from a genera property of probabtes, and the thrd ne foows drecty from our above defnton of condtona ndependence. More generay, when X contans n attrbutes whch satsfy the condtona ndependence assumpton, we have P(X...X n Y ) = n = P(X Y ) () Notce that when Y and the X are booean varabes, we need ony 2n parameters to defne P(X = x k Y = y j ) for the necessary, j,k. Ths s a dramatc reducton compared to the 2(2 n ) parameters needed to characterze P(X Y ) f we make no condtona ndependence assumpton. Let us now derve the Nave Bayes agorthm, assumng n genera that Y s any dscrete-vaued varabe, and the attrbutes X...X n are any dscrete or reavaued attrbutes. Our goa s to tran a cassfer that w output the probabty dstrbuton over possbe vaues of Y, for each new nstance X that we ask t to cassfy. The expresson for the probabty that Y w take on ts kth possbe vaue, accordng to Bayes rue, s P(Y = y k X...X n ) = P(Y = y k)p(x...x n Y = y k ) j P(Y = y j )P(X...X n Y = y j ) where the sum s taken over a possbe vaues y j of Y. Now, assumng the X are condtonay ndependent gven Y, we can use equaton () to rewrte ths as P(Y = y k X...X n ) = P(Y = y k) P(X Y = y k ) j P(Y = y j ) P(X Y = y j ) Equaton (2) s the fundamenta equaton for the Nave Bayes cassfer. Gven a new nstance X new = X...X n, ths equaton shows how to cacuate the probabty that Y w take on any gven vaue, gven the observed attrbute vaues of X new and gven the dstrbutons P(Y ) and P(X Y ) estmated from the tranng data. If we are nterested ony n the most probabe vaue of Y, then we have the Nave Bayes cassfcaton rue: Y argmax y k P(Y = y k ) P(X Y = y k ) j P(Y = y j ) P(X Y = y j ) whch smpfes to the foowng (because the denomnator does not depend on y k ). Y argmax P(X Y = y k ) (3) y k P(Y = y k ) (2)

5 Copyrght c 205, Tom M. Mtche Nave Bayes for Dscrete-Vaued Inputs To summarze, et us precsey defne the Nave Bayes earnng agorthm by descrbng the parameters that must be estmated, and how we may estmate them. When the n nput attrbutes X each take on J possbe dscrete vaues, and Y s a dscrete varabe takng on K possbe vaues, then our earnng task s to estmate two sets of parameters. The frst s θ jk P(X = x j Y = y k ) (4) for each nput attrbute X, each of ts possbe vaues x j, and each of the possbe vaues y k of Y. Note there w be njk such parameters, and note aso that ony n(j )K of these are ndependent, gven that they must satsfy = j θ jk for each par of,k vaues. In addton, we must estmate parameters that defne the pror probabty over Y : π k P(Y = y k ) (5) Note there are K of these parameters, (K ) of whch are ndependent. We can estmate these parameters usng ether maxmum kehood estmates (based on cacuatng the reatve frequences of the dfferent events n the data), or usng Bayesan MAP estmates (augmentng ths observed data wth pror dstrbutons over the vaues of these parameters). Maxmum kehood estmates for θ jk gven a set of tranng exampes D are gven by ˆθ jk = ˆP(X = x j Y = y k ) = #D{X = x j Y = y k } (6) #D{Y = y k } where the #D{x} operator returns the number of eements n the set D that satsfy property x. One danger of ths maxmum kehood estmate s that t can sometmes resut n θ estmates of zero, f the data does not happen to contan any tranng exampes satsfyng the condton n the numerator. To avod ths, t s common to use a smoothed estmate whch effectvey adds n a number of addtona haucnated exampes, and whch assumes these haucnated exampes are spread eveny over the possbe vaues of X. Ths smoothed estmate s gven by ˆθ jk = ˆP(X = x j Y = y k ) = #D{X = x j Y = y k } + #D{Y = y k } + J where J s the number of dstnct vaues X can take on, and determnes the strength of ths smoothng (.e., the number of haucnated exampes s J). Ths expresson corresponds to a MAP estmate for θ jk f we assume a Drchet pror dstrbuton over the θ jk parameters, wth equa-vaued parameters. If s set to, ths approach s caed Lapace smoothng. Maxmum kehood estmates for π k are ˆπ k = ˆP(Y = y k ) = #D{Y = y k} D (7) (8)

6 Copyrght c 205, Tom M. Mtche. 6 where D denotes the number of eements n the tranng set D. Aternatvey, we can obtan a smoothed estmate, or equvaenty a MAP estmate based on a Drchet pror over the π k parameters assumng equa prors on each π k, by usng the foowng expresson ˆπ k = ˆP(Y = y k ) = #D{Y = y k} + D + K (9) where K s the number of dstnct vaues Y can take on, and agan determnes the strength of the pror assumptons reatve to the observed data D. 2.4 Nave Bayes for Contnuous Inputs In the case of contnuous nputs X, we can of course contnue to use equatons (2) and (3) as the bass for desgnng a Nave Bayes cassfer. However, when the X are contnuous we must choose some other way to represent the dstrbutons P(X Y ). One common approach s to assume that for each possbe dscrete vaue y k of Y, the dstrbuton of each contnuous X s Gaussan, and s defned by a mean and standard devaton specfc to X and y k. In order to tran such a Nave Bayes cassfer we must therefore estmate the mean and standard devaton of each of these Gaussans: µ k = E[X Y = y k ] (0) σ 2 k = E[(X µ k ) 2 Y = y k ] () for each attrbute X and each possbe vaue y k of Y. Note there are 2nK of these parameters, a of whch must be estmated ndependenty. Of course we must aso estmate the prors on Y as we π k = P(Y = y k ) (2) The above mode summarzes a Gaussan Nave Bayes cassfer, whch assumes that the data X s generated by a mxture of cass-condtona (.e., dependent on the vaue of the cass varabe Y ) Gaussans. Furthermore, the Nave Bayes assumpton ntroduces the addtona constrant that the attrbute vaues X are ndependent of one another wthn each of these mxture components. In partcuar probem settngs where we have addtona nformaton, we mght ntroduce addtona assumptons to further restrct the number of parameters or the compexty of estmatng them. For exampe, f we have reason to beeve that nose n the observed X comes from a common source, then we mght further assume that a of the σ k are dentca, regardess of the attrbute or cass k (see the homework exercse on ths ssue). Agan, we can use ether maxmum kehood estmates (MLE) or maxmum a posteror (MAP) estmates for these parameters. The maxmum kehood estmator for µ k s ˆµ k = j δ(y j = y k ) X j δ(y j = y k ) (3) j

7 Copyrght c 205, Tom M. Mtche. 7 where the superscrpt j refers to the jth tranng exampe, and where δ(y = y k ) s f Y = y k and 0 otherwse. Note the roe of δ here s to seect ony those tranng exampes for whch Y = y k. The maxmum kehood estmator for σ 2 k s ˆσ 2 k = j δ(y j = y k ) (X j ˆµ k ) 2 δ(y j = y k ) (4) j Ths maxmum kehood estmator s based, so the mnmum varance unbased estmator (MVUE) s sometmes used nstead. It s ˆσ 2 k = ( j δ(y j = y k )) (X j ˆµ k ) 2 δ(y j = y k ) (5) j 3 Logstc Regresson Logstc Regresson s an approach to earnng functons of the form f : X Y, or P(Y X) n the case where Y s dscrete-vaued, and X = X...X n s any vector contanng dscrete or contnuous varabes. In ths secton we w prmary consder the case where Y s a booean varabe, n order to smpfy notaton. In the fna subsecton we extend our treatment to the case where Y takes on any fnte number of dscrete vaues. Logstc Regresson assumes a parametrc form for the dstrbuton P(Y X), then drecty estmates ts parameters from the tranng data. The parametrc mode assumed by Logstc Regresson n the case where Y s booean s: P(Y = X) = + exp(w 0 + n = w X ) (6) and P(Y = 0 X) = exp(w 0 + n = w X ) + exp(w 0 + n = w X ) (7) Notce that equaton (7) foows drecty from equaton (6), because the sum of these two probabtes must equa. One hghy convenent property of ths form for P(Y X) s that t eads to a smpe near expresson for cassfcaton. To cassfy any gven X we generay want to assgn the vaue y k that maxmzes P(Y = y k X). Put another way, we assgn the abe Y = 0 f the foowng condton hods: < P(Y = 0 X) P(Y = X) substtutng from equatons (6) and (7), ths becomes < exp(w 0 + n = w X )

8 Copyrght c 205, Tom M. Mtche. 8 Fgure : Form of the ogstc functon. In Logstc Regresson, P(Y X) s assumed to foow ths form. and takng the natura og of both sdes we have a near cassfcaton rue that assgns abe Y = 0 f X satsfes 0 < w 0 + n = w X (8) and assgns Y = otherwse. Interestngy, the parametrc form of P(Y X) used by Logstc Regresson s precsey the form mped by the assumptons of a Gaussan Nave Bayes cassfer. Therefore, we can vew Logstc Regresson as a cosey reated aternatve to GNB, though the two can produce dfferent resuts n many cases. 3. Form of P(Y X) for Gaussan Nave Bayes Cassfer Here we derve the form of P(Y X) entaed by the assumptons of a Gaussan Nave Bayes (GNB) cassfer, showng that t s precsey the form used by Logstc Regresson and summarzed n equatons (6) and (7). In partcuar, consder a GNB based on the foowng modeng assumptons: Y s booean, governed by a Bernou dstrbuton, wth parameter π = P(Y = ) X = X...X n, where each X s a contnuous random varabe For each X, P(X Y = y k ) s a Gaussan dstrbuton of the form N(µ k,σ ) For a and j, X and X j are condtonay ndependent gven Y

9 Copyrght c 205, Tom M. Mtche. 9 Note here we are assumng the standard devatons σ vary from attrbute to attrbute, but do not depend on Y. We now derve the parametrc form of P(Y X) that foows from ths set of GNB assumptons. In genera, Bayes rue aows us to wrte P(Y = X) = P(Y = )P(X Y = ) P(Y = )P(X Y = ) + P(Y = 0)P(X Y = 0) Dvdng both the numerator and denomnator by the numerator yeds: or equvaenty P(Y = X) = + P(Y = X) = + exp(n P(Y =0)P(X Y =0) P(Y =)P(X Y =) P(Y =0)P(X Y =0) P(Y =)P(X Y =) ) Because of our condtona ndependence assumpton we can wrte ths P(Y = X) = = P(Y =0) + exp(n P(Y =) + n P(X Y =0) P(X Y =) ) + exp(n π π + n P(X Y =0) P(X Y =) ) (9) Note the fna step expresses P(Y = 0) and P(Y = ) n terms of the bnoma parameter π. Now consder just the summaton n the denomnator of equaton (9). Gven our assumpton that P(X Y = y k ) s Gaussan, we can expand ths term as foows: ) n P(X Y = 0) P(X Y = ) ( exp (X µ 0 ) 2 2πσ 2 2σ 2 ( (X µ ) 2 = n exp 2πσ 2 2σ 2 ( (X µ ) = 2 (X µ 0 ) 2 ) nexp 2σ 2 ( (X µ ) = 2 (X µ 0 ) 2 ) = = = 2σ 2 ) ( (X 2 2X µ + µ 2 ) (X 2 2σ 2 ( 2X (µ 0 µ ) + µ 2 ) µ2 0 ( µ0 µ σ 2 2σ 2 X + µ2 µ2 0 2σ 2 ) 2X µ 0 + µ 2 0 ) ) (20)

10 Copyrght c 205, Tom M. Mtche. 0 Note ths expresson s a near weghted sum of the X s. Substtutng expresson (20) back nto equaton (9), we have Or equvaenty, P(Y = X) = + exp(n π π P(Y = X) = where the weghts w...w n are gven by + ( µ0 µ σ 2 + exp(w 0 + n = w X ) w = µ 0 µ σ 2 ) X + µ2 µ2 0 ) ) 2σ 2 (2) (22) and where Aso we have w 0 = n π π + µ 2 µ2 0 2σ 2 P(Y = 0 X) = P(Y = X) = exp(w 0 + n = w X ) + exp(w 0 + n = w X ) (23) 3.2 Estmatng Parameters for Logstc Regresson The above subsecton proves that P(Y X) can be expressed n the parametrc form gven by equatons (6) and (7), under the Gaussan Nave Bayes assumptons detaed there. It aso provdes the vaue of the weghts w n terms of the parameters estmated by the GNB cassfer. Here we descrbe an aternatve method for estmatng these weghts. We are nterested n ths aternatve for two reasons. Frst, the form of P(Y X) assumed by Logstc Regresson hods n many probem settngs beyond the GNB probem detaed n the above secton, and we wsh to have a genera method for estmatng t n a more broad range of cases. Second, n many cases we may suspect the GNB assumptons are not perfecty satsfed. In ths case we may wsh to estmate the w parameters drecty from the data, rather than gong through the ntermedate step of estmatng the GNB parameters whch forces us to adopt ts more strngent modeng assumptons. One reasonabe approach to tranng Logstc Regresson s to choose parameter vaues that maxmze the condtona data kehood. The condtona data kehood s the probabty of the observed Y vaues n the tranng data, condtoned on ther correspondng X vaues. We choose parameters W that satsfy W argmax W P(Y X,W) where W = w 0,w...w n s the vector of parameters to be estmated, Y denotes the observed vaue of Y n the th tranng exampe, and X denotes the observed

11 Copyrght c 205, Tom M. Mtche. vaue of X n the th tranng exampe. The expresson to the rght of the argmax s the condtona data kehood. Here we ncude W n the condtona, to emphasze that the expresson s a functon of the W we are attemptng to maxmze. Equvaenty, we can work wth the og of the condtona kehood: W argmax W np(y X,W) as Ths condtona data og kehood, whch we w denote (W) can be wrtten (W) = Y np(y = X,W) + ( Y )np(y = 0 X,W) Note here we are utzng the fact that Y can take ony vaues 0 or, so ony one of the two terms n the expresson w be non-zero for any gven Y. To keep our dervaton consstent wth common usage, we w n ths secton fp the assgnment of the booean varabe Y so that we assgn and P(Y = 0 X) = + exp(w 0 + n = w X ) P(Y = X) = exp(w 0 + n = w X ) + exp(w 0 + n = w X ) In ths case, we can reexpress the og of the condtona kehood as: (W) = Y np(y = X,W) + ( Y )np(y = 0 X,W) = Y n P(Y = X,W) P(Y = 0 X,W) + np(y = 0 X,W) = Y (w 0 + n w X ) n( + exp(w 0 + n w X )) (24) (25) where X denotes the vaue of X for the th tranng exampe. Note the superscrpt s not reated to the og kehood functon (W). Unfortunatey, there s no cosed form souton to maxmzng (W) wth respect to W. Therefore, one common approach s to use gradent ascent, n whch we work wth the gradent, whch s the vector of parta dervatves. The th component of the vector gradent has the form (W) w = X (Y ˆP(Y = X,W)) where ˆP(Y X,W) s the Logstc Regresson predcton usng equatons (24) and (25) and the weghts W. To accommodate weght w 0, we assume an magnary X 0 = for a. Ths expresson for the dervatve has an ntutve nterpretaton: the term nsde the parentheses s smpy the predcton error; that s, the dfference

12 Copyrght c 205, Tom M. Mtche. 2 between the observed Y and ts predcted probabty! Note f Y = then we wsh for ˆP(Y = X,W) to be, whereas f Y = 0 then we prefer that ˆP(Y = X,W) be 0 (whch makes ˆP(Y = 0 X,W) equa to ). Ths error term s mutped by the vaue of X, whch accounts for the magntude of the w X term n makng ths predcton. Gven ths formua for the dervatve of each w, we can use standard gradent ascent to optmze the weghts W. Begnnng wth nta weghts of zero, we repeatedy update the weghts n the drecton of the gradent, on each teraton changng every weght w accordng to w w + η X (Y ˆP(Y = X,W)) where η s a sma constant (e.g., 0.0) whch determnes the step sze. Because the condtona og kehood (W) s a concave functon n W, ths gradent ascent procedure w converge to a goba maxmum. Gradent ascent s descrbed n greater deta, for exampe, n Chapter 4 of Mtche (997). In many cases where computatona effcency s mportant t s common to use a varant of gradent ascent caed conjugate gradent ascent, whch often converges more qucky. 3.3 Reguarzaton n Logstc Regresson Overfttng the tranng data s a probem that can arse n Logstc Regresson, especay when data s very hgh dmensona and tranng data s sparse. One approach to reducng overfttng s reguarzaton, n whch we create a modfed penazed og kehood functon, whch penazes arge vaues of W. One approach s to use the penazed og kehood functon W argmax W np(y X,W) λ 2 W 2 whch adds a penaty proportona to the squared magntude of W. Here λ s a constant that determnes the strength of ths penaty term. Modfyng our objectve by addng n ths penaty term gves us a new objectve to maxmze. It s easy to show that maxmzng t corresponds to cacuatng the MAP estmate for W under the assumpton that the pror dstrbuton P(W) s a Norma dstrbuton wth mean zero, and a varance reated to /λ. Notce that n genera, the MAP estmate for W nvoves optmzng the objectve np(y X,W) + np(w) and f P(W) s a zero mean Gaussan dstrbuton, then np(w) yeds a term proportona to W 2. Gven ths penazed og kehood functon, t s easy to rederve the gradent descent rue. The dervatve of ths penazed og kehood functon s smar to

13 Copyrght c 205, Tom M. Mtche. 3 our earer dervatve, wth one addtona penaty term (W) w = X (Y ˆP(Y = X,W)) λw whch gves us the modfed gradent descent rue w w + η X (Y ˆP(Y = X,W)) ηλw (26) In cases where we have pror knowedge about key vaues for specfc w, t s possbe to derve a smar penaty term by usng a Norma pror on W wth a non-zero mean. 3.4 Logstc Regresson for Functons wth Many Dscrete Vaues Above we consdered usng Logstc Regresson to earn P(Y X) ony for the case where Y s a booean varabe. More generay, f Y can take on any of the dscrete vaues {y,...y K }, then the form of P(Y = y k X) for Y = y,y = y 2,...Y = y K s: exp(w k0 + n = P(Y = y k X) = w kx ) + K j= exp(w j0 + n = w (27) jx ) When Y = y K, t s P(Y = y K X) = + K j= exp(w j0 + n = w jx ) (28) Here w j denotes the weght assocated wth the jth cass Y = y j and wth nput X. It s easy to see that our earer expressons for the case where Y s booean (equatons (6) and (7)) are a speca case of the above expressons. Note aso that the form of the expresson for P(Y = y K X) assures that [ K k= P(Y = y k X)] =. The prmary dfference between these expressons and those for booean Y s that when Y takes on K possbe vaues, we construct K dfferent near expressons to capture the dstrbutons for the dfferent vaues of Y. The dstrbuton for the fna, Kth, vaue of Y s smpy one mnus the probabtes of the frst K vaues. In ths case, the gradent descent rue wth reguarzaton becomes: w j w j + η X (δ(y = y j ) ˆP(Y = y j X,W)) ηλw j (29) where δ(y = y j ) = f the th tranng vaue, Y, s equa to y j, and δ(y = y j ) = 0 otherwse. Note our earer earnng rue, equaton (26), s a speca case of ths new earnng rue, when K = 2. As n the case for K = 2, the quantty nsde the parentheses can be vewed as an error term whch goes to zero f the estmated condtona probabty ˆP(Y = y j X,W)) perfecty matches the observed vaue of Y.

14 Copyrght c 205, Tom M. Mtche. 4 4 Reatonshp Between Nave Bayes Cassfers and Logstc Regresson To summarze, Logstc Regresson drecty estmates the parameters of P(Y X), whereas Nave Bayes drecty estmates parameters for P(Y ) and P(X Y ). We often ca the former a dscrmnatve cassfer, and the atter a generatve cassfer. We showed above that the assumptons of one varant of a Gaussan Nave Bayes cassfer mpy the parametrc form of P(Y X) used n Logstc Regresson. Furthermore, we showed that the parameters w n Logstc Regresson can be expressed n terms of the Gaussan Nave Bayes parameters. In fact, f the GNB assumptons hod, then asymptotcay (as the number of tranng exampes grows toward nfnty) the GNB and Logstc Regresson converge toward dentca cassfers. The two agorthms aso dffer n nterestng ways: When the GNB modeng assumptons do not hod, Logstc Regresson and GNB typcay earn dfferent cassfer functons. In ths case, the asymptotc (as the number of tranng exampes approach nfnty) cassfcaton accuracy for Logstc Regresson s often better than the asymptotc accuracy of GNB. Athough Logstc Regresson s consstent wth the Nave Bayes assumpton that the nput features X are condtonay ndependent gven Y, t s not rgdy ted to ths assumpton as s Nave Bayes. Gven data that dsobeys ths assumpton, the condtona kehood maxmzaton agorthm for Logstc Regresson w adjust ts parameters to maxmze the ft to (the condtona kehood of) the data, even f the resutng parameters are nconsstent wth the Nave Bayes parameter estmates. GNB and Logstc Regresson converge toward ther asymptotc accuraces at dfferent rates. As Ng & Jordan (2002) show, GNB parameter estmates converge toward ther asymptotc vaues n order og n exampes, where n s the dmenson of X. In contrast, Logstc Regresson parameter estmates converge more sowy, requrng order n exampes. The authors aso show that n severa data sets Logstc Regresson outperforms GNB when many tranng exampes are avaabe, but GNB outperforms Logstc Regresson when tranng data s scarce. 5 What You Shoud Know The man ponts of ths chapter ncude: We can use Bayes rue as the bass for desgnng earnng agorthms (functon approxmators), as foows: Gven that we wsh to earn some target functon f : X Y, or equvaenty, P(Y X), we use the tranng data to earn estmates of P(X Y ) and P(Y ). New X exampes can then be cassfed usng these estmated probabty dstrbutons, pus Bayes rue. Ths

15 Copyrght c 205, Tom M. Mtche. 5 type of cassfer s caed a generatve cassfer, because we can vew the dstrbuton P(X Y ) as descrbng how to generate random nstances X condtoned on the target attrbute Y. Learnng Bayes cassfers typcay requres an unreastc number of tranng exampes (.e., more than X tranng exampes where X s the nstance space) uness some form of pror assumpton s made about the form of P(X Y ). The Nave Bayes cassfer assumes a attrbutes descrbng X are condtonay ndependent gven Y. Ths assumpton dramatcay reduces the number of parameters that must be estmated to earn the cassfer. Nave Bayes s a wdey used earnng agorthm, for both dscrete and contnuous X. When X s a vector of dscrete-vaued attrbutes, Nave Bayes earnng agorthms can be vewed as near cassfers; that s, every such Nave Bayes cassfer corresponds to a hyperpane decson surface n X. The same statement hods for Gaussan Nave Bayes cassfers f the varance of each feature s assumed to be ndependent of the cass (.e., f σ k = σ ). Logstc Regresson s a functon approxmaton agorthm that uses tranng data to drecty estmate P(Y X), n contrast to Nave Bayes. In ths sense, Logstc Regresson s often referred to as a dscrmnatve cassfer because we can vew the dstrbuton P(Y X) as drecty dscrmnatng the vaue of the target vaue Y for any gven nstance X. Logstc Regresson s a near cassfer over X. The near cassfers produced by Logstc Regresson and Gaussan Nave Bayes are dentca n the mt as the number of tranng exampes approaches nfnty, provded the Nave Bayes assumptons hod. However, f these assumptons do not hod, the Nave Bayes bas w cause t to perform ess accuratey than Logstc Regresson, n the mt. Put another way, Nave Bayes s a earnng agorthm wth greater bas, but ower varance, than Logstc Regresson. If ths bas s approprate gven the actua data, Nave Bayes w be preferred. Otherwse, Logstc Regresson w be preferred. We can vew functon approxmaton earnng agorthms as statstca estmators of functons, or of condtona dstrbutons P(Y X). They estmate P(Y X) from a sampe of tranng data. As wth other statstca estmators, t can be usefu to characterze earnng agorthms by ther bas and expected varance, taken over dfferent sampes of tranng data. 6 Further Readng Wasserman (2004) descrbes a Reweghted Least Squares method for Logstc Regresson. Ng and Jordan (2002) provde a theoretca and expermenta comparson of the Nave Bayes cassfer and Logstc Regresson.

16 Copyrght c 205, Tom M. Mtche. 6 EXERCISES. At the begnnng of the chapter we remarked that A hundred tranng exampes w usuay suffce to obtan an estmate of P(Y ) that s wthn a few percent of the correct vaue. Descrbe condtons under whch the 95% confdence nterva for our estmate of P(Y ) w be ± Consder earnng a functon X Y where Y s booean, where X = X,X 2, and where X s a booean varabe and X 2 a contnuous varabe. State the parameters that must be estmated to defne a Nave Bayes cassfer n ths case. Gve the formua for computng P(Y X), n terms of these parameters and the feature vaues X and X In secton 3 we showed that when Y s Booean and X = X...X n s a vector of contnuous varabes, then the assumptons of the Gaussan Nave Bayes cassfer mpy that P(Y X) s gven by the ogstc functon wth approprate parameters W. In partcuar: and P(Y = X) = + exp(w 0 + n = w X ) P(Y = 0 X) = exp(w 0 + n = w X ) + exp(w 0 + n = w X ) Consder nstead the case where Y s Booean and X = X...X n s a vector of Booean varabes. Prove for ths case aso that P(Y X) foows ths same form (and hence that Logstc Regresson s aso the dscrmnatve counterpart to a Nave Bayes generatve cassfer over Booean features). Hnts: Smpe notaton w hep. Snce the X are Booean varabes, you need ony one parameter to defne P(X Y = y k ). Defne θ P(X = Y = ), n whch case P(X = 0 Y = ) = ( θ ). Smary, use θ 0 to denote P(X = Y = 0). Notce wth the above notaton you can represent P(X Y = ) as foows P(X Y = ) = θ X ( θ ) ( X ) Note when X = the second term s equa to because ts exponent s zero. Smary, when X = 0 the frst term s equa to because ts exponent s zero. 4. (based on a suggeston from Sandra Zes). Ths queston asks you to consder the reatonshp between the MAP hypothess and the Bayes optma hypothess. Consder a hypothess space H defned over the set of nstances X, and contanng just two hypotheses, h and h2 wth equa pror probabtes P(h) = P(h2) = 0.5. Suppose we are gven an arbtrary set of tranng

17 Copyrght c 205, Tom M. Mtche. 7 data D whch we use to cacuate the posteror probabtes P(h D) and P(h2 D). Based on ths we choose the MAP hypothess, and cacuate the Bayes optma hypothess. Suppose we fnd that the Bayes optma cassfer s not equa to ether h or to h2, whch s generay the case because the Bayes optma hypothess corresponds to averagng over a hypotheses n H. Now we create a new hypothess h3 whch s equa to the Bayes optma cassfer wth respect to H, X and D; that s, h3 cassfes each nstance n X exacty the same as the Bayes optma cassfer for H and D. We now create a new hypothess space H = {h,h2,h3}. If we tran usng the same tranng data, D, w the MAP hypothess from H be h3? W the Bayes optma cassfer wth respect to H be equvaent to h3? (Hnt: the answer depends on the prors we assgn to the hypotheses n H. Can you gve constrants on these prors that assure the answers w be yes or no?) 7 Acknowedgements I very much apprecate recevng hepfu comments on earer drafts of ths chapter from the foowng: Nathane Farfed, Raner Gemua, Vneet Kumar, Andrew McCaum, Anand Prahad, We Wang, Geoff Webb, and Sandra Zes. REFERENCES Mtche, T (997). Machne Learnng, McGraw H. Ng, A.Y. & Jordan, M. I. (2002). On Dscrmnatve vs. Generatve Cassfers: A comparson of Logstc Regresson and Nave Bayes, Neura Informaton Processng Systems, Ng, A.Y., and Jordan, M. (2002). Wasserman, L. (2004). A of Statstcs, Sprnger-Verag.

Image Classification Using EM And JE algorithms

Image Classification Using EM And JE algorithms Machne earnng project report Fa, 2 Xaojn Sh, jennfer@soe Image Cassfcaton Usng EM And JE agorthms Xaojn Sh Department of Computer Engneerng, Unversty of Caforna, Santa Cruz, CA, 9564 jennfer@soe.ucsc.edu

More information

MARKOV CHAIN AND HIDDEN MARKOV MODEL

MARKOV CHAIN AND HIDDEN MARKOV MODEL MARKOV CHAIN AND HIDDEN MARKOV MODEL JIAN ZHANG JIANZHAN@STAT.PURDUE.EDU Markov chan and hdden Markov mode are probaby the smpest modes whch can be used to mode sequenta data,.e. data sampes whch are not

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

COXREG. Estimation (1)

COXREG. Estimation (1) COXREG Cox (972) frst suggested the modes n whch factors reated to fetme have a mutpcatve effect on the hazard functon. These modes are caed proportona hazards (PH) modes. Under the proportona hazards

More information

Example: Suppose we want to build a classifier that recognizes WebPages of graduate students.

Example: Suppose we want to build a classifier that recognizes WebPages of graduate students. Exampe: Suppose we want to bud a cassfer that recognzes WebPages of graduate students. How can we fnd tranng data? We can browse the web and coect a sampe of WebPages of graduate students of varous unverstes.

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks

Supplementary Material: Learning Structured Weight Uncertainty in Bayesian Neural Networks Shengyang Sun, Changyou Chen, Lawrence Carn Suppementary Matera: Learnng Structured Weght Uncertanty n Bayesan Neura Networks Shengyang Sun Changyou Chen Lawrence Carn Tsnghua Unversty Duke Unversty Duke

More information

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements CS 750 Machne Learnng Lecture 5 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square CS 750 Machne Learnng Announcements Homework Due on Wednesday before the class Reports: hand n before

More information

Neural network-based athletics performance prediction optimization model applied research

Neural network-based athletics performance prediction optimization model applied research Avaabe onne www.jocpr.com Journa of Chemca and Pharmaceutca Research, 04, 6(6):8-5 Research Artce ISSN : 0975-784 CODEN(USA) : JCPRC5 Neura networ-based athetcs performance predcton optmzaton mode apped

More information

Associative Memories

Associative Memories Assocatve Memores We consder now modes for unsupervsed earnng probems, caed auto-assocaton probems. Assocaton s the task of mappng patterns to patterns. In an assocatve memory the stmuus of an ncompete

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maxmum Lkelhood Estmaton INFO-2301: Quanttatve Reasonng 2 Mchael Paul and Jordan Boyd-Graber MARCH 7, 2017 INFO-2301: Quanttatve Reasonng 2 Paul and Boyd-Graber Maxmum Lkelhood Estmaton 1 of 9 Why MLE?

More information

A finite difference method for heat equation in the unbounded domain

A finite difference method for heat equation in the unbounded domain Internatona Conerence on Advanced ectronc Scence and Technoogy (AST 6) A nte derence method or heat equaton n the unbounded doman a Quan Zheng and Xn Zhao Coege o Scence North Chna nversty o Technoogy

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U) Econ 413 Exam 13 H ANSWERS Settet er nndelt 9 deloppgaver, A,B,C, som alle anbefales å telle lkt for å gøre det ltt lettere å stå. Svar er gtt . Unfortunately, there s a prntng error n the hnt of

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30 STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Lower bounds for the Crossing Number of the Cartesian Product of a Vertex-transitive Graph with a Cycle

Lower bounds for the Crossing Number of the Cartesian Product of a Vertex-transitive Graph with a Cycle Lower bounds for the Crossng Number of the Cartesan Product of a Vertex-transtve Graph wth a Cyce Junho Won MIT-PRIMES December 4, 013 Abstract. The mnmum number of crossngs for a drawngs of a gven graph

More information

Multispectral Remote Sensing Image Classification Algorithm Based on Rough Set Theory

Multispectral Remote Sensing Image Classification Algorithm Based on Rough Set Theory Proceedngs of the 2009 IEEE Internatona Conference on Systems Man and Cybernetcs San Antono TX USA - October 2009 Mutspectra Remote Sensng Image Cassfcaton Agorthm Based on Rough Set Theory Yng Wang Xaoyun

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ CSE 455/555 Sprng 2013 Homework 7: Parametrc Technques Jason J. Corso Computer Scence and Engneerng SUY at Buffalo jcorso@buffalo.edu Solutons by Yngbo Zhou Ths assgnment does not need to be submtted and

More information

Note 2. Ling fong Li. 1 Klein Gordon Equation Probablity interpretation Solutions to Klein-Gordon Equation... 2

Note 2. Ling fong Li. 1 Klein Gordon Equation Probablity interpretation Solutions to Klein-Gordon Equation... 2 Note 2 Lng fong L Contents Ken Gordon Equaton. Probabty nterpretaton......................................2 Soutons to Ken-Gordon Equaton............................... 2 2 Drac Equaton 3 2. Probabty nterpretaton.....................................

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

Relevance Vector Machines Explained

Relevance Vector Machines Explained October 19, 2010 Relevance Vector Machnes Explaned Trstan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introducton Ths document has been wrtten n an attempt to make Tppng s [1] Relevance Vector Machnes

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

Nested case-control and case-cohort studies

Nested case-control and case-cohort studies Outne: Nested case-contro and case-cohort studes Ørnuf Borgan Department of Mathematcs Unversty of Oso NORBIS course Unversty of Oso 4-8 December 217 1 Radaton and breast cancer data Nested case contro

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

Conjugacy and the Exponential Family

Conjugacy and the Exponential Family CS281B/Stat241B: Advanced Topcs n Learnng & Decson Makng Conjugacy and the Exponental Famly Lecturer: Mchael I. Jordan Scrbes: Bran Mlch 1 Conjugacy In the prevous lecture, we saw conjugate prors for the

More information

Delay tomography for large scale networks

Delay tomography for large scale networks Deay tomography for arge scae networks MENG-FU SHIH ALFRED O. HERO III Communcatons and Sgna Processng Laboratory Eectrca Engneerng and Computer Scence Department Unversty of Mchgan, 30 Bea. Ave., Ann

More information

APPENDIX A Some Linear Algebra

APPENDIX A Some Linear Algebra APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces

More information

Inthem-machine flow shop problem, a set of jobs, each

Inthem-machine flow shop problem, a set of jobs, each THE ASYMPTOTIC OPTIMALITY OF THE SPT RULE FOR THE FLOW SHOP MEAN COMPLETION TIME PROBLEM PHILIP KAMINSKY Industra Engneerng and Operatons Research, Unversty of Caforna, Bereey, Caforna 9470, amnsy@eor.bereey.edu

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

Generative classification models

Generative classification models CS 675 Intro to Machne Learnng Lecture Generatve classfcaton models Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Data: D { d, d,.., dn} d, Classfcaton represents a dscrete class value Goal: learn

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

STAT 511 FINAL EXAM NAME Spring 2001

STAT 511 FINAL EXAM NAME Spring 2001 STAT 5 FINAL EXAM NAME Sprng Instructons: Ths s a closed book exam. No notes or books are allowed. ou may use a calculator but you are not allowed to store notes or formulas n the calculator. Please wrte

More information

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Probabilistic Classification: Bayes Classifiers. Lecture 6: Probablstc Classfcaton: Bayes Classfers Lecture : Classfcaton Models Sam Rowes January, Generatve model: p(x, y) = p(y)p(x y). p(y) are called class prors. p(x y) are called class condtonal feature dstrbutons.

More information

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder

More information

Lecture 4 Hypothesis Testing

Lecture 4 Hypothesis Testing Lecture 4 Hypothess Testng We may wsh to test pror hypotheses about the coeffcents we estmate. We can use the estmates to test whether the data rejects our hypothess. An example mght be that we wsh to

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory

Research on Complex Networks Control Based on Fuzzy Integral Sliding Theory Advanced Scence and Technoogy Letters Vo.83 (ISA 205), pp.60-65 http://dx.do.org/0.4257/ast.205.83.2 Research on Compex etworks Contro Based on Fuzzy Integra Sdng Theory Dongsheng Yang, Bngqng L, 2, He

More information

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1 P. Guterrez Physcs 5153 Classcal Mechancs D Alembert s Prncple and The Lagrangan 1 Introducton The prncple of vrtual work provdes a method of solvng problems of statc equlbrum wthout havng to consder the

More information

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

Composite Hypotheses testing

Composite Hypotheses testing Composte ypotheses testng In many hypothess testng problems there are many possble dstrbutons that can occur under each of the hypotheses. The output of the source s a set of parameters (ponts n a parameter

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z ) C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z

More information

NP-Completeness : Proofs

NP-Completeness : Proofs NP-Completeness : Proofs Proof Methods A method to show a decson problem Π NP-complete s as follows. (1) Show Π NP. (2) Choose an NP-complete problem Π. (3) Show Π Π. A method to show an optmzaton problem

More information

Chapter 6. Rotations and Tensors

Chapter 6. Rotations and Tensors Vector Spaces n Physcs 8/6/5 Chapter 6. Rotatons and ensors here s a speca knd of near transformaton whch s used to transforms coordnates from one set of axes to another set of axes (wth the same orgn).

More information

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition EG 880/988 - Specal opcs n Computer Engneerng: Pattern Recognton Memoral Unversty of ewfoundland Pattern Recognton Lecture 7 May 3, 006 http://wwwengrmunca/~charlesr Offce Hours: uesdays hursdays 8:30-9:30

More information

Logistic Classifier CISC 5800 Professor Daniel Leeds

Logistic Classifier CISC 5800 Professor Daniel Leeds lon 9/7/8 Logstc Classfer CISC 58 Professor Danel Leeds Classfcaton strategy: generatve vs. dscrmnatve Generatve, e.g., Bayes/Naïve Bayes: 5 5 Identfy probablty dstrbuton for each class Determne class

More information

3. Stress-strain relationships of a composite layer

3. Stress-strain relationships of a composite layer OM PO I O U P U N I V I Y O F W N ompostes ourse 8-9 Unversty of wente ng. &ech... tress-stran reatonshps of a composte ayer - Laurent Warnet & emo Aerman.. tress-stran reatonshps of a composte ayer Introducton

More information

Cyclic Codes BCH Codes

Cyclic Codes BCH Codes Cycc Codes BCH Codes Gaos Feds GF m A Gaos fed of m eements can be obtaned usng the symbos 0,, á, and the eements beng 0,, á, á, á 3 m,... so that fed F* s cosed under mutpcaton wth m eements. The operator

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics ECOOMICS 35*-A Md-Term Exam -- Fall Term 000 Page of 3 pages QUEE'S UIVERSITY AT KIGSTO Department of Economcs ECOOMICS 35* - Secton A Introductory Econometrcs Fall Term 000 MID-TERM EAM ASWERS MG Abbott

More information

Goodness of fit and Wilks theorem

Goodness of fit and Wilks theorem DRAFT 0.0 Glen Cowan 3 June, 2013 Goodness of ft and Wlks theorem Suppose we model data y wth a lkelhood L(µ) that depends on a set of N parameters µ = (µ 1,...,µ N ). Defne the statstc t µ ln L(µ) L(ˆµ),

More information

Deriving the Dual. Prof. Bennett Math of Data Science 1/13/06

Deriving the Dual. Prof. Bennett Math of Data Science 1/13/06 Dervng the Dua Prof. Bennett Math of Data Scence /3/06 Outne Ntty Grtty for SVM Revew Rdge Regresson LS-SVM=KRR Dua Dervaton Bas Issue Summary Ntty Grtty Need Dua of w, b, z w 2 2 mn st. ( x w ) = C z

More information

Economics 130. Lecture 4 Simple Linear Regression Continued

Economics 130. Lecture 4 Simple Linear Regression Continued Economcs 130 Lecture 4 Contnued Readngs for Week 4 Text, Chapter and 3. We contnue wth addressng our second ssue + add n how we evaluate these relatonshps: Where do we get data to do ths analyss? How do

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING 1 ADVANCED ACHINE LEARNING ADVANCED ACHINE LEARNING Non-lnear regresson technques 2 ADVANCED ACHINE LEARNING Regresson: Prncple N ap N-dm. nput x to a contnuous output y. Learn a functon of the type: N

More information

14 Lagrange Multipliers

14 Lagrange Multipliers Lagrange Multplers 14 Lagrange Multplers The Method of Lagrange Multplers s a powerful technque for constraned optmzaton. Whle t has applcatons far beyond machne learnng t was orgnally developed to solve

More information

More metrics on cartesian products

More metrics on cartesian products More metrcs on cartesan products If (X, d ) are metrc spaces for 1 n, then n Secton II4 of the lecture notes we defned three metrcs on X whose underlyng topologes are the product topology The purpose of

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Math1110 (Spring 2009) Prelim 3 - Solutions

Math1110 (Spring 2009) Prelim 3 - Solutions Math 1110 (Sprng 2009) Solutons to Prelm 3 (04/21/2009) 1 Queston 1. (16 ponts) Short answer. Math1110 (Sprng 2009) Prelm 3 - Solutons x a 1 (a) (4 ponts) Please evaluate lm, where a and b are postve numbers.

More information

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above The conjugate pror to a Bernoull s A) Bernoull B) Gaussan C) Beta D) none of the above The conjugate pror to a Gaussan s A) Bernoull B) Gaussan C) Beta D) none of the above MAP estmates A) argmax θ p(θ

More information

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression 11 MACHINE APPLIED MACHINE LEARNING LEARNING MACHINE LEARNING Gaussan Mture Regresson 22 MACHINE APPLIED MACHINE LEARNING LEARNING Bref summary of last week s lecture 33 MACHINE APPLIED MACHINE LEARNING

More information

On the Power Function of the Likelihood Ratio Test for MANOVA

On the Power Function of the Likelihood Ratio Test for MANOVA Journa of Mutvarate Anayss 8, 416 41 (00) do:10.1006/jmva.001.036 On the Power Functon of the Lkehood Rato Test for MANOVA Dua Kumar Bhaumk Unversty of South Aabama and Unversty of Inos at Chcago and Sanat

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

Estimation: Part 2. Chapter GREG estimation

Estimation: Part 2. Chapter GREG estimation Chapter 9 Estmaton: Part 2 9. GREG estmaton In Chapter 8, we have seen that the regresson estmator s an effcent estmator when there s a lnear relatonshp between y and x. In ths chapter, we generalzed the

More information

Optimum Selection Combining for M-QAM on Fading Channels

Optimum Selection Combining for M-QAM on Fading Channels Optmum Seecton Combnng for M-QAM on Fadng Channes M. Surendra Raju, Ramesh Annavajjaa and A. Chockangam Insca Semconductors Inda Pvt. Ltd, Bangaore-56000, Inda Department of ECE, Unversty of Caforna, San

More information

Chapter 6 Hidden Markov Models. Chaochun Wei Spring 2018

Chapter 6 Hidden Markov Models. Chaochun Wei Spring 2018 896 920 987 2006 Chapter 6 Hdden Markov Modes Chaochun We Sprng 208 Contents Readng materas Introducton to Hdden Markov Mode Markov chans Hdden Markov Modes Parameter estmaton for HMMs 2 Readng Rabner,

More information

Learning from Data 1 Naive Bayes

Learning from Data 1 Naive Bayes Learnng from Data 1 Nave Bayes Davd Barber dbarber@anc.ed.ac.uk course page : http://anc.ed.ac.uk/ dbarber/lfd1/lfd1.html c Davd Barber 2001, 2002 1 Learnng from Data 1 : c Davd Barber 2001,2002 2 1 Why

More information

G : Statistical Mechanics

G : Statistical Mechanics G25.2651: Statstca Mechancs Notes for Lecture 11 I. PRINCIPLES OF QUANTUM STATISTICAL MECHANICS The probem of quantum statstca mechancs s the quantum mechanca treatment of an N-partce system. Suppose the

More information

Machine learning: Density estimation

Machine learning: Density estimation CS 70 Foundatons of AI Lecture 3 Machne learnng: ensty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square ata: ensty estmaton {.. n} x a vector of attrbute values Objectve: estmate the model of

More information

STAT 3008 Applied Regression Analysis

STAT 3008 Applied Regression Analysis STAT 3008 Appled Regresson Analyss Tutoral : Smple Lnear Regresson LAI Chun He Department of Statstcs, The Chnese Unversty of Hong Kong 1 Model Assumpton To quantfy the relatonshp between two factors,

More information

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models ECO 452 -- OE 4: Probt and Logt Models ECO 452 -- OE 4 Maxmum Lkelhood Estmaton of Bnary Dependent Varables Models: Probt and Logt hs note demonstrates how to formulate bnary dependent varables models

More information

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics ) Ismor Fscher, 8//008 Stat 54 / -8.3 Summary Statstcs Measures of Center and Spread Dstrbuton of dscrete contnuous POPULATION Random Varable, numercal True center =??? True spread =???? parameters ( populaton

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

Xin Li Department of Information Systems, College of Business, City University of Hong Kong, Hong Kong, CHINA

Xin Li Department of Information Systems, College of Business, City University of Hong Kong, Hong Kong, CHINA RESEARCH ARTICLE MOELING FIXE OS BETTING FOR FUTURE EVENT PREICTION Weyun Chen eartment of Educatona Informaton Technoogy, Facuty of Educaton, East Chna Norma Unversty, Shangha, CHINA {weyun.chen@qq.com}

More information

Supervised Learning. Neural Networks and Back-Propagation Learning. Credit Assignment Problem. Feedforward Network. Adaptive System.

Supervised Learning. Neural Networks and Back-Propagation Learning. Credit Assignment Problem. Feedforward Network. Adaptive System. Part 7: Neura Networ & earnng /2/05 Superved earnng Neura Networ and Bac-Propagaton earnng Produce dered output for tranng nput Generaze reaonaby & appropratey to other nput Good exampe: pattern recognton

More information

Journal of Multivariate Analysis

Journal of Multivariate Analysis Journa of Mutvarate Anayss 3 (04) 74 96 Contents sts avaabe at ScenceDrect Journa of Mutvarate Anayss journa homepage: www.esever.com/ocate/jmva Hgh-dmensona sparse MANOVA T. Tony Ca a, Yn Xa b, a Department

More information

Introduction to Dummy Variable Regressors. 1. An Example of Dummy Variable Regressors

Introduction to Dummy Variable Regressors. 1. An Example of Dummy Variable Regressors ECONOMICS 5* -- Introducton to Dummy Varable Regressors ECON 5* -- Introducton to NOTE Introducton to Dummy Varable Regressors. An Example of Dummy Varable Regressors A model of North Amercan car prces

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 31 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 6. Rdge regresson The OLSE s the best lnear unbased

More information

Quantum Runge-Lenz Vector and the Hydrogen Atom, the hidden SO(4) symmetry

Quantum Runge-Lenz Vector and the Hydrogen Atom, the hidden SO(4) symmetry Quantum Runge-Lenz ector and the Hydrogen Atom, the hdden SO(4) symmetry Pasca Szrftgser and Edgardo S. Cheb-Terrab () Laboratore PhLAM, UMR CNRS 85, Unversté Le, F-59655, France () Mapesoft Let's consder

More information

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0 MODULE 2 Topcs: Lnear ndependence, bass and dmenson We have seen that f n a set of vectors one vector s a lnear combnaton of the remanng vectors n the set then the span of the set s unchanged f that vector

More information

22.51 Quantum Theory of Radiation Interactions

22.51 Quantum Theory of Radiation Interactions .51 Quantum Theory of Radaton Interactons Fna Exam - Soutons Tuesday December 15, 009 Probem 1 Harmonc oscator 0 ponts Consder an harmonc oscator descrbed by the Hamtonan H = ω(nˆ + ). Cacuate the evouton

More information

Semi-Supervised Learning

Semi-Supervised Learning Sem-Supervsed Learnng Consder the problem of Prepostonal Phrase Attachment. Buy car wth money ; buy car wth wheel There are several ways to generate features. Gven the lmted representaton, we can assume

More information

Robert Eisberg Second edition CH 09 Multielectron atoms ground states and x-ray excitations

Robert Eisberg Second edition CH 09 Multielectron atoms ground states and x-ray excitations Quantum Physcs 量 理 Robert Esberg Second edton CH 09 Multelectron atoms ground states and x-ray exctatons 9-01 By gong through the procedure ndcated n the text, develop the tme-ndependent Schroednger equaton

More information

Comparison of Regression Lines

Comparison of Regression Lines STATGRAPHICS Rev. 9/13/2013 Comparson of Regresson Lnes Summary... 1 Data Input... 3 Analyss Summary... 4 Plot of Ftted Model... 6 Condtonal Sums of Squares... 6 Analyss Optons... 7 Forecasts... 8 Confdence

More information