Bayes (Naïve or not) Classifiers: Generative Approach

Logstc regresso

Bayes (Naïve or ot) Classfers: Geeratve Approach What do we mea by Geeratve approach: Lear p(y), p(x y) ad the apply bayes rule to compute p(y x) for makg predctos Ths s essetally makg a assumpto that the data s geerated accordg to a geeratve process govered by p(y) ad p(x y) I partcular, to geerate a data pot: 1. Sample from p(y) to determe ts class label 2. Sample ts feature vector x from p(x y)

Bayes classfer y x p(y) p(x y) To geerate oe example our trag or test set: 1. Geerate a y accordg to p(y) 2. Geerate a x vector accordg to p(x y) Naïve Bayes classfer p(x 1 y) x 1 y p(y) x m p(x m y) To geerate oe example our trag or test set: 1. Geerate a y accordg to p(y) 2. For geerate a value for each feature x accordg to p(x y)

Geeratve approach s just oe type of learg approaches used mache learg Learg a correct geeratve model p(x y) ca be dffcult desty estmato s a challegg problem ts ow Ad sometmes uecessary I cotrast, perceptro, KNN ad DT are what we call dscrmatve methods They are ot cocered about ay probablstc models for geeratg the observatos They oly care about fdg a good dscrmatve fucto perceptro, KNN ad DT lear determstc fuctos, ot probablstc Oe ca also take a probablstc approach to learg dscrmatve fuctos.e., Lear p(y x) drectly wthout assumg x s geerated based o some partcular dstrbuto gve y (.e., p(x y)) Logstc regresso s oe such approach

Logstc regresso Recall the problem of regresso Lears a mappg from put vector x to a cotuous output y Logstc regresso exteds tradtoal regresso to hadle bary output y I partcular, we assume that g(t) P( y 1 x) w 1 e 1 g( w x) wx 1 e 1 ( 0 w1 x1... w m x m ) Sgmod fucto t

Logstc Regresso Equvaletly, we have the followg: P( y 1 x) log P y x w ( 0 ) w x 0 1 1 Odds of y=1... w m x m Sde Note: the odds favor of a evet are the quatty p / (1 p), where p s the probablty of the evet If I toss a far dce, what are the odds that I wll have a sx? I other words, LR assumes that the log odds s a lear fucto of the put features

Logstc Regresso lears a lear decso boudary We predct y = 1 f p y = 1 x > p y = 0 x Predct y = 1 f Predct y = 1 f p y = 1 x p y = 0 x > 1 log p y = 1 x p y = 0 x = w 0 + w 1 x 1 + + w m x m > 0

Learg w for logstc regresso Gve a set of trag data pots, we would lke to fd a weght vector w such that P( y 1 x) 1 e ( w0 w1 x1... w m x m ) s large (approachg 1) for postve trag examples, ad small (approachg 0) for egatve examples 1 Maxmum Lkelhood estmato allows us to precsely capture ths

Maxmum Lkelhood Estmato Goal: estmate the parameters of the dstrbuto gve data Assumg the examples are detcally depedetly dstrbuted (..d) accordg to a dstrbuto wth parameter θ Let D = {d 1, d 2,, d } be the observed data, we defe the lkelhood fucto of θ: It measures how lkely t s to observe the data D gve that the parameter s θ Maxmum Lkelhood Estmator (MLE): L( ) P( D; ) P( d ; ) 1 MLE arg max L( )

Example A co toss (deoted by x) follows a bary dstrbuto: P x = θ x 1 θ 1 x, where θ = P(x = 1) s the parameter we wat to estmate Observed data :..d. co tosses D = {x 1, x 2,, x } Prevously we metoed a coutg recpe: θ = 1 Ths s actually Maxmum Lkelhood Estmato of θ What s the Lkelhood fucto? L θ = P D; θ = =1 P x ; θ = θ x 1 θ 1 x =1 = θ x =1 1 θ =1 1 x =1 x

MLE estmate for bary dstrbuto θ MLE = argmax θ = argmax log θ θ = argmax θ dl( ) d 1 1 1 =1 1 0 (1 ) 0 x L θ log θ + x = argmax log L(θ) θ =1 1 θ =1 1 x 1 x =1 1 0 0 (1 ) 1 (1 ) 0 1 1 0 0 1 log 1 θ

MLE for logstc regresso w MLE arg max l( w) w arg max w 1 log P( y x ; w) arg max w 1 y log P( y 1 x ; w) (1 y )log(1 P( y 1 x ; w)) arg max w y log yˆ ( w) (1 y )log(1 yˆ ( w)) 1 As such, gve a set of trag data pots, we would lke to fd a weght vector w such that p(y = 1 x;w) s large (e.g. 1) for postve trag examples, ad small (e.g. 0) otherwse the same as our tuto

Optmzg L(w) Ufortuately ths does ot have a close form soluto You take the dervatve, set t to zero, but o closed form soluto Istead, we teratvely search for the optmal w Start wth a radom w, teratvely mprove w (smlar to Perceptro) by takg the gradet

Gradet of L(w) y log yˆ ( w) (1 y )log(1 yˆ ( )) l( w) w 1 l w = y log y (w) + 1 y log(1 y w ) Useful fact: y w = y w 1 y w x l w = y y w x

Batch Learg for Logstc Regresso Note: y takes 0/1 here, ot 1/-1 Gve : tragexamples( x Let w (0,0,0,...,0) Repeat utlcovergece d (0,0,0,...,0) For 1to N do 1 y w x 1 e error y y d d errorx w w d, y ), 1,..., N Note the strkg smlarty betwee LR ad perceptro Both lear a lear decso boudary The teratve algorthm takes very smlar form

Coecto to Naïve Bayes If we use Naïve Bayes ad assume a Gaussa dstrbuto for P(x y), we ca show that P(y = 1 x) takes the exact same fuctoal form as Logstc Regresso So, are they equvalet? Logstc regresso does ot assume Gaussa dstrbuto, or codtoal depedece, thus weaker modelg assumpto A strog modelg assumpto could make a sgfcat dfferece practce

Comparatvely Naïve Bayes - geeratve model: p(x y) makes strog assumptos about the dstrbuto of data attrbutes Whe the assumptos are ok, Naïve Bayes ca estmate a good model wth eve small umber of examples Logstc regresso-dscrmatve model: drectly lear P(y x) fewer parameters to estmate, but they are ted together ad make learg harder Makes weaker assumptos May eed larger umber of trag examples Bottom le: f the aïve bayes assumpto holds ad the probablstc models are accurate (.e., x s gaussa gve y etc.), NB would be a good choce; otherwse, logstc regresso ofte works better

Summary We troduced the cocept of geeratve vs. dscrmatve method Gve a method that we dscussed class, you eed to kow whch category t belogs to Logstc regresso Assumes that the log odds of y=1 s a lear fucto of x (.e., w x) Learg goal s to lear a weght vector w such that examples wth y=1 are predcted to have hgh P(y=1 x) ad vce versa Maxmum lkelhood estmato s a approach that acheves ths Iteratve algorthm to lear w usg MLE Smlarty ad dfferece betwee LR ad Perceptros Logstc regresso lears a lear decso boudares By troducg olear features (.e., x 1 2, x 2 2, x 1 x 2, ), ca be exteded to olear boudary.